jedisct1/huggingface-xet
Implementation of the HuggingFace Xet Protocol.
refs
A pure Zig implementation of the XET protocol for efficient file storage and retrieval through content-defined chunking and deduplication.
XET is a protocol for handling large files by breaking them into chunks based on their content (not fixed sizes), compressing them, and storing them in a way that eliminates duplicates.
It's particularly useful for managing large models and datasets, like those hosted on HuggingFace.
This library implements the full XET protocol spec in Zig, including:
The implementation has been cross-verified against the Rust reference implementation to ensure correctness.
# Build the project
zig build
# Run tests (62 tests covering all components)
zig build test
# Run the demo CLI
zig build run
# Run benchmarks
zig build bench
The most common use case is downloading models efficiently:
# Set your HuggingFace token
export HF_TOKEN="your_token_here"
# Run the download example
zig build run-example-download
This downloads a model using the XET protocol, which handles all the chunking, deduplication, and reconstruction automatically.
Add to your build.zig.zon:
.dependencies = .{
.xet = .{
.url = "https://github.com/yourusername/zig-xet/archive/main.tar.gz",
},
},
Then in your code:
const xet = @import("xet");
// Chunk a file using content-defined chunking
var chunks = try xet.chunking.chunkBuffer(allocator, data);
defer chunks.deinit(allocator);
// Hash chunks with BLAKE3
const hash = xet.hashing.computeDataHash(chunk_data);
// Build a Merkle tree for verification
const merkle_root = try xet.hashing.buildMerkleTree(allocator, &nodes);
// Download a model from HuggingFace
const config = xet.model_download.DownloadConfig{
.repo_id = "org/model",
.repo_type = "model",
.revision = "main",
.file_hash_hex = "...",
};
try xet.model_download.downloadModelToFile(allocator, config, "output.gguf");
The XET protocol processes files in several stages:
Chunking: Files are split using a rolling hash algorithm. Instead of fixed-size chunks, boundaries are determined by content patterns, which means similar files share many identical chunks.
Hashing: Each chunk gets a BLAKE3 hash. A Merkle tree combines these hashes to create a single file identifier.
Compression: Chunks are compressed with LZ4, optionally with byte grouping preprocessing for better ratios.
Deduplication: Identical chunks (same hash) are stored only once, saving space when you have multiple similar files.
Storage: Chunks are bundled into "xorbs" and metadata is stored in "MDB shards" for efficient retrieval.
When downloading from HuggingFace, the library queries the CAS (content-addressable storage) API to find which chunks are needed, fetches them, decompresses, and reconstructs the original file.
This implementation follows the official XET protocol specification exactly.
All constants, algorithms, and formats match the reference Rust implementation byte-for-byte. The test suite includes cross-verification tests to ensure continued compatibility.
HF_TOKEN environment variable