jedisct1/zig-ultracdc
UltraCDC, a fast content-defined chunking algorithm for data deduplication.
A Zig implementation of UltraCDC, a fast content-defined chunking algorithm for data deduplication.
Content-defined chunking (CDC) splits data into variable-sized pieces based on the data itself, not arbitrary boundaries. This makes it useful for deduplication: if you change one paragraph in a document, only that chunk changes, not everything after it.
UltraCDC is a CDC algorithm from a 2022 IEEE paper that's both fast and stable. This implementation can process data at around 2.7 GB/s, making it practical for real-world use.
You'll need Zig 0.16 or later.
zig build -Doptimize=ReleaseFast
The ultracdc tool analyzes how well your files would deduplicate:
# Basic usage
zig-out/bin/ultracdc file1.dat file2.dat
# With custom chunk sizes
zig-out/bin/ultracdc --min-size 4096 --max-size 262144 backup.tar
It will show you:
const ultracdc = @import("ultracdc");
// Use default options (8KB min, 64KB normal, 128KB max)
const options = ultracdc.ChunkerOptions{};
// Find the first chunk boundary
const cutpoint = ultracdc.UltraCDC.find(options, data, data.len);
// Process the chunk
const chunk = data[0..cutpoint];
UltraCDC uses a sliding window over your data and looks at the "fingerprint" of each window using hamming distance. When it finds a fingerprint that matches a pattern, it makes a cut. The algorithm is designed to:
zig build test
The tests cover edge cases like minimum-size data, low-entropy detection, and maximum chunk size enforcement.
zig build bench-find
The algorithm comes from: