A streaming, deterministic-by-design, allocation-conscious Content-Defined Chunking (CDC) engine for delta sync and deduplication systems.
chunkrsis a library crate, not a workload scheduler. It focuses on correct, fast, single-stream chunking and leaves concurrency orchestration, storage policy, and sync semantics to the application layer.
Byte Stream → CDC Boundary Detection → Chunk Assembly → Chunk Hashing → Output Stream
- Streaming-first - Process byte streams without full-file buffering
- High throughput - Saturate modern I/O without intra-file parallelism
- Deterministic - Identical inputs produce identical boundaries and hashes
- Zero-copy - Efficient
Bytesslicing with minimal allocations - Allocator-disciplined - Avoid contention under high throughput
- Std-quality API - Small, predictable, no hidden state
- Memory-safe -
#![forbid(unsafe_code)]throughout
chunkrs does not handle:
- Inter-file parallelism or thread pool management
- I/O scheduling, device throttling, or storage coordination
- Chunk persistence, deduplication indexing, or storage backends
- Network protocols, sync negotiation, or application-level logic
- HDD vs SSD vs NVMe detection or device-specific optimizations
These are application responsibilities - chunkrs provides pure CDC.
chunkrs implements a FastCDC-style rolling hash for boundary detection:
- Byte-by-byte rolling hash
- Mask-based boundary check
- Configurable minimum / average / maximum chunk sizes
Rolling hash is used only to decide where chunks end — never as a content identifier.
Each emitted chunk is finalized with a strong cryptographic hash (default: BLAKE3):
- Chunk hash defines identity
- Used for deduplication, delta sync, verification, ect.
- Rolling hash state does not affect identity
- Identical byte streams + identical configuration → identical chunk boundaries
- Identical byte streams + identical configuration → identical chunk hashes
- CDC behavior is byte-by-byte serial, ensuring deterministic boundaries regardless of:
- Input batch sizes (1 byte vs 1MB vs streaming)
- Number of
push()calls - Call timing
The FastCDC algorithm processes each byte sequentially, maintaining rolling hash state across all calls. This ensures:
- Exact boundary determinism - same byte positions always produce same boundaries
- No dependency on execution strategy or batching patterns
- Perfect reproducibility across different streaming scenarios
let mut chunker = Chunker::new(config);
let (chunks, pending) = chunker.push(data_bytes);
let final_chunk = chunker.finish();push(Bytes)- Feed data in any size (1 byte to megabytes)finish()- Emit final incomplete chunk when stream ends- Returns -
(Vec<Chunk>, Bytes)- Complete chunks and pending bytes
- Chunk data is sliced directly from input
Bytes- no copying - Caller owns the underlying memory
- Pending bytes held internally only between
push()calls
- Caller must process/drop chunks promptly (accumulating may OOM)
- Caller controls backpressure and memory management
- No global buffer pools or cross-thread state
CDC is inherently serial over a byte stream:
- Rolling hash state at byte
ndepends on bytes[0..n) - Input may be split into batches via multiple
push()calls, but state persists - Implementation processes bytes one-by-one for exact determinism
Therefore:
chunkrsdoes not parallelize CDC within a file- Modern CPUs are sufficient to saturate I/O bandwidth without intra-file parallelism
Applications achieve parallelism by:
- Running multiple
Chunkerinstances on different streams - Using async executors (tokio) with blocking tasks
- Managing their own thread pools
The library provides pure CDC - concurrency and I/O orchestration are application responsibilities.
chunkrs accepts Bytes from any source and emits Chunk objects:
- Input: Files, network, buffers - any source providing
Bytes - Output: Chunk with hash, length, offset, and zero-copy payload
- Errors: Localized to stream, no global state corruption
- Recovery: Checkpointing/resume is application's responsibility
The crate does not persist, index, or manage chunks.
| Aspect | fastcdc | chunkrs |
|---|---|---|
| Streaming API | Limited | First-class |
| Zero-copy | No | Yes |
| Rust edition | 2018 | 2024+ |
| API quality | Experimental | Std-style |
chunkrs focuses on API quality and streaming correctness, not just speed.
chunkrs is a deterministic, streaming, zero-copy CDC engine with a simple push/finish API. Byte-by-byte processing ensures exact determinism, while Bytes slicing provides zero-copy efficiency. The library handles pure CDC - orchestration, I/O, and storage are application responsibilities.
chunkrs uses a flat API design for "small, composable primitive" positioning. All public types are accessible directly from the crate root:
chunkrs::Chunk
chunkrs::ChunkHash
chunkrs::Chunker
chunkrs::ChunkConfig
chunkrs::HashConfig
chunkrs::ChunkError
chunkrs/
├── lib.rs # Public API: pub use re-exports only
├── chunk/ # Private: Chunk, ChunkHash
├── chunker/ # Private: Chunker with push/finish API
├── config/ # Private: ChunkConfig, HashConfig
├── error/ # Private: ChunkError
├── cdc/ # Private: FastCDC rolling hash
├── hash/ # Private: BLAKE3 (feature-gated)
└── util/ # Private: Internal helpers
- Public modules: None - modules are private for code organization only
- Public API: Only
pub usere-exports inlib.rs - Internal sharing: Private modules use
pubfor crate-local sharing - No
pub(crate): Eliminated for cleaner boundaries
This design ensures:
- No duplicate access paths (e.g.,
chunkrs::Chunkvschunkrs::chunk::Chunk) - Minimal public API surface
- Clear separation between public API and implementation details