block: zstd compression (codec auto-detect) + deduplication-model docs by jaredLunde · Pull Request #73 · beyondoss/glidefs

jaredLunde · 2026-06-06T17:54:56Z

Summary

Switch newly written packs from LZ4 to zstd, keeping every existing LZ4 pack readable, and document the deduplication/addressing model the codebase kept forcing people to re-derive.

Why

Measured per-128KiB-block on real blessed images, zstd is ~27% smaller at level 3 and ~37% at level 19 than LZ4 — a direct cut to S3 storage and egress, orthogonal to dedup. Reads get net faster: smaller packs transfer less, and zstd decode (~60µs/block) is ~0.1% of an S3 GET (measured). Bless is offline + write-once/read-many and zstd decode is ~level-independent, so the most-read data uses the max level for free.

How (mixed-codec safe, no format change)

Codec is detected on read by sniffing the zstd frame magic. A legacy LZ4 block's size-prefix can never collide (high byte 0x00 vs zstd's 0xFD), and per-block self-describing frames survive compaction's byte-reuse (a pack may legitimately hold both codecs). So old LZ4 packs read forever; content_pack_id is unchanged.

block_map: compress_block(data, level) / decompress_block (auto-detect, 2 MiB guard) / zstd_compress / CompressError; COMPRESSION_LZ4 sentinel, RUNTIME_DEFAULT (zstd-1), BLESS (zstd-19).
Default codec is zstd-1 (GLIDEFS_COMPRESSION_LEVEL overrides; 0 pins LZ4). Carried on CacheInner (atomic, set once before flush) rather than WriteCacheConfig — avoids churning 83 struct literals. bless overrides to zstd-19.
Read paths (read.rs ×3, compact.rs) and both compress paths (flush.rs, ext4_store::store_ext4_stream) routed through the new helpers.

Docs

ARCHITECTURE.md gains a Deduplication Model section (three tiers / three granularities; the deliberate position-addressed-S3 vs content-addressed-cache asymmetry and its consequences) + corrected flex_bg on-disk layout + WriterOption table. README.md gains Core Properties including the addressing split as a first-class property.

Testing

423 lib + 220 integration + 10 ublk zero-copy tests pass on zstd-1 by default (flipping the default means these suites exercise zstd end-to-end: snapshots, compaction, fork, crash-recovery, data-safety).
New: zstd roundtrip, codec auto-detect, legacy-LZ4 read, old-LZ4-rejects-zstd-frame (documents the rollback floor), mixed-codec pack, end-to-end zstd flush→S3→cold-read (levels 1/3/19, asserts the stored frame is genuinely zstd), mixed-codec-across-flushes cold read.
Fuzz target renamed to fuzz_decompress_block (exercises both branches).
compress_probe bin: per-block ratio + decode-speed measurement.

Rollout (single-shot — forward-only)

Once a zstd pack is written, a pre-this-change binary cannot read it: it hard-fails cleanly (the 2 MiB guard trips on zstd's 0xFD high byte) — never silent corruption. Deploy is forward-only; GLIDEFS_COMPRESSION_LEVEL=0 can pin LZ4 if a staged rollout is ever wanted.

🤖 Generated with Claude Code

Capture the model the codebase kept making people re-derive: dedup happens in three tiers at three granularities — lineage CoW (shared packs along ancestry), the content-addressed host clean cache (per-block, host-global), and position-addressed S3 packs (whole-pack). Explains the deliberate asymmetry (S3 is position-addressed for range-read/request economics; the cache is content-addressed for density — they optimize opposite things) and its consequences: cross-lineage overlap isn't deduped in S3 except via --layered; alignment helps the cache, not intra-rootfs S3; more S3 dedup is only pack/layer-granular. - ARCHITECTURE.md: new "Deduplication Model" section; corrected on-disk layout (flex_bg, reserved backup-superblock holes); WriterOption table. - README.md: "Core Properties" section incl. the position-vs-content addressing split as a first-class property. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Switch new packs from LZ4 to zstd — ~27% smaller at level 3, ~37% at 19 (measured per-128KiB-block on real images), cutting S3 storage and egress. Reads get net faster: smaller packs transfer less and zstd decode (~60µs/ block) is ~0.1% of an S3 GET. Orthogonal to dedup. Codec is detected on read by sniffing the zstd frame magic — no on-disk format change, no pack-version bump. A legacy LZ4 block's size-prefix can never collide (high byte 0x00 vs zstd's 0xFD), and per-block self-describing frames survive compaction's byte-reuse (a pack may legitimately hold both codecs). So existing LZ4 packs stay readable forever; content_pack_id is unchanged. - block_map: compress_block(data, level) / decompress_block (auto-detect, 2 MiB guard) / zstd_compress / CompressError; COMPRESSION_LZ4 sentinel, RUNTIME_DEFAULT (zstd-1), BLESS (zstd-19). - Default codec is zstd-1 (env GLIDEFS_COMPRESSION_LEVEL overrides; 0 = pin LZ4). Carried on CacheInner (atomic, set once before flush) rather than WriteCacheConfig to avoid churning 83 literals. bless overrides to zstd-19 (offline, write-once/read-many; decode is ~level-independent). Read paths (read.rs x3, compact.rs) and both compress paths (flush.rs, ext4_store store_ext4_stream) routed through the new helpers. Tests: zstd roundtrip, codec auto-detect, legacy-LZ4 read, old-LZ4-rejects- zstd-frame (documents the single-shot rollback floor), mixed-codec pack, end-to-end zstd flush->S3->cold-read (levels 1/3/19, asserts the stored frame is actually zstd), mixed-codec-across-flushes cold read. Fuzz target renamed to fuzz_decompress_block (both branches). 423 lib + 220 integration + 10 ublk zero-copy tests pass on zstd-1 by default. compress_probe bin: per-block LZ4-vs-zstd ratio + decode-speed measurement. NOTE (single-shot rollout): once a zstd pack is written, a pre-this-change binary cannot read it (clean hard-fail, never corruption). Forward-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two CI failures from the codec swap: - cli/bless.rs tests decompressed bless output with raw lz4_decompress, but bless now writes zstd-19 → use the codec-detecting decompress_block. These are test-utils-gated, so a plain `cargo test --lib` missed them; CI runs `--features test-utils`. - The fuzz CI step still invoked the old target `fuzz_lz4_decompress`; point it at the renamed `fuzz_decompress_block`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The docs predated the codec swap and still described LZ4 as *the* compressor. Update the write/read/sync paths, the Pack term, the on-disk pack format, the integrity/verification chain, and the Compression section to: zstd by default (runtime zstd-1, bless zstd-19; GLIDEFS_COMPRESSION_LEVEL=0 pins LZ4), with per-block codec auto-detection on read so legacy LZ4 packs stay readable. The remaining LZ4 mentions are intentional (legacy/auto-detect context). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jaredLunde and others added 4 commits June 6, 2026 10:54

jaredLunde merged commit e77676a into main Jun 6, 2026
24 checks passed

jaredLunde deleted the jared/zstd branch June 6, 2026 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

block: zstd compression (codec auto-detect) + deduplication-model docs#73

block: zstd compression (codec auto-detect) + deduplication-model docs#73
jaredLunde merged 4 commits into
mainfrom
jared/zstd

jaredLunde commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaredLunde commented Jun 6, 2026

Summary

Why

How (mixed-codec safe, no format change)

Docs

Testing

Rollout (single-shot — forward-only)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant