ext4: group-aware allocator, metadata-aware dedup alignment, e2fsck-gated harness by jaredLunde · Pull Request #72 · beyondoss/glidefs

jaredLunde · 2026-06-06T04:25:12Z

Summary

Improves host/S3 deduplication by aligning large file payloads in blessed base images to the dedup block grid — and, in the process, fixes a pre-existing shipping data-corruption bug in the ext4 writer that was masked by a lenient in-crate reader.

The work is gated end-to-end on kernel-grade oracles (e2fsck, loop-mount), not the crate's own reader.

The dedup win

Fixed 128 KiB blocks over ext4 miss byte-identical content placed at different offsets (positional windows). On real images this loses ~half the achievable dedup. Aligning large file payloads to the grid makes the existing content-addressing fire:

cross-image dedup 26% → 52% (3-image probe set), matching the file-level ground truth (52%) and FastCDC ceiling — captures 99% of it
range reads preserved (no CDC); the host content-addressed cache + S3 packs dedup the aligned blocks for free
padding is holes/zeros the block store drops → costs address space, not stored bytes

The bug it surfaced

Validating with the real e2fsck (after fixing a stray s_journal_uuid that made e2fsck abort) revealed that multi-block-group images (>128 MiB) placed file data on the Group 1 backup superblock (block 32768) → multiply-claimed blocks the kernel rejects → /home, /media etc. read back as "Structure needs cleaning". Pre-existing, independent of alignment, hidden by the lenient reader.

Fix: a group-aware allocator skips reserved backup-superblock/GDT blocks and fragments files around them. Alignment then composes on top (metadata-aware placement + padding holes cleared from the bitmap).

Changes

ext4 writer: s_journal_uuid fix; group-aware allocator (write_file_data, physical_runs, rewritten extent emission); metadata-aware WriterOption::AlignData (opt-in) with free-hole bitmap accounting.
bless: all three paths (CLI merged, CLI layered, server-side background) enable AlignData at the 128 KiB block grid, with device-size headroom for padding.
harness (ext4/tests/fsck_validity.rs): e2fsck-oracle integration tests — single/multi-group clean, content-survives-fragmentation, a property fuzzer over random multi-group layouts (both align modes; EXT4_FUZZ_SEEDS to scale; ran clean at 64 seeds = 128 images), and an opt-in kernel_mount_content check (real loop-mount, byte-exact; EXT4_MOUNT_TEST=1).
dedup_probe bin: empirical dedup measurement (replaces the unsound oci_dedup_measure model).

Validation

e2fsck clean on synthetic + real (debian, python:3.12) images
kernel loop-mount reads every file byte-exact, both aligned and unaligned
determinism preserved; full ext4 suite (52 tests) + harness green
live bless of debian (single-group) and python:3.12 (multi-group) completes end-to-end with alignment on

Notes / follow-ups

To realize the win in production, base images must be rebuilt with the new bless (previously-blessed images intentionally not migrated).
Worth a follow-up: a shared bless_writer_options() helper so the three bless sites can't drift (this PR already had one missing call site); would also unify a random-vs-deterministic UUID inconsistency in the server-side path.

🤖 Generated with Claude Code

Two latent, shipping bugs in the ext4 image writer, both surfaced by gating validation on the real `e2fsck`/kernel instead of the lenient in-crate reader: 1. s_journal_uuid was stamped with the filesystem UUID. That field names an *external* journal device, so the kernel and e2fsck searched for a nonexistent external journal and aborted before checking anything — meaning no image this writer produced could ever be fsck-validated. Zeroed it (internal journal carries its UUID in the jbd2 superblock). 2. Multi-block-group images (>128 MiB) placed file data on the Group 1 backup superblock + group descriptors (block 32768), producing multiply-claimed blocks the kernel rejects — real corruption (e.g. /home, /media read back as "Structure needs cleaning"). New group-aware allocator: file data skips the reserved backup-SB/GDT blocks at sparse_super group boundaries and fragments around them (write_file_data, physical_runs, rewritten extent emission keyed on data_start_block). Unfragmented files produce byte-identical output to before. Validated on synthetic and real (python:3.12-slim) images: e2fsck clean, kernel mount reads all files, byte-exact content across fragmentation, determinism preserved, full ext4 suite green. Adds: - ext4/tests/fsck_validity.rs: e2fsck-oracle integration harness. - glidefs/src/bin/dedup_probe.rs: empirical dedup probe (replaces the unsound oci_dedup_measure model). - WriterOption::AlignData: opt-in, gated off, marked KNOWN-LIMITATION (block-alignment dedup work in progress; not yet metadata-aware). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rness Completes the block-alignment dedup work on top of the group-aware allocator. Alignment (WriterOption::AlignData) is now correct and e2fsck-clean: - Composes with reserved-block skipping: when an aligned file start lands on a backup-superblock/GDT block, write_file_data skips past it before recording data_start_block. - Padding gaps are unreferenced free space, not data. record_free_hole tracks them (excluding reserved metadata blocks), and close() clears them from the otherwise-dense block bitmap so the free counts are correct. Previously the dense bitmap marked padding as used, which e2fsck rejected. Validated end to end on real images (python:3.12/3.13-slim): aligned builds are e2fsck-clean, kernel-mount and read byte-exact, deterministic, and realize the dedup win — 26%->52% cross-image (3-image set), matching the file-level ground truth (52%) and capturing 98% of the FastCDC ceiling, with zero stored cost (padding is dropped zeros). Adds fuzz_multigroup_validity_and_content: random multi-group filesets (mixed small/medium/large) must be e2fsck-clean AND read back byte-exact, both unaligned and aligned. Deterministic seeds reproduce failures; EXT4_FUZZ_SEEDS scales coverage (ran clean at 64 seeds = 128 images). This is the generalized gate over the size/position space where the data-on-metadata and alignment-bitmap bugs lived. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wire the validated dedup alignment into both bless paths: - run_bless_oci (merged image) and run_bless_oci_layered via layer_store (per-layer) now pass AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE } (128 KiB = the volume's block size). Threshold = one full dedup block: measured as the sweet spot — captures 99% of the FastCDC ceiling (34% cross-image reduction on the 3-image probe set) with less logical inflation than a 16 KiB threshold. - device_size estimates get headroom for alignment padding (bless x3->x4, layer_store x2->x3). Padding is holes/zeros the block store drops, so it costs address space, not stored bytes. Automate the strongest oracle: kernel_mount_content loop-mounts the image with the real Linux ext4 driver and verifies every file byte-exact against known input, for both aligned and unaligned builds. Opt-in via EXT4_MOUNT_TEST=1 (needs root/passwordless sudo), skips by default so CI stays green; runs in a privileged/nightly job. Verified passing locally. dedup_probe: alignment threshold is now configurable via DEDUP_ALIGN_THRESHOLD for measurement sweeps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

router.rs run_bless_oci_task (the in-server background bless triggered via the API) was missing the AlignData wiring and device-size headroom that the two CLI bless paths got. Wire it consistently so every bless path produces grid-aligned, cross-image-deduppable base images. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The group-aware allocator skipped reserved block-group backup-superblock/GDT blocks for FILE DATA, but the journal, inode table, and bitmaps are written contiguously at close() via raw writes and could still land on (or straddle) them. Since bless enables a 4 MiB journal, this was a real corruption path: e.g. a ~124 MiB image puts the journal inode across block 32768 (Group 1's backup superblock) -> multiply-claimed block the kernel rejects. The inode table straddles the same way for workloads with many files. Caught by running the fsck harness with the production journal config (it previously ran journal-less). Fix: reserve_contiguous(n) places a contiguous run clear of reserved regions (skipping past any it would straddle, recording the gap as a free hole); used for the journal, inode table, and bitmaps. Tests now run with Journal(1024) (matching bless) and add: - fsck_journal_straddles_group_boundary: sweeps 120-136 MiB - fsck_inode_table_straddles_boundary: many-file workloads at the boundary Both verified to fail before the fix (journal: inode 8 multiply-claim; inode table: "Group 1's inode table at 32768 conflicts"). Full suite + 64-seed fuzz + kernel mount green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bring the doc in line with actual behavior: - Extent building: file data fragments around reserved blocks (write_file_data / physical_runs / write_extents), not always contiguous. - New Block-Group Metadata Reservation section: sparse_super backup superblocks at block 32768 etc., data + close()-time structures skip them (reserve_contiguous), free-hole bitmap accounting; the multi-group corruption bug and why the in-crate reader hid it. - On-disk layout: correct flex_bg layout (metadata clustered at the end), reserved holes, journal placement. - WriterOption table: add Uuid, Journal (s_journal_uuid must be zero), AlignData. - Fix stale "why no journal" (now optional, bless enables it) and the zero-UUID determinism/checksum claims. - Testing: document the fsck_validity e2fsck/mount/fuzz harness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jaredLunde and others added 6 commits June 5, 2026 18:25

jaredLunde merged commit 74b07b2 into main Jun 6, 2026
24 checks passed

jaredLunde deleted the jared/good branch June 6, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ext4: group-aware allocator, metadata-aware dedup alignment, e2fsck-gated harness#72

ext4: group-aware allocator, metadata-aware dedup alignment, e2fsck-gated harness#72
jaredLunde merged 6 commits into
mainfrom
jared/good

jaredLunde commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaredLunde commented Jun 6, 2026

Summary

The dedup win

The bug it surfaced

Changes

Validation

Notes / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant