Skip to content

ext4: group-aware allocator, metadata-aware dedup alignment, e2fsck-gated harness#72

Merged
jaredLunde merged 6 commits into
mainfrom
jared/good
Jun 6, 2026
Merged

ext4: group-aware allocator, metadata-aware dedup alignment, e2fsck-gated harness#72
jaredLunde merged 6 commits into
mainfrom
jared/good

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

Summary

Improves host/S3 deduplication by aligning large file payloads in blessed base images to the dedup block grid — and, in the process, fixes a pre-existing shipping data-corruption bug in the ext4 writer that was masked by a lenient in-crate reader.

The work is gated end-to-end on kernel-grade oracles (e2fsck, loop-mount), not the crate's own reader.

The dedup win

Fixed 128 KiB blocks over ext4 miss byte-identical content placed at different offsets (positional windows). On real images this loses ~half the achievable dedup. Aligning large file payloads to the grid makes the existing content-addressing fire:

  • cross-image dedup 26% → 52% (3-image probe set), matching the file-level ground truth (52%) and FastCDC ceiling — captures 99% of it
  • range reads preserved (no CDC); the host content-addressed cache + S3 packs dedup the aligned blocks for free
  • padding is holes/zeros the block store drops → costs address space, not stored bytes

The bug it surfaced

Validating with the real e2fsck (after fixing a stray s_journal_uuid that made e2fsck abort) revealed that multi-block-group images (>128 MiB) placed file data on the Group 1 backup superblock (block 32768) → multiply-claimed blocks the kernel rejects → /home, /media etc. read back as "Structure needs cleaning". Pre-existing, independent of alignment, hidden by the lenient reader.

Fix: a group-aware allocator skips reserved backup-superblock/GDT blocks and fragments files around them. Alignment then composes on top (metadata-aware placement + padding holes cleared from the bitmap).

Changes

  • ext4 writer: s_journal_uuid fix; group-aware allocator (write_file_data, physical_runs, rewritten extent emission); metadata-aware WriterOption::AlignData (opt-in) with free-hole bitmap accounting.
  • bless: all three paths (CLI merged, CLI layered, server-side background) enable AlignData at the 128 KiB block grid, with device-size headroom for padding.
  • harness (ext4/tests/fsck_validity.rs): e2fsck-oracle integration tests — single/multi-group clean, content-survives-fragmentation, a property fuzzer over random multi-group layouts (both align modes; EXT4_FUZZ_SEEDS to scale; ran clean at 64 seeds = 128 images), and an opt-in kernel_mount_content check (real loop-mount, byte-exact; EXT4_MOUNT_TEST=1).
  • dedup_probe bin: empirical dedup measurement (replaces the unsound oci_dedup_measure model).

Validation

  • e2fsck clean on synthetic + real (debian, python:3.12) images
  • kernel loop-mount reads every file byte-exact, both aligned and unaligned
  • determinism preserved; full ext4 suite (52 tests) + harness green
  • live bless of debian (single-group) and python:3.12 (multi-group) completes end-to-end with alignment on

Notes / follow-ups

  • To realize the win in production, base images must be rebuilt with the new bless (previously-blessed images intentionally not migrated).
  • Worth a follow-up: a shared bless_writer_options() helper so the three bless sites can't drift (this PR already had one missing call site); would also unify a random-vs-deterministic UUID inconsistency in the server-side path.

🤖 Generated with Claude Code

jaredLunde and others added 6 commits June 5, 2026 18:25
Two latent, shipping bugs in the ext4 image writer, both surfaced by
gating validation on the real `e2fsck`/kernel instead of the lenient
in-crate reader:

1. s_journal_uuid was stamped with the filesystem UUID. That field names
   an *external* journal device, so the kernel and e2fsck searched for a
   nonexistent external journal and aborted before checking anything —
   meaning no image this writer produced could ever be fsck-validated.
   Zeroed it (internal journal carries its UUID in the jbd2 superblock).

2. Multi-block-group images (>128 MiB) placed file data on the Group 1
   backup superblock + group descriptors (block 32768), producing
   multiply-claimed blocks the kernel rejects — real corruption
   (e.g. /home, /media read back as "Structure needs cleaning").
   New group-aware allocator: file data skips the reserved backup-SB/GDT
   blocks at sparse_super group boundaries and fragments around them
   (write_file_data, physical_runs, rewritten extent emission keyed on
   data_start_block). Unfragmented files produce byte-identical output to
   before.

Validated on synthetic and real (python:3.12-slim) images: e2fsck clean,
kernel mount reads all files, byte-exact content across fragmentation,
determinism preserved, full ext4 suite green.

Adds:
- ext4/tests/fsck_validity.rs: e2fsck-oracle integration harness.
- glidefs/src/bin/dedup_probe.rs: empirical dedup probe (replaces the
  unsound oci_dedup_measure model).
- WriterOption::AlignData: opt-in, gated off, marked KNOWN-LIMITATION
  (block-alignment dedup work in progress; not yet metadata-aware).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rness

Completes the block-alignment dedup work on top of the group-aware
allocator.

Alignment (WriterOption::AlignData) is now correct and e2fsck-clean:
- Composes with reserved-block skipping: when an aligned file start lands
  on a backup-superblock/GDT block, write_file_data skips past it before
  recording data_start_block.
- Padding gaps are unreferenced free space, not data. record_free_hole
  tracks them (excluding reserved metadata blocks), and close() clears
  them from the otherwise-dense block bitmap so the free counts are
  correct. Previously the dense bitmap marked padding as used, which
  e2fsck rejected.

Validated end to end on real images (python:3.12/3.13-slim): aligned
builds are e2fsck-clean, kernel-mount and read byte-exact, deterministic,
and realize the dedup win — 26%->52% cross-image (3-image set), matching
the file-level ground truth (52%) and capturing 98% of the FastCDC
ceiling, with zero stored cost (padding is dropped zeros).

Adds fuzz_multigroup_validity_and_content: random multi-group filesets
(mixed small/medium/large) must be e2fsck-clean AND read back byte-exact,
both unaligned and aligned. Deterministic seeds reproduce failures;
EXT4_FUZZ_SEEDS scales coverage (ran clean at 64 seeds = 128 images).
This is the generalized gate over the size/position space where the
data-on-metadata and alignment-bitmap bugs lived.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the validated dedup alignment into both bless paths:
- run_bless_oci (merged image) and run_bless_oci_layered via layer_store
  (per-layer) now pass AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE }
  (128 KiB = the volume's block size). Threshold = one full dedup block:
  measured as the sweet spot — captures 99% of the FastCDC ceiling (34%
  cross-image reduction on the 3-image probe set) with less logical
  inflation than a 16 KiB threshold.
- device_size estimates get headroom for alignment padding (bless x3->x4,
  layer_store x2->x3). Padding is holes/zeros the block store drops, so it
  costs address space, not stored bytes.

Automate the strongest oracle: kernel_mount_content loop-mounts the image
with the real Linux ext4 driver and verifies every file byte-exact
against known input, for both aligned and unaligned builds. Opt-in via
EXT4_MOUNT_TEST=1 (needs root/passwordless sudo), skips by default so CI
stays green; runs in a privileged/nightly job. Verified passing locally.

dedup_probe: alignment threshold is now configurable via
DEDUP_ALIGN_THRESHOLD for measurement sweeps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
router.rs run_bless_oci_task (the in-server background bless triggered via
the API) was missing the AlignData wiring and device-size headroom that
the two CLI bless paths got. Wire it consistently so every bless path
produces grid-aligned, cross-image-deduppable base images.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The group-aware allocator skipped reserved block-group backup-superblock/GDT
blocks for FILE DATA, but the journal, inode table, and bitmaps are written
contiguously at close() via raw writes and could still land on (or straddle)
them. Since bless enables a 4 MiB journal, this was a real corruption path:
e.g. a ~124 MiB image puts the journal inode across block 32768 (Group 1's
backup superblock) -> multiply-claimed block the kernel rejects. The inode
table straddles the same way for workloads with many files.

Caught by running the fsck harness with the production journal config (it
previously ran journal-less). Fix: reserve_contiguous(n) places a contiguous
run clear of reserved regions (skipping past any it would straddle, recording
the gap as a free hole); used for the journal, inode table, and bitmaps.

Tests now run with Journal(1024) (matching bless) and add:
- fsck_journal_straddles_group_boundary: sweeps 120-136 MiB
- fsck_inode_table_straddles_boundary: many-file workloads at the boundary
Both verified to fail before the fix (journal: inode 8 multiply-claim; inode
table: "Group 1's inode table at 32768 conflicts"). Full suite + 64-seed fuzz
+ kernel mount green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring the doc in line with actual behavior:
- Extent building: file data fragments around reserved blocks
  (write_file_data / physical_runs / write_extents), not always contiguous.
- New Block-Group Metadata Reservation section: sparse_super backup
  superblocks at block 32768 etc., data + close()-time structures skip them
  (reserve_contiguous), free-hole bitmap accounting; the multi-group
  corruption bug and why the in-crate reader hid it.
- On-disk layout: correct flex_bg layout (metadata clustered at the end),
  reserved holes, journal placement.
- WriterOption table: add Uuid, Journal (s_journal_uuid must be zero),
  AlignData.
- Fix stale "why no journal" (now optional, bless enables it) and the
  zero-UUID determinism/checksum claims.
- Testing: document the fsck_validity e2fsck/mount/fuzz harness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit 74b07b2 into main Jun 6, 2026
24 checks passed
@jaredLunde jaredLunde deleted the jared/good branch June 6, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant