From d9e15e022db00e8bbc1aca7702d0a99f2362cb0e Mon Sep 17 00:00:00 2001
From: Jared Lunde <jared.lunde@gmail.com>
Date: Fri, 5 Jun 2026 18:25:18 -0700
Subject: [PATCH 1/6] fix(ext4): group-aware allocator + e2fsck-validated
 writer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two latent, shipping bugs in the ext4 image writer, both surfaced by
gating validation on the real `e2fsck`/kernel instead of the lenient
in-crate reader:

1. s_journal_uuid was stamped with the filesystem UUID. That field names
   an *external* journal device, so the kernel and e2fsck searched for a
   nonexistent external journal and aborted before checking anything —
   meaning no image this writer produced could ever be fsck-validated.
   Zeroed it (internal journal carries its UUID in the jbd2 superblock).

2. Multi-block-group images (>128 MiB) placed file data on the Group 1
   backup superblock + group descriptors (block 32768), producing
   multiply-claimed blocks the kernel rejects — real corruption
   (e.g. /home, /media read back as "Structure needs cleaning").
   New group-aware allocator: file data skips the reserved backup-SB/GDT
   blocks at sparse_super group boundaries and fragments around them
   (write_file_data, physical_runs, rewritten extent emission keyed on
   data_start_block). Unfragmented files produce byte-identical output to
   before.

Validated on synthetic and real (python:3.12-slim) images: e2fsck clean,
kernel mount reads all files, byte-exact content across fragmentation,
determinism preserved, full ext4 suite green.

Adds:
- ext4/tests/fsck_validity.rs: e2fsck-oracle integration harness.
- glidefs/src/bin/dedup_probe.rs: empirical dedup probe (replaces the
  unsound oci_dedup_measure model).
- WriterOption::AlignData: opt-in, gated off, marked KNOWN-LIMITATION
  (block-alignment dedup work in progress; not yet metadata-aware).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ext4/src/tests.rs              |   5 +-
 ext4/src/writer.rs             | 266 ++++++++++++++---
 ext4/tests/fsck_validity.rs    | 216 ++++++++++++++
 glidefs/src/bin/dedup_probe.rs | 530 +++++++++++++++++++++++++++++++++
 4 files changed, 968 insertions(+), 49 deletions(-)
 create mode 100644 ext4/tests/fsck_validity.rs
 create mode 100644 glidefs/src/bin/dedup_probe.rs

diff --git a/ext4/src/tests.rs b/ext4/src/tests.rs
index 79ec7de5..deb0c02a 100644
--- a/ext4/src/tests.rs
+++ b/ext4/src/tests.rs
@@ -971,7 +971,10 @@ fn test_journal_roundtrip() {
             "HAS_JOURNAL flag not set"
         );
         assert_eq!(sb.journal_inum, format::INODE_JOURNAL, "journal_inum should be 8");
-        assert_eq!(sb.journal_uuid, uuid, "journal_uuid should match filesystem uuid");
+        // s_journal_uuid identifies an EXTERNAL journal device; for an internal
+        // journal it must be zero, or the kernel/e2fsck search for a nonexistent
+        // external journal and abort ("Can't find external journal").
+        assert_eq!(sb.journal_uuid, [0u8; 16], "journal_uuid must be zero for an internal journal");
         assert_ne!(sb.journal_blocks[0], 0, "journal_blocks backup should be populated");
     }
 
diff --git a/ext4/src/writer.rs b/ext4/src/writer.rs
index 1ad23bcd..589675fd 100644
--- a/ext4/src/writer.rs
+++ b/ext4/src/writer.rs
@@ -66,6 +66,21 @@ pub enum WriterOption {
     /// Create an internal journal with the given size in 4 KiB blocks.
     /// Typical values: 1024 (4 MiB), 4096 (16 MiB), 16384 (64 MiB).
     Journal(u32),
+    /// Start the data of every regular file at least `min_size` bytes large on
+    /// an `align`-byte boundary (padding the gap with a hole). Aligning large
+    /// file payloads to the downstream dedup block grid makes the same file
+    /// produce the same blocks regardless of what was written before it, so
+    /// content-addressed dedup survives unrelated upstream churn. `align` must
+    /// be a power of two; `align == 0` disables (the default).
+    ///
+    /// KNOWN LIMITATION (do not enable in production yet): the current pad is
+    /// not metadata-aware. Padding can land a file's data on an ext4 block-group
+    /// reserved block (e.g. the backup superblock at block `blocks_per_group`),
+    /// producing an extent the *kernel* rejects ("invalid extent entries"),
+    /// even though the in-crate reader accepts it. A correct implementation must
+    /// skip group-metadata blocks when aligning. Verified via `dedup_probe` +
+    /// `e2fsck`/loop-mount.
+    AlignData { align: u32, min_size: u32 },
 }
 
 // ---- Internal inode ----
@@ -233,6 +248,15 @@ pub struct Writer<W: Read + Write + Seek> {
     gd_blocks: u32,
     uuid: [u8; 16],
     journal_blocks: u32,
+    /// Boundary (bytes) for large-file data alignment; 0 = disabled.
+    data_align: i64,
+    /// Minimum file size (bytes) that triggers data alignment.
+    data_align_min: i64,
+    /// Physical block where the in-progress file's data begins. File data skips
+    /// blocks reserved for block-group metadata (backup superblocks + GDT), so
+    /// the data is generally non-contiguous and `pos - data_written` no longer
+    /// locates the start — this does.
+    data_start_block: u32,
 }
 
 impl<W: Read + Write + Seek> Writer<W> {
@@ -251,6 +275,9 @@ impl<W: Read + Write + Seek> Writer<W> {
             gd_blocks: 0,
             uuid: [0u8; 16],
             journal_blocks: 0,
+            data_align: 0,
+            data_align_min: 0,
+            data_start_block: 0,
         };
         for opt in opts {
             match opt {
@@ -268,6 +295,11 @@ impl<W: Read + Write + Seek> Writer<W> {
                 }
                 WriterOption::Uuid(u) => w.uuid = *u,
                 WriterOption::Journal(blocks) => w.journal_blocks = *blocks,
+                WriterOption::AlignData { align, min_size } => {
+                    debug_assert!(*align == 0 || align.is_power_of_two());
+                    w.data_align = i64::from(*align);
+                    w.data_align_min = i64::from(*min_size);
+                }
             }
         }
         w
@@ -331,6 +363,101 @@ impl<W: Read + Write + Seek> Writer<W> {
         Ok(())
     }
 
+    // ---- block-group metadata reservation ----
+    //
+    // ext4's sparse_super layout reserves the first `1 + gd_blocks` blocks of
+    // certain groups (0, 1, and powers of 3/5/7) for a backup superblock + a
+    // group-descriptor copy. Group 0's reservation is skipped at init(); the
+    // interior ones (block 32768, 98304, ...) sit in the middle of the data
+    // region. File data must not be written onto them, or the kernel rejects the
+    // extent as overlapping a system zone (multiply-claimed block).
+
+    /// Number of reserved blocks at the start of a backup group.
+    fn group_reserve(&self) -> u32 {
+        1 + self.gd_blocks
+    }
+
+    /// Is physical block `b` reserved for an interior block-group backup?
+    fn is_reserved_block(&self, b: u32) -> bool {
+        let g = b / BLOCKS_PER_GROUP;
+        if g == 0 {
+            return false; // group 0's primary metadata is handled by init()'s seek
+        }
+        (b % BLOCKS_PER_GROUP) < self.group_reserve() && has_super_backup(g)
+    }
+
+    /// Smallest reserved block >= `from`, or None if none up to the max device.
+    fn next_reserved_block_ge(&self, from: u32) -> Option<u32> {
+        let max_group = (self.max_disk_size / (i64::from(BLOCKS_PER_GROUP) * BLOCK_SIZE as i64)) as u32 + 1;
+        let mut g = from / BLOCKS_PER_GROUP;
+        while g <= max_group {
+            if g >= 1 && has_super_backup(g) {
+                let rstart = g * BLOCKS_PER_GROUP;
+                let rend = rstart + self.group_reserve();
+                let cand = from.max(rstart);
+                if cand < rend {
+                    return Some(cand);
+                }
+            }
+            g += 1;
+        }
+        None
+    }
+
+    /// If `pos` sits at the start of a reserved region, seek past it.
+    fn skip_reserved_at_pos(&mut self) -> io::Result<()> {
+        while self.pos % BLOCK_SIZE as i64 == 0 && self.is_reserved_block(self.block()) {
+            let g = self.block() / BLOCKS_PER_GROUP;
+            let region_end = g * BLOCKS_PER_GROUP + self.group_reserve();
+            self.seek_block(region_end)?;
+        }
+        Ok(())
+    }
+
+    /// Write file data, skipping reserved block-group metadata regions. Records
+    /// the file's first data block on the first call.
+    fn write_file_data(&mut self, b: &[u8]) -> io::Result<usize> {
+        if self.data_written == 0 {
+            self.skip_reserved_at_pos()?;
+            self.data_start_block = self.block();
+        }
+        let mut off = 0usize;
+        while off < b.len() {
+            self.skip_reserved_at_pos()?;
+            let cur = self.block();
+            let limit = match self.next_reserved_block_ge(cur) {
+                // next_reserved >= cur, and cur is not reserved, so r > cur.
+                Some(r) => i64::from(r) * BLOCK_SIZE as i64 - self.pos,
+                None => i64::MAX,
+            };
+            let take = ((b.len() - off) as i64).min(limit) as usize;
+            let w = self.write_bytes(&b[off..off + take])?;
+            off += w;
+            if w < take {
+                break; // short write
+            }
+        }
+        Ok(off)
+    }
+
+    /// The contiguous, non-reserved physical runs covering [start, end).
+    fn physical_runs(&self, start: u32, end: u32) -> Vec<(u32, u32)> {
+        let mut runs = Vec::new();
+        let mut b = start;
+        while b < end {
+            if self.is_reserved_block(b) {
+                b += 1;
+                continue;
+            }
+            let run_start = b;
+            // Jump to the next reserved block (or end) rather than stepping.
+            let next_res = self.next_reserved_block_ge(b).unwrap_or(end).min(end);
+            b = next_res;
+            runs.push((run_start, b - run_start));
+        }
+        runs
+    }
+
     // ---- inode management ----
 
     fn get_inode(&self, i: InodeNumber) -> Option<&Inode> {
@@ -615,6 +742,7 @@ impl<W: Read + Write + Seek> Writer<W> {
         self.cur_inode = Some((ino - 1) as usize);
         self.data_written = 0;
         self.data_max = size;
+        self.data_start_block = 0;
         Ok(())
     }
 
@@ -645,64 +773,72 @@ impl<W: Read + Write + Seek> Writer<W> {
     }
 
     fn write_extents(&mut self, idx: usize) -> io::Result<()> {
-        let start = self.pos - self.data_written;
-        if start % BLOCK_SIZE as i64 != 0 {
-            return Err(io::Error::other(
-                "data start position is not block-aligned",
-            ));
-        }
+        // Flush the partial final data block, then resolve the file's physical
+        // layout. Data skips reserved block-group metadata, so it may be split
+        // across several contiguous runs; `data_start_block` (not
+        // `pos - data_written`) locates the start.
         self.next_block()?;
-
-        let start_block = (start / BLOCK_SIZE as i64) as u32;
-        let blocks = self.block() - start_block;
-        let mut used_blocks = blocks;
+        let start_block = self.data_start_block;
+        let end_block = self.block();
+        let runs = self.physical_runs(start_block, end_block);
+
+        // Flatten runs into extent leaves, each at most MAX_BLOCKS_PER_EXTENT.
+        // For an unfragmented file this yields exactly the same leaves the old
+        // contiguous arithmetic produced.
+        let mut leaves: Vec<(u32, u32, u32)> = Vec::new(); // (logical, phys, len)
+        let mut logical = 0u32;
+        for (phys, len) in &runs {
+            let mut o = 0u32;
+            while o < *len {
+                let l = (*len - o).min(MAX_BLOCKS_PER_EXTENT);
+                leaves.push((logical, phys + o, l));
+                logical += l;
+                o += l;
+            }
+        }
+        let mut used_blocks = logical; // data blocks (reserved gaps excluded)
 
         const EXTENT_NODE_SIZE: u32 = 12;
         const EXTENTS_PER_BLOCK: u32 = (BLOCK_SIZE as u32) / EXTENT_NODE_SIZE - 1;
 
-        let extents = if blocks == 0 { 0 } else { blocks.div_ceil(MAX_BLOCKS_PER_EXTENT) };
+        let n_ext = leaves.len() as u32;
         let mut data = Vec::new();
 
-        if extents == 0 {
+        if n_ext == 0 {
             // Nothing to do
-        } else if extents <= 4 {
-            // Fits in inode directly
-            write_extent_header_to_vec(&mut data, extents as u16, 4, 0);
-            for i in 0..extents {
-                let block_offset = i * MAX_BLOCKS_PER_EXTENT;
-                let mut length = blocks - block_offset;
-                if length > MAX_BLOCKS_PER_EXTENT {
-                    length = MAX_BLOCKS_PER_EXTENT;
-                }
-                write_extent_leaf_to_vec(&mut data, block_offset, length as u16, start_block + block_offset);
+        } else if n_ext <= 4 {
+            // Fits in the inode directly.
+            write_extent_header_to_vec(&mut data, n_ext as u16, 4, 0);
+            for (lblk, phys, len) in &leaves {
+                write_extent_leaf_to_vec(&mut data, *lblk, *len as u16, *phys);
             }
-            // Pad to 4 extents worth
-            let padding = (4 - extents) * EXTENT_NODE_SIZE;
+            let padding = (4 - n_ext) * EXTENT_NODE_SIZE;
             data.extend(std::iter::repeat_n(0u8, padding as usize));
-        } else if extents <= 4 * EXTENTS_PER_BLOCK {
-            let extent_blocks = extents.div_ceil(EXTENTS_PER_BLOCK);
-            used_blocks += extent_blocks;
+        } else if n_ext <= 4 * EXTENTS_PER_BLOCK {
+            let extent_blocks = n_ext.div_ceil(EXTENTS_PER_BLOCK);
 
-            // Root: index nodes
+            // Root: index nodes pointing at leaf blocks.
             write_extent_header_to_vec(&mut data, extent_blocks as u16, 4, 1);
-            // We'll fill in the index nodes after writing the leaf blocks
             let index_start = data.len();
             data.resize(index_start + 4 * EXTENT_NODE_SIZE as usize, 0);
 
             for i in 0..extent_blocks {
+                // Extent-tree blocks must avoid reserved metadata too.
+                self.skip_reserved_at_pos()?;
                 let leaf_block = self.block();
-                // Fill in the index node
+                used_blocks += 1;
+
+                let first = (i * EXTENTS_PER_BLOCK) as usize;
+                let extents_in_block = (n_ext - i * EXTENTS_PER_BLOCK).min(EXTENTS_PER_BLOCK);
+
+                // Index node: logical offset of this leaf block's first extent.
                 let idx_off = index_start + (i * EXTENT_NODE_SIZE) as usize;
-                let block_off = i * EXTENTS_PER_BLOCK * MAX_BLOCKS_PER_EXTENT;
-                data[idx_off..idx_off + 4].copy_from_slice(&block_off.to_le_bytes());
+                data[idx_off..idx_off + 4].copy_from_slice(&leaves[first].0.to_le_bytes());
                 data[idx_off + 4..idx_off + 8].copy_from_slice(&leaf_block.to_le_bytes());
                 // idx_off + 8..12 stays zero (leaf_high + unused)
 
-                let extents_in_block = (extents - i * EXTENTS_PER_BLOCK).min(EXTENTS_PER_BLOCK);
                 let mut leaf_buf = vec![0u8; BLOCK_SIZE as usize];
                 let mut leaf_pos = 0usize;
-
-                // Write extent header
                 leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&format::EXTENT_HEADER_MAGIC.to_le_bytes());
                 leaf_pos += 2;
                 leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(extents_in_block as u16).to_le_bytes());
@@ -714,21 +850,15 @@ impl<W: Read + Write + Seek> Writer<W> {
                 leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&0u32.to_le_bytes()); // generation
                 leaf_pos += 4;
 
-                let offset = i * EXTENTS_PER_BLOCK * MAX_BLOCKS_PER_EXTENT;
-                for j in 0..extents_in_block {
-                    let block_off2 = offset + j * MAX_BLOCKS_PER_EXTENT;
-                    let mut length = blocks - block_off2;
-                    if length > MAX_BLOCKS_PER_EXTENT {
-                        length = MAX_BLOCKS_PER_EXTENT;
-                    }
-                    let start = start_block + block_off2;
-                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&block_off2.to_le_bytes());
+                for j in 0..extents_in_block as usize {
+                    let (lblk, phys, len) = leaves[first + j];
+                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&lblk.to_le_bytes());
                     leaf_pos += 4;
-                    leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(length as u16).to_le_bytes());
+                    leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(len as u16).to_le_bytes());
                     leaf_pos += 2;
                     leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&0u16.to_le_bytes()); // start_high
                     leaf_pos += 2;
-                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&start.to_le_bytes());
+                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&phys.to_le_bytes());
                     leaf_pos += 4;
                 }
 
@@ -892,6 +1022,18 @@ impl<W: Read + Write + Seek> Writer<W> {
 
         if self.inode_ref(child_ino)?.mode & TYPE_MASK == format::S_IFREG {
             self.start_inode(name, child_ino, f.size)?;
+            // Align the start of large file payloads to the dedup block grid.
+            // Done before any data is written (data_written == 0), so the
+            // file's first data block — recorded as `pos - data_written` in
+            // write_extents — lands on the boundary. The padded gap is a hole
+            // (zeros), which the downstream block store drops for free.
+            if self.data_align > 0 && f.size >= self.data_align_min {
+                let align = self.data_align;
+                let rem = self.pos % align;
+                if rem != 0 {
+                    self.write_zeros(align - rem)?;
+                }
+            }
         }
         Ok(())
     }
@@ -1024,7 +1166,7 @@ impl<W: Read + Write + Seek> Writer<W> {
             self.data_written += b.len() as i64;
             Ok(b.len())
         } else {
-            let n = self.write_bytes(b)?;
+            let n = self.write_file_data(b)?;
             self.data_written += n as i64;
             Ok(n)
         }
@@ -1478,7 +1620,12 @@ impl<W: Read + Write + Seek> Writer<W> {
                 | format::RoCompatFeature::HUGE_FILE
                 | format::RoCompatFeature::EXTRA_ISIZE,
             uuid: self.uuid,
-            journal_uuid: self.uuid,
+            // s_journal_uuid identifies an *external* journal device. We only
+            // ever use an internal journal (inode 8) or none, so this must stay
+            // zero — a non-zero value makes the kernel and e2fsck search for an
+            // external journal and abort ("Can't find external journal"). The
+            // journal's own jbd2 superblock still carries the fs UUID.
+            journal_uuid: [0u8; 16],
             journal_inum: if self.journal_blocks > 0 { format::INODE_JOURNAL } else { 0 },
             hash_seed: [
                 u32::from_le_bytes(self.uuid[0..4].try_into().unwrap()),
@@ -1582,6 +1729,29 @@ fn best_group_count(blocks: u32, inodes: u32) -> (u32, u32) {
     (best_groups, best_ipg)
 }
 
+/// Does block group `g` hold a backup superblock + group-descriptor copy?
+/// With the sparse_super feature, backups live in groups 0, 1, and every power
+/// of 3, 5, and 7. The kernel/e2fsck reserve those blocks regardless of whether
+/// valid backup content is written, so file data must never claim them.
+fn has_super_backup(g: u32) -> bool {
+    if g <= 1 {
+        return true;
+    }
+    for base in [3u32, 5, 7] {
+        let mut p = base;
+        while p < g {
+            match p.checked_mul(base) {
+                Some(n) => p = n,
+                None => break,
+            }
+        }
+        if p == g {
+            return true;
+        }
+    }
+    false
+}
+
 fn write_extent_header_to_vec(buf: &mut Vec<u8>, entries: u16, max: u16, depth: u16) {
     buf.extend_from_slice(&format::EXTENT_HEADER_MAGIC.to_le_bytes());
     buf.extend_from_slice(&entries.to_le_bytes());
diff --git a/ext4/tests/fsck_validity.rs b/ext4/tests/fsck_validity.rs
new file mode 100644
index 00000000..803968fb
--- /dev/null
+++ b/ext4/tests/fsck_validity.rs
@@ -0,0 +1,216 @@
+//! Filesystem-validity harness gated on the REAL `e2fsck`, not the in-crate
+//! reader. The in-crate reader is lenient and hid a real multi-block-group
+//! corruption bug for a long time; `e2fsck` (kernel-grade structural check)
+//! catches it. Every writer change should be validated here.
+//!
+//! These tests shell out to `e2fsck`; they skip (pass with a notice) when it is
+//! not installed so they do not break environments without e2fsprogs.
+//!
+//! Run with: `cargo test -p ext4 --test fsck_validity -- --include-ignored`
+
+use std::path::PathBuf;
+use std::process::Command;
+
+use ext4::tar_convert::{ConvertOptions, convert_tar_to_ext4};
+use ext4::writer::WriterOption;
+
+/// Locate `e2fsck` (often in /sbin, not on a service PATH). Returns None to skip.
+fn find_e2fsck() -> Option<PathBuf> {
+    for p in ["/sbin/e2fsck", "/usr/sbin/e2fsck", "/usr/bin/e2fsck", "/bin/e2fsck"] {
+        if std::path::Path::new(p).exists() {
+            return Some(PathBuf::from(p));
+        }
+    }
+    None
+}
+
+/// Deterministic, non-trivial file content (so blocks are actually allocated and
+/// the layout is reproducible — no RNG).
+fn content(seed: u64, len: usize) -> Vec<u8> {
+    let mut v = Vec::with_capacity(len);
+    let mut s = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1);
+    while v.len() < len {
+        s ^= s << 13;
+        s ^= s >> 7;
+        s ^= s << 17;
+        v.extend_from_slice(&s.to_le_bytes());
+    }
+    v.truncate(len);
+    v
+}
+
+/// Build an in-memory tar of `(path, size)` files, convert to ext4 via the real
+/// production path, write it to a temp file, and run `e2fsck -fn` on it.
+/// Returns Ok(()) if e2fsck reports a clean filesystem (exit 0), else Err(report).
+fn build_and_fsck(files: &[(&str, usize)], align: Option<(u32, u32)>) -> Result<(), String> {
+    let Some(e2fsck) = find_e2fsck() else {
+        eprintln!("SKIP: e2fsck not installed");
+        return Ok(());
+    };
+
+    // 1. Synthesize a tar stream.
+    let mut tar = tar::Builder::new(Vec::new());
+    for (i, (path, size)) in files.iter().enumerate() {
+        let data = content(i as u64, *size);
+        let mut h = tar::Header::new_gnu();
+        h.set_size(data.len() as u64);
+        h.set_mode(0o644);
+        h.set_mtime(0);
+        h.set_entry_type(tar::EntryType::Regular);
+        h.set_cksum();
+        tar.append_data(&mut h, path, &data[..]).map_err(|e| format!("tar append: {e}"))?;
+    }
+    let tar_bytes = tar.into_inner().map_err(|e| format!("tar finish: {e}"))?;
+
+    // 2. Convert to a real ext4 image (same code bless uses).
+    let mut writer_options = vec![
+        WriterOption::MaximumDiskSize(2 * 1024 * 1024 * 1024),
+        WriterOption::Uuid([0x11; 16]),
+    ];
+    if let Some((a, m)) = align {
+        writer_options.push(WriterOption::AlignData { align: a, min_size: m });
+    }
+    let opts = ConvertOptions { convert_backslash: false, writer_options };
+
+    let tmp = tempfile::NamedTempFile::new().map_err(|e| format!("tmp: {e}"))?;
+    let out = tmp.reopen().map_err(|e| format!("reopen: {e}"))?;
+    convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), out, &opts)
+        .map_err(|e| format!("convert: {e}"))?;
+    tmp.as_file().sync_all().ok();
+
+    // 3. The oracle: e2fsck -fn. Exit 0 == clean.
+    let output = Command::new(&e2fsck)
+        .args(["-fn"])
+        .arg(tmp.path())
+        .output()
+        .map_err(|e| format!("spawn e2fsck: {e}"))?;
+    let code = output.status.code().unwrap_or(-1);
+    if code == 0 {
+        Ok(())
+    } else {
+        let mut report = format!("e2fsck exit={code} (nonzero = filesystem errors)\n");
+        report.push_str(&String::from_utf8_lossy(&output.stdout));
+        // Trim the giant bitmap-difference dumps to keep failures readable.
+        let trimmed: String = report.lines().take(40).collect::<Vec<_>>().join("\n");
+        Err(trimmed)
+    }
+}
+
+/// Baseline: a single-block-group filesystem (< 128 MiB) must be e2fsck-clean.
+/// This proves the harness works and the writer is sound when it doesn't cross
+/// a block-group boundary.
+#[test]
+fn fsck_single_group_clean() {
+    let files = &[
+        ("etc/hostname", 12),
+        ("etc/config.toml", 4096),
+        ("usr/bin/tool", 8 * 1024 * 1024),
+        ("usr/lib/data.bin", 32 * 1024 * 1024),
+        ("var/log/app.log", 1024),
+    ];
+    if let Err(report) = build_and_fsck(files, None) {
+        panic!("single-group image is not e2fsck-clean:\n{report}");
+    }
+}
+
+/// A filesystem that crosses a block-group boundary (> 128 MiB) must be
+/// e2fsck-clean. It currently is NOT: the writer's linear allocator places file
+/// data on the Group 1 backup superblock / group descriptors (block 32768+),
+/// producing multiply-claimed blocks the kernel rejects. Un-ignore once the
+/// allocator reserves group metadata.
+#[test]
+fn fsck_multi_group_clean() {
+    // ~160 MiB of file data guarantees crossing into block group 1 (32768 blocks
+    // == 128 MiB), regardless of group-0 metadata overhead.
+    let files = &[
+        ("data/a.bin", 40 * 1024 * 1024),
+        ("data/b.bin", 40 * 1024 * 1024),
+        ("data/c.bin", 40 * 1024 * 1024),
+        ("data/d.bin", 40 * 1024 * 1024),
+    ];
+    if let Err(report) = build_and_fsck(files, None) {
+        panic!("multi-group image is not e2fsck-clean:\n{report}");
+    }
+}
+
+/// Content correctness across fragmentation. e2fsck validates *structure*, not
+/// that a file's logical blocks are in the right order. A file that spans a
+/// reserved metadata region is split into multiple extents by the allocator; if
+/// the logical offsets were assigned wrong, the bytes would come back reordered.
+/// Build a multi-group image with KNOWN deterministic content, read every file
+/// back through the reader (which assembles by the on-disk extent tree,
+/// independent of the writer's allocation state), and assert byte-exact equality.
+#[test]
+fn content_survives_fragmentation() {
+    // Files sized so several cross the block-group boundary (block 32768).
+    let specs: &[(&str, usize)] = &[
+        ("data/a.bin", 50 * 1024 * 1024),
+        ("data/b.bin", 50 * 1024 * 1024),
+        ("data/c.bin", 50 * 1024 * 1024),
+        ("small/x", 1234),
+        ("data/d.bin", 30 * 1024 * 1024),
+    ];
+
+    // Build the tar.
+    let mut tar = tar::Builder::new(Vec::new());
+    let mut expected: std::collections::HashMap<String, Vec<u8>> = std::collections::HashMap::new();
+    for (i, (path, size)) in specs.iter().enumerate() {
+        let data = content(i as u64, *size);
+        let mut h = tar::Header::new_gnu();
+        h.set_size(data.len() as u64);
+        h.set_mode(0o644);
+        h.set_mtime(0);
+        h.set_entry_type(tar::EntryType::Regular);
+        h.set_cksum();
+        tar.append_data(&mut h, path, &data[..]).unwrap();
+        expected.insert((*path).to_string(), data);
+    }
+    let tar_bytes = tar.into_inner().unwrap();
+
+    // Convert to ext4 (multi-group, group-aware allocator).
+    let opts = ConvertOptions {
+        convert_backslash: false,
+        writer_options: vec![
+            WriterOption::MaximumDiskSize(2 * 1024 * 1024 * 1024),
+            WriterOption::Uuid([0x22; 16]),
+        ],
+    };
+    let mut img: Vec<u8> = Vec::new();
+    convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), std::io::Cursor::new(&mut img), &opts).unwrap();
+
+    // Read every file back via the reader and compare to known input.
+    let mut reader = ext4::reader::Reader::new(std::io::Cursor::new(&img)).unwrap();
+    let entries = reader.walk().unwrap();
+    let mut checked = 0;
+    for e in entries {
+        if (e.mode & 0xF000) != 0x8000 {
+            continue;
+        }
+        let path = e.path.trim_start_matches('/').to_string();
+        let Some(want) = expected.get(&path) else { continue };
+        let inode = reader.read_inode(e.inode_number).unwrap();
+        let got = reader.read_data(&inode).unwrap();
+        assert_eq!(got.len(), want.len(), "{path}: size mismatch");
+        assert!(got == *want, "{path}: CONTENT MISMATCH (fragmentation reordered bytes)");
+        checked += 1;
+    }
+    assert_eq!(checked, specs.len(), "did not read back all files");
+}
+
+/// Once the allocator is group-aware, the aligned multi-group build must ALSO be
+/// e2fsck-clean (alignment padding must be marked free, and aligned file starts
+/// must skip reserved blocks). Captures both the metadata-collision and the
+/// padding-bitmap issues found via e2fsck.
+#[test]
+#[ignore = "KNOWN BUG: alignment padding marked used + lands on group metadata; needs metadata-aware align"]
+fn fsck_multi_group_aligned_clean() {
+    let files = &[
+        ("data/a.bin", 40 * 1024 * 1024),
+        ("data/b.bin", 40 * 1024 * 1024),
+        ("data/c.bin", 40 * 1024 * 1024),
+        ("data/d.bin", 40 * 1024 * 1024),
+    ];
+    if let Err(report) = build_and_fsck(files, Some((128 * 1024, 16 * 1024))) {
+        panic!("aligned multi-group image is not e2fsck-clean:\n{report}");
+    }
+}
diff --git a/glidefs/src/bin/dedup_probe.rs b/glidefs/src/bin/dedup_probe.rs
new file mode 100644
index 00000000..a3c63863
--- /dev/null
+++ b/glidefs/src/bin/dedup_probe.rs
@@ -0,0 +1,530 @@
+#![allow(clippy::cast_possible_wrap, clippy::cast_sign_loss, clippy::cast_possible_truncation)]
+//! Empirical probe for the block-alignment dedup hypothesis.
+//!
+//! Claim 1 — the problem is real: fixed-grid block dedup over ext4 silently
+//! fails to dedup byte-identical content when that content sits at a different
+//! offset, because the dedup window is positional. We prove this two ways:
+//!   E1 (mechanism): take ONE real ext4 image and dedup it against a copy of
+//!      *itself* shifted by δ bytes. δ that is a whole-grid multiple → ~100%
+//!      dedup (the method works); a sub-grid δ → dedup collapses. Same bytes,
+//!      only the offset changes, so alignment is provably the whole cause.
+//!   E2 (bite): over real, related images, fixed-grid dedup is far below what
+//!      content-defined chunking (FastCDC) recovers from the *same* ext4 bytes.
+//!
+//! Claim 2 — the fix is easy: rebuild the same images with the production
+//!   writer's new `AlignData` option (large file payloads start on the grid).
+//!   E3: fixed-grid dedup on the aligned images should jump to ~the FastCDC
+//!   ceiling, and the aligned images still round-trip through the ext4 reader.
+//!   The padding is holes (zeros), which the block store never stores.
+//!
+//! Everything runs through the real production code paths
+//! (`convert_oci_layers_to_ext4`, the real `block_map` hashing/compression), so
+//! the numbers reflect what GlideFS would actually store — not a re-model.
+//!
+//! Usage:
+//!   skopeo copy docker://python:3.12-slim-bookworm dir:/tmp/oci/py312
+//!   skopeo copy docker://python:3.13-slim-bookworm dir:/tmp/oci/py313
+//!   cargo run --release --bin dedup_probe -- /tmp/oci/py312 /tmp/oci/py313 ...
+
+use std::collections::HashMap;
+use std::fs::File;
+use std::io::{Read, Seek, SeekFrom};
+use std::path::{Path, PathBuf};
+
+use ext4::tar_convert::{ConvertOptions, convert_oci_layers_to_ext4};
+use ext4::writer::WriterOption;
+use glidefs::block::block_map::{Blake3Hash, blake3_128, lz4_compress};
+
+const GRID: usize = 128 * 1024; // production BLOCK_SIZE — the dedup window size
+const ALIGN_THRESHOLD: u32 = 16 * 1024; // align files >= 16 KiB; pack smaller ones
+
+// ---- shared image plumbing (mirrors bless: deterministic, content-addressed) ----
+
+fn deterministic_uuid(seed: &str) -> [u8; 16] {
+    let mut uuid = *blake3_128(seed.as_bytes()).as_bytes();
+    uuid[6] = (uuid[6] & 0x0f) | 0x80;
+    uuid[8] = (uuid[8] & 0x3f) | 0x80;
+    uuid
+}
+
+fn decompress_layer(blob: &Path) -> std::io::Result<File> {
+    let mut f = File::open(blob)?;
+    let mut magic = [0u8; 4];
+    f.read_exact(&mut magic)?;
+    f.seek(SeekFrom::Start(0))?;
+    let mut out = tempfile::tempfile()?;
+    if magic[0] == 0x1f && magic[1] == 0x8b {
+        std::io::copy(&mut flate2::read::GzDecoder::new(f), &mut out)?;
+    } else if magic == [0x28, 0xb5, 0x2f, 0xfd] {
+        std::io::copy(&mut zstd::Decoder::new(f)?, &mut out)?;
+    } else {
+        std::io::copy(&mut f, &mut out)?;
+    }
+    out.seek(SeekFrom::Start(0))?;
+    Ok(out)
+}
+
+struct Image {
+    name: String,
+    seed: String,
+    layer_blobs: Vec<PathBuf>,
+}
+
+fn load_image(dir: &Path) -> Image {
+    let bytes = std::fs::read(dir.join("manifest.json")).expect("read manifest.json");
+    let seed = format!("blake3:{:032x}", u128::from_le_bytes(*blake3_128(&bytes).as_bytes()));
+    let v: serde_json::Value = serde_json::from_slice(&bytes).expect("parse manifest");
+    let layer_blobs = v["layers"]
+        .as_array()
+        .expect("layers[]")
+        .iter()
+        .map(|l| {
+            let d = l["digest"].as_str().expect("digest");
+            dir.join(d.strip_prefix("sha256:").unwrap_or(d))
+        })
+        .collect();
+    Image { name: dir.file_name().unwrap().to_string_lossy().into_owned(), seed, layer_blobs }
+}
+
+/// Build a real ext4 image from the layers. `align` toggles the new writer
+/// option; everything else is identical so the only variable is alignment.
+/// Fixed 4 GiB device for all builds so the block grids are comparable.
+fn build_ext4(img: &Image, align: bool) -> File {
+    let mut layers: Vec<File> =
+        img.layer_blobs.iter().map(|p| decompress_layer(p).expect("decompress")).collect();
+    let mut writer_options = vec![
+        WriterOption::MaximumDiskSize(4 * 1024 * 1024 * 1024),
+        WriterOption::Uuid(deterministic_uuid(&img.seed)),
+    ];
+    // The writer's internal-journal feature flag trips `e2fsck` ("external
+    // journal") before it inspects bitmaps. Allow disabling it for fsck runs so
+    // the full structural check actually executes. Production keeps the journal.
+    if std::env::var("DEDUP_PROBE_NO_JOURNAL").is_err() {
+        writer_options.push(WriterOption::Journal(1024));
+    }
+    if align {
+        writer_options.push(WriterOption::AlignData { align: GRID as u32, min_size: ALIGN_THRESHOLD });
+    }
+    let opts = ConvertOptions { convert_backslash: false, writer_options };
+    let out = tempfile::tempfile().expect("tempfile");
+    let mut fs = convert_oci_layers_to_ext4(&mut layers, out, &opts).expect("convert");
+    fs.seek(SeekFrom::Start(0)).unwrap();
+    fs
+}
+
+fn is_zero(d: &[u8]) -> bool {
+    let (p, c, s) = unsafe { d.align_to::<u64>() };
+    p.iter().all(|&b| b == 0) && c.iter().all(|&w| w == 0) && s.iter().all(|&b| b == 0)
+}
+
+fn read_all(mut f: File) -> Vec<u8> {
+    let mut v = Vec::new();
+    f.seek(SeekFrom::Start(0)).unwrap();
+    f.read_to_end(&mut v).unwrap();
+    v
+}
+
+// ---- chunking schemes; each returns (content hash -> stored bytes) units ----
+
+/// Fixed grid (what production does today). Skips zero blocks, exactly like the
+/// real flush path — so alignment padding (holes) is free here.
+fn fixed_grid_units(img: &[u8], stored: &mut HashMap<Blake3Hash, usize>, raw: &mut u64, zero: &mut u64) {
+    for blk in img.chunks(GRID) {
+        if is_zero(blk) {
+            *zero += 1;
+            continue;
+        }
+        *raw += blk.len() as u64;
+        stored.entry(blake3_128(blk)).or_insert_with(|| lz4_compress(blk).len());
+    }
+}
+
+/// Gear-hash content-defined chunking (the alignment-immune ceiling). Boundary
+/// where the rolling hash hits the mask; min/max clamp run length.
+fn gear_table() -> [u64; 256] {
+    // splitmix64 from a fixed seed — deterministic across runs.
+    let mut s = 0x9E3779B97F4A7C15u64;
+    let mut t = [0u64; 256];
+    for e in &mut t {
+        s = s.wrapping_add(0x9E3779B97F4A7C15);
+        let mut z = s;
+        z = (z ^ (z >> 30)).wrapping_mul(0xBF58476D1CE4E5B9);
+        z = (z ^ (z >> 27)).wrapping_mul(0x94D049BB133111EB);
+        *e = z ^ (z >> 31);
+    }
+    t
+}
+
+fn cdc_units(img: &[u8], gear: &[u64; 256], stored: &mut HashMap<Blake3Hash, usize>, raw: &mut u64) {
+    const MIN: usize = 16 * 1024;
+    const MAX: usize = 1024 * 1024;
+    const MASK: u64 = (1 << 17) - 1; // avg ~128 KiB, matching the grid
+    let n = img.len();
+    let mut start = 0;
+    while start < n {
+        let mut hash = 0u64;
+        let mut j = start;
+        let cut = loop {
+            if j >= n {
+                break n;
+            }
+            if j - start >= MAX {
+                break j;
+            }
+            hash = (hash << 1).wrapping_add(gear[img[j] as usize]);
+            j += 1;
+            if j - start >= MIN && (hash & MASK) == 0 {
+                break j;
+            }
+        };
+        let chunk = &img[start..cut];
+        if !is_zero(chunk) {
+            *raw += chunk.len() as u64;
+            stored.entry(blake3_128(chunk)).or_insert_with(|| lz4_compress(chunk).len());
+        }
+        start = cut;
+    }
+}
+
+fn human(b: u64) -> String {
+    let f = b as f64;
+    if b >= 1 << 30 {
+        format!("{:.2} GiB", f / (1u64 << 30) as f64)
+    } else if b >= 1 << 20 {
+        format!("{:.1} MiB", f / (1u64 << 20) as f64)
+    } else {
+        format!("{:.1} KiB", f / (1u64 << 10) as f64)
+    }
+}
+
+/// Sum of stored (lz4) bytes over the union of unique content hashes.
+fn union_stored(maps: &[&HashMap<Blake3Hash, usize>]) -> u64 {
+    let mut u: HashMap<Blake3Hash, usize> = HashMap::new();
+    for m in maps {
+        for (h, &s) in *m {
+            u.entry(*h).or_insert(s);
+        }
+    }
+    u.values().map(|&v| v as u64).sum()
+}
+
+fn main() {
+    let dirs: Vec<PathBuf> = std::env::args().skip(1).map(PathBuf::from).collect();
+    if dirs.is_empty() {
+        eprintln!("usage: dedup_probe <skopeo-dir> [<skopeo-dir> ...]");
+        std::process::exit(2);
+    }
+    let images: Vec<Image> = dirs.iter().map(|d| load_image(d)).collect();
+    let gear = gear_table();
+
+    eprintln!("Building real ext4 images (unaligned + aligned) via production convert path...");
+    let mut unaligned: Vec<Vec<u8>> = Vec::new();
+    let mut aligned: Vec<Vec<u8>> = Vec::new();
+    for img in &images {
+        eprint!("  {} ...", img.name);
+        unaligned.push(read_all(build_ext4(img, false)));
+        aligned.push(read_all(build_ext4(img, true)));
+        eprintln!(" done");
+    }
+
+    // =====================================================================
+    // E1 — MECHANISM: same bytes, shifted. Isolates alignment, zero confound.
+    // =====================================================================
+    println!("\n================ E1: self-shift control ({}) ================", images[0].name);
+    println!("Dedup of the image against a copy of ITSELF shifted by δ bytes.");
+    println!("(fixed {GRID}-byte grid; shared = blocks whose content hash matches)");
+    let base = &unaligned[0];
+    let mut base_blocks: HashMap<Blake3Hash, ()> = HashMap::new();
+    let mut base_n = 0u64;
+    for blk in base.chunks(GRID) {
+        if !is_zero(blk) {
+            base_blocks.insert(blake3_128(blk), ());
+            base_n += 1;
+        }
+    }
+    for &delta in &[0usize, 4096, 8192, 65536, GRID, 2 * GRID] {
+        let shifted = &base[delta.min(base.len())..];
+        let mut shared = 0u64;
+        let mut total = 0u64;
+        for blk in shifted.chunks(GRID) {
+            if is_zero(blk) {
+                continue;
+            }
+            total += 1;
+            if base_blocks.contains_key(&blake3_128(blk)) {
+                shared += 1;
+            }
+        }
+        let pct = if total > 0 { 100.0 * shared as f64 / total as f64 } else { 0.0 };
+        let tag = if delta % GRID == 0 { "  (whole-grid multiple → control)" } else { "  (sub-grid shift)" };
+        println!("  δ = {:>7} B : {:>5.1}% blocks dedup ({shared}/{total}){tag}", delta, pct);
+    }
+    println!("  base non-zero blocks: {base_n}");
+
+    // =====================================================================
+    // E2 + E3 — same real ext4 bytes under 3 schemes.
+    // =====================================================================
+    let mut fixed_maps = Vec::new();
+    let mut aligned_maps = Vec::new();
+    let mut cdc_maps = Vec::new();
+    let (mut fixed_raw, mut aligned_raw, mut cdc_raw) = (0u64, 0u64, 0u64);
+    let (mut unaligned_zero, mut aligned_zero) = (0u64, 0u64);
+
+    println!("\n================ E2/E3: per-image + cross-image dedup ================");
+    println!("Same production ext4 bytes. Three schemes:");
+    println!("  TODAY   = fixed 128 KiB grid on the unaligned image (what ships now)");
+    println!("  ALIGNED = fixed 128 KiB grid on the aligned image (the proposed fix)");
+    println!("  CDC     = content-defined chunking on the unaligned image (ceiling)\n");
+    for (i, img) in images.iter().enumerate() {
+        let mut fm = HashMap::new();
+        let mut am = HashMap::new();
+        let mut cm = HashMap::new();
+        let (mut r1, mut r2, mut r3) = (0u64, 0u64, 0u64);
+        let (mut z1, mut z2) = (0u64, 0u64);
+        fixed_grid_units(&unaligned[i], &mut fm, &mut r1, &mut z1);
+        fixed_grid_units(&aligned[i], &mut am, &mut r2, &mut z2);
+        cdc_units(&unaligned[i], &gear, &mut cm, &mut r3);
+        println!(
+            "  {:14} stored: TODAY {:>10} | ALIGNED {:>10} | CDC {:>10}",
+            img.name,
+            human(fm.values().map(|&v| v as u64).sum()),
+            human(am.values().map(|&v| v as u64).sum()),
+            human(cm.values().map(|&v| v as u64).sum()),
+        );
+        fixed_raw += r1;
+        aligned_raw += r2;
+        cdc_raw += r3;
+        unaligned_zero += z1;
+        aligned_zero += z2;
+        fixed_maps.push(fm);
+        aligned_maps.push(am);
+        cdc_maps.push(cm);
+    }
+
+    let sum_indiv = |maps: &[HashMap<Blake3Hash, usize>]| -> u64 {
+        maps.iter().map(|m| m.values().map(|&v| v as u64).sum::<u64>()).sum()
+    };
+    let fixed_indiv = sum_indiv(&fixed_maps);
+    let aligned_indiv = sum_indiv(&aligned_maps);
+    let cdc_indiv = sum_indiv(&cdc_maps);
+    let fixed_union = union_stored(&fixed_maps.iter().collect::<Vec<_>>());
+    let aligned_union = union_stored(&aligned_maps.iter().collect::<Vec<_>>());
+    let cdc_union = union_stored(&cdc_maps.iter().collect::<Vec<_>>());
+
+    let row = |label: &str, raw: u64, indiv: u64, union: u64| {
+        let cross = indiv.saturating_sub(union);
+        let cross_pct = if indiv > 0 { 100.0 * cross as f64 / indiv as f64 } else { 0.0 };
+        println!(
+            "  {label:8}: store-each {:>10} | store-union {:>10} | cross-image dedup {:>9} ({:.1}%)",
+            human(indiv),
+            human(union),
+            human(cross),
+            cross_pct,
+        );
+        let _ = raw;
+    };
+    println!("\n  --- storing ALL {} images (lz4-compressed, zeros dropped) ---", images.len());
+    row("TODAY", fixed_raw, fixed_indiv, fixed_union);
+    row("ALIGNED", aligned_raw, aligned_indiv, aligned_union);
+    row("CDC", cdc_raw, cdc_indiv, cdc_union);
+
+    println!("\n  store-union is total S3/cache bytes for the whole set.");
+    if fixed_union > 0 {
+        let saved = fixed_union.saturating_sub(aligned_union);
+        println!(
+            "  ALIGNED vs TODAY: {} smaller ({:.1}%) — this is the recovered dedup.",
+            human(saved),
+            100.0 * saved as f64 / fixed_union as f64
+        );
+        let ceil = fixed_union.saturating_sub(cdc_union);
+        let captured = if ceil > 0 { 100.0 * saved as f64 / ceil as f64 } else { 0.0 };
+        println!("  CDC ceiling would save {} — ALIGNED captures {:.0}% of it.", human(ceil), captured);
+    }
+
+    // =====================================================================
+    // Padding cost + round-trip validity of the aligned images.
+    // =====================================================================
+    println!("\n================ Cost & validity of the fix ================");
+    println!(
+        "  zero blocks (holes): unaligned {} → aligned {} (+{} blocks of padding)",
+        unaligned_zero,
+        aligned_zero,
+        aligned_zero.saturating_sub(unaligned_zero)
+    );
+    println!("  padding is zeros → NOT stored: TODAY/ALIGNED store-union already exclude it.");
+    for (i, img) in images.iter().enumerate() {
+        match verify_roundtrip(&aligned[i]) {
+            Ok((files, bytes)) => println!(
+                "  {:14} aligned image round-trips OK: {} files, {} of data read back",
+                img.name,
+                files,
+                human(bytes)
+            ),
+            Err(e) => println!("  {:14} ROUND-TRIP FAILED: {e}", img.name),
+        }
+    }
+
+    // =====================================================================
+    // V1 — DETERMINISM. Content-addressing requires byte-identical rebuilds.
+    // =====================================================================
+    println!("\n================ V1: determinism ================");
+    let a1 = read_all(build_ext4(&images[0], true));
+    let a2 = read_all(build_ext4(&images[0], true));
+    println!(
+        "  {} aligned built twice: {} (and differs from unaligned: {})",
+        images[0].name,
+        if a1 == a2 { "IDENTICAL ✓" } else { "DIFFERENT ✗ — BUG" },
+        if a1 != unaligned[0] { "yes ✓" } else { "no ✗" },
+    );
+
+    // =====================================================================
+    // V2 — INDEPENDENT GROUND TRUTH. Hash whole regular files (no chunking at
+    // all) read back from the ALIGNED images. This both proves the files
+    // survived alignment intact and measures the true shared-content fraction
+    // with a method that shares NO code with the block-grid path.
+    // =====================================================================
+    println!("\n================ V2: file-level ground truth (chunking-agnostic) ================");
+    let file_maps: Vec<HashMap<Blake3Hash, u64>> =
+        aligned.iter().map(|img| extract_file_contents(img)).collect();
+    let file_indiv: u64 = file_maps.iter().map(|m| m.values().sum::<u64>()).sum();
+    let mut file_union: HashMap<Blake3Hash, u64> = HashMap::new();
+    for m in &file_maps {
+        for (h, &s) in m {
+            file_union.entry(*h).or_insert(s);
+        }
+    }
+    let file_union_bytes: u64 = file_union.values().sum();
+    let file_cross = file_indiv.saturating_sub(file_union_bytes);
+    let file_pct = if file_indiv > 0 { 100.0 * file_cross as f64 / file_indiv as f64 } else { 0.0 };
+    println!("  whole-file content hashing (raw bytes, identical files counted once):");
+    println!(
+        "    store-each {} | store-union {} | cross-image identical-file content {} ({:.1}%)",
+        human(file_indiv),
+        human(file_union_bytes),
+        human(file_cross),
+        file_pct
+    );
+    println!("  → THIS is how much byte-identical content genuinely exists across the set.");
+    println!("    TODAY grid recovers 26-ish%; ALIGNED grid recovers ~this; if they match,");
+    println!("    the win is real and the CDC ceiling was not mis-tuned.");
+
+    // =====================================================================
+    // V3 — PER-FILE SPOTLIGHT. Take one large file that is byte-identical in
+    // two images and show its blocks dedup under ALIGNED but not under TODAY.
+    // =====================================================================
+    if images.len() >= 2 {
+        println!("\n================ V3: per-file spotlight (img[0] vs img[1]) ================");
+        spotlight(&images, &aligned, &unaligned, &aligned_maps, &fixed_maps);
+    }
+
+    // Dump images so they can be checked with the real e2fsck (external proof
+    // of filesystem validity, independent of our own reader).
+    let outdir = std::path::Path::new("/tmp/dedup_verify");
+    std::fs::create_dir_all(outdir).ok();
+    for (i, img) in images.iter().enumerate() {
+        std::fs::write(outdir.join(format!("{}-unaligned.img", img.name)), &unaligned[i]).ok();
+        std::fs::write(outdir.join(format!("{}-aligned.img", img.name)), &aligned[i]).ok();
+    }
+    println!("\n  images dumped to {} for external `e2fsck -fn` validation.", outdir.display());
+}
+
+/// Read every regular file from an ext4 image and map its whole-content hash to
+/// its size. Uses only the reader — no block-grid code — so it is an
+/// independent oracle for "how much identical file content exists".
+fn extract_file_contents(img: &[u8]) -> HashMap<Blake3Hash, u64> {
+    let mut out: HashMap<Blake3Hash, u64> = HashMap::new();
+    let cursor = std::io::Cursor::new(img);
+    let mut reader = ext4::reader::Reader::new(cursor).expect("reader");
+    let entries = reader.walk().expect("walk");
+    for e in entries {
+        if (e.mode & 0xF000) == 0x8000 && e.size > 0 {
+            let inode = reader.read_inode(e.inode_number).expect("inode");
+            let data = reader.read_data(&inode).expect("data");
+            // Key on (content, size); dedup identical files within the image.
+            out.entry(blake3_128(&data)).or_insert(data.len() as u64);
+        }
+    }
+    out
+}
+
+fn read_file_by_path(img: &[u8], path: &str) -> Option<Vec<u8>> {
+    let cursor = std::io::Cursor::new(img);
+    let mut reader = ext4::reader::Reader::new(cursor).ok()?;
+    let entries = reader.walk().ok()?;
+    for e in entries {
+        if e.path == path && (e.mode & 0xF000) == 0x8000 {
+            let inode = reader.read_inode(e.inode_number).ok()?;
+            return reader.read_data(&inode).ok();
+        }
+    }
+    None
+}
+
+fn spotlight(
+    images: &[Image],
+    aligned: &[Vec<u8>],
+    unaligned: &[Vec<u8>],
+    aligned_maps: &[HashMap<Blake3Hash, usize>],
+    fixed_maps: &[HashMap<Blake3Hash, usize>],
+) {
+    // Find a path present in BOTH images with byte-identical content and > 256 KiB.
+    let cur = std::io::Cursor::new(&aligned[0]);
+    let mut r0 = ext4::reader::Reader::new(cur).expect("reader");
+    let walk0 = r0.walk().expect("walk");
+    let mut candidates: Vec<(String, u64)> = walk0
+        .iter()
+        .filter(|e| (e.mode & 0xF000) == 0x8000 && e.size > 256 * 1024)
+        .map(|e| (e.path.clone(), e.size))
+        .collect();
+    candidates.sort_by_key(|(_, s)| std::cmp::Reverse(*s));
+
+    for (path, size) in candidates {
+        let c0 = read_file_by_path(&aligned[0], &path);
+        let c1 = read_file_by_path(&aligned[1], &path);
+        let (Some(c0), Some(c1)) = (c0, c1) else { continue };
+        if c0 != c1 || blake3_128(&c0) != blake3_128(&c1) {
+            continue; // not byte-identical across the two images
+        }
+        // Hash this file's content in 128 KiB chunks. Under ALIGNED the file
+        // starts on the grid, so its full sub-blocks ARE these content chunks.
+        let chunk_hashes: Vec<Blake3Hash> = c0.chunks(GRID).filter(|b| !is_zero(b)).map(blake3_128).collect();
+        let n = chunk_hashes.len();
+        // How many of this file's content-chunks appear as real grid blocks in
+        // the OTHER image (img[1]) under each scheme?
+        let in_aligned = chunk_hashes.iter().filter(|h| aligned_maps[1].contains_key(*h)).count();
+        let in_today = chunk_hashes.iter().filter(|h| fixed_maps[1].contains_key(*h)).count();
+        println!("  file: {path}  ({}, byte-identical in {} & {})", human(size), images[0].name, images[1].name);
+        println!("    its {n} content-chunks found as grid blocks in {}:", images[1].name);
+        println!(
+            "      under ALIGNED: {in_aligned}/{n} ({:.0}%) dedup",
+            100.0 * in_aligned as f64 / n.max(1) as f64
+        );
+        println!(
+            "      under TODAY:   {in_today}/{n} ({:.0}%) dedup",
+            100.0 * in_today as f64 / n.max(1) as f64
+        );
+        let _ = unaligned;
+        return;
+    }
+    println!("  (no >256 KiB byte-identical file shared across the two images found)");
+}
+
+/// Read the aligned ext4 back through the production reader to prove the image
+/// is a valid filesystem (not just bytes that happen to dedup well).
+fn verify_roundtrip(img: &[u8]) -> std::io::Result<(usize, u64)> {
+    let cursor = std::io::Cursor::new(img);
+    let mut reader = ext4::reader::Reader::new(cursor)?;
+    let entries = reader.walk()?;
+    let mut files = 0usize;
+    let mut bytes = 0u64;
+    for e in entries {
+        if (e.mode & 0xF000) == 0x8000 {
+            // regular file
+            let inode = reader.read_inode(e.inode_number)?;
+            let data = reader.read_data(&inode)?;
+            bytes += data.len() as u64;
+            files += 1;
+        }
+    }
+    Ok((files, bytes))
+}

From ff6d465eccee6184ba209e2a1cc9001ddfa4e7bf Mon Sep 17 00:00:00 2001
From: Jared Lunde <jared.lunde@gmail.com>
Date: Fri, 5 Jun 2026 18:37:28 -0700
Subject: [PATCH 2/6] feat(ext4): metadata-aware data alignment + property-fuzz
 validity harness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Completes the block-alignment dedup work on top of the group-aware
allocator.

Alignment (WriterOption::AlignData) is now correct and e2fsck-clean:
- Composes with reserved-block skipping: when an aligned file start lands
  on a backup-superblock/GDT block, write_file_data skips past it before
  recording data_start_block.
- Padding gaps are unreferenced free space, not data. record_free_hole
  tracks them (excluding reserved metadata blocks), and close() clears
  them from the otherwise-dense block bitmap so the free counts are
  correct. Previously the dense bitmap marked padding as used, which
  e2fsck rejected.

Validated end to end on real images (python:3.12/3.13-slim): aligned
builds are e2fsck-clean, kernel-mount and read byte-exact, deterministic,
and realize the dedup win — 26%->52% cross-image (3-image set), matching
the file-level ground truth (52%) and capturing 98% of the FastCDC
ceiling, with zero stored cost (padding is dropped zeros).

Adds fuzz_multigroup_validity_and_content: random multi-group filesets
(mixed small/medium/large) must be e2fsck-clean AND read back byte-exact,
both unaligned and aligned. Deterministic seeds reproduce failures;
EXT4_FUZZ_SEEDS scales coverage (ran clean at 64 seeds = 128 images).
This is the generalized gate over the size/position space where the
data-on-metadata and alignment-bitmap bugs lived.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ext4/src/writer.rs          |  56 +++++++++++--
 ext4/tests/fsck_validity.rs | 151 +++++++++++++++++++++++++++++-------
 2 files changed, 176 insertions(+), 31 deletions(-)

diff --git a/ext4/src/writer.rs b/ext4/src/writer.rs
index 589675fd..e03c8df1 100644
--- a/ext4/src/writer.rs
+++ b/ext4/src/writer.rs
@@ -257,6 +257,11 @@ pub struct Writer<W: Read + Write + Seek> {
     /// the data is generally non-contiguous and `pos - data_written` no longer
     /// locates the start — this does.
     data_start_block: u32,
+    /// Unreferenced free block ranges created by data alignment padding. The
+    /// block bitmap assumes a densely packed data region; these holes must be
+    /// cleared from it so the filesystem is consistent. Empty unless alignment
+    /// is enabled.
+    free_holes: Vec<(u32, u32)>,
 }
 
 impl<W: Read + Write + Seek> Writer<W> {
@@ -278,6 +283,7 @@ impl<W: Read + Write + Seek> Writer<W> {
             data_align: 0,
             data_align_min: 0,
             data_start_block: 0,
+            free_holes: Vec::new(),
         };
         for opt in opts {
             match opt {
@@ -440,6 +446,23 @@ impl<W: Read + Write + Seek> Writer<W> {
         Ok(off)
     }
 
+    /// Record [start, end) as free holes, excluding reserved metadata blocks
+    /// (which stay marked used — they hold backup superblocks, not free space).
+    fn record_free_hole(&mut self, start: u32, end: u32) {
+        let mut b = start;
+        while b < end {
+            if self.is_reserved_block(b) {
+                b += 1;
+                continue;
+            }
+            let run_start = b;
+            b = self.next_reserved_block_ge(b).unwrap_or(end).min(end);
+            if b > run_start {
+                self.free_holes.push((run_start, b - run_start));
+            }
+        }
+    }
+
     /// The contiguous, non-reserved physical runs covering [start, end).
     fn physical_runs(&self, start: u32, end: u32) -> Vec<(u32, u32)> {
         let mut runs = Vec::new();
@@ -1022,16 +1045,20 @@ impl<W: Read + Write + Seek> Writer<W> {
 
         if self.inode_ref(child_ino)?.mode & TYPE_MASK == format::S_IFREG {
             self.start_inode(name, child_ino, f.size)?;
-            // Align the start of large file payloads to the dedup block grid.
-            // Done before any data is written (data_written == 0), so the
-            // file's first data block — recorded as `pos - data_written` in
-            // write_extents — lands on the boundary. The padded gap is a hole
-            // (zeros), which the downstream block store drops for free.
+            // Align the start of large file payloads to the dedup block grid, so
+            // the same file produces the same blocks regardless of upstream
+            // churn. The padded gap is unreferenced free space; record it so the
+            // block bitmap marks it free (reserved metadata blocks within the
+            // gap stay used). write_file_data then skips any reserved block at
+            // the aligned position before recording data_start_block.
             if self.data_align > 0 && f.size >= self.data_align_min {
                 let align = self.data_align;
                 let rem = self.pos % align;
                 if rem != 0 {
+                    let pad_start = self.block();
                     self.write_zeros(align - rem)?;
+                    let pad_end = self.block();
+                    self.record_free_hole(pad_start, pad_end);
                 }
             }
         }
@@ -1510,6 +1537,8 @@ impl<W: Read + Write + Seek> Writer<W> {
         let inode_table_size_per_group = inodes_per_group * INODE_SIZE as u32 / BLOCK_SIZE as u32;
         let mut total_used_blocks: u32 = 0;
         let mut total_used_inodes: u32 = 0;
+        // Alignment padding holes to clear from the otherwise-dense bitmap.
+        let free_holes = std::mem::take(&mut self.free_holes);
 
         for g in 0..groups {
             let mut bitmap_buf = vec![0u8; BLOCK_SIZE as usize * 2];
@@ -1542,6 +1571,23 @@ impl<W: Read + Write + Seek> Writer<W> {
                     used_block_count += 1;
                 }
             }
+            // Clear alignment padding holes: the bitmap is dense by default, but
+            // these blocks are unreferenced free space.
+            let gstart = g * BLOCKS_PER_GROUP;
+            for &(hstart, hlen) in &free_holes {
+                let lo = hstart.max(gstart);
+                let hi = (hstart + hlen).min(gstart + BLOCKS_PER_GROUP);
+                let mut b = lo;
+                while b < hi {
+                    let j = b - gstart;
+                    let mask = 1u8 << (j % 8);
+                    if bitmap_buf[(j / 8) as usize] & mask != 0 {
+                        bitmap_buf[(j / 8) as usize] &= !mask;
+                        used_block_count -= 1;
+                    }
+                    b += 1;
+                }
+            }
 
             // Inode bitmap
             for j in 0..inodes_per_group {
diff --git a/ext4/tests/fsck_validity.rs b/ext4/tests/fsck_validity.rs
index 803968fb..1a7acda5 100644
--- a/ext4/tests/fsck_validity.rs
+++ b/ext4/tests/fsck_validity.rs
@@ -8,6 +8,7 @@
 //!
 //! Run with: `cargo test -p ext4 --test fsck_validity -- --include-ignored`
 
+use std::io::Write;
 use std::path::PathBuf;
 use std::process::Command;
 
@@ -43,12 +44,13 @@ fn content(seed: u64, len: usize) -> Vec<u8> {
 /// production path, write it to a temp file, and run `e2fsck -fn` on it.
 /// Returns Ok(()) if e2fsck reports a clean filesystem (exit 0), else Err(report).
 fn build_and_fsck(files: &[(&str, usize)], align: Option<(u32, u32)>) -> Result<(), String> {
-    let Some(e2fsck) = find_e2fsck() else {
-        eprintln!("SKIP: e2fsck not installed");
-        return Ok(());
-    };
+    let owned: Vec<(String, usize)> = files.iter().map(|(p, s)| ((*p).to_string(), *s)).collect();
+    e2fsck_clean(&build_image(&owned, align))
+}
 
-    // 1. Synthesize a tar stream.
+/// Build a real ext4 image (production convert path) from `(path, size)` files,
+/// each filled with the deterministic `content(index, size)`.
+fn build_image(files: &[(String, usize)], align: Option<(u32, u32)>) -> Vec<u8> {
     let mut tar = tar::Builder::new(Vec::new());
     for (i, (path, size)) in files.iter().enumerate() {
         let data = content(i as u64, *size);
@@ -58,42 +60,81 @@ fn build_and_fsck(files: &[(&str, usize)], align: Option<(u32, u32)>) -> Result<
         h.set_mtime(0);
         h.set_entry_type(tar::EntryType::Regular);
         h.set_cksum();
-        tar.append_data(&mut h, path, &data[..]).map_err(|e| format!("tar append: {e}"))?;
+        tar.append_data(&mut h, path, &data[..]).unwrap();
     }
-    let tar_bytes = tar.into_inner().map_err(|e| format!("tar finish: {e}"))?;
+    let tar_bytes = tar.into_inner().unwrap();
 
-    // 2. Convert to a real ext4 image (same code bless uses).
     let mut writer_options = vec![
-        WriterOption::MaximumDiskSize(2 * 1024 * 1024 * 1024),
+        WriterOption::MaximumDiskSize(4 * 1024 * 1024 * 1024),
         WriterOption::Uuid([0x11; 16]),
     ];
     if let Some((a, m)) = align {
         writer_options.push(WriterOption::AlignData { align: a, min_size: m });
     }
     let opts = ConvertOptions { convert_backslash: false, writer_options };
+    let mut img: Vec<u8> = Vec::new();
+    convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), std::io::Cursor::new(&mut img), &opts)
+        .unwrap();
+    img
+}
 
-    let tmp = tempfile::NamedTempFile::new().map_err(|e| format!("tmp: {e}"))?;
-    let out = tmp.reopen().map_err(|e| format!("reopen: {e}"))?;
-    convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), out, &opts)
-        .map_err(|e| format!("convert: {e}"))?;
-    tmp.as_file().sync_all().ok();
-
-    // 3. The oracle: e2fsck -fn. Exit 0 == clean.
+/// The oracle: run `e2fsck -fn` on the image. Ok == clean (exit 0). Skips
+/// (returns Ok) when e2fsck is not installed.
+fn e2fsck_clean(img: &[u8]) -> Result<(), String> {
+    let Some(e2fsck) = find_e2fsck() else {
+        eprintln!("SKIP: e2fsck not installed");
+        return Ok(());
+    };
+    let mut tmp = tempfile::NamedTempFile::new().map_err(|e| format!("tmp: {e}"))?;
+    tmp.write_all(img).map_err(|e| format!("write: {e}"))?;
+    tmp.flush().ok();
     let output = Command::new(&e2fsck)
         .args(["-fn"])
         .arg(tmp.path())
         .output()
         .map_err(|e| format!("spawn e2fsck: {e}"))?;
-    let code = output.status.code().unwrap_or(-1);
-    if code == 0 {
+    if output.status.code() == Some(0) {
         Ok(())
     } else {
-        let mut report = format!("e2fsck exit={code} (nonzero = filesystem errors)\n");
+        let mut report = format!("e2fsck exit={:?} (nonzero = filesystem errors)\n", output.status.code());
         report.push_str(&String::from_utf8_lossy(&output.stdout));
         // Trim the giant bitmap-difference dumps to keep failures readable.
-        let trimmed: String = report.lines().take(40).collect::<Vec<_>>().join("\n");
-        Err(trimmed)
+        Err(report.lines().take(30).collect::<Vec<_>>().join("\n"))
+    }
+}
+
+/// Read every file back via the reader (which assembles from the on-disk extent
+/// tree) and assert byte-exact equality with the known input — catches any
+/// logical-ordering bug introduced by fragmentation around reserved blocks.
+fn content_matches(img: &[u8], files: &[(String, usize)]) -> Result<(), String> {
+    let mut want: std::collections::HashMap<String, (u64, usize)> = std::collections::HashMap::new();
+    for (i, (p, s)) in files.iter().enumerate() {
+        want.insert(p.trim_start_matches('/').to_string(), (i as u64, *s));
+    }
+    let mut reader =
+        ext4::reader::Reader::new(std::io::Cursor::new(img)).map_err(|e| format!("reader: {e}"))?;
+    let entries = reader.walk().map_err(|e| format!("walk: {e}"))?;
+    let mut checked = 0;
+    for e in entries {
+        if (e.mode & 0xF000) != 0x8000 {
+            continue;
+        }
+        let path = e.path.trim_start_matches('/').to_string();
+        let Some(&(idx, size)) = want.get(&path) else { continue };
+        let inode = reader.read_inode(e.inode_number).map_err(|e| format!("{path}: inode: {e}"))?;
+        let got = reader.read_data(&inode).map_err(|e| format!("{path}: read: {e}"))?;
+        if got.len() != size {
+            return Err(format!("{path}: size {} != {size}", got.len()));
+        }
+        if got != content(idx, size) {
+            return Err(format!("{path}: CONTENT MISMATCH (fragmentation reordered bytes)"));
+        }
+        checked += 1;
+    }
+    if checked != files.len() {
+        return Err(format!("read back {checked}/{} files", files.len()));
     }
+    Ok(())
 }
 
 /// Baseline: a single-block-group filesystem (< 128 MiB) must be e2fsck-clean.
@@ -114,10 +155,10 @@ fn fsck_single_group_clean() {
 }
 
 /// A filesystem that crosses a block-group boundary (> 128 MiB) must be
-/// e2fsck-clean. It currently is NOT: the writer's linear allocator places file
-/// data on the Group 1 backup superblock / group descriptors (block 32768+),
-/// producing multiply-claimed blocks the kernel rejects. Un-ignore once the
-/// allocator reserves group metadata.
+/// e2fsck-clean. Regression for the original corruption: the linear allocator
+/// used to place file data on the Group 1 backup superblock / group descriptors
+/// (block 32768+), producing multiply-claimed blocks. The group-aware allocator
+/// now skips those reserved blocks and fragments files around them.
 #[test]
 fn fsck_multi_group_clean() {
     // ~160 MiB of file data guarantees crossing into block group 1 (32768 blocks
@@ -202,7 +243,6 @@ fn content_survives_fragmentation() {
 /// must skip reserved blocks). Captures both the metadata-collision and the
 /// padding-bitmap issues found via e2fsck.
 #[test]
-#[ignore = "KNOWN BUG: alignment padding marked used + lands on group metadata; needs metadata-aware align"]
 fn fsck_multi_group_aligned_clean() {
     let files = &[
         ("data/a.bin", 40 * 1024 * 1024),
@@ -214,3 +254,62 @@ fn fsck_multi_group_aligned_clean() {
         panic!("aligned multi-group image is not e2fsck-clean:\n{report}");
     }
 }
+
+/// Property fuzzer: random multi-group filesets must be e2fsck-clean AND read
+/// back byte-exact — both unaligned and aligned. Seeds are deterministic so any
+/// failure reproduces verbatim (the panic prints the seed and fileset). This is
+/// the generalized gate: it sweeps the size/position space where the original
+/// data-on-backup-superblock and alignment-bitmap bugs lived. Crank coverage
+/// with `EXT4_FUZZ_SEEDS=64 cargo test -p ext4 --test fsck_validity fuzz`.
+#[test]
+fn fuzz_multigroup_validity_and_content() {
+    if find_e2fsck().is_none() {
+        eprintln!("SKIP: e2fsck not installed");
+        return;
+    }
+    let seeds: u64 = std::env::var("EXT4_FUZZ_SEEDS")
+        .ok()
+        .and_then(|s| s.parse().ok())
+        .unwrap_or(8);
+
+    for seed in 0..seeds {
+        let mut state = seed.wrapping_mul(0x9E3779B97F4A7C15) | 1;
+        let mut next = move || {
+            state ^= state << 13;
+            state ^= state >> 7;
+            state ^= state << 17;
+            state
+        };
+
+        // Random fileset with a deliberate mix: small files (below the align
+        // threshold), medium, and large (which straddle group boundaries).
+        let nfiles = 4 + (next() % 10) as usize;
+        let mut files: Vec<(String, usize)> = Vec::new();
+        let mut total: u64 = 0;
+        for k in 0..nfiles {
+            let size = match next() % 10 {
+                0..=3 => 1 + (next() % (64 * 1024)) as usize,
+                4..=6 => 4096 + (next() % (8 * 1024 * 1024)) as usize,
+                _ => 8 * 1024 * 1024 + (next() % (40 * 1024 * 1024)) as usize,
+            };
+            files.push((format!("d/s{seed}_f{k}.bin"), size));
+            total += size as u64;
+        }
+        // Guarantee at least one block-group boundary (128 MiB) is crossed so the
+        // reserved-block / fragmentation paths are always exercised.
+        if total < 160 * 1024 * 1024 {
+            let pad = (160 * 1024 * 1024 - total) as usize + 4 * 1024 * 1024;
+            files.push((format!("d/s{seed}_big.bin"), pad));
+        }
+
+        for align in [None, Some((128 * 1024u32, 16 * 1024u32))] {
+            let img = build_image(&files, align);
+            if let Err(e) = e2fsck_clean(&img) {
+                panic!("seed={seed} align={align:?} NOT e2fsck-clean:\n{e}\nfiles={files:?}");
+            }
+            if let Err(e) = content_matches(&img, &files) {
+                panic!("seed={seed} align={align:?} content error: {e}\nfiles={files:?}");
+            }
+        }
+    }
+}

From a8d91f03d56284ec4f2e7e14c32146a74e9ec0f7 Mon Sep 17 00:00:00 2001
From: Jared Lunde <jared.lunde@gmail.com>
Date: Fri, 5 Jun 2026 20:50:19 -0700
Subject: [PATCH 3/6] feat(bless): enable block-grid alignment + automate
 kernel-mount check
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Wire the validated dedup alignment into both bless paths:
- run_bless_oci (merged image) and run_bless_oci_layered via layer_store
  (per-layer) now pass AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE }
  (128 KiB = the volume's block size). Threshold = one full dedup block:
  measured as the sweet spot — captures 99% of the FastCDC ceiling (34%
  cross-image reduction on the 3-image probe set) with less logical
  inflation than a 16 KiB threshold.
- device_size estimates get headroom for alignment padding (bless x3->x4,
  layer_store x2->x3). Padding is holes/zeros the block store drops, so it
  costs address space, not stored bytes.

Automate the strongest oracle: kernel_mount_content loop-mounts the image
with the real Linux ext4 driver and verifies every file byte-exact
against known input, for both aligned and unaligned builds. Opt-in via
EXT4_MOUNT_TEST=1 (needs root/passwordless sudo), skips by default so CI
stays green; runs in a privileged/nightly job. Verified passing locally.

dedup_probe: alignment threshold is now configurable via
DEDUP_ALIGN_THRESHOLD for measurement sweeps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ext4/tests/fsck_validity.rs    | 70 ++++++++++++++++++++++++++++++++++
 glidefs/src/bin/dedup_probe.rs |  7 +++-
 glidefs/src/cli/bless.rs       | 15 ++++++--
 glidefs/src/oci/layer_store.rs | 10 ++++-
 4 files changed, 95 insertions(+), 7 deletions(-)

diff --git a/ext4/tests/fsck_validity.rs b/ext4/tests/fsck_validity.rs
index 1a7acda5..626ed173 100644
--- a/ext4/tests/fsck_validity.rs
+++ b/ext4/tests/fsck_validity.rs
@@ -313,3 +313,73 @@ fn fuzz_multigroup_validity_and_content() {
         }
     }
 }
+
+/// Privileged kernel-mount content check: loop-mount the image with the REAL
+/// Linux ext4 driver and verify every file's bytes against the known input —
+/// the strongest oracle, independent of both the writer and the in-crate reader.
+/// Opt-in (needs root / passwordless sudo + loop devices), so it skips by
+/// default and runs in a privileged/nightly job:
+///   EXT4_MOUNT_TEST=1 cargo test -p ext4 --test fsck_validity kernel_mount
+#[test]
+fn kernel_mount_content() {
+    if std::env::var("EXT4_MOUNT_TEST").is_err() {
+        eprintln!("SKIP: set EXT4_MOUNT_TEST=1 to run the privileged kernel-mount check");
+        return;
+    }
+    let sudo_ok = Command::new("sudo")
+        .args(["-n", "true"])
+        .status()
+        .map(|s| s.success())
+        .unwrap_or(false);
+    if !sudo_ok {
+        eprintln!("SKIP: passwordless sudo not available for mount");
+        return;
+    }
+
+    // Multi-group fileset (~180 MiB) with known deterministic content.
+    let files: Vec<(String, usize)> = vec![
+        ("data/a.bin".into(), 50 * 1024 * 1024),
+        ("data/b.bin".into(), 50 * 1024 * 1024),
+        ("small/x".into(), 4096),
+        ("data/c.bin".into(), 50 * 1024 * 1024),
+        ("data/d.bin".into(), 30 * 1024 * 1024),
+    ];
+
+    for align in [None, Some((128 * 1024u32, 128 * 1024u32))] {
+        let img = build_image(&files, align);
+        let mut tmp = tempfile::NamedTempFile::new().unwrap();
+        tmp.write_all(&img).unwrap();
+        tmp.flush().unwrap();
+        let mnt = tempfile::tempdir().unwrap();
+
+        let mounted = Command::new("sudo")
+            .args(["-n", "mount", "-o", "ro,loop"])
+            .arg(tmp.path())
+            .arg(mnt.path())
+            .status()
+            .expect("spawn mount");
+        assert!(mounted.success(), "kernel mount failed (align={align:?})");
+
+        // Read every file through the kernel and compare to known input. Collect
+        // the result first so we always unmount, even on mismatch.
+        let mut err: Option<String> = None;
+        for (i, (path, size)) in files.iter().enumerate() {
+            match std::fs::read(mnt.path().join(path)) {
+                Ok(got) if got.len() == *size && got == content(i as u64, *size) => {}
+                Ok(got) => {
+                    err = Some(format!("{path}: kernel read mismatch (len {} vs {size})", got.len()));
+                    break;
+                }
+                Err(e) => {
+                    err = Some(format!("{path}: kernel read failed: {e}"));
+                    break;
+                }
+            }
+        }
+
+        let _ = Command::new("sudo").args(["-n", "umount"]).arg(mnt.path()).status();
+        if let Some(e) = err {
+            panic!("kernel-mount content check failed (align={align:?}): {e}");
+        }
+    }
+}
diff --git a/glidefs/src/bin/dedup_probe.rs b/glidefs/src/bin/dedup_probe.rs
index a3c63863..65e23486 100644
--- a/glidefs/src/bin/dedup_probe.rs
+++ b/glidefs/src/bin/dedup_probe.rs
@@ -36,7 +36,10 @@ use ext4::writer::WriterOption;
 use glidefs::block::block_map::{Blake3Hash, blake3_128, lz4_compress};
 
 const GRID: usize = 128 * 1024; // production BLOCK_SIZE — the dedup window size
-const ALIGN_THRESHOLD: u32 = 16 * 1024; // align files >= 16 KiB; pack smaller ones
+// align files >= threshold; pack smaller ones. Override with DEDUP_ALIGN_THRESHOLD.
+fn align_threshold() -> u32 {
+    std::env::var("DEDUP_ALIGN_THRESHOLD").ok().and_then(|s| s.parse().ok()).unwrap_or(16 * 1024)
+}
 
 // ---- shared image plumbing (mirrors bless: deterministic, content-addressed) ----
 
@@ -103,7 +106,7 @@ fn build_ext4(img: &Image, align: bool) -> File {
         writer_options.push(WriterOption::Journal(1024));
     }
     if align {
-        writer_options.push(WriterOption::AlignData { align: GRID as u32, min_size: ALIGN_THRESHOLD });
+        writer_options.push(WriterOption::AlignData { align: GRID as u32, min_size: align_threshold() });
     }
     let opts = ConvertOptions { convert_backslash: false, writer_options };
     let out = tempfile::tempfile().expect("tempfile");
diff --git a/glidefs/src/cli/bless.rs b/glidefs/src/cli/bless.rs
index 66961f42..2ba0dea5 100644
--- a/glidefs/src/cli/bless.rs
+++ b/glidefs/src/cli/bless.rs
@@ -171,10 +171,13 @@ pub async fn run_bless_oci(
         .await
         .map_err(|e| anyhow::anyhow!("failed to resolve image: {e}"))?;
 
-    // Estimate device size: sum compressed layer sizes × 3 (decompression + ext4 overhead).
-    // Round up to next power-of-2 MiB boundary. Minimum 64 MiB.
+    // Estimate device size: sum compressed layer sizes × 4 (decompression + ext4
+    // overhead + block-grid alignment headroom). Round up to next power-of-2.
+    // Minimum 64 MiB. The ×4 (vs ×3) covers the logical inflation from aligning
+    // large files to the dedup block grid; that padding is holes/zeros which the
+    // block store drops, so it costs address space, not stored bytes.
     let total_compressed: u64 = resolved.layers.iter().map(|l| l.size as u64).sum();
-    let estimated = (total_compressed * 3).max(64 * 1024 * 1024);
+    let estimated = (total_compressed * 4).max(64 * 1024 * 1024);
     let device_size = estimated.next_power_of_two();
 
     info!(
@@ -251,6 +254,12 @@ pub async fn run_bless_oci(
             WriterOption::MaximumDiskSize(device_size as i64),
             WriterOption::Uuid(uuid),
             WriterOption::Journal(1024), // 4 MiB journal
+            // Align large file payloads to the dedup block grid (the volume's
+            // 128 KiB block size) so the same file produces the same blocks
+            // across images and the host's content-addressed cache + S3 packs
+            // dedup it. Only files >= one full block are aligned, bounding the
+            // padding. See dedup_probe / fsck_validity for the validation.
+            WriterOption::AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE },
         ],
     };
 
diff --git a/glidefs/src/oci/layer_store.rs b/glidefs/src/oci/layer_store.rs
index 1b6edfa8..9e9afaf1 100644
--- a/glidefs/src/oci/layer_store.rs
+++ b/glidefs/src/oci/layer_store.rs
@@ -28,7 +28,7 @@ use object_store::{ObjectStore, PutPayload};
 use serde::{Deserialize, Serialize};
 
 use crate::block::content_store::ContentStore;
-use crate::oci::ext4_store::{deterministic_uuid, store_ext4_stream};
+use crate::oci::ext4_store::{deterministic_uuid, store_ext4_stream, BLOCK_SIZE};
 
 /// Manifest name for a stored layer (its sole VolumeManifest).
 const LAYER_MANIFEST_NAME: &str = "layer";
@@ -92,7 +92,10 @@ impl ImageDescriptor {
 /// Zero blocks past the real content are skipped at store time, so oversizing
 /// costs nothing in storage.
 fn layer_device_size(tar_len: u64) -> u64 {
-    (tar_len.saturating_mul(2).max(64 * 1024 * 1024)).next_power_of_two()
+    // ×3 (not ×2): extra headroom for block-grid alignment padding, which
+    // inflates the logical ext4. The padding is holes/zeros dropped by the
+    // block store, so it costs address space, not stored bytes.
+    (tar_len.saturating_mul(3).max(64 * 1024 * 1024)).next_power_of_two()
 }
 
 /// Ensure a single OCI layer is stored as a content-addressed ext4 artifact.
@@ -137,6 +140,9 @@ pub async fn ensure_layer_stored<R: Read + Seek>(
             WriterOption::MaximumDiskSize(device_size as i64),
             WriterOption::Uuid(deterministic_uuid(digest)),
             WriterOption::Journal(1024), // 4 MiB journal — same as bless
+            // Align large file payloads to the dedup block grid (the volume's
+            // 128 KiB block size) so the same file dedups across layers/images.
+            WriterOption::AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE },
         ],
     };
     let mut ext4_tmp = tempfile::tempfile().context("layer ext4 tempfile")?;

From 6eae506a6baa50a88faed08ecafaff4555ad2454 Mon Sep 17 00:00:00 2001
From: Jared Lunde <jared.lunde@gmail.com>
Date: Fri, 5 Jun 2026 21:21:22 -0700
Subject: [PATCH 4/6] fix(bless): align the server-side background bless path
 too

router.rs run_bless_oci_task (the in-server background bless triggered via
the API) was missing the AlignData wiring and device-size headroom that
the two CLI bless paths got. Wire it consistently so every bless path
produces grid-aligned, cross-image-deduppable base images.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 glidefs/src/block/router.rs | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/glidefs/src/block/router.rs b/glidefs/src/block/router.rs
index 259b1fb5..b04615ba 100644
--- a/glidefs/src/block/router.rs
+++ b/glidefs/src/block/router.rs
@@ -1819,9 +1819,11 @@ impl ExportRouter {
             .await
             .map_err(|e| RouterError::OciPull(format!("failed to resolve image: {e}")))?;
 
-        // Estimate device size: compressed × 3, next power-of-2, min 64 MiB.
+        // Estimate device size: compressed × 4, next power-of-2, min 64 MiB.
+        // The ×4 (vs ×3) leaves headroom for block-grid alignment padding, which
+        // inflates the logical ext4 with holes/zeros the block store drops.
         let total_compressed: u64 = resolved.layers.iter().map(|l| l.size as u64).sum();
-        let estimated = (total_compressed * 3).max(64 * 1024 * 1024);
+        let estimated = (total_compressed * 4).max(64 * 1024 * 1024);
         let device_size = estimated.next_power_of_two();
 
         info!(
@@ -1885,6 +1887,9 @@ impl ExportRouter {
                 WriterOption::MaximumDiskSize(device_size as i64),
                 WriterOption::Uuid(uuid),
                 WriterOption::Journal(1024), // 4 MiB journal
+                // Align large file payloads to the dedup block grid (the volume
+                // block size) so the same file dedups across blessed images.
+                WriterOption::AlignData { align: block_size_u32, min_size: block_size_u32 },
             ],
         };
 

From 5b26eec31b3fd0839493593bf5b55c2156a784bc Mon Sep 17 00:00:00 2001
From: Jared Lunde <jared.lunde@gmail.com>
Date: Fri, 5 Jun 2026 21:43:56 -0700
Subject: [PATCH 5/6] fix(ext4): keep close()-time contiguous structures off
 reserved blocks

The group-aware allocator skipped reserved block-group backup-superblock/GDT
blocks for FILE DATA, but the journal, inode table, and bitmaps are written
contiguously at close() via raw writes and could still land on (or straddle)
them. Since bless enables a 4 MiB journal, this was a real corruption path:
e.g. a ~124 MiB image puts the journal inode across block 32768 (Group 1's
backup superblock) -> multiply-claimed block the kernel rejects. The inode
table straddles the same way for workloads with many files.

Caught by running the fsck harness with the production journal config (it
previously ran journal-less). Fix: reserve_contiguous(n) places a contiguous
run clear of reserved regions (skipping past any it would straddle, recording
the gap as a free hole); used for the journal, inode table, and bitmaps.

Tests now run with Journal(1024) (matching bless) and add:
- fsck_journal_straddles_group_boundary: sweeps 120-136 MiB
- fsck_inode_table_straddles_boundary: many-file workloads at the boundary
Both verified to fail before the fix (journal: inode 8 multiply-claim; inode
table: "Group 1's inode table at 32768 conflicts"). Full suite + 64-seed fuzz
+ kernel mount green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ext4/src/writer.rs          | 46 ++++++++++++++++++++++++++----
 ext4/tests/fsck_validity.rs | 57 +++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+), 6 deletions(-)

diff --git a/ext4/src/writer.rs b/ext4/src/writer.rs
index e03c8df1..57afd929 100644
--- a/ext4/src/writer.rs
+++ b/ext4/src/writer.rs
@@ -463,6 +463,30 @@ impl<W: Read + Write + Seek> Writer<W> {
         }
     }
 
+    /// Position the cursor so the next `n` blocks form a single contiguous run
+    /// that contains no reserved block-group metadata, and return that start
+    /// block. Used for structures that must be contiguous (journal inode, the
+    /// flex_bg inode table, bitmaps) — unlike file data, they can't be
+    /// fragmented around a reserved block, so instead we skip the whole run past
+    /// any reserved region it would straddle. Skipped data blocks become free
+    /// holes; the reserved blocks stay used. `n` is always << a block group, so
+    /// at most one interior backup region is ever in the way.
+    fn reserve_contiguous(&mut self, n: u32) -> io::Result<u32> {
+        loop {
+            self.skip_reserved_at_pos()?;
+            let start = self.block();
+            match self.next_reserved_block_ge(start) {
+                Some(r) if r < start + n => {
+                    let g = r / BLOCKS_PER_GROUP;
+                    let region_end = g * BLOCKS_PER_GROUP + self.group_reserve();
+                    self.record_free_hole(start, r);
+                    self.seek_block(region_end)?;
+                }
+                _ => return Ok(start),
+            }
+        }
+    }
+
     /// The contiguous, non-reserved physical runs covering [start, end).
     fn physical_runs(&self, start: u32, end: u32) -> Vec<(u32, u32)> {
         let mut runs = Vec::new();
@@ -1316,7 +1340,10 @@ impl<W: Read + Write + Seek> Writer<W> {
     /// journal blocks. The superblock is updated in close() to set
     /// HAS_JOURNAL, journal_inum, and the journal_blocks backup.
     fn write_journal(&mut self) -> io::Result<()> {
-        let journal_start = self.block();
+        // The journal is one contiguous extent; keep it clear of reserved
+        // block-group metadata (a straddle would make inode 8 multiply-claim the
+        // backup superblock).
+        let journal_start = self.reserve_contiguous(self.journal_blocks)?;
 
         // Write JBD2 v2 superblock (first block of journal)
         // All multi-byte fields are big-endian per JBD2 spec.
@@ -1510,14 +1537,21 @@ impl<W: Read + Write + Seek> Writer<W> {
             self.write_journal()?;
         }
 
-        // Write the inode table
-        let inode_table_offset = self.block();
-        let (groups, inodes_per_group) = best_group_count(inode_table_offset, self.inodes.len() as u32);
+        // Write the inode table. It is contiguous and located via per-group
+        // descriptors (inode_table_low + g * size_per_group), so it must avoid
+        // reserved block-group metadata. Reserve a clean run sized for the group
+        // count; padding the start can bump the count by one group, so reserve a
+        // one-group margin and recompute against the final offset.
+        let n_inodes = self.inodes.len() as u32;
+        let (g0, ipg0) = best_group_count(self.block(), n_inodes);
+        let itspg0 = ipg0 * INODE_SIZE as u32 / BLOCK_SIZE as u32;
+        let inode_table_offset = self.reserve_contiguous((g0 + 1) * itspg0 + 2)?;
+        let (groups, inodes_per_group) = best_group_count(inode_table_offset, n_inodes);
         self.write_inode_table(groups * inodes_per_group * INODE_SIZE as u32)?;
 
-        // Write bitmaps
-        let bitmap_offset = self.block();
+        // Write bitmaps (also contiguous and GD-located).
         let bitmap_size = groups * 2;
+        let bitmap_offset = self.reserve_contiguous(bitmap_size)?;
         let valid_data_size = bitmap_offset + bitmap_size;
         let mut disk_size = valid_data_size;
         let min_size = (groups - 1) * BLOCKS_PER_GROUP + 1;
diff --git a/ext4/tests/fsck_validity.rs b/ext4/tests/fsck_validity.rs
index 626ed173..b5d606ac 100644
--- a/ext4/tests/fsck_validity.rs
+++ b/ext4/tests/fsck_validity.rs
@@ -67,6 +67,11 @@ fn build_image(files: &[(String, usize)], align: Option<(u32, u32)>) -> Vec<u8>
     let mut writer_options = vec![
         WriterOption::MaximumDiskSize(4 * 1024 * 1024 * 1024),
         WriterOption::Uuid([0x11; 16]),
+        // Match the production bless config: an internal 4 MiB journal. The
+        // journal (and inode table / dir blocks) are written at close() and can
+        // land near a block-group backup-superblock boundary, so exercising it
+        // is part of validating the reserved-block handling.
+        WriterOption::Journal(1024),
     ];
     if let Some((a, m)) = align {
         writer_options.push(WriterOption::AlignData { align: a, min_size: m });
@@ -255,6 +260,58 @@ fn fsck_multi_group_aligned_clean() {
     }
 }
 
+/// Targeted sweep of the block-group boundary (block 32768 == 128 MiB). With a
+/// journal enabled (production config), the journal — and the inode table / dir
+/// blocks — are written at close() and can land straddling the Group 1 backup
+/// superblock. Sweep data sizes that push those close()-time structures across
+/// the boundary; every one must be e2fsck-clean. Regression for reserved-block
+/// handling of NON-file-data writes.
+#[test]
+fn fsck_journal_straddles_group_boundary() {
+    if find_e2fsck().is_none() {
+        eprintln!("SKIP: e2fsck not installed");
+        return;
+    }
+    // 120..136 MiB in 1 MiB steps: data ends near block 32768, so the trailing
+    // journal (1024 blocks = 4 MiB) and inode table cross the boundary.
+    for mib in 120..=136 {
+        let files = vec![(format!("data/blob_{mib}.bin"), mib * 1024 * 1024)];
+        for align in [None, Some((128 * 1024u32, 128 * 1024u32))] {
+            let img = build_image(&files, align);
+            if let Err(e) = e2fsck_clean(&img) {
+                panic!("{mib} MiB align={align:?} NOT e2fsck-clean:\n{e}");
+            }
+            if let Err(e) = content_matches(&img, &files) {
+                panic!("{mib} MiB align={align:?} content error: {e}");
+            }
+        }
+    }
+}
+
+/// Many files whose data ends near block 32768 make the flex_bg inode table
+/// large enough to straddle the Group 1 backup superblock. The inode table is
+/// contiguous and pointed at by per-group descriptors, so it must also dodge
+/// reserved blocks. Regression for inode-table/bitmap reserved-block handling.
+#[test]
+fn fsck_inode_table_straddles_boundary() {
+    if find_e2fsck().is_none() {
+        eprintln!("SKIP: e2fsck not installed");
+        return;
+    }
+    // N one-block files put data just below block 32768 while the inode table
+    // (N/16 blocks) crosses it. Sweep a few counts so one reliably straddles.
+    for n in [30_500usize, 31_000, 31_500] {
+        let files: Vec<(String, usize)> =
+            (0..n).map(|i| (format!("d{}/f{i}.bin", i % 256), 4096)).collect();
+        for align in [None, Some((128 * 1024u32, 128 * 1024u32))] {
+            let img = build_image(&files, align);
+            if let Err(e) = e2fsck_clean(&img) {
+                panic!("inode-table straddle n={n} align={align:?} NOT e2fsck-clean:\n{e}");
+            }
+        }
+    }
+}
+
 /// Property fuzzer: random multi-group filesets must be e2fsck-clean AND read
 /// back byte-exact — both unaligned and aligned. Seeds are deterministic so any
 /// failure reproduces verbatim (the panic prints the seed and fileset). This is

From 78b4eb1454af23fc3d196b6140a23cf8e713c17f Mon Sep 17 00:00:00 2001
From: Jared Lunde <jared.lunde@gmail.com>
Date: Fri, 5 Jun 2026 21:45:59 -0700
Subject: [PATCH 6/6] docs(ext4): update ARCHITECTURE.md for group-aware
 allocator + alignment

Bring the doc in line with actual behavior:
- Extent building: file data fragments around reserved blocks
  (write_file_data / physical_runs / write_extents), not always contiguous.
- New Block-Group Metadata Reservation section: sparse_super backup
  superblocks at block 32768 etc., data + close()-time structures skip them
  (reserve_contiguous), free-hole bitmap accounting; the multi-group
  corruption bug and why the in-crate reader hid it.
- On-disk layout: correct flex_bg layout (metadata clustered at the end),
  reserved holes, journal placement.
- WriterOption table: add Uuid, Journal (s_journal_uuid must be zero),
  AlignData.
- Fix stale "why no journal" (now optional, bless enables it) and the
  zero-UUID determinism/checksum claims.
- Testing: document the fsck_validity e2fsck/mount/fuzz harness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ext4/ARCHITECTURE.md | 122 +++++++++++++++++++++++++++++++------------
 1 file changed, 89 insertions(+), 33 deletions(-)

diff --git a/ext4/ARCHITECTURE.md b/ext4/ARCHITECTURE.md
index 9796fede..1f463929 100644
--- a/ext4/ARCHITECTURE.md
+++ b/ext4/ARCHITECTURE.md
@@ -143,31 +143,33 @@ Comparison ignores atime, ctime (volatile), inode_number (internal), and links_c
 
 ## On-Disk Layout
 
+The writer uses `flex_bg`, so per-group metadata is **not** interleaved per
+group — all inode tables and bitmaps are clustered at the end of the image,
+after the data region:
+
 ```
 Byte 0                          Block 0 (4096 bytes)
 ├─ [0..1024)    zeros           (boot sector area)
-├─ [1024..2048) SuperBlock      (1024 bytes)
+├─ [1024..2048) SuperBlock      (primary, 1024 bytes)
 └─ [2048..4096) zeros
 
-Block 1                         Group Descriptor Table
-├─ 128 × GroupDescriptor        (32 bytes each = 4096 bytes)
-└─ (repeated if >128 groups)
-
-Block gd_end .. gd_end+N        Inode Table (per group)
-├─ 16 inodes per block          (256 bytes each)
-└─ N blocks = ceil(inodes_per_group / 16)
+Block 1 .. 1+gd_blocks          Group Descriptor Table (primary)
+└─ GroupDescriptor × groups     (32 bytes each)
 
-Block inode_end .. data_start   Block Bitmap + Inode Bitmap
-├─ block_bitmap: 1 block        (1 bit per block in group)
-└─ inode_bitmap: 1 block        (1 bit per inode in group)
+Data region (streamed forward, may contain reserved holes):
+├─ lost+found, file data, directory blocks, xattr blocks, extent index blocks
+├─ Journal (optional)           contiguous run, placed via reserve_contiguous
+└─ ⟂ reserved holes at sparse_super group starts (block 32768, 98304, …):
+     backup superblock + GDT copy — never claimed by data
 
-Block data_start .. end         Data Blocks
-├─ Directory blocks             (packed dir entries)
-├─ File data blocks             (streamed content)
-├─ xattr blocks                 (for large xattr sets)
-└─ Extent index blocks          (for very large files)
+Trailing metadata (flex_bg, all groups clustered, reserve_contiguous-placed):
+├─ Inode Table                  groups × inodes_per_group × 256 bytes
+└─ Block + Inode Bitmaps        2 blocks per group
 ```
 
+The superblock and primary GDT are written last (seek back to block 0/1) at
+`close()`, after the layout is known.
+
 ## Inode Number Allocation
 
 ```
@@ -197,22 +199,58 @@ The 60-byte `inode.data` area holds:
 
 Each extent covers at most `MAX_BLOCKS_PER_EXTENT = 0x8000` (32,768) blocks = 128 MiB. Adjacent same-physical-run blocks are merged into one extent.
 
-### Extent Building (writer.rs:write_extent)
+### Extent Building (writer.rs:write_file_data, physical_runs, write_extents)
+
+File data is streamed forward, but it is **not** always one contiguous run: the
+allocator skips blocks reserved for block-group metadata (see below), so a file
+spanning such a block is split into multiple extents.
 
 ```
-on each data write:
-  extend current_extent if blocks are contiguous
-  else:
-    flush current_extent to inode.data (if fits in 4 entries)
-    or to pending extent_index_block (depth 2)
-    start new extent
-
-on finish_inode:
-  flush last extent
-  if depth==2: write extent_index_block to disk
-  seek to inode slot, write inode
+write(&[u8]) → write_file_data:
+  on first byte: skip any reserved block at pos, record data_start_block
+  stream data, jumping over reserved regions (write up to the next reserved
+    block, seek past it, continue) — pos advances over the skipped blocks
+
+finish_inode → write_extents:
+  runs   = physical_runs(data_start_block, end_block)   // non-reserved spans
+  leaves = split each run into ≤ MAX_BLOCKS_PER_EXTENT extents (logical offset
+           accumulates over data blocks only, excluding reserved gaps)
+  emit:
+    ≤4 leaves            → inline in inode.data (depth 0)
+    ≤4×EXTENTS_PER_BLOCK → one index level (depth 1), leaf blocks skip reserved
+    else                 → error (file too large)
 ```
 
+A file that crosses no reserved block yields exactly one run — identical output
+to a plain contiguous writer. `block_count` counts data + extent-tree blocks
+only, never the reserved gaps.
+
+### Block-Group Metadata Reservation (writer.rs:is_reserved_block, has_super_backup)
+
+With the `sparse_super` feature, block groups 0, 1, and every power of 3, 5, and
+7 hold a **backup superblock + group-descriptor copy** in their first
+`1 + gd_blocks` blocks (e.g. block 32768 for group 1, 98304 for group 3). The
+kernel reserves these regardless of whether valid backup content is written, so
+**file or metadata data must never claim them** — an overlapping extent is a
+multiply-claimed block that `e2fsck` and the kernel reject (the file reads back
+as "Structure needs cleaning"). Group 0's reservation is skipped at `init()`.
+
+The allocator keeps everything off these blocks:
+
+- **File data** fragments around them (`write_file_data` / `physical_runs`).
+- **Contiguous close()-time structures** — the journal inode, the flex_bg inode
+  table, and the bitmaps — can't fragment (they're single extents or located by
+  group-descriptor offsets), so `reserve_contiguous(n)` instead places the whole
+  run *past* any reserved region it would straddle. The skipped lead-in blocks
+  become free holes.
+- **Padding holes** (from alignment and from `reserve_contiguous`) are recorded
+  in `free_holes` and cleared from the otherwise-dense block bitmap in `close()`.
+
+This was a real, latent corruption bug for any image larger than one block group
+(>128 MiB): the linear allocator wrote straight through block 32768. It was
+hidden because the in-crate reader is lenient; the real `e2fsck` and a kernel
+loop-mount catch it (see Testing).
+
 ## xattr Storage Strategy
 
 Extended attributes use a two-tier storage model:
@@ -290,11 +328,12 @@ Directory entries reference inode numbers. Hard links and `link()` calls can ass
 
 GlideFS content-addresses blocks with BLAKE3. If two nodes generate the same OCI layer, they must produce byte-identical ext4 images or they'll compute different hashes and store duplicate data. Determinism requires:
 - No uninitialized bytes (zero all padding)
-- No random UUIDs (UUID is all-zeros)
+- A deterministic UUID — `WriterOption::Uuid` set to a content-derived value (e.g. the manifest digest), or all-zeros if unset. Never random.
 - No timestamps (`mtime=0`, `wtime=0` in superblock)
 - Sorted directory entries (by inode number, then name)
 - Sorted xattr entries
 - `BTreeMap` for all child/xattr collections
+- A content-addressed layout: file→block placement (including reserved-block skips and any alignment padding) is a pure function of the input, so the same tar always lands the same bytes in the same blocks.
 
 ### Why port from hcsshim instead of using an existing crate?
 
@@ -306,13 +345,13 @@ hcsshim's `compactext4` is the reference implementation for OCI-compatible ext4
 
 The port preserves the same on-disk layout, making images identical to those produced by the Go implementation.
 
-### Why no journal?
+### Why is the journal optional?
 
-Container layer images are read-only once mounted by the overlay filesystem. A journal adds ~128 MiB of overhead for no benefit. The `HAS_JOURNAL` compat feature is intentionally absent.
+The journal is off by default in the writer: a container layer mounted read-only through overlay never needs one. But a blessed base image that backs a *mutable* volume does, so bless enables `WriterOption::Journal(1024)` (4 MiB). When enabled, the journal is inode 8 with the `HAS_JOURNAL` feature; `s_journal_uuid` stays zero because it identifies an *external* journal device (a non-zero value makes the kernel/e2fsck abort searching for one). When disabled, `HAS_JOURNAL` is absent.
 
 ### Why no checksums?
 
-`METADATA_CSUM` and `GDT_CSUM` are not enabled. Checksums require the UUID as a seed, but a zero UUID makes all checksums trivially zero — enabling the feature would silently produce invalid checksums. Since images are content-addressed externally, internal ext4 checksums are redundant.
+`METADATA_CSUM` and `GDT_CSUM` are not enabled. Metadata checksums are seeded by the UUID and would have to be recomputed for every structure; since images are content-addressed externally (BLAKE3 over the bytes) and validated against the real `e2fsck`/kernel in tests, internal ext4 checksums are redundant.
 
 ## Package Structure
 
@@ -320,7 +359,7 @@ Container layer images are read-only once mounted by the overlay filesystem. A j
 |------|---------|
 | `mod.rs` | Re-exports public API: `Writer`, `Reader`, `File`, `WriterOption`, `convert_tar_to_ext4` |
 | `format.rs` | On-disk binary structures: `SuperBlock`, `GroupDescriptor`, `ParsedInode`, `ExtentHeader/Leaf/Index`, `DirEntry`, xattr helpers. Both serialization (`write_to`) and deserialization (`read_from`, `get_xattrs`) for shared on-disk types. |
-| `writer.rs` | Core filesystem builder. Manages inode lifecycle, block allocation, extent tree construction, xattr packing, directory serialization, superblock finalization. |
+| `writer.rs` | Core filesystem builder. Manages inode lifecycle, reserved-block-aware allocation (data fragments around backup-superblock blocks; contiguous structures use `reserve_contiguous`), extent tree construction, optional alignment + free-hole accounting, xattr packing, directory serialization, journal, superblock finalization. |
 | `reader.rs` | ext4 image parser. Reads superblock, group descriptors, inode table, extent trees, directory entries, and xattrs. Exports via `walk()` and `to_tar()`. |
 | `tar_convert.rs` | tar→ext4 bridge. Maps tar entry types to writer operations, handles OCI whiteouts and PAX xattrs. |
 | `diff.rs` | Incremental export: diffs two ext4 snapshots and produces an OCI-compatible delta tar layer with whiteout markers for deletions. |
@@ -332,6 +371,9 @@ Container layer images are read-only once mounted by the overlay filesystem. A j
 |--------|---------|--------|
 | `WriterOption::InlineData` | disabled | Store files ≤136 bytes inside the inode instead of allocating data blocks. Reduces image size for layers with many small files (e.g., config files, scripts). |
 | `WriterOption::MaximumDiskSize(n)` | 16 GiB | Maximum filesystem size. Controls the number of block groups pre-allocated in the group descriptor table. Range: 0..16 TiB. |
+| `WriterOption::Uuid([u8;16])` | all-zeros | Filesystem UUID, written to the superblock and used as the directory-hash seed. Callers that content-address the image pass a deterministic (e.g. manifest-derived) UUID so the same input yields the same bytes. |
+| `WriterOption::Journal(blocks)` | none | Create an internal jbd2 journal of `blocks` 4 KiB blocks (e.g. 1024 = 4 MiB) as inode 8, set the `HAS_JOURNAL` feature. `s_journal_uuid` is left **zero** (it names an *external* journal device; a non-zero value makes the kernel/e2fsck abort looking for one). bless enables this. |
+| `WriterOption::AlignData { align, min_size }` | disabled | Start the data of every regular file ≥ `min_size` on an `align`-byte boundary, padding the gap with a (free) hole. Aligning large payloads to the downstream dedup block grid makes the same file produce the same blocks regardless of upstream churn, so content-addressed dedup survives. Metadata-aware: composes with reserved-block skipping. |
 
 ## Limits
 
@@ -399,6 +441,20 @@ Three test tiers in `tests.rs`:
 
 Run without Docker: `cargo test --features test-utils --lib` and `cargo test --features test-utils --test integration`
 
+**Filesystem-validity harness** (`tests/fsck_validity.rs`) — gates correctness on kernel-grade oracles, not the in-crate reader (which is lenient and once hid a multi-group corruption bug). Skips cleanly where `e2fsck` is absent.
+
+| Test | What it covers |
+|------|---------------|
+| `fsck_single_group_clean` / `fsck_multi_group_clean` | `e2fsck -fn` clean for single- and multi-group images |
+| `fsck_multi_group_aligned_clean` | aligned build is e2fsck-clean (padding marked free, aligned starts dodge reserved blocks) |
+| `content_survives_fragmentation` | files split around reserved blocks read back byte-exact (right logical order) |
+| `fsck_journal_straddles_group_boundary` | journal must not straddle a backup superblock (sweeps 120–136 MiB) |
+| `fsck_inode_table_straddles_boundary` | inode table must not straddle a backup superblock (many-file workloads) |
+| `fuzz_multigroup_validity_and_content` | random multi-group filesets: e2fsck-clean + byte-exact, both align modes (`EXT4_FUZZ_SEEDS` to scale) |
+| `kernel_mount_content` | opt-in (`EXT4_MOUNT_TEST=1`): real loop-mount, every file byte-exact vs known input |
+
+All tests build with `Journal(1024)` to match the production bless config.
+
 ## Failure Modes
 
 | Failure | Behavior |