diff --git a/ext4/ARCHITECTURE.md b/ext4/ARCHITECTURE.md
index 9796fed..1f46392 100644
--- a/ext4/ARCHITECTURE.md
+++ b/ext4/ARCHITECTURE.md
@@ -143,31 +143,33 @@ Comparison ignores atime, ctime (volatile), inode_number (internal), and links_c
 
 ## On-Disk Layout
 
+The writer uses `flex_bg`, so per-group metadata is **not** interleaved per
+group — all inode tables and bitmaps are clustered at the end of the image,
+after the data region:
+
 ```
 Byte 0                          Block 0 (4096 bytes)
 ├─ [0..1024)    zeros           (boot sector area)
-├─ [1024..2048) SuperBlock      (1024 bytes)
+├─ [1024..2048) SuperBlock      (primary, 1024 bytes)
 └─ [2048..4096) zeros
 
-Block 1                         Group Descriptor Table
-├─ 128 × GroupDescriptor        (32 bytes each = 4096 bytes)
-└─ (repeated if >128 groups)
-
-Block gd_end .. gd_end+N        Inode Table (per group)
-├─ 16 inodes per block          (256 bytes each)
-└─ N blocks = ceil(inodes_per_group / 16)
+Block 1 .. 1+gd_blocks          Group Descriptor Table (primary)
+└─ GroupDescriptor × groups     (32 bytes each)
 
-Block inode_end .. data_start   Block Bitmap + Inode Bitmap
-├─ block_bitmap: 1 block        (1 bit per block in group)
-└─ inode_bitmap: 1 block        (1 bit per inode in group)
+Data region (streamed forward, may contain reserved holes):
+├─ lost+found, file data, directory blocks, xattr blocks, extent index blocks
+├─ Journal (optional)           contiguous run, placed via reserve_contiguous
+└─ ⟂ reserved holes at sparse_super group starts (block 32768, 98304, …):
+     backup superblock + GDT copy — never claimed by data
 
-Block data_start .. end         Data Blocks
-├─ Directory blocks             (packed dir entries)
-├─ File data blocks             (streamed content)
-├─ xattr blocks                 (for large xattr sets)
-└─ Extent index blocks          (for very large files)
+Trailing metadata (flex_bg, all groups clustered, reserve_contiguous-placed):
+├─ Inode Table                  groups × inodes_per_group × 256 bytes
+└─ Block + Inode Bitmaps        2 blocks per group
 ```
 
+The superblock and primary GDT are written last (seek back to block 0/1) at
+`close()`, after the layout is known.
+
 ## Inode Number Allocation
 
 ```
@@ -197,22 +199,58 @@ The 60-byte `inode.data` area holds:
 
 Each extent covers at most `MAX_BLOCKS_PER_EXTENT = 0x8000` (32,768) blocks = 128 MiB. Adjacent same-physical-run blocks are merged into one extent.
 
-### Extent Building (writer.rs:write_extent)
+### Extent Building (writer.rs:write_file_data, physical_runs, write_extents)
+
+File data is streamed forward, but it is **not** always one contiguous run: the
+allocator skips blocks reserved for block-group metadata (see below), so a file
+spanning such a block is split into multiple extents.
 
 ```
-on each data write:
-  extend current_extent if blocks are contiguous
-  else:
-    flush current_extent to inode.data (if fits in 4 entries)
-    or to pending extent_index_block (depth 2)
-    start new extent
-
-on finish_inode:
-  flush last extent
-  if depth==2: write extent_index_block to disk
-  seek to inode slot, write inode
+write(&[u8]) → write_file_data:
+  on first byte: skip any reserved block at pos, record data_start_block
+  stream data, jumping over reserved regions (write up to the next reserved
+    block, seek past it, continue) — pos advances over the skipped blocks
+
+finish_inode → write_extents:
+  runs   = physical_runs(data_start_block, end_block)   // non-reserved spans
+  leaves = split each run into ≤ MAX_BLOCKS_PER_EXTENT extents (logical offset
+           accumulates over data blocks only, excluding reserved gaps)
+  emit:
+    ≤4 leaves            → inline in inode.data (depth 0)
+    ≤4×EXTENTS_PER_BLOCK → one index level (depth 1), leaf blocks skip reserved
+    else                 → error (file too large)
 ```
 
+A file that crosses no reserved block yields exactly one run — identical output
+to a plain contiguous writer. `block_count` counts data + extent-tree blocks
+only, never the reserved gaps.
+
+### Block-Group Metadata Reservation (writer.rs:is_reserved_block, has_super_backup)
+
+With the `sparse_super` feature, block groups 0, 1, and every power of 3, 5, and
+7 hold a **backup superblock + group-descriptor copy** in their first
+`1 + gd_blocks` blocks (e.g. block 32768 for group 1, 98304 for group 3). The
+kernel reserves these regardless of whether valid backup content is written, so
+**file or metadata data must never claim them** — an overlapping extent is a
+multiply-claimed block that `e2fsck` and the kernel reject (the file reads back
+as "Structure needs cleaning"). Group 0's reservation is skipped at `init()`.
+
+The allocator keeps everything off these blocks:
+
+- **File data** fragments around them (`write_file_data` / `physical_runs`).
+- **Contiguous close()-time structures** — the journal inode, the flex_bg inode
+  table, and the bitmaps — can't fragment (they're single extents or located by
+  group-descriptor offsets), so `reserve_contiguous(n)` instead places the whole
+  run *past* any reserved region it would straddle. The skipped lead-in blocks
+  become free holes.
+- **Padding holes** (from alignment and from `reserve_contiguous`) are recorded
+  in `free_holes` and cleared from the otherwise-dense block bitmap in `close()`.
+
+This was a real, latent corruption bug for any image larger than one block group
+(>128 MiB): the linear allocator wrote straight through block 32768. It was
+hidden because the in-crate reader is lenient; the real `e2fsck` and a kernel
+loop-mount catch it (see Testing).
+
 ## xattr Storage Strategy
 
 Extended attributes use a two-tier storage model:
@@ -290,11 +328,12 @@ Directory entries reference inode numbers. Hard links and `link()` calls can ass
 
 GlideFS content-addresses blocks with BLAKE3. If two nodes generate the same OCI layer, they must produce byte-identical ext4 images or they'll compute different hashes and store duplicate data. Determinism requires:
 - No uninitialized bytes (zero all padding)
-- No random UUIDs (UUID is all-zeros)
+- A deterministic UUID — `WriterOption::Uuid` set to a content-derived value (e.g. the manifest digest), or all-zeros if unset. Never random.
 - No timestamps (`mtime=0`, `wtime=0` in superblock)
 - Sorted directory entries (by inode number, then name)
 - Sorted xattr entries
 - `BTreeMap` for all child/xattr collections
+- A content-addressed layout: file→block placement (including reserved-block skips and any alignment padding) is a pure function of the input, so the same tar always lands the same bytes in the same blocks.
 
 ### Why port from hcsshim instead of using an existing crate?
 
@@ -306,13 +345,13 @@ hcsshim's `compactext4` is the reference implementation for OCI-compatible ext4
 
 The port preserves the same on-disk layout, making images identical to those produced by the Go implementation.
 
-### Why no journal?
+### Why is the journal optional?
 
-Container layer images are read-only once mounted by the overlay filesystem. A journal adds ~128 MiB of overhead for no benefit. The `HAS_JOURNAL` compat feature is intentionally absent.
+The journal is off by default in the writer: a container layer mounted read-only through overlay never needs one. But a blessed base image that backs a *mutable* volume does, so bless enables `WriterOption::Journal(1024)` (4 MiB). When enabled, the journal is inode 8 with the `HAS_JOURNAL` feature; `s_journal_uuid` stays zero because it identifies an *external* journal device (a non-zero value makes the kernel/e2fsck abort searching for one). When disabled, `HAS_JOURNAL` is absent.
 
 ### Why no checksums?
 
-`METADATA_CSUM` and `GDT_CSUM` are not enabled. Checksums require the UUID as a seed, but a zero UUID makes all checksums trivially zero — enabling the feature would silently produce invalid checksums. Since images are content-addressed externally, internal ext4 checksums are redundant.
+`METADATA_CSUM` and `GDT_CSUM` are not enabled. Metadata checksums are seeded by the UUID and would have to be recomputed for every structure; since images are content-addressed externally (BLAKE3 over the bytes) and validated against the real `e2fsck`/kernel in tests, internal ext4 checksums are redundant.
 
 ## Package Structure
 
@@ -320,7 +359,7 @@ Container layer images are read-only once mounted by the overlay filesystem. A j
 |------|---------|
 | `mod.rs` | Re-exports public API: `Writer`, `Reader`, `File`, `WriterOption`, `convert_tar_to_ext4` |
 | `format.rs` | On-disk binary structures: `SuperBlock`, `GroupDescriptor`, `ParsedInode`, `ExtentHeader/Leaf/Index`, `DirEntry`, xattr helpers. Both serialization (`write_to`) and deserialization (`read_from`, `get_xattrs`) for shared on-disk types. |
-| `writer.rs` | Core filesystem builder. Manages inode lifecycle, block allocation, extent tree construction, xattr packing, directory serialization, superblock finalization. |
+| `writer.rs` | Core filesystem builder. Manages inode lifecycle, reserved-block-aware allocation (data fragments around backup-superblock blocks; contiguous structures use `reserve_contiguous`), extent tree construction, optional alignment + free-hole accounting, xattr packing, directory serialization, journal, superblock finalization. |
 | `reader.rs` | ext4 image parser. Reads superblock, group descriptors, inode table, extent trees, directory entries, and xattrs. Exports via `walk()` and `to_tar()`. |
 | `tar_convert.rs` | tar→ext4 bridge. Maps tar entry types to writer operations, handles OCI whiteouts and PAX xattrs. |
 | `diff.rs` | Incremental export: diffs two ext4 snapshots and produces an OCI-compatible delta tar layer with whiteout markers for deletions. |
@@ -332,6 +371,9 @@ Container layer images are read-only once mounted by the overlay filesystem. A j
 |--------|---------|--------|
 | `WriterOption::InlineData` | disabled | Store files ≤136 bytes inside the inode instead of allocating data blocks. Reduces image size for layers with many small files (e.g., config files, scripts). |
 | `WriterOption::MaximumDiskSize(n)` | 16 GiB | Maximum filesystem size. Controls the number of block groups pre-allocated in the group descriptor table. Range: 0..16 TiB. |
+| `WriterOption::Uuid([u8;16])` | all-zeros | Filesystem UUID, written to the superblock and used as the directory-hash seed. Callers that content-address the image pass a deterministic (e.g. manifest-derived) UUID so the same input yields the same bytes. |
+| `WriterOption::Journal(blocks)` | none | Create an internal jbd2 journal of `blocks` 4 KiB blocks (e.g. 1024 = 4 MiB) as inode 8, set the `HAS_JOURNAL` feature. `s_journal_uuid` is left **zero** (it names an *external* journal device; a non-zero value makes the kernel/e2fsck abort looking for one). bless enables this. |
+| `WriterOption::AlignData { align, min_size }` | disabled | Start the data of every regular file ≥ `min_size` on an `align`-byte boundary, padding the gap with a (free) hole. Aligning large payloads to the downstream dedup block grid makes the same file produce the same blocks regardless of upstream churn, so content-addressed dedup survives. Metadata-aware: composes with reserved-block skipping. |
 
 ## Limits
 
@@ -399,6 +441,20 @@ Three test tiers in `tests.rs`:
 
 Run without Docker: `cargo test --features test-utils --lib` and `cargo test --features test-utils --test integration`
 
+**Filesystem-validity harness** (`tests/fsck_validity.rs`) — gates correctness on kernel-grade oracles, not the in-crate reader (which is lenient and once hid a multi-group corruption bug). Skips cleanly where `e2fsck` is absent.
+
+| Test | What it covers |
+|------|---------------|
+| `fsck_single_group_clean` / `fsck_multi_group_clean` | `e2fsck -fn` clean for single- and multi-group images |
+| `fsck_multi_group_aligned_clean` | aligned build is e2fsck-clean (padding marked free, aligned starts dodge reserved blocks) |
+| `content_survives_fragmentation` | files split around reserved blocks read back byte-exact (right logical order) |
+| `fsck_journal_straddles_group_boundary` | journal must not straddle a backup superblock (sweeps 120–136 MiB) |
+| `fsck_inode_table_straddles_boundary` | inode table must not straddle a backup superblock (many-file workloads) |
+| `fuzz_multigroup_validity_and_content` | random multi-group filesets: e2fsck-clean + byte-exact, both align modes (`EXT4_FUZZ_SEEDS` to scale) |
+| `kernel_mount_content` | opt-in (`EXT4_MOUNT_TEST=1`): real loop-mount, every file byte-exact vs known input |
+
+All tests build with `Journal(1024)` to match the production bless config.
+
 ## Failure Modes
 
 | Failure | Behavior |
diff --git a/ext4/src/tests.rs b/ext4/src/tests.rs
index 79ec7de..deb0c02 100644
--- a/ext4/src/tests.rs
+++ b/ext4/src/tests.rs
@@ -971,7 +971,10 @@ fn test_journal_roundtrip() {
             "HAS_JOURNAL flag not set"
         );
         assert_eq!(sb.journal_inum, format::INODE_JOURNAL, "journal_inum should be 8");
-        assert_eq!(sb.journal_uuid, uuid, "journal_uuid should match filesystem uuid");
+        // s_journal_uuid identifies an EXTERNAL journal device; for an internal
+        // journal it must be zero, or the kernel/e2fsck search for a nonexistent
+        // external journal and abort ("Can't find external journal").
+        assert_eq!(sb.journal_uuid, [0u8; 16], "journal_uuid must be zero for an internal journal");
         assert_ne!(sb.journal_blocks[0], 0, "journal_blocks backup should be populated");
     }
 
diff --git a/ext4/src/writer.rs b/ext4/src/writer.rs
index 1ad23bc..57afd92 100644
--- a/ext4/src/writer.rs
+++ b/ext4/src/writer.rs
@@ -66,6 +66,21 @@ pub enum WriterOption {
     /// Create an internal journal with the given size in 4 KiB blocks.
     /// Typical values: 1024 (4 MiB), 4096 (16 MiB), 16384 (64 MiB).
     Journal(u32),
+    /// Start the data of every regular file at least `min_size` bytes large on
+    /// an `align`-byte boundary (padding the gap with a hole). Aligning large
+    /// file payloads to the downstream dedup block grid makes the same file
+    /// produce the same blocks regardless of what was written before it, so
+    /// content-addressed dedup survives unrelated upstream churn. `align` must
+    /// be a power of two; `align == 0` disables (the default).
+    ///
+    /// KNOWN LIMITATION (do not enable in production yet): the current pad is
+    /// not metadata-aware. Padding can land a file's data on an ext4 block-group
+    /// reserved block (e.g. the backup superblock at block `blocks_per_group`),
+    /// producing an extent the *kernel* rejects ("invalid extent entries"),
+    /// even though the in-crate reader accepts it. A correct implementation must
+    /// skip group-metadata blocks when aligning. Verified via `dedup_probe` +
+    /// `e2fsck`/loop-mount.
+    AlignData { align: u32, min_size: u32 },
 }
 
 // ---- Internal inode ----
@@ -233,6 +248,20 @@ pub struct Writer<W: Read + Write + Seek> {
     gd_blocks: u32,
     uuid: [u8; 16],
     journal_blocks: u32,
+    /// Boundary (bytes) for large-file data alignment; 0 = disabled.
+    data_align: i64,
+    /// Minimum file size (bytes) that triggers data alignment.
+    data_align_min: i64,
+    /// Physical block where the in-progress file's data begins. File data skips
+    /// blocks reserved for block-group metadata (backup superblocks + GDT), so
+    /// the data is generally non-contiguous and `pos - data_written` no longer
+    /// locates the start — this does.
+    data_start_block: u32,
+    /// Unreferenced free block ranges created by data alignment padding. The
+    /// block bitmap assumes a densely packed data region; these holes must be
+    /// cleared from it so the filesystem is consistent. Empty unless alignment
+    /// is enabled.
+    free_holes: Vec<(u32, u32)>,
 }
 
 impl<W: Read + Write + Seek> Writer<W> {
@@ -251,6 +280,10 @@ impl<W: Read + Write + Seek> Writer<W> {
             gd_blocks: 0,
             uuid: [0u8; 16],
             journal_blocks: 0,
+            data_align: 0,
+            data_align_min: 0,
+            data_start_block: 0,
+            free_holes: Vec::new(),
         };
         for opt in opts {
             match opt {
@@ -268,6 +301,11 @@ impl<W: Read + Write + Seek> Writer<W> {
                 }
                 WriterOption::Uuid(u) => w.uuid = *u,
                 WriterOption::Journal(blocks) => w.journal_blocks = *blocks,
+                WriterOption::AlignData { align, min_size } => {
+                    debug_assert!(*align == 0 || align.is_power_of_two());
+                    w.data_align = i64::from(*align);
+                    w.data_align_min = i64::from(*min_size);
+                }
             }
         }
         w
@@ -331,6 +369,142 @@ impl<W: Read + Write + Seek> Writer<W> {
         Ok(())
     }
 
+    // ---- block-group metadata reservation ----
+    //
+    // ext4's sparse_super layout reserves the first `1 + gd_blocks` blocks of
+    // certain groups (0, 1, and powers of 3/5/7) for a backup superblock + a
+    // group-descriptor copy. Group 0's reservation is skipped at init(); the
+    // interior ones (block 32768, 98304, ...) sit in the middle of the data
+    // region. File data must not be written onto them, or the kernel rejects the
+    // extent as overlapping a system zone (multiply-claimed block).
+
+    /// Number of reserved blocks at the start of a backup group.
+    fn group_reserve(&self) -> u32 {
+        1 + self.gd_blocks
+    }
+
+    /// Is physical block `b` reserved for an interior block-group backup?
+    fn is_reserved_block(&self, b: u32) -> bool {
+        let g = b / BLOCKS_PER_GROUP;
+        if g == 0 {
+            return false; // group 0's primary metadata is handled by init()'s seek
+        }
+        (b % BLOCKS_PER_GROUP) < self.group_reserve() && has_super_backup(g)
+    }
+
+    /// Smallest reserved block >= `from`, or None if none up to the max device.
+    fn next_reserved_block_ge(&self, from: u32) -> Option<u32> {
+        let max_group = (self.max_disk_size / (i64::from(BLOCKS_PER_GROUP) * BLOCK_SIZE as i64)) as u32 + 1;
+        let mut g = from / BLOCKS_PER_GROUP;
+        while g <= max_group {
+            if g >= 1 && has_super_backup(g) {
+                let rstart = g * BLOCKS_PER_GROUP;
+                let rend = rstart + self.group_reserve();
+                let cand = from.max(rstart);
+                if cand < rend {
+                    return Some(cand);
+                }
+            }
+            g += 1;
+        }
+        None
+    }
+
+    /// If `pos` sits at the start of a reserved region, seek past it.
+    fn skip_reserved_at_pos(&mut self) -> io::Result<()> {
+        while self.pos % BLOCK_SIZE as i64 == 0 && self.is_reserved_block(self.block()) {
+            let g = self.block() / BLOCKS_PER_GROUP;
+            let region_end = g * BLOCKS_PER_GROUP + self.group_reserve();
+            self.seek_block(region_end)?;
+        }
+        Ok(())
+    }
+
+    /// Write file data, skipping reserved block-group metadata regions. Records
+    /// the file's first data block on the first call.
+    fn write_file_data(&mut self, b: &[u8]) -> io::Result<usize> {
+        if self.data_written == 0 {
+            self.skip_reserved_at_pos()?;
+            self.data_start_block = self.block();
+        }
+        let mut off = 0usize;
+        while off < b.len() {
+            self.skip_reserved_at_pos()?;
+            let cur = self.block();
+            let limit = match self.next_reserved_block_ge(cur) {
+                // next_reserved >= cur, and cur is not reserved, so r > cur.
+                Some(r) => i64::from(r) * BLOCK_SIZE as i64 - self.pos,
+                None => i64::MAX,
+            };
+            let take = ((b.len() - off) as i64).min(limit) as usize;
+            let w = self.write_bytes(&b[off..off + take])?;
+            off += w;
+            if w < take {
+                break; // short write
+            }
+        }
+        Ok(off)
+    }
+
+    /// Record [start, end) as free holes, excluding reserved metadata blocks
+    /// (which stay marked used — they hold backup superblocks, not free space).
+    fn record_free_hole(&mut self, start: u32, end: u32) {
+        let mut b = start;
+        while b < end {
+            if self.is_reserved_block(b) {
+                b += 1;
+                continue;
+            }
+            let run_start = b;
+            b = self.next_reserved_block_ge(b).unwrap_or(end).min(end);
+            if b > run_start {
+                self.free_holes.push((run_start, b - run_start));
+            }
+        }
+    }
+
+    /// Position the cursor so the next `n` blocks form a single contiguous run
+    /// that contains no reserved block-group metadata, and return that start
+    /// block. Used for structures that must be contiguous (journal inode, the
+    /// flex_bg inode table, bitmaps) — unlike file data, they can't be
+    /// fragmented around a reserved block, so instead we skip the whole run past
+    /// any reserved region it would straddle. Skipped data blocks become free
+    /// holes; the reserved blocks stay used. `n` is always << a block group, so
+    /// at most one interior backup region is ever in the way.
+    fn reserve_contiguous(&mut self, n: u32) -> io::Result<u32> {
+        loop {
+            self.skip_reserved_at_pos()?;
+            let start = self.block();
+            match self.next_reserved_block_ge(start) {
+                Some(r) if r < start + n => {
+                    let g = r / BLOCKS_PER_GROUP;
+                    let region_end = g * BLOCKS_PER_GROUP + self.group_reserve();
+                    self.record_free_hole(start, r);
+                    self.seek_block(region_end)?;
+                }
+                _ => return Ok(start),
+            }
+        }
+    }
+
+    /// The contiguous, non-reserved physical runs covering [start, end).
+    fn physical_runs(&self, start: u32, end: u32) -> Vec<(u32, u32)> {
+        let mut runs = Vec::new();
+        let mut b = start;
+        while b < end {
+            if self.is_reserved_block(b) {
+                b += 1;
+                continue;
+            }
+            let run_start = b;
+            // Jump to the next reserved block (or end) rather than stepping.
+            let next_res = self.next_reserved_block_ge(b).unwrap_or(end).min(end);
+            b = next_res;
+            runs.push((run_start, b - run_start));
+        }
+        runs
+    }
+
     // ---- inode management ----
 
     fn get_inode(&self, i: InodeNumber) -> Option<&Inode> {
@@ -615,6 +789,7 @@ impl<W: Read + Write + Seek> Writer<W> {
         self.cur_inode = Some((ino - 1) as usize);
         self.data_written = 0;
         self.data_max = size;
+        self.data_start_block = 0;
         Ok(())
     }
 
@@ -645,64 +820,72 @@ impl<W: Read + Write + Seek> Writer<W> {
     }
 
     fn write_extents(&mut self, idx: usize) -> io::Result<()> {
-        let start = self.pos - self.data_written;
-        if start % BLOCK_SIZE as i64 != 0 {
-            return Err(io::Error::other(
-                "data start position is not block-aligned",
-            ));
-        }
+        // Flush the partial final data block, then resolve the file's physical
+        // layout. Data skips reserved block-group metadata, so it may be split
+        // across several contiguous runs; `data_start_block` (not
+        // `pos - data_written`) locates the start.
         self.next_block()?;
-
-        let start_block = (start / BLOCK_SIZE as i64) as u32;
-        let blocks = self.block() - start_block;
-        let mut used_blocks = blocks;
+        let start_block = self.data_start_block;
+        let end_block = self.block();
+        let runs = self.physical_runs(start_block, end_block);
+
+        // Flatten runs into extent leaves, each at most MAX_BLOCKS_PER_EXTENT.
+        // For an unfragmented file this yields exactly the same leaves the old
+        // contiguous arithmetic produced.
+        let mut leaves: Vec<(u32, u32, u32)> = Vec::new(); // (logical, phys, len)
+        let mut logical = 0u32;
+        for (phys, len) in &runs {
+            let mut o = 0u32;
+            while o < *len {
+                let l = (*len - o).min(MAX_BLOCKS_PER_EXTENT);
+                leaves.push((logical, phys + o, l));
+                logical += l;
+                o += l;
+            }
+        }
+        let mut used_blocks = logical; // data blocks (reserved gaps excluded)
 
         const EXTENT_NODE_SIZE: u32 = 12;
         const EXTENTS_PER_BLOCK: u32 = (BLOCK_SIZE as u32) / EXTENT_NODE_SIZE - 1;
 
-        let extents = if blocks == 0 { 0 } else { blocks.div_ceil(MAX_BLOCKS_PER_EXTENT) };
+        let n_ext = leaves.len() as u32;
         let mut data = Vec::new();
 
-        if extents == 0 {
+        if n_ext == 0 {
             // Nothing to do
-        } else if extents <= 4 {
-            // Fits in inode directly
-            write_extent_header_to_vec(&mut data, extents as u16, 4, 0);
-            for i in 0..extents {
-                let block_offset = i * MAX_BLOCKS_PER_EXTENT;
-                let mut length = blocks - block_offset;
-                if length > MAX_BLOCKS_PER_EXTENT {
-                    length = MAX_BLOCKS_PER_EXTENT;
-                }
-                write_extent_leaf_to_vec(&mut data, block_offset, length as u16, start_block + block_offset);
+        } else if n_ext <= 4 {
+            // Fits in the inode directly.
+            write_extent_header_to_vec(&mut data, n_ext as u16, 4, 0);
+            for (lblk, phys, len) in &leaves {
+                write_extent_leaf_to_vec(&mut data, *lblk, *len as u16, *phys);
             }
-            // Pad to 4 extents worth
-            let padding = (4 - extents) * EXTENT_NODE_SIZE;
+            let padding = (4 - n_ext) * EXTENT_NODE_SIZE;
             data.extend(std::iter::repeat_n(0u8, padding as usize));
-        } else if extents <= 4 * EXTENTS_PER_BLOCK {
-            let extent_blocks = extents.div_ceil(EXTENTS_PER_BLOCK);
-            used_blocks += extent_blocks;
+        } else if n_ext <= 4 * EXTENTS_PER_BLOCK {
+            let extent_blocks = n_ext.div_ceil(EXTENTS_PER_BLOCK);
 
-            // Root: index nodes
+            // Root: index nodes pointing at leaf blocks.
             write_extent_header_to_vec(&mut data, extent_blocks as u16, 4, 1);
-            // We'll fill in the index nodes after writing the leaf blocks
             let index_start = data.len();
             data.resize(index_start + 4 * EXTENT_NODE_SIZE as usize, 0);
 
             for i in 0..extent_blocks {
+                // Extent-tree blocks must avoid reserved metadata too.
+                self.skip_reserved_at_pos()?;
                 let leaf_block = self.block();
-                // Fill in the index node
+                used_blocks += 1;
+
+                let first = (i * EXTENTS_PER_BLOCK) as usize;
+                let extents_in_block = (n_ext - i * EXTENTS_PER_BLOCK).min(EXTENTS_PER_BLOCK);
+
+                // Index node: logical offset of this leaf block's first extent.
                 let idx_off = index_start + (i * EXTENT_NODE_SIZE) as usize;
-                let block_off = i * EXTENTS_PER_BLOCK * MAX_BLOCKS_PER_EXTENT;
-                data[idx_off..idx_off + 4].copy_from_slice(&block_off.to_le_bytes());
+                data[idx_off..idx_off + 4].copy_from_slice(&leaves[first].0.to_le_bytes());
                 data[idx_off + 4..idx_off + 8].copy_from_slice(&leaf_block.to_le_bytes());
                 // idx_off + 8..12 stays zero (leaf_high + unused)
 
-                let extents_in_block = (extents - i * EXTENTS_PER_BLOCK).min(EXTENTS_PER_BLOCK);
                 let mut leaf_buf = vec![0u8; BLOCK_SIZE as usize];
                 let mut leaf_pos = 0usize;
-
-                // Write extent header
                 leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&format::EXTENT_HEADER_MAGIC.to_le_bytes());
                 leaf_pos += 2;
                 leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(extents_in_block as u16).to_le_bytes());
@@ -714,21 +897,15 @@ impl<W: Read + Write + Seek> Writer<W> {
                 leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&0u32.to_le_bytes()); // generation
                 leaf_pos += 4;
 
-                let offset = i * EXTENTS_PER_BLOCK * MAX_BLOCKS_PER_EXTENT;
-                for j in 0..extents_in_block {
-                    let block_off2 = offset + j * MAX_BLOCKS_PER_EXTENT;
-                    let mut length = blocks - block_off2;
-                    if length > MAX_BLOCKS_PER_EXTENT {
-                        length = MAX_BLOCKS_PER_EXTENT;
-                    }
-                    let start = start_block + block_off2;
-                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&block_off2.to_le_bytes());
+                for j in 0..extents_in_block as usize {
+                    let (lblk, phys, len) = leaves[first + j];
+                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&lblk.to_le_bytes());
                     leaf_pos += 4;
-                    leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(length as u16).to_le_bytes());
+                    leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(len as u16).to_le_bytes());
                     leaf_pos += 2;
                     leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&0u16.to_le_bytes()); // start_high
                     leaf_pos += 2;
-                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&start.to_le_bytes());
+                    leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&phys.to_le_bytes());
                     leaf_pos += 4;
                 }
 
@@ -892,6 +1069,22 @@ impl<W: Read + Write + Seek> Writer<W> {
 
         if self.inode_ref(child_ino)?.mode & TYPE_MASK == format::S_IFREG {
             self.start_inode(name, child_ino, f.size)?;
+            // Align the start of large file payloads to the dedup block grid, so
+            // the same file produces the same blocks regardless of upstream
+            // churn. The padded gap is unreferenced free space; record it so the
+            // block bitmap marks it free (reserved metadata blocks within the
+            // gap stay used). write_file_data then skips any reserved block at
+            // the aligned position before recording data_start_block.
+            if self.data_align > 0 && f.size >= self.data_align_min {
+                let align = self.data_align;
+                let rem = self.pos % align;
+                if rem != 0 {
+                    let pad_start = self.block();
+                    self.write_zeros(align - rem)?;
+                    let pad_end = self.block();
+                    self.record_free_hole(pad_start, pad_end);
+                }
+            }
         }
         Ok(())
     }
@@ -1024,7 +1217,7 @@ impl<W: Read + Write + Seek> Writer<W> {
             self.data_written += b.len() as i64;
             Ok(b.len())
         } else {
-            let n = self.write_bytes(b)?;
+            let n = self.write_file_data(b)?;
             self.data_written += n as i64;
             Ok(n)
         }
@@ -1147,7 +1340,10 @@ impl<W: Read + Write + Seek> Writer<W> {
     /// journal blocks. The superblock is updated in close() to set
     /// HAS_JOURNAL, journal_inum, and the journal_blocks backup.
     fn write_journal(&mut self) -> io::Result<()> {
-        let journal_start = self.block();
+        // The journal is one contiguous extent; keep it clear of reserved
+        // block-group metadata (a straddle would make inode 8 multiply-claim the
+        // backup superblock).
+        let journal_start = self.reserve_contiguous(self.journal_blocks)?;
 
         // Write JBD2 v2 superblock (first block of journal)
         // All multi-byte fields are big-endian per JBD2 spec.
@@ -1341,14 +1537,21 @@ impl<W: Read + Write + Seek> Writer<W> {
             self.write_journal()?;
         }
 
-        // Write the inode table
-        let inode_table_offset = self.block();
-        let (groups, inodes_per_group) = best_group_count(inode_table_offset, self.inodes.len() as u32);
+        // Write the inode table. It is contiguous and located via per-group
+        // descriptors (inode_table_low + g * size_per_group), so it must avoid
+        // reserved block-group metadata. Reserve a clean run sized for the group
+        // count; padding the start can bump the count by one group, so reserve a
+        // one-group margin and recompute against the final offset.
+        let n_inodes = self.inodes.len() as u32;
+        let (g0, ipg0) = best_group_count(self.block(), n_inodes);
+        let itspg0 = ipg0 * INODE_SIZE as u32 / BLOCK_SIZE as u32;
+        let inode_table_offset = self.reserve_contiguous((g0 + 1) * itspg0 + 2)?;
+        let (groups, inodes_per_group) = best_group_count(inode_table_offset, n_inodes);
         self.write_inode_table(groups * inodes_per_group * INODE_SIZE as u32)?;
 
-        // Write bitmaps
-        let bitmap_offset = self.block();
+        // Write bitmaps (also contiguous and GD-located).
         let bitmap_size = groups * 2;
+        let bitmap_offset = self.reserve_contiguous(bitmap_size)?;
         let valid_data_size = bitmap_offset + bitmap_size;
         let mut disk_size = valid_data_size;
         let min_size = (groups - 1) * BLOCKS_PER_GROUP + 1;
@@ -1368,6 +1571,8 @@ impl<W: Read + Write + Seek> Writer<W> {
         let inode_table_size_per_group = inodes_per_group * INODE_SIZE as u32 / BLOCK_SIZE as u32;
         let mut total_used_blocks: u32 = 0;
         let mut total_used_inodes: u32 = 0;
+        // Alignment padding holes to clear from the otherwise-dense bitmap.
+        let free_holes = std::mem::take(&mut self.free_holes);
 
         for g in 0..groups {
             let mut bitmap_buf = vec![0u8; BLOCK_SIZE as usize * 2];
@@ -1400,6 +1605,23 @@ impl<W: Read + Write + Seek> Writer<W> {
                     used_block_count += 1;
                 }
             }
+            // Clear alignment padding holes: the bitmap is dense by default, but
+            // these blocks are unreferenced free space.
+            let gstart = g * BLOCKS_PER_GROUP;
+            for &(hstart, hlen) in &free_holes {
+                let lo = hstart.max(gstart);
+                let hi = (hstart + hlen).min(gstart + BLOCKS_PER_GROUP);
+                let mut b = lo;
+                while b < hi {
+                    let j = b - gstart;
+                    let mask = 1u8 << (j % 8);
+                    if bitmap_buf[(j / 8) as usize] & mask != 0 {
+                        bitmap_buf[(j / 8) as usize] &= !mask;
+                        used_block_count -= 1;
+                    }
+                    b += 1;
+                }
+            }
 
             // Inode bitmap
             for j in 0..inodes_per_group {
@@ -1478,7 +1700,12 @@ impl<W: Read + Write + Seek> Writer<W> {
                 | format::RoCompatFeature::HUGE_FILE
                 | format::RoCompatFeature::EXTRA_ISIZE,
             uuid: self.uuid,
-            journal_uuid: self.uuid,
+            // s_journal_uuid identifies an *external* journal device. We only
+            // ever use an internal journal (inode 8) or none, so this must stay
+            // zero — a non-zero value makes the kernel and e2fsck search for an
+            // external journal and abort ("Can't find external journal"). The
+            // journal's own jbd2 superblock still carries the fs UUID.
+            journal_uuid: [0u8; 16],
             journal_inum: if self.journal_blocks > 0 { format::INODE_JOURNAL } else { 0 },
             hash_seed: [
                 u32::from_le_bytes(self.uuid[0..4].try_into().unwrap()),
@@ -1582,6 +1809,29 @@ fn best_group_count(blocks: u32, inodes: u32) -> (u32, u32) {
     (best_groups, best_ipg)
 }
 
+/// Does block group `g` hold a backup superblock + group-descriptor copy?
+/// With the sparse_super feature, backups live in groups 0, 1, and every power
+/// of 3, 5, and 7. The kernel/e2fsck reserve those blocks regardless of whether
+/// valid backup content is written, so file data must never claim them.
+fn has_super_backup(g: u32) -> bool {
+    if g <= 1 {
+        return true;
+    }
+    for base in [3u32, 5, 7] {
+        let mut p = base;
+        while p < g {
+            match p.checked_mul(base) {
+                Some(n) => p = n,
+                None => break,
+            }
+        }
+        if p == g {
+            return true;
+        }
+    }
+    false
+}
+
 fn write_extent_header_to_vec(buf: &mut Vec<u8>, entries: u16, max: u16, depth: u16) {
     buf.extend_from_slice(&format::EXTENT_HEADER_MAGIC.to_le_bytes());
     buf.extend_from_slice(&entries.to_le_bytes());
diff --git a/ext4/tests/fsck_validity.rs b/ext4/tests/fsck_validity.rs
new file mode 100644
index 0000000..b5d606a
--- /dev/null
+++ b/ext4/tests/fsck_validity.rs
@@ -0,0 +1,442 @@
+//! Filesystem-validity harness gated on the REAL `e2fsck`, not the in-crate
+//! reader. The in-crate reader is lenient and hid a real multi-block-group
+//! corruption bug for a long time; `e2fsck` (kernel-grade structural check)
+//! catches it. Every writer change should be validated here.
+//!
+//! These tests shell out to `e2fsck`; they skip (pass with a notice) when it is
+//! not installed so they do not break environments without e2fsprogs.
+//!
+//! Run with: `cargo test -p ext4 --test fsck_validity -- --include-ignored`
+
+use std::io::Write;
+use std::path::PathBuf;
+use std::process::Command;
+
+use ext4::tar_convert::{ConvertOptions, convert_tar_to_ext4};
+use ext4::writer::WriterOption;
+
+/// Locate `e2fsck` (often in /sbin, not on a service PATH). Returns None to skip.
+fn find_e2fsck() -> Option<PathBuf> {
+    for p in ["/sbin/e2fsck", "/usr/sbin/e2fsck", "/usr/bin/e2fsck", "/bin/e2fsck"] {
+        if std::path::Path::new(p).exists() {
+            return Some(PathBuf::from(p));
+        }
+    }
+    None
+}
+
+/// Deterministic, non-trivial file content (so blocks are actually allocated and
+/// the layout is reproducible — no RNG).
+fn content(seed: u64, len: usize) -> Vec<u8> {
+    let mut v = Vec::with_capacity(len);
+    let mut s = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1);
+    while v.len() < len {
+        s ^= s << 13;
+        s ^= s >> 7;
+        s ^= s << 17;
+        v.extend_from_slice(&s.to_le_bytes());
+    }
+    v.truncate(len);
+    v
+}
+
+/// Build an in-memory tar of `(path, size)` files, convert to ext4 via the real
+/// production path, write it to a temp file, and run `e2fsck -fn` on it.
+/// Returns Ok(()) if e2fsck reports a clean filesystem (exit 0), else Err(report).
+fn build_and_fsck(files: &[(&str, usize)], align: Option<(u32, u32)>) -> Result<(), String> {
+    let owned: Vec<(String, usize)> = files.iter().map(|(p, s)| ((*p).to_string(), *s)).collect();
+    e2fsck_clean(&build_image(&owned, align))
+}
+
+/// Build a real ext4 image (production convert path) from `(path, size)` files,
+/// each filled with the deterministic `content(index, size)`.
+fn build_image(files: &[(String, usize)], align: Option<(u32, u32)>) -> Vec<u8> {
+    let mut tar = tar::Builder::new(Vec::new());
+    for (i, (path, size)) in files.iter().enumerate() {
+        let data = content(i as u64, *size);
+        let mut h = tar::Header::new_gnu();
+        h.set_size(data.len() as u64);
+        h.set_mode(0o644);
+        h.set_mtime(0);
+        h.set_entry_type(tar::EntryType::Regular);
+        h.set_cksum();
+        tar.append_data(&mut h, path, &data[..]).unwrap();
+    }
+    let tar_bytes = tar.into_inner().unwrap();
+
+    let mut writer_options = vec![
+        WriterOption::MaximumDiskSize(4 * 1024 * 1024 * 1024),
+        WriterOption::Uuid([0x11; 16]),
+        // Match the production bless config: an internal 4 MiB journal. The
+        // journal (and inode table / dir blocks) are written at close() and can
+        // land near a block-group backup-superblock boundary, so exercising it
+        // is part of validating the reserved-block handling.
+        WriterOption::Journal(1024),
+    ];
+    if let Some((a, m)) = align {
+        writer_options.push(WriterOption::AlignData { align: a, min_size: m });
+    }
+    let opts = ConvertOptions { convert_backslash: false, writer_options };
+    let mut img: Vec<u8> = Vec::new();
+    convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), std::io::Cursor::new(&mut img), &opts)
+        .unwrap();
+    img
+}
+
+/// The oracle: run `e2fsck -fn` on the image. Ok == clean (exit 0). Skips
+/// (returns Ok) when e2fsck is not installed.
+fn e2fsck_clean(img: &[u8]) -> Result<(), String> {
+    let Some(e2fsck) = find_e2fsck() else {
+        eprintln!("SKIP: e2fsck not installed");
+        return Ok(());
+    };
+    let mut tmp = tempfile::NamedTempFile::new().map_err(|e| format!("tmp: {e}"))?;
+    tmp.write_all(img).map_err(|e| format!("write: {e}"))?;
+    tmp.flush().ok();
+    let output = Command::new(&e2fsck)
+        .args(["-fn"])
+        .arg(tmp.path())
+        .output()
+        .map_err(|e| format!("spawn e2fsck: {e}"))?;
+    if output.status.code() == Some(0) {
+        Ok(())
+    } else {
+        let mut report = format!("e2fsck exit={:?} (nonzero = filesystem errors)\n", output.status.code());
+        report.push_str(&String::from_utf8_lossy(&output.stdout));
+        // Trim the giant bitmap-difference dumps to keep failures readable.
+        Err(report.lines().take(30).collect::<Vec<_>>().join("\n"))
+    }
+}
+
+/// Read every file back via the reader (which assembles from the on-disk extent
+/// tree) and assert byte-exact equality with the known input — catches any
+/// logical-ordering bug introduced by fragmentation around reserved blocks.
+fn content_matches(img: &[u8], files: &[(String, usize)]) -> Result<(), String> {
+    let mut want: std::collections::HashMap<String, (u64, usize)> = std::collections::HashMap::new();
+    for (i, (p, s)) in files.iter().enumerate() {
+        want.insert(p.trim_start_matches('/').to_string(), (i as u64, *s));
+    }
+    let mut reader =
+        ext4::reader::Reader::new(std::io::Cursor::new(img)).map_err(|e| format!("reader: {e}"))?;
+    let entries = reader.walk().map_err(|e| format!("walk: {e}"))?;
+    let mut checked = 0;
+    for e in entries {
+        if (e.mode & 0xF000) != 0x8000 {
+            continue;
+        }
+        let path = e.path.trim_start_matches('/').to_string();
+        let Some(&(idx, size)) = want.get(&path) else { continue };
+        let inode = reader.read_inode(e.inode_number).map_err(|e| format!("{path}: inode: {e}"))?;
+        let got = reader.read_data(&inode).map_err(|e| format!("{path}: read: {e}"))?;
+        if got.len() != size {
+            return Err(format!("{path}: size {} != {size}", got.len()));
+        }
+        if got != content(idx, size) {
+            return Err(format!("{path}: CONTENT MISMATCH (fragmentation reordered bytes)"));
+        }
+        checked += 1;
+    }
+    if checked != files.len() {
+        return Err(format!("read back {checked}/{} files", files.len()));
+    }
+    Ok(())
+}
+
+/// Baseline: a single-block-group filesystem (< 128 MiB) must be e2fsck-clean.
+/// This proves the harness works and the writer is sound when it doesn't cross
+/// a block-group boundary.
+#[test]
+fn fsck_single_group_clean() {
+    let files = &[
+        ("etc/hostname", 12),
+        ("etc/config.toml", 4096),
+        ("usr/bin/tool", 8 * 1024 * 1024),
+        ("usr/lib/data.bin", 32 * 1024 * 1024),
+        ("var/log/app.log", 1024),
+    ];
+    if let Err(report) = build_and_fsck(files, None) {
+        panic!("single-group image is not e2fsck-clean:\n{report}");
+    }
+}
+
+/// A filesystem that crosses a block-group boundary (> 128 MiB) must be
+/// e2fsck-clean. Regression for the original corruption: the linear allocator
+/// used to place file data on the Group 1 backup superblock / group descriptors
+/// (block 32768+), producing multiply-claimed blocks. The group-aware allocator
+/// now skips those reserved blocks and fragments files around them.
+#[test]
+fn fsck_multi_group_clean() {
+    // ~160 MiB of file data guarantees crossing into block group 1 (32768 blocks
+    // == 128 MiB), regardless of group-0 metadata overhead.
+    let files = &[
+        ("data/a.bin", 40 * 1024 * 1024),
+        ("data/b.bin", 40 * 1024 * 1024),
+        ("data/c.bin", 40 * 1024 * 1024),
+        ("data/d.bin", 40 * 1024 * 1024),
+    ];
+    if let Err(report) = build_and_fsck(files, None) {
+        panic!("multi-group image is not e2fsck-clean:\n{report}");
+    }
+}
+
+/// Content correctness across fragmentation. e2fsck validates *structure*, not
+/// that a file's logical blocks are in the right order. A file that spans a
+/// reserved metadata region is split into multiple extents by the allocator; if
+/// the logical offsets were assigned wrong, the bytes would come back reordered.
+/// Build a multi-group image with KNOWN deterministic content, read every file
+/// back through the reader (which assembles by the on-disk extent tree,
+/// independent of the writer's allocation state), and assert byte-exact equality.
+#[test]
+fn content_survives_fragmentation() {
+    // Files sized so several cross the block-group boundary (block 32768).
+    let specs: &[(&str, usize)] = &[
+        ("data/a.bin", 50 * 1024 * 1024),
+        ("data/b.bin", 50 * 1024 * 1024),
+        ("data/c.bin", 50 * 1024 * 1024),
+        ("small/x", 1234),
+        ("data/d.bin", 30 * 1024 * 1024),
+    ];
+
+    // Build the tar.
+    let mut tar = tar::Builder::new(Vec::new());
+    let mut expected: std::collections::HashMap<String, Vec<u8>> = std::collections::HashMap::new();
+    for (i, (path, size)) in specs.iter().enumerate() {
+        let data = content(i as u64, *size);
+        let mut h = tar::Header::new_gnu();
+        h.set_size(data.len() as u64);
+        h.set_mode(0o644);
+        h.set_mtime(0);
+        h.set_entry_type(tar::EntryType::Regular);
+        h.set_cksum();
+        tar.append_data(&mut h, path, &data[..]).unwrap();
+        expected.insert((*path).to_string(), data);
+    }
+    let tar_bytes = tar.into_inner().unwrap();
+
+    // Convert to ext4 (multi-group, group-aware allocator).
+    let opts = ConvertOptions {
+        convert_backslash: false,
+        writer_options: vec![
+            WriterOption::MaximumDiskSize(2 * 1024 * 1024 * 1024),
+            WriterOption::Uuid([0x22; 16]),
+        ],
+    };
+    let mut img: Vec<u8> = Vec::new();
+    convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), std::io::Cursor::new(&mut img), &opts).unwrap();
+
+    // Read every file back via the reader and compare to known input.
+    let mut reader = ext4::reader::Reader::new(std::io::Cursor::new(&img)).unwrap();
+    let entries = reader.walk().unwrap();
+    let mut checked = 0;
+    for e in entries {
+        if (e.mode & 0xF000) != 0x8000 {
+            continue;
+        }
+        let path = e.path.trim_start_matches('/').to_string();
+        let Some(want) = expected.get(&path) else { continue };
+        let inode = reader.read_inode(e.inode_number).unwrap();
+        let got = reader.read_data(&inode).unwrap();
+        assert_eq!(got.len(), want.len(), "{path}: size mismatch");
+        assert!(got == *want, "{path}: CONTENT MISMATCH (fragmentation reordered bytes)");
+        checked += 1;
+    }
+    assert_eq!(checked, specs.len(), "did not read back all files");
+}
+
+/// Once the allocator is group-aware, the aligned multi-group build must ALSO be
+/// e2fsck-clean (alignment padding must be marked free, and aligned file starts
+/// must skip reserved blocks). Captures both the metadata-collision and the
+/// padding-bitmap issues found via e2fsck.
+#[test]
+fn fsck_multi_group_aligned_clean() {
+    let files = &[
+        ("data/a.bin", 40 * 1024 * 1024),
+        ("data/b.bin", 40 * 1024 * 1024),
+        ("data/c.bin", 40 * 1024 * 1024),
+        ("data/d.bin", 40 * 1024 * 1024),
+    ];
+    if let Err(report) = build_and_fsck(files, Some((128 * 1024, 16 * 1024))) {
+        panic!("aligned multi-group image is not e2fsck-clean:\n{report}");
+    }
+}
+
+/// Targeted sweep of the block-group boundary (block 32768 == 128 MiB). With a
+/// journal enabled (production config), the journal — and the inode table / dir
+/// blocks — are written at close() and can land straddling the Group 1 backup
+/// superblock. Sweep data sizes that push those close()-time structures across
+/// the boundary; every one must be e2fsck-clean. Regression for reserved-block
+/// handling of NON-file-data writes.
+#[test]
+fn fsck_journal_straddles_group_boundary() {
+    if find_e2fsck().is_none() {
+        eprintln!("SKIP: e2fsck not installed");
+        return;
+    }
+    // 120..136 MiB in 1 MiB steps: data ends near block 32768, so the trailing
+    // journal (1024 blocks = 4 MiB) and inode table cross the boundary.
+    for mib in 120..=136 {
+        let files = vec![(format!("data/blob_{mib}.bin"), mib * 1024 * 1024)];
+        for align in [None, Some((128 * 1024u32, 128 * 1024u32))] {
+            let img = build_image(&files, align);
+            if let Err(e) = e2fsck_clean(&img) {
+                panic!("{mib} MiB align={align:?} NOT e2fsck-clean:\n{e}");
+            }
+            if let Err(e) = content_matches(&img, &files) {
+                panic!("{mib} MiB align={align:?} content error: {e}");
+            }
+        }
+    }
+}
+
+/// Many files whose data ends near block 32768 make the flex_bg inode table
+/// large enough to straddle the Group 1 backup superblock. The inode table is
+/// contiguous and pointed at by per-group descriptors, so it must also dodge
+/// reserved blocks. Regression for inode-table/bitmap reserved-block handling.
+#[test]
+fn fsck_inode_table_straddles_boundary() {
+    if find_e2fsck().is_none() {
+        eprintln!("SKIP: e2fsck not installed");
+        return;
+    }
+    // N one-block files put data just below block 32768 while the inode table
+    // (N/16 blocks) crosses it. Sweep a few counts so one reliably straddles.
+    for n in [30_500usize, 31_000, 31_500] {
+        let files: Vec<(String, usize)> =
+            (0..n).map(|i| (format!("d{}/f{i}.bin", i % 256), 4096)).collect();
+        for align in [None, Some((128 * 1024u32, 128 * 1024u32))] {
+            let img = build_image(&files, align);
+            if let Err(e) = e2fsck_clean(&img) {
+                panic!("inode-table straddle n={n} align={align:?} NOT e2fsck-clean:\n{e}");
+            }
+        }
+    }
+}
+
+/// Property fuzzer: random multi-group filesets must be e2fsck-clean AND read
+/// back byte-exact — both unaligned and aligned. Seeds are deterministic so any
+/// failure reproduces verbatim (the panic prints the seed and fileset). This is
+/// the generalized gate: it sweeps the size/position space where the original
+/// data-on-backup-superblock and alignment-bitmap bugs lived. Crank coverage
+/// with `EXT4_FUZZ_SEEDS=64 cargo test -p ext4 --test fsck_validity fuzz`.
+#[test]
+fn fuzz_multigroup_validity_and_content() {
+    if find_e2fsck().is_none() {
+        eprintln!("SKIP: e2fsck not installed");
+        return;
+    }
+    let seeds: u64 = std::env::var("EXT4_FUZZ_SEEDS")
+        .ok()
+        .and_then(|s| s.parse().ok())
+        .unwrap_or(8);
+
+    for seed in 0..seeds {
+        let mut state = seed.wrapping_mul(0x9E3779B97F4A7C15) | 1;
+        let mut next = move || {
+            state ^= state << 13;
+            state ^= state >> 7;
+            state ^= state << 17;
+            state
+        };
+
+        // Random fileset with a deliberate mix: small files (below the align
+        // threshold), medium, and large (which straddle group boundaries).
+        let nfiles = 4 + (next() % 10) as usize;
+        let mut files: Vec<(String, usize)> = Vec::new();
+        let mut total: u64 = 0;
+        for k in 0..nfiles {
+            let size = match next() % 10 {
+                0..=3 => 1 + (next() % (64 * 1024)) as usize,
+                4..=6 => 4096 + (next() % (8 * 1024 * 1024)) as usize,
+                _ => 8 * 1024 * 1024 + (next() % (40 * 1024 * 1024)) as usize,
+            };
+            files.push((format!("d/s{seed}_f{k}.bin"), size));
+            total += size as u64;
+        }
+        // Guarantee at least one block-group boundary (128 MiB) is crossed so the
+        // reserved-block / fragmentation paths are always exercised.
+        if total < 160 * 1024 * 1024 {
+            let pad = (160 * 1024 * 1024 - total) as usize + 4 * 1024 * 1024;
+            files.push((format!("d/s{seed}_big.bin"), pad));
+        }
+
+        for align in [None, Some((128 * 1024u32, 16 * 1024u32))] {
+            let img = build_image(&files, align);
+            if let Err(e) = e2fsck_clean(&img) {
+                panic!("seed={seed} align={align:?} NOT e2fsck-clean:\n{e}\nfiles={files:?}");
+            }
+            if let Err(e) = content_matches(&img, &files) {
+                panic!("seed={seed} align={align:?} content error: {e}\nfiles={files:?}");
+            }
+        }
+    }
+}
+
+/// Privileged kernel-mount content check: loop-mount the image with the REAL
+/// Linux ext4 driver and verify every file's bytes against the known input —
+/// the strongest oracle, independent of both the writer and the in-crate reader.
+/// Opt-in (needs root / passwordless sudo + loop devices), so it skips by
+/// default and runs in a privileged/nightly job:
+///   EXT4_MOUNT_TEST=1 cargo test -p ext4 --test fsck_validity kernel_mount
+#[test]
+fn kernel_mount_content() {
+    if std::env::var("EXT4_MOUNT_TEST").is_err() {
+        eprintln!("SKIP: set EXT4_MOUNT_TEST=1 to run the privileged kernel-mount check");
+        return;
+    }
+    let sudo_ok = Command::new("sudo")
+        .args(["-n", "true"])
+        .status()
+        .map(|s| s.success())
+        .unwrap_or(false);
+    if !sudo_ok {
+        eprintln!("SKIP: passwordless sudo not available for mount");
+        return;
+    }
+
+    // Multi-group fileset (~180 MiB) with known deterministic content.
+    let files: Vec<(String, usize)> = vec![
+        ("data/a.bin".into(), 50 * 1024 * 1024),
+        ("data/b.bin".into(), 50 * 1024 * 1024),
+        ("small/x".into(), 4096),
+        ("data/c.bin".into(), 50 * 1024 * 1024),
+        ("data/d.bin".into(), 30 * 1024 * 1024),
+    ];
+
+    for align in [None, Some((128 * 1024u32, 128 * 1024u32))] {
+        let img = build_image(&files, align);
+        let mut tmp = tempfile::NamedTempFile::new().unwrap();
+        tmp.write_all(&img).unwrap();
+        tmp.flush().unwrap();
+        let mnt = tempfile::tempdir().unwrap();
+
+        let mounted = Command::new("sudo")
+            .args(["-n", "mount", "-o", "ro,loop"])
+            .arg(tmp.path())
+            .arg(mnt.path())
+            .status()
+            .expect("spawn mount");
+        assert!(mounted.success(), "kernel mount failed (align={align:?})");
+
+        // Read every file through the kernel and compare to known input. Collect
+        // the result first so we always unmount, even on mismatch.
+        let mut err: Option<String> = None;
+        for (i, (path, size)) in files.iter().enumerate() {
+            match std::fs::read(mnt.path().join(path)) {
+                Ok(got) if got.len() == *size && got == content(i as u64, *size) => {}
+                Ok(got) => {
+                    err = Some(format!("{path}: kernel read mismatch (len {} vs {size})", got.len()));
+                    break;
+                }
+                Err(e) => {
+                    err = Some(format!("{path}: kernel read failed: {e}"));
+                    break;
+                }
+            }
+        }
+
+        let _ = Command::new("sudo").args(["-n", "umount"]).arg(mnt.path()).status();
+        if let Some(e) = err {
+            panic!("kernel-mount content check failed (align={align:?}): {e}");
+        }
+    }
+}
diff --git a/glidefs/src/bin/dedup_probe.rs b/glidefs/src/bin/dedup_probe.rs
new file mode 100644
index 0000000..65e2348
--- /dev/null
+++ b/glidefs/src/bin/dedup_probe.rs
@@ -0,0 +1,533 @@
+#![allow(clippy::cast_possible_wrap, clippy::cast_sign_loss, clippy::cast_possible_truncation)]
+//! Empirical probe for the block-alignment dedup hypothesis.
+//!
+//! Claim 1 — the problem is real: fixed-grid block dedup over ext4 silently
+//! fails to dedup byte-identical content when that content sits at a different
+//! offset, because the dedup window is positional. We prove this two ways:
+//!   E1 (mechanism): take ONE real ext4 image and dedup it against a copy of
+//!      *itself* shifted by δ bytes. δ that is a whole-grid multiple → ~100%
+//!      dedup (the method works); a sub-grid δ → dedup collapses. Same bytes,
+//!      only the offset changes, so alignment is provably the whole cause.
+//!   E2 (bite): over real, related images, fixed-grid dedup is far below what
+//!      content-defined chunking (FastCDC) recovers from the *same* ext4 bytes.
+//!
+//! Claim 2 — the fix is easy: rebuild the same images with the production
+//!   writer's new `AlignData` option (large file payloads start on the grid).
+//!   E3: fixed-grid dedup on the aligned images should jump to ~the FastCDC
+//!   ceiling, and the aligned images still round-trip through the ext4 reader.
+//!   The padding is holes (zeros), which the block store never stores.
+//!
+//! Everything runs through the real production code paths
+//! (`convert_oci_layers_to_ext4`, the real `block_map` hashing/compression), so
+//! the numbers reflect what GlideFS would actually store — not a re-model.
+//!
+//! Usage:
+//!   skopeo copy docker://python:3.12-slim-bookworm dir:/tmp/oci/py312
+//!   skopeo copy docker://python:3.13-slim-bookworm dir:/tmp/oci/py313
+//!   cargo run --release --bin dedup_probe -- /tmp/oci/py312 /tmp/oci/py313 ...
+
+use std::collections::HashMap;
+use std::fs::File;
+use std::io::{Read, Seek, SeekFrom};
+use std::path::{Path, PathBuf};
+
+use ext4::tar_convert::{ConvertOptions, convert_oci_layers_to_ext4};
+use ext4::writer::WriterOption;
+use glidefs::block::block_map::{Blake3Hash, blake3_128, lz4_compress};
+
+const GRID: usize = 128 * 1024; // production BLOCK_SIZE — the dedup window size
+// align files >= threshold; pack smaller ones. Override with DEDUP_ALIGN_THRESHOLD.
+fn align_threshold() -> u32 {
+    std::env::var("DEDUP_ALIGN_THRESHOLD").ok().and_then(|s| s.parse().ok()).unwrap_or(16 * 1024)
+}
+
+// ---- shared image plumbing (mirrors bless: deterministic, content-addressed) ----
+
+fn deterministic_uuid(seed: &str) -> [u8; 16] {
+    let mut uuid = *blake3_128(seed.as_bytes()).as_bytes();
+    uuid[6] = (uuid[6] & 0x0f) | 0x80;
+    uuid[8] = (uuid[8] & 0x3f) | 0x80;
+    uuid
+}
+
+fn decompress_layer(blob: &Path) -> std::io::Result<File> {
+    let mut f = File::open(blob)?;
+    let mut magic = [0u8; 4];
+    f.read_exact(&mut magic)?;
+    f.seek(SeekFrom::Start(0))?;
+    let mut out = tempfile::tempfile()?;
+    if magic[0] == 0x1f && magic[1] == 0x8b {
+        std::io::copy(&mut flate2::read::GzDecoder::new(f), &mut out)?;
+    } else if magic == [0x28, 0xb5, 0x2f, 0xfd] {
+        std::io::copy(&mut zstd::Decoder::new(f)?, &mut out)?;
+    } else {
+        std::io::copy(&mut f, &mut out)?;
+    }
+    out.seek(SeekFrom::Start(0))?;
+    Ok(out)
+}
+
+struct Image {
+    name: String,
+    seed: String,
+    layer_blobs: Vec<PathBuf>,
+}
+
+fn load_image(dir: &Path) -> Image {
+    let bytes = std::fs::read(dir.join("manifest.json")).expect("read manifest.json");
+    let seed = format!("blake3:{:032x}", u128::from_le_bytes(*blake3_128(&bytes).as_bytes()));
+    let v: serde_json::Value = serde_json::from_slice(&bytes).expect("parse manifest");
+    let layer_blobs = v["layers"]
+        .as_array()
+        .expect("layers[]")
+        .iter()
+        .map(|l| {
+            let d = l["digest"].as_str().expect("digest");
+            dir.join(d.strip_prefix("sha256:").unwrap_or(d))
+        })
+        .collect();
+    Image { name: dir.file_name().unwrap().to_string_lossy().into_owned(), seed, layer_blobs }
+}
+
+/// Build a real ext4 image from the layers. `align` toggles the new writer
+/// option; everything else is identical so the only variable is alignment.
+/// Fixed 4 GiB device for all builds so the block grids are comparable.
+fn build_ext4(img: &Image, align: bool) -> File {
+    let mut layers: Vec<File> =
+        img.layer_blobs.iter().map(|p| decompress_layer(p).expect("decompress")).collect();
+    let mut writer_options = vec![
+        WriterOption::MaximumDiskSize(4 * 1024 * 1024 * 1024),
+        WriterOption::Uuid(deterministic_uuid(&img.seed)),
+    ];
+    // The writer's internal-journal feature flag trips `e2fsck` ("external
+    // journal") before it inspects bitmaps. Allow disabling it for fsck runs so
+    // the full structural check actually executes. Production keeps the journal.
+    if std::env::var("DEDUP_PROBE_NO_JOURNAL").is_err() {
+        writer_options.push(WriterOption::Journal(1024));
+    }
+    if align {
+        writer_options.push(WriterOption::AlignData { align: GRID as u32, min_size: align_threshold() });
+    }
+    let opts = ConvertOptions { convert_backslash: false, writer_options };
+    let out = tempfile::tempfile().expect("tempfile");
+    let mut fs = convert_oci_layers_to_ext4(&mut layers, out, &opts).expect("convert");
+    fs.seek(SeekFrom::Start(0)).unwrap();
+    fs
+}
+
+fn is_zero(d: &[u8]) -> bool {
+    let (p, c, s) = unsafe { d.align_to::<u64>() };
+    p.iter().all(|&b| b == 0) && c.iter().all(|&w| w == 0) && s.iter().all(|&b| b == 0)
+}
+
+fn read_all(mut f: File) -> Vec<u8> {
+    let mut v = Vec::new();
+    f.seek(SeekFrom::Start(0)).unwrap();
+    f.read_to_end(&mut v).unwrap();
+    v
+}
+
+// ---- chunking schemes; each returns (content hash -> stored bytes) units ----
+
+/// Fixed grid (what production does today). Skips zero blocks, exactly like the
+/// real flush path — so alignment padding (holes) is free here.
+fn fixed_grid_units(img: &[u8], stored: &mut HashMap<Blake3Hash, usize>, raw: &mut u64, zero: &mut u64) {
+    for blk in img.chunks(GRID) {
+        if is_zero(blk) {
+            *zero += 1;
+            continue;
+        }
+        *raw += blk.len() as u64;
+        stored.entry(blake3_128(blk)).or_insert_with(|| lz4_compress(blk).len());
+    }
+}
+
+/// Gear-hash content-defined chunking (the alignment-immune ceiling). Boundary
+/// where the rolling hash hits the mask; min/max clamp run length.
+fn gear_table() -> [u64; 256] {
+    // splitmix64 from a fixed seed — deterministic across runs.
+    let mut s = 0x9E3779B97F4A7C15u64;
+    let mut t = [0u64; 256];
+    for e in &mut t {
+        s = s.wrapping_add(0x9E3779B97F4A7C15);
+        let mut z = s;
+        z = (z ^ (z >> 30)).wrapping_mul(0xBF58476D1CE4E5B9);
+        z = (z ^ (z >> 27)).wrapping_mul(0x94D049BB133111EB);
+        *e = z ^ (z >> 31);
+    }
+    t
+}
+
+fn cdc_units(img: &[u8], gear: &[u64; 256], stored: &mut HashMap<Blake3Hash, usize>, raw: &mut u64) {
+    const MIN: usize = 16 * 1024;
+    const MAX: usize = 1024 * 1024;
+    const MASK: u64 = (1 << 17) - 1; // avg ~128 KiB, matching the grid
+    let n = img.len();
+    let mut start = 0;
+    while start < n {
+        let mut hash = 0u64;
+        let mut j = start;
+        let cut = loop {
+            if j >= n {
+                break n;
+            }
+            if j - start >= MAX {
+                break j;
+            }
+            hash = (hash << 1).wrapping_add(gear[img[j] as usize]);
+            j += 1;
+            if j - start >= MIN && (hash & MASK) == 0 {
+                break j;
+            }
+        };
+        let chunk = &img[start..cut];
+        if !is_zero(chunk) {
+            *raw += chunk.len() as u64;
+            stored.entry(blake3_128(chunk)).or_insert_with(|| lz4_compress(chunk).len());
+        }
+        start = cut;
+    }
+}
+
+fn human(b: u64) -> String {
+    let f = b as f64;
+    if b >= 1 << 30 {
+        format!("{:.2} GiB", f / (1u64 << 30) as f64)
+    } else if b >= 1 << 20 {
+        format!("{:.1} MiB", f / (1u64 << 20) as f64)
+    } else {
+        format!("{:.1} KiB", f / (1u64 << 10) as f64)
+    }
+}
+
+/// Sum of stored (lz4) bytes over the union of unique content hashes.
+fn union_stored(maps: &[&HashMap<Blake3Hash, usize>]) -> u64 {
+    let mut u: HashMap<Blake3Hash, usize> = HashMap::new();
+    for m in maps {
+        for (h, &s) in *m {
+            u.entry(*h).or_insert(s);
+        }
+    }
+    u.values().map(|&v| v as u64).sum()
+}
+
+fn main() {
+    let dirs: Vec<PathBuf> = std::env::args().skip(1).map(PathBuf::from).collect();
+    if dirs.is_empty() {
+        eprintln!("usage: dedup_probe <skopeo-dir> [<skopeo-dir> ...]");
+        std::process::exit(2);
+    }
+    let images: Vec<Image> = dirs.iter().map(|d| load_image(d)).collect();
+    let gear = gear_table();
+
+    eprintln!("Building real ext4 images (unaligned + aligned) via production convert path...");
+    let mut unaligned: Vec<Vec<u8>> = Vec::new();
+    let mut aligned: Vec<Vec<u8>> = Vec::new();
+    for img in &images {
+        eprint!("  {} ...", img.name);
+        unaligned.push(read_all(build_ext4(img, false)));
+        aligned.push(read_all(build_ext4(img, true)));
+        eprintln!(" done");
+    }
+
+    // =====================================================================
+    // E1 — MECHANISM: same bytes, shifted. Isolates alignment, zero confound.
+    // =====================================================================
+    println!("\n================ E1: self-shift control ({}) ================", images[0].name);
+    println!("Dedup of the image against a copy of ITSELF shifted by δ bytes.");
+    println!("(fixed {GRID}-byte grid; shared = blocks whose content hash matches)");
+    let base = &unaligned[0];
+    let mut base_blocks: HashMap<Blake3Hash, ()> = HashMap::new();
+    let mut base_n = 0u64;
+    for blk in base.chunks(GRID) {
+        if !is_zero(blk) {
+            base_blocks.insert(blake3_128(blk), ());
+            base_n += 1;
+        }
+    }
+    for &delta in &[0usize, 4096, 8192, 65536, GRID, 2 * GRID] {
+        let shifted = &base[delta.min(base.len())..];
+        let mut shared = 0u64;
+        let mut total = 0u64;
+        for blk in shifted.chunks(GRID) {
+            if is_zero(blk) {
+                continue;
+            }
+            total += 1;
+            if base_blocks.contains_key(&blake3_128(blk)) {
+                shared += 1;
+            }
+        }
+        let pct = if total > 0 { 100.0 * shared as f64 / total as f64 } else { 0.0 };
+        let tag = if delta % GRID == 0 { "  (whole-grid multiple → control)" } else { "  (sub-grid shift)" };
+        println!("  δ = {:>7} B : {:>5.1}% blocks dedup ({shared}/{total}){tag}", delta, pct);
+    }
+    println!("  base non-zero blocks: {base_n}");
+
+    // =====================================================================
+    // E2 + E3 — same real ext4 bytes under 3 schemes.
+    // =====================================================================
+    let mut fixed_maps = Vec::new();
+    let mut aligned_maps = Vec::new();
+    let mut cdc_maps = Vec::new();
+    let (mut fixed_raw, mut aligned_raw, mut cdc_raw) = (0u64, 0u64, 0u64);
+    let (mut unaligned_zero, mut aligned_zero) = (0u64, 0u64);
+
+    println!("\n================ E2/E3: per-image + cross-image dedup ================");
+    println!("Same production ext4 bytes. Three schemes:");
+    println!("  TODAY   = fixed 128 KiB grid on the unaligned image (what ships now)");
+    println!("  ALIGNED = fixed 128 KiB grid on the aligned image (the proposed fix)");
+    println!("  CDC     = content-defined chunking on the unaligned image (ceiling)\n");
+    for (i, img) in images.iter().enumerate() {
+        let mut fm = HashMap::new();
+        let mut am = HashMap::new();
+        let mut cm = HashMap::new();
+        let (mut r1, mut r2, mut r3) = (0u64, 0u64, 0u64);
+        let (mut z1, mut z2) = (0u64, 0u64);
+        fixed_grid_units(&unaligned[i], &mut fm, &mut r1, &mut z1);
+        fixed_grid_units(&aligned[i], &mut am, &mut r2, &mut z2);
+        cdc_units(&unaligned[i], &gear, &mut cm, &mut r3);
+        println!(
+            "  {:14} stored: TODAY {:>10} | ALIGNED {:>10} | CDC {:>10}",
+            img.name,
+            human(fm.values().map(|&v| v as u64).sum()),
+            human(am.values().map(|&v| v as u64).sum()),
+            human(cm.values().map(|&v| v as u64).sum()),
+        );
+        fixed_raw += r1;
+        aligned_raw += r2;
+        cdc_raw += r3;
+        unaligned_zero += z1;
+        aligned_zero += z2;
+        fixed_maps.push(fm);
+        aligned_maps.push(am);
+        cdc_maps.push(cm);
+    }
+
+    let sum_indiv = |maps: &[HashMap<Blake3Hash, usize>]| -> u64 {
+        maps.iter().map(|m| m.values().map(|&v| v as u64).sum::<u64>()).sum()
+    };
+    let fixed_indiv = sum_indiv(&fixed_maps);
+    let aligned_indiv = sum_indiv(&aligned_maps);
+    let cdc_indiv = sum_indiv(&cdc_maps);
+    let fixed_union = union_stored(&fixed_maps.iter().collect::<Vec<_>>());
+    let aligned_union = union_stored(&aligned_maps.iter().collect::<Vec<_>>());
+    let cdc_union = union_stored(&cdc_maps.iter().collect::<Vec<_>>());
+
+    let row = |label: &str, raw: u64, indiv: u64, union: u64| {
+        let cross = indiv.saturating_sub(union);
+        let cross_pct = if indiv > 0 { 100.0 * cross as f64 / indiv as f64 } else { 0.0 };
+        println!(
+            "  {label:8}: store-each {:>10} | store-union {:>10} | cross-image dedup {:>9} ({:.1}%)",
+            human(indiv),
+            human(union),
+            human(cross),
+            cross_pct,
+        );
+        let _ = raw;
+    };
+    println!("\n  --- storing ALL {} images (lz4-compressed, zeros dropped) ---", images.len());
+    row("TODAY", fixed_raw, fixed_indiv, fixed_union);
+    row("ALIGNED", aligned_raw, aligned_indiv, aligned_union);
+    row("CDC", cdc_raw, cdc_indiv, cdc_union);
+
+    println!("\n  store-union is total S3/cache bytes for the whole set.");
+    if fixed_union > 0 {
+        let saved = fixed_union.saturating_sub(aligned_union);
+        println!(
+            "  ALIGNED vs TODAY: {} smaller ({:.1}%) — this is the recovered dedup.",
+            human(saved),
+            100.0 * saved as f64 / fixed_union as f64
+        );
+        let ceil = fixed_union.saturating_sub(cdc_union);
+        let captured = if ceil > 0 { 100.0 * saved as f64 / ceil as f64 } else { 0.0 };
+        println!("  CDC ceiling would save {} — ALIGNED captures {:.0}% of it.", human(ceil), captured);
+    }
+
+    // =====================================================================
+    // Padding cost + round-trip validity of the aligned images.
+    // =====================================================================
+    println!("\n================ Cost & validity of the fix ================");
+    println!(
+        "  zero blocks (holes): unaligned {} → aligned {} (+{} blocks of padding)",
+        unaligned_zero,
+        aligned_zero,
+        aligned_zero.saturating_sub(unaligned_zero)
+    );
+    println!("  padding is zeros → NOT stored: TODAY/ALIGNED store-union already exclude it.");
+    for (i, img) in images.iter().enumerate() {
+        match verify_roundtrip(&aligned[i]) {
+            Ok((files, bytes)) => println!(
+                "  {:14} aligned image round-trips OK: {} files, {} of data read back",
+                img.name,
+                files,
+                human(bytes)
+            ),
+            Err(e) => println!("  {:14} ROUND-TRIP FAILED: {e}", img.name),
+        }
+    }
+
+    // =====================================================================
+    // V1 — DETERMINISM. Content-addressing requires byte-identical rebuilds.
+    // =====================================================================
+    println!("\n================ V1: determinism ================");
+    let a1 = read_all(build_ext4(&images[0], true));
+    let a2 = read_all(build_ext4(&images[0], true));
+    println!(
+        "  {} aligned built twice: {} (and differs from unaligned: {})",
+        images[0].name,
+        if a1 == a2 { "IDENTICAL ✓" } else { "DIFFERENT ✗ — BUG" },
+        if a1 != unaligned[0] { "yes ✓" } else { "no ✗" },
+    );
+
+    // =====================================================================
+    // V2 — INDEPENDENT GROUND TRUTH. Hash whole regular files (no chunking at
+    // all) read back from the ALIGNED images. This both proves the files
+    // survived alignment intact and measures the true shared-content fraction
+    // with a method that shares NO code with the block-grid path.
+    // =====================================================================
+    println!("\n================ V2: file-level ground truth (chunking-agnostic) ================");
+    let file_maps: Vec<HashMap<Blake3Hash, u64>> =
+        aligned.iter().map(|img| extract_file_contents(img)).collect();
+    let file_indiv: u64 = file_maps.iter().map(|m| m.values().sum::<u64>()).sum();
+    let mut file_union: HashMap<Blake3Hash, u64> = HashMap::new();
+    for m in &file_maps {
+        for (h, &s) in m {
+            file_union.entry(*h).or_insert(s);
+        }
+    }
+    let file_union_bytes: u64 = file_union.values().sum();
+    let file_cross = file_indiv.saturating_sub(file_union_bytes);
+    let file_pct = if file_indiv > 0 { 100.0 * file_cross as f64 / file_indiv as f64 } else { 0.0 };
+    println!("  whole-file content hashing (raw bytes, identical files counted once):");
+    println!(
+        "    store-each {} | store-union {} | cross-image identical-file content {} ({:.1}%)",
+        human(file_indiv),
+        human(file_union_bytes),
+        human(file_cross),
+        file_pct
+    );
+    println!("  → THIS is how much byte-identical content genuinely exists across the set.");
+    println!("    TODAY grid recovers 26-ish%; ALIGNED grid recovers ~this; if they match,");
+    println!("    the win is real and the CDC ceiling was not mis-tuned.");
+
+    // =====================================================================
+    // V3 — PER-FILE SPOTLIGHT. Take one large file that is byte-identical in
+    // two images and show its blocks dedup under ALIGNED but not under TODAY.
+    // =====================================================================
+    if images.len() >= 2 {
+        println!("\n================ V3: per-file spotlight (img[0] vs img[1]) ================");
+        spotlight(&images, &aligned, &unaligned, &aligned_maps, &fixed_maps);
+    }
+
+    // Dump images so they can be checked with the real e2fsck (external proof
+    // of filesystem validity, independent of our own reader).
+    let outdir = std::path::Path::new("/tmp/dedup_verify");
+    std::fs::create_dir_all(outdir).ok();
+    for (i, img) in images.iter().enumerate() {
+        std::fs::write(outdir.join(format!("{}-unaligned.img", img.name)), &unaligned[i]).ok();
+        std::fs::write(outdir.join(format!("{}-aligned.img", img.name)), &aligned[i]).ok();
+    }
+    println!("\n  images dumped to {} for external `e2fsck -fn` validation.", outdir.display());
+}
+
+/// Read every regular file from an ext4 image and map its whole-content hash to
+/// its size. Uses only the reader — no block-grid code — so it is an
+/// independent oracle for "how much identical file content exists".
+fn extract_file_contents(img: &[u8]) -> HashMap<Blake3Hash, u64> {
+    let mut out: HashMap<Blake3Hash, u64> = HashMap::new();
+    let cursor = std::io::Cursor::new(img);
+    let mut reader = ext4::reader::Reader::new(cursor).expect("reader");
+    let entries = reader.walk().expect("walk");
+    for e in entries {
+        if (e.mode & 0xF000) == 0x8000 && e.size > 0 {
+            let inode = reader.read_inode(e.inode_number).expect("inode");
+            let data = reader.read_data(&inode).expect("data");
+            // Key on (content, size); dedup identical files within the image.
+            out.entry(blake3_128(&data)).or_insert(data.len() as u64);
+        }
+    }
+    out
+}
+
+fn read_file_by_path(img: &[u8], path: &str) -> Option<Vec<u8>> {
+    let cursor = std::io::Cursor::new(img);
+    let mut reader = ext4::reader::Reader::new(cursor).ok()?;
+    let entries = reader.walk().ok()?;
+    for e in entries {
+        if e.path == path && (e.mode & 0xF000) == 0x8000 {
+            let inode = reader.read_inode(e.inode_number).ok()?;
+            return reader.read_data(&inode).ok();
+        }
+    }
+    None
+}
+
+fn spotlight(
+    images: &[Image],
+    aligned: &[Vec<u8>],
+    unaligned: &[Vec<u8>],
+    aligned_maps: &[HashMap<Blake3Hash, usize>],
+    fixed_maps: &[HashMap<Blake3Hash, usize>],
+) {
+    // Find a path present in BOTH images with byte-identical content and > 256 KiB.
+    let cur = std::io::Cursor::new(&aligned[0]);
+    let mut r0 = ext4::reader::Reader::new(cur).expect("reader");
+    let walk0 = r0.walk().expect("walk");
+    let mut candidates: Vec<(String, u64)> = walk0
+        .iter()
+        .filter(|e| (e.mode & 0xF000) == 0x8000 && e.size > 256 * 1024)
+        .map(|e| (e.path.clone(), e.size))
+        .collect();
+    candidates.sort_by_key(|(_, s)| std::cmp::Reverse(*s));
+
+    for (path, size) in candidates {
+        let c0 = read_file_by_path(&aligned[0], &path);
+        let c1 = read_file_by_path(&aligned[1], &path);
+        let (Some(c0), Some(c1)) = (c0, c1) else { continue };
+        if c0 != c1 || blake3_128(&c0) != blake3_128(&c1) {
+            continue; // not byte-identical across the two images
+        }
+        // Hash this file's content in 128 KiB chunks. Under ALIGNED the file
+        // starts on the grid, so its full sub-blocks ARE these content chunks.
+        let chunk_hashes: Vec<Blake3Hash> = c0.chunks(GRID).filter(|b| !is_zero(b)).map(blake3_128).collect();
+        let n = chunk_hashes.len();
+        // How many of this file's content-chunks appear as real grid blocks in
+        // the OTHER image (img[1]) under each scheme?
+        let in_aligned = chunk_hashes.iter().filter(|h| aligned_maps[1].contains_key(*h)).count();
+        let in_today = chunk_hashes.iter().filter(|h| fixed_maps[1].contains_key(*h)).count();
+        println!("  file: {path}  ({}, byte-identical in {} & {})", human(size), images[0].name, images[1].name);
+        println!("    its {n} content-chunks found as grid blocks in {}:", images[1].name);
+        println!(
+            "      under ALIGNED: {in_aligned}/{n} ({:.0}%) dedup",
+            100.0 * in_aligned as f64 / n.max(1) as f64
+        );
+        println!(
+            "      under TODAY:   {in_today}/{n} ({:.0}%) dedup",
+            100.0 * in_today as f64 / n.max(1) as f64
+        );
+        let _ = unaligned;
+        return;
+    }
+    println!("  (no >256 KiB byte-identical file shared across the two images found)");
+}
+
+/// Read the aligned ext4 back through the production reader to prove the image
+/// is a valid filesystem (not just bytes that happen to dedup well).
+fn verify_roundtrip(img: &[u8]) -> std::io::Result<(usize, u64)> {
+    let cursor = std::io::Cursor::new(img);
+    let mut reader = ext4::reader::Reader::new(cursor)?;
+    let entries = reader.walk()?;
+    let mut files = 0usize;
+    let mut bytes = 0u64;
+    for e in entries {
+        if (e.mode & 0xF000) == 0x8000 {
+            // regular file
+            let inode = reader.read_inode(e.inode_number)?;
+            let data = reader.read_data(&inode)?;
+            bytes += data.len() as u64;
+            files += 1;
+        }
+    }
+    Ok((files, bytes))
+}
diff --git a/glidefs/src/block/router.rs b/glidefs/src/block/router.rs
index 259b1fb..b04615b 100644
--- a/glidefs/src/block/router.rs
+++ b/glidefs/src/block/router.rs
@@ -1819,9 +1819,11 @@ impl ExportRouter {
             .await
             .map_err(|e| RouterError::OciPull(format!("failed to resolve image: {e}")))?;
 
-        // Estimate device size: compressed × 3, next power-of-2, min 64 MiB.
+        // Estimate device size: compressed × 4, next power-of-2, min 64 MiB.
+        // The ×4 (vs ×3) leaves headroom for block-grid alignment padding, which
+        // inflates the logical ext4 with holes/zeros the block store drops.
         let total_compressed: u64 = resolved.layers.iter().map(|l| l.size as u64).sum();
-        let estimated = (total_compressed * 3).max(64 * 1024 * 1024);
+        let estimated = (total_compressed * 4).max(64 * 1024 * 1024);
         let device_size = estimated.next_power_of_two();
 
         info!(
@@ -1885,6 +1887,9 @@ impl ExportRouter {
                 WriterOption::MaximumDiskSize(device_size as i64),
                 WriterOption::Uuid(uuid),
                 WriterOption::Journal(1024), // 4 MiB journal
+                // Align large file payloads to the dedup block grid (the volume
+                // block size) so the same file dedups across blessed images.
+                WriterOption::AlignData { align: block_size_u32, min_size: block_size_u32 },
             ],
         };
 
diff --git a/glidefs/src/cli/bless.rs b/glidefs/src/cli/bless.rs
index 66961f4..2ba0dea 100644
--- a/glidefs/src/cli/bless.rs
+++ b/glidefs/src/cli/bless.rs
@@ -171,10 +171,13 @@ pub async fn run_bless_oci(
         .await
         .map_err(|e| anyhow::anyhow!("failed to resolve image: {e}"))?;
 
-    // Estimate device size: sum compressed layer sizes × 3 (decompression + ext4 overhead).
-    // Round up to next power-of-2 MiB boundary. Minimum 64 MiB.
+    // Estimate device size: sum compressed layer sizes × 4 (decompression + ext4
+    // overhead + block-grid alignment headroom). Round up to next power-of-2.
+    // Minimum 64 MiB. The ×4 (vs ×3) covers the logical inflation from aligning
+    // large files to the dedup block grid; that padding is holes/zeros which the
+    // block store drops, so it costs address space, not stored bytes.
     let total_compressed: u64 = resolved.layers.iter().map(|l| l.size as u64).sum();
-    let estimated = (total_compressed * 3).max(64 * 1024 * 1024);
+    let estimated = (total_compressed * 4).max(64 * 1024 * 1024);
     let device_size = estimated.next_power_of_two();
 
     info!(
@@ -251,6 +254,12 @@ pub async fn run_bless_oci(
             WriterOption::MaximumDiskSize(device_size as i64),
             WriterOption::Uuid(uuid),
             WriterOption::Journal(1024), // 4 MiB journal
+            // Align large file payloads to the dedup block grid (the volume's
+            // 128 KiB block size) so the same file produces the same blocks
+            // across images and the host's content-addressed cache + S3 packs
+            // dedup it. Only files >= one full block are aligned, bounding the
+            // padding. See dedup_probe / fsck_validity for the validation.
+            WriterOption::AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE },
         ],
     };
 
diff --git a/glidefs/src/oci/layer_store.rs b/glidefs/src/oci/layer_store.rs
index 1b6edfa..9e9afaf 100644
--- a/glidefs/src/oci/layer_store.rs
+++ b/glidefs/src/oci/layer_store.rs
@@ -28,7 +28,7 @@ use object_store::{ObjectStore, PutPayload};
 use serde::{Deserialize, Serialize};
 
 use crate::block::content_store::ContentStore;
-use crate::oci::ext4_store::{deterministic_uuid, store_ext4_stream};
+use crate::oci::ext4_store::{deterministic_uuid, store_ext4_stream, BLOCK_SIZE};
 
 /// Manifest name for a stored layer (its sole VolumeManifest).
 const LAYER_MANIFEST_NAME: &str = "layer";
@@ -92,7 +92,10 @@ impl ImageDescriptor {
 /// Zero blocks past the real content are skipped at store time, so oversizing
 /// costs nothing in storage.
 fn layer_device_size(tar_len: u64) -> u64 {
-    (tar_len.saturating_mul(2).max(64 * 1024 * 1024)).next_power_of_two()
+    // ×3 (not ×2): extra headroom for block-grid alignment padding, which
+    // inflates the logical ext4. The padding is holes/zeros dropped by the
+    // block store, so it costs address space, not stored bytes.
+    (tar_len.saturating_mul(3).max(64 * 1024 * 1024)).next_power_of_two()
 }
 
 /// Ensure a single OCI layer is stored as a content-addressed ext4 artifact.
@@ -137,6 +140,9 @@ pub async fn ensure_layer_stored<R: Read + Seek>(
             WriterOption::MaximumDiskSize(device_size as i64),
             WriterOption::Uuid(deterministic_uuid(digest)),
             WriterOption::Journal(1024), // 4 MiB journal — same as bless
+            // Align large file payloads to the dedup block grid (the volume's
+            // 128 KiB block size) so the same file dedups across layers/images.
+            WriterOption::AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE },
         ],
     };
     let mut ext4_tmp = tempfile::tempfile().context("layer ext4 tempfile")?;