diff --git a/ext4/ARCHITECTURE.md b/ext4/ARCHITECTURE.md index 9796fed..1f46392 100644 --- a/ext4/ARCHITECTURE.md +++ b/ext4/ARCHITECTURE.md @@ -143,31 +143,33 @@ Comparison ignores atime, ctime (volatile), inode_number (internal), and links_c ## On-Disk Layout +The writer uses `flex_bg`, so per-group metadata is **not** interleaved per +group — all inode tables and bitmaps are clustered at the end of the image, +after the data region: + ``` Byte 0 Block 0 (4096 bytes) ├─ [0..1024) zeros (boot sector area) -├─ [1024..2048) SuperBlock (1024 bytes) +├─ [1024..2048) SuperBlock (primary, 1024 bytes) └─ [2048..4096) zeros -Block 1 Group Descriptor Table -├─ 128 × GroupDescriptor (32 bytes each = 4096 bytes) -└─ (repeated if >128 groups) - -Block gd_end .. gd_end+N Inode Table (per group) -├─ 16 inodes per block (256 bytes each) -└─ N blocks = ceil(inodes_per_group / 16) +Block 1 .. 1+gd_blocks Group Descriptor Table (primary) +└─ GroupDescriptor × groups (32 bytes each) -Block inode_end .. data_start Block Bitmap + Inode Bitmap -├─ block_bitmap: 1 block (1 bit per block in group) -└─ inode_bitmap: 1 block (1 bit per inode in group) +Data region (streamed forward, may contain reserved holes): +├─ lost+found, file data, directory blocks, xattr blocks, extent index blocks +├─ Journal (optional) contiguous run, placed via reserve_contiguous +└─ ⟂ reserved holes at sparse_super group starts (block 32768, 98304, …): + backup superblock + GDT copy — never claimed by data -Block data_start .. end Data Blocks -├─ Directory blocks (packed dir entries) -├─ File data blocks (streamed content) -├─ xattr blocks (for large xattr sets) -└─ Extent index blocks (for very large files) +Trailing metadata (flex_bg, all groups clustered, reserve_contiguous-placed): +├─ Inode Table groups × inodes_per_group × 256 bytes +└─ Block + Inode Bitmaps 2 blocks per group ``` +The superblock and primary GDT are written last (seek back to block 0/1) at +`close()`, after the layout is known. + ## Inode Number Allocation ``` @@ -197,22 +199,58 @@ The 60-byte `inode.data` area holds: Each extent covers at most `MAX_BLOCKS_PER_EXTENT = 0x8000` (32,768) blocks = 128 MiB. Adjacent same-physical-run blocks are merged into one extent. -### Extent Building (writer.rs:write_extent) +### Extent Building (writer.rs:write_file_data, physical_runs, write_extents) + +File data is streamed forward, but it is **not** always one contiguous run: the +allocator skips blocks reserved for block-group metadata (see below), so a file +spanning such a block is split into multiple extents. ``` -on each data write: - extend current_extent if blocks are contiguous - else: - flush current_extent to inode.data (if fits in 4 entries) - or to pending extent_index_block (depth 2) - start new extent - -on finish_inode: - flush last extent - if depth==2: write extent_index_block to disk - seek to inode slot, write inode +write(&[u8]) → write_file_data: + on first byte: skip any reserved block at pos, record data_start_block + stream data, jumping over reserved regions (write up to the next reserved + block, seek past it, continue) — pos advances over the skipped blocks + +finish_inode → write_extents: + runs = physical_runs(data_start_block, end_block) // non-reserved spans + leaves = split each run into ≤ MAX_BLOCKS_PER_EXTENT extents (logical offset + accumulates over data blocks only, excluding reserved gaps) + emit: + ≤4 leaves → inline in inode.data (depth 0) + ≤4×EXTENTS_PER_BLOCK → one index level (depth 1), leaf blocks skip reserved + else → error (file too large) ``` +A file that crosses no reserved block yields exactly one run — identical output +to a plain contiguous writer. `block_count` counts data + extent-tree blocks +only, never the reserved gaps. + +### Block-Group Metadata Reservation (writer.rs:is_reserved_block, has_super_backup) + +With the `sparse_super` feature, block groups 0, 1, and every power of 3, 5, and +7 hold a **backup superblock + group-descriptor copy** in their first +`1 + gd_blocks` blocks (e.g. block 32768 for group 1, 98304 for group 3). The +kernel reserves these regardless of whether valid backup content is written, so +**file or metadata data must never claim them** — an overlapping extent is a +multiply-claimed block that `e2fsck` and the kernel reject (the file reads back +as "Structure needs cleaning"). Group 0's reservation is skipped at `init()`. + +The allocator keeps everything off these blocks: + +- **File data** fragments around them (`write_file_data` / `physical_runs`). +- **Contiguous close()-time structures** — the journal inode, the flex_bg inode + table, and the bitmaps — can't fragment (they're single extents or located by + group-descriptor offsets), so `reserve_contiguous(n)` instead places the whole + run *past* any reserved region it would straddle. The skipped lead-in blocks + become free holes. +- **Padding holes** (from alignment and from `reserve_contiguous`) are recorded + in `free_holes` and cleared from the otherwise-dense block bitmap in `close()`. + +This was a real, latent corruption bug for any image larger than one block group +(>128 MiB): the linear allocator wrote straight through block 32768. It was +hidden because the in-crate reader is lenient; the real `e2fsck` and a kernel +loop-mount catch it (see Testing). + ## xattr Storage Strategy Extended attributes use a two-tier storage model: @@ -290,11 +328,12 @@ Directory entries reference inode numbers. Hard links and `link()` calls can ass GlideFS content-addresses blocks with BLAKE3. If two nodes generate the same OCI layer, they must produce byte-identical ext4 images or they'll compute different hashes and store duplicate data. Determinism requires: - No uninitialized bytes (zero all padding) -- No random UUIDs (UUID is all-zeros) +- A deterministic UUID — `WriterOption::Uuid` set to a content-derived value (e.g. the manifest digest), or all-zeros if unset. Never random. - No timestamps (`mtime=0`, `wtime=0` in superblock) - Sorted directory entries (by inode number, then name) - Sorted xattr entries - `BTreeMap` for all child/xattr collections +- A content-addressed layout: file→block placement (including reserved-block skips and any alignment padding) is a pure function of the input, so the same tar always lands the same bytes in the same blocks. ### Why port from hcsshim instead of using an existing crate? @@ -306,13 +345,13 @@ hcsshim's `compactext4` is the reference implementation for OCI-compatible ext4 The port preserves the same on-disk layout, making images identical to those produced by the Go implementation. -### Why no journal? +### Why is the journal optional? -Container layer images are read-only once mounted by the overlay filesystem. A journal adds ~128 MiB of overhead for no benefit. The `HAS_JOURNAL` compat feature is intentionally absent. +The journal is off by default in the writer: a container layer mounted read-only through overlay never needs one. But a blessed base image that backs a *mutable* volume does, so bless enables `WriterOption::Journal(1024)` (4 MiB). When enabled, the journal is inode 8 with the `HAS_JOURNAL` feature; `s_journal_uuid` stays zero because it identifies an *external* journal device (a non-zero value makes the kernel/e2fsck abort searching for one). When disabled, `HAS_JOURNAL` is absent. ### Why no checksums? -`METADATA_CSUM` and `GDT_CSUM` are not enabled. Checksums require the UUID as a seed, but a zero UUID makes all checksums trivially zero — enabling the feature would silently produce invalid checksums. Since images are content-addressed externally, internal ext4 checksums are redundant. +`METADATA_CSUM` and `GDT_CSUM` are not enabled. Metadata checksums are seeded by the UUID and would have to be recomputed for every structure; since images are content-addressed externally (BLAKE3 over the bytes) and validated against the real `e2fsck`/kernel in tests, internal ext4 checksums are redundant. ## Package Structure @@ -320,7 +359,7 @@ Container layer images are read-only once mounted by the overlay filesystem. A j |------|---------| | `mod.rs` | Re-exports public API: `Writer`, `Reader`, `File`, `WriterOption`, `convert_tar_to_ext4` | | `format.rs` | On-disk binary structures: `SuperBlock`, `GroupDescriptor`, `ParsedInode`, `ExtentHeader/Leaf/Index`, `DirEntry`, xattr helpers. Both serialization (`write_to`) and deserialization (`read_from`, `get_xattrs`) for shared on-disk types. | -| `writer.rs` | Core filesystem builder. Manages inode lifecycle, block allocation, extent tree construction, xattr packing, directory serialization, superblock finalization. | +| `writer.rs` | Core filesystem builder. Manages inode lifecycle, reserved-block-aware allocation (data fragments around backup-superblock blocks; contiguous structures use `reserve_contiguous`), extent tree construction, optional alignment + free-hole accounting, xattr packing, directory serialization, journal, superblock finalization. | | `reader.rs` | ext4 image parser. Reads superblock, group descriptors, inode table, extent trees, directory entries, and xattrs. Exports via `walk()` and `to_tar()`. | | `tar_convert.rs` | tar→ext4 bridge. Maps tar entry types to writer operations, handles OCI whiteouts and PAX xattrs. | | `diff.rs` | Incremental export: diffs two ext4 snapshots and produces an OCI-compatible delta tar layer with whiteout markers for deletions. | @@ -332,6 +371,9 @@ Container layer images are read-only once mounted by the overlay filesystem. A j |--------|---------|--------| | `WriterOption::InlineData` | disabled | Store files ≤136 bytes inside the inode instead of allocating data blocks. Reduces image size for layers with many small files (e.g., config files, scripts). | | `WriterOption::MaximumDiskSize(n)` | 16 GiB | Maximum filesystem size. Controls the number of block groups pre-allocated in the group descriptor table. Range: 0..16 TiB. | +| `WriterOption::Uuid([u8;16])` | all-zeros | Filesystem UUID, written to the superblock and used as the directory-hash seed. Callers that content-address the image pass a deterministic (e.g. manifest-derived) UUID so the same input yields the same bytes. | +| `WriterOption::Journal(blocks)` | none | Create an internal jbd2 journal of `blocks` 4 KiB blocks (e.g. 1024 = 4 MiB) as inode 8, set the `HAS_JOURNAL` feature. `s_journal_uuid` is left **zero** (it names an *external* journal device; a non-zero value makes the kernel/e2fsck abort looking for one). bless enables this. | +| `WriterOption::AlignData { align, min_size }` | disabled | Start the data of every regular file ≥ `min_size` on an `align`-byte boundary, padding the gap with a (free) hole. Aligning large payloads to the downstream dedup block grid makes the same file produce the same blocks regardless of upstream churn, so content-addressed dedup survives. Metadata-aware: composes with reserved-block skipping. | ## Limits @@ -399,6 +441,20 @@ Three test tiers in `tests.rs`: Run without Docker: `cargo test --features test-utils --lib` and `cargo test --features test-utils --test integration` +**Filesystem-validity harness** (`tests/fsck_validity.rs`) — gates correctness on kernel-grade oracles, not the in-crate reader (which is lenient and once hid a multi-group corruption bug). Skips cleanly where `e2fsck` is absent. + +| Test | What it covers | +|------|---------------| +| `fsck_single_group_clean` / `fsck_multi_group_clean` | `e2fsck -fn` clean for single- and multi-group images | +| `fsck_multi_group_aligned_clean` | aligned build is e2fsck-clean (padding marked free, aligned starts dodge reserved blocks) | +| `content_survives_fragmentation` | files split around reserved blocks read back byte-exact (right logical order) | +| `fsck_journal_straddles_group_boundary` | journal must not straddle a backup superblock (sweeps 120–136 MiB) | +| `fsck_inode_table_straddles_boundary` | inode table must not straddle a backup superblock (many-file workloads) | +| `fuzz_multigroup_validity_and_content` | random multi-group filesets: e2fsck-clean + byte-exact, both align modes (`EXT4_FUZZ_SEEDS` to scale) | +| `kernel_mount_content` | opt-in (`EXT4_MOUNT_TEST=1`): real loop-mount, every file byte-exact vs known input | + +All tests build with `Journal(1024)` to match the production bless config. + ## Failure Modes | Failure | Behavior | diff --git a/ext4/src/tests.rs b/ext4/src/tests.rs index 79ec7de..deb0c02 100644 --- a/ext4/src/tests.rs +++ b/ext4/src/tests.rs @@ -971,7 +971,10 @@ fn test_journal_roundtrip() { "HAS_JOURNAL flag not set" ); assert_eq!(sb.journal_inum, format::INODE_JOURNAL, "journal_inum should be 8"); - assert_eq!(sb.journal_uuid, uuid, "journal_uuid should match filesystem uuid"); + // s_journal_uuid identifies an EXTERNAL journal device; for an internal + // journal it must be zero, or the kernel/e2fsck search for a nonexistent + // external journal and abort ("Can't find external journal"). + assert_eq!(sb.journal_uuid, [0u8; 16], "journal_uuid must be zero for an internal journal"); assert_ne!(sb.journal_blocks[0], 0, "journal_blocks backup should be populated"); } diff --git a/ext4/src/writer.rs b/ext4/src/writer.rs index 1ad23bc..57afd92 100644 --- a/ext4/src/writer.rs +++ b/ext4/src/writer.rs @@ -66,6 +66,21 @@ pub enum WriterOption { /// Create an internal journal with the given size in 4 KiB blocks. /// Typical values: 1024 (4 MiB), 4096 (16 MiB), 16384 (64 MiB). Journal(u32), + /// Start the data of every regular file at least `min_size` bytes large on + /// an `align`-byte boundary (padding the gap with a hole). Aligning large + /// file payloads to the downstream dedup block grid makes the same file + /// produce the same blocks regardless of what was written before it, so + /// content-addressed dedup survives unrelated upstream churn. `align` must + /// be a power of two; `align == 0` disables (the default). + /// + /// KNOWN LIMITATION (do not enable in production yet): the current pad is + /// not metadata-aware. Padding can land a file's data on an ext4 block-group + /// reserved block (e.g. the backup superblock at block `blocks_per_group`), + /// producing an extent the *kernel* rejects ("invalid extent entries"), + /// even though the in-crate reader accepts it. A correct implementation must + /// skip group-metadata blocks when aligning. Verified via `dedup_probe` + + /// `e2fsck`/loop-mount. + AlignData { align: u32, min_size: u32 }, } // ---- Internal inode ---- @@ -233,6 +248,20 @@ pub struct Writer { gd_blocks: u32, uuid: [u8; 16], journal_blocks: u32, + /// Boundary (bytes) for large-file data alignment; 0 = disabled. + data_align: i64, + /// Minimum file size (bytes) that triggers data alignment. + data_align_min: i64, + /// Physical block where the in-progress file's data begins. File data skips + /// blocks reserved for block-group metadata (backup superblocks + GDT), so + /// the data is generally non-contiguous and `pos - data_written` no longer + /// locates the start — this does. + data_start_block: u32, + /// Unreferenced free block ranges created by data alignment padding. The + /// block bitmap assumes a densely packed data region; these holes must be + /// cleared from it so the filesystem is consistent. Empty unless alignment + /// is enabled. + free_holes: Vec<(u32, u32)>, } impl Writer { @@ -251,6 +280,10 @@ impl Writer { gd_blocks: 0, uuid: [0u8; 16], journal_blocks: 0, + data_align: 0, + data_align_min: 0, + data_start_block: 0, + free_holes: Vec::new(), }; for opt in opts { match opt { @@ -268,6 +301,11 @@ impl Writer { } WriterOption::Uuid(u) => w.uuid = *u, WriterOption::Journal(blocks) => w.journal_blocks = *blocks, + WriterOption::AlignData { align, min_size } => { + debug_assert!(*align == 0 || align.is_power_of_two()); + w.data_align = i64::from(*align); + w.data_align_min = i64::from(*min_size); + } } } w @@ -331,6 +369,142 @@ impl Writer { Ok(()) } + // ---- block-group metadata reservation ---- + // + // ext4's sparse_super layout reserves the first `1 + gd_blocks` blocks of + // certain groups (0, 1, and powers of 3/5/7) for a backup superblock + a + // group-descriptor copy. Group 0's reservation is skipped at init(); the + // interior ones (block 32768, 98304, ...) sit in the middle of the data + // region. File data must not be written onto them, or the kernel rejects the + // extent as overlapping a system zone (multiply-claimed block). + + /// Number of reserved blocks at the start of a backup group. + fn group_reserve(&self) -> u32 { + 1 + self.gd_blocks + } + + /// Is physical block `b` reserved for an interior block-group backup? + fn is_reserved_block(&self, b: u32) -> bool { + let g = b / BLOCKS_PER_GROUP; + if g == 0 { + return false; // group 0's primary metadata is handled by init()'s seek + } + (b % BLOCKS_PER_GROUP) < self.group_reserve() && has_super_backup(g) + } + + /// Smallest reserved block >= `from`, or None if none up to the max device. + fn next_reserved_block_ge(&self, from: u32) -> Option { + let max_group = (self.max_disk_size / (i64::from(BLOCKS_PER_GROUP) * BLOCK_SIZE as i64)) as u32 + 1; + let mut g = from / BLOCKS_PER_GROUP; + while g <= max_group { + if g >= 1 && has_super_backup(g) { + let rstart = g * BLOCKS_PER_GROUP; + let rend = rstart + self.group_reserve(); + let cand = from.max(rstart); + if cand < rend { + return Some(cand); + } + } + g += 1; + } + None + } + + /// If `pos` sits at the start of a reserved region, seek past it. + fn skip_reserved_at_pos(&mut self) -> io::Result<()> { + while self.pos % BLOCK_SIZE as i64 == 0 && self.is_reserved_block(self.block()) { + let g = self.block() / BLOCKS_PER_GROUP; + let region_end = g * BLOCKS_PER_GROUP + self.group_reserve(); + self.seek_block(region_end)?; + } + Ok(()) + } + + /// Write file data, skipping reserved block-group metadata regions. Records + /// the file's first data block on the first call. + fn write_file_data(&mut self, b: &[u8]) -> io::Result { + if self.data_written == 0 { + self.skip_reserved_at_pos()?; + self.data_start_block = self.block(); + } + let mut off = 0usize; + while off < b.len() { + self.skip_reserved_at_pos()?; + let cur = self.block(); + let limit = match self.next_reserved_block_ge(cur) { + // next_reserved >= cur, and cur is not reserved, so r > cur. + Some(r) => i64::from(r) * BLOCK_SIZE as i64 - self.pos, + None => i64::MAX, + }; + let take = ((b.len() - off) as i64).min(limit) as usize; + let w = self.write_bytes(&b[off..off + take])?; + off += w; + if w < take { + break; // short write + } + } + Ok(off) + } + + /// Record [start, end) as free holes, excluding reserved metadata blocks + /// (which stay marked used — they hold backup superblocks, not free space). + fn record_free_hole(&mut self, start: u32, end: u32) { + let mut b = start; + while b < end { + if self.is_reserved_block(b) { + b += 1; + continue; + } + let run_start = b; + b = self.next_reserved_block_ge(b).unwrap_or(end).min(end); + if b > run_start { + self.free_holes.push((run_start, b - run_start)); + } + } + } + + /// Position the cursor so the next `n` blocks form a single contiguous run + /// that contains no reserved block-group metadata, and return that start + /// block. Used for structures that must be contiguous (journal inode, the + /// flex_bg inode table, bitmaps) — unlike file data, they can't be + /// fragmented around a reserved block, so instead we skip the whole run past + /// any reserved region it would straddle. Skipped data blocks become free + /// holes; the reserved blocks stay used. `n` is always << a block group, so + /// at most one interior backup region is ever in the way. + fn reserve_contiguous(&mut self, n: u32) -> io::Result { + loop { + self.skip_reserved_at_pos()?; + let start = self.block(); + match self.next_reserved_block_ge(start) { + Some(r) if r < start + n => { + let g = r / BLOCKS_PER_GROUP; + let region_end = g * BLOCKS_PER_GROUP + self.group_reserve(); + self.record_free_hole(start, r); + self.seek_block(region_end)?; + } + _ => return Ok(start), + } + } + } + + /// The contiguous, non-reserved physical runs covering [start, end). + fn physical_runs(&self, start: u32, end: u32) -> Vec<(u32, u32)> { + let mut runs = Vec::new(); + let mut b = start; + while b < end { + if self.is_reserved_block(b) { + b += 1; + continue; + } + let run_start = b; + // Jump to the next reserved block (or end) rather than stepping. + let next_res = self.next_reserved_block_ge(b).unwrap_or(end).min(end); + b = next_res; + runs.push((run_start, b - run_start)); + } + runs + } + // ---- inode management ---- fn get_inode(&self, i: InodeNumber) -> Option<&Inode> { @@ -615,6 +789,7 @@ impl Writer { self.cur_inode = Some((ino - 1) as usize); self.data_written = 0; self.data_max = size; + self.data_start_block = 0; Ok(()) } @@ -645,64 +820,72 @@ impl Writer { } fn write_extents(&mut self, idx: usize) -> io::Result<()> { - let start = self.pos - self.data_written; - if start % BLOCK_SIZE as i64 != 0 { - return Err(io::Error::other( - "data start position is not block-aligned", - )); - } + // Flush the partial final data block, then resolve the file's physical + // layout. Data skips reserved block-group metadata, so it may be split + // across several contiguous runs; `data_start_block` (not + // `pos - data_written`) locates the start. self.next_block()?; - - let start_block = (start / BLOCK_SIZE as i64) as u32; - let blocks = self.block() - start_block; - let mut used_blocks = blocks; + let start_block = self.data_start_block; + let end_block = self.block(); + let runs = self.physical_runs(start_block, end_block); + + // Flatten runs into extent leaves, each at most MAX_BLOCKS_PER_EXTENT. + // For an unfragmented file this yields exactly the same leaves the old + // contiguous arithmetic produced. + let mut leaves: Vec<(u32, u32, u32)> = Vec::new(); // (logical, phys, len) + let mut logical = 0u32; + for (phys, len) in &runs { + let mut o = 0u32; + while o < *len { + let l = (*len - o).min(MAX_BLOCKS_PER_EXTENT); + leaves.push((logical, phys + o, l)); + logical += l; + o += l; + } + } + let mut used_blocks = logical; // data blocks (reserved gaps excluded) const EXTENT_NODE_SIZE: u32 = 12; const EXTENTS_PER_BLOCK: u32 = (BLOCK_SIZE as u32) / EXTENT_NODE_SIZE - 1; - let extents = if blocks == 0 { 0 } else { blocks.div_ceil(MAX_BLOCKS_PER_EXTENT) }; + let n_ext = leaves.len() as u32; let mut data = Vec::new(); - if extents == 0 { + if n_ext == 0 { // Nothing to do - } else if extents <= 4 { - // Fits in inode directly - write_extent_header_to_vec(&mut data, extents as u16, 4, 0); - for i in 0..extents { - let block_offset = i * MAX_BLOCKS_PER_EXTENT; - let mut length = blocks - block_offset; - if length > MAX_BLOCKS_PER_EXTENT { - length = MAX_BLOCKS_PER_EXTENT; - } - write_extent_leaf_to_vec(&mut data, block_offset, length as u16, start_block + block_offset); + } else if n_ext <= 4 { + // Fits in the inode directly. + write_extent_header_to_vec(&mut data, n_ext as u16, 4, 0); + for (lblk, phys, len) in &leaves { + write_extent_leaf_to_vec(&mut data, *lblk, *len as u16, *phys); } - // Pad to 4 extents worth - let padding = (4 - extents) * EXTENT_NODE_SIZE; + let padding = (4 - n_ext) * EXTENT_NODE_SIZE; data.extend(std::iter::repeat_n(0u8, padding as usize)); - } else if extents <= 4 * EXTENTS_PER_BLOCK { - let extent_blocks = extents.div_ceil(EXTENTS_PER_BLOCK); - used_blocks += extent_blocks; + } else if n_ext <= 4 * EXTENTS_PER_BLOCK { + let extent_blocks = n_ext.div_ceil(EXTENTS_PER_BLOCK); - // Root: index nodes + // Root: index nodes pointing at leaf blocks. write_extent_header_to_vec(&mut data, extent_blocks as u16, 4, 1); - // We'll fill in the index nodes after writing the leaf blocks let index_start = data.len(); data.resize(index_start + 4 * EXTENT_NODE_SIZE as usize, 0); for i in 0..extent_blocks { + // Extent-tree blocks must avoid reserved metadata too. + self.skip_reserved_at_pos()?; let leaf_block = self.block(); - // Fill in the index node + used_blocks += 1; + + let first = (i * EXTENTS_PER_BLOCK) as usize; + let extents_in_block = (n_ext - i * EXTENTS_PER_BLOCK).min(EXTENTS_PER_BLOCK); + + // Index node: logical offset of this leaf block's first extent. let idx_off = index_start + (i * EXTENT_NODE_SIZE) as usize; - let block_off = i * EXTENTS_PER_BLOCK * MAX_BLOCKS_PER_EXTENT; - data[idx_off..idx_off + 4].copy_from_slice(&block_off.to_le_bytes()); + data[idx_off..idx_off + 4].copy_from_slice(&leaves[first].0.to_le_bytes()); data[idx_off + 4..idx_off + 8].copy_from_slice(&leaf_block.to_le_bytes()); // idx_off + 8..12 stays zero (leaf_high + unused) - let extents_in_block = (extents - i * EXTENTS_PER_BLOCK).min(EXTENTS_PER_BLOCK); let mut leaf_buf = vec![0u8; BLOCK_SIZE as usize]; let mut leaf_pos = 0usize; - - // Write extent header leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&format::EXTENT_HEADER_MAGIC.to_le_bytes()); leaf_pos += 2; leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(extents_in_block as u16).to_le_bytes()); @@ -714,21 +897,15 @@ impl Writer { leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&0u32.to_le_bytes()); // generation leaf_pos += 4; - let offset = i * EXTENTS_PER_BLOCK * MAX_BLOCKS_PER_EXTENT; - for j in 0..extents_in_block { - let block_off2 = offset + j * MAX_BLOCKS_PER_EXTENT; - let mut length = blocks - block_off2; - if length > MAX_BLOCKS_PER_EXTENT { - length = MAX_BLOCKS_PER_EXTENT; - } - let start = start_block + block_off2; - leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&block_off2.to_le_bytes()); + for j in 0..extents_in_block as usize { + let (lblk, phys, len) = leaves[first + j]; + leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&lblk.to_le_bytes()); leaf_pos += 4; - leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(length as u16).to_le_bytes()); + leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&(len as u16).to_le_bytes()); leaf_pos += 2; leaf_buf[leaf_pos..leaf_pos + 2].copy_from_slice(&0u16.to_le_bytes()); // start_high leaf_pos += 2; - leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&start.to_le_bytes()); + leaf_buf[leaf_pos..leaf_pos + 4].copy_from_slice(&phys.to_le_bytes()); leaf_pos += 4; } @@ -892,6 +1069,22 @@ impl Writer { if self.inode_ref(child_ino)?.mode & TYPE_MASK == format::S_IFREG { self.start_inode(name, child_ino, f.size)?; + // Align the start of large file payloads to the dedup block grid, so + // the same file produces the same blocks regardless of upstream + // churn. The padded gap is unreferenced free space; record it so the + // block bitmap marks it free (reserved metadata blocks within the + // gap stay used). write_file_data then skips any reserved block at + // the aligned position before recording data_start_block. + if self.data_align > 0 && f.size >= self.data_align_min { + let align = self.data_align; + let rem = self.pos % align; + if rem != 0 { + let pad_start = self.block(); + self.write_zeros(align - rem)?; + let pad_end = self.block(); + self.record_free_hole(pad_start, pad_end); + } + } } Ok(()) } @@ -1024,7 +1217,7 @@ impl Writer { self.data_written += b.len() as i64; Ok(b.len()) } else { - let n = self.write_bytes(b)?; + let n = self.write_file_data(b)?; self.data_written += n as i64; Ok(n) } @@ -1147,7 +1340,10 @@ impl Writer { /// journal blocks. The superblock is updated in close() to set /// HAS_JOURNAL, journal_inum, and the journal_blocks backup. fn write_journal(&mut self) -> io::Result<()> { - let journal_start = self.block(); + // The journal is one contiguous extent; keep it clear of reserved + // block-group metadata (a straddle would make inode 8 multiply-claim the + // backup superblock). + let journal_start = self.reserve_contiguous(self.journal_blocks)?; // Write JBD2 v2 superblock (first block of journal) // All multi-byte fields are big-endian per JBD2 spec. @@ -1341,14 +1537,21 @@ impl Writer { self.write_journal()?; } - // Write the inode table - let inode_table_offset = self.block(); - let (groups, inodes_per_group) = best_group_count(inode_table_offset, self.inodes.len() as u32); + // Write the inode table. It is contiguous and located via per-group + // descriptors (inode_table_low + g * size_per_group), so it must avoid + // reserved block-group metadata. Reserve a clean run sized for the group + // count; padding the start can bump the count by one group, so reserve a + // one-group margin and recompute against the final offset. + let n_inodes = self.inodes.len() as u32; + let (g0, ipg0) = best_group_count(self.block(), n_inodes); + let itspg0 = ipg0 * INODE_SIZE as u32 / BLOCK_SIZE as u32; + let inode_table_offset = self.reserve_contiguous((g0 + 1) * itspg0 + 2)?; + let (groups, inodes_per_group) = best_group_count(inode_table_offset, n_inodes); self.write_inode_table(groups * inodes_per_group * INODE_SIZE as u32)?; - // Write bitmaps - let bitmap_offset = self.block(); + // Write bitmaps (also contiguous and GD-located). let bitmap_size = groups * 2; + let bitmap_offset = self.reserve_contiguous(bitmap_size)?; let valid_data_size = bitmap_offset + bitmap_size; let mut disk_size = valid_data_size; let min_size = (groups - 1) * BLOCKS_PER_GROUP + 1; @@ -1368,6 +1571,8 @@ impl Writer { let inode_table_size_per_group = inodes_per_group * INODE_SIZE as u32 / BLOCK_SIZE as u32; let mut total_used_blocks: u32 = 0; let mut total_used_inodes: u32 = 0; + // Alignment padding holes to clear from the otherwise-dense bitmap. + let free_holes = std::mem::take(&mut self.free_holes); for g in 0..groups { let mut bitmap_buf = vec![0u8; BLOCK_SIZE as usize * 2]; @@ -1400,6 +1605,23 @@ impl Writer { used_block_count += 1; } } + // Clear alignment padding holes: the bitmap is dense by default, but + // these blocks are unreferenced free space. + let gstart = g * BLOCKS_PER_GROUP; + for &(hstart, hlen) in &free_holes { + let lo = hstart.max(gstart); + let hi = (hstart + hlen).min(gstart + BLOCKS_PER_GROUP); + let mut b = lo; + while b < hi { + let j = b - gstart; + let mask = 1u8 << (j % 8); + if bitmap_buf[(j / 8) as usize] & mask != 0 { + bitmap_buf[(j / 8) as usize] &= !mask; + used_block_count -= 1; + } + b += 1; + } + } // Inode bitmap for j in 0..inodes_per_group { @@ -1478,7 +1700,12 @@ impl Writer { | format::RoCompatFeature::HUGE_FILE | format::RoCompatFeature::EXTRA_ISIZE, uuid: self.uuid, - journal_uuid: self.uuid, + // s_journal_uuid identifies an *external* journal device. We only + // ever use an internal journal (inode 8) or none, so this must stay + // zero — a non-zero value makes the kernel and e2fsck search for an + // external journal and abort ("Can't find external journal"). The + // journal's own jbd2 superblock still carries the fs UUID. + journal_uuid: [0u8; 16], journal_inum: if self.journal_blocks > 0 { format::INODE_JOURNAL } else { 0 }, hash_seed: [ u32::from_le_bytes(self.uuid[0..4].try_into().unwrap()), @@ -1582,6 +1809,29 @@ fn best_group_count(blocks: u32, inodes: u32) -> (u32, u32) { (best_groups, best_ipg) } +/// Does block group `g` hold a backup superblock + group-descriptor copy? +/// With the sparse_super feature, backups live in groups 0, 1, and every power +/// of 3, 5, and 7. The kernel/e2fsck reserve those blocks regardless of whether +/// valid backup content is written, so file data must never claim them. +fn has_super_backup(g: u32) -> bool { + if g <= 1 { + return true; + } + for base in [3u32, 5, 7] { + let mut p = base; + while p < g { + match p.checked_mul(base) { + Some(n) => p = n, + None => break, + } + } + if p == g { + return true; + } + } + false +} + fn write_extent_header_to_vec(buf: &mut Vec, entries: u16, max: u16, depth: u16) { buf.extend_from_slice(&format::EXTENT_HEADER_MAGIC.to_le_bytes()); buf.extend_from_slice(&entries.to_le_bytes()); diff --git a/ext4/tests/fsck_validity.rs b/ext4/tests/fsck_validity.rs new file mode 100644 index 0000000..b5d606a --- /dev/null +++ b/ext4/tests/fsck_validity.rs @@ -0,0 +1,442 @@ +//! Filesystem-validity harness gated on the REAL `e2fsck`, not the in-crate +//! reader. The in-crate reader is lenient and hid a real multi-block-group +//! corruption bug for a long time; `e2fsck` (kernel-grade structural check) +//! catches it. Every writer change should be validated here. +//! +//! These tests shell out to `e2fsck`; they skip (pass with a notice) when it is +//! not installed so they do not break environments without e2fsprogs. +//! +//! Run with: `cargo test -p ext4 --test fsck_validity -- --include-ignored` + +use std::io::Write; +use std::path::PathBuf; +use std::process::Command; + +use ext4::tar_convert::{ConvertOptions, convert_tar_to_ext4}; +use ext4::writer::WriterOption; + +/// Locate `e2fsck` (often in /sbin, not on a service PATH). Returns None to skip. +fn find_e2fsck() -> Option { + for p in ["/sbin/e2fsck", "/usr/sbin/e2fsck", "/usr/bin/e2fsck", "/bin/e2fsck"] { + if std::path::Path::new(p).exists() { + return Some(PathBuf::from(p)); + } + } + None +} + +/// Deterministic, non-trivial file content (so blocks are actually allocated and +/// the layout is reproducible — no RNG). +fn content(seed: u64, len: usize) -> Vec { + let mut v = Vec::with_capacity(len); + let mut s = seed.wrapping_mul(0x9E3779B97F4A7C15).wrapping_add(1); + while v.len() < len { + s ^= s << 13; + s ^= s >> 7; + s ^= s << 17; + v.extend_from_slice(&s.to_le_bytes()); + } + v.truncate(len); + v +} + +/// Build an in-memory tar of `(path, size)` files, convert to ext4 via the real +/// production path, write it to a temp file, and run `e2fsck -fn` on it. +/// Returns Ok(()) if e2fsck reports a clean filesystem (exit 0), else Err(report). +fn build_and_fsck(files: &[(&str, usize)], align: Option<(u32, u32)>) -> Result<(), String> { + let owned: Vec<(String, usize)> = files.iter().map(|(p, s)| ((*p).to_string(), *s)).collect(); + e2fsck_clean(&build_image(&owned, align)) +} + +/// Build a real ext4 image (production convert path) from `(path, size)` files, +/// each filled with the deterministic `content(index, size)`. +fn build_image(files: &[(String, usize)], align: Option<(u32, u32)>) -> Vec { + let mut tar = tar::Builder::new(Vec::new()); + for (i, (path, size)) in files.iter().enumerate() { + let data = content(i as u64, *size); + let mut h = tar::Header::new_gnu(); + h.set_size(data.len() as u64); + h.set_mode(0o644); + h.set_mtime(0); + h.set_entry_type(tar::EntryType::Regular); + h.set_cksum(); + tar.append_data(&mut h, path, &data[..]).unwrap(); + } + let tar_bytes = tar.into_inner().unwrap(); + + let mut writer_options = vec![ + WriterOption::MaximumDiskSize(4 * 1024 * 1024 * 1024), + WriterOption::Uuid([0x11; 16]), + // Match the production bless config: an internal 4 MiB journal. The + // journal (and inode table / dir blocks) are written at close() and can + // land near a block-group backup-superblock boundary, so exercising it + // is part of validating the reserved-block handling. + WriterOption::Journal(1024), + ]; + if let Some((a, m)) = align { + writer_options.push(WriterOption::AlignData { align: a, min_size: m }); + } + let opts = ConvertOptions { convert_backslash: false, writer_options }; + let mut img: Vec = Vec::new(); + convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), std::io::Cursor::new(&mut img), &opts) + .unwrap(); + img +} + +/// The oracle: run `e2fsck -fn` on the image. Ok == clean (exit 0). Skips +/// (returns Ok) when e2fsck is not installed. +fn e2fsck_clean(img: &[u8]) -> Result<(), String> { + let Some(e2fsck) = find_e2fsck() else { + eprintln!("SKIP: e2fsck not installed"); + return Ok(()); + }; + let mut tmp = tempfile::NamedTempFile::new().map_err(|e| format!("tmp: {e}"))?; + tmp.write_all(img).map_err(|e| format!("write: {e}"))?; + tmp.flush().ok(); + let output = Command::new(&e2fsck) + .args(["-fn"]) + .arg(tmp.path()) + .output() + .map_err(|e| format!("spawn e2fsck: {e}"))?; + if output.status.code() == Some(0) { + Ok(()) + } else { + let mut report = format!("e2fsck exit={:?} (nonzero = filesystem errors)\n", output.status.code()); + report.push_str(&String::from_utf8_lossy(&output.stdout)); + // Trim the giant bitmap-difference dumps to keep failures readable. + Err(report.lines().take(30).collect::>().join("\n")) + } +} + +/// Read every file back via the reader (which assembles from the on-disk extent +/// tree) and assert byte-exact equality with the known input — catches any +/// logical-ordering bug introduced by fragmentation around reserved blocks. +fn content_matches(img: &[u8], files: &[(String, usize)]) -> Result<(), String> { + let mut want: std::collections::HashMap = std::collections::HashMap::new(); + for (i, (p, s)) in files.iter().enumerate() { + want.insert(p.trim_start_matches('/').to_string(), (i as u64, *s)); + } + let mut reader = + ext4::reader::Reader::new(std::io::Cursor::new(img)).map_err(|e| format!("reader: {e}"))?; + let entries = reader.walk().map_err(|e| format!("walk: {e}"))?; + let mut checked = 0; + for e in entries { + if (e.mode & 0xF000) != 0x8000 { + continue; + } + let path = e.path.trim_start_matches('/').to_string(); + let Some(&(idx, size)) = want.get(&path) else { continue }; + let inode = reader.read_inode(e.inode_number).map_err(|e| format!("{path}: inode: {e}"))?; + let got = reader.read_data(&inode).map_err(|e| format!("{path}: read: {e}"))?; + if got.len() != size { + return Err(format!("{path}: size {} != {size}", got.len())); + } + if got != content(idx, size) { + return Err(format!("{path}: CONTENT MISMATCH (fragmentation reordered bytes)")); + } + checked += 1; + } + if checked != files.len() { + return Err(format!("read back {checked}/{} files", files.len())); + } + Ok(()) +} + +/// Baseline: a single-block-group filesystem (< 128 MiB) must be e2fsck-clean. +/// This proves the harness works and the writer is sound when it doesn't cross +/// a block-group boundary. +#[test] +fn fsck_single_group_clean() { + let files = &[ + ("etc/hostname", 12), + ("etc/config.toml", 4096), + ("usr/bin/tool", 8 * 1024 * 1024), + ("usr/lib/data.bin", 32 * 1024 * 1024), + ("var/log/app.log", 1024), + ]; + if let Err(report) = build_and_fsck(files, None) { + panic!("single-group image is not e2fsck-clean:\n{report}"); + } +} + +/// A filesystem that crosses a block-group boundary (> 128 MiB) must be +/// e2fsck-clean. Regression for the original corruption: the linear allocator +/// used to place file data on the Group 1 backup superblock / group descriptors +/// (block 32768+), producing multiply-claimed blocks. The group-aware allocator +/// now skips those reserved blocks and fragments files around them. +#[test] +fn fsck_multi_group_clean() { + // ~160 MiB of file data guarantees crossing into block group 1 (32768 blocks + // == 128 MiB), regardless of group-0 metadata overhead. + let files = &[ + ("data/a.bin", 40 * 1024 * 1024), + ("data/b.bin", 40 * 1024 * 1024), + ("data/c.bin", 40 * 1024 * 1024), + ("data/d.bin", 40 * 1024 * 1024), + ]; + if let Err(report) = build_and_fsck(files, None) { + panic!("multi-group image is not e2fsck-clean:\n{report}"); + } +} + +/// Content correctness across fragmentation. e2fsck validates *structure*, not +/// that a file's logical blocks are in the right order. A file that spans a +/// reserved metadata region is split into multiple extents by the allocator; if +/// the logical offsets were assigned wrong, the bytes would come back reordered. +/// Build a multi-group image with KNOWN deterministic content, read every file +/// back through the reader (which assembles by the on-disk extent tree, +/// independent of the writer's allocation state), and assert byte-exact equality. +#[test] +fn content_survives_fragmentation() { + // Files sized so several cross the block-group boundary (block 32768). + let specs: &[(&str, usize)] = &[ + ("data/a.bin", 50 * 1024 * 1024), + ("data/b.bin", 50 * 1024 * 1024), + ("data/c.bin", 50 * 1024 * 1024), + ("small/x", 1234), + ("data/d.bin", 30 * 1024 * 1024), + ]; + + // Build the tar. + let mut tar = tar::Builder::new(Vec::new()); + let mut expected: std::collections::HashMap> = std::collections::HashMap::new(); + for (i, (path, size)) in specs.iter().enumerate() { + let data = content(i as u64, *size); + let mut h = tar::Header::new_gnu(); + h.set_size(data.len() as u64); + h.set_mode(0o644); + h.set_mtime(0); + h.set_entry_type(tar::EntryType::Regular); + h.set_cksum(); + tar.append_data(&mut h, path, &data[..]).unwrap(); + expected.insert((*path).to_string(), data); + } + let tar_bytes = tar.into_inner().unwrap(); + + // Convert to ext4 (multi-group, group-aware allocator). + let opts = ConvertOptions { + convert_backslash: false, + writer_options: vec![ + WriterOption::MaximumDiskSize(2 * 1024 * 1024 * 1024), + WriterOption::Uuid([0x22; 16]), + ], + }; + let mut img: Vec = Vec::new(); + convert_tar_to_ext4(std::io::Cursor::new(tar_bytes), std::io::Cursor::new(&mut img), &opts).unwrap(); + + // Read every file back via the reader and compare to known input. + let mut reader = ext4::reader::Reader::new(std::io::Cursor::new(&img)).unwrap(); + let entries = reader.walk().unwrap(); + let mut checked = 0; + for e in entries { + if (e.mode & 0xF000) != 0x8000 { + continue; + } + let path = e.path.trim_start_matches('/').to_string(); + let Some(want) = expected.get(&path) else { continue }; + let inode = reader.read_inode(e.inode_number).unwrap(); + let got = reader.read_data(&inode).unwrap(); + assert_eq!(got.len(), want.len(), "{path}: size mismatch"); + assert!(got == *want, "{path}: CONTENT MISMATCH (fragmentation reordered bytes)"); + checked += 1; + } + assert_eq!(checked, specs.len(), "did not read back all files"); +} + +/// Once the allocator is group-aware, the aligned multi-group build must ALSO be +/// e2fsck-clean (alignment padding must be marked free, and aligned file starts +/// must skip reserved blocks). Captures both the metadata-collision and the +/// padding-bitmap issues found via e2fsck. +#[test] +fn fsck_multi_group_aligned_clean() { + let files = &[ + ("data/a.bin", 40 * 1024 * 1024), + ("data/b.bin", 40 * 1024 * 1024), + ("data/c.bin", 40 * 1024 * 1024), + ("data/d.bin", 40 * 1024 * 1024), + ]; + if let Err(report) = build_and_fsck(files, Some((128 * 1024, 16 * 1024))) { + panic!("aligned multi-group image is not e2fsck-clean:\n{report}"); + } +} + +/// Targeted sweep of the block-group boundary (block 32768 == 128 MiB). With a +/// journal enabled (production config), the journal — and the inode table / dir +/// blocks — are written at close() and can land straddling the Group 1 backup +/// superblock. Sweep data sizes that push those close()-time structures across +/// the boundary; every one must be e2fsck-clean. Regression for reserved-block +/// handling of NON-file-data writes. +#[test] +fn fsck_journal_straddles_group_boundary() { + if find_e2fsck().is_none() { + eprintln!("SKIP: e2fsck not installed"); + return; + } + // 120..136 MiB in 1 MiB steps: data ends near block 32768, so the trailing + // journal (1024 blocks = 4 MiB) and inode table cross the boundary. + for mib in 120..=136 { + let files = vec![(format!("data/blob_{mib}.bin"), mib * 1024 * 1024)]; + for align in [None, Some((128 * 1024u32, 128 * 1024u32))] { + let img = build_image(&files, align); + if let Err(e) = e2fsck_clean(&img) { + panic!("{mib} MiB align={align:?} NOT e2fsck-clean:\n{e}"); + } + if let Err(e) = content_matches(&img, &files) { + panic!("{mib} MiB align={align:?} content error: {e}"); + } + } + } +} + +/// Many files whose data ends near block 32768 make the flex_bg inode table +/// large enough to straddle the Group 1 backup superblock. The inode table is +/// contiguous and pointed at by per-group descriptors, so it must also dodge +/// reserved blocks. Regression for inode-table/bitmap reserved-block handling. +#[test] +fn fsck_inode_table_straddles_boundary() { + if find_e2fsck().is_none() { + eprintln!("SKIP: e2fsck not installed"); + return; + } + // N one-block files put data just below block 32768 while the inode table + // (N/16 blocks) crosses it. Sweep a few counts so one reliably straddles. + for n in [30_500usize, 31_000, 31_500] { + let files: Vec<(String, usize)> = + (0..n).map(|i| (format!("d{}/f{i}.bin", i % 256), 4096)).collect(); + for align in [None, Some((128 * 1024u32, 128 * 1024u32))] { + let img = build_image(&files, align); + if let Err(e) = e2fsck_clean(&img) { + panic!("inode-table straddle n={n} align={align:?} NOT e2fsck-clean:\n{e}"); + } + } + } +} + +/// Property fuzzer: random multi-group filesets must be e2fsck-clean AND read +/// back byte-exact — both unaligned and aligned. Seeds are deterministic so any +/// failure reproduces verbatim (the panic prints the seed and fileset). This is +/// the generalized gate: it sweeps the size/position space where the original +/// data-on-backup-superblock and alignment-bitmap bugs lived. Crank coverage +/// with `EXT4_FUZZ_SEEDS=64 cargo test -p ext4 --test fsck_validity fuzz`. +#[test] +fn fuzz_multigroup_validity_and_content() { + if find_e2fsck().is_none() { + eprintln!("SKIP: e2fsck not installed"); + return; + } + let seeds: u64 = std::env::var("EXT4_FUZZ_SEEDS") + .ok() + .and_then(|s| s.parse().ok()) + .unwrap_or(8); + + for seed in 0..seeds { + let mut state = seed.wrapping_mul(0x9E3779B97F4A7C15) | 1; + let mut next = move || { + state ^= state << 13; + state ^= state >> 7; + state ^= state << 17; + state + }; + + // Random fileset with a deliberate mix: small files (below the align + // threshold), medium, and large (which straddle group boundaries). + let nfiles = 4 + (next() % 10) as usize; + let mut files: Vec<(String, usize)> = Vec::new(); + let mut total: u64 = 0; + for k in 0..nfiles { + let size = match next() % 10 { + 0..=3 => 1 + (next() % (64 * 1024)) as usize, + 4..=6 => 4096 + (next() % (8 * 1024 * 1024)) as usize, + _ => 8 * 1024 * 1024 + (next() % (40 * 1024 * 1024)) as usize, + }; + files.push((format!("d/s{seed}_f{k}.bin"), size)); + total += size as u64; + } + // Guarantee at least one block-group boundary (128 MiB) is crossed so the + // reserved-block / fragmentation paths are always exercised. + if total < 160 * 1024 * 1024 { + let pad = (160 * 1024 * 1024 - total) as usize + 4 * 1024 * 1024; + files.push((format!("d/s{seed}_big.bin"), pad)); + } + + for align in [None, Some((128 * 1024u32, 16 * 1024u32))] { + let img = build_image(&files, align); + if let Err(e) = e2fsck_clean(&img) { + panic!("seed={seed} align={align:?} NOT e2fsck-clean:\n{e}\nfiles={files:?}"); + } + if let Err(e) = content_matches(&img, &files) { + panic!("seed={seed} align={align:?} content error: {e}\nfiles={files:?}"); + } + } + } +} + +/// Privileged kernel-mount content check: loop-mount the image with the REAL +/// Linux ext4 driver and verify every file's bytes against the known input — +/// the strongest oracle, independent of both the writer and the in-crate reader. +/// Opt-in (needs root / passwordless sudo + loop devices), so it skips by +/// default and runs in a privileged/nightly job: +/// EXT4_MOUNT_TEST=1 cargo test -p ext4 --test fsck_validity kernel_mount +#[test] +fn kernel_mount_content() { + if std::env::var("EXT4_MOUNT_TEST").is_err() { + eprintln!("SKIP: set EXT4_MOUNT_TEST=1 to run the privileged kernel-mount check"); + return; + } + let sudo_ok = Command::new("sudo") + .args(["-n", "true"]) + .status() + .map(|s| s.success()) + .unwrap_or(false); + if !sudo_ok { + eprintln!("SKIP: passwordless sudo not available for mount"); + return; + } + + // Multi-group fileset (~180 MiB) with known deterministic content. + let files: Vec<(String, usize)> = vec![ + ("data/a.bin".into(), 50 * 1024 * 1024), + ("data/b.bin".into(), 50 * 1024 * 1024), + ("small/x".into(), 4096), + ("data/c.bin".into(), 50 * 1024 * 1024), + ("data/d.bin".into(), 30 * 1024 * 1024), + ]; + + for align in [None, Some((128 * 1024u32, 128 * 1024u32))] { + let img = build_image(&files, align); + let mut tmp = tempfile::NamedTempFile::new().unwrap(); + tmp.write_all(&img).unwrap(); + tmp.flush().unwrap(); + let mnt = tempfile::tempdir().unwrap(); + + let mounted = Command::new("sudo") + .args(["-n", "mount", "-o", "ro,loop"]) + .arg(tmp.path()) + .arg(mnt.path()) + .status() + .expect("spawn mount"); + assert!(mounted.success(), "kernel mount failed (align={align:?})"); + + // Read every file through the kernel and compare to known input. Collect + // the result first so we always unmount, even on mismatch. + let mut err: Option = None; + for (i, (path, size)) in files.iter().enumerate() { + match std::fs::read(mnt.path().join(path)) { + Ok(got) if got.len() == *size && got == content(i as u64, *size) => {} + Ok(got) => { + err = Some(format!("{path}: kernel read mismatch (len {} vs {size})", got.len())); + break; + } + Err(e) => { + err = Some(format!("{path}: kernel read failed: {e}")); + break; + } + } + } + + let _ = Command::new("sudo").args(["-n", "umount"]).arg(mnt.path()).status(); + if let Some(e) = err { + panic!("kernel-mount content check failed (align={align:?}): {e}"); + } + } +} diff --git a/glidefs/src/bin/dedup_probe.rs b/glidefs/src/bin/dedup_probe.rs new file mode 100644 index 0000000..65e2348 --- /dev/null +++ b/glidefs/src/bin/dedup_probe.rs @@ -0,0 +1,533 @@ +#![allow(clippy::cast_possible_wrap, clippy::cast_sign_loss, clippy::cast_possible_truncation)] +//! Empirical probe for the block-alignment dedup hypothesis. +//! +//! Claim 1 — the problem is real: fixed-grid block dedup over ext4 silently +//! fails to dedup byte-identical content when that content sits at a different +//! offset, because the dedup window is positional. We prove this two ways: +//! E1 (mechanism): take ONE real ext4 image and dedup it against a copy of +//! *itself* shifted by δ bytes. δ that is a whole-grid multiple → ~100% +//! dedup (the method works); a sub-grid δ → dedup collapses. Same bytes, +//! only the offset changes, so alignment is provably the whole cause. +//! E2 (bite): over real, related images, fixed-grid dedup is far below what +//! content-defined chunking (FastCDC) recovers from the *same* ext4 bytes. +//! +//! Claim 2 — the fix is easy: rebuild the same images with the production +//! writer's new `AlignData` option (large file payloads start on the grid). +//! E3: fixed-grid dedup on the aligned images should jump to ~the FastCDC +//! ceiling, and the aligned images still round-trip through the ext4 reader. +//! The padding is holes (zeros), which the block store never stores. +//! +//! Everything runs through the real production code paths +//! (`convert_oci_layers_to_ext4`, the real `block_map` hashing/compression), so +//! the numbers reflect what GlideFS would actually store — not a re-model. +//! +//! Usage: +//! skopeo copy docker://python:3.12-slim-bookworm dir:/tmp/oci/py312 +//! skopeo copy docker://python:3.13-slim-bookworm dir:/tmp/oci/py313 +//! cargo run --release --bin dedup_probe -- /tmp/oci/py312 /tmp/oci/py313 ... + +use std::collections::HashMap; +use std::fs::File; +use std::io::{Read, Seek, SeekFrom}; +use std::path::{Path, PathBuf}; + +use ext4::tar_convert::{ConvertOptions, convert_oci_layers_to_ext4}; +use ext4::writer::WriterOption; +use glidefs::block::block_map::{Blake3Hash, blake3_128, lz4_compress}; + +const GRID: usize = 128 * 1024; // production BLOCK_SIZE — the dedup window size +// align files >= threshold; pack smaller ones. Override with DEDUP_ALIGN_THRESHOLD. +fn align_threshold() -> u32 { + std::env::var("DEDUP_ALIGN_THRESHOLD").ok().and_then(|s| s.parse().ok()).unwrap_or(16 * 1024) +} + +// ---- shared image plumbing (mirrors bless: deterministic, content-addressed) ---- + +fn deterministic_uuid(seed: &str) -> [u8; 16] { + let mut uuid = *blake3_128(seed.as_bytes()).as_bytes(); + uuid[6] = (uuid[6] & 0x0f) | 0x80; + uuid[8] = (uuid[8] & 0x3f) | 0x80; + uuid +} + +fn decompress_layer(blob: &Path) -> std::io::Result { + let mut f = File::open(blob)?; + let mut magic = [0u8; 4]; + f.read_exact(&mut magic)?; + f.seek(SeekFrom::Start(0))?; + let mut out = tempfile::tempfile()?; + if magic[0] == 0x1f && magic[1] == 0x8b { + std::io::copy(&mut flate2::read::GzDecoder::new(f), &mut out)?; + } else if magic == [0x28, 0xb5, 0x2f, 0xfd] { + std::io::copy(&mut zstd::Decoder::new(f)?, &mut out)?; + } else { + std::io::copy(&mut f, &mut out)?; + } + out.seek(SeekFrom::Start(0))?; + Ok(out) +} + +struct Image { + name: String, + seed: String, + layer_blobs: Vec, +} + +fn load_image(dir: &Path) -> Image { + let bytes = std::fs::read(dir.join("manifest.json")).expect("read manifest.json"); + let seed = format!("blake3:{:032x}", u128::from_le_bytes(*blake3_128(&bytes).as_bytes())); + let v: serde_json::Value = serde_json::from_slice(&bytes).expect("parse manifest"); + let layer_blobs = v["layers"] + .as_array() + .expect("layers[]") + .iter() + .map(|l| { + let d = l["digest"].as_str().expect("digest"); + dir.join(d.strip_prefix("sha256:").unwrap_or(d)) + }) + .collect(); + Image { name: dir.file_name().unwrap().to_string_lossy().into_owned(), seed, layer_blobs } +} + +/// Build a real ext4 image from the layers. `align` toggles the new writer +/// option; everything else is identical so the only variable is alignment. +/// Fixed 4 GiB device for all builds so the block grids are comparable. +fn build_ext4(img: &Image, align: bool) -> File { + let mut layers: Vec = + img.layer_blobs.iter().map(|p| decompress_layer(p).expect("decompress")).collect(); + let mut writer_options = vec![ + WriterOption::MaximumDiskSize(4 * 1024 * 1024 * 1024), + WriterOption::Uuid(deterministic_uuid(&img.seed)), + ]; + // The writer's internal-journal feature flag trips `e2fsck` ("external + // journal") before it inspects bitmaps. Allow disabling it for fsck runs so + // the full structural check actually executes. Production keeps the journal. + if std::env::var("DEDUP_PROBE_NO_JOURNAL").is_err() { + writer_options.push(WriterOption::Journal(1024)); + } + if align { + writer_options.push(WriterOption::AlignData { align: GRID as u32, min_size: align_threshold() }); + } + let opts = ConvertOptions { convert_backslash: false, writer_options }; + let out = tempfile::tempfile().expect("tempfile"); + let mut fs = convert_oci_layers_to_ext4(&mut layers, out, &opts).expect("convert"); + fs.seek(SeekFrom::Start(0)).unwrap(); + fs +} + +fn is_zero(d: &[u8]) -> bool { + let (p, c, s) = unsafe { d.align_to::() }; + p.iter().all(|&b| b == 0) && c.iter().all(|&w| w == 0) && s.iter().all(|&b| b == 0) +} + +fn read_all(mut f: File) -> Vec { + let mut v = Vec::new(); + f.seek(SeekFrom::Start(0)).unwrap(); + f.read_to_end(&mut v).unwrap(); + v +} + +// ---- chunking schemes; each returns (content hash -> stored bytes) units ---- + +/// Fixed grid (what production does today). Skips zero blocks, exactly like the +/// real flush path — so alignment padding (holes) is free here. +fn fixed_grid_units(img: &[u8], stored: &mut HashMap, raw: &mut u64, zero: &mut u64) { + for blk in img.chunks(GRID) { + if is_zero(blk) { + *zero += 1; + continue; + } + *raw += blk.len() as u64; + stored.entry(blake3_128(blk)).or_insert_with(|| lz4_compress(blk).len()); + } +} + +/// Gear-hash content-defined chunking (the alignment-immune ceiling). Boundary +/// where the rolling hash hits the mask; min/max clamp run length. +fn gear_table() -> [u64; 256] { + // splitmix64 from a fixed seed — deterministic across runs. + let mut s = 0x9E3779B97F4A7C15u64; + let mut t = [0u64; 256]; + for e in &mut t { + s = s.wrapping_add(0x9E3779B97F4A7C15); + let mut z = s; + z = (z ^ (z >> 30)).wrapping_mul(0xBF58476D1CE4E5B9); + z = (z ^ (z >> 27)).wrapping_mul(0x94D049BB133111EB); + *e = z ^ (z >> 31); + } + t +} + +fn cdc_units(img: &[u8], gear: &[u64; 256], stored: &mut HashMap, raw: &mut u64) { + const MIN: usize = 16 * 1024; + const MAX: usize = 1024 * 1024; + const MASK: u64 = (1 << 17) - 1; // avg ~128 KiB, matching the grid + let n = img.len(); + let mut start = 0; + while start < n { + let mut hash = 0u64; + let mut j = start; + let cut = loop { + if j >= n { + break n; + } + if j - start >= MAX { + break j; + } + hash = (hash << 1).wrapping_add(gear[img[j] as usize]); + j += 1; + if j - start >= MIN && (hash & MASK) == 0 { + break j; + } + }; + let chunk = &img[start..cut]; + if !is_zero(chunk) { + *raw += chunk.len() as u64; + stored.entry(blake3_128(chunk)).or_insert_with(|| lz4_compress(chunk).len()); + } + start = cut; + } +} + +fn human(b: u64) -> String { + let f = b as f64; + if b >= 1 << 30 { + format!("{:.2} GiB", f / (1u64 << 30) as f64) + } else if b >= 1 << 20 { + format!("{:.1} MiB", f / (1u64 << 20) as f64) + } else { + format!("{:.1} KiB", f / (1u64 << 10) as f64) + } +} + +/// Sum of stored (lz4) bytes over the union of unique content hashes. +fn union_stored(maps: &[&HashMap]) -> u64 { + let mut u: HashMap = HashMap::new(); + for m in maps { + for (h, &s) in *m { + u.entry(*h).or_insert(s); + } + } + u.values().map(|&v| v as u64).sum() +} + +fn main() { + let dirs: Vec = std::env::args().skip(1).map(PathBuf::from).collect(); + if dirs.is_empty() { + eprintln!("usage: dedup_probe [ ...]"); + std::process::exit(2); + } + let images: Vec = dirs.iter().map(|d| load_image(d)).collect(); + let gear = gear_table(); + + eprintln!("Building real ext4 images (unaligned + aligned) via production convert path..."); + let mut unaligned: Vec> = Vec::new(); + let mut aligned: Vec> = Vec::new(); + for img in &images { + eprint!(" {} ...", img.name); + unaligned.push(read_all(build_ext4(img, false))); + aligned.push(read_all(build_ext4(img, true))); + eprintln!(" done"); + } + + // ===================================================================== + // E1 — MECHANISM: same bytes, shifted. Isolates alignment, zero confound. + // ===================================================================== + println!("\n================ E1: self-shift control ({}) ================", images[0].name); + println!("Dedup of the image against a copy of ITSELF shifted by δ bytes."); + println!("(fixed {GRID}-byte grid; shared = blocks whose content hash matches)"); + let base = &unaligned[0]; + let mut base_blocks: HashMap = HashMap::new(); + let mut base_n = 0u64; + for blk in base.chunks(GRID) { + if !is_zero(blk) { + base_blocks.insert(blake3_128(blk), ()); + base_n += 1; + } + } + for &delta in &[0usize, 4096, 8192, 65536, GRID, 2 * GRID] { + let shifted = &base[delta.min(base.len())..]; + let mut shared = 0u64; + let mut total = 0u64; + for blk in shifted.chunks(GRID) { + if is_zero(blk) { + continue; + } + total += 1; + if base_blocks.contains_key(&blake3_128(blk)) { + shared += 1; + } + } + let pct = if total > 0 { 100.0 * shared as f64 / total as f64 } else { 0.0 }; + let tag = if delta % GRID == 0 { " (whole-grid multiple → control)" } else { " (sub-grid shift)" }; + println!(" δ = {:>7} B : {:>5.1}% blocks dedup ({shared}/{total}){tag}", delta, pct); + } + println!(" base non-zero blocks: {base_n}"); + + // ===================================================================== + // E2 + E3 — same real ext4 bytes under 3 schemes. + // ===================================================================== + let mut fixed_maps = Vec::new(); + let mut aligned_maps = Vec::new(); + let mut cdc_maps = Vec::new(); + let (mut fixed_raw, mut aligned_raw, mut cdc_raw) = (0u64, 0u64, 0u64); + let (mut unaligned_zero, mut aligned_zero) = (0u64, 0u64); + + println!("\n================ E2/E3: per-image + cross-image dedup ================"); + println!("Same production ext4 bytes. Three schemes:"); + println!(" TODAY = fixed 128 KiB grid on the unaligned image (what ships now)"); + println!(" ALIGNED = fixed 128 KiB grid on the aligned image (the proposed fix)"); + println!(" CDC = content-defined chunking on the unaligned image (ceiling)\n"); + for (i, img) in images.iter().enumerate() { + let mut fm = HashMap::new(); + let mut am = HashMap::new(); + let mut cm = HashMap::new(); + let (mut r1, mut r2, mut r3) = (0u64, 0u64, 0u64); + let (mut z1, mut z2) = (0u64, 0u64); + fixed_grid_units(&unaligned[i], &mut fm, &mut r1, &mut z1); + fixed_grid_units(&aligned[i], &mut am, &mut r2, &mut z2); + cdc_units(&unaligned[i], &gear, &mut cm, &mut r3); + println!( + " {:14} stored: TODAY {:>10} | ALIGNED {:>10} | CDC {:>10}", + img.name, + human(fm.values().map(|&v| v as u64).sum()), + human(am.values().map(|&v| v as u64).sum()), + human(cm.values().map(|&v| v as u64).sum()), + ); + fixed_raw += r1; + aligned_raw += r2; + cdc_raw += r3; + unaligned_zero += z1; + aligned_zero += z2; + fixed_maps.push(fm); + aligned_maps.push(am); + cdc_maps.push(cm); + } + + let sum_indiv = |maps: &[HashMap]| -> u64 { + maps.iter().map(|m| m.values().map(|&v| v as u64).sum::()).sum() + }; + let fixed_indiv = sum_indiv(&fixed_maps); + let aligned_indiv = sum_indiv(&aligned_maps); + let cdc_indiv = sum_indiv(&cdc_maps); + let fixed_union = union_stored(&fixed_maps.iter().collect::>()); + let aligned_union = union_stored(&aligned_maps.iter().collect::>()); + let cdc_union = union_stored(&cdc_maps.iter().collect::>()); + + let row = |label: &str, raw: u64, indiv: u64, union: u64| { + let cross = indiv.saturating_sub(union); + let cross_pct = if indiv > 0 { 100.0 * cross as f64 / indiv as f64 } else { 0.0 }; + println!( + " {label:8}: store-each {:>10} | store-union {:>10} | cross-image dedup {:>9} ({:.1}%)", + human(indiv), + human(union), + human(cross), + cross_pct, + ); + let _ = raw; + }; + println!("\n --- storing ALL {} images (lz4-compressed, zeros dropped) ---", images.len()); + row("TODAY", fixed_raw, fixed_indiv, fixed_union); + row("ALIGNED", aligned_raw, aligned_indiv, aligned_union); + row("CDC", cdc_raw, cdc_indiv, cdc_union); + + println!("\n store-union is total S3/cache bytes for the whole set."); + if fixed_union > 0 { + let saved = fixed_union.saturating_sub(aligned_union); + println!( + " ALIGNED vs TODAY: {} smaller ({:.1}%) — this is the recovered dedup.", + human(saved), + 100.0 * saved as f64 / fixed_union as f64 + ); + let ceil = fixed_union.saturating_sub(cdc_union); + let captured = if ceil > 0 { 100.0 * saved as f64 / ceil as f64 } else { 0.0 }; + println!(" CDC ceiling would save {} — ALIGNED captures {:.0}% of it.", human(ceil), captured); + } + + // ===================================================================== + // Padding cost + round-trip validity of the aligned images. + // ===================================================================== + println!("\n================ Cost & validity of the fix ================"); + println!( + " zero blocks (holes): unaligned {} → aligned {} (+{} blocks of padding)", + unaligned_zero, + aligned_zero, + aligned_zero.saturating_sub(unaligned_zero) + ); + println!(" padding is zeros → NOT stored: TODAY/ALIGNED store-union already exclude it."); + for (i, img) in images.iter().enumerate() { + match verify_roundtrip(&aligned[i]) { + Ok((files, bytes)) => println!( + " {:14} aligned image round-trips OK: {} files, {} of data read back", + img.name, + files, + human(bytes) + ), + Err(e) => println!(" {:14} ROUND-TRIP FAILED: {e}", img.name), + } + } + + // ===================================================================== + // V1 — DETERMINISM. Content-addressing requires byte-identical rebuilds. + // ===================================================================== + println!("\n================ V1: determinism ================"); + let a1 = read_all(build_ext4(&images[0], true)); + let a2 = read_all(build_ext4(&images[0], true)); + println!( + " {} aligned built twice: {} (and differs from unaligned: {})", + images[0].name, + if a1 == a2 { "IDENTICAL ✓" } else { "DIFFERENT ✗ — BUG" }, + if a1 != unaligned[0] { "yes ✓" } else { "no ✗" }, + ); + + // ===================================================================== + // V2 — INDEPENDENT GROUND TRUTH. Hash whole regular files (no chunking at + // all) read back from the ALIGNED images. This both proves the files + // survived alignment intact and measures the true shared-content fraction + // with a method that shares NO code with the block-grid path. + // ===================================================================== + println!("\n================ V2: file-level ground truth (chunking-agnostic) ================"); + let file_maps: Vec> = + aligned.iter().map(|img| extract_file_contents(img)).collect(); + let file_indiv: u64 = file_maps.iter().map(|m| m.values().sum::()).sum(); + let mut file_union: HashMap = HashMap::new(); + for m in &file_maps { + for (h, &s) in m { + file_union.entry(*h).or_insert(s); + } + } + let file_union_bytes: u64 = file_union.values().sum(); + let file_cross = file_indiv.saturating_sub(file_union_bytes); + let file_pct = if file_indiv > 0 { 100.0 * file_cross as f64 / file_indiv as f64 } else { 0.0 }; + println!(" whole-file content hashing (raw bytes, identical files counted once):"); + println!( + " store-each {} | store-union {} | cross-image identical-file content {} ({:.1}%)", + human(file_indiv), + human(file_union_bytes), + human(file_cross), + file_pct + ); + println!(" → THIS is how much byte-identical content genuinely exists across the set."); + println!(" TODAY grid recovers 26-ish%; ALIGNED grid recovers ~this; if they match,"); + println!(" the win is real and the CDC ceiling was not mis-tuned."); + + // ===================================================================== + // V3 — PER-FILE SPOTLIGHT. Take one large file that is byte-identical in + // two images and show its blocks dedup under ALIGNED but not under TODAY. + // ===================================================================== + if images.len() >= 2 { + println!("\n================ V3: per-file spotlight (img[0] vs img[1]) ================"); + spotlight(&images, &aligned, &unaligned, &aligned_maps, &fixed_maps); + } + + // Dump images so they can be checked with the real e2fsck (external proof + // of filesystem validity, independent of our own reader). + let outdir = std::path::Path::new("/tmp/dedup_verify"); + std::fs::create_dir_all(outdir).ok(); + for (i, img) in images.iter().enumerate() { + std::fs::write(outdir.join(format!("{}-unaligned.img", img.name)), &unaligned[i]).ok(); + std::fs::write(outdir.join(format!("{}-aligned.img", img.name)), &aligned[i]).ok(); + } + println!("\n images dumped to {} for external `e2fsck -fn` validation.", outdir.display()); +} + +/// Read every regular file from an ext4 image and map its whole-content hash to +/// its size. Uses only the reader — no block-grid code — so it is an +/// independent oracle for "how much identical file content exists". +fn extract_file_contents(img: &[u8]) -> HashMap { + let mut out: HashMap = HashMap::new(); + let cursor = std::io::Cursor::new(img); + let mut reader = ext4::reader::Reader::new(cursor).expect("reader"); + let entries = reader.walk().expect("walk"); + for e in entries { + if (e.mode & 0xF000) == 0x8000 && e.size > 0 { + let inode = reader.read_inode(e.inode_number).expect("inode"); + let data = reader.read_data(&inode).expect("data"); + // Key on (content, size); dedup identical files within the image. + out.entry(blake3_128(&data)).or_insert(data.len() as u64); + } + } + out +} + +fn read_file_by_path(img: &[u8], path: &str) -> Option> { + let cursor = std::io::Cursor::new(img); + let mut reader = ext4::reader::Reader::new(cursor).ok()?; + let entries = reader.walk().ok()?; + for e in entries { + if e.path == path && (e.mode & 0xF000) == 0x8000 { + let inode = reader.read_inode(e.inode_number).ok()?; + return reader.read_data(&inode).ok(); + } + } + None +} + +fn spotlight( + images: &[Image], + aligned: &[Vec], + unaligned: &[Vec], + aligned_maps: &[HashMap], + fixed_maps: &[HashMap], +) { + // Find a path present in BOTH images with byte-identical content and > 256 KiB. + let cur = std::io::Cursor::new(&aligned[0]); + let mut r0 = ext4::reader::Reader::new(cur).expect("reader"); + let walk0 = r0.walk().expect("walk"); + let mut candidates: Vec<(String, u64)> = walk0 + .iter() + .filter(|e| (e.mode & 0xF000) == 0x8000 && e.size > 256 * 1024) + .map(|e| (e.path.clone(), e.size)) + .collect(); + candidates.sort_by_key(|(_, s)| std::cmp::Reverse(*s)); + + for (path, size) in candidates { + let c0 = read_file_by_path(&aligned[0], &path); + let c1 = read_file_by_path(&aligned[1], &path); + let (Some(c0), Some(c1)) = (c0, c1) else { continue }; + if c0 != c1 || blake3_128(&c0) != blake3_128(&c1) { + continue; // not byte-identical across the two images + } + // Hash this file's content in 128 KiB chunks. Under ALIGNED the file + // starts on the grid, so its full sub-blocks ARE these content chunks. + let chunk_hashes: Vec = c0.chunks(GRID).filter(|b| !is_zero(b)).map(blake3_128).collect(); + let n = chunk_hashes.len(); + // How many of this file's content-chunks appear as real grid blocks in + // the OTHER image (img[1]) under each scheme? + let in_aligned = chunk_hashes.iter().filter(|h| aligned_maps[1].contains_key(*h)).count(); + let in_today = chunk_hashes.iter().filter(|h| fixed_maps[1].contains_key(*h)).count(); + println!(" file: {path} ({}, byte-identical in {} & {})", human(size), images[0].name, images[1].name); + println!(" its {n} content-chunks found as grid blocks in {}:", images[1].name); + println!( + " under ALIGNED: {in_aligned}/{n} ({:.0}%) dedup", + 100.0 * in_aligned as f64 / n.max(1) as f64 + ); + println!( + " under TODAY: {in_today}/{n} ({:.0}%) dedup", + 100.0 * in_today as f64 / n.max(1) as f64 + ); + let _ = unaligned; + return; + } + println!(" (no >256 KiB byte-identical file shared across the two images found)"); +} + +/// Read the aligned ext4 back through the production reader to prove the image +/// is a valid filesystem (not just bytes that happen to dedup well). +fn verify_roundtrip(img: &[u8]) -> std::io::Result<(usize, u64)> { + let cursor = std::io::Cursor::new(img); + let mut reader = ext4::reader::Reader::new(cursor)?; + let entries = reader.walk()?; + let mut files = 0usize; + let mut bytes = 0u64; + for e in entries { + if (e.mode & 0xF000) == 0x8000 { + // regular file + let inode = reader.read_inode(e.inode_number)?; + let data = reader.read_data(&inode)?; + bytes += data.len() as u64; + files += 1; + } + } + Ok((files, bytes)) +} diff --git a/glidefs/src/block/router.rs b/glidefs/src/block/router.rs index 259b1fb..b04615b 100644 --- a/glidefs/src/block/router.rs +++ b/glidefs/src/block/router.rs @@ -1819,9 +1819,11 @@ impl ExportRouter { .await .map_err(|e| RouterError::OciPull(format!("failed to resolve image: {e}")))?; - // Estimate device size: compressed × 3, next power-of-2, min 64 MiB. + // Estimate device size: compressed × 4, next power-of-2, min 64 MiB. + // The ×4 (vs ×3) leaves headroom for block-grid alignment padding, which + // inflates the logical ext4 with holes/zeros the block store drops. let total_compressed: u64 = resolved.layers.iter().map(|l| l.size as u64).sum(); - let estimated = (total_compressed * 3).max(64 * 1024 * 1024); + let estimated = (total_compressed * 4).max(64 * 1024 * 1024); let device_size = estimated.next_power_of_two(); info!( @@ -1885,6 +1887,9 @@ impl ExportRouter { WriterOption::MaximumDiskSize(device_size as i64), WriterOption::Uuid(uuid), WriterOption::Journal(1024), // 4 MiB journal + // Align large file payloads to the dedup block grid (the volume + // block size) so the same file dedups across blessed images. + WriterOption::AlignData { align: block_size_u32, min_size: block_size_u32 }, ], }; diff --git a/glidefs/src/cli/bless.rs b/glidefs/src/cli/bless.rs index 66961f4..2ba0dea 100644 --- a/glidefs/src/cli/bless.rs +++ b/glidefs/src/cli/bless.rs @@ -171,10 +171,13 @@ pub async fn run_bless_oci( .await .map_err(|e| anyhow::anyhow!("failed to resolve image: {e}"))?; - // Estimate device size: sum compressed layer sizes × 3 (decompression + ext4 overhead). - // Round up to next power-of-2 MiB boundary. Minimum 64 MiB. + // Estimate device size: sum compressed layer sizes × 4 (decompression + ext4 + // overhead + block-grid alignment headroom). Round up to next power-of-2. + // Minimum 64 MiB. The ×4 (vs ×3) covers the logical inflation from aligning + // large files to the dedup block grid; that padding is holes/zeros which the + // block store drops, so it costs address space, not stored bytes. let total_compressed: u64 = resolved.layers.iter().map(|l| l.size as u64).sum(); - let estimated = (total_compressed * 3).max(64 * 1024 * 1024); + let estimated = (total_compressed * 4).max(64 * 1024 * 1024); let device_size = estimated.next_power_of_two(); info!( @@ -251,6 +254,12 @@ pub async fn run_bless_oci( WriterOption::MaximumDiskSize(device_size as i64), WriterOption::Uuid(uuid), WriterOption::Journal(1024), // 4 MiB journal + // Align large file payloads to the dedup block grid (the volume's + // 128 KiB block size) so the same file produces the same blocks + // across images and the host's content-addressed cache + S3 packs + // dedup it. Only files >= one full block are aligned, bounding the + // padding. See dedup_probe / fsck_validity for the validation. + WriterOption::AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE }, ], }; diff --git a/glidefs/src/oci/layer_store.rs b/glidefs/src/oci/layer_store.rs index 1b6edfa..9e9afaf 100644 --- a/glidefs/src/oci/layer_store.rs +++ b/glidefs/src/oci/layer_store.rs @@ -28,7 +28,7 @@ use object_store::{ObjectStore, PutPayload}; use serde::{Deserialize, Serialize}; use crate::block::content_store::ContentStore; -use crate::oci::ext4_store::{deterministic_uuid, store_ext4_stream}; +use crate::oci::ext4_store::{deterministic_uuid, store_ext4_stream, BLOCK_SIZE}; /// Manifest name for a stored layer (its sole VolumeManifest). const LAYER_MANIFEST_NAME: &str = "layer"; @@ -92,7 +92,10 @@ impl ImageDescriptor { /// Zero blocks past the real content are skipped at store time, so oversizing /// costs nothing in storage. fn layer_device_size(tar_len: u64) -> u64 { - (tar_len.saturating_mul(2).max(64 * 1024 * 1024)).next_power_of_two() + // ×3 (not ×2): extra headroom for block-grid alignment padding, which + // inflates the logical ext4. The padding is holes/zeros dropped by the + // block store, so it costs address space, not stored bytes. + (tar_len.saturating_mul(3).max(64 * 1024 * 1024)).next_power_of_two() } /// Ensure a single OCI layer is stored as a content-addressed ext4 artifact. @@ -137,6 +140,9 @@ pub async fn ensure_layer_stored( WriterOption::MaximumDiskSize(device_size as i64), WriterOption::Uuid(deterministic_uuid(digest)), WriterOption::Journal(1024), // 4 MiB journal — same as bless + // Align large file payloads to the dedup block grid (the volume's + // 128 KiB block size) so the same file dedups across layers/images. + WriterOption::AlignData { align: BLOCK_SIZE, min_size: BLOCK_SIZE }, ], }; let mut ext4_tmp = tempfile::tempfile().context("layer ext4 tempfile")?;