Skip to content

feat(keydir): store offsets and revalidate CRC on every get (#4)#28

Merged
robertoberto merged 2 commits into
mainfrom
feat/keydir-offset
May 21, 2026
Merged

feat(keydir): store offsets and revalidate CRC on every get (#4)#28
robertoberto merged 2 commits into
mainfrom
feat/keydir-offset

Conversation

@robertoberto
Copy link
Copy Markdown
Contributor

Summary

Switches DataWal's in-memory keydir from HashMap<Vec<u8>, Vec<u8>> (full payload cached) to HashMap<Vec<u8>, RecordRef> (segment + offset + len) backed by a small LRU fd-pool. Every get now does one pread and revalidates the full record CRC before returning.

This trades RAM for I/O. A DataWal with N keys and average payload P used to hold N * P bytes of payload in memory; it now holds N RecordRefs (~40 bytes each) and at most FdPool::DEFAULT_CAPACITY open fds. Soak driver smoke continues to hold RSS stable at ~5.6 MB under ~1 GB of write traffic.

Closes #4.

What this is, what it is not

Is Is not
HashMap<Vec<u8>, RecordRef> + fd-pool LRU An mmap path or page-cache layer
pread + CRC revalidate on every get An optimistic / unchecked read path
Lazy open (no payload materialisation) A removal of the in-memory keydir
Single-threaded &mut self semantics A concurrent reader API (RecordLogReader from PR #27 covers that)

What ships

  • crates/datawal-core/src/fd_pool.rs (new, ~205 LOC + 6 unit tests)

    • Tiny VecDeque-backed LRU keyed by segment_id.
    • read_at(dir, segment_id, offset, len) uses pread (FileExt::read_exact_at) on Unix, falls back to seek+read on non-Unix.
    • Default capacity 16; configurable via with_capacity. Capacity 0 disables caching (single replaceable slot) and is used by tests.
  • crates/datawal-core/src/datawal.rs (rewritten, ~340 LOC)

  • crates/datawal-core/tests/keydir_by_offset.rs (new, 245 LOC, 8 tests)

    • get after drop+reopen, CRC revalidation on get, ref_of tracks deletes, ref_of advances on overwrite, get after rotation / compaction, fd reuse across 500 repeated gets (Linux-only /proc/self/fd assertion), items() revalidates CRC, compact_to preserves live state via the offset keydir.

Breaking changes (0.1.x semver-acceptable)

Four DataWal methods change signature:

Method Before After
get &self &mut self
items &self -> Vec<...> &mut self -> Result<Vec<...>>
compact_to &self -> Result<...> &mut self -> Result<...>
export_jsonl &self -> Result<()> &mut self -> Result<()>

All four perform reads against the underlying segments and must mutate the fd-pool. &mut self is the honest signature; no interior-mutability hack. Documented in the module rustdoc.

All in-tree callers (tests, examples, benches, CLI cmd_get, soak driver) are updated to bind handles as let mut. No semantic change in any of them.

What stays the same

  • Wire format frozen at WIRE_VERSION = 1. Six corpus fixtures unchanged.
  • Recovery is still longest valid prefix. Tail truncation on the active segment is non-fatal, CRC mismatch on a sealed segment is a hard error during scan_iter.
  • Single-writer, single-process invariant via fs2 exclusive lock on .lock is unchanged.
  • Compaction stays snapshot-style (source untouched).
  • MSRV stays at 1.75.0.
  • Public surface adds exactly one method (DataWal::ref_of). No new types: RecordRef was already re-exported.

Local validation

Gate Result
cargo fmt --all -- --check clean
cargo clippy --workspace --all-targets -- -D warnings clean
cargo test --workspace --all-targets 137 passed, 1 ignored
RUSTDOCFLAGS='-D warnings' cargo doc --workspace --no-deps clean
cargo bench --workspace --no-run clean
cargo publish --dry-run -p datawal --allow-dirty clean

Out of scope (deferred)

  • No fd-pool tuning APIs on DataWal (default 16 is plenty for current workloads; revisit when benchmarks demand it).
  • No pread_record low-level escape hatch.
  • No mmap path. pread keeps semantics simple and matches what the CRC check expects.
  • No benchmark refresh in this PR -- saved for the 0.1.4 release PR (Roadmap: crash injection and soak testing #8 in the kit roadmap).

Switches `DataWal`'s in-memory keydir from
`HashMap<Vec<u8>, Vec<u8>>` (full payload cached) to
`HashMap<Vec<u8>, RecordRef>` (segment + offset + len) backed by a
small LRU fd-pool. Every `get` now does one `pread` and revalidates
the full record CRC before returning.

This trades RAM for I/O. A DataWal with N keys and average payload P
used to hold N * P bytes of payload in memory; it now holds N
RecordRefs (~40 bytes each) and at most `FdPool::DEFAULT_CAPACITY`
open fds. Soak driver smoke continues to hold RSS stable at ~5.6 MB
under 1 GB of write traffic.

What ships
----------

* `crates/datawal-core/src/fd_pool.rs` (new, ~205 LOC + 6 unit tests)
  - Tiny VecDeque-backed LRU keyed by `segment_id`.
  - `read_at(dir, segment_id, offset, len)` uses `pread`
    (`FileExt::read_exact_at`) on Unix, falls back to seek+read on
    non-Unix.
  - Default capacity 16, configurable via `with_capacity`.
  - Capacity 0 disables caching (single replaceable slot) and is
    used by tests.

* `crates/datawal-core/src/datawal.rs` (rewritten, ~340 LOC)
  - Field `map` is now `HashMap<Vec<u8>, RecordRef>`.
  - `open` rebuilds the keydir lazily via `scan_iter` (PR #23),
    never materialising payloads.
  - `get(&mut self, key)` reads the record bytes through the
    fd-pool, decodes them, checks `record_type == Put`, and
    validates that the consumed byte count matches `RecordRef.len`.
    Truncation, CRC mismatch, and structural decode errors are
    reported with explicit segment/offset context.
  - New helper `ref_of(&self, key) -> Option<RecordRef>` exposes
    the keydir entry without touching disk.

* `crates/datawal-core/tests/keydir_by_offset.rs` (new, 245 LOC, 8
  tests)
  - get after drop+reopen, CRC revalidation on get, ref_of tracks
    deletes, ref_of advances on overwrite, get after rotation /
    compaction, fd reuse across 500 repeated gets (Linux-only
    /proc/self/fd assertion), items() revalidates CRC, compact_to
    preserves live state via the offset keydir.

Breaking changes (0.1.x semver-acceptable per owner)
----------------------------------------------------

Four `DataWal` methods change signature:

* `get`         : `&self` -> `&mut self`
* `items`       : `&self -> Vec<...>` -> `&mut self -> Result<Vec<...>>`
* `compact_to`  : `&self -> Result<...>` -> `&mut self -> Result<...>`
* `export_jsonl`: `&self -> Result<()>` -> `&mut self -> Result<()>`

All four perform reads against the underlying segments and must
mutate the fd-pool. `&mut self` is the honest signature; no
interior-mutability hack. Documented in the module rustdoc.

All in-tree callers (tests, examples, benches, CLI `cmd_get`,
soak driver) are updated to bind handles as `let mut`. No semantic
change in any of them.

What stays the same
-------------------

* Wire format frozen at WIRE_VERSION=1. Six corpus fixtures
  unchanged.
* Recovery is still longest valid prefix; tail truncation on the
  active segment is non-fatal, CRC mismatch on a sealed segment is
  a hard error during scan_iter.
* Single-writer, single-process invariant via `fs2` exclusive lock
  on `.lock` is unchanged.
* Compaction stays snapshot-style (source untouched).
* MSRV stays at 1.75.0.
* Public surface adds exactly one method (`DataWal::ref_of`). No
  new types: `RecordRef` was already re-exported.

Local validation
----------------

cargo fmt --all -- --check          : clean
cargo clippy --workspace --all-targets -- -D warnings : clean
cargo test --workspace --all-targets : 137 passed, 1 ignored
cargo doc --workspace --no-deps      : clean (RUSTDOCFLAGS=-D warnings)
cargo bench --workspace --no-run     : clean
cargo publish --dry-run -p datawal --allow-dirty : clean

Out of scope (deferred)
-----------------------

* No fd-pool tuning APIs on `DataWal` (default 16 is plenty for
  the current workloads; revisit when benchmarks demand it).
* No `pread_record` low-level escape hatch.
* No mmap path. `pread` keeps semantics simple and matches what the
  CRC check expects.
* No benchmark refresh in this PR -- saved for the 0.1.4 release PR
  (#8).

Refs #4.
The fd_pool_reuses_fd_across_repeated_gets test asserted that the
process-level fd count grew by no more than 4 across 500 gets. The
GHA MSRV 1.75 runner reliably grows by 7 due to harness/runner fds
unrelated to the pool, which has DEFAULT_CAPACITY=16 and at most
one segment opened in this scenario.

Relax the bound to capacity + slack (32) so the test signals only
when the pool genuinely fails to reuse, not when the runner happens
to hold extra fds. Local re-run: passes.
@robertoberto robertoberto merged commit ef1530f into main May 21, 2026
7 checks passed
@robertoberto robertoberto deleted the feat/keydir-offset branch May 21, 2026 13:39
robertoberto added a commit that referenced this pull request May 21, 2026
Replaces the TBD scaffold in docs/benchmarks/v0.1.4-reference.md
with the numbers from the criterion run on the dia host
(DATAWAL_BENCH_DIR on ext4 with relatime,discard, NVMe-PLP).

Stack:
  ext4 with relatime,discard, on a single NVMe-PLP
  raw fsync(64B) median 1785 us, p10 1720 us, p90 1858 us

Highlights vs alpha (where comparable):
- append_no_fsync: ~same as alpha (codec path unchanged).
- append_fsync_each: ~1.78 ms (alpha reported ~3.7 us; see below).
- datawal_get: ~700-780 ns (alpha ~30-46 ns). Expected: PR #28
  keydir-by-offset adds one pread + CRC32C verify per get.
- datawal_open_rebuild: ~25.6 ms at 100k (alpha ~39 ms). Faster
  by design: offset-keydir does not materialise values on replay.
- compact_to: 3-9x slower than alpha. Two factors, both expected:
  (1) real-fsync floor, (2) PR #28 re-reads each live value via
  fd-pool LRU instead of copying from memory.

Bench-dir caveat (the reason the alpha numbers differed so much in
fsync-bearing tables): a raw-fsync sanity check on the same machine
gave 0.59 us on tmpfs vs 1785 us on ext4. The alpha-reference
single-digit-microsecond fsyncs are only consistent with a tmpfs
bench dir (i.e. DATAWAL_BENCH_DIR unset, defaulting to /tmp). The
v0.1.4 run was collected with DATAWAL_BENCH_DIR pointing at the
real ext4 mount, so the new numbers reflect honest durability
costs. A retrospective note has been added at the top of
v0.1.0-alpha-reference.md flagging this; historical numbers in
that doc are kept verbatim as a record of what was published.

Code in this PR: docs only. No source / config / Cargo.* changes.
The release/0.1.4 branch is now ready for merge + tag v0.1.4 push
once the rest of the PR review wraps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Roadmap: keydir by offset instead of storing full values

1 participant