Skip to content

Add dm-flakey power-loss harness + release 0.1.5 (#31)#35

Merged
robertoberto merged 4 commits into
mainfrom
power-loss/dm-flakey-harness
May 21, 2026
Merged

Add dm-flakey power-loss harness + release 0.1.5 (#31)#35
robertoberto merged 4 commits into
mainfrom
power-loss/dm-flakey-harness

Conversation

@robertoberto
Copy link
Copy Markdown
Contributor

Closes #31 (v1 dm-flakey harness; dm-log-writes replay tracked as v2).

Releases datawal 0.1.5.

Scope

Two commits on this branch:

  1. feat(testing): add dm-flakey power-loss simulation harness (#31) -- adds the harness and its docs.
  2. chore: release datawal 0.1.5 -- version bump, CHANGELOG entry, and README cleanup so the repo no longer claims an "alpha" status it has outgrown.

What is new in 0.1.5

Power-loss harness (Linux, root, not CI)

  • crates/datawal-core/examples/power_loss_workload.rs -- deterministic workload generator that fsyncs after every operation and appends an fsync-ordered oracle JSONL on a separate filesystem.
  • crates/datawal-core/examples/power_loss_validate.rs -- post-fault validator that reopens the store and checks per-key prefix invariants 3, 4, 5 against the oracle.
  • scripts/power_loss_dm_flakey.sh -- orchestrator: backing file -> losetup -> dmsetup flakey table -> mkfs.ext4 -> mount -> workload -> flip to error_writes -> force unmount -> reload healthy -> remount -> validator.
  • scripts/power_loss_cleanup.sh -- idempotent prefix-guarded teardown.
  • docs/power-loss-testing.md -- how to run, env vars, exit codes, honest "what this is not" section.
  • docs/power-loss-results.md -- sanitized record of the first run.

Durability framing in README

  • New "Durability evidence" section linking fuzz, proptest, SIGKILL crash-injection, and dm-flakey.
  • Both the docs and the README carry the mandatory wording: "This is stricter than process-level crash testing but is not a substitute for real power-cut testing on real hardware." and "DataWal trusts the storage stack below it to honor fsync."

README cleanup (0.1.5)

  • Dropped "alpha" framing in favor of "pre-1.0 crate suitable for local recoverable logs where JSONL would otherwise be used, with documented limits".
  • "What is in" now reflects the actual surface: RecordLogReader (snapshot-at-open, no live tailing), scan_iter, datawal CLI (crates/datawal-cli/), four TLA+ models (including ReadWhileWrite.tla), dm-flakey harness, fs2 fd lock, explicit fsync boundary, compact_to only.
  • Limits table corrected: readers entry now describes RecordLogReader; scan() entry points at scan_iter for the lazy path; production-status note clarified.
  • Evidence stack reframed by layer instead of point-in-time counts.
  • Formal models section now says four models and adds ReadWhileWrite.tla.
  • Running section adds cargo run -p datawal-cli -- --help.

What did NOT change

  • No edits under crates/datawal-core/src/, tests/, benches/, corpus/, formal/, or fuzz/.
  • WIRE_VERSION = 1 is unchanged. Corpus fixtures are unchanged.
  • Public Rust surface is unchanged.
  • MSRV stays at 1.75.0.

Run 1 (sanitized)

Single end-to-end run executed on a Linux x86_64, apt-based host with kernel 6.12 series and dm-flakey available. Backing was a loopback file on a local temporary filesystem; filesystem under test was ext4. No hostname, user, paths, kernel patchlevel, mount table, partition UUID, device IDs, container IDs, environment, or raw uname/mount/env output is published. Full sanitized record in docs/power-loss-results.md.

Result:

  • Workload: 50000 ops (47827 puts + 2173 dels), fsync per op, ~9s wall, seed 42, effective key space ~4096.
  • Fault: dmsetup reloaded to flakey ... error_writes, force unmount, reload to healthy, remount.
  • Recovery report: files_scanned=1 records_replayed=50000 tail_truncated_segs=0 tail_bytes_discarded=0 mid_stream_errors=0 last_txid_seen=50000.
  • Validator: OK observed_live=3918 oracle_live=3918 survived=3918 dropped=0 oracle_dead=178 extras=0.

Reading: per-key prefix invariants 3, 4, 5 all hold; no extras; no payload mismatches; nothing truncated on recovery; tombstones correctly absent.

This is one fault class (dm-flakey error_writes, clean), not a general "power-loss safe" claim. Lying storage, torn writes, reordering, bit rot, mid-workload faults, and dm-log-writes replay are explicitly out of scope for this run.

CI signals expected

rust (stable), rust (1.75.0), formal, corpus, publish-dry-run.

Release follow-up (not in this PR)

Tag v0.1.5 after CI is green on main to trigger the release job that runs cargo publish with CARGO_REGISTRY_TOKEN.

Add a Linux device-mapper based power-loss harness that exercises the
WAL under a fault model stronger than SIGKILL: the harness flips the
ext4-backing dm-flakey layer to error_writes, force-unmounts, then
remounts the layer healthy and verifies the reopened store against an
fsync-ordered oracle.

New artefacts:
- crates/datawal-core/examples/power_loss_workload.rs: deterministic
  put/delete workload, fsync per op, append-on-fsync JSONL oracle on a
  separate filesystem.
- crates/datawal-core/examples/power_loss_validate.rs: post-fault
  validator. Prints RecoveryReport, checks per-key prefix (Inv 3),
  payload integrity (Inv 4) and no-extras (Inv 5).
- scripts/power_loss_dm_flakey.sh: orchestrator. losetup -> dmsetup
  flakey healthy -> mkfs.ext4 -> mount -> workload -> reload to
  error_writes -> umount -f -> reload healthy -> remount -> validate.
  Strict prefix guardrails: /tmp/datawal-powerloss-* and
  datawal-test-* dm names only. Linux + root only.
- scripts/power_loss_cleanup.sh: idempotent teardown, prefix-guarded.
- docs/power-loss-testing.md: harness contract, env-var table, exit
  codes, 'what this is not' framing.
- docs/power-loss-results.md: sanitized record of a verified run
  (50000 ops, 47827 puts + 2173 dels, 3918 live keys; recovery
  files_scanned=1 records_replayed=50000 tail_truncated=0
  mid_stream_errors=0; validator OK with extras=0).

Doc updates:
- README.md: Durability evidence section linking the harness.
- docs/roadmap.md: row for issue #31, dm-log-writes noted as v2.
- CHANGELOG.md: Unreleased entry.

Public Rust API unchanged. Wire format unchanged. Linux-only,
root-only, not part of CI.
Docs-and-testing release. Adds the dm-flakey power-loss harness from
#31 (committed separately) and aligns the README with the post-alpha
state of the crate.

Public Rust API unchanged. Wire format (WIRE_VERSION = 1) unchanged.
Corpus fixtures unchanged.

README cleanup:
- Replaced 'alpha crate' / 'alpha release' / 'alpha limits' / 'not
  production-ready' wording with pre-1.0 scoped-production framing.
- 'What is in' now lists RecordLogReader, scan_iter, datawal CLI,
  four TLA+ models, fuzz, proptest, crash injection, ENOSPC, soak,
  dm-flakey, Criterion benches. Removed stale fixed counts.
- 'What is not in' dropped 'Reader API / concurrent reads' (now in)
  and added 'Group commit / configurable fsync policy'.
- Limits table updated: Readers = snapshot-at-open RecordLogReader;
  scan() = eager Vec<Record> with scan_iter() lazy alternative;
  DataWal keydir = offsets in memory with on-demand CRC-checked I/O;
  Production status = scoped production use.
- Evidence stack reframed by layer (spec / wire / formal /
  parser / recovery / long-run / performance / operations) instead
  of fixed counts.
- Formal models section updated to four models including
  ReadWhileWrite.tla.
- Running section adds 'cargo run -p datawal-cli -- --help'.
Single-line collapse of write_oracle_line(OracleLine::Del { .. }) call to
satisfy rustfmt. No behavioral change.

Reformats only crates/datawal-core/examples/power_loss_workload.rs. Same
content -- the inline struct literal is now on one line so rustfmt is
happy under stable.
Two clippy::-D warnings findings, both in the new examples added in
cd79bfa. Public API and behaviour unchanged.

power_loss_validate.rs:
- Replace the 4-tuple return of load_oracle() with a named struct
  OracleAggregate { effect, last_seq, puts, dels } so the signature
  is no longer flagged by clippy::type_complexity. Call-site uses
  field destructuring; no behavioural change.

power_loss_workload.rs:
- Add #[allow(clippy::modulo_one)] on fn run() with a comment
  explaining why FSYNC_BATCH=1 today and why the modulo structure
  is intentional (it stays correct if the constant is ever raised
  for batched-fsync tuning).

Verified locally:
- cargo clippy --workspace --all-targets -- -D warnings   clean
- cargo fmt --all -- --check                              clean
- cargo test --workspace                                  passes
@robertoberto robertoberto merged commit 7bc7466 into main May 21, 2026
7 checks passed
@robertoberto robertoberto deleted the power-loss/dm-flakey-harness branch May 21, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Roadmap: power-loss simulation with Linux device-mapper

1 participant