Add dm-flakey power-loss harness + release 0.1.5 (#31) by robertoberto · Pull Request #35 · deepcausa/datawal

robertoberto · 2026-05-21T18:13:10Z

Closes #31 (v1 dm-flakey harness; dm-log-writes replay tracked as v2).

Releases datawal 0.1.5.

Scope

Two commits on this branch:

feat(testing): add dm-flakey power-loss simulation harness (#31) -- adds the harness and its docs.
chore: release datawal 0.1.5 -- version bump, CHANGELOG entry, and README cleanup so the repo no longer claims an "alpha" status it has outgrown.

What is new in 0.1.5

Power-loss harness (Linux, root, not CI)

crates/datawal-core/examples/power_loss_workload.rs -- deterministic workload generator that fsyncs after every operation and appends an fsync-ordered oracle JSONL on a separate filesystem.
crates/datawal-core/examples/power_loss_validate.rs -- post-fault validator that reopens the store and checks per-key prefix invariants 3, 4, 5 against the oracle.
scripts/power_loss_dm_flakey.sh -- orchestrator: backing file -> losetup -> dmsetup flakey table -> mkfs.ext4 -> mount -> workload -> flip to error_writes -> force unmount -> reload healthy -> remount -> validator.
scripts/power_loss_cleanup.sh -- idempotent prefix-guarded teardown.
docs/power-loss-testing.md -- how to run, env vars, exit codes, honest "what this is not" section.
docs/power-loss-results.md -- sanitized record of the first run.

Durability framing in README

New "Durability evidence" section linking fuzz, proptest, SIGKILL crash-injection, and dm-flakey.
Both the docs and the README carry the mandatory wording: "This is stricter than process-level crash testing but is not a substitute for real power-cut testing on real hardware." and "DataWal trusts the storage stack below it to honor fsync."

README cleanup (0.1.5)

Dropped "alpha" framing in favor of "pre-1.0 crate suitable for local recoverable logs where JSONL would otherwise be used, with documented limits".
"What is in" now reflects the actual surface: RecordLogReader (snapshot-at-open, no live tailing), scan_iter, datawal CLI (crates/datawal-cli/), four TLA+ models (including ReadWhileWrite.tla), dm-flakey harness, fs2 fd lock, explicit fsync boundary, compact_to only.
Limits table corrected: readers entry now describes RecordLogReader; scan() entry points at scan_iter for the lazy path; production-status note clarified.
Evidence stack reframed by layer instead of point-in-time counts.
Formal models section now says four models and adds ReadWhileWrite.tla.
Running section adds cargo run -p datawal-cli -- --help.

What did NOT change

No edits under crates/datawal-core/src/, tests/, benches/, corpus/, formal/, or fuzz/.
WIRE_VERSION = 1 is unchanged. Corpus fixtures are unchanged.
Public Rust surface is unchanged.
MSRV stays at 1.75.0.

Run 1 (sanitized)

Single end-to-end run executed on a Linux x86_64, apt-based host with kernel 6.12 series and dm-flakey available. Backing was a loopback file on a local temporary filesystem; filesystem under test was ext4. No hostname, user, paths, kernel patchlevel, mount table, partition UUID, device IDs, container IDs, environment, or raw uname/mount/env output is published. Full sanitized record in docs/power-loss-results.md.

Result:

Workload: 50000 ops (47827 puts + 2173 dels), fsync per op, ~9s wall, seed 42, effective key space ~4096.
Fault: dmsetup reloaded to flakey ... error_writes, force unmount, reload to healthy, remount.
Recovery report: files_scanned=1 records_replayed=50000 tail_truncated_segs=0 tail_bytes_discarded=0 mid_stream_errors=0 last_txid_seen=50000.
Validator: OK observed_live=3918 oracle_live=3918 survived=3918 dropped=0 oracle_dead=178 extras=0.

Reading: per-key prefix invariants 3, 4, 5 all hold; no extras; no payload mismatches; nothing truncated on recovery; tombstones correctly absent.

This is one fault class (dm-flakey error_writes, clean), not a general "power-loss safe" claim. Lying storage, torn writes, reordering, bit rot, mid-workload faults, and dm-log-writes replay are explicitly out of scope for this run.

CI signals expected

rust (stable), rust (1.75.0), formal, corpus, publish-dry-run.

Release follow-up (not in this PR)

Tag v0.1.5 after CI is green on main to trigger the release job that runs cargo publish with CARGO_REGISTRY_TOKEN.

Add a Linux device-mapper based power-loss harness that exercises the WAL under a fault model stronger than SIGKILL: the harness flips the ext4-backing dm-flakey layer to error_writes, force-unmounts, then remounts the layer healthy and verifies the reopened store against an fsync-ordered oracle. New artefacts: - crates/datawal-core/examples/power_loss_workload.rs: deterministic put/delete workload, fsync per op, append-on-fsync JSONL oracle on a separate filesystem. - crates/datawal-core/examples/power_loss_validate.rs: post-fault validator. Prints RecoveryReport, checks per-key prefix (Inv 3), payload integrity (Inv 4) and no-extras (Inv 5). - scripts/power_loss_dm_flakey.sh: orchestrator. losetup -> dmsetup flakey healthy -> mkfs.ext4 -> mount -> workload -> reload to error_writes -> umount -f -> reload healthy -> remount -> validate. Strict prefix guardrails: /tmp/datawal-powerloss-* and datawal-test-* dm names only. Linux + root only. - scripts/power_loss_cleanup.sh: idempotent teardown, prefix-guarded. - docs/power-loss-testing.md: harness contract, env-var table, exit codes, 'what this is not' framing. - docs/power-loss-results.md: sanitized record of a verified run (50000 ops, 47827 puts + 2173 dels, 3918 live keys; recovery files_scanned=1 records_replayed=50000 tail_truncated=0 mid_stream_errors=0; validator OK with extras=0). Doc updates: - README.md: Durability evidence section linking the harness. - docs/roadmap.md: row for issue #31, dm-log-writes noted as v2. - CHANGELOG.md: Unreleased entry. Public Rust API unchanged. Wire format unchanged. Linux-only, root-only, not part of CI.

Docs-and-testing release. Adds the dm-flakey power-loss harness from #31 (committed separately) and aligns the README with the post-alpha state of the crate. Public Rust API unchanged. Wire format (WIRE_VERSION = 1) unchanged. Corpus fixtures unchanged. README cleanup: - Replaced 'alpha crate' / 'alpha release' / 'alpha limits' / 'not production-ready' wording with pre-1.0 scoped-production framing. - 'What is in' now lists RecordLogReader, scan_iter, datawal CLI, four TLA+ models, fuzz, proptest, crash injection, ENOSPC, soak, dm-flakey, Criterion benches. Removed stale fixed counts. - 'What is not in' dropped 'Reader API / concurrent reads' (now in) and added 'Group commit / configurable fsync policy'. - Limits table updated: Readers = snapshot-at-open RecordLogReader; scan() = eager Vec<Record> with scan_iter() lazy alternative; DataWal keydir = offsets in memory with on-demand CRC-checked I/O; Production status = scoped production use. - Evidence stack reframed by layer (spec / wire / formal / parser / recovery / long-run / performance / operations) instead of fixed counts. - Formal models section updated to four models including ReadWhileWrite.tla. - Running section adds 'cargo run -p datawal-cli -- --help'.

Single-line collapse of write_oracle_line(OracleLine::Del { .. }) call to satisfy rustfmt. No behavioral change. Reformats only crates/datawal-core/examples/power_loss_workload.rs. Same content -- the inline struct literal is now on one line so rustfmt is happy under stable.

Two clippy::-D warnings findings, both in the new examples added in cd79bfa. Public API and behaviour unchanged. power_loss_validate.rs: - Replace the 4-tuple return of load_oracle() with a named struct OracleAggregate { effect, last_seq, puts, dels } so the signature is no longer flagged by clippy::type_complexity. Call-site uses field destructuring; no behavioural change. power_loss_workload.rs: - Add #[allow(clippy::modulo_one)] on fn run() with a comment explaining why FSYNC_BATCH=1 today and why the modulo structure is intentional (it stays correct if the constant is ever raised for batched-fsync tuning). Verified locally: - cargo clippy --workspace --all-targets -- -D warnings clean - cargo fmt --all -- --check clean - cargo test --workspace passes

robertoberto added 2 commits May 21, 2026 17:54

robertoberto mentioned this pull request May 21, 2026

Roadmap: power-loss simulation with Linux device-mapper #31

Closed

robertoberto added 2 commits May 21, 2026 18:20

robertoberto merged commit 7bc7466 into main May 21, 2026
7 checks passed

robertoberto deleted the power-loss/dm-flakey-harness branch May 21, 2026 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dm-flakey power-loss harness + release 0.1.5 (#31)#35

Add dm-flakey power-loss harness + release 0.1.5 (#31)#35
robertoberto merged 4 commits into
mainfrom
power-loss/dm-flakey-harness

robertoberto commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robertoberto commented May 21, 2026

Scope

What is new in 0.1.5

Power-loss harness (Linux, root, not CI)

Durability framing in README

README cleanup (0.1.5)

What did NOT change

Run 1 (sanitized)

CI signals expected

Release follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant