Add dm-flakey power-loss harness + release 0.1.5 (#31)#35
Merged
Conversation
Add a Linux device-mapper based power-loss harness that exercises the WAL under a fault model stronger than SIGKILL: the harness flips the ext4-backing dm-flakey layer to error_writes, force-unmounts, then remounts the layer healthy and verifies the reopened store against an fsync-ordered oracle. New artefacts: - crates/datawal-core/examples/power_loss_workload.rs: deterministic put/delete workload, fsync per op, append-on-fsync JSONL oracle on a separate filesystem. - crates/datawal-core/examples/power_loss_validate.rs: post-fault validator. Prints RecoveryReport, checks per-key prefix (Inv 3), payload integrity (Inv 4) and no-extras (Inv 5). - scripts/power_loss_dm_flakey.sh: orchestrator. losetup -> dmsetup flakey healthy -> mkfs.ext4 -> mount -> workload -> reload to error_writes -> umount -f -> reload healthy -> remount -> validate. Strict prefix guardrails: /tmp/datawal-powerloss-* and datawal-test-* dm names only. Linux + root only. - scripts/power_loss_cleanup.sh: idempotent teardown, prefix-guarded. - docs/power-loss-testing.md: harness contract, env-var table, exit codes, 'what this is not' framing. - docs/power-loss-results.md: sanitized record of a verified run (50000 ops, 47827 puts + 2173 dels, 3918 live keys; recovery files_scanned=1 records_replayed=50000 tail_truncated=0 mid_stream_errors=0; validator OK with extras=0). Doc updates: - README.md: Durability evidence section linking the harness. - docs/roadmap.md: row for issue #31, dm-log-writes noted as v2. - CHANGELOG.md: Unreleased entry. Public Rust API unchanged. Wire format unchanged. Linux-only, root-only, not part of CI.
Docs-and-testing release. Adds the dm-flakey power-loss harness from #31 (committed separately) and aligns the README with the post-alpha state of the crate. Public Rust API unchanged. Wire format (WIRE_VERSION = 1) unchanged. Corpus fixtures unchanged. README cleanup: - Replaced 'alpha crate' / 'alpha release' / 'alpha limits' / 'not production-ready' wording with pre-1.0 scoped-production framing. - 'What is in' now lists RecordLogReader, scan_iter, datawal CLI, four TLA+ models, fuzz, proptest, crash injection, ENOSPC, soak, dm-flakey, Criterion benches. Removed stale fixed counts. - 'What is not in' dropped 'Reader API / concurrent reads' (now in) and added 'Group commit / configurable fsync policy'. - Limits table updated: Readers = snapshot-at-open RecordLogReader; scan() = eager Vec<Record> with scan_iter() lazy alternative; DataWal keydir = offsets in memory with on-demand CRC-checked I/O; Production status = scoped production use. - Evidence stack reframed by layer (spec / wire / formal / parser / recovery / long-run / performance / operations) instead of fixed counts. - Formal models section updated to four models including ReadWhileWrite.tla. - Running section adds 'cargo run -p datawal-cli -- --help'.
Single-line collapse of write_oracle_line(OracleLine::Del { .. }) call to
satisfy rustfmt. No behavioral change.
Reformats only crates/datawal-core/examples/power_loss_workload.rs. Same
content -- the inline struct literal is now on one line so rustfmt is
happy under stable.
Two clippy::-D warnings findings, both in the new examples added in cd79bfa. Public API and behaviour unchanged. power_loss_validate.rs: - Replace the 4-tuple return of load_oracle() with a named struct OracleAggregate { effect, last_seq, puts, dels } so the signature is no longer flagged by clippy::type_complexity. Call-site uses field destructuring; no behavioural change. power_loss_workload.rs: - Add #[allow(clippy::modulo_one)] on fn run() with a comment explaining why FSYNC_BATCH=1 today and why the modulo structure is intentional (it stays correct if the constant is ever raised for batched-fsync tuning). Verified locally: - cargo clippy --workspace --all-targets -- -D warnings clean - cargo fmt --all -- --check clean - cargo test --workspace passes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #31 (v1 dm-flakey harness; dm-log-writes replay tracked as v2).
Releases
datawal 0.1.5.Scope
Two commits on this branch:
feat(testing): add dm-flakey power-loss simulation harness (#31)-- adds the harness and its docs.chore: release datawal 0.1.5-- version bump, CHANGELOG entry, and README cleanup so the repo no longer claims an "alpha" status it has outgrown.What is new in 0.1.5
Power-loss harness (Linux, root, not CI)
crates/datawal-core/examples/power_loss_workload.rs-- deterministic workload generator thatfsyncs after every operation and appends an fsync-ordered oracle JSONL on a separate filesystem.crates/datawal-core/examples/power_loss_validate.rs-- post-fault validator that reopens the store and checks per-key prefix invariants 3, 4, 5 against the oracle.scripts/power_loss_dm_flakey.sh-- orchestrator: backing file ->losetup->dmsetupflakey table ->mkfs.ext4-> mount -> workload -> flip toerror_writes-> force unmount -> reload healthy -> remount -> validator.scripts/power_loss_cleanup.sh-- idempotent prefix-guarded teardown.docs/power-loss-testing.md-- how to run, env vars, exit codes, honest "what this is not" section.docs/power-loss-results.md-- sanitized record of the first run.Durability framing in README
fsync."README cleanup (0.1.5)
RecordLogReader(snapshot-at-open, no live tailing),scan_iter,datawalCLI (crates/datawal-cli/), four TLA+ models (includingReadWhileWrite.tla), dm-flakey harness, fs2 fd lock, explicit fsync boundary,compact_toonly.RecordLogReader;scan()entry points atscan_iterfor the lazy path; production-status note clarified.ReadWhileWrite.tla.Runningsection addscargo run -p datawal-cli -- --help.What did NOT change
crates/datawal-core/src/,tests/,benches/,corpus/,formal/, orfuzz/.WIRE_VERSION = 1is unchanged. Corpus fixtures are unchanged.Run 1 (sanitized)
Single end-to-end run executed on a Linux x86_64, apt-based host with kernel 6.12 series and
dm-flakeyavailable. Backing was a loopback file on a local temporary filesystem; filesystem under test wasext4. No hostname, user, paths, kernel patchlevel, mount table, partition UUID, device IDs, container IDs, environment, or rawuname/mount/envoutput is published. Full sanitized record indocs/power-loss-results.md.Result:
fsyncper op, ~9s wall, seed 42, effective key space ~4096.dmsetupreloaded toflakey ... error_writes, force unmount, reload to healthy, remount.files_scanned=1 records_replayed=50000 tail_truncated_segs=0 tail_bytes_discarded=0 mid_stream_errors=0 last_txid_seen=50000.OK observed_live=3918 oracle_live=3918 survived=3918 dropped=0 oracle_dead=178 extras=0.Reading: per-key prefix invariants 3, 4, 5 all hold; no extras; no payload mismatches; nothing truncated on recovery; tombstones correctly absent.
This is one fault class (
dm-flakey error_writes, clean), not a general "power-loss safe" claim. Lying storage, torn writes, reordering, bit rot, mid-workload faults, anddm-log-writesreplay are explicitly out of scope for this run.CI signals expected
rust (stable),rust (1.75.0),formal,corpus,publish-dry-run.Release follow-up (not in this PR)
Tag
v0.1.5after CI is green onmainto trigger the release job that runscargo publishwithCARGO_REGISTRY_TOKEN.