fix(datastore): symmetric drain to close data-delete TOCTOU race (swamp-club#234)#1303
Merged
fix(datastore): symmetric drain to close data-delete TOCTOU race (swamp-club#234)#1303
Conversation
…mp-club#234) `requireInitializedRepo` waited for per-model locks to drain before acquiring the global lock, but a writer could inspect the global lock between the drain ending and the global lock being taken — see it not held, take a per-model lock, pass its own TOCTOU recheck (also not held), and start writing a new version directory while the deleter's recursive rmdir was in flight. The result was ENOTEMPTY (Linux: os error 39, macOS: os error 66) on `swamp data delete --force` against a concurrent writer. Fix: run `waitForPerModelLocks` a second time after the global lock is acquired. Writers that slipped past the first drain will either back off on their next inspect (because the global lock is now held) or finish their write before the deleter proceeds. The writer's existing TOCTOU recheck runs immediately after the per-model lock acquisition, before any data I/O — so there is no remaining window where a writer can write data without the second drain seeing the per-model lock. design/datastores.md "Lock Lifecycle" section gains a "Symmetric Drain" subsection documenting the contract bidirectionally with the code, so the second drain cannot be silently removed as redundant. The deterministic regression for the polling primitive lives in `src/cli/repo_context_test.ts`; a 50-iteration smoke test in `integration/data_delete_test.ts` exercises the end-to-end flow. Verified against the original reproduction (4 parallel writers + tight delete loop): 250 delete attempts, zero ENOTEMPTY. Original bug reproduced in ~40 attempts under identical pressure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
No blocking issues found. This is a clean, well-reasoned fix for a real concurrency bug.
Strengths
- Correct fix: The second
waitForPerModelLockscall after global lock acquisition closes the symmetric TOCTOU window precisely. The protocol reasoning in the comments and design doc is sound — a writer that slipped past the first drain either (a) sees the now-held global lock and backs off, or (b) is already committed and must finish before the second drain returns. - Bidirectional documentation contract: The cross-reference between
src/cli/repo_context.tsanddesign/datastores.mdprevents the second drain from being silently removed as "redundant" — a nice architectural guardrail. - Test coverage at two levels: Deterministic unit tests for the polling primitive (
waitForPerModelLockswith injected scanner) cover the logic; the 50-iteration smoke test inintegration/data_delete_test.tsexercises the real end-to-end race. - Test seam design: The
findModelLocksOverrideparameter is well-scoped — optional, documented as test-only, not re-exported from any barrel. - No import boundary violations:
waitForPerModelLocksis exported fromsrc/cli/repo_context.tsbut only consumed by the adjacent_test.tsfile; not leaked throughsrc/libswamp/mod.ts. - LogTape conventions: Structured logging uses bare
{count}interpolation, no double-quoting. - DDD: The change is purely in the CLI/infrastructure coordination layer — no domain model concerns.
Suggestions
- The verbose inline comments in
requireInitializedRepo(lines 428–452) are justified here since the concurrency protocol is genuinely non-obvious, but the comments in thewaitForPerModelLocksdocstring (Test seam: findModelLocksOverride is for unit tests only — production callers must omit it. Not exported from any barrel; used solely by repo_context_test.ts.) could be trimmed — the function signature and its test-only usage are self-evident. - The integration test's
runSwamphelper spawns subprocesses without anAbortSignal/timeout. This is consistent with existing integration tests, but a hung subprocess would stall the suite indefinitely. A future follow-up could add a per-iteration timeout (e.g.,signal: AbortSignal.timeout(10_000)) to fail fast. Not a concern for this PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes swamp-club#234.
swamp data delete --forcecould fail withDirectory not empty(Linux:os error 39, macOS:os error 66) when a concurrent writer was creating a new version of the same data artifact.The root cause was a symmetric TOCTOU window in
requireInitializedRepo: the deleter drained per-model locks before acquiring the global lock, but between those two steps a writer could inspect the global lock (not yet held), take a per-model lock, pass its own recheck (also not held), and start writing a new version directory. The deleter would then runDeno.remove(dataNameDir, { recursive: true })against a directory the writer was actively repopulating.The fix runs
waitForPerModelLocksa second time after the global lock is acquired. With the global lock held, the writer's existing TOCTOU recheck (which runs before any data I/O) sees the held global lock and backs off. The second drain catches any writer already past its recheck and waits for it to finish before structural work proceeds.design/datastores.mdgains a "Symmetric Drain" subsection documenting the contract bidirectionally with the code, so the second drain cannot be silently removed as redundant.Test Plan
deno check/deno lint/deno fmt --check— cleandeno run test— 5405 passed, 0 faileddeno run compile— produces./swampbinarycommand/shell+ tightdata delete --forceloop). Original bug reproduced in ~40 attempts; with the fix, 250 delete attempts produced zero ENOTEMPTY errors.src/cli/repo_context_test.ts: 3 new tests for the polling primitive (immediate return, retry-until-zero, end-to-end stale-lock pass-through).integration/data_delete_test.ts: 50 iterations of concurrent CLI writer + deleter, asserts no ENOTEMPTY/os-error-39/os-error-66 strings appear.Follow-ups (separate work, not bundled)
systeminit/swamp-uatto strengthentests/cli/adversarial/data_delete_test.tsScenario 2 — today only checks for crash markers, should additionally assert no ENOTEMPTY string.unified_data_repository.ts:632(silent.catch(() => {})) vs:639(re-throw) is now architecturally unnecessary; tracked as a follow-up cleanup.🤖 Generated with Claude Code