Skip to content

fix(datastore): symmetric drain to close data-delete TOCTOU race (swamp-club#234)#1303

Merged
stack72 merged 1 commit intomainfrom
worktree-234
May 5, 2026
Merged

fix(datastore): symmetric drain to close data-delete TOCTOU race (swamp-club#234)#1303
stack72 merged 1 commit intomainfrom
worktree-234

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented May 5, 2026

Summary

Fixes swamp-club#234. swamp data delete --force could fail with Directory not empty (Linux: os error 39, macOS: os error 66) when a concurrent writer was creating a new version of the same data artifact.

The root cause was a symmetric TOCTOU window in requireInitializedRepo: the deleter drained per-model locks before acquiring the global lock, but between those two steps a writer could inspect the global lock (not yet held), take a per-model lock, pass its own recheck (also not held), and start writing a new version directory. The deleter would then run Deno.remove(dataNameDir, { recursive: true }) against a directory the writer was actively repopulating.

The fix runs waitForPerModelLocks a second time after the global lock is acquired. With the global lock held, the writer's existing TOCTOU recheck (which runs before any data I/O) sees the held global lock and backs off. The second drain catches any writer already past its recheck and waits for it to finish before structural work proceeds.

design/datastores.md gains a "Symmetric Drain" subsection documenting the contract bidirectionally with the code, so the second drain cannot be silently removed as redundant.

Test Plan

  • deno check / deno lint / deno fmt --check — clean
  • deno run test5405 passed, 0 failed
  • deno run compile — produces ./swamp binary
  • Manual repro verified. Reused the original reproduction (4 parallel writers via command/shell + tight data delete --force loop). Original bug reproduced in ~40 attempts; with the fix, 250 delete attempts produced zero ENOTEMPTY errors.
  • Deterministic regression in src/cli/repo_context_test.ts: 3 new tests for the polling primitive (immediate return, retry-until-zero, end-to-end stale-lock pass-through).
  • Smoke regression in integration/data_delete_test.ts: 50 iterations of concurrent CLI writer + deleter, asserts no ENOTEMPTY/os-error-39/os-error-66 strings appear.

Follow-ups (separate work, not bundled)

  • File issue in systeminit/swamp-uat to strengthen tests/cli/adversarial/data_delete_test.ts Scenario 2 — today only checks for crash markers, should additionally assert no ENOTEMPTY string.
  • The asymmetric error handling in unified_data_repository.ts:632 (silent .catch(() => {})) vs :639 (re-throw) is now architecturally unnecessary; tracked as a follow-up cleanup.

🤖 Generated with Claude Code

…mp-club#234)

`requireInitializedRepo` waited for per-model locks to drain before
acquiring the global lock, but a writer could inspect the global lock
between the drain ending and the global lock being taken — see it not
held, take a per-model lock, pass its own TOCTOU recheck (also not
held), and start writing a new version directory while the deleter's
recursive rmdir was in flight. The result was ENOTEMPTY (Linux: os
error 39, macOS: os error 66) on `swamp data delete --force` against a
concurrent writer.

Fix: run `waitForPerModelLocks` a second time after the global lock is
acquired. Writers that slipped past the first drain will either back
off on their next inspect (because the global lock is now held) or
finish their write before the deleter proceeds. The writer's existing
TOCTOU recheck runs immediately after the per-model lock acquisition,
before any data I/O — so there is no remaining window where a writer
can write data without the second drain seeing the per-model lock.

design/datastores.md "Lock Lifecycle" section gains a "Symmetric Drain"
subsection documenting the contract bidirectionally with the code, so
the second drain cannot be silently removed as redundant. The
deterministic regression for the polling primitive lives in
`src/cli/repo_context_test.ts`; a 50-iteration smoke test in
`integration/data_delete_test.ts` exercises the end-to-end flow.

Verified against the original reproduction (4 parallel writers + tight
delete loop): 250 delete attempts, zero ENOTEMPTY. Original bug
reproduced in ~40 attempts under identical pressure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

No blocking issues found. This is a clean, well-reasoned fix for a real concurrency bug.

Strengths

  • Correct fix: The second waitForPerModelLocks call after global lock acquisition closes the symmetric TOCTOU window precisely. The protocol reasoning in the comments and design doc is sound — a writer that slipped past the first drain either (a) sees the now-held global lock and backs off, or (b) is already committed and must finish before the second drain returns.
  • Bidirectional documentation contract: The cross-reference between src/cli/repo_context.ts and design/datastores.md prevents the second drain from being silently removed as "redundant" — a nice architectural guardrail.
  • Test coverage at two levels: Deterministic unit tests for the polling primitive (waitForPerModelLocks with injected scanner) cover the logic; the 50-iteration smoke test in integration/data_delete_test.ts exercises the real end-to-end race.
  • Test seam design: The findModelLocksOverride parameter is well-scoped — optional, documented as test-only, not re-exported from any barrel.
  • No import boundary violations: waitForPerModelLocks is exported from src/cli/repo_context.ts but only consumed by the adjacent _test.ts file; not leaked through src/libswamp/mod.ts.
  • LogTape conventions: Structured logging uses bare {count} interpolation, no double-quoting.
  • DDD: The change is purely in the CLI/infrastructure coordination layer — no domain model concerns.

Suggestions

  1. The verbose inline comments in requireInitializedRepo (lines 428–452) are justified here since the concurrency protocol is genuinely non-obvious, but the comments in the waitForPerModelLocks docstring (Test seam: findModelLocksOverride is for unit tests only — production callers must omit it. Not exported from any barrel; used solely by repo_context_test.ts.) could be trimmed — the function signature and its test-only usage are self-evident.
  2. The integration test's runSwamp helper spawns subprocesses without an AbortSignal/timeout. This is consistent with existing integration tests, but a hung subprocess would stall the suite indefinitely. A future follow-up could add a per-iteration timeout (e.g., signal: AbortSignal.timeout(10_000)) to fail fast. Not a concern for this PR.

@stack72 stack72 merged commit 1781d38 into main May 5, 2026
11 checks passed
@stack72 stack72 deleted the worktree-234 branch May 5, 2026 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant