Skip to content

fix(persistence): close TOCTOU window in find* walkers (review followup)#1307

Merged
stack72 merged 1 commit intomainfrom
fix/persistence-toctou-walkers
May 5, 2026
Merged

fix(persistence): close TOCTOU window in find* walkers (review followup)#1307
stack72 merged 1 commit intomainfrom
fix/persistence-toctou-walkers

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented May 5, 2026

Followup to #1306 (swamp-club#240). Addresses review comments that surfaced after merge.

What

TOCTOU fix in find* walkers (the one that matters)

Deno.stat and Deno.readTextFile in the workflow-run and output repos' `find*` methods were unguarded. A concurrent delete — GC, `deleteAllByWorkflowId`, output cleanup, another swamp invocation — between `readDir` and the per-file ops throws `NotFound`, propagates to the outer catch, and:

  • Workflow run repo (`findRunsSinceByWorkflowId` / `findAllByWorkflowId`): returns `[]` for that workflow → silently discards every run already parsed and pushed to the array.
  • Output repo (`findAllGlobalSince` / `findAll(type)`): `continue`s past the entire current model type → silently drops the rest of that type's outputs.

The fix mirrors the pattern `unified_data_repository.ts:273` already uses: wrap each per-file body in its own `try/catch` that `continue`s on `NotFound`. The outer catch keeps its role of handling "directory itself doesn't exist." All four walkers across the workflow-run and output repos now use the same shape.

This isn't a regression introduced by #1306 — the pre-existing `findAllByWorkflowId` and `findAll(type)` had it too — but the new `findAllGlobalSince` methods are specifically designed for large-repo scenarios where the race window is wider (3 ops per file instead of 1), so it's worth fixing now while the design is fresh.

Direct unit tests for findAllGlobalSince

The original PR added dedicated tests for the workflow-run repo's two-stage filter but left the output and data repo implementations covered only indirectly via service-layer mocks. Future divergence between the three implementations could ship undetected. Added tests mirroring the workflow-run patterns:

  • In-window-only filtering on output and data repos.
  • The "file deleted mid-iteration" race that this PR's TOCTOU fix prevents (and pins the data repo's existing per-file try/catch so it can't regress).

Polish

  • `--limit count:number` → `<n:number>` to match `data_search` and `workflow_run_search` placeholders in `--help`.
  • Added `swamp summarise --limit 10` to the command's example block.

Test plan

  • `deno check` — clean
  • `deno lint` — clean
  • `deno fmt` — clean
  • `deno run test` — 5430 passing, 0 failing (+17 new tests vs. main: 4 TOCTOU + 4 findAllGlobalSince across the three repos, plus the polish stays in the existing CLI tests)
  • `deno run compile` — binary refreshed
  • All four walkers (`findAllByWorkflowId`, `findRunsSinceByWorkflowId`, `findAll(type)`, output `findAllGlobalSince`) have a "file deleted mid-iteration is skipped, not fatal" test that fails against the pre-fix code and passes against the new pattern

🤖 Generated with Claude Code

Followup to swamp-club#240 / #1306. Review surfaced two related issues that
landed without being addressed:

1. **TOCTOU race in find* walkers**: `Deno.stat` and `Deno.readTextFile`
   in `YamlWorkflowRunRepository.findRunsSinceByWorkflowId` and
   `YamlOutputRepository.findAllGlobalSince` were unguarded. A concurrent
   delete (GC, `deleteAllByWorkflowId`, output cleanup) between `readDir`
   and the per-file ops would throw `NotFound`, propagate to the outer
   catch, and silently discard runs already collected for the workflow
   (run repo) or abandon the rest of the model type's outputs (output
   repo). The pre-existing `findAllByWorkflowId` and `findAll(type)`
   shared the bug.

   Fix: wrap each per-file body in its own `try/catch` that `continue`s
   on `NotFound`, matching the pattern `unified_data_repository.ts:273`
   already uses. The outer catch keeps its role of handling "directory
   itself doesn't exist." All four walkers across the workflow-run and
   output repos now use the same shape.

2. **`YamlOutputRepository.findAllGlobalSince` and
   `FileSystemUnifiedDataRepository.findAllGlobalSince` lacked direct
   unit tests**. They were covered indirectly via service-layer mocks,
   so a divergence between the three implementations could ship
   undetected. Added tests mirroring the workflow-run repo's patterns:
   in-window-only filtering and the "file deleted mid-iteration" race
   that this PR's #1 fix prevents.

Also includes two cosmetic polishes called out in review:
- `--limit <count:number>` → `<n:number>` to match `data_search` and
  `workflow_run_search` flag placeholders in `--help` output.
- Added `swamp summarise --limit 10` to the command's example block so
  the flag shows up alongside `--since` examples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI UX Review

Blocking

None.

Suggestions

None.

Verdict

PASS — The only UX-facing change is in summarise.ts: the --limit placeholder was renamed from <count:number> to <n:number>, making it consistent with data_search, workflow_run_search, and data_query which all use <n:number>. A new example (swamp summarise --limit 10) was also added to improve discoverability. All remaining changes are internal persistence fixes and tests with no user-visible effect.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-scoped followup. The TOCTOU fix is correct and consistently applied across all four walkers, matching the existing pattern in unified_data_repository.ts:273. The new tests directly exercise both the time-window filtering and the race-condition resilience for each repository, which pins the fix against future regressions.

Blocking Issues

None.

Suggestions

  1. Pre-existing TOCTOU in findById / delete walkersYamlOutputRepository.findById (line 65–93) and YamlWorkflowRunRepository.findById (line 71–103) have the same unguarded readTextFile inside a readDir loop. A concurrent delete of a non-target file stops the scan early and returns null, potentially missing the actual target listed later. Same for YamlOutputRepository.delete. Not in scope of this PR (and the race window is smaller since these return on first match), but worth a follow-up ticket.

  2. Comment verbosity — The per-file try/catch blocks carry a 4–5 line comment in each of the four walkers explaining the same TOCTOU rationale. The WHY is genuinely non-obvious and warrants a comment, but after the first instance a shorter one-liner (e.g. // Skip files deleted between readDir and stat/read (TOCTOU)) would suffice and reduce duplication. Minor style point — each method is independently readable as-is.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

I traced every code path in the four modified walkers (findAll, findAllGlobalSince in the output repo; findAllByWorkflowId, findRunsSinceByWorkflowId in the workflow-run repo), the three new test files, and the CLI polish change.

Critical / High

None.

Medium

None.

Low

  1. Pre-existing TOCTOU at directory level in findAll (yaml_output_repository.ts:150): If a method directory is deleted between readDir(typeDir) and readDir(methodDir), the NotFound escapes to the outer catch which returns [], discarding all outputs already collected for that type. The per-file fix in this PR is the right granularity — directory deletion is far rarer than file deletion (it only happens via bulk ops like deleteAllByWorkflowId) — but noting for completeness that the window still exists one level up. Not introduced by this PR.

  2. Pre-existing TOCTOU in findById (both repos): findById still has readTextFile after readDir without per-file try/catch. A concurrent deletion of a non-target file during the scan would cause the search to abort and return null even if the target file exists. The failure mode is benign (false "not found") and the single-file lookup has a much narrower race window than the multi-file walkers this PR fixes. Not introduced by this PR.

Verdict

PASS. The TOCTOU fix is correct and consistent across all four walkers. Each inner continue targets the right loop level (verified for the triple-nested findAllGlobalSince in the output repo). The inner catch re-throws non-NotFound errors so genuine I/O failures still propagate. The outer catch retains its original role (directory-doesn't-exist). The findAllGlobalSince tests properly simulate the race by deleting files between save and query. The CLI polish (<count:number><n:number>, added example) is harmless. Clean PR.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None.

Medium

None.

Low

  1. Pre-existing TOCTOU at directory level in findAll (yaml_output_repository.ts:150): If a method directory is deleted between readDir(typeDir) and readDir(methodDir), the NotFound escapes to the outer catch which returns [], discarding all outputs already collected for that type. The per-file fix in this PR is the right granularity -- directory deletion is far rarer than file deletion (it only happens via bulk ops like deleteAllByWorkflowId) -- but noting for completeness that the window still exists one level up. Not introduced by this PR.

  2. Pre-existing TOCTOU in findById (both repos): findById still has readTextFile after readDir without per-file try/catch. A concurrent deletion of a non-target file during the scan would cause the search to abort and return null even if the target file exists. The failure mode is benign (false "not found") and the single-file lookup has a much narrower race window than the multi-file walkers this PR fixes. Not introduced by this PR.

Verdict

PASS. The TOCTOU fix is correct and consistent across all four walkers. Each inner continue targets the right loop level (verified for the triple-nested findAllGlobalSince in the output repo). The inner catch re-throws non-NotFound errors so genuine I/O failures still propagate. The outer catch retains its original role (directory-doesn't-exist). The findAllGlobalSince tests properly simulate the race by deleting files between save and query. The CLI polish (count to n, added example) is harmless. Clean PR.

@stack72 stack72 merged commit f630b98 into main May 5, 2026
11 checks passed
@stack72 stack72 deleted the fix/persistence-toctou-walkers branch May 5, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant