Skip to content

fix(headless): harden drain/finalize against leaked reader threads (#1025)#1078

Merged
vybe merged 2 commits into
devfrom
feature/1025-harden-headless-drain-finalize
Jun 16, 2026
Merged

fix(headless): harden drain/finalize against leaked reader threads (#1025)#1078
vybe merged 2 commits into
devfrom
feature/1025-harden-headless-drain-finalize

Conversation

@dolho

@dolho dolho commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Salvages the two orthogonal robustness improvements from the closed PR #980 that the #973 fix (the actual 2h false-timeout bug) did not include. Pure hardening of the already-bounded drain/finalize path — defense-in-depth, no behavioural change on the clean path.

D20 — capture daemon-thread exceptions in _drain_bounded

subprocess_lifecycle._drain_bounded swallowed every drain exception via except Exception: pass, making a drain that raised indistinguishable from a clean completion (and hiding the leaked reader). It now captures + logs the exception and returns a DrainOutcome (completed | budget_exceeded | errored). Existing callers that ignore the return value (claude_code.py) are unaffected.

D19 — _finalize_headless_result snapshot isolation

On the budget-exceeded / errored drain path a reader thread is leaked and may still be mutating ctx's shared buffers / metadata. The run context now carries drain_budget_exceeded; when set, finalize rebinds ctx to a deep snapshot (list(...) of each buffer + metadata.model_copy(deep=True)) via _snapshot_for_finalize, retry-guarded against the "changed size during iteration" race with a live-ctx fallback. Clean drains keep the zero-copy fast path.

Explicitly out of scope (per the issue)

Tests

  • test_drain_bounded.py: outcome matrix — completed / budget_exceeded / errored, plus no-swallow logging assertion
  • test_headless_finalize_snapshot.py (new): list + metadata isolation from post-snapshot mutation, retry-exhaustion fallback to live ctx, default-off fast path
  • Full related suite (-k "drain or headless or claude or subprocess or pgroup or orphan"): 130 passed

Related to #1025

🤖 Generated with Claude Code

# done is set: _target finished, so ``errored`` is fully settled.
if errored.is_set():
return "errored"
return "completed"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot trigger misses the leaked drain outcome — the exact race this PR fixes still fires on the "completed" path.

_drain_bounded only returns "budget_exceeded" (timeout) or "errored" (raise). But utils/subprocess_pgroup.py:drain_reader_threads has a third leak case: after force-close it computes leaked = [t for t in still_stuck if t.is_alive()], logs outcome=leaked, and returns normally (return [], well within the 90s budget). So a confirmed-alive leaked reader makes _drain_bounded return "completed"ctx.drain_budget_exceeded stays False_finalize_headless_result skips _snapshot_for_finalize and reads raw_messages / response_parts / execution_log / metadata while the leaked reader is still mutating them. metadata.model_copy(deep=True) (or iteration) can then raise RuntimeError: dictionary changed size during iteration with no retry guard, surfacing as an opaque HTTP 500.

The correct "a reader may still mutate ctx" signal is drain_outcome != "completed" including the leaked case — surface a "leaked" member on DrainOutcome fed from that branch, or treat it like the other two. As written the fix has a hole on the very path it declares safe.

last_exc: Optional[RuntimeError] = None
for _ in range(_SNAPSHOT_RETRY_ATTEMPTS):
try:
return replace(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot doesn't isolate auth_abort_event / auth_abort_reason, which finalize reads off the still-live objects.

replace(ctx, ...) only overrides raw_messages, response_parts, execution_log, verbose_output_lines, metadata. The new context keeps the same auth_abort_event (Event) and auth_abort_reason (list) — both mutated by the read_stderr() reader thread (lines 438-439). On the leaked-reader path, finalize then reads ctx.auth_abort_event.is_set() (line 737) and ctx.auth_abort_reason[0] (738) on those live objects.

So if the leaked thread is the stderr reader and it matches an auth pattern in a late line after the snapshot but before line 737, an otherwise-successful run is converted to a spurious HTTP 503 — which task_execution_service (SUB-003) treats as an auth failure and uses to auto-switch the subscription. Narrow, but real.

At minimum this contradicts the helper's own docstring ("Snapshot EVERY field finalize reads"). Either snapshot the auth fields too, or document explicitly why they're exempt (as parse_failure_count/_sample already are).

drain_outcome = _drain_bounded(process, stdout_thread, stderr_thread,
grace=3, pgid=ctx.process_pgid,
execution_tag=ctx.task_session_id)
ctx.drain_budget_exceeded = drain_outcome in ("budget_exceeded", "errored")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This drain_budget_exceeded write is dead for its stated purpose.

The TimeoutExpired branch sets the flag and then immediately raises. execute_headless_task catches TimeoutExpired and raises HTTPException(504) without calling _finalize_headless_result, so the flag set here is never read — only the post-exit drain at line 612 feeds finalize. Harmless today, but it reads as if the snapshot protects the timeout path when it does not, and it's a maintenance trap if the timeout branch is ever wired into finalize later. Consider dropping the assignment here (with a comment) or routing the timeout path through finalize so the snapshot actually applies.

"never swallow exceptions silently" rule).
"""
done = threading.Event()
errored = threading.Event()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor (cleanup): the second errored Event is redundant. It's written by one thread and read by one thread only after the done barrier, so a single write-once cell captures the same thing with the ordering self-evident:

outcome = "completed"
def _target():
    nonlocal outcome
    try:
        asyncio.run(_drain_reader_threads(...))
    except Exception:
        outcome = "errored"
        logger.exception(...)
    finally:
        done.set()
...
if not done.wait(timeout=_DRAIN_BUDGET_SECONDS):
    return "budget_exceeded"
return outcome

Removes one Event allocation per drain and the load-bearing "errored is fully settled" comment. (A concurrent.futures Future would fold the timeout + exception-capture + outcome together even more cleanly.)

response_parts=list(ctx.response_parts),
execution_log=list(ctx.execution_log),
verbose_output_lines=list(ctx.verbose_output_lines),
metadata=ctx.metadata.model_copy(deep=True),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor (cleanup): model_copy(deep=True)ExecutionMetadata's only nested mutable field is compact_events, which finalize overwrites from the JSONL on the success path (_extract_compact_events_from_jsonl). Everything else is scalars, where deep=True == shallow. The deep copy is also precisely what can raise the "changed size during iteration" the 3-attempt retry loop exists to absorb. A deep=False (or model_copy(update={'compact_events': list(...)}) if a tear-safe copy of that one list is actually wanted) narrows the race window the retry is guarding.

@dolho

dolho commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Code review — /code-review (high effort, 7 finder angles)

Reviewed the 4-file diff. The core design is sound (verified: errored/done Event ordering is correct; ctx never escapes execute_headless_task so finalize mutating the snapshot copy is fine; replace() import + dataclass field ordering OK; both _drain_bounded callers safely ignore the changed return type). Local runtime verification also passed (clean headless task + concurrent tasks, no regression).

Findings posted inline, ranked:

  1. 🅰 Correctness — leaked drain outcome not surfaced (subprocess_lifecycle.py:124). drain_reader_threads returns normally on its leaked branch with readers confirmed still-alive, so _drain_bounded reports "completed" and finalize skips the snapshot — the exact race this PR fixes, on the path it declares safe. Strongest finding.
  2. 🅱 Correctness (narrow) — auth_abort_* not snapshotted (headless_executor.py:642). Finalize reads auth_abort_event/auth_abort_reason off the still-live objects after the snapshot rebind; a leaked stderr reader setting auth-abort late → spurious 503 + SUB-003 auto-switch. Also contradicts the "snapshot EVERY field" docstring.
  3. Dead flag on the timeout path (headless_executor.py:603) — set then raise; the 504 path never finalizes, so it's never read.
  4. Cleanup — redundant second Event (subprocess_lifecycle.py:90); unnecessary model_copy(deep=True) (headless_executor.py:648).

Not inline-able (file outside the diff)

  1. Consistency — the chat path wasn't hardened (services/claude_code.py:386). execute_claude_code is structurally identical — same shared buffers, same two _drain_bounded sites, a finalize-equivalent block — but it discards the new DrainOutcome and never snapshots, then reads ''.join(stderr_lines) immediately after the drain (L390). The same leaked-reader race remains fully open on the higher-traffic interactive /api/chat surface. Harden headless drain/finalize: capture daemon-thread exceptions + finalize snapshot isolation (salvage from #980) #1025 scoped itself to the headless path, so this is arguably a follow-up rather than a blocker — but combined with finding 1 it suggests the real fix belongs at the shared _drain_bounded/finalize seam both callers cross, not re-implemented per-caller.

Findings 3-5 are low-severity; 1 and 2 are worth addressing before merge (1 especially — it's a gap in the fix's own trigger logic).

— 🤖 generated by /code-review

@dolho

dolho commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Review findings addressed — e65238a5

# Finding Fix
1 🔴 leaked drain outcome missed → snapshot skipped while a confirmed-alive reader mutates ctx _drain_bounded adds a "leaked" DrainOutcome: after a within-budget, non-raising drain it checks whether any reader thread it was handed is still alive (no dependency on drain_reader_threads internals). Flag renamed reader_may_be_live, set on drain_outcome != "completed" so budget_exceeded / errored / leaked all snapshot.
2 🟠 auth_abort_* not snapshotted → possible spurious 503 / SUB-003 auto-switch _snapshot_for_finalize now freezes both: a fresh Event mirroring is_set() (_freeze_event) + a list(...) copy of auth_abort_reason. The "snapshots every field finalize reads" claim now holds.
3 Dead flag on the timeout path Dropped — the TimeoutExpired path re-raises to HTTP 504 without finalizing, so nothing reads it (replaced with a comment).
4 Redundant errored Event Collapsed to a single write-once outcome cell set by the daemon thread, read after the done barrier.
5 model_copy(deep=True) Kept deliberatelycompact_events is a mutable list a leaked reader can append to, so the deep copy is the correct isolation on the snapshot path; the retry guard is needed regardless of depth (it covers __pydantic_fields_set__ growth).
Chat path (claude_code.py) has the same race Left as a follow-up — out of #1025's headless scope, and findings 1+2 argue the real fix is to lift this onto the shared _drain_bounded/finalize seam both callers cross rather than re-implement per-caller. Worth its own issue.

New tests: "leaked"-outcome (within-budget drain leaving a live reader) + auth-abort freeze isolation; 21 passed locally. Re-verified on a live agent-server — clean task still 200s through the fast path with zero leaked/snapshot warnings.

@dolho

dolho commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Second review pass (post-fix e65238a5) — follow-up findings

First-round findings are resolved. A deeper pass surfaced 4 smaller items (none merge-blocking on their own) + known notes. Flagging status-in-progress until 2–4 are addressed.

1. 🟠 Residual torn read of response_parts (pre-existing, snapshot doesn't fully close).
_snapshot_for_finalize does list(ctx.response_parts), but stream_parser.py:299-300 does response_parts.clear() then .append(result_text). If a leaked reader runs that pair while list() executes between them, the snapshot captures [] → spurious empty-result 500/502. list() doesn't raise, so the retry guard never fires. Not a regression (old code read the live list directly), and the proper fix lives in stream_parser (build-then-assign / lock) — tracking as a follow-up rather than expanding this PR's scope.

2. Misleading "leaked" warning on the chat path. The new _drain_bounded log says "reporting 'leaked' so finalize snapshots the run context", but it fires for all callers including claude_code.py (chat), which discards the return and has no snapshot — misleading for operators. Reword to drop the caller-specific promise.

3. Double-scan self-contradiction. _drain_bounded computes any(t.is_alive()) for the return and sum(1 for t … is_alive()) for the warning — two walks; a reader dying between them can log "reporting 'leaked' … 0 reader thread(s) are still alive." Compute the alive list once.

4. Test gap. test_snapshot_freezes_auth_abort_signal only covers not-set→set; the load-bearing if event.is_set(): frozen.set() branch (the 503 path) is untested. Add the already-set→stays-set assertion.

Known / by-design (not actionable here)

Refuted

  • metadata value-tearing across fieldsmodel_copy snapshots __dict__ in one atomic copy under the GIL; the retry already covers the only real race (__pydantic_fields_set__ growth).

Plan: fix 2/3/4 (small), leave 1 as a tracked follow-up.

— 🤖 second pass via /code-review

@dolho dolho added the status-in-progress Currently being worked on label Jun 5, 2026
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

@vybe

vybe commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Validation: approve the fix on its merits, but please rebase before merge.

The D19/D20 hardening itself looks solid — typed DrainOutcome with capture+log replacing the silent except: pass, the leaked outcome catching the #586 within-budget-but-reader-alive gap, and the finalize snapshot isolation (incl. the auth_abort_event freeze that prevents a late auth match flipping success→503). Clean-drain keeps the zero-copy fast path, so no behavioural change on the happy path. Tests cover the outcome matrix + isolation/retry-fallback.

One blocking concern — the branch is stale and predates the fix it's meant to complement. It was cut from a dev that does not contain:

dev has 8 commits touching headless_executor.py since this branch's merge-base. The 3-way merge auto-resolves cleanly (no conflicts), but a clean textual merge isn't a semantic guarantee: #973 and D19/D20 both touch the drain/finalize path, and the "130 passed" run was against a base that lacks #973 — so the suite hasn't actually exercised the merged result.

Please rebase onto current dev (post-#973) and re-run the headless suite (-k "drain or headless or claude or subprocess or pgroup or orphan") so we can confirm the salvaged hardening composes correctly with #973's wait-bounding changes.

(For reference: the README.md +219/−219 in the diff is a harmless artifact of the stale merge-base — the README/hero blobs are byte-identical to what's already on dev/main, so the merge is a README no-op. No action needed there.)

dolho and others added 2 commits June 12, 2026 11:24
…1025)

Salvages the two orthogonal robustness improvements from the closed PR #980
that the #973 fix did not include. Pure hardening of the already-bounded
drain/finalize path — defense-in-depth, not a behavioural change on the
clean path.

D20 — capture daemon-thread exceptions in `_drain_bounded`
  `subprocess_lifecycle._drain_bounded` swallowed every drain exception via
  `except Exception: pass`, making a drain that raised indistinguishable
  from a clean completion (and hiding the leaked reader). It now captures +
  logs the exception and returns a `DrainOutcome`
  (`completed` | `budget_exceeded` | `errored`) so the caller can tell the
  leaked-reader cases apart. Existing callers that ignore the return value
  are unaffected.

D19 — `_finalize_headless_result` snapshot isolation
  On the budget-exceeded / errored drain path a reader thread is leaked and
  may still be mutating ctx's shared buffers / metadata. The headless run
  context now carries `drain_budget_exceeded`; when set, finalize rebinds
  `ctx` to a deep snapshot (`list(...)` of each buffer +
  `metadata.model_copy(deep=True)`) via `_snapshot_for_finalize`, which is
  retry-guarded against the "changed size during iteration" race and falls
  back to the live ctx if every attempt loses. Clean drains keep the
  zero-copy fast path.

Out of scope (per the issue): the #980 asyncio.wait_for margin widening and
the #970 lsof/proc pipe-holder hunt — both correctly excluded.

Tests: `_drain_bounded` outcome matrix (completed/budget_exceeded/errored +
no-swallow logging) in test_drain_bounded.py; snapshot list/metadata
isolation, retry-exhaustion fallback, and default-off fast path in
test_headless_finalize_snapshot.py. Full related suite: 130 passed.

Related to #1025

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#1025)

Self-review of the #1025 PR surfaced two correctness gaps in the hardening
itself, plus a dead assignment and a redundant Event. Fixed:

1. Leaked-reader outcome was missed (strongest finding). drain_reader_threads
   force-closes and *returns normally* when a grandchild-held reader won't EOF
   (its own outcome=leaked METRIC, #586), so a within-budget, non-raising drain
   did NOT prove the readers were dead — _drain_bounded returned "completed",
   finalize skipped the snapshot, and read buffers a confirmed-alive reader was
   still mutating. _drain_bounded now adds a "leaked" DrainOutcome: after a
   within-budget return it checks whether any reader thread it was handed is
   still alive (no dependency on drain_reader_threads internals). The headless
   flag is renamed reader_may_be_live and set on `drain_outcome != "completed"`
   so budget_exceeded / errored / leaked all trigger the snapshot.

2. Snapshot didn't isolate the auth-abort signal. finalize reads
   auth_abort_event.is_set() / auth_abort_reason[0], but the stderr reader (the
   thread that leaks) mutates them — a late auth match could flip a success to a
   spurious 503 (→ SUB-003 auto-switch). _snapshot_for_finalize now freezes both
   (a fresh Event mirroring is_set() via _freeze_event + a list copy), so the
   snapshot's "every field finalize reads" claim holds.

3. Dropped the dead reader_may_be_live assignment on the TimeoutExpired path —
   that path re-raises to HTTP 504 without finalizing, so nothing reads it
   (replaced with a comment).

4. Collapsed the redundant `errored` Event in _drain_bounded to a single
   write-once `outcome` cell set by the daemon thread and read after the `done`
   barrier — removes the load-bearing two-Event ordering invariant.

Tests: new "leaked"-outcome case (within-budget drain leaving a live reader)
and auth-abort freeze isolation; renamed the flag test. 21 passed locally.
Re-verified against a live agent-server: clean task still 200s through the
fast path with zero leaked/snapshot warnings.

Related to #1025

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dolho dolho force-pushed the feature/1025-harden-headless-drain-finalize branch from e65238a to b5daa8b Compare June 12, 2026 08:26
@dolho dolho requested a review from vybe June 15, 2026 14:29

@vybe vybe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Verified the load-bearing claim independently: _snapshot_for_finalize captures exactly the 7 ctx buffers _finalize_headless_result reads (lists, metadata deep-copy, frozen auth-abort event/reason); the only iterating read (execution_log) is over the captured copy; tool_start_times is correctly omitted (read only by the #970 stall-watchdog, never by finalize). Outcome matrix + no-swallow logging + snapshot isolation/fallback/auth-freeze are well-tested and CI is green.

⚠️ Cross-PR coordination with #1229: this PR justifies NOT setting reader_may_be_live on the kill path because "the 504 path never calls _finalize_headless_result." #1229 adds _timeout_504_detail(ctx, …) which reads ctx.metadata/buffers on exactly that kill/504 path. Whichever lands second must reconcile — read 504 telemetry from a snapshot (or set reader_may_be_live on the kill path) so the leaked-reader race isn't reopened on the timeout path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status-in-progress Currently being worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants