fix(headless): harden drain/finalize against leaked reader threads (#1025) by dolho · Pull Request #1078 · Abilityai/trinity

dolho · 2026-06-05T10:52:47Z

Summary

Salvages the two orthogonal robustness improvements from the closed PR #980 that the #973 fix (the actual 2h false-timeout bug) did not include. Pure hardening of the already-bounded drain/finalize path — defense-in-depth, no behavioural change on the clean path.

D20 — capture daemon-thread exceptions in `_drain_bounded`

subprocess_lifecycle._drain_bounded swallowed every drain exception via except Exception: pass, making a drain that raised indistinguishable from a clean completion (and hiding the leaked reader). It now captures + logs the exception and returns a DrainOutcome (completed | budget_exceeded | errored). Existing callers that ignore the return value (claude_code.py) are unaffected.

D19 — `_finalize_headless_result` snapshot isolation

On the budget-exceeded / errored drain path a reader thread is leaked and may still be mutating ctx's shared buffers / metadata. The run context now carries drain_budget_exceeded; when set, finalize rebinds ctx to a deep snapshot (list(...) of each buffer + metadata.model_copy(deep=True)) via _snapshot_for_finalize, retry-guarded against the "changed size during iteration" race with a live-ctx fallback. Clean drains keep the zero-copy fast path.

Explicitly out of scope (per the issue)

The fix(agent-runtime): decouple headless executor from leaked reader threads (#970) #980 asyncio.wait_for outer-margin widening (fix(#970): bound headless wait on result message + tool stall #973 leaves it as-is)
The bug: headless_executor reader thread leak causes 2h execution hangs #970 lsof//proc pipe-holder hunt (reintroduces the D-state deadlock bug: Runaway subprocess after task failure makes agent-server unresponsive, triggering persistent CB cascade #817 removed)

Tests

test_drain_bounded.py: outcome matrix — completed / budget_exceeded / errored, plus no-swallow logging assertion
test_headless_finalize_snapshot.py (new): list + metadata isolation from post-snapshot mutation, retry-exhaustion fallback to live ctx, default-off fast path
Full related suite (-k "drain or headless or claude or subprocess or pgroup or orphan"): 130 passed

Related to #1025

🤖 Generated with Claude Code

dolho · 2026-06-05T11:15:51Z

+    # done is set: _target finished, so ``errored`` is fully settled.
+    if errored.is_set():
+        return "errored"
+    return "completed"


The snapshot trigger misses the leaked drain outcome — the exact race this PR fixes still fires on the "completed" path.

_drain_bounded only returns "budget_exceeded" (timeout) or "errored" (raise). But utils/subprocess_pgroup.py:drain_reader_threads has a third leak case: after force-close it computes leaked = [t for t in still_stuck if t.is_alive()], logs outcome=leaked, and returns normally (return [], well within the 90s budget). So a confirmed-alive leaked reader makes _drain_bounded return "completed" → ctx.drain_budget_exceeded stays False → _finalize_headless_result skips _snapshot_for_finalize and reads raw_messages / response_parts / execution_log / metadata while the leaked reader is still mutating them. metadata.model_copy(deep=True) (or iteration) can then raise RuntimeError: dictionary changed size during iteration with no retry guard, surfacing as an opaque HTTP 500.

The correct "a reader may still mutate ctx" signal is drain_outcome != "completed" including the leaked case — surface a "leaked" member on DrainOutcome fed from that branch, or treat it like the other two. As written the fix has a hole on the very path it declares safe.

dolho · 2026-06-05T11:16:38Z

+    last_exc: Optional[RuntimeError] = None
+    for _ in range(_SNAPSHOT_RETRY_ATTEMPTS):
+        try:
+            return replace(


The snapshot doesn't isolate auth_abort_event / auth_abort_reason, which finalize reads off the still-live objects.

replace(ctx, ...) only overrides raw_messages, response_parts, execution_log, verbose_output_lines, metadata. The new context keeps the same auth_abort_event (Event) and auth_abort_reason (list) — both mutated by the read_stderr() reader thread (lines 438-439). On the leaked-reader path, finalize then reads ctx.auth_abort_event.is_set() (line 737) and ctx.auth_abort_reason[0] (738) on those live objects.

So if the leaked thread is the stderr reader and it matches an auth pattern in a late line after the snapshot but before line 737, an otherwise-successful run is converted to a spurious HTTP 503 — which task_execution_service (SUB-003) treats as an auth failure and uses to auto-switch the subscription. Narrow, but real.

At minimum this contradicts the helper's own docstring ("Snapshot EVERY field finalize reads"). Either snapshot the auth fields too, or document explicitly why they're exempt (as parse_failure_count/_sample already are).

dolho · 2026-06-05T11:16:39Z

+        drain_outcome = _drain_bounded(process, stdout_thread, stderr_thread,
+                                       grace=3, pgid=ctx.process_pgid,
+                                       execution_tag=ctx.task_session_id)
+        ctx.drain_budget_exceeded = drain_outcome in ("budget_exceeded", "errored")


This drain_budget_exceeded write is dead for its stated purpose.

The TimeoutExpired branch sets the flag and then immediately raises. execute_headless_task catches TimeoutExpired and raises HTTPException(504) without calling _finalize_headless_result, so the flag set here is never read — only the post-exit drain at line 612 feeds finalize. Harmless today, but it reads as if the snapshot protects the timeout path when it does not, and it's a maintenance trap if the timeout branch is ever wired into finalize later. Consider dropping the assignment here (with a comment) or routing the timeout path through finalize so the snapshot actually applies.

dolho · 2026-06-05T11:16:40Z

+      "never swallow exceptions silently" rule).
    """
    done = threading.Event()
+    errored = threading.Event()


Minor (cleanup): the second errored Event is redundant. It's written by one thread and read by one thread only after the done barrier, so a single write-once cell captures the same thing with the ordering self-evident:

outcome = "completed" def _target(): nonlocal outcome try: asyncio.run(_drain_reader_threads(...)) except Exception: outcome = "errored" logger.exception(...) finally: done.set() ... if not done.wait(timeout=_DRAIN_BUDGET_SECONDS): return "budget_exceeded" return outcome

Removes one Event allocation per drain and the load-bearing "errored is fully settled" comment. (A concurrent.futures Future would fold the timeout + exception-capture + outcome together even more cleanly.)

dolho · 2026-06-05T11:16:41Z

+                response_parts=list(ctx.response_parts),
+                execution_log=list(ctx.execution_log),
+                verbose_output_lines=list(ctx.verbose_output_lines),
+                metadata=ctx.metadata.model_copy(deep=True),


Minor (cleanup): model_copy(deep=True) — ExecutionMetadata's only nested mutable field is compact_events, which finalize overwrites from the JSONL on the success path (_extract_compact_events_from_jsonl). Everything else is scalars, where deep=True == shallow. The deep copy is also precisely what can raise the "changed size during iteration" the 3-attempt retry loop exists to absorb. A deep=False (or model_copy(update={'compact_events': list(...)}) if a tear-safe copy of that one list is actually wanted) narrows the race window the retry is guarding.

dolho · 2026-06-05T11:17:05Z

Code review — `/code-review` (high effort, 7 finder angles)

Reviewed the 4-file diff. The core design is sound (verified: errored/done Event ordering is correct; ctx never escapes execute_headless_task so finalize mutating the snapshot copy is fine; replace() import + dataclass field ordering OK; both _drain_bounded callers safely ignore the changed return type). Local runtime verification also passed (clean headless task + concurrent tasks, no regression).

Findings posted inline, ranked:

🅰 Correctness — leaked drain outcome not surfaced (subprocess_lifecycle.py:124). drain_reader_threads returns normally on its leaked branch with readers confirmed still-alive, so _drain_bounded reports "completed" and finalize skips the snapshot — the exact race this PR fixes, on the path it declares safe. Strongest finding.
🅱 Correctness (narrow) — auth_abort_* not snapshotted (headless_executor.py:642). Finalize reads auth_abort_event/auth_abort_reason off the still-live objects after the snapshot rebind; a leaked stderr reader setting auth-abort late → spurious 503 + SUB-003 auto-switch. Also contradicts the "snapshot EVERY field" docstring.
Dead flag on the timeout path (headless_executor.py:603) — set then raise; the 504 path never finalizes, so it's never read.
Cleanup — redundant second Event (subprocess_lifecycle.py:90); unnecessary model_copy(deep=True) (headless_executor.py:648).

Not inline-able (file outside the diff)

Consistency — the chat path wasn't hardened (services/claude_code.py:386). execute_claude_code is structurally identical — same shared buffers, same two _drain_bounded sites, a finalize-equivalent block — but it discards the new DrainOutcome and never snapshots, then reads ''.join(stderr_lines) immediately after the drain (L390). The same leaked-reader race remains fully open on the higher-traffic interactive /api/chat surface. Harden headless drain/finalize: capture daemon-thread exceptions + finalize snapshot isolation (salvage from #980) #1025 scoped itself to the headless path, so this is arguably a follow-up rather than a blocker — but combined with finding 1 it suggests the real fix belongs at the shared _drain_bounded/finalize seam both callers cross, not re-implemented per-caller.

Findings 3-5 are low-severity; 1 and 2 are worth addressing before merge (1 especially — it's a gap in the fix's own trigger logic).

— 🤖 generated by /code-review

dolho · 2026-06-05T11:29:34Z

Review findings addressed — `e65238a5`

#	Finding	Fix
1	🔴 `leaked` drain outcome missed → snapshot skipped while a confirmed-alive reader mutates ctx	`_drain_bounded` adds a `"leaked"` `DrainOutcome`: after a within-budget, non-raising drain it checks whether any reader thread it was handed is still alive (no dependency on `drain_reader_threads` internals). Flag renamed `reader_may_be_live`, set on `drain_outcome != "completed"` so budget_exceeded / errored / leaked all snapshot.
2	🟠 `auth_abort_*` not snapshotted → possible spurious 503 / SUB-003 auto-switch	`_snapshot_for_finalize` now freezes both: a fresh Event mirroring `is_set()` (`_freeze_event`) + a `list(...)` copy of `auth_abort_reason`. The "snapshots every field finalize reads" claim now holds.
3	Dead flag on the timeout path	Dropped — the `TimeoutExpired` path re-raises to HTTP 504 without finalizing, so nothing reads it (replaced with a comment).
4	Redundant `errored` Event	Collapsed to a single write-once `outcome` cell set by the daemon thread, read after the `done` barrier.
5	`model_copy(deep=True)`	Kept deliberately — `compact_events` is a mutable list a leaked reader can append to, so the deep copy is the correct isolation on the snapshot path; the retry guard is needed regardless of depth (it covers `__pydantic_fields_set__` growth).
—	Chat path (`claude_code.py`) has the same race	Left as a follow-up — out of #1025's headless scope, and findings 1+2 argue the real fix is to lift this onto the shared `_drain_bounded`/finalize seam both callers cross rather than re-implement per-caller. Worth its own issue.

New tests: "leaked"-outcome (within-budget drain leaving a live reader) + auth-abort freeze isolation; 21 passed locally. Re-verified on a live agent-server — clean task still 200s through the fast path with zero leaked/snapshot warnings.

dolho · 2026-06-05T13:36:23Z

Second review pass (post-fix `e65238a5`) — follow-up findings

First-round findings are resolved. A deeper pass surfaced 4 smaller items (none merge-blocking on their own) + known notes. Flagging status-in-progress until 2–4 are addressed.

1. 🟠 Residual torn read of response_parts (pre-existing, snapshot doesn't fully close).
_snapshot_for_finalize does list(ctx.response_parts), but stream_parser.py:299-300 does response_parts.clear() then .append(result_text). If a leaked reader runs that pair while list() executes between them, the snapshot captures [] → spurious empty-result 500/502. list() doesn't raise, so the retry guard never fires. Not a regression (old code read the live list directly), and the proper fix lives in stream_parser (build-then-assign / lock) — tracking as a follow-up rather than expanding this PR's scope.

2. Misleading "leaked" warning on the chat path. The new _drain_bounded log says "reporting 'leaked' so finalize snapshots the run context", but it fires for all callers including claude_code.py (chat), which discards the return and has no snapshot — misleading for operators. Reword to drop the caller-specific promise.

3. Double-scan self-contradiction. _drain_bounded computes any(t.is_alive()) for the return and sum(1 for t … is_alive()) for the warning — two walks; a reader dying between them can log "reporting 'leaked' … 0 reader thread(s) are still alive." Compute the alive list once.

4. Test gap. test_snapshot_freezes_auth_abort_signal only covers not-set→set; the load-bearing if event.is_set(): frozen.set() branch (the 503 path) is untested. Add the already-set→stays-set assertion.

Known / by-design (not actionable here)

Chat path (claude_code.py) shares the same race — deliberately out of Harden headless drain/finalize: capture daemon-thread exceptions + finalize snapshot isolation (salvage from #980) #1025's headless scope; better lifted onto the shared _drain_bounded/finalize seam as its own issue.
_snapshot_for_finalize's field list is hand-maintained (drift risk as HeadlessRunContext grows).
3-retry fallback returns the live ctx as a last resort; and the approach is symptom-level (root cause = uncancellable run_in_executor thread, already scoped out).

Refuted

metadata value-tearing across fields — model_copy snapshots __dict__ in one atomic copy under the GIL; the retry already covers the only real race (__pydantic_fields_set__ growth).

Plan: fix 2/3/4 (small), leave 1 as a tracked follow-up.

— 🤖 second pass via /code-review

github-actions · 2026-06-06T08:43:23Z

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

vybe · 2026-06-08T14:01:56Z

Validation: approve the fix on its merits, but please rebase before merge.

The D19/D20 hardening itself looks solid — typed DrainOutcome with capture+log replacing the silent except: pass, the leaked outcome catching the #586 within-budget-but-reader-alive gap, and the finalize snapshot isolation (incl. the auth_abort_event freeze that prevents a late auth match flipping success→503). Clean-drain keeps the zero-copy fast path, so no behavioural change on the happy path. Tests cover the outcome matrix + isolation/retry-fallback.

One blocking concern — the branch is stale and predates the fix it's meant to complement. It was cut from a dev that does not contain:

b4c5bfdc (fix(#970): bound headless wait on result message + tool stall #973) — the actual 2h false-timeout fix this PR is framed as salvaging the orthogonal bits of
abb6654c (bug: agent container OOM cascades into backend worker saturation and freezes admin UI #904), ab9f4012 (bug: Runaway subprocess after task failure makes agent-server unresponsive, triggering persistent CB cascade #817) — other headless_executor.py fixes since landed

dev has 8 commits touching headless_executor.py since this branch's merge-base. The 3-way merge auto-resolves cleanly (no conflicts), but a clean textual merge isn't a semantic guarantee: #973 and D19/D20 both touch the drain/finalize path, and the "130 passed" run was against a base that lacks #973 — so the suite hasn't actually exercised the merged result.

Please rebase onto current dev (post-#973) and re-run the headless suite (-k "drain or headless or claude or subprocess or pgroup or orphan") so we can confirm the salvaged hardening composes correctly with #973's wait-bounding changes.

(For reference: the README.md +219/−219 in the diff is a harmless artifact of the stale merge-base — the README/hero blobs are byte-identical to what's already on dev/main, so the merge is a README no-op. No action needed there.)

…1025) Salvages the two orthogonal robustness improvements from the closed PR #980 that the #973 fix did not include. Pure hardening of the already-bounded drain/finalize path — defense-in-depth, not a behavioural change on the clean path. D20 — capture daemon-thread exceptions in `_drain_bounded` `subprocess_lifecycle._drain_bounded` swallowed every drain exception via `except Exception: pass`, making a drain that raised indistinguishable from a clean completion (and hiding the leaked reader). It now captures + logs the exception and returns a `DrainOutcome` (`completed` | `budget_exceeded` | `errored`) so the caller can tell the leaked-reader cases apart. Existing callers that ignore the return value are unaffected. D19 — `_finalize_headless_result` snapshot isolation On the budget-exceeded / errored drain path a reader thread is leaked and may still be mutating ctx's shared buffers / metadata. The headless run context now carries `drain_budget_exceeded`; when set, finalize rebinds `ctx` to a deep snapshot (`list(...)` of each buffer + `metadata.model_copy(deep=True)`) via `_snapshot_for_finalize`, which is retry-guarded against the "changed size during iteration" race and falls back to the live ctx if every attempt loses. Clean drains keep the zero-copy fast path. Out of scope (per the issue): the #980 asyncio.wait_for margin widening and the #970 lsof/proc pipe-holder hunt — both correctly excluded. Tests: `_drain_bounded` outcome matrix (completed/budget_exceeded/errored + no-swallow logging) in test_drain_bounded.py; snapshot list/metadata isolation, retry-exhaustion fallback, and default-off fast path in test_headless_finalize_snapshot.py. Full related suite: 130 passed. Related to #1025 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

#1025) Self-review of the #1025 PR surfaced two correctness gaps in the hardening itself, plus a dead assignment and a redundant Event. Fixed: 1. Leaked-reader outcome was missed (strongest finding). drain_reader_threads force-closes and *returns normally* when a grandchild-held reader won't EOF (its own outcome=leaked METRIC, #586), so a within-budget, non-raising drain did NOT prove the readers were dead — _drain_bounded returned "completed", finalize skipped the snapshot, and read buffers a confirmed-alive reader was still mutating. _drain_bounded now adds a "leaked" DrainOutcome: after a within-budget return it checks whether any reader thread it was handed is still alive (no dependency on drain_reader_threads internals). The headless flag is renamed reader_may_be_live and set on `drain_outcome != "completed"` so budget_exceeded / errored / leaked all trigger the snapshot. 2. Snapshot didn't isolate the auth-abort signal. finalize reads auth_abort_event.is_set() / auth_abort_reason[0], but the stderr reader (the thread that leaks) mutates them — a late auth match could flip a success to a spurious 503 (→ SUB-003 auto-switch). _snapshot_for_finalize now freezes both (a fresh Event mirroring is_set() via _freeze_event + a list copy), so the snapshot's "every field finalize reads" claim holds. 3. Dropped the dead reader_may_be_live assignment on the TimeoutExpired path — that path re-raises to HTTP 504 without finalizing, so nothing reads it (replaced with a comment). 4. Collapsed the redundant `errored` Event in _drain_bounded to a single write-once `outcome` cell set by the daemon thread and read after the `done` barrier — removes the load-bearing two-Event ordering invariant. Tests: new "leaked"-outcome case (within-budget drain leaving a live reader) and auth-abort freeze isolation; renamed the flag test. 21 passed locally. Re-verified against a live agent-server: clean task still 200s through the fast path with zero leaked/snapshot warnings. Related to #1025 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vybe

Approving. Verified the load-bearing claim independently: _snapshot_for_finalize captures exactly the 7 ctx buffers _finalize_headless_result reads (lists, metadata deep-copy, frozen auth-abort event/reason); the only iterating read (execution_log) is over the captured copy; tool_start_times is correctly omitted (read only by the #970 stall-watchdog, never by finalize). Outcome matrix + no-swallow logging + snapshot isolation/fallback/auth-freeze are well-tested and CI is green.

⚠️ Cross-PR coordination with #1229: this PR justifies NOT setting reader_may_be_live on the kill path because "the 504 path never calls _finalize_headless_result." #1229 adds _timeout_504_detail(ctx, …) which reads ctx.metadata/buffers on exactly that kill/504 path. Whichever lands second must reconcile — read 504 telemetry from a snapshot (or set reader_may_be_live on the kill path) so the leaked-reader race isn't reopened on the timeout path.

dolho commented Jun 5, 2026

View reviewed changes

dolho added the status-in-progress Currently being worked on label Jun 5, 2026

dolho and others added 2 commits June 12, 2026 11:24

dolho force-pushed the feature/1025-harden-headless-drain-finalize branch from e65238a to b5daa8b Compare June 12, 2026 08:26

dolho requested a review from vybe June 15, 2026 14:29

vybe approved these changes Jun 16, 2026

View reviewed changes

vybe merged commit d2543c8 into dev Jun 16, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(headless): harden drain/finalize against leaked reader threads (#1025)#1078

fix(headless): harden drain/finalize against leaked reader threads (#1025)#1078
vybe merged 2 commits into
devfrom
feature/1025-harden-headless-drain-finalize

dolho commented Jun 5, 2026

Uh oh!

dolho Jun 5, 2026

Uh oh!

dolho Jun 5, 2026

Uh oh!

dolho Jun 5, 2026

Uh oh!

dolho Jun 5, 2026

Uh oh!

dolho Jun 5, 2026

Uh oh!

dolho commented Jun 5, 2026

Uh oh!

dolho commented Jun 5, 2026

Uh oh!

dolho commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

vybe commented Jun 8, 2026

Uh oh!

vybe left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dolho commented Jun 5, 2026

Summary

D20 — capture daemon-thread exceptions in _drain_bounded

D19 — _finalize_headless_result snapshot isolation

Explicitly out of scope (per the issue)

Tests

Uh oh!

dolho Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dolho Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dolho Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dolho Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dolho Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dolho commented Jun 5, 2026

Code review — /code-review (high effort, 7 finder angles)

Not inline-able (file outside the diff)

Uh oh!

dolho commented Jun 5, 2026

Review findings addressed — e65238a5

Uh oh!

dolho commented Jun 5, 2026

Second review pass (post-fix e65238a5) — follow-up findings

Known / by-design (not actionable here)

Refuted

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

vybe commented Jun 8, 2026

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

D20 — capture daemon-thread exceptions in `_drain_bounded`

D19 — `_finalize_headless_result` snapshot isolation

Code review — `/code-review` (high effort, 7 finder angles)

Review findings addressed — `e65238a5`

Second review pass (post-fix `e65238a5`) — follow-up findings