fix(agent): salvage execution telemetry on agent-side timeout 504 (#1201) by dolho · Pull Request #1229 · Abilityai/trinity

dolho · 2026-06-16T11:25:57Z

Problem

When an agent execution times out, the terminal schedule_executions write is bare — status='failed' + an error string. All telemetry parsed before the kill (cost, context, tool calls) is discarded, so operators get zero cost accounting for exactly the long, tool-heavy runs that reach the timeout.

Root cause (the tractable case)

On an agent-side budget timeout (max_duration) or per-tool stall kill, headless_executor.py raises a 504 whose detail carried only {message, termination_reason}. ctx.metadata (cost/context/tool-calls) was fully populated and in scope — one dict-key away from being returned.

The backend already has the salvage machinery: the agent 504 reaches response.raise_for_status() → httpx.HTTPStatusError → the existing except httpx.HTTPError branch in task_execution_service.py, which reads detail["metadata"] and rolls cost/context onto the FAILED row (the #678 path). It just never received metadata.

Fix (agent-side only)

Add _timeout_504_detail(ctx, message, reason, stalled_tool=None) that attaches metadata = sanitize_dict(ctx.metadata.model_dump()) — the same structured shape the Async chat_with_agent: long execution silently fails with null response (reader-thread) #678 empty-result body uses — and route all three timeout-504 sites (outer safety-net, inner max_duration, stall_no_output) through it. Also de-duplicates the three near-identical bodies.
No backend change: the existing HTTPError salvage branch now finds metadata and persists salvaged cost / context_used / context_max. termination_reason and the timeout-failure semantics are unchanged.

Out of scope (per issue)

The backend httpx ReadTimeout case (task_execution_service.py except httpx.TimeoutException) — no response body exists, so telemetry genuinely can't be salvaged without a side-channel. Bigger problem, overlaps #1022.

Verification

tests/unit/test_1201_timeout_telemetry_salvage.py — pins the agent-side contract: max_duration and stall_no_output bodies carry metadata with cost/context/tool fields; the metadata shape matches every key the backend salvage reads (cost_usd, input_tokens, cache_*, context_window); empty-metadata timeout still produces a well-formed body. Existing #1094 stall-reason tests still pass (6 passed total). py_compile clean.

Note: takes effect once the agent base image is rebuilt and agents recreated (agent-server change).

Related to #1201

🤖 Generated with Claude Code

… survive (#1201) When an agent execution hits its own max_duration budget or the per-tool stall watchdog, headless_executor raised a 504 whose detail carried only {message, termination_reason}. The cost/context/tool-call telemetry parsed before the kill (ctx.metadata) was discarded, so the backend wrote a bare FAILED row — zero cost accounting for exactly the long, tool-heavy runs that reach the timeout. - Add `_timeout_504_detail(ctx, message, reason, stalled_tool=None)` helper that includes `metadata = sanitize_dict(ctx.metadata.model_dump())` — the same structured shape the #678 empty-result body uses — and route all three timeout 504 sites (outer safety-net, inner max_duration, stall_no_output) through it (also de-duplicates the body). - No backend change: the agent 504 already reaches the existing HTTPError salvage branch (response.raise_for_status -> httpx.HTTPStatusError), which reads detail["metadata"] and persists salvaged cost / context_used / context_max onto the FAILED row. termination_reason / failure semantics unchanged. Out of scope (per issue): the backend httpx ReadTimeout case (no response body) — telemetry genuinely unavailable there; overlaps #1022. Tests: tests/unit/test_1201_timeout_telemetry_salvage.py pins the agent-side contract (max_duration + stall bodies carry metadata; shape matches the backend salvage reader; empty-metadata body still well-formed). #1094 stall-reason tests still pass. Note: takes effect once the agent base image is rebuilt and agents recreated. Related to #1201

github-actions · 2026-06-16T11:27:53Z

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

vybe · 2026-06-16T15:13:47Z

⚠️ Coordination note — now live in dev via #1078 (merged): #1078 hardened _finalize_headless_result to snapshot the run context when a drain leaves a possibly-live reader (reader_may_be_live), but it deliberately does not set that flag on the timeout/kill path. Its stated justification: the 504 path never calls _finalize_headless_result, so there's no ctx read to protect there.

This PR changes that: _timeout_504_detail(ctx, …) now reads ctx.metadata (and salvages buffers) on exactly that kill path. If a timeout kill also leaks a reader past the grace=3 drain budget, metadata.model_dump()/buffer reads can race the live reader — the precise tear #1078 fixes for finalize, reopened on the 504 path.

Suggest reading the 504 telemetry from a snapshot (reuse _snapshot_for_finalize) or setting reader_may_be_live + snapshotting on the kill path before this merges. Refs: #1078 / #1025.

dolho mentioned this pull request Jun 16, 2026

fix(ci): gate required schema-parity & verify-non-root via changes job (#1222) #1223

Merged

4 tasks

dolho requested a review from vybe June 16, 2026 11:43

vybe mentioned this pull request Jun 16, 2026

fix(headless): harden drain/finalize against leaked reader threads (#1025) #1078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): salvage execution telemetry on agent-side timeout 504 (#1201)#1229

fix(agent): salvage execution telemetry on agent-side timeout 504 (#1201)#1229
dolho wants to merge 1 commit into
devfrom
fix/1201-timeout-telemetry-salvage

dolho commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

vybe commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dolho commented Jun 16, 2026

Problem

Root cause (the tractable case)

Fix (agent-side only)

Out of scope (per issue)

Verification

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

vybe commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants