Skip to content

fix(agent): salvage execution telemetry on agent-side timeout 504 (#1201)#1229

Open
dolho wants to merge 1 commit into
devfrom
fix/1201-timeout-telemetry-salvage
Open

fix(agent): salvage execution telemetry on agent-side timeout 504 (#1201)#1229
dolho wants to merge 1 commit into
devfrom
fix/1201-timeout-telemetry-salvage

Conversation

@dolho

@dolho dolho commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem

When an agent execution times out, the terminal schedule_executions write is bare — status='failed' + an error string. All telemetry parsed before the kill (cost, context, tool calls) is discarded, so operators get zero cost accounting for exactly the long, tool-heavy runs that reach the timeout.

Root cause (the tractable case)

On an agent-side budget timeout (max_duration) or per-tool stall kill, headless_executor.py raises a 504 whose detail carried only {message, termination_reason}. ctx.metadata (cost/context/tool-calls) was fully populated and in scope — one dict-key away from being returned.

The backend already has the salvage machinery: the agent 504 reaches response.raise_for_status()httpx.HTTPStatusError → the existing except httpx.HTTPError branch in task_execution_service.py, which reads detail["metadata"] and rolls cost/context onto the FAILED row (the #678 path). It just never received metadata.

Fix (agent-side only)

  • Add _timeout_504_detail(ctx, message, reason, stalled_tool=None) that attaches metadata = sanitize_dict(ctx.metadata.model_dump()) — the same structured shape the Async chat_with_agent: long execution silently fails with null response (reader-thread) #678 empty-result body uses — and route all three timeout-504 sites (outer safety-net, inner max_duration, stall_no_output) through it. Also de-duplicates the three near-identical bodies.
  • No backend change: the existing HTTPError salvage branch now finds metadata and persists salvaged cost / context_used / context_max. termination_reason and the timeout-failure semantics are unchanged.

Out of scope (per issue)

The backend httpx ReadTimeout case (task_execution_service.py except httpx.TimeoutException) — no response body exists, so telemetry genuinely can't be salvaged without a side-channel. Bigger problem, overlaps #1022.

Verification

tests/unit/test_1201_timeout_telemetry_salvage.py — pins the agent-side contract: max_duration and stall_no_output bodies carry metadata with cost/context/tool fields; the metadata shape matches every key the backend salvage reads (cost_usd, input_tokens, cache_*, context_window); empty-metadata timeout still produces a well-formed body. Existing #1094 stall-reason tests still pass (6 passed total). py_compile clean.

Note: takes effect once the agent base image is rebuilt and agents recreated (agent-server change).

Related to #1201

🤖 Generated with Claude Code

… survive (#1201)

When an agent execution hits its own max_duration budget or the per-tool stall
watchdog, headless_executor raised a 504 whose detail carried only
{message, termination_reason}. The cost/context/tool-call telemetry parsed
before the kill (ctx.metadata) was discarded, so the backend wrote a bare
FAILED row — zero cost accounting for exactly the long, tool-heavy runs that
reach the timeout.

- Add `_timeout_504_detail(ctx, message, reason, stalled_tool=None)` helper that
  includes `metadata = sanitize_dict(ctx.metadata.model_dump())` — the same
  structured shape the #678 empty-result body uses — and route all three
  timeout 504 sites (outer safety-net, inner max_duration, stall_no_output)
  through it (also de-duplicates the body).
- No backend change: the agent 504 already reaches the existing HTTPError
  salvage branch (response.raise_for_status -> httpx.HTTPStatusError), which
  reads detail["metadata"] and persists salvaged cost / context_used /
  context_max onto the FAILED row. termination_reason / failure semantics
  unchanged.

Out of scope (per issue): the backend httpx ReadTimeout case (no response body)
— telemetry genuinely unavailable there; overlaps #1022.

Tests: tests/unit/test_1201_timeout_telemetry_salvage.py pins the agent-side
contract (max_duration + stall bodies carry metadata; shape matches the backend
salvage reader; empty-metadata body still well-formed). #1094 stall-reason
tests still pass.

Note: takes effect once the agent base image is rebuilt and agents recreated.

Related to #1201
@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

@vybe

vybe commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

⚠️ Coordination note — now live in dev via #1078 (merged): #1078 hardened _finalize_headless_result to snapshot the run context when a drain leaves a possibly-live reader (reader_may_be_live), but it deliberately does not set that flag on the timeout/kill path. Its stated justification: the 504 path never calls _finalize_headless_result, so there's no ctx read to protect there.

This PR changes that: _timeout_504_detail(ctx, …) now reads ctx.metadata (and salvages buffers) on exactly that kill path. If a timeout kill also leaks a reader past the grace=3 drain budget, metadata.model_dump()/buffer reads can race the live reader — the precise tear #1078 fixes for finalize, reopened on the 504 path.

Suggest reading the 504 telemetry from a snapshot (reuse _snapshot_for_finalize) or setting reader_may_be_live + snapshotting on the kill path before this merges. Refs: #1078 / #1025.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants