fix(agent): salvage execution telemetry on agent-side timeout 504 (#1201)#1229
fix(agent): salvage execution telemetry on agent-side timeout 504 (#1201)#1229dolho wants to merge 1 commit into
Conversation
… survive (#1201) When an agent execution hits its own max_duration budget or the per-tool stall watchdog, headless_executor raised a 504 whose detail carried only {message, termination_reason}. The cost/context/tool-call telemetry parsed before the kill (ctx.metadata) was discarded, so the backend wrote a bare FAILED row — zero cost accounting for exactly the long, tool-heavy runs that reach the timeout. - Add `_timeout_504_detail(ctx, message, reason, stalled_tool=None)` helper that includes `metadata = sanitize_dict(ctx.metadata.model_dump())` — the same structured shape the #678 empty-result body uses — and route all three timeout 504 sites (outer safety-net, inner max_duration, stall_no_output) through it (also de-duplicates the body). - No backend change: the agent 504 already reaches the existing HTTPError salvage branch (response.raise_for_status -> httpx.HTTPStatusError), which reads detail["metadata"] and persists salvaged cost / context_used / context_max onto the FAILED row. termination_reason / failure semantics unchanged. Out of scope (per issue): the backend httpx ReadTimeout case (no response body) — telemetry genuinely unavailable there; overlaps #1022. Tests: tests/unit/test_1201_timeout_telemetry_salvage.py pins the agent-side contract (max_duration + stall bodies carry metadata; shape matches the backend salvage reader; empty-metadata body still well-formed). #1094 stall-reason tests still pass. Note: takes effect once the agent base image is rebuilt and agents recreated. Related to #1201
|
Resolve by running |
|
This PR changes that: Suggest reading the 504 telemetry from a snapshot (reuse |
Problem
When an agent execution times out, the terminal
schedule_executionswrite is bare —status='failed'+ anerrorstring. All telemetry parsed before the kill (cost, context, tool calls) is discarded, so operators get zero cost accounting for exactly the long, tool-heavy runs that reach the timeout.Root cause (the tractable case)
On an agent-side budget timeout (
max_duration) or per-tool stall kill,headless_executor.pyraises a 504 whosedetailcarried only{message, termination_reason}.ctx.metadata(cost/context/tool-calls) was fully populated and in scope — one dict-key away from being returned.The backend already has the salvage machinery: the agent 504 reaches
response.raise_for_status()→httpx.HTTPStatusError→ the existingexcept httpx.HTTPErrorbranch intask_execution_service.py, which readsdetail["metadata"]and rolls cost/context onto the FAILED row (the #678 path). It just never receivedmetadata.Fix (agent-side only)
_timeout_504_detail(ctx, message, reason, stalled_tool=None)that attachesmetadata = sanitize_dict(ctx.metadata.model_dump())— the same structured shape the Async chat_with_agent: long execution silently fails with null response (reader-thread) #678 empty-result body uses — and route all three timeout-504 sites (outer safety-net, innermax_duration,stall_no_output) through it. Also de-duplicates the three near-identical bodies.metadataand persists salvagedcost/context_used/context_max.termination_reasonand the timeout-failure semantics are unchanged.Out of scope (per issue)
The backend httpx ReadTimeout case (
task_execution_service.pyexcept httpx.TimeoutException) — no response body exists, so telemetry genuinely can't be salvaged without a side-channel. Bigger problem, overlaps #1022.Verification
tests/unit/test_1201_timeout_telemetry_salvage.py— pins the agent-side contract:max_durationandstall_no_outputbodies carrymetadatawith cost/context/tool fields; the metadata shape matches every key the backend salvage reads (cost_usd,input_tokens,cache_*,context_window); empty-metadata timeout still produces a well-formed body. Existing #1094 stall-reason tests still pass (6 passed total).py_compileclean.Related to #1201
🤖 Generated with Claude Code