Summary
When SUB-003 auto-switches an agent's subscription (on a 429 rate-limit or auth failure), it recreates the agent container to inject the new CLAUDE_CODE_OAUTH_TOKEN — but it stops the container without draining in-flight executions. Any other execution running in that agent's parallel slots is killed mid-flight and recorded as a failure (HTTP error: RemoteProtocolError / ReadError). Long-running scheduled tasks (research agents, multi-minute/30-min runs) are silently lost and the spend wasted. The same no-drain recreate path is reachable from manual subscription assignment and agent rebuild/deploy.
Component
Backend / Subscription auto-switch (SUB-003) — services/subscription_auto_switch.py
Priority
P1 — a core feature (auto-switch) destroys in-flight work from another core feature (parallel scheduled execution), with no in-product workaround. Recurs automatically whenever a shared subscription rate-limits while agents have long executions running.
Error
Recorded on the killed execution:
schedule_executions.error = "HTTP error: RemoteProtocolError" (or "HTTP error: ReadError")
Backend log sequence (timestamps from one occurrence; the stop precedes the recreate-complete by ~3s):
12:12:43 [TaskExecService] Failed to execute task on [AGENT]: HTTP error: RemoteProtocolError <- in-flight exec killed
12:12:43 [Slots] Agent '[AGENT]' released slot for execution [EXEC_ID] ... FAILED
12:12:46 Recreated container for agent [AGENT] with updated configuration <- recreate completes
12:12:57 [SUB-003] Auto-switch complete: {'switched': True, 'agent_name': '[AGENT]',
'failure_kind': 'rate_limit', 'restart_result': 'success'}
The connection is severed because the agent's HTTP server is stopped while the backend is still awaiting the /api/task response — httpx surfaces RemoteProtocolError (peer closed mid-response) or ReadError (connection dropped during read).
Location
- File:
src/backend/services/subscription_auto_switch.py
- Function:
_restart_agent() (line ~266), called from _perform_auto_switch() (line ~203)
The offending sequence:
# _restart_agent(), no check for running executions:
await container_stop(container) # kills every in-flight execution in the agent's slots
await start_agent_internal(agent_name)
container_stop is gated only on agent_status.status == "running" (container is up), not on whether the agent is idle.
Same no-drain recreate is also reachable via:
- Manual assign/unassign —
src/backend/routers/subscriptions.py:252
- Agent rebuild / base-image update (
recreate_container_with_updated_config, src/backend/services/agent_service/lifecycle.py)
Root Cause
Subscription token changes require a container recreate (the token is injected as a build-time env var). The recreate stops the container immediately, with no grace for executions running in the agent's other parallel slots. Because auto-switch is triggered by a 429 on one task, any other long-running task on the same agent becomes collateral damage.
Aggravating factor: when many agents share one subscription, that subscription hitting its model rate limit produces a burst of 429s → simultaneous auto-switch + recreate across all of them → many in-flight executions killed at once. Observed: 9 auto-switches in one afternoon (8 off a single shared subscription) → 8 transport-drop execution failures the same day.
Note: the backend correctly classifies RemoteProtocolError/ReadError as non-circuit-failures (services/agent_client.py, #474), so this does not (and should not) open the circuit breaker — but there is no retry/drain compensating, so the execution just dies terminal-FAILED.
Reproduction Steps
Deterministic (no real rate-limit needed — exercises the same recreate path):
- Pick an agent
A with a subscription assigned and max_parallel_tasks >= 1.
- Start a long execution:
POST /api/agents/A/task with a multi-minute prompt. Confirm status=running (slot acquired) via GET /api/agents/A/executions.
- While it runs, recreate the container via the path SUB-003 uses:
PUT /api/subscriptions/agents/A?subscription_name=<other-subscription>.
- Observe: backend logs
Recreated container for agent A; the running execution dies with schedule_executions.error = "HTTP error: RemoteProtocolError" and never completes.
End-to-end (auto-switch path): two agents share one near-limit subscription with auto-switch ON, each max_parallel_tasks >= 2. Start a long task on A; start a second task on A that hits the 429 → SUB-003 auto-switches A → the long task from step 1 dies with RemoteProtocolError.
Suggested Fix
Drain (or refuse to stop) before recreating on a token change:
- Before
container_stop in _restart_agent(), check the agent's running-execution / slot count; if non-zero, defer the recreate until the agent is idle (the new token only needs to apply to future executions, and the switch already records the rate-limit event so the exhausted subscription won't be re-selected meanwhile), or
- Apply a bounded graceful drain (wait up to N seconds / until slots free), or
- Inject the switched token without a full container recreate if feasible.
# Sketch in _restart_agent():
running = get_running_execution_count(agent_name) # slot/registry query
if running > 0:
# defer: token applies to next execution; don't kill in-flight work
schedule_recreate_when_idle(agent_name)
return "deferred_running"
await container_stop(container)
await start_agent_internal(agent_name)
The manual-assign (routers/subscriptions.py) and rebuild (agent_service/lifecycle.py) recreate paths should share the same drain guard.
Environment
- Trinity commit:
353b7c05
- Backend: FastAPI + uvicorn 0.37.0; agents run their own HTTP server on the internal Docker network
- Execution path affected:
/api/task (scheduled + async executions)
Related
Summary
When SUB-003 auto-switches an agent's subscription (on a 429 rate-limit or auth failure), it recreates the agent container to inject the new
CLAUDE_CODE_OAUTH_TOKEN— but it stops the container without draining in-flight executions. Any other execution running in that agent's parallel slots is killed mid-flight and recorded as a failure (HTTP error: RemoteProtocolError/ReadError). Long-running scheduled tasks (research agents, multi-minute/30-min runs) are silently lost and the spend wasted. The same no-drain recreate path is reachable from manual subscription assignment and agent rebuild/deploy.Component
Backend / Subscription auto-switch (SUB-003) —
services/subscription_auto_switch.pyPriority
P1 — a core feature (auto-switch) destroys in-flight work from another core feature (parallel scheduled execution), with no in-product workaround. Recurs automatically whenever a shared subscription rate-limits while agents have long executions running.
Error
Recorded on the killed execution:
Backend log sequence (timestamps from one occurrence; the stop precedes the recreate-complete by ~3s):
The connection is severed because the agent's HTTP server is stopped while the backend is still awaiting the
/api/taskresponse — httpx surfacesRemoteProtocolError(peer closed mid-response) orReadError(connection dropped during read).Location
src/backend/services/subscription_auto_switch.py_restart_agent()(line ~266), called from_perform_auto_switch()(line ~203)The offending sequence:
container_stopis gated only onagent_status.status == "running"(container is up), not on whether the agent is idle.Same no-drain recreate is also reachable via:
src/backend/routers/subscriptions.py:252recreate_container_with_updated_config,src/backend/services/agent_service/lifecycle.py)Root Cause
Subscription token changes require a container recreate (the token is injected as a build-time env var). The recreate stops the container immediately, with no grace for executions running in the agent's other parallel slots. Because auto-switch is triggered by a 429 on one task, any other long-running task on the same agent becomes collateral damage.
Aggravating factor: when many agents share one subscription, that subscription hitting its model rate limit produces a burst of 429s → simultaneous auto-switch + recreate across all of them → many in-flight executions killed at once. Observed: 9 auto-switches in one afternoon (8 off a single shared subscription) → 8 transport-drop execution failures the same day.
Note: the backend correctly classifies
RemoteProtocolError/ReadErroras non-circuit-failures (services/agent_client.py, #474), so this does not (and should not) open the circuit breaker — but there is no retry/drain compensating, so the execution just dies terminal-FAILED.Reproduction Steps
Deterministic (no real rate-limit needed — exercises the same recreate path):
Awith a subscription assigned andmax_parallel_tasks >= 1.POST /api/agents/A/taskwith a multi-minute prompt. Confirmstatus=running(slot acquired) viaGET /api/agents/A/executions.PUT /api/subscriptions/agents/A?subscription_name=<other-subscription>.Recreated container for agent A; the running execution dies withschedule_executions.error = "HTTP error: RemoteProtocolError"and never completes.End-to-end (auto-switch path): two agents share one near-limit subscription with auto-switch ON, each
max_parallel_tasks >= 2. Start a long task onA; start a second task onAthat hits the 429 → SUB-003 auto-switchesA→ the long task from step 1 dies withRemoteProtocolError.Suggested Fix
Drain (or refuse to stop) before recreating on a token change:
container_stopin_restart_agent(), check the agent's running-execution / slot count; if non-zero, defer the recreate until the agent is idle (the new token only needs to apply to future executions, and the switch already records the rate-limit event so the exhausted subscription won't be re-selected meanwhile), orThe manual-assign (
routers/subscriptions.py) and rebuild (agent_service/lifecycle.py) recreate paths should share the same drain guard.Environment
353b7c05/api/task(scheduled + async executions)Related
services/agent_client.pytransport-drop classification (bug: dropped MCP sync connection causes broken pipe that trips circuit breaker #474) — correctly keeps the circuit closed on these errors; not the bug, but explains why no retry kicks in.routers/subscriptions.pyPUT assign restarts the container (documented "restarts container" behavior) — same no-drain recreate.