Skip to content

bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037

@obasilakis

Description

@obasilakis

Summary

When SUB-003 auto-switches an agent's subscription (on a 429 rate-limit or auth failure), it recreates the agent container to inject the new CLAUDE_CODE_OAUTH_TOKEN — but it stops the container without draining in-flight executions. Any other execution running in that agent's parallel slots is killed mid-flight and recorded as a failure (HTTP error: RemoteProtocolError / ReadError). Long-running scheduled tasks (research agents, multi-minute/30-min runs) are silently lost and the spend wasted. The same no-drain recreate path is reachable from manual subscription assignment and agent rebuild/deploy.

Component

Backend / Subscription auto-switch (SUB-003) — services/subscription_auto_switch.py

Priority

P1 — a core feature (auto-switch) destroys in-flight work from another core feature (parallel scheduled execution), with no in-product workaround. Recurs automatically whenever a shared subscription rate-limits while agents have long executions running.

Error

Recorded on the killed execution:

schedule_executions.error = "HTTP error: RemoteProtocolError"   (or "HTTP error: ReadError")

Backend log sequence (timestamps from one occurrence; the stop precedes the recreate-complete by ~3s):

12:12:43  [TaskExecService] Failed to execute task on [AGENT]: HTTP error: RemoteProtocolError   <- in-flight exec killed
12:12:43  [Slots] Agent '[AGENT]' released slot for execution [EXEC_ID] ... FAILED
12:12:46  Recreated container for agent [AGENT] with updated configuration                       <- recreate completes
12:12:57  [SUB-003] Auto-switch complete: {'switched': True, 'agent_name': '[AGENT]',
          'failure_kind': 'rate_limit', 'restart_result': 'success'}

The connection is severed because the agent's HTTP server is stopped while the backend is still awaiting the /api/task response — httpx surfaces RemoteProtocolError (peer closed mid-response) or ReadError (connection dropped during read).

Location

  • File: src/backend/services/subscription_auto_switch.py
  • Function: _restart_agent() (line ~266), called from _perform_auto_switch() (line ~203)

The offending sequence:

# _restart_agent(), no check for running executions:
await container_stop(container)        # kills every in-flight execution in the agent's slots
await start_agent_internal(agent_name)

container_stop is gated only on agent_status.status == "running" (container is up), not on whether the agent is idle.

Same no-drain recreate is also reachable via:

  • Manual assign/unassign — src/backend/routers/subscriptions.py:252
  • Agent rebuild / base-image update (recreate_container_with_updated_config, src/backend/services/agent_service/lifecycle.py)

Root Cause

Subscription token changes require a container recreate (the token is injected as a build-time env var). The recreate stops the container immediately, with no grace for executions running in the agent's other parallel slots. Because auto-switch is triggered by a 429 on one task, any other long-running task on the same agent becomes collateral damage.

Aggravating factor: when many agents share one subscription, that subscription hitting its model rate limit produces a burst of 429s → simultaneous auto-switch + recreate across all of them → many in-flight executions killed at once. Observed: 9 auto-switches in one afternoon (8 off a single shared subscription) → 8 transport-drop execution failures the same day.

Note: the backend correctly classifies RemoteProtocolError/ReadError as non-circuit-failures (services/agent_client.py, #474), so this does not (and should not) open the circuit breaker — but there is no retry/drain compensating, so the execution just dies terminal-FAILED.

Reproduction Steps

Deterministic (no real rate-limit needed — exercises the same recreate path):

  1. Pick an agent A with a subscription assigned and max_parallel_tasks >= 1.
  2. Start a long execution: POST /api/agents/A/task with a multi-minute prompt. Confirm status=running (slot acquired) via GET /api/agents/A/executions.
  3. While it runs, recreate the container via the path SUB-003 uses: PUT /api/subscriptions/agents/A?subscription_name=<other-subscription>.
  4. Observe: backend logs Recreated container for agent A; the running execution dies with schedule_executions.error = "HTTP error: RemoteProtocolError" and never completes.

End-to-end (auto-switch path): two agents share one near-limit subscription with auto-switch ON, each max_parallel_tasks >= 2. Start a long task on A; start a second task on A that hits the 429 → SUB-003 auto-switches A → the long task from step 1 dies with RemoteProtocolError.

Suggested Fix

Drain (or refuse to stop) before recreating on a token change:

  • Before container_stop in _restart_agent(), check the agent's running-execution / slot count; if non-zero, defer the recreate until the agent is idle (the new token only needs to apply to future executions, and the switch already records the rate-limit event so the exhausted subscription won't be re-selected meanwhile), or
  • Apply a bounded graceful drain (wait up to N seconds / until slots free), or
  • Inject the switched token without a full container recreate if feasible.
# Sketch in _restart_agent():
running = get_running_execution_count(agent_name)   # slot/registry query
if running > 0:
    # defer: token applies to next execution; don't kill in-flight work
    schedule_recreate_when_idle(agent_name)
    return "deferred_running"
await container_stop(container)
await start_agent_internal(agent_name)

The manual-assign (routers/subscriptions.py) and rebuild (agent_service/lifecycle.py) recreate paths should share the same drain guard.

Environment

  • Trinity commit: 353b7c05
  • Backend: FastAPI + uvicorn 0.37.0; agents run their own HTTP server on the internal Docker network
  • Execution path affected: /api/task (scheduled + async executions)

Related

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions