bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError)

## Summary

When SUB-003 auto-switches an agent's subscription (on a 429 rate-limit or auth failure), it recreates the agent container to inject the new `CLAUDE_CODE_OAUTH_TOKEN` — but it stops the container **without draining in-flight executions**. Any other execution running in that agent's parallel slots is killed mid-flight and recorded as a failure (`HTTP error: RemoteProtocolError` / `ReadError`). Long-running scheduled tasks (research agents, multi-minute/30-min runs) are silently lost and the spend wasted. The same no-drain recreate path is reachable from manual subscription assignment and agent rebuild/deploy.

## Component

Backend / Subscription auto-switch (SUB-003) — `services/subscription_auto_switch.py`

## Priority

P1 — a core feature (auto-switch) destroys in-flight work from another core feature (parallel scheduled execution), with no in-product workaround. Recurs automatically whenever a shared subscription rate-limits while agents have long executions running.

## Error

Recorded on the killed execution:
```
schedule_executions.error = "HTTP error: RemoteProtocolError"   (or "HTTP error: ReadError")
```

Backend log sequence (timestamps from one occurrence; the stop precedes the recreate-complete by ~3s):
```
12:12:43  [TaskExecService] Failed to execute task on [AGENT]: HTTP error: RemoteProtocolError   <- in-flight exec killed
12:12:43  [Slots] Agent '[AGENT]' released slot for execution [EXEC_ID] ... FAILED
12:12:46  Recreated container for agent [AGENT] with updated configuration                       <- recreate completes
12:12:57  [SUB-003] Auto-switch complete: {'switched': True, 'agent_name': '[AGENT]',
          'failure_kind': 'rate_limit', 'restart_result': 'success'}
```
The connection is severed because the agent's HTTP server is stopped while the backend is still awaiting the `/api/task` response — httpx surfaces `RemoteProtocolError` (peer closed mid-response) or `ReadError` (connection dropped during read).

## Location

- **File**: `src/backend/services/subscription_auto_switch.py`
- **Function**: `_restart_agent()` (line ~266), called from `_perform_auto_switch()` (line ~203)

The offending sequence:
```python
# _restart_agent(), no check for running executions:
await container_stop(container)        # kills every in-flight execution in the agent's slots
await start_agent_internal(agent_name)
```
`container_stop` is gated only on `agent_status.status == "running"` (container is up), **not** on whether the agent is idle.

Same no-drain recreate is also reachable via:
- Manual assign/unassign — `src/backend/routers/subscriptions.py:252`
- Agent rebuild / base-image update (`recreate_container_with_updated_config`, `src/backend/services/agent_service/lifecycle.py`)

## Root Cause

Subscription token changes require a container recreate (the token is injected as a build-time env var). The recreate stops the container immediately, with no grace for executions running in the agent's other parallel slots. Because auto-switch is triggered by a 429 on *one* task, any *other* long-running task on the same agent becomes collateral damage.

**Aggravating factor:** when many agents share one subscription, that subscription hitting its model rate limit produces a burst of 429s → simultaneous auto-switch + recreate across all of them → many in-flight executions killed at once. Observed: 9 auto-switches in one afternoon (8 off a single shared subscription) → 8 transport-drop execution failures the same day.

Note: the backend correctly classifies `RemoteProtocolError`/`ReadError` as non-circuit-failures (`services/agent_client.py`, #474), so this does not (and should not) open the circuit breaker — but there is no retry/drain compensating, so the execution just dies terminal-FAILED.

## Reproduction Steps

Deterministic (no real rate-limit needed — exercises the same recreate path):
1. Pick an agent `A` with a subscription assigned and `max_parallel_tasks >= 1`.
2. Start a long execution: `POST /api/agents/A/task` with a multi-minute prompt. Confirm `status=running` (slot acquired) via `GET /api/agents/A/executions`.
3. While it runs, recreate the container via the path SUB-003 uses: `PUT /api/subscriptions/agents/A?subscription_name=<other-subscription>`.
4. **Observe:** backend logs `Recreated container for agent A`; the running execution dies with `schedule_executions.error = "HTTP error: RemoteProtocolError"` and never completes.

End-to-end (auto-switch path): two agents share one near-limit subscription with auto-switch ON, each `max_parallel_tasks >= 2`. Start a long task on `A`; start a second task on `A` that hits the 429 → SUB-003 auto-switches `A` → the long task from step 1 dies with `RemoteProtocolError`.

## Suggested Fix

Drain (or refuse to stop) before recreating on a token change:
- Before `container_stop` in `_restart_agent()`, check the agent's running-execution / slot count; if non-zero, **defer the recreate** until the agent is idle (the new token only needs to apply to *future* executions, and the switch already records the rate-limit event so the exhausted subscription won't be re-selected meanwhile), or
- Apply a bounded graceful drain (wait up to N seconds / until slots free), or
- Inject the switched token without a full container recreate if feasible.

```python
# Sketch in _restart_agent():
running = get_running_execution_count(agent_name)   # slot/registry query
if running > 0:
    # defer: token applies to next execution; don't kill in-flight work
    schedule_recreate_when_idle(agent_name)
    return "deferred_running"
await container_stop(container)
await start_agent_internal(agent_name)
```

The manual-assign (`routers/subscriptions.py`) and rebuild (`agent_service/lifecycle.py`) recreate paths should share the same drain guard.

## Environment

- Trinity commit: `353b7c05`
- Backend: FastAPI + uvicorn 0.37.0; agents run their own HTTP server on the internal Docker network
- Execution path affected: `/api/task` (scheduled + async executions)

## Related

- `services/agent_client.py` transport-drop classification (#474) — correctly keeps the circuit closed on these errors; not the bug, but explains why no retry kicks in.
- `routers/subscriptions.py` PUT assign restarts the container (documented "restarts container" behavior) — same no-drain recreate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037

Summary

Component

Priority

Error

Location

Root Cause

Reproduction Steps

Suggested Fix

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037

Description

Summary

Component

Priority

Error

Location

Root Cause

Reproduction Steps

Suggested Fix

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions