bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart

## Summary

Concurrent failures on the same agent (two chat requests in flight, or a chat request overlapping a scheduled task) both enter `handle_subscription_failure` in `services/subscription_auto_switch.py` with no mutual exclusion. The read-decide-write window from `db.get_agent_subscription_id` (line 83) through `_restart_agent` return (line 176) is unsynchronized, so both coroutines can pick the same alternative subscription, both call `assign_subscription_to_agent`, and both fire `_restart_agent`. The second `container_stop` racing the first `start_agent_internal` can produce a wedged container, duplicate switch notifications, or trip the `#421` `was_already_running` ambiguity.

## Component

Backend / Subscription Auto-Switch (SUB-003)

## Priority

P2 — real but narrow blast radius. Requires concurrent subscription failures on the same agent within seconds of each other, which only burst-loaded agents see in practice.

## Reachable from

- `src/backend/routers/chat.py:475` (429 path) and `:504` (auth-class path) — concurrent chat requests for the same agent
- `src/backend/services/task_execution_service.py:623` and `:629` — concurrent scheduled tasks for the same agent

## Concrete failure scenario

Burst on one agent:

1. Two coroutines (A, B) both hit the agent and both get a 429.
2. Both enter `handle_subscription_failure` with `current_sub_id = sub-A`.
3. Both call `db.select_best_alternative_subscription(sub-A)` and pick `sub-B`.
4. Both call `db.assign_subscription_to_agent(agent, sub-B)`.
5. Both call `_restart_agent` → `container_stop(container)` then `start_agent_internal(agent_name)`.
6. B's `container_stop` lands while A's `start_agent_internal` is mid-flight — the container ends up in one of:
   - wedged (stopped while starting),
   - duplicate restart with two switch notifications and two activity rows,
   - second `start_agent_internal` sees a freshly-recreated container as `running` and trips the `#421` `was_already_running` skip incorrectly.

## Suggested fix

Wrap `handle_subscription_failure` in a per-`agent_name` `asyncio.Lock` from a module-level `dict[str, asyncio.Lock]`. The lock must span from `db.get_agent_subscription_id` (line 83) through `_restart_agent` return (line 176) — the full read-decide-write window.

```python
# services/subscription_auto_switch.py
import asyncio
from collections import defaultdict

_agent_switch_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)

async def handle_subscription_failure(agent_name: str, ...) -> Optional[dict]:
    async with _agent_switch_locks[agent_name]:
        # existing body unchanged
        ...
```

Process-local is sufficient today (single-process backend). If/when the backend goes multi-process, escalate to a Redis `SETNX` lock keyed `auto_switch:{agent_name}` with TTL ≥ the longest plausible container start (currently ~60s).

## Test coverage

Add a concurrency test that fires two `handle_subscription_failure` coroutines for the same agent under `asyncio.gather` and asserts:

1. Exactly ONE `assign_subscription_to_agent` call observed.
2. Exactly ONE `_restart_agent` call observed.
3. Exactly ONE switch notification created.

The second caller should see the post-switch state (`current_sub_id == sub-B`) and either no-op or pick a different alternative — not double-switch.

## Existing reference

The carve-out is documented at `tests/unit/test_subscription_auto_switch_no_cred_import.py:36-39` ("Explicit non-claim" docstring) — that test pins the credential-import short-circuit (`#606`) but explicitly does NOT cover this race.

## Related

- Feature: Subscription Auto-Switch (SUB-003) — original `#153` (closed)
- Bug: `#606` (this issue's parent context — credential-import short-circuit; in-progress on branch `AndriiPasternak31/issue-606-plan`)
- Related: `#792` (retry triggering execution after successful auto-switch) — different concern, same surface
- Backend service: `services/subscription_auto_switch.py`
- Backend service: `services/agent_service/lifecycle.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799

Summary

Component

Priority

Reachable from

Concrete failure scenario

Suggested fix

Test coverage

Existing reference

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799

Description

Summary

Component

Priority

Reachable from

Concrete failure scenario

Suggested fix

Test coverage

Existing reference

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions