Summary
Concurrent failures on the same agent (two chat requests in flight, or a chat request overlapping a scheduled task) both enter handle_subscription_failure in services/subscription_auto_switch.py with no mutual exclusion. The read-decide-write window from db.get_agent_subscription_id (line 83) through _restart_agent return (line 176) is unsynchronized, so both coroutines can pick the same alternative subscription, both call assign_subscription_to_agent, and both fire _restart_agent. The second container_stop racing the first start_agent_internal can produce a wedged container, duplicate switch notifications, or trip the #421 was_already_running ambiguity.
Component
Backend / Subscription Auto-Switch (SUB-003)
Priority
P2 — real but narrow blast radius. Requires concurrent subscription failures on the same agent within seconds of each other, which only burst-loaded agents see in practice.
Reachable from
src/backend/routers/chat.py:475 (429 path) and :504 (auth-class path) — concurrent chat requests for the same agent
src/backend/services/task_execution_service.py:623 and :629 — concurrent scheduled tasks for the same agent
Concrete failure scenario
Burst on one agent:
- Two coroutines (A, B) both hit the agent and both get a 429.
- Both enter
handle_subscription_failure with current_sub_id = sub-A.
- Both call
db.select_best_alternative_subscription(sub-A) and pick sub-B.
- Both call
db.assign_subscription_to_agent(agent, sub-B).
- Both call
_restart_agent → container_stop(container) then start_agent_internal(agent_name).
- B's
container_stop lands while A's start_agent_internal is mid-flight — the container ends up in one of:
- wedged (stopped while starting),
- duplicate restart with two switch notifications and two activity rows,
- second
start_agent_internal sees a freshly-recreated container as running and trips the #421 was_already_running skip incorrectly.
Suggested fix
Wrap handle_subscription_failure in a per-agent_name asyncio.Lock from a module-level dict[str, asyncio.Lock]. The lock must span from db.get_agent_subscription_id (line 83) through _restart_agent return (line 176) — the full read-decide-write window.
# services/subscription_auto_switch.py
import asyncio
from collections import defaultdict
_agent_switch_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)
async def handle_subscription_failure(agent_name: str, ...) -> Optional[dict]:
async with _agent_switch_locks[agent_name]:
# existing body unchanged
...
Process-local is sufficient today (single-process backend). If/when the backend goes multi-process, escalate to a Redis SETNX lock keyed auto_switch:{agent_name} with TTL ≥ the longest plausible container start (currently ~60s).
Test coverage
Add a concurrency test that fires two handle_subscription_failure coroutines for the same agent under asyncio.gather and asserts:
- Exactly ONE
assign_subscription_to_agent call observed.
- Exactly ONE
_restart_agent call observed.
- Exactly ONE switch notification created.
The second caller should see the post-switch state (current_sub_id == sub-B) and either no-op or pick a different alternative — not double-switch.
Existing reference
The carve-out is documented at tests/unit/test_subscription_auto_switch_no_cred_import.py:36-39 ("Explicit non-claim" docstring) — that test pins the credential-import short-circuit (#606) but explicitly does NOT cover this race.
Related
- Feature: Subscription Auto-Switch (SUB-003) — original
#153 (closed)
- Bug:
#606 (this issue's parent context — credential-import short-circuit; in-progress on branch AndriiPasternak31/issue-606-plan)
- Related:
#792 (retry triggering execution after successful auto-switch) — different concern, same surface
- Backend service:
services/subscription_auto_switch.py
- Backend service:
services/agent_service/lifecycle.py
Summary
Concurrent failures on the same agent (two chat requests in flight, or a chat request overlapping a scheduled task) both enter
handle_subscription_failureinservices/subscription_auto_switch.pywith no mutual exclusion. The read-decide-write window fromdb.get_agent_subscription_id(line 83) through_restart_agentreturn (line 176) is unsynchronized, so both coroutines can pick the same alternative subscription, both callassign_subscription_to_agent, and both fire_restart_agent. The secondcontainer_stopracing the firststart_agent_internalcan produce a wedged container, duplicate switch notifications, or trip the#421was_already_runningambiguity.Component
Backend / Subscription Auto-Switch (SUB-003)
Priority
P2 — real but narrow blast radius. Requires concurrent subscription failures on the same agent within seconds of each other, which only burst-loaded agents see in practice.
Reachable from
src/backend/routers/chat.py:475(429 path) and:504(auth-class path) — concurrent chat requests for the same agentsrc/backend/services/task_execution_service.py:623and:629— concurrent scheduled tasks for the same agentConcrete failure scenario
Burst on one agent:
handle_subscription_failurewithcurrent_sub_id = sub-A.db.select_best_alternative_subscription(sub-A)and picksub-B.db.assign_subscription_to_agent(agent, sub-B)._restart_agent→container_stop(container)thenstart_agent_internal(agent_name).container_stoplands while A'sstart_agent_internalis mid-flight — the container ends up in one of:start_agent_internalsees a freshly-recreated container asrunningand trips the#421was_already_runningskip incorrectly.Suggested fix
Wrap
handle_subscription_failurein a per-agent_nameasyncio.Lockfrom a module-leveldict[str, asyncio.Lock]. The lock must span fromdb.get_agent_subscription_id(line 83) through_restart_agentreturn (line 176) — the full read-decide-write window.Process-local is sufficient today (single-process backend). If/when the backend goes multi-process, escalate to a Redis
SETNXlock keyedauto_switch:{agent_name}with TTL ≥ the longest plausible container start (currently ~60s).Test coverage
Add a concurrency test that fires two
handle_subscription_failurecoroutines for the same agent underasyncio.gatherand asserts:assign_subscription_to_agentcall observed._restart_agentcall observed.The second caller should see the post-switch state (
current_sub_id == sub-B) and either no-op or pick a different alternative — not double-switch.Existing reference
The carve-out is documented at
tests/unit/test_subscription_auto_switch_no_cred_import.py:36-39("Explicit non-claim" docstring) — that test pins the credential-import short-circuit (#606) but explicitly does NOT cover this race.Related
#153(closed)#606(this issue's parent context — credential-import short-circuit; in-progress on branchAndriiPasternak31/issue-606-plan)#792(retry triggering execution after successful auto-switch) — different concern, same surfaceservices/subscription_auto_switch.pyservices/agent_service/lifecycle.py