Skip to content

bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799

@AndriiPasternak31

Description

@AndriiPasternak31

Summary

Concurrent failures on the same agent (two chat requests in flight, or a chat request overlapping a scheduled task) both enter handle_subscription_failure in services/subscription_auto_switch.py with no mutual exclusion. The read-decide-write window from db.get_agent_subscription_id (line 83) through _restart_agent return (line 176) is unsynchronized, so both coroutines can pick the same alternative subscription, both call assign_subscription_to_agent, and both fire _restart_agent. The second container_stop racing the first start_agent_internal can produce a wedged container, duplicate switch notifications, or trip the #421 was_already_running ambiguity.

Component

Backend / Subscription Auto-Switch (SUB-003)

Priority

P2 — real but narrow blast radius. Requires concurrent subscription failures on the same agent within seconds of each other, which only burst-loaded agents see in practice.

Reachable from

  • src/backend/routers/chat.py:475 (429 path) and :504 (auth-class path) — concurrent chat requests for the same agent
  • src/backend/services/task_execution_service.py:623 and :629 — concurrent scheduled tasks for the same agent

Concrete failure scenario

Burst on one agent:

  1. Two coroutines (A, B) both hit the agent and both get a 429.
  2. Both enter handle_subscription_failure with current_sub_id = sub-A.
  3. Both call db.select_best_alternative_subscription(sub-A) and pick sub-B.
  4. Both call db.assign_subscription_to_agent(agent, sub-B).
  5. Both call _restart_agentcontainer_stop(container) then start_agent_internal(agent_name).
  6. B's container_stop lands while A's start_agent_internal is mid-flight — the container ends up in one of:
    • wedged (stopped while starting),
    • duplicate restart with two switch notifications and two activity rows,
    • second start_agent_internal sees a freshly-recreated container as running and trips the #421 was_already_running skip incorrectly.

Suggested fix

Wrap handle_subscription_failure in a per-agent_name asyncio.Lock from a module-level dict[str, asyncio.Lock]. The lock must span from db.get_agent_subscription_id (line 83) through _restart_agent return (line 176) — the full read-decide-write window.

# services/subscription_auto_switch.py
import asyncio
from collections import defaultdict

_agent_switch_locks: dict[str, asyncio.Lock] = defaultdict(asyncio.Lock)

async def handle_subscription_failure(agent_name: str, ...) -> Optional[dict]:
    async with _agent_switch_locks[agent_name]:
        # existing body unchanged
        ...

Process-local is sufficient today (single-process backend). If/when the backend goes multi-process, escalate to a Redis SETNX lock keyed auto_switch:{agent_name} with TTL ≥ the longest plausible container start (currently ~60s).

Test coverage

Add a concurrency test that fires two handle_subscription_failure coroutines for the same agent under asyncio.gather and asserts:

  1. Exactly ONE assign_subscription_to_agent call observed.
  2. Exactly ONE _restart_agent call observed.
  3. Exactly ONE switch notification created.

The second caller should see the post-switch state (current_sub_id == sub-B) and either no-op or pick a different alternative — not double-switch.

Existing reference

The carve-out is documented at tests/unit/test_subscription_auto_switch_no_cred_import.py:36-39 ("Explicit non-claim" docstring) — that test pins the credential-import short-circuit (#606) but explicitly does NOT cover this race.

Related

  • Feature: Subscription Auto-Switch (SUB-003) — original #153 (closed)
  • Bug: #606 (this issue's parent context — credential-import short-circuit; in-progress on branch AndriiPasternak31/issue-606-plan)
  • Related: #792 (retry triggering execution after successful auto-switch) — different concern, same surface
  • Backend service: services/subscription_auto_switch.py
  • Backend service: services/agent_service/lifecycle.py

Metadata

Metadata

Labels

complexity-mediumComplexity: medium (board points 5-8)priority-p2Importantstatus-in-devMerged to dev, awaiting release cut to mainstatus-readyGreenlit and ready for development (vetted; counterpart to status-incubating)theme-reliabilityTheme: Reliabilitytype-bugBug fix

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions