Skip to content

fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799)#1166

Merged
vybe merged 1 commit into
devfrom
AndriiPasternak31/issue-799
Jun 16, 2026
Merged

fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799)#1166
vybe merged 1 commit into
devfrom
AndriiPasternak31/issue-799

Conversation

@AndriiPasternak31

Copy link
Copy Markdown
Contributor

What

Adds a per-agent lock so concurrent SUB-003 subscription failures on the same agent can't race the auto-switch, plus a stale-failure guard so 3+ viable subscriptions don't cascade. Closes #799.

Problem

services/subscription_auto_switch.py::handle_subscription_failure ran an unsynchronized read→decide→assign→restart sequence. Two failures on the same agent (two chat requests, or a chat overlapping a scheduled task — burst-loaded agents) both:

  • read the same current subscription,
  • pick the same alternative,
  • assign_subscription_to_agent, and
  • fire _restart_agent.

The second container_stop racing the first start_agent_internal wedges the container, fires duplicate switch notifications/activity rows, or trips the #421 was_already_running ambiguity.

Fix

  • Per-agent asyncio.Lock (lazy + event-loop-safe — mirrors services/agent_call_limiter.py, not defaultdict(asyncio.Lock), which binds locks to the wrong loop across pytest's per-test loops) wrapping the whole switch critical section. At most one switch + restart per burst.
  • Stale-failure guard (Codex C8): snapshot the subscription at handler entry; under the lock, bail if it changed. Without this, a loser whose failure was about sub-A would attribute it to the new current sub-B and cascade A→B→C for 3+ viable subscriptions. The 2h skip-list masks this for 2 subs only.

Process-local is sufficient today: the backend runs a single uvicorn process, and the scheduler delegates execution to the backend via /api/internal/execute-task rather than calling this module in its own process. Multi-process escalation path (Redis SETNX) is documented in-code.

Tests

tests/unit/test_subscription_auto_switch_concurrency.py:

  • 2-sub concurrent burst → exactly one switch, loser no-ops.
  • 3-sub no-cascade regression (the critical one) → asserts no A→B→C; verified to fail without the guard (cascade ['B','C']).
  • distinct-agent locks don't serialize.

Regression: existing pingpong + the #606 carve-out test (which enters at _restart_agent) stay green; tests/lint_sys_modules.py clean (no new violations).

Scope note

This was originally planned to land with #1037 (drain-aware recreate). An engineering review (with a Codex outside-voice pass) found the bundled drain-gate design had multiple correctness holes — /chat overflow doesn't actually suspend execution, the 60s maintenance sweep bypasses an acquire()-level gate, the recreate-pending TTL can't span a 7200s execution, recreate-failure drains onto the bad token, and #526's breaker can wipe the queued tasks. So #1037 is split into its own PR built around a slot-acquire-primitive gate; the findings are captured on #1037 as its design seed. This PR is the small, correct, independent half.

🤖 Generated with Claude Code

…ailures (#799)

Concurrent subscription failures on the same agent (two chat requests, or a
chat overlapping a scheduled task) both ran handle_subscription_failure's
read→decide→assign→restart sequence with no mutual exclusion: both could pick
the same alternative, both assign_subscription_to_agent, and both fire
_restart_agent — wedging the container, duplicating the switch notification, or
tripping the #421 was_already_running ambiguity.

- Add a per-agent asyncio.Lock (lazy + event-loop-safe, mirroring
  agent_call_limiter.py rather than defaultdict(asyncio.Lock)) wrapping the
  whole switch critical section.
- Add a stale-failure guard: snapshot the subscription at handler entry and,
  under the lock, bail when it changed — so 3+ viable subscriptions don't
  cascade A→B→C (a loser whose failure was about the now-previous sub).
- Tests: concurrent 2-sub (exactly one switch), 3-sub no-cascade regression,
  distinct-agent locks. Pins the race the #606 carve-out test deferred.

Process-local lock is sufficient today (single backend process; the scheduler
delegates execution to the backend via /api/internal/execute-task rather than
calling this module in its own process). #1037 (drain-aware recreate) is split
into its own PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@AndriiPasternak31

Copy link
Copy Markdown
Contributor Author

Why this is #799-only (decision trail)

This started as a combined #799 + #1037 PR per the do-together note. The plan went through /plan-eng-review + a Codex outside-voice pass, which reshaped the scope:

  1. D1 — for bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037 I first chose an admission gate in CapacityManager.acquire() over a passive recreate-on-idle (a prior cross-model learning flagged the passive approach as best-effort: new admissions run on the stale token, busy agents starve the recreate).
  2. Codex outside voice — proved the acquire()-level gate still leaks: /chat queue_in_memory is observability-only and doesn't suspend execution; the 60s drain_orphans_all()BacklogService.drain_next() acquires SlotService directly, bypassing the gate; the feat: per-agent dispatch circuit breaker to prevent backlog poisoning (RELIABILITY-007) #526 half-open probe bypasses it; the recreate-pending TTL can't span a 7200s execution; recreate-failure drains onto the bad token; feat: per-agent dispatch circuit breaker to prevent backlog poisoning (RELIABILITY-007) #526's breaker can wipe deliberately-queued rows. It also caught that the bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 lock alone double-switches A→B→C for 3+ subs (→ the stale-failure guard in this PR).
  3. D3bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037-done-right belongs in the slot-acquire primitive (the one chokepoint all admission paths funnel through) and is its own focused PR. Ship bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 standalone now.

The #1037 redesign seed (the full Codex finding list) is posted on #1037.

@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

@vybe vybe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated & approved (/validate-pr)

P2 reliability bug fix (Closes #799). 2 files, +337/-35.

Correctness:

  • Per-agent asyncio.Lock via lazy, event-loop-safe creation (agent_switch_lock) — correctly avoids the defaultdict(asyncio.Lock) cross-loop binding trap; guard check→set has no await between, so it's atomic on the single event loop, and setdefault under the guard resolves the create race.
  • Stale-failure guard (snapshot at entry → re-read under lock → bail on change) correctly short-circuits BEFORE alternative selection, preventing the A→B→C cascade for 3+ viable subs.
  • Process-local invariant explicitly documented with the Redis SETNX escalation path for multi-worker.

Tests: test_subscription_auto_switch_concurrency.py — deterministic (pre-hold-lock), covers exactly-one-switch, the critical 3-sub no-cascade regression (asserts B never reaches alternative selection), and distinct-agent locks. Snapshot/restore sys.modules fixture keeps blast radius bounded.

  • Base → dev ✓ · conventional commit ✓ · CI all green
  • Security scan clean; no new backend modules / env vars / infra changes
  • Bug fix → descriptive commit sufficient (no docs required)

Note: head branch AndriiPasternak31/issue-799 is the base of #1196 (stacked) — merge this first; #1196 auto-retargets to dev.

@vybe vybe merged commit 4274f9f into dev Jun 16, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants