fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799)#1166
Conversation
…ailures (#799) Concurrent subscription failures on the same agent (two chat requests, or a chat overlapping a scheduled task) both ran handle_subscription_failure's read→decide→assign→restart sequence with no mutual exclusion: both could pick the same alternative, both assign_subscription_to_agent, and both fire _restart_agent — wedging the container, duplicating the switch notification, or tripping the #421 was_already_running ambiguity. - Add a per-agent asyncio.Lock (lazy + event-loop-safe, mirroring agent_call_limiter.py rather than defaultdict(asyncio.Lock)) wrapping the whole switch critical section. - Add a stale-failure guard: snapshot the subscription at handler entry and, under the lock, bail when it changed — so 3+ viable subscriptions don't cascade A→B→C (a loser whose failure was about the now-previous sub). - Tests: concurrent 2-sub (exactly one switch), 3-sub no-cascade regression, distinct-agent locks. Pins the race the #606 carve-out test deferred. Process-local lock is sufficient today (single backend process; the scheduler delegates execution to the backend via /api/internal/execute-task rather than calling this module in its own process). #1037 (drain-aware recreate) is split into its own PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Why this is #799-only (decision trail)This started as a combined #799 + #1037 PR per the do-together note. The plan went through
The #1037 redesign seed (the full Codex finding list) is posted on #1037. |
|
Resolve by running |
vybe
left a comment
There was a problem hiding this comment.
✅ Validated & approved (/validate-pr)
P2 reliability bug fix (Closes #799). 2 files, +337/-35.
Correctness:
- Per-agent
asyncio.Lockvia lazy, event-loop-safe creation (agent_switch_lock) — correctly avoids thedefaultdict(asyncio.Lock)cross-loop binding trap; guard check→set has no await between, so it's atomic on the single event loop, andsetdefaultunder the guard resolves the create race. - Stale-failure guard (snapshot at entry → re-read under lock → bail on change) correctly short-circuits BEFORE alternative selection, preventing the A→B→C cascade for 3+ viable subs.
- Process-local invariant explicitly documented with the Redis
SETNXescalation path for multi-worker.
Tests: test_subscription_auto_switch_concurrency.py — deterministic (pre-hold-lock), covers exactly-one-switch, the critical 3-sub no-cascade regression (asserts B never reaches alternative selection), and distinct-agent locks. Snapshot/restore sys.modules fixture keeps blast radius bounded.
- Base →
dev✓ · conventional commit ✓ · CI all green - Security scan clean; no new backend modules / env vars / infra changes
- Bug fix → descriptive commit sufficient (no docs required)
Note: head branch AndriiPasternak31/issue-799 is the base of #1196 (stacked) — merge this first; #1196 auto-retargets to dev.
What
Adds a per-agent lock so concurrent SUB-003 subscription failures on the same agent can't race the auto-switch, plus a stale-failure guard so 3+ viable subscriptions don't cascade. Closes #799.
Problem
services/subscription_auto_switch.py::handle_subscription_failureran an unsynchronized read→decide→assign→restart sequence. Two failures on the same agent (two chat requests, or a chat overlapping a scheduled task — burst-loaded agents) both:assign_subscription_to_agent, and_restart_agent.The second
container_stopracing the firststart_agent_internalwedges the container, fires duplicate switch notifications/activity rows, or trips the #421was_already_runningambiguity.Fix
asyncio.Lock(lazy + event-loop-safe — mirrorsservices/agent_call_limiter.py, notdefaultdict(asyncio.Lock), which binds locks to the wrong loop across pytest's per-test loops) wrapping the whole switch critical section. At most one switch + restart per burst.A→B→Cfor 3+ viable subscriptions. The 2h skip-list masks this for 2 subs only.Process-local is sufficient today: the backend runs a single uvicorn process, and the scheduler delegates execution to the backend via
/api/internal/execute-taskrather than calling this module in its own process. Multi-process escalation path (RedisSETNX) is documented in-code.Tests
tests/unit/test_subscription_auto_switch_concurrency.py:A→B→C; verified to fail without the guard (cascade['B','C']).Regression: existing pingpong + the #606 carve-out test (which enters at
_restart_agent) stay green;tests/lint_sys_modules.pyclean (no new violations).Scope note
This was originally planned to land with #1037 (drain-aware recreate). An engineering review (with a Codex outside-voice pass) found the bundled drain-gate design had multiple correctness holes —
/chatoverflow doesn't actually suspend execution, the 60s maintenance sweep bypasses anacquire()-level gate, the recreate-pending TTL can't span a 7200s execution, recreate-failure drains onto the bad token, and #526's breaker can wipe the queued tasks. So #1037 is split into its own PR built around a slot-acquire-primitive gate; the findings are captured on #1037 as its design seed. This PR is the small, correct, independent half.🤖 Generated with Claude Code