fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799) by AndriiPasternak31 · Pull Request #1166 · Abilityai/trinity

AndriiPasternak31 · 2026-06-11T15:22:29Z

What

Adds a per-agent lock so concurrent SUB-003 subscription failures on the same agent can't race the auto-switch, plus a stale-failure guard so 3+ viable subscriptions don't cascade. Closes #799.

Problem

services/subscription_auto_switch.py::handle_subscription_failure ran an unsynchronized read→decide→assign→restart sequence. Two failures on the same agent (two chat requests, or a chat overlapping a scheduled task — burst-loaded agents) both:

read the same current subscription,
pick the same alternative,
assign_subscription_to_agent, and
fire _restart_agent.

The second container_stop racing the first start_agent_internal wedges the container, fires duplicate switch notifications/activity rows, or trips the #421 was_already_running ambiguity.

Fix

Per-agent asyncio.Lock (lazy + event-loop-safe — mirrors services/agent_call_limiter.py, not defaultdict(asyncio.Lock), which binds locks to the wrong loop across pytest's per-test loops) wrapping the whole switch critical section. At most one switch + restart per burst.
Stale-failure guard (Codex C8): snapshot the subscription at handler entry; under the lock, bail if it changed. Without this, a loser whose failure was about sub-A would attribute it to the new current sub-B and cascade A→B→C for 3+ viable subscriptions. The 2h skip-list masks this for 2 subs only.

Process-local is sufficient today: the backend runs a single uvicorn process, and the scheduler delegates execution to the backend via /api/internal/execute-task rather than calling this module in its own process. Multi-process escalation path (Redis SETNX) is documented in-code.

Tests

tests/unit/test_subscription_auto_switch_concurrency.py:

2-sub concurrent burst → exactly one switch, loser no-ops.
3-sub no-cascade regression (the critical one) → asserts no A→B→C; verified to fail without the guard (cascade ['B','C']).
distinct-agent locks don't serialize.

Regression: existing pingpong + the #606 carve-out test (which enters at _restart_agent) stay green; tests/lint_sys_modules.py clean (no new violations).

Scope note

This was originally planned to land with #1037 (drain-aware recreate). An engineering review (with a Codex outside-voice pass) found the bundled drain-gate design had multiple correctness holes — /chat overflow doesn't actually suspend execution, the 60s maintenance sweep bypasses an acquire()-level gate, the recreate-pending TTL can't span a 7200s execution, recreate-failure drains onto the bad token, and #526's breaker can wipe the queued tasks. So #1037 is split into its own PR built around a slot-acquire-primitive gate; the findings are captured on #1037 as its design seed. This PR is the small, correct, independent half.

🤖 Generated with Claude Code

…ailures (#799) Concurrent subscription failures on the same agent (two chat requests, or a chat overlapping a scheduled task) both ran handle_subscription_failure's read→decide→assign→restart sequence with no mutual exclusion: both could pick the same alternative, both assign_subscription_to_agent, and both fire _restart_agent — wedging the container, duplicating the switch notification, or tripping the #421 was_already_running ambiguity. - Add a per-agent asyncio.Lock (lazy + event-loop-safe, mirroring agent_call_limiter.py rather than defaultdict(asyncio.Lock)) wrapping the whole switch critical section. - Add a stale-failure guard: snapshot the subscription at handler entry and, under the lock, bail when it changed — so 3+ viable subscriptions don't cascade A→B→C (a loser whose failure was about the now-previous sub). - Tests: concurrent 2-sub (exactly one switch), 3-sub no-cascade regression, distinct-agent locks. Pins the race the #606 carve-out test deferred. Process-local lock is sufficient today (single backend process; the scheduler delegates execution to the backend via /api/internal/execute-task rather than calling this module in its own process). #1037 (drain-aware recreate) is split into its own PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

AndriiPasternak31 · 2026-06-11T15:22:58Z

Why this is #799-only (decision trail)

This started as a combined #799 + #1037 PR per the do-together note. The plan went through /plan-eng-review + a Codex outside-voice pass, which reshaped the scope:

D1 — for bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037 I first chose an admission gate in CapacityManager.acquire() over a passive recreate-on-idle (a prior cross-model learning flagged the passive approach as best-effort: new admissions run on the stale token, busy agents starve the recreate).
Codex outside voice — proved the acquire()-level gate still leaks: /chat queue_in_memory is observability-only and doesn't suspend execution; the 60s drain_orphans_all() → BacklogService.drain_next() acquires SlotService directly, bypassing the gate; the feat: per-agent dispatch circuit breaker to prevent backlog poisoning (RELIABILITY-007) #526 half-open probe bypasses it; the recreate-pending TTL can't span a 7200s execution; recreate-failure drains onto the bad token; feat: per-agent dispatch circuit breaker to prevent backlog poisoning (RELIABILITY-007) #526's breaker can wipe deliberately-queued rows. It also caught that the bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 lock alone double-switches A→B→C for 3+ subs (→ the stale-failure guard in this PR).
D3 — bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037-done-right belongs in the slot-acquire primitive (the one chokepoint all admission paths funnel through) and is its own focused PR. Ship bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 standalone now.

The #1037 redesign seed (the full Codex finding list) is posted on #1037.

github-actions · 2026-06-12T10:39:35Z

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

vybe

✅ Validated & approved (/validate-pr)

P2 reliability bug fix (Closes #799). 2 files, +337/-35.

Correctness:

Per-agent asyncio.Lock via lazy, event-loop-safe creation (agent_switch_lock) — correctly avoids the defaultdict(asyncio.Lock) cross-loop binding trap; guard check→set has no await between, so it's atomic on the single event loop, and setdefault under the guard resolves the create race.
Stale-failure guard (snapshot at entry → re-read under lock → bail on change) correctly short-circuits BEFORE alternative selection, preventing the A→B→C cascade for 3+ viable subs.
Process-local invariant explicitly documented with the Redis SETNX escalation path for multi-worker.

Tests: test_subscription_auto_switch_concurrency.py — deterministic (pre-hold-lock), covers exactly-one-switch, the critical 3-sub no-cascade regression (asserts B never reaches alternative selection), and distinct-agent locks. Snapshot/restore sys.modules fixture keeps blast radius bounded.

Base → dev ✓ · conventional commit ✓ · CI all green
Security scan clean; no new backend modules / env vars / infra changes
Bug fix → descriptive commit sufficient (no docs required)

Note: head branch AndriiPasternak31/issue-799 is the base of #1196 (stacked) — merge this first; #1196 auto-retargets to dev.

This was referenced Jun 11, 2026

bug: subscription auto-switch (SUB-003) recreates agent container without draining in-flight executions, killing them (RemoteProtocolError) #1037

Closed

bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799

Open

AndriiPasternak31 mentioned this pull request Jun 13, 2026

feat: rotate subscription credentials via hot-reload, not container recreate (#1089) #1196

Open

3 tasks

AndriiPasternak31 self-assigned this Jun 13, 2026

AndriiPasternak31 requested a review from vybe June 13, 2026 23:54

vybe approved these changes Jun 15, 2026

View reviewed changes

vybe merged commit 4274f9f into dev Jun 16, 2026
14 checks passed

This was referenced Jun 16, 2026

CI: path-filtered required checks (schema-parity, verify-non-root) brick unrelated PRs #1237

Open

ci: self-skip schema-parity + verify-non-root so they're safe to require (#1237) #1238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799)#1166

fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799)#1166
vybe merged 1 commit into
devfrom
AndriiPasternak31/issue-799

AndriiPasternak31 commented Jun 11, 2026

Uh oh!

AndriiPasternak31 commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

vybe left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AndriiPasternak31 commented Jun 11, 2026

What

Problem

Fix

Tests

Scope note

Uh oh!

AndriiPasternak31 commented Jun 11, 2026

Why this is #799-only (decision trail)

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants