From 89bfce15ec6a2898e6ef39e8546bbe3a3f42bec9 Mon Sep 17 00:00:00 2001 From: Andrii Pasternak Date: Thu, 11 Jun 2026 16:21:51 +0100 Subject: [PATCH 1/6] =?UTF-8?q?fix(sub-003):=20per-agent=20lock=20for=20au?= =?UTF-8?q?to-switch=20=E2=80=94=20serialize=20concurrent=20failures=20(#7?= =?UTF-8?q?99)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Concurrent subscription failures on the same agent (two chat requests, or a chat overlapping a scheduled task) both ran handle_subscription_failure's read→decide→assign→restart sequence with no mutual exclusion: both could pick the same alternative, both assign_subscription_to_agent, and both fire _restart_agent — wedging the container, duplicating the switch notification, or tripping the #421 was_already_running ambiguity. - Add a per-agent asyncio.Lock (lazy + event-loop-safe, mirroring agent_call_limiter.py rather than defaultdict(asyncio.Lock)) wrapping the whole switch critical section. - Add a stale-failure guard: snapshot the subscription at handler entry and, under the lock, bail when it changed — so 3+ viable subscriptions don't cascade A→B→C (a loser whose failure was about the now-previous sub). - Tests: concurrent 2-sub (exactly one switch), 3-sub no-cascade regression, distinct-agent locks. Pins the race the #606 carve-out test deferred. Process-local lock is sufficient today (single backend process; the scheduler delegates execution to the backend via /api/internal/execute-task rather than calling this module in its own process). #1037 (drain-aware recreate) is split into its own PR. Co-Authored-By: Claude Opus 4.8 --- .../services/subscription_auto_switch.py | 141 ++++++++--- ...st_subscription_auto_switch_concurrency.py | 231 ++++++++++++++++++ 2 files changed, 337 insertions(+), 35 deletions(-) create mode 100644 tests/unit/test_subscription_auto_switch_concurrency.py diff --git a/src/backend/services/subscription_auto_switch.py b/src/backend/services/subscription_auto_switch.py index 6707a7a5..7ab05767 100644 --- a/src/backend/services/subscription_auto_switch.py +++ b/src/backend/services/subscription_auto_switch.py @@ -20,6 +20,7 @@ tests pinning that contract. """ +import asyncio import logging from typing import Optional @@ -29,6 +30,55 @@ logger = logging.getLogger(__name__) +# --------------------------------------------------------------------------- +# Per-agent switch lock (#799) +# --------------------------------------------------------------------------- +# +# Concurrent subscription failures on the SAME agent (two chat requests, or a +# chat overlapping a scheduled task) both enter `handle_subscription_failure` +# and, without mutual exclusion, both pick the same alternative, both +# `assign_subscription_to_agent`, and both fire `_restart_agent` — the second +# `container_stop` racing the first `start_agent_internal` wedges the container, +# duplicates the switch notification, or trips the #421 `was_already_running` +# ambiguity. A per-agent lock serializes the read→decide→assign→restart window. +# +# Mirrors the event-loop-safe lazy pattern in `services/agent_call_limiter.py` +# (module dict + a lazily-created guard) rather than `defaultdict(asyncio.Lock)`: +# a defaultdict binds each lock to whatever event loop is current at first key +# access and persists it, which breaks across pytest's per-test loops +# ("Future attached to a different loop"). Creating the locks lazily on the +# running loop avoids that. +# +# INVARIANT: process-local. Correct only while (a) the backend runs a single +# process and (b) the scheduler delegates execution to the backend via +# `/api/internal/execute-task` rather than calling this module in its own +# process — both true today. If the backend ever runs multiple workers, escalate +# to a Redis `SETNX` lock keyed `auto_switch:{agent_name}` (TTL ≥ longest +# plausible container start, ~60s). +_AGENT_SWITCH_LOCKS: dict[str, asyncio.Lock] = {} +_AGENT_SWITCH_LOCKS_GUARD: Optional[asyncio.Lock] = None + + +async def agent_switch_lock(agent_name: str) -> asyncio.Lock: + """Return the per-agent switch lock, creating it lazily on the running loop.""" + global _AGENT_SWITCH_LOCKS_GUARD + if _AGENT_SWITCH_LOCKS_GUARD is None: + _AGENT_SWITCH_LOCKS_GUARD = asyncio.Lock() + lock = _AGENT_SWITCH_LOCKS.get(agent_name) + if lock is None: + async with _AGENT_SWITCH_LOCKS_GUARD: + lock = _AGENT_SWITCH_LOCKS.setdefault(agent_name, asyncio.Lock()) + return lock + + +def _reset_locks_for_test() -> None: + """Test hook: drop all per-agent locks + the guard so each test's event loop + starts clean (locks are loop-bound).""" + global _AGENT_SWITCH_LOCKS_GUARD + _AGENT_SWITCH_LOCKS.clear() + _AGENT_SWITCH_LOCKS_GUARD = None + + # Substrings that indicate an auth-class subscription failure. Mirrors the # scheduler's classification at `src/scheduler/service.py` (which now imports # this same list to keep the two surfaces from drifting). @@ -101,50 +151,71 @@ async def handle_subscription_failure( Returns: dict with switch details if auto-switch occurred, None otherwise. """ - # 1. Check if auto-switch is enabled (default: on, #441) + # 1. Check if auto-switch is enabled (default: on, #441). Cheap, lock-free — + # a disabled platform never contends for the per-agent lock. enabled = db.get_setting_value("auto_switch_subscriptions", default="true") == "true" if not enabled: return None - # 2. Check if agent has a subscription assigned - current_sub_id = db.get_agent_subscription_id(agent_name) - if not current_sub_id: + # 2. Snapshot the agent's subscription BEFORE acquiring the lock. This is the + # subscription our failure was (approximately) about. If a concurrent failure + # switches the agent off it while we wait for the lock, our failure is stale. + sub_at_entry = db.get_agent_subscription_id(agent_name) + if not sub_at_entry: return None - # 3. Record the failure event in the rate-limit table. Auth-class events - # share the same table — the table tracks "subscription failure events" - # generically; `is_subscription_rate_limited` treats any event in the 2h - # window as a reason to skip the subscription as a candidate, which is the - # behavior we want for both kinds of failure. - consecutive_count = db.record_rate_limit_event( - agent_name=agent_name, - subscription_id=current_sub_id, - error_message=error_message, - ) - - # 4. Find a viable alternative subscription - alternative = db.select_best_alternative_subscription(current_sub_id) - if not alternative: - logger.warning( - f"[SUB-003] Agent '{agent_name}' hit a {failure_kind} failure on " - f"subscription {current_sub_id} (event #{consecutive_count}) " - f"but no viable alternative subscription is available" + # #799: serialize the read→decide→assign→restart window per agent so two + # concurrent failures on the same agent can't both switch + restart it. + async with await agent_switch_lock(agent_name): + # Re-read under the lock. If another coroutine already switched the agent + # off `sub_at_entry`, this failure is stale — return rather than switch + # again. This is what makes the fix correct for 3+ subscriptions: without + # it, a loser whose failure was about sub-A would attribute it to the new + # current sub-B and cascade A→B→C (#799 / Codex C8). + current_sub_id = db.get_agent_subscription_id(agent_name) + if current_sub_id != sub_at_entry: + logger.info( + f"[SUB-003] Agent '{agent_name}' already switched off subscription " + f"{sub_at_entry} (now {current_sub_id}) before this {failure_kind} " + f"failure acquired the lock — stale failure, skipping" + ) + return None + + # 3. Record the failure event in the rate-limit table. Auth-class events + # share the same table — the table tracks "subscription failure events" + # generically; `is_subscription_rate_limited` treats any event in the 2h + # window as a reason to skip the subscription as a candidate, which is the + # behavior we want for both kinds of failure. + consecutive_count = db.record_rate_limit_event( + agent_name=agent_name, + subscription_id=current_sub_id, + error_message=error_message, ) - return None - # Get current subscription name for logging / notification - current_sub = db.get_subscription(current_sub_id) - old_name = current_sub.name if current_sub else current_sub_id + # 4. Find a viable alternative subscription + alternative = db.select_best_alternative_subscription(current_sub_id) + if not alternative: + logger.warning( + f"[SUB-003] Agent '{agent_name}' hit a {failure_kind} failure on " + f"subscription {current_sub_id} (event #{consecutive_count}) " + f"but no viable alternative subscription is available" + ) + return None - # 5. Perform the switch - return await _perform_auto_switch( - agent_name=agent_name, - old_subscription_id=current_sub_id, - old_subscription_name=old_name, - new_subscription=alternative, - failure_kind=failure_kind, - event_count=consecutive_count, - ) + # Get current subscription name for logging / notification + current_sub = db.get_subscription(current_sub_id) + old_name = current_sub.name if current_sub else current_sub_id + + # 5. Perform the switch (still under the lock — the assign + restart must + # not interleave with a concurrent switch for this agent). + return await _perform_auto_switch( + agent_name=agent_name, + old_subscription_id=current_sub_id, + old_subscription_name=old_name, + new_subscription=alternative, + failure_kind=failure_kind, + event_count=consecutive_count, + ) async def handle_rate_limit_error( diff --git a/tests/unit/test_subscription_auto_switch_concurrency.py b/tests/unit/test_subscription_auto_switch_concurrency.py new file mode 100644 index 00000000..24f50932 --- /dev/null +++ b/tests/unit/test_subscription_auto_switch_concurrency.py @@ -0,0 +1,231 @@ +"""Concurrency tests for SUB-003 auto-switch (#799). + +Pins the per-agent switch lock + stale-failure guard added in +``services/subscription_auto_switch.py``. These are the race the carve-out +test ``test_subscription_auto_switch_no_cred_import.py`` explicitly left out +of scope ("Explicit non-claim: this is NOT a concurrency test"). + +Bug being fixed: two subscription failures on the SAME agent (two chat +requests, or a chat overlapping a scheduled task) both ran the +read→decide→assign→restart sequence with no mutual exclusion, so both could +pick the same alternative, both ``assign_subscription_to_agent``, and both +fire ``_restart_agent`` — wedging the container / duplicating the switch. + +What's pinned here: + - Part A: ``handle_subscription_failure`` is serialized per agent by + ``agent_switch_lock``, so concurrent failures yield exactly ONE switch. + - Part B (Codex C8): the loser snapshots the subscription at entry and, + after taking the lock, bails when it changed — so 3+ viable + subscriptions do NOT cascade A→B→C. + +Determinism: rather than rely on asyncio interleaving (``asyncio.Lock.acquire`` +does not yield on a free lock, so a naive gather lets the winner switch before +the loser snapshots), the tests PRE-HOLD the agent lock. Both coroutines then +park at ``async with lock`` having already snapshotted the old sub; releasing +the lock lets exactly one win and the other observe the post-switch state. + +Modules under test: + src/backend/services/subscription_auto_switch.py::handle_subscription_failure + src/backend/services/subscription_auto_switch.py::agent_switch_lock +""" + +from __future__ import annotations + +import asyncio +import importlib +import importlib.util +import os +import sys +import types +from unittest.mock import MagicMock + +import pytest + + +_BACKEND = os.path.abspath( + os.path.join(os.path.dirname(__file__), "..", "..", "src", "backend") +) + +# Slots this file stubs/loads at import time. The autouse snapshot/restore +# fixture (precedent: test_subscription_auto_switch_no_cred_import.py) bounds +# the blast radius so these don't leak into other files sharing the session. +_STUBBED_MODULE_NAMES = [ + "services", + "database", + "db_models", + "services.subscription_auto_switch", +] + + +@pytest.fixture(autouse=True) +def _restore_sys_modules(): + saved = {name: sys.modules.get(name) for name in _STUBBED_MODULE_NAMES} + try: + yield + finally: + for name, value in saved.items(): + if value is None: + sys.modules.pop(name, None) + else: + sys.modules[name] = value + + +def _load_auto_switch(mock_db): + """Load ``subscription_auto_switch`` in isolation with ``database`` stubbed. + + The module's top-level imports are only ``asyncio``/``logging``/``typing`` + + ``from database import db`` + ``from db_models import NotificationCreate`` + (the ``services.*`` imports are lazy, inside ``_restart_agent`` / + ``_perform_auto_switch``), so loading it by file location under its real + dotted name needs no real ``services`` package boot. + """ + sys.modules.setdefault("services", types.ModuleType("services")) + + db_module = types.ModuleType("database") + db_module.db = mock_db + sys.modules["database"] = db_module + + # db_models is pure Pydantic — import the real one rather than stubbing + # (a bare stub would persist and break unrelated tests). + if "db_models" not in sys.modules: + if _BACKEND not in sys.path: + sys.path.insert(0, _BACKEND) + import db_models # noqa: F401 — registers in sys.modules + + sys.modules.pop("services.subscription_auto_switch", None) + spec = importlib.util.spec_from_file_location( + "services.subscription_auto_switch", + os.path.join(_BACKEND, "services", "subscription_auto_switch.py"), + ) + mod = importlib.util.module_from_spec(spec) + sys.modules["services.subscription_auto_switch"] = mod + spec.loader.exec_module(mod) + return mod + + +def _sub(sub_id: str) -> MagicMock: + """A subscription stand-in with ``.id`` / ``.name`` (``.name`` must be set + after construction — ``MagicMock(name=...)`` is the repr name, not a field).""" + s = MagicMock() + s.id = sub_id + s.name = f"sub-{sub_id}" + return s + + +def _build_db(initial_sub: str, alt_map: dict) -> MagicMock: + """A ``db`` stub whose current subscription is mutable. + + ``alt_map`` maps current-sub-id → the subscription + ``select_best_alternative_subscription`` returns for it (or absent → None). + """ + state = {"current": initial_sub} + db = MagicMock() + db.get_setting_value.return_value = "true" + db.get_agent_subscription_id.side_effect = lambda _agent: state["current"] + db.record_rate_limit_event.return_value = 1 + db.get_subscription.side_effect = lambda sub_id: _sub(sub_id) + db.select_best_alternative_subscription.side_effect = ( + lambda cur: alt_map.get(cur) + ) + db._state = state # expose for assertions / the perform spy + return db + + +def _install_perform_spy(mod, db): + """Replace ``_perform_auto_switch`` with a spy that simulates the assign + (mutates the stub's current sub) and records each switch.""" + calls: list[str] = [] + + async def spy(*, agent_name, old_subscription_id, old_subscription_name, + new_subscription, failure_kind, event_count): + calls.append(new_subscription.id) + db._state["current"] = new_subscription.id # the real assign side effect + return { + "switched": True, + "agent_name": agent_name, + "old_subscription": old_subscription_name, + "new_subscription": new_subscription.name, + } + + mod._perform_auto_switch = spy + return calls + + +@pytest.mark.asyncio +async def test_concurrent_failures_switch_exactly_once_two_subs(): + """Two concurrent failures on one agent (A→B viable) → exactly ONE switch; + the loser returns None.""" + db = _build_db("A", alt_map={"A": _sub("B")}) + mod = _load_auto_switch(db) + mod._reset_locks_for_test() + calls = _install_perform_spy(mod, db) + + # Pre-hold the agent lock so both coroutines snapshot "A" and park at + # `async with lock` before either can switch. + lock = await mod.agent_switch_lock("burst-agent") + await lock.acquire() + + t1 = asyncio.create_task(mod.handle_subscription_failure("burst-agent", "429", "rate_limit")) + t2 = asyncio.create_task(mod.handle_subscription_failure("burst-agent", "429", "rate_limit")) + await asyncio.sleep(0) # let both reach `await lock.acquire()` + + lock.release() + results = await asyncio.gather(t1, t2) + + assert calls == ["B"], f"expected exactly one switch to B, got {calls!r}" + assert db._state["current"] == "B" + switched = [r for r in results if r is not None] + skipped = [r for r in results if r is None] + assert len(switched) == 1 and len(skipped) == 1, results + assert db.assign_subscription_to_agent.call_count == 0 # spy stands in + + +@pytest.mark.asyncio +async def test_concurrent_failures_no_cascade_three_subs(): + """C8 regression (CRITICAL): with A,B,C all viable, two concurrent failures + that both observed sub-A must NOT cascade A→B→C. + + Without the stale-failure guard, the loser would read current=B (after the + winner's A→B switch) and switch B→C. The guard makes it a no-op: it never + even reaches alternative selection for B. + """ + db = _build_db("A", alt_map={"A": _sub("B"), "B": _sub("C")}) + mod = _load_auto_switch(db) + mod._reset_locks_for_test() + calls = _install_perform_spy(mod, db) + + lock = await mod.agent_switch_lock("burst-agent") + await lock.acquire() + + t1 = asyncio.create_task(mod.handle_subscription_failure("burst-agent", "429", "rate_limit")) + t2 = asyncio.create_task(mod.handle_subscription_failure("burst-agent", "429", "rate_limit")) + await asyncio.sleep(0) + + lock.release() + results = await asyncio.gather(t1, t2) + + assert calls == ["B"], f"expected ONLY A→B, got cascade {calls!r}" + assert db._state["current"] == "B" + assert len([r for r in results if r is None]) == 1, "loser must no-op" + # The guard short-circuits BEFORE alternative selection for the new sub: + selected = [c.args[0] for c in db.select_best_alternative_subscription.call_args_list] + assert "B" not in selected, ( + f"loser reached alternative selection for the post-switch sub — the " + f"stale-failure guard did not fire (selected={selected!r})" + ) + + +@pytest.mark.asyncio +async def test_different_agents_use_distinct_locks(): + """Distinct agents get distinct lock objects (so they don't serialize); + the same agent always gets the same lock.""" + db = _build_db("A", alt_map={}) + mod = _load_auto_switch(db) + mod._reset_locks_for_test() + + lock_a1 = await mod.agent_switch_lock("agent-a") + lock_a2 = await mod.agent_switch_lock("agent-a") + lock_b = await mod.agent_switch_lock("agent-b") + + assert lock_a1 is lock_a2, "same agent must reuse its lock" + assert lock_a1 is not lock_b, "different agents must not share a lock" From 146e244d3bbe3f408b02deb4c3e90c755b0892b3 Mon Sep 17 00:00:00 2001 From: Andrii Pasternak Date: Sat, 13 Jun 2026 14:36:04 +0100 Subject: [PATCH 2/6] feat: rotate subscription credentials via hot-reload, not container recreate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rotating an agent's subscription token recreated the container, making "rotate a credential" and "kill every in-flight turn" the same operation (#1037 — one 429 on a shared subscription auto-switched and destroyed every parallel execution). Token rotation now hot-reloads the running container; recreate is reserved for image/template/auth-mode changes (TARGET_ARCHITECTURE §Agent Runtime). - Agent server: POST /api/credentials/reload-token mutates the process CLAUDE_CODE_OAUTH_TOKEN env so the NEXT claude subprocess uses the rotated token while in-flight subprocesses finish on the old one (single uvicorn worker; Claude authenticates purely from the env var). Does not rewrite .env/.mcp.json. - Durable override: token persisted to /var/lib/trinity/oauth-token (0600, writable layer, NOT the /home/developer volume); startup.sh exports it before launching the agent server so a plain fleet restart (raw stop+start, bypasses start_agent_internal) keeps the rotated token. Self-reconciling — wiped on recreate, so a DB-driven recreate re-bakes Config.Env from the DB token. - Three producer paths converted under the #799 agent_switch_lock: auto-switch (SUB-003), manual sub->sub reassignment (auth-mode changes still recreate), and key-rollover fan-out on the POST /api/subscriptions upsert (best-effort). - Back-compat: old base images return 404 -> fall back to _restart_agent (today's behavior, no regression; per-agent adoption on next recreate). Builds on #799 (per-agent agent_switch_lock). 19 new unit tests. Closes #1089 Co-Authored-By: Claude Opus 4.8 (1M context) --- docker/base-image/Dockerfile | 7 + docker/base-image/agent_server/models.py | 17 ++ .../agent_server/routers/credentials.py | 48 ++++ docker/base-image/startup.sh | 14 + docs/memory/architecture.md | 11 +- docs/memory/feature-flows.md | 1 + .../feature-flows/subscription-auto-switch.md | 25 +- docs/memory/requirements.md | 27 ++ docs/planning/TARGET_ARCHITECTURE.md | 2 +- src/backend/routers/subscriptions.py | 115 +++++--- .../services/subscription_auto_switch.py | 102 +++++++- tests/unit/test_reload_token_endpoint.py | 121 +++++++++ .../test_subscription_auto_switch_pingpong.py | 246 ++++++++++++++++++ .../test_subscription_reassign_hotreload.py | 241 +++++++++++++++++ 14 files changed, 933 insertions(+), 44 deletions(-) create mode 100644 tests/unit/test_reload_token_endpoint.py create mode 100644 tests/unit/test_subscription_reassign_hotreload.py diff --git a/docker/base-image/Dockerfile b/docker/base-image/Dockerfile index 98ce515c..2056be45 100644 --- a/docker/base-image/Dockerfile +++ b/docker/base-image/Dockerfile @@ -130,6 +130,13 @@ RUN mkdir -p /workspace /data /logs && \ RUN mkdir -p /tmp/secure && \ chmod 1777 /tmp/secure +# #1089: writable-layer dir for the subscription-token hot-reload override +# (/var/lib/trinity/oauth-token). Owned by the agent (UID 1000) so the +# agent-server process can write it; on the writable layer (NOT /home/developer) +# so it survives a plain stop+start but is wiped on recreate. +RUN mkdir -p /var/lib/trinity && \ + chown developer:developer /var/lib/trinity + USER developer CMD ["/app/startup.sh"] diff --git a/docker/base-image/agent_server/models.py b/docker/base-image/agent_server/models.py index 4fa19d96..c570eca2 100644 --- a/docker/base-image/agent_server/models.py +++ b/docker/base-image/agent_server/models.py @@ -261,3 +261,20 @@ class CredentialInjectResponse(BaseModel): """Response from credential injection""" status: str # "success" files_written: List[str] + + +class TokenReloadRequest(BaseModel): + """Request to hot-reload the subscription OAuth token (#1089). + + Surgical alternative to a container recreate: mutates the agent-server + process env so the NEXT claude subprocess uses the rotated token while + in-flight turns keep their already-inherited old token and finish. + """ + token: str # CLAUDE_CODE_OAUTH_TOKEN value to apply + remove_api_key: bool = False # also drop ANTHROPIC_API_KEY from env + + +class TokenReloadResponse(BaseModel): + """Response from a subscription token hot-reload""" + status: str # "success" + reloaded: bool diff --git a/docker/base-image/agent_server/routers/credentials.py b/docker/base-image/agent_server/routers/credentials.py index cb0a40db..da00e18f 100644 --- a/docker/base-image/agent_server/routers/credentials.py +++ b/docker/base-image/agent_server/routers/credentials.py @@ -15,6 +15,8 @@ CredentialReadResponse, CredentialInjectRequest, CredentialInjectResponse, + TokenReloadRequest, + TokenReloadResponse, ) from ..state import agent_state from ..services.trinity_mcp import inject_trinity_mcp_if_configured @@ -132,6 +134,52 @@ async def update_credentials(request: CredentialUpdateRequest): raise HTTPException(status_code=500, detail=f"Credential update failed: {str(e)}") +# Writable-layer override path (#1089). Deliberately NOT under /home/developer — +# that path is the persistent agent-{name}-workspace volume which +# `recreate_container_with_updated_config` preserves, so a token written there +# would survive a recreate and shadow the freshly-baked Config.Env (DB token). +# The writable layer instead survives a plain stop+start (same container) but is +# wiped on recreate (new container, fresh layer) — self-reconciling by Docker +# semantics, no marker logic needed. The directory is created + chowned to UID +# 1000 in the base-image Dockerfile (before the USER switch). +_TOKEN_OVERRIDE = Path("/var/lib/trinity/oauth-token") + + +@router.post("/api/credentials/reload-token", response_model=TokenReloadResponse) +async def reload_subscription_token(request: TokenReloadRequest): + """Hot-reload CLAUDE_CODE_OAUTH_TOKEN for the NEXT claude subprocess (#1089). + + Mutates the agent-server process env so the next `subprocess.Popen` for + `claude` inherits the rotated token; in-flight subprocesses keep their + already-inherited old token and finish. Also persists the token to the + writable-layer override so it survives a plain stop+start (fleet restart + bypasses `start_agent_internal`, which would otherwise revert to the old + Config.Env token — F2). + + Deliberately does NOT rewrite .env / .mcp.json or re-inject Trinity MCP: the + subscription token is not a .env credential, and the `/update` / `/inject` + endpoints destructively rewrite whole files. + """ + if not request.token: + raise HTTPException(status_code=400, detail="token is required") + + os.environ["CLAUDE_CODE_OAUTH_TOKEN"] = request.token + if request.remove_api_key: + os.environ.pop("ANTHROPIC_API_KEY", None) + + # Persist to the writable-layer override (0600). Parent dir is created + + # chowned in the Dockerfile, so the agent (UID 1000) can write here. + _TOKEN_OVERRIDE.write_text(request.token) + _TOKEN_OVERRIDE.chmod(0o600) + + # Add the new token to the log-redaction set (drops the old exact-match + # value; OAuth tokens stay caught by the sk-ant value regex regardless). + refresh_credential_values() + + logger.info("Hot-reloaded CLAUDE_CODE_OAUTH_TOKEN (next subprocess; in-flight turns unaffected)") + return TokenReloadResponse(status="success", reloaded=True) + + @router.get("/api/credentials/status") async def get_credentials_status(): """ diff --git a/docker/base-image/startup.sh b/docker/base-image/startup.sh index 59d0f7a0..4f566159 100644 --- a/docker/base-image/startup.sh +++ b/docker/base-image/startup.sh @@ -343,6 +343,20 @@ if [ -d "/config/mcp-servers" ]; then done fi +# === Rotated subscription token: durable override (#1089) === +# A hot-reload (POST /api/credentials/reload-token) persists the rotated +# CLAUDE_CODE_OAUTH_TOKEN to this writable-layer path so it survives a plain +# stop+start. The container's baked Config.Env still holds the OLD token and a +# fleet restart (ops.py) does a raw stop+start that bypasses start_agent_internal +# — so export the override (when present and non-empty) BEFORE launching the +# agent server, so the rotated token wins. The file is wiped on recreate (fresh +# writable layer), so a DB-driven recreate cleanly reverts to the freshly-baked +# Config.Env token — no marker logic needed. +if [ -s /var/lib/trinity/oauth-token ]; then + export CLAUDE_CODE_OAUTH_TOKEN="$(cat /var/lib/trinity/oauth-token)" + echo "Applied rotated subscription token from durable override" +fi + # Start Agent Web Server (self-contained UI) if [ "${ENABLE_AGENT_UI}" = "true" ]; then echo "Starting Agent Web UI on port ${AGENT_SERVER_PORT:-8000}..." diff --git a/docs/memory/architecture.md b/docs/memory/architecture.md index cd7f5d04..6c973042 100644 --- a/docs/memory/architecture.md +++ b/docs/memory/architecture.md @@ -276,12 +276,15 @@ Vector 0.43.1 (`timberio/vector:0.43.1-alpine`). Captures all container stdout/s **Internal server** `agent-server.py` (FastAPI, port 8000): - `/api/chat` - Claude Code execution (messages persisted to database) - `/health` - Health check. Returns `{status}` plus `active_tasks` (concurrent executions across `/api/chat` + `/api/task`), `last_task_at`, `consecutive_failures` (reset on success — consumed by the dispatch breaker #526 and fleet health #307) and the #333 `diagnostics` gauges (#1020). `mailbox_depth` intentionally NOT emitted — no agent-side mailbox until the actor model (#945); the backend derives queue depth from `CapacityManager`. Counters live in `agent_server/state.py`; backend reads them in `monitoring_service.py` with graceful defaults for older images. -- `/api/credentials/update` - Hot-reload credentials +- `/api/credentials/update` - Hot-reload credentials (rewrites `.env`/`.mcp.json`) +- `/api/credentials/reload-token` - Surgical subscription-token hot-reload (#1089): mutates the agent-server process `os.environ["CLAUDE_CODE_OAUTH_TOKEN"]` so the NEXT claude subprocess uses the rotated token while in-flight subprocesses keep theirs; persists to the writable-layer override `/var/lib/trinity/oauth-token` (0600). Does NOT touch `.env`/`.mcp.json`. See [Subscription Token Rotation](#subscription-token-rotation-via-hot-reload-1089) - `/api/chat/session` - Context window stats - `/api/files`, `/api/files/download` (100MB limit), `/api/files/mkdir` (workspace-confined, #37) The agent server also runs two loops: the 15-min git `auto_sync` heartbeat (see [Git Sync Health](#git-sync-health-389390)) and the 5s liveness heartbeat (see [Heartbeat Liveness](#heartbeat-liveness-reliability-004-307)). +**Durable subscription-token override (#1089):** `startup.sh` exports `CLAUDE_CODE_OAUTH_TOKEN` from `/var/lib/trinity/oauth-token` (when present, non-empty) **before** launching the agent server, so a token rotated via hot-reload survives a plain stop+start (a fleet restart via `routers/ops.py` does a raw `container_stop`+`container_start` that bypasses `start_agent_internal` and would otherwise revert to the baked `Config.Env` token). The path is deliberately on the writable layer, **not** under the persisted `/home/developer` volume: it survives `stop`→`start` (same container) but is wiped on recreate (fresh layer), so a DB-driven recreate cleanly re-bakes `Config.Env` from the DB and the stale override is gone — self-reconciling, no marker logic. Dir created+chowned to UID 1000 in the base-image Dockerfile. + **Template-supplied pre-check** (SCHED-COND-001, #454): if the template ships an executable `~/.trinity/pre-check`, the backend's internal endpoint `POST /api/internal/agents/{name}/pre-check` runs it via `docker exec` before a cron-triggered chat. Language-agnostic — interpreter selected by shebang. The hook's stdout becomes the chat message; empty stdout + exit 0 records a skipped execution (Claude never invoked). Uses the same `execute_command_in_container` primitive as `git_service.py`, `ssh_service.py`, and the agent terminal — no agent-server HTTP endpoint. **Persistent chat:** all chat messages auto-saved to SQLite (`chat_sessions`, `chat_messages`) with full observability (costs, context, tool calls, execution time); sessions survive container restarts/deletions; users see only their own messages (admins see all). @@ -367,6 +370,12 @@ agent:heartbeat:misses:{name} → STRING(int), ~60s TTL. Consecutive-miss counte Trigger-boundary dedup — policy in Architectural Invariant #18, table DDL under `idempotency_keys`. `services/idempotency_service.py` (key derivation + `begin`/`complete`/`fail`) over `db/idempotency.py`. The `(scope, key)` PRIMARY KEY is the atomic claim: `claim()` INSERTs an `in_flight` row; a concurrent loser catches `IntegrityError` and reads the surviving row — cross-process safe across uvicorn workers and the standalone scheduler (shared SQLite file). Lifecycle: `claim` → (`attach_execution`) → `complete` (stores `response_snapshot` for replay) or `release` (deletes the in_flight row so a failed attempt can retry; never deletes a `completed` row). Rows older than 24h are treated as expired and re-claimed; the cleanup service purges them (`idempotency_purge_expired`). Duplicates within 24h short-circuit with the original result + `X-Idempotent-Replay: true`; an in-flight duplicate returns 409. Fail-open — a key never blocks a real execution. +### Subscription Token Rotation via Hot-Reload (#1089) + +Rotating an agent's subscription token used to recreate the container, making "rotate a credential" and "kill every in-flight turn" the same operation (#1037). Token rotation now hot-reloads the running container; recreate is reserved for image/template/auth-**mode** changes (TARGET_ARCHITECTURE §Agent Runtime). The agent server authenticates Claude purely from `CLAUDE_CODE_OAUTH_TOKEN` (no `.credentials.json`) and is a single uvicorn worker, so mutating its process env makes the **next** subprocess use the new token while in-flight subprocesses finish on the old one. + +Backend orchestration in `services/subscription_auto_switch.py`: `_hot_reload_subscription_token(agent_name)` POSTs the agent's current DB token to the agent-server `POST /api/credentials/reload-token`, falling back to `_restart_agent` on a 404 (old base image), transport failure, or missing token (`no_container`/`not_running` short-circuit otherwise). Three producer paths converted, all under the #799 `agent_switch_lock`: **auto-switch** (`_perform_auto_switch`, SUB-003), **manual sub→sub reassignment** (`PUT /api/subscriptions/agents/{name}` — auth-mode changes none/api-key→sub still recreate), and **key rollover** (`reload_subscription_for_all_agents(sub_id)` fans a best-effort reload across every running agent on a re-registered subscription). Durable override (`/var/lib/trinity/oauth-token`) + `startup.sh` read make a rotation survive a plain restart — see the agent-server [Durable subscription-token override](#agent-containers) note. Agent-server endpoint mirroring follows Invariant #5. + ### Real-time Delivery (RELIABILITY-003, #306) **Transport** (`event_bus.py`): Redis Streams. `ConnectionManager`/`FilteredWebSocketManager` are thin shims that `XADD` to the MAXLEN-trimmed `trinity:events` stream; one `StreamDispatcher` per backend process runs `XREAD BLOCK` and fans out to registered clients, evicting a client after 3 consecutive delivery failures. New broadcast sites keep calling `manager.broadcast(...)` / `filtered_manager.broadcast_filtered(...)` — never publish to the stream directly (Invariant #10). diff --git a/docs/memory/feature-flows.md b/docs/memory/feature-flows.md index d78d6df4..f9c198f1 100644 --- a/docs/memory/feature-flows.md +++ b/docs/memory/feature-flows.md @@ -11,6 +11,7 @@ | Date | ID | Feature | Flow | |------|-----|---------|------| +| 2026-06-13 | #1089 | feat: subscription token rotation via **hot-reload, not container recreate** — a dedicated agent-server `POST /api/credentials/reload-token` mutates the running container's `CLAUDE_CODE_OAUTH_TOKEN` env so the next claude subprocess uses the rotated token while in-flight turns finish (closes the #1037 collateral-kill class). Three producer paths converted under the #799 `agent_switch_lock`: auto-switch (SUB-003), manual sub→sub reassignment (auth-mode changes still recreate), and key-rollover fan-out on `POST /api/subscriptions` upsert. Durable writable-layer override (`/var/lib/trinity/oauth-token` + `startup.sh` read) survives a plain restart; recreate self-reconciles to the DB token. Falls back to the old `_restart_agent` recreate on a 404 (old base image). | [subscription-auto-switch.md](feature-flows/subscription-auto-switch.md) | | 2026-06-10 | #1130 | fix: retired `gemini-2.0-flash` replaced with env-configurable models — `GEMINI_TEXT_MODEL` (image-gen prompt refinement) + `GEMINI_TRANSCRIPTION_MODEL` (Telegram voice), both default `gemini-3.5-flash`, defined in `config.py`, empty-string-safe wiring in both compose files (#1076 pattern). | [image-generation.md](feature-flows/image-generation.md), [telegram-integration.md](feature-flows/telegram-integration.md) | | 2026-06-10 | #1108 | feat(ui): Agent Detail **Guardrails** tab renamed to **Settings** — sectioned config home. New `components/settings/SettingsPanel.vue` renders `GuardrailsPanel` unchanged as section #1; future per-agent settings land as additive sections, not new tabs. `?tab=guardrails` deep links alias to `settings` via `TAB_ALIASES`. Pure frontend. | [agent-guardrails.md](feature-flows/agent-guardrails.md) | | 2026-06-10 | #1114 | feat(ui): Agent Detail tabs overflow into a **"More ▾"** dropdown instead of horizontal scroll. New reusable `components/OverflowTabs.vue` ("priority+" pattern): a hidden, zero-layout mirror row measures every `{id,label,badge?}` tab's width (+ a worst-case "More" button) so the visible row renders as many tabs as fit and collapses the trailing remainder into a right-aligned disclosure menu. Re-measures on container resize (`ResizeObserver` on the outer wrapper, width-diff-guarded + rAF-debounced) and after `document.fonts.ready`; re-measures on tab/label/badge changes via a derived-signature `watch` (`flush:'post'`). Defaults to all-inline before the first measure (no first-paint snap; no "More" when everything fits). Active-in-overflow reflected on the trigger (active underline + dot), tab order never reshuffled. Plain `