feat: rotate subscription credentials via hot-reload, not container recreate (#1089)#1196
feat: rotate subscription credentials via hot-reload, not container recreate (#1089)#1196AndriiPasternak31 wants to merge 7 commits into
Conversation
…ailures (#799) Concurrent subscription failures on the same agent (two chat requests, or a chat overlapping a scheduled task) both ran handle_subscription_failure's read→decide→assign→restart sequence with no mutual exclusion: both could pick the same alternative, both assign_subscription_to_agent, and both fire _restart_agent — wedging the container, duplicating the switch notification, or tripping the #421 was_already_running ambiguity. - Add a per-agent asyncio.Lock (lazy + event-loop-safe, mirroring agent_call_limiter.py rather than defaultdict(asyncio.Lock)) wrapping the whole switch critical section. - Add a stale-failure guard: snapshot the subscription at handler entry and, under the lock, bail when it changed — so 3+ viable subscriptions don't cascade A→B→C (a loser whose failure was about the now-previous sub). - Tests: concurrent 2-sub (exactly one switch), 3-sub no-cascade regression, distinct-agent locks. Pins the race the #606 carve-out test deferred. Process-local lock is sufficient today (single backend process; the scheduler delegates execution to the backend via /api/internal/execute-task rather than calling this module in its own process). #1037 (drain-aware recreate) is split into its own PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecreate Rotating an agent's subscription token recreated the container, making "rotate a credential" and "kill every in-flight turn" the same operation (#1037 — one 429 on a shared subscription auto-switched and destroyed every parallel execution). Token rotation now hot-reloads the running container; recreate is reserved for image/template/auth-mode changes (TARGET_ARCHITECTURE §Agent Runtime). - Agent server: POST /api/credentials/reload-token mutates the process CLAUDE_CODE_OAUTH_TOKEN env so the NEXT claude subprocess uses the rotated token while in-flight subprocesses finish on the old one (single uvicorn worker; Claude authenticates purely from the env var). Does not rewrite .env/.mcp.json. - Durable override: token persisted to /var/lib/trinity/oauth-token (0600, writable layer, NOT the /home/developer volume); startup.sh exports it before launching the agent server so a plain fleet restart (raw stop+start, bypasses start_agent_internal) keeps the rotated token. Self-reconciling — wiped on recreate, so a DB-driven recreate re-bakes Config.Env from the DB token. - Three producer paths converted under the #799 agent_switch_lock: auto-switch (SUB-003), manual sub->sub reassignment (auth-mode changes still recreate), and key-rollover fan-out on the POST /api/subscriptions upsert (best-effort). - Back-compat: old base images return 404 -> fall back to _restart_agent (today's behavior, no regression; per-agent adoption on next recreate). Builds on #799 (per-agent agent_switch_lock). 19 new unit tests. Closes #1089 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Full local verification — backend + agent sideVerified end-to-end against real production images on an isolated local stack: the backend prod image, and a freshly-rebuilt Backend (verify-local Docker stages, isolated sibling stack)
Agent side (rebuilt
|
| Area | Result |
|---|---|
/var/lib/trinity = developer:developer 0755, container UID 1000, writable (Dockerfile) |
✅ |
Fresh container: no override file, baked Config.Env token wins |
✅ |
POST /api/credentials/reload-token → {"status":"success","reloaded":true} |
✅ |
Override file persisted 0600, correct content |
✅ |
Endpoint does not rewrite .env / .mcp.json |
✅ |
F2: after stop+start (same container), agent-server env = rotated token (beats baked Config.Env); startup.sh logs the override application |
✅ |
| Recreate: brand-new container (fresh writable layer) → override gone, reverts to baked token (self-reconciles, no marker logic) | ✅ |
The agent-server process env was read from /proc/<pid>/environ to confirm the startup.sh durable-override path wins on a plain restart and that a recreate cleanly drops a stale override. Test used obviously-fake placeholder tokens.
Conclusion: both the backend hot-reload path and the agent-side endpoint + durable-override + Dockerfile changes are validated on real images.
) write_text()+chmod() created the override file under the process umask (typically 0644) and left the token world-readable in the window between create and the follow-up chmod. Use os.open(..., O_CREAT|O_TRUNC, 0o600) so the file is created with 0600 in one syscall, and fchmod the fd as well so a pre-existing override (older write path or tampering) is forced back to 0600 — O_CREAT only applies its mode arg on creation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The agent's current subscription was read BEFORE entering agent_switch_lock, so a concurrent auto-switch could change the assignment between that read and the in-lock DB assign (TOCTOU). The stale old_sub_id could then mis-classify a sub->sub swap as an auth-mode change (or vice-versa) and route it into a needless container recreate instead of a hot-reload. Move the get_agent_subscription_id() snapshot inside the lock, immediately before the assign. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vybe
left a comment
There was a problem hiding this comment.
✅ Code validated & approved (/validate-pr) — merge gated on the conditions below.
P2 reliability feature (Closes #1089). 14 files, +1023/-45. Stacked on #1166 (#799).
Code review — clean, no defects:
- Agent-server
POST /api/credentials/reload-token: token never logged; atomic 0600 create viaos.open(..., O_WRONLY|O_CREAT|O_TRUNC, 0o600)+os.fchmod(correctly closes the umask race and forces perms on a pre-existing file); 400 on empty token;refresh_credential_values()adds the new token to the redaction set; non-destructive (no.env/.mcp.jsonrewrite). Follows the established agent-server endpoint pattern (Invariant #5 — backend mirror present insubscriptions.py). - Backend
_hot_reload_subscription_token: logs only agent name + status codes; 404 (old base image) / transport failure / no-token →_restart_agentfallback = identical to today's behavior, no regression. Per-agent activation, not fleet-wide on deploy. - Durable override on the writable layer (
/var/lib/trinity/oauth-token, NOT under/home/developer): survives stop+start, wiped on recreate → self-reconciling by Docker semantics, no marker logic.startup.shexports it before launch and does not echo the value; Dockerfilemkdir+chown developerbefore theUSERswitch (Invariant #17 ✓). - All 3 producer paths (auto-switch, manual reassign, key rollover) under the #799
agent_switch_lock; mode-change still recreates (dropsANTHROPIC_API_KEY). - Docs complete (architecture + requirements + feature-flows index + subscription-auto-switch.md + TARGET_ARCHITECTURE); 19 TDD unit tests; security scan clean (no token-value logging anywhere).
- #1166 must merge first (this PR is stacked on
AndriiPasternak31/issue-799). - Retarget base →
devafter that. - Full CI matrix must go green — the backend-unit-test matrix (pytest seeds / lint / Analyze / regression-diff) did not run here because the base is a feature branch, not
dev; current green is onlyprod-image-smoke+verify-non-root(e2e skipped). Test evidence is local-only until retarget. - Run the
verify-localintegration the test plan leaves unchecked (rebuild base image → assert a parallel turn survives a rotation, container ID unchanged, fleet restart keeps the rotated token, old-image 404 fallback) — this is a credential-rotation + container-lifecycle change.
Reminder: base-image rebuild (./scripts/deploy/build-base-image.sh) required for the endpoint to exist on agents. Cross-worker race correctly flagged for #799/#1166 (Redis SETNX escalation).
|
@AndriiPasternak31 heads-up: #1166 ( Could you rebase onto the updated |
|
Resolve by running |
…-issue-1089 # Conflicts: # docs/memory/feature-flows.md
|
|
||
| logger.info( | ||
| f"Assigned subscription '{subscription_name}' to agent '{agent_name}' " | ||
| f"by {current_user.username}" | ||
| ) | ||
|
|
||
| restart_result = None | ||
| injection_result = None |
…o fix seed-flaky reassign tests (#1089) The manual-reassign / key-rollover fixtures stubbed services.subscription_auto_switch via `import ... as auto_switch`, which binds the `services` package attribute. The endpoint resolves the helpers through a function-local `from services.subscription_auto_switch import (...)`, which binds from sys.modules. The conftest #762 autouse fixture restores sys.modules["services"] before/after every test, and under some pytest-randomly orderings (seed 99999) the package attribute and the sys.modules submodule entry drift to two different module objects — so the fixture patched one object while the endpoint called the other (the real, unpatched helper), and 4 tests flaked. Resolve the module via sys.modules in the fixtures (new _live_auto_switch helper) so the patch always lands on the object the endpoint calls. Verified: full tests/unit suite green under pytest-randomly seeds 99999/12345/67890 (the 4 failures under 99999 are gone, no regression on the others). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Rotating an agent's subscription token used to recreate the container — so "rotate a credential" and "kill every in-flight turn" were the same operation. One 429 on a shared subscription triggered SUB-003 auto-switch →
_restart_agent→container_stop+ recreate, destroying every parallel execution on that agent (#1037 collateral kills). This PR makes token rotation hot-reload the running container instead; recreate is reserved for image/template/auth-mode changes. It removes the credential↔execution collision class structurally rather than recovering from it after the fact (TARGET_ARCHITECTURE.md§Agent Runtime).Why hot-reload is safe (verified)
subprocess.Popen(..., env={**os.environ, ...})and authenticates purely from theCLAUDE_CODE_OAUTH_TOKENenv var (no.credentials.jsonwrite), and it is a single uvicorn worker.os.environ["CLAUDE_CODE_OAUTH_TOKEN"]makes the next Claude subprocess use the new token, while in-flight subprocesses keep their already-inherited old token and finish. The DB stays source of truth;check_api_key_env_matchesreconcilesConfig.Envon the next recreate.What changed
Agent server (
docker/base-image/agent_server/)POST /api/credentials/reload-token({token, remove_api_key}) — mutatesos.environ, persists the token to a writable-layer override, and refreshes the log-redaction set. Does not rewrite.env/.mcp.jsonor re-inject Trinity MCP (those whole-file flows are destructive; the sub token is not a.envcredential).TokenReloadRequest/TokenReloadResponsemodels.Durable override across a plain restart (F2) —
docker/base-image/Dockerfile+startup.sh/var/lib/trinity/oauth-token(0600), deliberately NOT under/home/developer(the persistedagent-{name}-workspacevolume that recreate preserves).startup.shexportsCLAUDE_CODE_OAUTH_TOKENfrom that file (when present, non-empty) before launching the agent server. A fleet restart (routers/ops.py) does a rawcontainer_stop+container_startthat bypassesstart_agent_internal, so without this it would silently revert to the old bakedConfig.Envtoken.stop→start(same container) but is wiped on recreate (new container, fresh layer). After a DB-driven recreate the override is gone and the freshly-bakedConfig.Env(DB token) wins — no marker logic.Backend (
src/backend/)_hot_reload_subscription_token(agent_name)— POSTs the agent's current DB token to the endpoint; falls back to_restart_agenton a 404 (old base image), transport failure, or no resolvable token; short-circuitsno_container/not_running.agent_switch_lock:_perform_auto_switchhot-reloads instead of recreating.PUT /api/subscriptions/agents/{name}) — sub→sub swap hot-reloads under the lock; an auth-mode change (none/api-key → subscription) still recreates soANTHROPIC_API_KEYis dropped and the token is baked intoConfig.Env.POST /api/subscriptionsupsert) —reload_subscription_for_all_agentsfans a best-effort hot-reload across every running agent on the re-registered subscription (one agent's failure never fails the upsert nor blocks the others).Back-compat & rollout
_restart_agent, identical to today's behavior (no regression).recreate_container_with_updated_configreuses the old image, so an agent only gains the endpoint once recreated onto a rebuilt base image — hot-reload activates per-agent, not fleet-wide on deploy. Base-image rebuild required (./scripts/deploy/build-base-image.sh).Test plan
reload-tokenwith the DB token; 404/transport/no-token → restart fallback;no_container/not_runningshort-circuit), auto-switch wire-in (restart_result == "hot_reloaded", no recreate), manual sub→sub (nocontainer_stop; mode-change still recreates; under the lock), key-rollover fan-out (continues past one agent's failure; upsert never fails), and the agent-server endpoint (sets env, writes the 0600 override, no.envwrite,remove_api_keypopsANTHROPIC_API_KEY, empty token → 400).tests/lint_sys_modules.pyclean (no new violations); fulltests/unit/sweep shows zero regressions (the 13 pre-existing failures + 1 collection error reproduce identically on the clean bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 base).verify-local(recommended before merge): rebuild the base image, subscription-backed agent with two parallel long tasks, triggerhandle_subscription_failure→ assert (a) the non-failing parallel turn survives, (b) container ID unchanged, (c) a new task uses the new token, (d) a fleet restart keeps the rotated token, (e) an old-image agent hits the 404 fallback.Known limitations
uvicorn --workers 2; theagent_switch_lockis a process-localasyncio.Lock, so two same-agent switches on different workers aren't serialized. feat: credential rotation via hot-reload, not container recreate #1089 doesn't worsen it (divergence self-heals via the durable override /check_api_key_env_matches). Recommend escalating to a RedisSETNXlock (auto_switch:{agent}) in bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799.delete_subscriptionstill leaves the deleted token live until next start (pre-existing, out of scope).Closes #1089
🤖 Generated with Claude Code