feat: rotate subscription credentials via hot-reload, not container recreate (#1089) by AndriiPasternak31 · Pull Request #1196 · Abilityai/trinity

AndriiPasternak31 · 2026-06-13T13:36:58Z

Summary

Rotating an agent's subscription token used to recreate the container — so "rotate a credential" and "kill every in-flight turn" were the same operation. One 429 on a shared subscription triggered SUB-003 auto-switch → _restart_agent → container_stop + recreate, destroying every parallel execution on that agent (#1037 collateral kills). This PR makes token rotation hot-reload the running container instead; recreate is reserved for image/template/auth-mode changes. It removes the credential↔execution collision class structurally rather than recovering from it after the fact (TARGET_ARCHITECTURE.md §Agent Runtime).

Stacked on #799 (AndriiPasternak31/issue-799, PR #1166) — base branch is that PR, not dev. #799 adds the per-agent agent_switch_lock and runs _perform_auto_switch inside it; this PR extends the lock to the manual-reassign + key-rollover paths. Rebase onto dev once #1166 merges.

Why hot-reload is safe (verified)

The agent server spawns Claude via subprocess.Popen(..., env={**os.environ, ...}) and authenticates purely from the CLAUDE_CODE_OAUTH_TOKEN env var (no .credentials.json write), and it is a single uvicorn worker.
⟹ Mutating the agent-server process os.environ["CLAUDE_CODE_OAUTH_TOKEN"] makes the next Claude subprocess use the new token, while in-flight subprocesses keep their already-inherited old token and finish. The DB stays source of truth; check_api_key_env_matches reconciles Config.Env on the next recreate.

What changed

Agent server (docker/base-image/agent_server/)

New POST /api/credentials/reload-token ({token, remove_api_key}) — mutates os.environ, persists the token to a writable-layer override, and refreshes the log-redaction set. Does not rewrite .env/.mcp.json or re-inject Trinity MCP (those whole-file flows are destructive; the sub token is not a .env credential).
TokenReloadRequest/TokenReloadResponse models.

Durable override across a plain restart (F2) — docker/base-image/Dockerfile + startup.sh

The reload also writes /var/lib/trinity/oauth-token (0600), deliberately NOT under /home/developer (the persisted agent-{name}-workspace volume that recreate preserves).
startup.sh exports CLAUDE_CODE_OAUTH_TOKEN from that file (when present, non-empty) before launching the agent server. A fleet restart (routers/ops.py) does a raw container_stop+container_start that bypasses start_agent_internal, so without this it would silently revert to the old baked Config.Env token.
Self-reconciling by Docker semantics: the writable layer survives stop→start (same container) but is wiped on recreate (new container, fresh layer). After a DB-driven recreate the override is gone and the freshly-baked Config.Env (DB token) wins — no marker logic.

Backend (src/backend/)

_hot_reload_subscription_token(agent_name) — POSTs the agent's current DB token to the endpoint; falls back to _restart_agent on a 404 (old base image), transport failure, or no resolvable token; short-circuits no_container/not_running.
Three producer paths converted, all under the bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 agent_switch_lock:
1. Auto-switch (SUB-003) — _perform_auto_switch hot-reloads instead of recreating.
2. Manual reassignment (PUT /api/subscriptions/agents/{name}) — sub→sub swap hot-reloads under the lock; an auth-mode change (none/api-key → subscription) still recreates so ANTHROPIC_API_KEY is dropped and the token is baked into Config.Env.
3. Key rollover (POST /api/subscriptions upsert) — reload_subscription_for_all_agents fans a best-effort hot-reload across every running agent on the re-registered subscription (one agent's failure never fails the upsert nor blocks the others).

Back-compat & rollout

Running containers on an older base image return 404 for the endpoint → the helper falls back to _restart_agent, identical to today's behavior (no regression).
recreate_container_with_updated_config reuses the old image, so an agent only gains the endpoint once recreated onto a rebuilt base image — hot-reload activates per-agent, not fleet-wide on deploy. Base-image rebuild required (./scripts/deploy/build-base-image.sh).

Test plan

19 new unit tests, TDD (red→green): hot-reload helper (happy path POSTs reload-token with the DB token; 404/transport/no-token → restart fallback; no_container/not_running short-circuit), auto-switch wire-in (restart_result == "hot_reloaded", no recreate), manual sub→sub (no container_stop; mode-change still recreates; under the lock), key-rollover fan-out (continues past one agent's failure; upsert never fails), and the agent-server endpoint (sets env, writes the 0600 override, no .env write, remove_api_key pops ANTHROPIC_API_KEY, empty token → 400).
ruff + focused mypy clean on changed files; tests/lint_sys_modules.py clean (no new violations); full tests/unit/ sweep shows zero regressions (the 13 pre-existing failures + 1 collection error reproduce identically on the clean bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 base).
Integration / verify-local (recommended before merge): rebuild the base image, subscription-backed agent with two parallel long tasks, trigger handle_subscription_failure → assert (a) the non-failing parallel turn survives, (b) container ID unchanged, (c) a new task uses the new token, (d) a fleet restart keeps the rotated token, (e) an old-image agent hits the 404 fallback.

Known limitations

Cross-worker race (flagged for bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799/fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799) #1166): prod runs uvicorn --workers 2; the agent_switch_lock is a process-local asyncio.Lock, so two same-agent switches on different workers aren't serialized. feat: credential rotation via hot-reload, not container recreate #1089 doesn't worsen it (divergence self-heals via the durable override / check_api_key_env_matches). Recommend escalating to a Redis SETNX lock (auto_switch:{agent}) in bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799.
Bulk delete_subscription still leaves the deleted token live until next start (pre-existing, out of scope).

Closes #1089

🤖 Generated with Claude Code

…ailures (#799) Concurrent subscription failures on the same agent (two chat requests, or a chat overlapping a scheduled task) both ran handle_subscription_failure's read→decide→assign→restart sequence with no mutual exclusion: both could pick the same alternative, both assign_subscription_to_agent, and both fire _restart_agent — wedging the container, duplicating the switch notification, or tripping the #421 was_already_running ambiguity. - Add a per-agent asyncio.Lock (lazy + event-loop-safe, mirroring agent_call_limiter.py rather than defaultdict(asyncio.Lock)) wrapping the whole switch critical section. - Add a stale-failure guard: snapshot the subscription at handler entry and, under the lock, bail when it changed — so 3+ viable subscriptions don't cascade A→B→C (a loser whose failure was about the now-previous sub). - Tests: concurrent 2-sub (exactly one switch), 3-sub no-cascade regression, distinct-agent locks. Pins the race the #606 carve-out test deferred. Process-local lock is sufficient today (single backend process; the scheduler delegates execution to the backend via /api/internal/execute-task rather than calling this module in its own process). #1037 (drain-aware recreate) is split into its own PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ecreate Rotating an agent's subscription token recreated the container, making "rotate a credential" and "kill every in-flight turn" the same operation (#1037 — one 429 on a shared subscription auto-switched and destroyed every parallel execution). Token rotation now hot-reloads the running container; recreate is reserved for image/template/auth-mode changes (TARGET_ARCHITECTURE §Agent Runtime). - Agent server: POST /api/credentials/reload-token mutates the process CLAUDE_CODE_OAUTH_TOKEN env so the NEXT claude subprocess uses the rotated token while in-flight subprocesses finish on the old one (single uvicorn worker; Claude authenticates purely from the env var). Does not rewrite .env/.mcp.json. - Durable override: token persisted to /var/lib/trinity/oauth-token (0600, writable layer, NOT the /home/developer volume); startup.sh exports it before launching the agent server so a plain fleet restart (raw stop+start, bypasses start_agent_internal) keeps the rotated token. Self-reconciling — wiped on recreate, so a DB-driven recreate re-bakes Config.Env from the DB token. - Three producer paths converted under the #799 agent_switch_lock: auto-switch (SUB-003), manual sub->sub reassignment (auth-mode changes still recreate), and key-rollover fan-out on the POST /api/subscriptions upsert (best-effort). - Back-compat: old base images return 404 -> fall back to _restart_agent (today's behavior, no regression; per-agent adoption on next recreate). Builds on #799 (per-agent agent_switch_lock). 19 new unit tests. Closes #1089 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AndriiPasternak31 · 2026-06-13T13:58:04Z

✅ Full local verification — backend + agent side

Verified end-to-end against real production images on an isolated local stack: the backend prod image, and a freshly-rebuilt trinity-agent-base exercised through the complete token-rotation lifecycle on a real agent container.

Backend (verify-local Docker stages, isolated sibling stack)

build + import-smoke ✅ — backend prod image builds; python -c 'import main' runs inside the built image (the packaging/missing-COPY check)
boot + health ✅ — stack boots, migrations apply, /health green with these changes
integration ✅ — 6 passed, 69 skipped, 2 deselected against the live stack
unit ✅ — the new #1089 suites pass directly: 39 passed (reload-token endpoint, auto-switch concurrency, ping-pong, manual sub→sub reassign hot-reload)

Note: the verify-local unit stage itself aborts on a pre-existing, unrelated collection error in tests/unit/test_1073_voip_media_stream_ticket.py — the repo-root config/ namespace package shadows src/backend/config.py (from config import VOIP_MAX_CALL_DURATION → "unknown location"). It reproduces identically on dev and on this file run alone; none of config.py / twilio_media_stream.py / test_1073 are touched by this branch. Worked around with --skip-unit after running the new suites directly.

Agent side (rebuilt `trinity-agent-base`, real standalone container) — 12/12 ✅

Area	Result
`/var/lib/trinity` = `developer:developer 0755`, container UID 1000, writable (Dockerfile)	✅
Fresh container: no override file, baked `Config.Env` token wins	✅
`POST /api/credentials/reload-token` → `{"status":"success","reloaded":true}`	✅
Override file persisted `0600`, correct content	✅
Endpoint does not rewrite `.env` / `.mcp.json`	✅
F2: after `stop`+`start` (same container), agent-server env = rotated token (beats baked `Config.Env`); `startup.sh` logs the override application	✅
Recreate: brand-new container (fresh writable layer) → override gone, reverts to baked token (self-reconciles, no marker logic)	✅

The agent-server process env was read from /proc/<pid>/environ to confirm the startup.sh durable-override path wins on a plain restart and that a recreate cleanly drops a stale override. Test used obviously-fake placeholder tokens.

Conclusion: both the backend hot-reload path and the agent-side endpoint + durable-override + Dockerfile changes are validated on real images.

) write_text()+chmod() created the override file under the process umask (typically 0644) and left the token world-readable in the window between create and the follow-up chmod. Use os.open(..., O_CREAT|O_TRUNC, 0o600) so the file is created with 0600 in one syscall, and fchmod the fd as well so a pre-existing override (older write path or tampering) is forced back to 0600 — O_CREAT only applies its mode arg on creation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The agent's current subscription was read BEFORE entering agent_switch_lock, so a concurrent auto-switch could change the assignment between that read and the in-lock DB assign (TOCTOU). The stale old_sub_id could then mis-classify a sub->sub swap as an auth-mode change (or vice-versa) and route it into a needless container recreate instead of a hot-reload. Move the get_agent_subscription_id() snapshot inside the lock, immediately before the assign. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) Add a regression test asserting a non-admin caller gets 403 from register_subscription before any subscription is created or the key-rollover fan-out runs — mirroring the owner gate already covered on the reassign path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vybe

✅ Code validated & approved (/validate-pr) — merge gated on the conditions below.

P2 reliability feature (Closes #1089). 14 files, +1023/-45. Stacked on #1166 (#799).

Code review — clean, no defects:

Agent-server POST /api/credentials/reload-token: token never logged; atomic 0600 create via os.open(..., O_WRONLY|O_CREAT|O_TRUNC, 0o600) + os.fchmod (correctly closes the umask race and forces perms on a pre-existing file); 400 on empty token; refresh_credential_values() adds the new token to the redaction set; non-destructive (no .env/.mcp.json rewrite). Follows the established agent-server endpoint pattern (Invariant #5 — backend mirror present in subscriptions.py).
Backend _hot_reload_subscription_token: logs only agent name + status codes; 404 (old base image) / transport failure / no-token → _restart_agent fallback = identical to today's behavior, no regression. Per-agent activation, not fleet-wide on deploy.
Durable override on the writable layer (/var/lib/trinity/oauth-token, NOT under /home/developer): survives stop+start, wiped on recreate → self-reconciling by Docker semantics, no marker logic. startup.sh exports it before launch and does not echo the value; Dockerfile mkdir+chown developer before the USER switch (Invariant #17 ✓).
All 3 producer paths (auto-switch, manual reassign, key rollover) under the #799 agent_switch_lock; mode-change still recreates (drops ANTHROPIC_API_KEY).
Docs complete (architecture + requirements + feature-flows index + subscription-auto-switch.md + TARGET_ARCHITECTURE); 19 TDD unit tests; security scan clean (no token-value logging anywhere).

⚠️ Merge conditions (this approval is the code-review gate, not a merge-now):

#1166 must merge first (this PR is stacked on AndriiPasternak31/issue-799).
Retarget base → dev after that.
Full CI matrix must go green — the backend-unit-test matrix (pytest seeds / lint / Analyze / regression-diff) did not run here because the base is a feature branch, not dev; current green is only prod-image-smoke + verify-non-root (e2e skipped). Test evidence is local-only until retarget.
Run the verify-local integration the test plan leaves unchecked (rebuild base image → assert a parallel turn survives a rotation, container ID unchanged, fleet restart keeps the rotated token, old-image 404 fallback) — this is a credential-rotation + container-lifecycle change.

Reminder: base-image rebuild (./scripts/deploy/build-base-image.sh) required for the endpoint to exist on agents. Cross-worker race correctly flagged for #799/#1166 (Redis SETNX escalation).

vybe · 2026-06-16T15:13:45Z

@AndriiPasternak31 heads-up: #1166 (issue-799) was just squash-merged to dev, and I retargeted this PR's base from issue-799 → dev. Because the squash detached the shared history, this PR now re-carries #799's diff (+1360/-80) and shows as CONFLICTING.

Could you rebase onto the updated dev and drop the now-redundant #799 commits? It should then reduce back to just the #1089 hot-reload diff and be mergeable. It's already approved — just needs the rebase.

github-actions · 2026-06-17T11:11:39Z

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

…-issue-1089 # Conflicts: # docs/memory/feature-flows.md

+
+            logger.info(
+                f"Assigned subscription '{subscription_name}' to agent '{agent_name}' "
+                f"by {current_user.username}"
+            )
+
+            restart_result = None
+            injection_result = None


…o fix seed-flaky reassign tests (#1089) The manual-reassign / key-rollover fixtures stubbed services.subscription_auto_switch via `import ... as auto_switch`, which binds the `services` package attribute. The endpoint resolves the helpers through a function-local `from services.subscription_auto_switch import (...)`, which binds from sys.modules. The conftest #762 autouse fixture restores sys.modules["services"] before/after every test, and under some pytest-randomly orderings (seed 99999) the package attribute and the sys.modules submodule entry drift to two different module objects — so the fixture patched one object while the endpoint called the other (the real, unpatched helper), and 4 tests flaked. Resolve the module via sys.modules in the fixtures (new _live_auto_switch helper) so the patch always lands on the object the endpoint calls. Verified: full tests/unit suite green under pytest-randomly seeds 99999/12345/67890 (the 4 failures under 99999 are gone, no regression on the others). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AndriiPasternak31 and others added 2 commits June 11, 2026 16:21

AndriiPasternak31 and others added 3 commits June 13, 2026 15:21

AndriiPasternak31 self-assigned this Jun 13, 2026

AndriiPasternak31 requested review from dolho and vybe and removed request for dolho June 13, 2026 23:53

vybe mentioned this pull request Jun 15, 2026

fix(sub-003): per-agent lock for auto-switch — serialize concurrent failures (#799) #1166

Merged

vybe approved these changes Jun 15, 2026

View reviewed changes

vybe mentioned this pull request Jun 15, 2026

feat(agent-runtime): OpenAI Codex as the third agent runtime + resolve E2E findings (#1187) #1203

Open

vybe changed the base branch from AndriiPasternak31/issue-799 to dev June 16, 2026 15:01

Merge remote-tracking branch 'origin/dev' into AndriiPasternak31/plan…

2c3270a

…-issue-1089 # Conflicts: # docs/memory/feature-flows.md

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread src/backend/routers/subscriptions.py

Comment on lines +282 to +289

logger.info(

f"Assigned subscription '{subscription_name}' to agent '{agent_name}' "

f"by {current_user.username}"

)

restart_result = None

injection_result = None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rotate subscription credentials via hot-reload, not container recreate (#1089)#1196

feat: rotate subscription credentials via hot-reload, not container recreate (#1089)#1196
AndriiPasternak31 wants to merge 7 commits into
devfrom
AndriiPasternak31/plan-issue-1089

AndriiPasternak31 commented Jun 13, 2026

Uh oh!

AndriiPasternak31 commented Jun 13, 2026

Uh oh!

vybe left a comment

Uh oh!

vybe commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AndriiPasternak31 commented Jun 13, 2026

Summary

Why hot-reload is safe (verified)

What changed

Back-compat & rollout

Test plan

Known limitations

Uh oh!

AndriiPasternak31 commented Jun 13, 2026

✅ Full local verification — backend + agent side

Backend (verify-local Docker stages, isolated sibling stack)

Agent side (rebuilt trinity-agent-base, real standalone container) — 12/12 ✅

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

vybe commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Agent side (rebuilt `trinity-agent-base`, real standalone container) — 12/12 ✅