Skip to content

feat: rotate subscription credentials via hot-reload, not container recreate (#1089)#1196

Open
AndriiPasternak31 wants to merge 7 commits into
devfrom
AndriiPasternak31/plan-issue-1089
Open

feat: rotate subscription credentials via hot-reload, not container recreate (#1089)#1196
AndriiPasternak31 wants to merge 7 commits into
devfrom
AndriiPasternak31/plan-issue-1089

Conversation

@AndriiPasternak31

Copy link
Copy Markdown
Contributor

Summary

Rotating an agent's subscription token used to recreate the container — so "rotate a credential" and "kill every in-flight turn" were the same operation. One 429 on a shared subscription triggered SUB-003 auto-switch → _restart_agentcontainer_stop + recreate, destroying every parallel execution on that agent (#1037 collateral kills). This PR makes token rotation hot-reload the running container instead; recreate is reserved for image/template/auth-mode changes. It removes the credential↔execution collision class structurally rather than recovering from it after the fact (TARGET_ARCHITECTURE.md §Agent Runtime).

Stacked on #799 (AndriiPasternak31/issue-799, PR #1166) — base branch is that PR, not dev. #799 adds the per-agent agent_switch_lock and runs _perform_auto_switch inside it; this PR extends the lock to the manual-reassign + key-rollover paths. Rebase onto dev once #1166 merges.

Why hot-reload is safe (verified)

  • The agent server spawns Claude via subprocess.Popen(..., env={**os.environ, ...}) and authenticates purely from the CLAUDE_CODE_OAUTH_TOKEN env var (no .credentials.json write), and it is a single uvicorn worker.
  • ⟹ Mutating the agent-server process os.environ["CLAUDE_CODE_OAUTH_TOKEN"] makes the next Claude subprocess use the new token, while in-flight subprocesses keep their already-inherited old token and finish. The DB stays source of truth; check_api_key_env_matches reconciles Config.Env on the next recreate.

What changed

Agent server (docker/base-image/agent_server/)

  • New POST /api/credentials/reload-token ({token, remove_api_key}) — mutates os.environ, persists the token to a writable-layer override, and refreshes the log-redaction set. Does not rewrite .env/.mcp.json or re-inject Trinity MCP (those whole-file flows are destructive; the sub token is not a .env credential).
  • TokenReloadRequest/TokenReloadResponse models.

Durable override across a plain restart (F2)docker/base-image/Dockerfile + startup.sh

  • The reload also writes /var/lib/trinity/oauth-token (0600), deliberately NOT under /home/developer (the persisted agent-{name}-workspace volume that recreate preserves).
  • startup.sh exports CLAUDE_CODE_OAUTH_TOKEN from that file (when present, non-empty) before launching the agent server. A fleet restart (routers/ops.py) does a raw container_stop+container_start that bypasses start_agent_internal, so without this it would silently revert to the old baked Config.Env token.
  • Self-reconciling by Docker semantics: the writable layer survives stopstart (same container) but is wiped on recreate (new container, fresh layer). After a DB-driven recreate the override is gone and the freshly-baked Config.Env (DB token) wins — no marker logic.

Backend (src/backend/)

  • _hot_reload_subscription_token(agent_name) — POSTs the agent's current DB token to the endpoint; falls back to _restart_agent on a 404 (old base image), transport failure, or no resolvable token; short-circuits no_container/not_running.
  • Three producer paths converted, all under the bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 agent_switch_lock:
    1. Auto-switch (SUB-003) — _perform_auto_switch hot-reloads instead of recreating.
    2. Manual reassignment (PUT /api/subscriptions/agents/{name}) — sub→sub swap hot-reloads under the lock; an auth-mode change (none/api-key → subscription) still recreates so ANTHROPIC_API_KEY is dropped and the token is baked into Config.Env.
    3. Key rollover (POST /api/subscriptions upsert) — reload_subscription_for_all_agents fans a best-effort hot-reload across every running agent on the re-registered subscription (one agent's failure never fails the upsert nor blocks the others).

Back-compat & rollout

  • Running containers on an older base image return 404 for the endpoint → the helper falls back to _restart_agent, identical to today's behavior (no regression).
  • recreate_container_with_updated_config reuses the old image, so an agent only gains the endpoint once recreated onto a rebuilt base image — hot-reload activates per-agent, not fleet-wide on deploy. Base-image rebuild required (./scripts/deploy/build-base-image.sh).

Test plan

  • 19 new unit tests, TDD (red→green): hot-reload helper (happy path POSTs reload-token with the DB token; 404/transport/no-token → restart fallback; no_container/not_running short-circuit), auto-switch wire-in (restart_result == "hot_reloaded", no recreate), manual sub→sub (no container_stop; mode-change still recreates; under the lock), key-rollover fan-out (continues past one agent's failure; upsert never fails), and the agent-server endpoint (sets env, writes the 0600 override, no .env write, remove_api_key pops ANTHROPIC_API_KEY, empty token → 400).
  • ruff + focused mypy clean on changed files; tests/lint_sys_modules.py clean (no new violations); full tests/unit/ sweep shows zero regressions (the 13 pre-existing failures + 1 collection error reproduce identically on the clean bug: SUB-003 auto-switch has no per-agent lock — concurrent 429s race the restart #799 base).
  • Integration / verify-local (recommended before merge): rebuild the base image, subscription-backed agent with two parallel long tasks, trigger handle_subscription_failure → assert (a) the non-failing parallel turn survives, (b) container ID unchanged, (c) a new task uses the new token, (d) a fleet restart keeps the rotated token, (e) an old-image agent hits the 404 fallback.

Known limitations

Closes #1089

🤖 Generated with Claude Code

AndriiPasternak31 and others added 2 commits June 11, 2026 16:21
…ailures (#799)

Concurrent subscription failures on the same agent (two chat requests, or a
chat overlapping a scheduled task) both ran handle_subscription_failure's
read→decide→assign→restart sequence with no mutual exclusion: both could pick
the same alternative, both assign_subscription_to_agent, and both fire
_restart_agent — wedging the container, duplicating the switch notification, or
tripping the #421 was_already_running ambiguity.

- Add a per-agent asyncio.Lock (lazy + event-loop-safe, mirroring
  agent_call_limiter.py rather than defaultdict(asyncio.Lock)) wrapping the
  whole switch critical section.
- Add a stale-failure guard: snapshot the subscription at handler entry and,
  under the lock, bail when it changed — so 3+ viable subscriptions don't
  cascade A→B→C (a loser whose failure was about the now-previous sub).
- Tests: concurrent 2-sub (exactly one switch), 3-sub no-cascade regression,
  distinct-agent locks. Pins the race the #606 carve-out test deferred.

Process-local lock is sufficient today (single backend process; the scheduler
delegates execution to the backend via /api/internal/execute-task rather than
calling this module in its own process). #1037 (drain-aware recreate) is split
into its own PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecreate

Rotating an agent's subscription token recreated the container, making
"rotate a credential" and "kill every in-flight turn" the same operation
(#1037 — one 429 on a shared subscription auto-switched and destroyed every
parallel execution). Token rotation now hot-reloads the running container;
recreate is reserved for image/template/auth-mode changes (TARGET_ARCHITECTURE
§Agent Runtime).

- Agent server: POST /api/credentials/reload-token mutates the process
  CLAUDE_CODE_OAUTH_TOKEN env so the NEXT claude subprocess uses the rotated
  token while in-flight subprocesses finish on the old one (single uvicorn
  worker; Claude authenticates purely from the env var). Does not rewrite
  .env/.mcp.json.
- Durable override: token persisted to /var/lib/trinity/oauth-token (0600,
  writable layer, NOT the /home/developer volume); startup.sh exports it before
  launching the agent server so a plain fleet restart (raw stop+start, bypasses
  start_agent_internal) keeps the rotated token. Self-reconciling — wiped on
  recreate, so a DB-driven recreate re-bakes Config.Env from the DB token.
- Three producer paths converted under the #799 agent_switch_lock: auto-switch
  (SUB-003), manual sub->sub reassignment (auth-mode changes still recreate),
  and key-rollover fan-out on the POST /api/subscriptions upsert (best-effort).
- Back-compat: old base images return 404 -> fall back to _restart_agent
  (today's behavior, no regression; per-agent adoption on next recreate).

Builds on #799 (per-agent agent_switch_lock). 19 new unit tests.

Closes #1089

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AndriiPasternak31

Copy link
Copy Markdown
Contributor Author

✅ Full local verification — backend + agent side

Verified end-to-end against real production images on an isolated local stack: the backend prod image, and a freshly-rebuilt trinity-agent-base exercised through the complete token-rotation lifecycle on a real agent container.

Backend (verify-local Docker stages, isolated sibling stack)

  • build + import-smoke ✅ — backend prod image builds; python -c 'import main' runs inside the built image (the packaging/missing-COPY check)
  • boot + health ✅ — stack boots, migrations apply, /health green with these changes
  • integration ✅ — 6 passed, 69 skipped, 2 deselected against the live stack
  • unit ✅ — the new #1089 suites pass directly: 39 passed (reload-token endpoint, auto-switch concurrency, ping-pong, manual sub→sub reassign hot-reload)

Note: the verify-local unit stage itself aborts on a pre-existing, unrelated collection error in tests/unit/test_1073_voip_media_stream_ticket.py — the repo-root config/ namespace package shadows src/backend/config.py (from config import VOIP_MAX_CALL_DURATION → "unknown location"). It reproduces identically on dev and on this file run alone; none of config.py / twilio_media_stream.py / test_1073 are touched by this branch. Worked around with --skip-unit after running the new suites directly.

Agent side (rebuilt trinity-agent-base, real standalone container) — 12/12 ✅

Area Result
/var/lib/trinity = developer:developer 0755, container UID 1000, writable (Dockerfile)
Fresh container: no override file, baked Config.Env token wins
POST /api/credentials/reload-token{"status":"success","reloaded":true}
Override file persisted 0600, correct content
Endpoint does not rewrite .env / .mcp.json
F2: after stop+start (same container), agent-server env = rotated token (beats baked Config.Env); startup.sh logs the override application
Recreate: brand-new container (fresh writable layer) → override gone, reverts to baked token (self-reconciles, no marker logic)

The agent-server process env was read from /proc/<pid>/environ to confirm the startup.sh durable-override path wins on a plain restart and that a recreate cleanly drops a stale override. Test used obviously-fake placeholder tokens.

Conclusion: both the backend hot-reload path and the agent-side endpoint + durable-override + Dockerfile changes are validated on real images.

AndriiPasternak31 and others added 3 commits June 13, 2026 15:21
)

write_text()+chmod() created the override file under the process umask
(typically 0644) and left the token world-readable in the window between
create and the follow-up chmod. Use os.open(..., O_CREAT|O_TRUNC, 0o600)
so the file is created with 0600 in one syscall, and fchmod the fd as
well so a pre-existing override (older write path or tampering) is forced
back to 0600 — O_CREAT only applies its mode arg on creation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The agent's current subscription was read BEFORE entering agent_switch_lock,
so a concurrent auto-switch could change the assignment between that read
and the in-lock DB assign (TOCTOU). The stale old_sub_id could then
mis-classify a sub->sub swap as an auth-mode change (or vice-versa) and
route it into a needless container recreate instead of a hot-reload.
Move the get_agent_subscription_id() snapshot inside the lock, immediately
before the assign.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

Add a regression test asserting a non-admin caller gets 403 from
register_subscription before any subscription is created or the
key-rollover fan-out runs — mirroring the owner gate already covered on
the reassign path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@vybe vybe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code validated & approved (/validate-pr) — merge gated on the conditions below.

P2 reliability feature (Closes #1089). 14 files, +1023/-45. Stacked on #1166 (#799).

Code review — clean, no defects:

  • Agent-server POST /api/credentials/reload-token: token never logged; atomic 0600 create via os.open(..., O_WRONLY|O_CREAT|O_TRUNC, 0o600) + os.fchmod (correctly closes the umask race and forces perms on a pre-existing file); 400 on empty token; refresh_credential_values() adds the new token to the redaction set; non-destructive (no .env/.mcp.json rewrite). Follows the established agent-server endpoint pattern (Invariant #5 — backend mirror present in subscriptions.py).
  • Backend _hot_reload_subscription_token: logs only agent name + status codes; 404 (old base image) / transport failure / no-token → _restart_agent fallback = identical to today's behavior, no regression. Per-agent activation, not fleet-wide on deploy.
  • Durable override on the writable layer (/var/lib/trinity/oauth-token, NOT under /home/developer): survives stop+start, wiped on recreate → self-reconciling by Docker semantics, no marker logic. startup.sh exports it before launch and does not echo the value; Dockerfile mkdir+chown developer before the USER switch (Invariant #17 ✓).
  • All 3 producer paths (auto-switch, manual reassign, key rollover) under the #799 agent_switch_lock; mode-change still recreates (drops ANTHROPIC_API_KEY).
  • Docs complete (architecture + requirements + feature-flows index + subscription-auto-switch.md + TARGET_ARCHITECTURE); 19 TDD unit tests; security scan clean (no token-value logging anywhere).

⚠️ Merge conditions (this approval is the code-review gate, not a merge-now):

  1. #1166 must merge first (this PR is stacked on AndriiPasternak31/issue-799).
  2. Retarget base → dev after that.
  3. Full CI matrix must go green — the backend-unit-test matrix (pytest seeds / lint / Analyze / regression-diff) did not run here because the base is a feature branch, not dev; current green is only prod-image-smoke + verify-non-root (e2e skipped). Test evidence is local-only until retarget.
  4. Run the verify-local integration the test plan leaves unchecked (rebuild base image → assert a parallel turn survives a rotation, container ID unchanged, fleet restart keeps the rotated token, old-image 404 fallback) — this is a credential-rotation + container-lifecycle change.

Reminder: base-image rebuild (./scripts/deploy/build-base-image.sh) required for the endpoint to exist on agents. Cross-worker race correctly flagged for #799/#1166 (Redis SETNX escalation).

@vybe vybe changed the base branch from AndriiPasternak31/issue-799 to dev June 16, 2026 15:01
@vybe

vybe commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@AndriiPasternak31 heads-up: #1166 (issue-799) was just squash-merged to dev, and I retargeted this PR's base from issue-799dev. Because the squash detached the shared history, this PR now re-carries #799's diff (+1360/-80) and shows as CONFLICTING.

Could you rebase onto the updated dev and drop the now-redundant #799 commits? It should then reduce back to just the #1089 hot-reload diff and be mergeable. It's already approved — just needs the rebase.

@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

…-issue-1089

# Conflicts:
#	docs/memory/feature-flows.md
Comment on lines +282 to +289

logger.info(
f"Assigned subscription '{subscription_name}' to agent '{agent_name}' "
f"by {current_user.username}"
)

restart_result = None
injection_result = None
…o fix seed-flaky reassign tests (#1089)

The manual-reassign / key-rollover fixtures stubbed
services.subscription_auto_switch via `import ... as auto_switch`, which binds the
`services` package attribute. The endpoint resolves the helpers through a
function-local `from services.subscription_auto_switch import (...)`, which binds
from sys.modules. The conftest #762 autouse fixture restores
sys.modules["services"] before/after every test, and under some pytest-randomly
orderings (seed 99999) the package attribute and the sys.modules submodule entry
drift to two different module objects — so the fixture patched one object while
the endpoint called the other (the real, unpatched helper), and 4 tests flaked.

Resolve the module via sys.modules in the fixtures (new _live_auto_switch helper)
so the patch always lands on the object the endpoint calls. Verified: full
tests/unit suite green under pytest-randomly seeds 99999/12345/67890 (the 4
failures under 99999 are gone, no regression on the others).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants