Skip to content

fix(supervisor,hub): kill legacy agent spawn + cap orphan replay + cap restarts#86

Merged
finedesignz merged 1 commit into
mainfrom
fix/orphan-replay-and-legacy-agent
May 27, 2026
Merged

fix(supervisor,hub): kill legacy agent spawn + cap orphan replay + cap restarts#86
finedesignz merged 1 commit into
mainfrom
fix/orphan-replay-and-legacy-agent

Conversation

@finedesignz
Copy link
Copy Markdown
Owner

Summary

Structural fixes from the 2026-05-27 autonomous-loop RCA.

  • Supervisor process-manager.ts:spawn() no longer invokes the retired CLI agent. Every session.start finalizes immediately as stopped with exit_reason='legacy_agent_spawn_disabled'. The cached buggy v0.4.1 of the retired package was the source of the loop.
  • Supervisor scheduleRestart() caps at MAX_RESTART_COUNT=10. After 10 consecutive restarts the run finalizes as max_restarts_exceeded instead of looping forever at the 30s backoff plateau.
  • Hub auto-resume (ws/agent.ts on supervisor.hello):
    • Filters orphan session_runs to started_at > now() - 24h.
    • Sweeps older open rows as exit_reason='stale' in one UPDATE.
    • Skips replay for restart_count >= 10, finalizing as max_restarts_exceeded.

Tests

  • supervisor/test/no-legacy-agent-spawn.test.ts — canary that greps supervisor/src/** for the retired package name + --append-system-prompt. Fails the build if either reappears.
  • supervisor/test/process-manager.test.ts — happy-path tests rewritten to assert the new legacy_agent_spawn_disabled stopped state; concurrency/audit tests monkey-patch spawn() to hold the slot so cap gates stay exercisable.
  • hub/test/auto-resume-caps.test.ts — DB-gated unit test covering the 24h age cap and restart_count >= 10 cap (skipped without REMO_E2E_DB_URL).

Version bump

Supervisor 0.5.1 → 0.5.2 across Cargo.toml, tauri.conf.json, ui/package.json, and the inline VERSION constants in supervisor/src/{index,hub-client}.ts.

Follow-up (not in this PR)

The in-process claude-runner that bridges claude --input-format stream-json --output-format stream-json over the supervisor WS. Until that lands, session.start rolls cleanly to stopped instead of trapping the supervisor in a respawn loop.

Test plan

  • cd supervisor && bun test → 48 pass, 0 fail
  • cd hub && bun test → 367 pass (same 8 pre-existing test-ordering flakes as main, unrelated to this PR)
  • Hub agent.ts module loads cleanly
  • All 4 version sources updated in lockstep

🤖 Generated with Claude Code

…p restarts

Structural fixes from the 2026-05-27 autonomous-loop RCA.

Supervisor (process-manager.ts):
- spawn() no longer invokes the retired CLI agent via npx -y. Every
  session.start now finalizes immediately as stopped with
  exit_reason='legacy_agent_spawn_disabled'. The cached buggy v0.4.1 of
  that retired package was the source of the "Always delegate" loop.
- scheduleRestart() now hard-caps at MAX_RESTART_COUNT=10. After 10
  consecutive attempts the run finalizes as max_restarts_exceeded and
  the supervisor stops respawning, instead of looping forever at the
  30s backoff plateau.

Hub auto-resume (ws/agent.ts on supervisor.hello):
- Filters orphan session_runs to rows where started_at > now() - 24h.
- Sweeps older open rows in one UPDATE, finalizing them as
  exit_reason='stale'.
- Skips replay for any orphan whose restart_count >= 10, finalizing as
  exit_reason='max_restarts_exceeded'.

Tests:
- supervisor/test/no-legacy-agent-spawn.test.ts — canary that greps
  supervisor/src/** for the retired package name and --append-system-prompt
  flag. Fails the build if either reappears.
- supervisor/test/process-manager.test.ts — happy-path test rewritten
  to assert the new legacy_agent_spawn_disabled stopped state; concurrency
  + audit tests now monkey-patch spawn() to hold the slot so the cap
  gates remain testable until the inline runner lands.
- hub/test/auto-resume-caps.test.ts — DB-gated unit test covering the
  24h age cap and restart_count >= 10 cap (skipped without REMO_E2E_DB_URL).

Version bump: supervisor 0.5.1 → 0.5.2 across Cargo.toml, tauri.conf.json,
ui/package.json, and the inline VERSION constants in supervisor/src/{index,
hub-client}.ts.

Follow-up not in this PR: the in-process claude-runner that bridges
stream-json over the supervisor WS. Until that lands, session.start
rolls to stopped instead of trapping the supervisor in a respawn loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@finedesignz finedesignz merged commit 7898dd7 into main May 27, 2026
1 check passed
finedesignz added a commit that referenced this pull request May 28, 2026
… bridge (#96)

Phase 09 (PR #86) gutted ProcessManager.spawn() and returned every
session.start with `legacy_agent_spawn_disabled`. Result: clicking Launch
on a repo in Settings → Supervisor / Sidebar did nothing — the supervisor
received `session.start` from the hub and immediately finalized the run
as stopped. Reproduced in supervisor.log at 2026-05-27 23:51:54.

Root cause: PR #86 killed the legacy external CLI agent spawn path
(autonomous-loop bug in cached v0.4.1) without an in-process replacement.

Fix: re-wire spawn() to construct a `SessionBridge` per run. Each bridge:
  - opens its own /ws/agent connection with project_dir = repoPath and
    role = 'agent' (mirroring what the retired external agent did);
  - spawns `claude --input-format stream-json --output-format stream-json
    --verbose` directly via a new `ClaudeRunner`;
  - parses Claude's stream-json events and translates them onto the hub's
    agent-protocol (thinking / text_delta / tool_use / tool_result /
    status / assistant_message / permission_request / user_question);
  - routes hub-originated user_message / permission_response /
    question_response / cancel / shutdown back into the runner.

Preserved from PR #86:
  - sandbox-escape, not-git-repo, concurrency, dangerous-skip-permissions
    gates + audit-log entry per decision;
  - CIRCUIT_THRESHOLD (5 crashes / 10 min) and MAX_RESTART_COUNT (10)
    caps so a broken spawn can no longer loop forever.

Phase 09 canary `no-legacy-agent-spawn.test.ts` still passes — the new
runner never references the retired npm package name or forbidden flags.
ProcessManager unit tests rewritten to use `bridgeFactory` instead of the
old `spawnImpl` (Bun.spawn) hook.

Supervisor bumped 0.5.2 → 0.5.3 in all four source-of-truth locations
(supervisor/src/index.ts, supervisor/src/hub-client.ts,
supervisor/tauri/src-tauri/{Cargo.toml,tauri.conf.json},
supervisor/tauri/ui/package.json) so the next MSI advertises the fix.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant