Skip to content

fix(runner): prevent ghost sessions from orphaned spawn webhooks#440

Open
BigKunLun wants to merge 1 commit intotiann:mainfrom
BigKunLun:fix/runner-webhook-timeout-orphan-sessions
Open

fix(runner): prevent ghost sessions from orphaned spawn webhooks#440
BigKunLun wants to merge 1 commit intotiann:mainfrom
BigKunLun:fix/runner-webhook-timeout-orphan-sessions

Conversation

@BigKunLun
Copy link
Copy Markdown

Summary

spawnSession() in cli/src/runner/run.ts currently registers a tracked session entry before the child's Session started webhook arrives. If the 15-second webhook timeout fires, only pidToAwaiter / pidToErrorAwaiter are cleaned; the pidToTrackedSession entry is left in place and the detached child keeps running. When the child eventually starts and reports its webhook, onHappySessionWebhook() finds the stale tracking entry and silently promotes the orphan into a normal runner-managed session — a message-less "ghost session" shows up on the web UI, and the user has no idea where it came from.

Reproduction

Seen repeatedly on opus[1m] --resume on a macOS machine running the current Homebrew build (0.16.6). Concretely:

  1. Tap Resume on a session whose claude-code child takes >15s to boot (observed real start delays up to ~60 minutes under rate-limit + large session restore).
  2. The hub returns POST /api/sessions/:id/resume 500 after ~15s.
  3. The user, getting 500s, taps Resume a few more times — each click spawns a fresh detached child. All of them eventually finish booting and report their webhook together, creating one ghost session per attempt.

In one user's dashboard this produced four ghost Obsidian1210 sessions from a single troubled session, all rooted at the same --resume <sessionId>.

Relevant log excerpt:

18:19:35  Spawning session (...)  PID 63881  → webhook timeout 15s later
18:20:02  Spawning session (...)  PID 63921  → webhook timeout
18:20:14  Spawning session (...)  PID 63997  → webhook timeout
19:13:45  Spawning session (...)  PID 70231  → webhook timeout
19:20:16  PID 63921 webhook arrives → promoted to ghost session ab2799dd
19:20:16  PID 70231 webhook arrives → promoted to ghost session 778d408c
19:20:17  PID 63997 webhook arrives → promoted to ghost session 69cdf975
19:20:17  PID 63881 webhook arrives → promoted to ghost session 5996f036

Note the ~60-minute gap between Waiting for session webhook and the children actually reporting in — far beyond the current hard-coded 15s ceiling.

Fix

Three small, composable changes in cli/src/runner/run.ts:

  1. Make the webhook timeout configurable via HAPI_RUNNER_WEBHOOK_TIMEOUT_MS (default unchanged at 15_000). Users on slow models / large resumes can raise the ceiling so the timeout never fires in the first place. Zero behaviour change for existing users.

  2. On timeout, clean pidToTrackedSession and SIGTERM the child. Without this, the orphan's late webhook can still be silently promoted into a live session by onHappySessionWebhook().

  3. Defence in depth in onHappySessionWebhook(): if a webhook arrives for a PID that has no tracked session but its payload claims startedBy: 'runner', treat it as an orphan — log it, SIGTERM the child, and return without registering the session. Genuine terminal-launched children correctly report startedBy: 'terminal' (see commands/claude.ts, codex/loop.ts, opencode/runOpencode.ts), so this branch cannot false-positive on them.

The original hard-coded timeout even carried a // I have seen timeouts on 10 seconds even though session was still created successfully in ~2 more seconds comment — the race it describes is the same race exploited in (2) and (3).

Out of scope / follow-ups

  • No new tests added in this PR. runner.integration.test.ts covers most of this flow but exercising the 60-min late-webhook window requires wall-clock mocking. Happy to add a regression test if reviewers prefer — just say the word and I'll wire one in with vi.useFakeTimers().
  • Docs: I did not touch cli/src/runner/README.md or docs/; let me know if you'd like HAPI_RUNNER_WEBHOOK_TIMEOUT_MS documented there.

Verification

Manually reviewed against:

  • spawnSession() (spawn → awaiter Promise)
  • onHappySessionWebhook() (webhook dispatch)
  • stopSession() / onChildExited() (cleanup paths — no new leaks)
  • controlServer.ts (webhook payload shape — startedBy is always present)

Built locally is not currently possible on the submitter's machine (bun not installed), but each change is small enough to be reviewed by eye and the CI will catch any typecheck slip. I can provide a local smoke-test log from a Homebrew-replaced binary after merge if useful.

Why this matters

Until the fix ships, every slow --resume on a slow-starting model produces a compounding pile of ghost sessions that the user can only clean up manually via DELETE /api/sessions/:id (and only after first POST .../archive to clear the active flag). The underlying bug has been latent since detached: true was introduced; it only becomes visible when the model's boot time crosses the 15s ceiling, which opus[1m] does routinely.

Thanks for maintaining HAPI 🙏

spawnSession() registers a tracked session entry before the child's
"Session started" webhook arrives. When the 15s webhook timeout fires,
only pidToAwaiter / pidToErrorAwaiter are cleared; the
pidToTrackedSession entry is left in place and the detached child
keeps running. If the child eventually starts and reports its webhook,
onHappySessionWebhook() still finds the stale tracking entry and
promotes the orphan into a normal runner-managed session, surfacing
a message-less "ghost session" in the web UI.

This is easy to reproduce on opus[1m] --resume: observed real-world
case where four rapid "resume" clicks produced four orphan children
that all reported their webhooks ~60 minutes later, creating four
ghost sessions on the dashboard.

Three-part fix:

1. Make the webhook timeout configurable via
   HAPI_RUNNER_WEBHOOK_TIMEOUT_MS (default unchanged at 15_000). Users
   on slow models / large resumes can raise the ceiling so the
   timeout never fires in the first place.

2. On timeout, also delete the pidToTrackedSession entry and SIGTERM
   the child, so a late webhook cannot promote the orphan.

3. Defence in depth: if onHappySessionWebhook() receives a webhook
   from a PID that is not tracked but whose payload claims
   startedBy: 'runner', ignore it and SIGTERM the child. Genuine
   terminal-launched children correctly report
   startedBy: 'terminal', so this branch cannot false-positive on
   them.
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Timeout/orphan cleanup only signals the wrapper process, not the session process tree — both new branches use a direct SIGTERM (process.kill(pid, 'SIGTERM') / happyProcess?.kill('SIGTERM')), but runner-owned sessions already use killProcessByChildProcess() because the actual agent subprocesses can outlive the top-level HAPI child. That means a timed-out/orphaned session can keep running after the runner has already reported failure and dropped tracking, which still leaves stray agent processes mutating the worktree. Evidence cli/src/runner/run.ts:208, cli/src/runner/run.ts:567; related existing handling cli/src/runner/run.ts:644, cli/src/utils/process.ts:77.
    Suggested fix:
    const orphan = pidToTrackedSession.get(pid);
    if (orphan?.childProcess) {
        void killProcessByChildProcess(orphan.childProcess);
    }
  • [Major] Timed-out worktree spawns can now leak their temporary worktree — the new timeout branch resolves immediately after sending SIGTERM, but maybeCleanupWorktree('spawn-error') only removes the worktree if the child is already gone. In practice the child is usually still alive in that same tick, so cleanup is skipped and onChildExited() never retries it. A slow worktree session that times out will now leave the created worktree behind. Evidence cli/src/runner/run.ts:567, cli/src/runner/run.ts:575; related cleanup gate cli/src/runner/run.ts:329 and exit path cli/src/runner/run.ts:670.
    Suggested fix:
    let cleanupWorktreeAfterTimeout = false;
    
    const timeout = setTimeout(() => {
        cleanupWorktreeAfterTimeout = true;
        void killProcessByChildProcess(happyProcess!);
        resolve({ type: 'error', errorMessage: buildWebhookFailureMessage('timeout') });
    }, webhookTimeoutMs);
    
    happyProcess.on('exit', () => {
        if (cleanupWorktreeAfterTimeout) {
            void cleanupWorktree();
        }
    });

Summary
Review mode: initial
Two correctness regressions in the new timeout/orphan cleanup path: the new termination logic bypasses the existing tree-kill helper, and the timeout path can skip worktree cleanup permanently after returning an error.

Testing
Not run (automation/security policy). Suggested coverage: a fake-timer runner test that times out a worktree spawn and asserts both the child process tree and the created worktree are cleaned up.

HAPI Bot

`[RUNNER RUN] Ignoring late webhook from orphaned runner-spawned PID ${pid} (session ${sessionId}). Terminating child.`
);
try {
process.kill(pid, 'SIGTERM');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This only signals the top-level HAPI child. Runner-owned sessions already use killProcessByChildProcess() elsewhere because Claude/Codex/Cursor descendants can outlive the wrapper, so this branch can still leave the real agent subtree running after tracking has been dropped. Suggested fix:

const orphan = pidToTrackedSession.get(pid);
if (orphan?.childProcess) {
    void killProcessByChildProcess(orphan.childProcess);
}

// to clean up; even if the signal is ignored we already removed
// their tracking entry above, so they cannot be promoted.
try {
happyProcess?.kill('SIGTERM');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Same tree-kill issue here, and this now races worktree cleanup: maybeCleanupWorktree(spawn-error) only removes the worktree if the PID is already dead, but this branch resolves immediately after sending SIGTERM, so timed-out worktree sessions will usually skip cleanup and never retry it. Suggested fix: kill the full process tree here and trigger cleanupWorktree() after the child has actually exited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants