fix(runner): prevent ghost sessions from orphaned spawn webhooks by BigKunLun · Pull Request #440 · tiann/hapi

BigKunLun · 2026-04-11T12:08:38Z

Summary

spawnSession() in cli/src/runner/run.ts currently registers a tracked session entry before the child's Session started webhook arrives. If the 15-second webhook timeout fires, only pidToAwaiter / pidToErrorAwaiter are cleaned; the pidToTrackedSession entry is left in place and the detached child keeps running. When the child eventually starts and reports its webhook, onHappySessionWebhook() finds the stale tracking entry and silently promotes the orphan into a normal runner-managed session — a message-less "ghost session" shows up on the web UI, and the user has no idea where it came from.

Reproduction

Seen repeatedly on opus[1m] --resume on a macOS machine running the current Homebrew build (0.16.6). Concretely:

Tap Resume on a session whose claude-code child takes >15s to boot (observed real start delays up to ~60 minutes under rate-limit + large session restore).
The hub returns POST /api/sessions/:id/resume 500 after ~15s.
The user, getting 500s, taps Resume a few more times — each click spawns a fresh detached child. All of them eventually finish booting and report their webhook together, creating one ghost session per attempt.

In one user's dashboard this produced four ghost Obsidian1210 sessions from a single troubled session, all rooted at the same --resume <sessionId>.

Relevant log excerpt:

18:19:35  Spawning session (...)  PID 63881  → webhook timeout 15s later
18:20:02  Spawning session (...)  PID 63921  → webhook timeout
18:20:14  Spawning session (...)  PID 63997  → webhook timeout
19:13:45  Spawning session (...)  PID 70231  → webhook timeout
19:20:16  PID 63921 webhook arrives → promoted to ghost session ab2799dd
19:20:16  PID 70231 webhook arrives → promoted to ghost session 778d408c
19:20:17  PID 63997 webhook arrives → promoted to ghost session 69cdf975
19:20:17  PID 63881 webhook arrives → promoted to ghost session 5996f036

Note the ~60-minute gap between Waiting for session webhook and the children actually reporting in — far beyond the current hard-coded 15s ceiling.

Fix

Three small, composable changes in cli/src/runner/run.ts:

Make the webhook timeout configurable via HAPI_RUNNER_WEBHOOK_TIMEOUT_MS (default unchanged at 15_000). Users on slow models / large resumes can raise the ceiling so the timeout never fires in the first place. Zero behaviour change for existing users.
On timeout, clean pidToTrackedSession and SIGTERM the child. Without this, the orphan's late webhook can still be silently promoted into a live session by onHappySessionWebhook().
Defence in depth in onHappySessionWebhook(): if a webhook arrives for a PID that has no tracked session but its payload claims startedBy: 'runner', treat it as an orphan — log it, SIGTERM the child, and return without registering the session. Genuine terminal-launched children correctly report startedBy: 'terminal' (see commands/claude.ts, codex/loop.ts, opencode/runOpencode.ts), so this branch cannot false-positive on them.

The original hard-coded timeout even carried a // I have seen timeouts on 10 seconds even though session was still created successfully in ~2 more seconds comment — the race it describes is the same race exploited in (2) and (3).

Out of scope / follow-ups

No new tests added in this PR. runner.integration.test.ts covers most of this flow but exercising the 60-min late-webhook window requires wall-clock mocking. Happy to add a regression test if reviewers prefer — just say the word and I'll wire one in with vi.useFakeTimers().
Docs: I did not touch cli/src/runner/README.md or docs/; let me know if you'd like HAPI_RUNNER_WEBHOOK_TIMEOUT_MS documented there.

Verification

Manually reviewed against:

spawnSession() (spawn → awaiter Promise)
onHappySessionWebhook() (webhook dispatch)
stopSession() / onChildExited() (cleanup paths — no new leaks)
controlServer.ts (webhook payload shape — startedBy is always present)

Built locally is not currently possible on the submitter's machine (bun not installed), but each change is small enough to be reviewed by eye and the CI will catch any typecheck slip. I can provide a local smoke-test log from a Homebrew-replaced binary after merge if useful.

Why this matters

Until the fix ships, every slow --resume on a slow-starting model produces a compounding pile of ghost sessions that the user can only clean up manually via DELETE /api/sessions/:id (and only after first POST .../archive to clear the active flag). The underlying bug has been latent since detached: true was introduced; it only becomes visible when the model's boot time crosses the 15s ceiling, which opus[1m] does routinely.

Thanks for maintaining HAPI 🙏

spawnSession() registers a tracked session entry before the child's "Session started" webhook arrives. When the 15s webhook timeout fires, only pidToAwaiter / pidToErrorAwaiter are cleared; the pidToTrackedSession entry is left in place and the detached child keeps running. If the child eventually starts and reports its webhook, onHappySessionWebhook() still finds the stale tracking entry and promotes the orphan into a normal runner-managed session, surfacing a message-less "ghost session" in the web UI. This is easy to reproduce on opus[1m] --resume: observed real-world case where four rapid "resume" clicks produced four orphan children that all reported their webhooks ~60 minutes later, creating four ghost sessions on the dashboard. Three-part fix: 1. Make the webhook timeout configurable via HAPI_RUNNER_WEBHOOK_TIMEOUT_MS (default unchanged at 15_000). Users on slow models / large resumes can raise the ceiling so the timeout never fires in the first place. 2. On timeout, also delete the pidToTrackedSession entry and SIGTERM the child, so a late webhook cannot promote the orphan. 3. Defence in depth: if onHappySessionWebhook() receives a webhook from a PID that is not tracked but whose payload claims startedBy: 'runner', ignore it and SIGTERM the child. Genuine terminal-launched children correctly report startedBy: 'terminal', so this branch cannot false-positive on them.

github-actions

Findings

[Major] Timeout/orphan cleanup only signals the wrapper process, not the session process tree — both new branches use a direct SIGTERM (process.kill(pid, 'SIGTERM') / happyProcess?.kill('SIGTERM')), but runner-owned sessions already use killProcessByChildProcess() because the actual agent subprocesses can outlive the top-level HAPI child. That means a timed-out/orphaned session can keep running after the runner has already reported failure and dropped tracking, which still leaves stray agent processes mutating the worktree. Evidence cli/src/runner/run.ts:208, cli/src/runner/run.ts:567; related existing handling cli/src/runner/run.ts:644, cli/src/utils/process.ts:77.
Suggested fix:
```
const orphan = pidToTrackedSession.get(pid);
if (orphan?.childProcess) {
    void killProcessByChildProcess(orphan.childProcess);
}
```
[Major] Timed-out worktree spawns can now leak their temporary worktree — the new timeout branch resolves immediately after sending SIGTERM, but maybeCleanupWorktree('spawn-error') only removes the worktree if the child is already gone. In practice the child is usually still alive in that same tick, so cleanup is skipped and onChildExited() never retries it. A slow worktree session that times out will now leave the created worktree behind. Evidence cli/src/runner/run.ts:567, cli/src/runner/run.ts:575; related cleanup gate cli/src/runner/run.ts:329 and exit path cli/src/runner/run.ts:670.
Suggested fix:
```
let cleanupWorktreeAfterTimeout = false;

const timeout = setTimeout(() => {
    cleanupWorktreeAfterTimeout = true;
    void killProcessByChildProcess(happyProcess!);
    resolve({ type: 'error', errorMessage: buildWebhookFailureMessage('timeout') });
}, webhookTimeoutMs);

happyProcess.on('exit', () => {
    if (cleanupWorktreeAfterTimeout) {
        void cleanupWorktree();
    }
});
```

Summary
Review mode: initial
Two correctness regressions in the new timeout/orphan cleanup path: the new termination logic bypasses the existing tree-kill helper, and the timeout path can skip worktree cleanup permanently after returning an error.

Testing
Not run (automation/security policy). Suggested coverage: a fake-timer runner test that times out a worktree spawn and asserts both the child process tree and the created worktree are cleaned up.

HAPI Bot

github-actions · 2026-04-11T12:13:53Z

cli/src/runner/run.ts

+            `[RUNNER RUN] Ignoring late webhook from orphaned runner-spawned PID ${pid} (session ${sessionId}). Terminating child.`
+          );
+          try {
+            process.kill(pid, 'SIGTERM');


[MAJOR] This only signals the top-level HAPI child. Runner-owned sessions already use killProcessByChildProcess() elsewhere because Claude/Codex/Cursor descendants can outlive the wrapper, so this branch can still leave the real agent subtree running after tracking has been dropped. Suggested fix:

const orphan = pidToTrackedSession.get(pid); if (orphan?.childProcess) { void killProcessByChildProcess(orphan.childProcess); }

github-actions · 2026-04-11T12:13:53Z

cli/src/runner/run.ts

+            // to clean up; even if the signal is ignored we already removed
+            // their tracking entry above, so they cannot be promoted.
+            try {
+              happyProcess?.kill('SIGTERM');


[MAJOR] Same tree-kill issue here, and this now races worktree cleanup: maybeCleanupWorktree(spawn-error) only removes the worktree if the PID is already dead, but this branch resolves immediately after sending SIGTERM, so timed-out worktree sessions will usually skip cleanup and never retry it. Suggested fix: kill the full process tree here and trigger cleanupWorktree() after the child has actually exited.

github-actions bot reviewed Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(runner): prevent ghost sessions from orphaned spawn webhooks#440

fix(runner): prevent ghost sessions from orphaned spawn webhooks#440
BigKunLun wants to merge 1 commit intotiann:mainfrom
BigKunLun:fix/runner-webhook-timeout-orphan-sessions

BigKunLun commented Apr 11, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Apr 11, 2026

Uh oh!

github-actions bot Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BigKunLun commented Apr 11, 2026

Summary

Reproduction

Fix

Out of scope / follow-ups

Verification

Why this matters

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants