Skip to content

bug(lifecycle): a sticky waiting_input session whose runtime pane died is never reaped to exited (stuck needs_input, frozen activity, dead terminal) #290

@neversettle17-101

Description

@neversettle17-101

Bug

A worker that finishes its turn and reports waiting_input is correct to show needs_input. But if its runtime pane then dies (daemon/app restart, zellij exit, crash), the session is never reconciled to exited. It stays permanently:

  • status: needs_input / activity.state: waiting_input
  • activity.lastActivityAt frozen at the last hook signal
  • a dead terminal pane that shows nothing new (the agent process is gone)

This was observed live on agent-orchestrator-2: the worker committed and opened PR #278, went waiting_input at 08:12Z, its zellij pane later exited, and the session sat frozen at needs_input with lastActivityAt=08:12Z while the reaper probed it dead every 5s.

Source: local observation (session agent-orchestrator-2) | Reported by: @aditi1178 | Analyzed against: 7c9ae53 (lifecycle code identical on main)
Confidence: High

Reproduction

  1. Spawn a worker; let it finish a task so it reports waiting_input (needs_input).
  2. Kill its runtime pane (restart the daemon/app, or zellij pane exits/crashes).
  3. The reaper probes the pane dead every 5s, but the session never flips to exited.
  4. UI keeps showing needs_input with a stale lastActivityAt and a dead/blank terminal.

Root Cause

The OBSERVE-layer reaper correctly reports the pane as dead, but the LCM's death-arbitration gate suppresses it for sticky states.

  • backend/internal/observe/reaper/reaper.go — every 5s, IsAlive returns false → reports ports.ProbeDead via ApplyRuntimeObservation. Working.
  • backend/internal/lifecycle/manager.go:94 ApplyRuntimeObservation only marks IsTerminated + ActivityExited when runtimeClearlyDead(...) is true.
  • backend/internal/lifecycle/runtime.go:25-28:
    func runtimeClearlyDead(f ports.RuntimeFacts, activity domain.Activity, now time.Time, window time.Duration) bool {
        observedAt := timeOr(f.ObservedAt, now)
        return f.Probe == ports.ProbeDead && !hasRecentActivity(activity, observedAt, window)
    }
  • hasRecentActivity (runtime.go:12-23) returns true unconditionally for any sticky state:
    case a.State.IsSticky():   // ActivityWaitingInput
        return true
  • IsSticky() (backend/internal/domain/activity.go:19-21) is true for waiting_input.

So for a waiting_input session, hasRecentActivity is always true!hasRecentActivity is always falseruntimeClearlyDead is always falsethe dead-runtime fact is discarded forever.

The sticky exemption is right for aging by passage of time ("a paused agent is still paused until a new signal says so"), but it is wrong to apply it against an authoritative dead-runtime probe: if the process is gone, the agent cannot be "waiting for input" — there is nothing to receive it. hasRecentActivity has exactly one caller (runtimeClearlyDead), so the sticky branch exists solely to (incorrectly) influence death arbitration.

Maps to all three reported symptoms

  1. Commits don't show in the terminal — the worker's commits are real (in the worktree git log), but its pane is dead; because the session is never marked exited, the UI still presents it as a live needs_input session with an attachable terminal that has no live PTY behind it.
  2. Stuck in needs_input — legit at first, then frozen by this bug.
  3. Activity not updated — no new hook signals (process dead) AND the reaper-driven exited transition (which would stamp lastActivityAt to the observed-dead time) is suppressed.

Fix

In runtimeClearlyDead, let an authoritative ProbeDead override the sticky exemption while still honoring the recent-activity time window (so we don't race a just-reported signal during a relaunch):

func runtimeClearlyDead(f ports.RuntimeFacts, activity domain.Activity, now time.Time, window time.Duration) bool {
    if f.Probe != ports.ProbeDead || activity.State == domain.ActivityExited {
        return false
    }
    observedAt := timeOr(f.ObservedAt, now)
    if activity.LastActivityAt.IsZero() {
        return true
    }
    // A dead runtime is authoritative; a sticky state no longer means "still
    // waiting" once the process is gone. Only protect against racing a *recent*
    // live signal via the time window — do NOT treat sticky as infinitely recent.
    return observedAt.Sub(activity.LastActivityAt) > window
}

This keeps sticky semantics for time-based aging while letting a provably-dead pane be reaped after the window. Needs a unit test: waiting_input + ProbeDead + LastActivityAt older than windowexited.

Impact

  • Workers/orchestrators that pause for input and then lose their pane are stuck forever as needs_input — they pollute the session list, mislead the orchestrator/UI, and present dead terminals.
  • Notification correctness: a stuck needs_input keeps signaling "needs input" for a session that can never receive it.

Related

  • #241 — MarkSpawned clobbers a racing activity signal / resurrects exited sessions (same activity-vs-runtime arbitration surface).
  • #265 — hung zellij pane blocks PR nudges (dead-pane handling).
  • #32 — reaper known gaps.

Metadata

Metadata

Labels

bugSomething isn't workinglcm-smLifecycle + Session Manager lanepriority: highFix soon

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions