Skip to content

Heartbeat health check misses zombie sessions after API 529 (overloaded) #498

@DXristianin

Description

@DXristianin

Summary

When the Anthropic API returns 529 (overloaded), agent sessions die but the heartbeat health check fails to detect and recover them. Workers remain stuck with active: true indefinitely, blocking issue progress until manual intervention.

Environment

  • OpenClaw: 2026.3.2
  • DevClaw: @laurentenhoor/devclaw 1.6.10
  • Auth: Claude Max (single profile anthropic:manual)

Steps to Reproduce

  1. DevClaw heartbeat dispatches an agent (e.g., ARCHITECT) on an issue
  2. Anthropic API returns 529 (overloaded) during the session
  3. OpenClaw embedded runner retries 3x (300ms/600ms/1200ms) — all fail
  4. Session dies — the gateway shows the session with abortedLastRun: false
  5. Worker slot in projects.json remains active: true, sessionKey still set
  6. Heartbeat health check runs but does NOT detect the problem
  7. Issue stays stuck forever until manual reset

Root Cause Analysis

The health check in lib/services/heartbeat/health.ts has a stale worker detection:

```js
if (slot.active && slot.startTime && sessionKey && sessions && isSessionAlive(sessionKey, sessions)) {
const hours = (Date.now() - new Date(slot.startTime).getTime()) / 36e5;
if (hours > staleWorkerHours) { // default 2h
// deactivate slot
}
}
```

This check requires all of:

  1. `slot.startTime` is set — but after a 529 crash/recovery cycle, `startTime` can be `null`
  2. `isSessionAlive()` returns `true` — the session still exists in the gateway
  3. `abortedLastRun` is `false` — 529 does not set this flag

When `startTime` is `null`, the entire stale worker check is bypassed (`slot.active && slot.startTime` = false).

The `session_stalled` check also fails because `session.updatedAt` keeps getting refreshed by gateway polling, so `sessionIdleMs` stays below the 15-minute threshold even though the session is not actually doing any work.

The `session_dead` check only fires when the session is missing from the gateway — but the session file still exists, so `isSessionAlive()` returns true.

Net result: all health check branches pass, the zombie goes undetected.

Observed Behavior

Two ARCHITECT workers on issue #74 were stuck for 41 and 75 hours respectively. Both had:

  • `active: true`
  • `startTime: null`
  • `sessionKey` pointing to existing gateway sessions
  • `abortedLastRun: false`
  • Gateway `updatedAt` kept refreshing

Manual fix required: delete sessions via `sessions.delete`, reset slots in `projects.json`.

Suggested Fixes

1. Handle startTime: null as a stuck state

If a slot is `active` but `startTime` is null, the health check should treat it as stuck and deactivate:

```js
if (slot.active && !slot.startTime) {
// Missing startTime — cannot determine staleness, treat as stuck
await deactivateSlot();
}
```

2. Set abortedLastRun: true on terminal 529/overloaded errors

The embedded runner maps 529 to `rate_limit` failover reason, but the session is not marked as aborted when failover is exhausted. If the session terminates due to exhausted failover attempts, `abortedLastRun` should be set to `true` so the `context_overflow` health check branch catches it.

3. Add a secondary staleness check based on session content

Even when `updatedAt` looks fresh, if `outputTokens` and `inputTokens` haven't changed between two consecutive heartbeat ticks, the session is likely dead. This would catch the zombie case regardless of gateway polling behavior.

Workaround

Deployed a cron-based watchdog (runs every 5 min) that catches:

  1. `active: true` + `startTime: null` → reset slot
  2. Session key not in gateway → reset slot
  3. Session `updatedAt` > 30 min old → kill session + reset slot

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions