Heartbeat health check misses zombie sessions after API 529 (overloaded)

## Summary

When the Anthropic API returns 529 (overloaded), agent sessions die but the heartbeat health check fails to detect and recover them. Workers remain stuck with `active: true` indefinitely, blocking issue progress until manual intervention.

## Environment

- OpenClaw: 2026.3.2
- DevClaw: @laurentenhoor/devclaw 1.6.10
- Auth: Claude Max (single profile `anthropic:manual`)

## Steps to Reproduce

1. DevClaw heartbeat dispatches an agent (e.g., ARCHITECT) on an issue
2. Anthropic API returns 529 (overloaded) during the session
3. OpenClaw embedded runner retries 3x (300ms/600ms/1200ms) — all fail
4. Session dies — the gateway shows the session with `abortedLastRun: false`
5. Worker slot in `projects.json` remains `active: true`, `sessionKey` still set
6. **Heartbeat health check runs but does NOT detect the problem**
7. Issue stays stuck forever until manual reset

## Root Cause Analysis

The health check in `lib/services/heartbeat/health.ts` has a stale worker detection:

\`\`\`js
if (slot.active && slot.startTime && sessionKey && sessions && isSessionAlive(sessionKey, sessions)) {
    const hours = (Date.now() - new Date(slot.startTime).getTime()) / 36e5;
    if (hours > staleWorkerHours) { // default 2h
        // deactivate slot
    }
}
\`\`\`

This check requires **all** of:
1. \`slot.startTime\` is set — but after a 529 crash/recovery cycle, \`startTime\` can be \`null\`
2. \`isSessionAlive()\` returns \`true\` — the session still exists in the gateway
3. \`abortedLastRun\` is \`false\` — 529 does not set this flag

When \`startTime\` is \`null\`, the entire stale worker check is bypassed (\`slot.active && slot.startTime\` = false).

The \`session_stalled\` check also fails because \`session.updatedAt\` keeps getting refreshed by gateway polling, so \`sessionIdleMs\` stays below the 15-minute threshold even though the session is not actually doing any work.

The \`session_dead\` check only fires when the session is **missing** from the gateway — but the session file still exists, so \`isSessionAlive()\` returns true.

**Net result:** all health check branches pass, the zombie goes undetected.

## Observed Behavior

Two ARCHITECT workers on issue #74 were stuck for **41 and 75 hours** respectively. Both had:
- \`active: true\`
- \`startTime: null\`
- \`sessionKey\` pointing to existing gateway sessions
- \`abortedLastRun: false\`
- Gateway \`updatedAt\` kept refreshing

Manual fix required: delete sessions via \`sessions.delete\`, reset slots in \`projects.json\`.

## Suggested Fixes

### 1. Handle `startTime: null` as a stuck state

If a slot is \`active\` but \`startTime\` is null, the health check should treat it as stuck and deactivate:

\`\`\`js
if (slot.active && !slot.startTime) {
    // Missing startTime — cannot determine staleness, treat as stuck
    await deactivateSlot();
}
\`\`\`

### 2. Set `abortedLastRun: true` on terminal 529/overloaded errors

The embedded runner maps 529 to \`rate_limit\` failover reason, but the session is not marked as aborted when failover is exhausted. If the session terminates due to exhausted failover attempts, \`abortedLastRun\` should be set to \`true\` so the \`context_overflow\` health check branch catches it.

### 3. Add a secondary staleness check based on session content

Even when \`updatedAt\` looks fresh, if \`outputTokens\` and \`inputTokens\` haven't changed between two consecutive heartbeat ticks, the session is likely dead. This would catch the zombie case regardless of gateway polling behavior.

## Workaround

Deployed a cron-based watchdog (runs every 5 min) that catches:
1. \`active: true\` + \`startTime: null\` → reset slot
2. Session key not in gateway → reset slot  
3. Session \`updatedAt\` > 30 min old → kill session + reset slot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeat health check misses zombie sessions after API 529 (overloaded) #498

Summary

Environment

Steps to Reproduce

Root Cause Analysis

Observed Behavior

Suggested Fixes

1. Handle `startTime: null` as a stuck state

2. Set `abortedLastRun: true` on terminal 529/overloaded errors

3. Add a secondary staleness check based on session content

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Heartbeat health check misses zombie sessions after API 529 (overloaded) #498

Description

Summary

Environment

Steps to Reproduce

Root Cause Analysis

Observed Behavior

Suggested Fixes

1. Handle startTime: null as a stuck state

2. Set abortedLastRun: true on terminal 529/overloaded errors

3. Add a secondary staleness check based on session content

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Handle `startTime: null` as a stuck state

2. Set `abortedLastRun: true` on terminal 529/overloaded errors