Summary
When the Anthropic API returns 529 (overloaded), agent sessions die but the heartbeat health check fails to detect and recover them. Workers remain stuck with active: true indefinitely, blocking issue progress until manual intervention.
Environment
- OpenClaw: 2026.3.2
- DevClaw: @laurentenhoor/devclaw 1.6.10
- Auth: Claude Max (single profile
anthropic:manual)
Steps to Reproduce
- DevClaw heartbeat dispatches an agent (e.g., ARCHITECT) on an issue
- Anthropic API returns 529 (overloaded) during the session
- OpenClaw embedded runner retries 3x (300ms/600ms/1200ms) — all fail
- Session dies — the gateway shows the session with
abortedLastRun: false
- Worker slot in
projects.json remains active: true, sessionKey still set
- Heartbeat health check runs but does NOT detect the problem
- Issue stays stuck forever until manual reset
Root Cause Analysis
The health check in lib/services/heartbeat/health.ts has a stale worker detection:
```js
if (slot.active && slot.startTime && sessionKey && sessions && isSessionAlive(sessionKey, sessions)) {
const hours = (Date.now() - new Date(slot.startTime).getTime()) / 36e5;
if (hours > staleWorkerHours) { // default 2h
// deactivate slot
}
}
```
This check requires all of:
- `slot.startTime` is set — but after a 529 crash/recovery cycle, `startTime` can be `null`
- `isSessionAlive()` returns `true` — the session still exists in the gateway
- `abortedLastRun` is `false` — 529 does not set this flag
When `startTime` is `null`, the entire stale worker check is bypassed (`slot.active && slot.startTime` = false).
The `session_stalled` check also fails because `session.updatedAt` keeps getting refreshed by gateway polling, so `sessionIdleMs` stays below the 15-minute threshold even though the session is not actually doing any work.
The `session_dead` check only fires when the session is missing from the gateway — but the session file still exists, so `isSessionAlive()` returns true.
Net result: all health check branches pass, the zombie goes undetected.
Observed Behavior
Two ARCHITECT workers on issue #74 were stuck for 41 and 75 hours respectively. Both had:
- `active: true`
- `startTime: null`
- `sessionKey` pointing to existing gateway sessions
- `abortedLastRun: false`
- Gateway `updatedAt` kept refreshing
Manual fix required: delete sessions via `sessions.delete`, reset slots in `projects.json`.
Suggested Fixes
1. Handle startTime: null as a stuck state
If a slot is `active` but `startTime` is null, the health check should treat it as stuck and deactivate:
```js
if (slot.active && !slot.startTime) {
// Missing startTime — cannot determine staleness, treat as stuck
await deactivateSlot();
}
```
2. Set abortedLastRun: true on terminal 529/overloaded errors
The embedded runner maps 529 to `rate_limit` failover reason, but the session is not marked as aborted when failover is exhausted. If the session terminates due to exhausted failover attempts, `abortedLastRun` should be set to `true` so the `context_overflow` health check branch catches it.
3. Add a secondary staleness check based on session content
Even when `updatedAt` looks fresh, if `outputTokens` and `inputTokens` haven't changed between two consecutive heartbeat ticks, the session is likely dead. This would catch the zombie case regardless of gateway polling behavior.
Workaround
Deployed a cron-based watchdog (runs every 5 min) that catches:
- `active: true` + `startTime: null` → reset slot
- Session key not in gateway → reset slot
- Session `updatedAt` > 30 min old → kill session + reset slot
Summary
When the Anthropic API returns 529 (overloaded), agent sessions die but the heartbeat health check fails to detect and recover them. Workers remain stuck with
active: trueindefinitely, blocking issue progress until manual intervention.Environment
anthropic:manual)Steps to Reproduce
abortedLastRun: falseprojects.jsonremainsactive: true,sessionKeystill setRoot Cause Analysis
The health check in
lib/services/heartbeat/health.tshas a stale worker detection:```js
if (slot.active && slot.startTime && sessionKey && sessions && isSessionAlive(sessionKey, sessions)) {
const hours = (Date.now() - new Date(slot.startTime).getTime()) / 36e5;
if (hours > staleWorkerHours) { // default 2h
// deactivate slot
}
}
```
This check requires all of:
When `startTime` is `null`, the entire stale worker check is bypassed (`slot.active && slot.startTime` = false).
The `session_stalled` check also fails because `session.updatedAt` keeps getting refreshed by gateway polling, so `sessionIdleMs` stays below the 15-minute threshold even though the session is not actually doing any work.
The `session_dead` check only fires when the session is missing from the gateway — but the session file still exists, so `isSessionAlive()` returns true.
Net result: all health check branches pass, the zombie goes undetected.
Observed Behavior
Two ARCHITECT workers on issue #74 were stuck for 41 and 75 hours respectively. Both had:
Manual fix required: delete sessions via `sessions.delete`, reset slots in `projects.json`.
Suggested Fixes
1. Handle
startTime: nullas a stuck stateIf a slot is `active` but `startTime` is null, the health check should treat it as stuck and deactivate:
```js
if (slot.active && !slot.startTime) {
// Missing startTime — cannot determine staleness, treat as stuck
await deactivateSlot();
}
```
2. Set
abortedLastRun: trueon terminal 529/overloaded errorsThe embedded runner maps 529 to `rate_limit` failover reason, but the session is not marked as aborted when failover is exhausted. If the session terminates due to exhausted failover attempts, `abortedLastRun` should be set to `true` so the `context_overflow` health check branch catches it.
3. Add a secondary staleness check based on session content
Even when `updatedAt` looks fresh, if `outputTokens` and `inputTokens` haven't changed between two consecutive heartbeat ticks, the session is likely dead. This would catch the zombie case regardless of gateway polling behavior.
Workaround
Deployed a cron-based watchdog (runs every 5 min) that catches: