Heartbeat Service Deadlock After Gateway RPC Timeout
Summary
The work_heartbeat service stops executing ticks after a single RPC timeout to the gateway. Root cause: the _tickRunning flag remains locked after fetchGatewaySessions() times out, preventing all subsequent ticks.
Environment
- DevClaw version: v1.6.10
- OpenClaw version: (from npm-global)
- Node.js: v22.22.0
- OS: Linux 5.15.0-164-generic (x64)
Steps to Reproduce
- Start gateway with DevClaw plugin enabled
- Wait for heartbeat to execute its first tick (happens ~2s after startup)
- Trigger gateway RPC timeout (e.g., gateway under load, network latency)
- Observe: heartbeat stops ticking entirely
Expected Behavior
Heartbeat should continue ticking every 60 seconds even if individual ticks fail or timeout.
Actual Behavior
After the first RPC timeout:
_tickRunning flag remains true
- All subsequent ticks return early:
if (_tickRunning) return;
- Heartbeat is dead until gateway restart
Evidence
Log proof (timestamps in CET):
2026-03-05T09:41:35.722Z - work_heartbeat tick: 1 pickups, 1 health fixes, 1 review transitions, 0 review skips, 0 test skips, 8 skipped
2026-03-05T10:01:26.517Z - Gateway call failed: Error: gateway timeout after 10000ms
2026-03-05T10:06:09.680Z - Gateway call failed: Error: gateway timeout after 10000ms
[NO MORE TICKS AFTER THIS]
Last successful tick: 09:41
First timeout: 10:01
Current time when observed: 11:28 (no ticks for ~1h 45min)
Root Cause Analysis
In lib/services/heartbeat/index.ts:
let _tickRunning = false;
async function runHeartbeatTick(...): Promise<void> {
if (_tickRunning) return; // ← EARLY RETURN
_tickRunning = true;
try {
const config = resolveHeartbeatConfig(ctx.pluginConfig);
if (!config.enabled) return; // ← BUG: doesn't reset flag
const agents = discoverAgents(ctx.config);
if (agents.length === 0) return; // ← BUG: doesn't reset flag
const result = await processAllAgents(...); // ← Can throw/timeout
logTickResult(result, logger);
} catch (err) {
logger.error(`work_heartbeat tick failed: ${err}`);
} finally {
_tickRunning = false;
}
}
Problem 1: Early returns (lines for !config.enabled and agents.length === 0) skip the finally block.
Problem 2: If processAllAgents() → fetchGatewaySessions() throws an unhandled rejection or timeout, the promise may hang indefinitely, never reaching finally.
Attempted Fix
Applied this patch:
async function runHeartbeatTick(...): Promise<void> {
if (_tickRunning) {
logger.warn("work_heartbeat: skipping tick (previous tick still running)");
return;
}
_tickRunning = true;
try {
const config = resolveHeartbeatConfig(ctx.pluginConfig);
if (!config.enabled) {
_tickRunning = false; // ← Added
return;
}
const agents = discoverAgents(ctx.config);
if (agents.length === 0) {
_tickRunning = false; // ← Added
return;
}
const result = await processAllAgents(...);
logTickResult(result, logger);
} catch (err) {
logger.error(`work_heartbeat tick failed: ${err}`);
} finally {
_tickRunning = false;
}
}
Result: Partial fix — handles early returns, but heartbeat still deadlocks after ~1 hour.
Hypothesis: fetchGatewaySessions() or processAllAgents() hangs without throwing, leaving the promise pending forever.
Additional Issue: Plugin Lifecycle
After systemctl --user restart openclaw-gateway, the DevClaw plugin loads (DevClaw plugin registered) but the heartbeat service does not start:
2026-03-05T10:18:40.953Z - DevClaw plugin registered (23 tools, 1 CLI command group, 1 service, 3 hooks)
[NO "work_heartbeat service started" LOG]
[NO TICKS AFTER THIS]
The service seems to fail silently on restart.
Proposed Solutions
1. Timeout Guard for processAllAgents()
Wrap the entire tick execution in a timeout:
async function runHeartbeatTick(...): Promise<void> {
if (_tickRunning) {
logger.warn("work_heartbeat: skipping tick (previous tick still running)");
return;
}
_tickRunning = true;
const timeoutMs = 50_000; // 50s (less than 60s interval)
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('Tick timeout')), timeoutMs)
);
try {
const config = resolveHeartbeatConfig(ctx.pluginConfig);
if (!config.enabled) {
_tickRunning = false;
return;
}
const agents = discoverAgents(ctx.config);
if (agents.length === 0) {
_tickRunning = false;
return;
}
const result = await Promise.race([
processAllAgents(agents, config, pluginConfig, logger, runCommand, runtime),
timeoutPromise
]);
logTickResult(result, logger);
} catch (err) {
logger.error(`work_heartbeat tick failed: ${err}`);
} finally {
_tickRunning = false;
}
}
2. Per-Agent Timeout in fetchGatewaySessions()
Add a timeout to the RPC call itself in lib/services/heartbeat/health.ts:
export async function fetchGatewaySessions(
agentId: string | undefined,
runCommand: RunCommand,
): Promise<SessionLookup | null> {
try {
const timeoutMs = 8_000; // 8s
const result = await Promise.race([
runCommand({ command: "sessions", args: { action: "list", agentId } }),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Gateway RPC timeout')), timeoutMs)
)
]);
// ... rest of the function
} catch (err) {
logger.warn(`Failed to fetch gateway sessions: ${err.message}`);
return null; // Graceful degradation
}
}
3. Investigate Plugin Service Lifecycle
Why does the heartbeat service fail to start on gateway restart? Logs show plugin registration but no service startup.
Check if api.registerService() start callback is being invoked after restart.
Impact
Severity: High
- Heartbeat is critical for automatic worker dispatch
- Manual intervention required every ~1 hour to restart gateway
- Affects all DevClaw installations with active projects
Workarounds (Temporary)
- Manual gateway restart via cron:
systemctl --user restart openclaw-gateway every hour
- Manual task dispatch via
task_start tool when heartbeat is dead
Both workarounds are not sustainable for production use.
Request: Please investigate and apply a robust fix that ensures _tickRunning is always reset, even in timeout/hang scenarios. Consider adding the timeout guards proposed above.
Thanks for the great work on DevClaw! 🦾
Heartbeat Service Deadlock After Gateway RPC Timeout
Summary
The
work_heartbeatservice stops executing ticks after a single RPC timeout to the gateway. Root cause: the_tickRunningflag remains locked afterfetchGatewaySessions()times out, preventing all subsequent ticks.Environment
Steps to Reproduce
Expected Behavior
Heartbeat should continue ticking every 60 seconds even if individual ticks fail or timeout.
Actual Behavior
After the first RPC timeout:
_tickRunningflag remainstrueif (_tickRunning) return;Evidence
Log proof (timestamps in CET):
Last successful tick: 09:41
First timeout: 10:01
Current time when observed: 11:28 (no ticks for ~1h 45min)
Root Cause Analysis
In
lib/services/heartbeat/index.ts:Problem 1: Early returns (lines for
!config.enabledandagents.length === 0) skip thefinallyblock.Problem 2: If
processAllAgents()→fetchGatewaySessions()throws an unhandled rejection or timeout, the promise may hang indefinitely, never reachingfinally.Attempted Fix
Applied this patch:
Result: Partial fix — handles early returns, but heartbeat still deadlocks after ~1 hour.
Hypothesis:
fetchGatewaySessions()orprocessAllAgents()hangs without throwing, leaving the promise pending forever.Additional Issue: Plugin Lifecycle
After
systemctl --user restart openclaw-gateway, the DevClaw plugin loads (DevClaw plugin registered) but the heartbeat service does not start:The service seems to fail silently on restart.
Proposed Solutions
1. Timeout Guard for
processAllAgents()Wrap the entire tick execution in a timeout:
2. Per-Agent Timeout in
fetchGatewaySessions()Add a timeout to the RPC call itself in
lib/services/heartbeat/health.ts:3. Investigate Plugin Service Lifecycle
Why does the heartbeat service fail to start on gateway restart? Logs show plugin registration but no service startup.
Check if
api.registerService()start callback is being invoked after restart.Impact
Severity: High
Workarounds (Temporary)
systemctl --user restart openclaw-gatewayevery hourtask_starttool when heartbeat is deadBoth workarounds are not sustainable for production use.
Request: Please investigate and apply a robust fix that ensures
_tickRunningis always reset, even in timeout/hang scenarios. Consider adding the timeout guards proposed above.Thanks for the great work on DevClaw! 🦾