Skip to content

Heartbeat Service Deadlock After Gateway RPC Timeout #499

@ziomik

Description

@ziomik

Heartbeat Service Deadlock After Gateway RPC Timeout

Summary

The work_heartbeat service stops executing ticks after a single RPC timeout to the gateway. Root cause: the _tickRunning flag remains locked after fetchGatewaySessions() times out, preventing all subsequent ticks.

Environment

  • DevClaw version: v1.6.10
  • OpenClaw version: (from npm-global)
  • Node.js: v22.22.0
  • OS: Linux 5.15.0-164-generic (x64)

Steps to Reproduce

  1. Start gateway with DevClaw plugin enabled
  2. Wait for heartbeat to execute its first tick (happens ~2s after startup)
  3. Trigger gateway RPC timeout (e.g., gateway under load, network latency)
  4. Observe: heartbeat stops ticking entirely

Expected Behavior

Heartbeat should continue ticking every 60 seconds even if individual ticks fail or timeout.

Actual Behavior

After the first RPC timeout:

  • _tickRunning flag remains true
  • All subsequent ticks return early: if (_tickRunning) return;
  • Heartbeat is dead until gateway restart

Evidence

Log proof (timestamps in CET):

2026-03-05T09:41:35.722Z - work_heartbeat tick: 1 pickups, 1 health fixes, 1 review transitions, 0 review skips, 0 test skips, 8 skipped
2026-03-05T10:01:26.517Z - Gateway call failed: Error: gateway timeout after 10000ms
2026-03-05T10:06:09.680Z - Gateway call failed: Error: gateway timeout after 10000ms
[NO MORE TICKS AFTER THIS]

Last successful tick: 09:41
First timeout: 10:01
Current time when observed: 11:28 (no ticks for ~1h 45min)

Root Cause Analysis

In lib/services/heartbeat/index.ts:

let _tickRunning = false;

async function runHeartbeatTick(...): Promise<void> {
  if (_tickRunning) return;  // ← EARLY RETURN
  _tickRunning = true;
  try {
    const config = resolveHeartbeatConfig(ctx.pluginConfig);
    if (!config.enabled) return;  // ← BUG: doesn't reset flag

    const agents = discoverAgents(ctx.config);
    if (agents.length === 0) return;  // ← BUG: doesn't reset flag

    const result = await processAllAgents(...);  // ← Can throw/timeout
    logTickResult(result, logger);
  } catch (err) {
    logger.error(`work_heartbeat tick failed: ${err}`);
  } finally {
    _tickRunning = false;
  }
}

Problem 1: Early returns (lines for !config.enabled and agents.length === 0) skip the finally block.

Problem 2: If processAllAgents()fetchGatewaySessions() throws an unhandled rejection or timeout, the promise may hang indefinitely, never reaching finally.

Attempted Fix

Applied this patch:

async function runHeartbeatTick(...): Promise<void> {
  if (_tickRunning) {
    logger.warn("work_heartbeat: skipping tick (previous tick still running)");
    return;
  }
  _tickRunning = true;
  try {
    const config = resolveHeartbeatConfig(ctx.pluginConfig);
    if (!config.enabled) {
      _tickRunning = false;  // ← Added
      return;
    }

    const agents = discoverAgents(ctx.config);
    if (agents.length === 0) {
      _tickRunning = false;  // ← Added
      return;
    }

    const result = await processAllAgents(...);
    logTickResult(result, logger);
  } catch (err) {
    logger.error(`work_heartbeat tick failed: ${err}`);
  } finally {
    _tickRunning = false;
  }
}

Result: Partial fix — handles early returns, but heartbeat still deadlocks after ~1 hour.

Hypothesis: fetchGatewaySessions() or processAllAgents() hangs without throwing, leaving the promise pending forever.

Additional Issue: Plugin Lifecycle

After systemctl --user restart openclaw-gateway, the DevClaw plugin loads (DevClaw plugin registered) but the heartbeat service does not start:

2026-03-05T10:18:40.953Z - DevClaw plugin registered (23 tools, 1 CLI command group, 1 service, 3 hooks)
[NO "work_heartbeat service started" LOG]
[NO TICKS AFTER THIS]

The service seems to fail silently on restart.

Proposed Solutions

1. Timeout Guard for processAllAgents()

Wrap the entire tick execution in a timeout:

async function runHeartbeatTick(...): Promise<void> {
  if (_tickRunning) {
    logger.warn("work_heartbeat: skipping tick (previous tick still running)");
    return;
  }
  _tickRunning = true;

  const timeoutMs = 50_000; // 50s (less than 60s interval)
  const timeoutPromise = new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Tick timeout')), timeoutMs)
  );

  try {
    const config = resolveHeartbeatConfig(ctx.pluginConfig);
    if (!config.enabled) {
      _tickRunning = false;
      return;
    }

    const agents = discoverAgents(ctx.config);
    if (agents.length === 0) {
      _tickRunning = false;
      return;
    }

    const result = await Promise.race([
      processAllAgents(agents, config, pluginConfig, logger, runCommand, runtime),
      timeoutPromise
    ]);
    logTickResult(result, logger);
  } catch (err) {
    logger.error(`work_heartbeat tick failed: ${err}`);
  } finally {
    _tickRunning = false;
  }
}

2. Per-Agent Timeout in fetchGatewaySessions()

Add a timeout to the RPC call itself in lib/services/heartbeat/health.ts:

export async function fetchGatewaySessions(
  agentId: string | undefined,
  runCommand: RunCommand,
): Promise<SessionLookup | null> {
  try {
    const timeoutMs = 8_000; // 8s
    const result = await Promise.race([
      runCommand({ command: "sessions", args: { action: "list", agentId } }),
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('Gateway RPC timeout')), timeoutMs)
      )
    ]);
    // ... rest of the function
  } catch (err) {
    logger.warn(`Failed to fetch gateway sessions: ${err.message}`);
    return null; // Graceful degradation
  }
}

3. Investigate Plugin Service Lifecycle

Why does the heartbeat service fail to start on gateway restart? Logs show plugin registration but no service startup.

Check if api.registerService() start callback is being invoked after restart.

Impact

Severity: High

  • Heartbeat is critical for automatic worker dispatch
  • Manual intervention required every ~1 hour to restart gateway
  • Affects all DevClaw installations with active projects

Workarounds (Temporary)

  1. Manual gateway restart via cron: systemctl --user restart openclaw-gateway every hour
  2. Manual task dispatch via task_start tool when heartbeat is dead

Both workarounds are not sustainable for production use.


Request: Please investigate and apply a robust fix that ensures _tickRunning is always reset, even in timeout/hang scenarios. Consider adding the timeout guards proposed above.

Thanks for the great work on DevClaw! 🦾

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions