Skip to content

fix: exempt non-autonomous agents from heartbeat inactivity timeout#708

Open
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/heartbeat-idle-reactive-agents
Open

fix: exempt non-autonomous agents from heartbeat inactivity timeout#708
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/heartbeat-idle-reactive-agents

Conversation

@Fail-Safe
Copy link
Contributor

Problem

Non-autonomous (reactive) agents were being flagged as unresponsive and crash-recovered after sitting idle for default_timeout_secs (180s), even though idle is their normal state. They wait for incoming messages and have no expected self-trigger schedule.

This caused unnecessary crash/recovery cycles for healthy agents that simply hadn't received a message in the last 3 minutes.

Root Cause

check_agents in heartbeat.rs applied the same inactivity timeout to all Running agents regardless of type. For agents without an autonomous config block, it fell back to config.default_timeout_secs (180s). A reactive agent idle for 3+ minutes would be indistinguishable from an autonomous agent that had stalled.

Fix

Make the inactivity check conditional on agent type:

  • Autonomous agents retain the heartbeat_interval_secs × 2 inactivity check — meaningful because they are expected to fire periodically.
  • Non-autonomous agents are only flagged when their state is Crashed; idle time is no longer checked at all.
// Before
let timeout_secs = entry_ref.manifest.autonomous.as_ref()
    .map(|a| a.heartbeat_interval_secs * UNRESPONSIVE_MULTIPLIER)
    .unwrap_or(config.default_timeout_secs) as i64;
let unresponsive = entry_ref.state == AgentState::Crashed || inactive_secs > timeout_secs;

// After
let timeout_secs: Option<i64> = entry_ref.manifest.autonomous.as_ref()
    .map(|a| (a.heartbeat_interval_secs * UNRESPONSIVE_MULTIPLIER) as i64);
let unresponsive = match timeout_secs {
    Some(t) => entry_ref.state == AgentState::Crashed || inactive_secs > t,
    None    => entry_ref.state == AgentState::Crashed,
};

Testing

Verified with a 5-agent setup (mix of autonomous and reactive). Before the fix, idle reactive agents were logged as unresponsive at inactive_secs=210. After the fix, they show heartbeat OK at 210s, 240s, and beyond with no crash/recovery cycles.

  • cargo build --workspace --lib passes
  • cargo clippy --workspace --all-targets -- -D warnings passes (zero warnings)
  • cargo test --workspace passes
  • Live integration tested

Reactive (non-autonomous) agents wait indefinitely for incoming messages
and have no expected self-trigger schedule. Applying an inactivity timeout
to them was incorrect — they would be flagged as unresponsive after the
default 180s simply for being idle, causing unnecessary crash/recovery
cycles.

The fix makes timeout behaviour conditional on agent type:
- Autonomous agents retain the `heartbeat_interval_secs × 2` inactivity
  check, which is meaningful because they are expected to fire periodically.
- Non-autonomous agents are only flagged when their state is `Crashed`;
  idle time is irrelevant and no longer checked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Member

@jaberjaber23 jaberjaber23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug, clean fix, not slop. But needs tests before merge.

The fix correctly exempts reactive agents from inactivity timeouts — they're designed to sit idle waiting for messages. However:

  1. No tests. The check_agents function is pure and trivially testable. Need at minimum: (a) reactive agent idle 5 min is NOT flagged unresponsive, (b) reactive agent in Crashed state IS flagged, (c) autonomous agent idle beyond timeout IS flagged.

  2. default_timeout_secs becomes dead code after this PR — never read at runtime. Should document why it's retained or remove it.

  3. The warn! macro now logs timeout_secs=Some(60) instead of timeout_secs=60. Should unwrap the Some in the log since it's guaranteed at that point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants