fix: exempt non-autonomous agents from heartbeat inactivity timeout#708
Open
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Open
fix: exempt non-autonomous agents from heartbeat inactivity timeout#708Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Conversation
Reactive (non-autonomous) agents wait indefinitely for incoming messages and have no expected self-trigger schedule. Applying an inactivity timeout to them was incorrect — they would be flagged as unresponsive after the default 180s simply for being idle, causing unnecessary crash/recovery cycles. The fix makes timeout behaviour conditional on agent type: - Autonomous agents retain the `heartbeat_interval_secs × 2` inactivity check, which is meaningful because they are expected to fire periodically. - Non-autonomous agents are only flagged when their state is `Crashed`; idle time is irrelevant and no longer checked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jaberjaber23
requested changes
Mar 19, 2026
Member
jaberjaber23
left a comment
There was a problem hiding this comment.
Real bug, clean fix, not slop. But needs tests before merge.
The fix correctly exempts reactive agents from inactivity timeouts — they're designed to sit idle waiting for messages. However:
-
No tests. The check_agents function is pure and trivially testable. Need at minimum: (a) reactive agent idle 5 min is NOT flagged unresponsive, (b) reactive agent in Crashed state IS flagged, (c) autonomous agent idle beyond timeout IS flagged.
-
default_timeout_secs becomes dead code after this PR — never read at runtime. Should document why it's retained or remove it.
-
The warn! macro now logs timeout_secs=Some(60) instead of timeout_secs=60. Should unwrap the Some in the log since it's guaranteed at that point.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Non-autonomous (reactive) agents were being flagged as unresponsive and crash-recovered after sitting idle for
default_timeout_secs(180s), even though idle is their normal state. They wait for incoming messages and have no expected self-trigger schedule.This caused unnecessary crash/recovery cycles for healthy agents that simply hadn't received a message in the last 3 minutes.
Root Cause
check_agentsinheartbeat.rsapplied the same inactivity timeout to all Running agents regardless of type. For agents without anautonomousconfig block, it fell back toconfig.default_timeout_secs(180s). A reactive agent idle for 3+ minutes would be indistinguishable from an autonomous agent that had stalled.Fix
Make the inactivity check conditional on agent type:
heartbeat_interval_secs × 2inactivity check — meaningful because they are expected to fire periodically.Crashed; idle time is no longer checked at all.Testing
Verified with a 5-agent setup (mix of autonomous and reactive). Before the fix, idle reactive agents were logged as unresponsive at
inactive_secs=210. After the fix, they showheartbeat OKat 210s, 240s, and beyond with no crash/recovery cycles.cargo build --workspace --libpassescargo clippy --workspace --all-targets -- -D warningspasses (zero warnings)cargo test --workspacepasses