fix: stamp last_active before LLM call to prevent mid-iteration heartbeat timeouts#709
Open
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Open
fix: stamp last_active before LLM call to prevent mid-iteration heartbeat timeouts#709Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Conversation
…beat timeouts Slow local models (e.g. 27B quantised MLX models) can take 3–4+ minutes per iteration, well beyond the default 180s heartbeat timeout. Because last_active was only updated at the end of an iteration — never during it — the heartbeat monitor would flag the agent as unresponsive mid-call and initiate crash/recovery while the loop was still running correctly. Changes: - Add `touch()` to `AgentRegistry`: refreshes `last_active` with no other side-effects. - Add `touch_agent(&self, agent_id: &str)` to `KernelHandle` trait with a default no-op, so existing mock implementations require no changes. - Implement `touch_agent` on `OpenFangKernel`: parses the UUID and delegates to `registry.touch()`. - Call `kernel.touch_agent(agent_id)` at the top of each agent loop iteration, immediately before the `call_with_retry` LLM call. This resets the inactivity clock at the start of every iteration rather than only at completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jaberjaber23
requested changes
Mar 19, 2026
Member
jaberjaber23
left a comment
There was a problem hiding this comment.
Real bug, clean fix, not slop. But needs tests before merge.
Stamping last_active before LLM call correctly prevents false Crashed state on slow local models. The fix only fires at loop iteration start so it won't mask genuinely stuck agents.
Blocking issue: no tests. AgentRegistry::touch() is trivially testable — register agent, set last_active to 5 min ago, call touch(), verify last_active is now recent. The project has 2074+ tests; this should meet that standard.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Agents running slow local models (e.g. 27B quantised MLX models running on-device) can take 3–4+ minutes per loop iteration. The heartbeat monitor's default
inactive_secstimeout is 180s. Becauselast_activewas only stamped at the end of an iteration — after the LLM call returns — the heartbeat would fire mid-call and flag the agent as unresponsive, triggering crash/recovery while the loop was still running correctly.Root Cause
last_activeupdates happen as a side-effect of state/data mutations inAgentRegistry(e.g.set_state,update_model). There was no mechanism for the agent loop to signal liveness during a long-running LLM call — the only updates were at iteration boundaries, not within one.Fix
Add a lightweight
touch_agentcallback through the existingKernelHandletrait so the agent loop can refreshlast_activewithout mutating any real state:AgentRegistry::touch(id)— updateslast_activeonly, no other side-effects. Silently ignores unknown IDs (non-blocking).KernelHandle::touch_agent(&str)— new trait method with a default no-op, so all existing mock/test implementations require zero changes.OpenFangKernel::touch_agent— parses the UUID string and delegates toregistry.touch().agent_loop— callskernel.touch_agent(agent_id)at the top of each iteration, immediately beforecall_with_retry, resetting the inactivity clock at the start of every LLM call rather than only at completion.Impact
KernelHandledefault no-op means downstream forks/embedders that implement the trait don't need any changes.Testing
Verified on a local 27B 4-bit MLX model. Before the fix, agents were killed and restarted mid-loop. After the fix,
inactive_secsresets to ~0 at the start of each iteration and the loop completes without interruption.cargo build --workspace --libpassescargo clippy --workspace --all-targets -- -D warningspasses (zero warnings)cargo test --workspacepasses