fix: stamp last_active before LLM call to prevent mid-iteration heartbeat timeouts by Fail-Safe · Pull Request #709 · RightNow-AI/openfang

Fail-Safe · 2026-03-17T22:54:07Z

Problem

Agents running slow local models (e.g. 27B quantised MLX models running on-device) can take 3–4+ minutes per loop iteration. The heartbeat monitor's default inactive_secs timeout is 180s. Because last_active was only stamped at the end of an iteration — after the LLM call returns — the heartbeat would fire mid-call and flag the agent as unresponsive, triggering crash/recovery while the loop was still running correctly.

Root Cause

last_active updates happen as a side-effect of state/data mutations in AgentRegistry (e.g. set_state, update_model). There was no mechanism for the agent loop to signal liveness during a long-running LLM call — the only updates were at iteration boundaries, not within one.

Fix

Add a lightweight touch_agent callback through the existing KernelHandle trait so the agent loop can refresh last_active without mutating any real state:

AgentRegistry::touch(id) — updates last_active only, no other side-effects. Silently ignores unknown IDs (non-blocking).
KernelHandle::touch_agent(&str) — new trait method with a default no-op, so all existing mock/test implementations require zero changes.
OpenFangKernel::touch_agent — parses the UUID string and delegates to registry.touch().
agent_loop — calls kernel.touch_agent(agent_id) at the top of each iteration, immediately before call_with_retry, resetting the inactivity clock at the start of every LLM call rather than only at completion.

Impact

Remote API providers (OpenAI, Anthropic, Groq) are rarely affected since calls complete in seconds; this is a no-op overhead for them.
Local model users (mlx-lm, llama.cpp via OpenAI-compat, Ollama) benefit directly: a 4-minute 27B model call no longer triggers false crash detection.
The KernelHandle default no-op means downstream forks/embedders that implement the trait don't need any changes.

Testing

Verified on a local 27B 4-bit MLX model. Before the fix, agents were killed and restarted mid-loop. After the fix, inactive_secs resets to ~0 at the start of each iteration and the loop completes without interruption.

cargo build --workspace --lib passes
cargo clippy --workspace --all-targets -- -D warnings passes (zero warnings)
cargo test --workspace passes
Live integration tested with slow local LLM (27B MLX model, ~3–4 min/iteration)

…beat timeouts Slow local models (e.g. 27B quantised MLX models) can take 3–4+ minutes per iteration, well beyond the default 180s heartbeat timeout. Because last_active was only updated at the end of an iteration — never during it — the heartbeat monitor would flag the agent as unresponsive mid-call and initiate crash/recovery while the loop was still running correctly. Changes: - Add `touch()` to `AgentRegistry`: refreshes `last_active` with no other side-effects. - Add `touch_agent(&self, agent_id: &str)` to `KernelHandle` trait with a default no-op, so existing mock implementations require no changes. - Implement `touch_agent` on `OpenFangKernel`: parses the UUID and delegates to `registry.touch()`. - Call `kernel.touch_agent(agent_id)` at the top of each agent loop iteration, immediately before the `call_with_retry` LLM call. This resets the inactivity clock at the start of every iteration rather than only at completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jaberjaber23

Real bug, clean fix, not slop. But needs tests before merge.

Stamping last_active before LLM call correctly prevents false Crashed state on slow local models. The fix only fires at loop iteration start so it won't mask genuinely stuck agents.

Blocking issue: no tests. AgentRegistry::touch() is trivially testable — register agent, set last_active to 5 min ago, call touch(), verify last_active is now recent. The project has 2074+ tests; this should meet that standard.

jaberjaber23 requested changes Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stamp last_active before LLM call to prevent mid-iteration heartbeat timeouts#709

fix: stamp last_active before LLM call to prevent mid-iteration heartbeat timeouts#709
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/touch-agent-before-llm-call

Fail-Safe commented Mar 17, 2026

Uh oh!

jaberjaber23 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fail-Safe commented Mar 17, 2026

Problem

Root Cause

Fix

Impact

Testing

Uh oh!

jaberjaber23 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants