Skip to content

fix: stamp last_active before LLM call to prevent mid-iteration heartbeat timeouts#709

Open
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/touch-agent-before-llm-call
Open

fix: stamp last_active before LLM call to prevent mid-iteration heartbeat timeouts#709
Fail-Safe wants to merge 1 commit intoRightNow-AI:mainfrom
Fail-Safe:fix/touch-agent-before-llm-call

Conversation

@Fail-Safe
Copy link
Contributor

Problem

Agents running slow local models (e.g. 27B quantised MLX models running on-device) can take 3–4+ minutes per loop iteration. The heartbeat monitor's default inactive_secs timeout is 180s. Because last_active was only stamped at the end of an iteration — after the LLM call returns — the heartbeat would fire mid-call and flag the agent as unresponsive, triggering crash/recovery while the loop was still running correctly.

Root Cause

last_active updates happen as a side-effect of state/data mutations in AgentRegistry (e.g. set_state, update_model). There was no mechanism for the agent loop to signal liveness during a long-running LLM call — the only updates were at iteration boundaries, not within one.

Fix

Add a lightweight touch_agent callback through the existing KernelHandle trait so the agent loop can refresh last_active without mutating any real state:

  1. AgentRegistry::touch(id) — updates last_active only, no other side-effects. Silently ignores unknown IDs (non-blocking).
  2. KernelHandle::touch_agent(&str) — new trait method with a default no-op, so all existing mock/test implementations require zero changes.
  3. OpenFangKernel::touch_agent — parses the UUID string and delegates to registry.touch().
  4. agent_loop — calls kernel.touch_agent(agent_id) at the top of each iteration, immediately before call_with_retry, resetting the inactivity clock at the start of every LLM call rather than only at completion.

Impact

  • Remote API providers (OpenAI, Anthropic, Groq) are rarely affected since calls complete in seconds; this is a no-op overhead for them.
  • Local model users (mlx-lm, llama.cpp via OpenAI-compat, Ollama) benefit directly: a 4-minute 27B model call no longer triggers false crash detection.
  • The KernelHandle default no-op means downstream forks/embedders that implement the trait don't need any changes.

Testing

Verified on a local 27B 4-bit MLX model. Before the fix, agents were killed and restarted mid-loop. After the fix, inactive_secs resets to ~0 at the start of each iteration and the loop completes without interruption.

  • cargo build --workspace --lib passes
  • cargo clippy --workspace --all-targets -- -D warnings passes (zero warnings)
  • cargo test --workspace passes
  • Live integration tested with slow local LLM (27B MLX model, ~3–4 min/iteration)

…beat timeouts

Slow local models (e.g. 27B quantised MLX models) can take 3–4+ minutes
per iteration, well beyond the default 180s heartbeat timeout. Because
last_active was only updated at the end of an iteration — never during it —
the heartbeat monitor would flag the agent as unresponsive mid-call and
initiate crash/recovery while the loop was still running correctly.

Changes:
- Add `touch()` to `AgentRegistry`: refreshes `last_active` with no other
  side-effects.
- Add `touch_agent(&self, agent_id: &str)` to `KernelHandle` trait with a
  default no-op, so existing mock implementations require no changes.
- Implement `touch_agent` on `OpenFangKernel`: parses the UUID and
  delegates to `registry.touch()`.
- Call `kernel.touch_agent(agent_id)` at the top of each agent loop
  iteration, immediately before the `call_with_retry` LLM call. This
  resets the inactivity clock at the start of every iteration rather than
  only at completion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Member

@jaberjaber23 jaberjaber23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug, clean fix, not slop. But needs tests before merge.

Stamping last_active before LLM call correctly prevents false Crashed state on slow local models. The fix only fires at loop iteration start so it won't mask genuinely stuck agents.

Blocking issue: no tests. AgentRegistry::touch() is trivially testable — register agent, set last_active to 5 min ago, call touch(), verify last_active is now recent. The project has 2074+ tests; this should meet that standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants