Skip to content

fix(engine): emit heartbeat on interval regardless of output growth#60

Open
mvanhorn wants to merge 3 commits intodanshapiro:mainfrom
mvanhorn:osc/51-heartbeat-output-gated
Open

fix(engine): emit heartbeat on interval regardless of output growth#60
mvanhorn wants to merge 3 commits intodanshapiro:mainfrom
mvanhorn:osc/51-heartbeat-output-gated

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Fixes #51

Summary

The stage_heartbeat emission was gated on stdout/stderr growth, so quiet-but-active phases (like npm install downloading) produced sparse or missing heartbeat events. This made healthy stages appear stalled in monitoring.

Changes

  • progress.go: Added appendProgressLivenessOnly() - writes events to progress.ndjson for observability without resetting the stall watchdog timer. This is the key to decoupling heartbeat observability from stall detection.

  • codergen_router.go (CLI path, ~line 1162): Heartbeat goroutine now always emits on every ticker interval. Uses appendProgress when output has grown (resets stall timer) and appendProgressLivenessOnly when output is static (preserves stall detection). Added since_last_output_s field.

  • codergen_router.go (API path, ~line 330): Same pattern - always emits heartbeat, uses liveness-only when event count hasn't changed. Added since_last_output_s field.

  • codergen_heartbeat_test.go: Added TestRunWithConfig_HeartbeatEmitsDuringQuietPeriods - verifies heartbeats emit during a 3-second quiet period and include the since_last_output_s field. Updated stall watchdog test comments.

Design

The fix separates two concerns:

  1. Observability (always emit heartbeats) - operators can see the stage is alive
  2. Stall detection (only reset timer on actual progress) - watchdog still fires for truly stalled processes

Test plan

  • TestRunWithConfig_HeartbeatEmitsDuringQuietPeriods - heartbeats during quiet periods with since_last_output_s
  • TestRunWithConfig_HeartbeatEmitsDuringCodergen - existing test still passes
  • TestRunWithConfig_APIBackend_StallWatchdogFiresDespiteHeartbeatGoroutine - watchdog still fires
  • TestRunWithConfig_CLIBackend_StallWatchdogFiresDespiteHeartbeatGoroutine - watchdog still fires
  • TestRunWithConfig_HeartbeatStopsAfterProcessExit - heartbeats stop after process exit

This contribution was developed with AI assistance (Claude Code).

Decouples heartbeat emission from output-growth gating so active stages
produce predictable liveness signals even during quiet periods. Adds
appendProgressLivenessOnly to write events without resetting the stall
watchdog timer, preserving correct stall detection while improving
observability. Heartbeat events now include since_last_output_s for
distinguishing "alive but quiet" from "alive and producing output".

Fixes danshapiro#51

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mvanhorn and others added 2 commits March 9, 2026 20:40
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…i/ paths

- Add gpt-5.3-codex-spark to cliOnlyModelIDs map (fixes TestIsCLIOnlyModel)
- Replace root .ai/verify_errors.log and .ai/test-evidence/latest/ with
  run-scoped .ai/runs/$KILROY_RUN_ID/ paths in demo spec (fixes
  TestReferenceSurfaces_NoLegacyRootAIScratchPaths)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stage heartbeat is output-gated, causing false stall perception during quiet long-running operations

1 participant