fix(stop-hooks): circuit breaker to stop orchestrator LLM-turn-polling runaway (#29)#45
Conversation
…bmad-code-org#29) The story-automator Stop hook read its stdin and discarded it, so it never inspected `stop_hook_active` and blocked unconditionally while stories remained. On a long-lived/resumed orchestrator this fought Claude Code's "go idle, get woken on completion" model: every stop attempt was re-blocked until the harness force-ended the turn at CLAUDE_CODE_STOP_HOOK_BLOCK_CAP, and each empty "Holding" turn replayed the whole growing transcript. Issue bmad-code-org#29 measured the result: 1,599 turns / 478M cache_read tokens in one resumed session, a 5-hour quota window burned in ~30 minutes. Fixes: 1. Stop-hook circuit breaker (commands/basic.py): parse the hook input, honor `stop_hook_active`, and count consecutive blocks WITHOUT step progress in the marker. After STORY_AUTOMATOR_MAX_STOP_BLOCKS (default 5, below the harness cap of 9) release the stop with a systemMessage instead of busy-waiting. Progress = a story completed or the marker heartbeat bumped at a verified step (wired into step-03-execute.md). 2. Reload anti-polling guidance on resume: add monitoring-pattern.md to the LOAD-ONCE set in data-file-index.md and mandate reading it in step-03-execute.md and step-01b-continue.md before any tmux interaction, so a Stop-hook resume cannot improvise per-turn cat/capture-pane polling. 3. Cost ceiling in stop-hook-recovery.md: replace unbounded "do whatever it takes" with an explicit ceiling and forbidden-polling decision rows. 4. Tighten the false-complete contract (commands/tmux.py): monitor-session re-confirms a completed-but-unverified session in-process up to --completion-rechecks (default 3) before returning `incomplete`, absorbing false-completes inside the single blocking call. A broken verifier contract still returns immediately so it escalates. Adds tests for the circuit breaker (test_stop_hooks.py) and the monitor-session re-poll (test_monitor_session.py).
WalkthroughAdds a Stop-hook circuit breaker to ChangesStop-hook circuit breaker and monitor-session false-complete rechecking
Sequence Diagram(s)sequenceDiagram
participant StopHook as Stop Hook (cmd_stop_hook)
participant Marker as Active Marker (JSON)
participant Monitor as monitor-session (cmd_monitor_session)
participant Verifier as _verify_monitor_completion
StopHook->>Marker: read stopHookBlocks, heartbeat, storiesRemaining
alt progress detected (heartbeat changed or storiesRemaining decreased)
StopHook->>Marker: reset stopHookBlocks=0
StopHook-->>StopHook: block turn (normal)
else no progress, blocks < max
StopHook->>Marker: increment stopHookBlocks
StopHook-->>StopHook: block turn
else blocks >= STORY_AUTOMATOR_MAX_STOP_BLOCKS
StopHook->>Marker: reset stopHookBlocks=0, persist
StopHook-->>StopHook: emit systemMessage auto-pause
end
Note over Monitor,Verifier: On next execution turn
Monitor->>Verifier: verify artifact on completed state
alt verified immediately
Monitor-->>StopHook: final_state=completed
else unverified, rechecks > 0
loop up to completion_rechecks
Monitor->>Monitor: sleep RECHECK_GRACE_SECONDS
Monitor->>Verifier: re-verify
end
Monitor-->>StopHook: final_state=completed OR incomplete
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@skills/bmad-story-automator/src/story_automator/commands/basic.py`:
- Around line 240-246: The `_write_marker` function currently silently swallows
OSError exceptions, which prevents marker state from being persisted and can
cause the circuit breaker to stall permanently by always returning "block" on
subsequent invocations. Instead of silently passing on OSError, log the error
with details about what failed (include the path and error information) and then
re-raise the exception so that the caller can decide how to handle the
persistence failure appropriately, allowing the breaker to fail open rather than
remain blocked indefinitely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 58d03866-e3c9-4d5e-8916-a6dbcbf4a84b
📒 Files selected for processing (9)
skills/bmad-story-automator/data/data-file-index.mdskills/bmad-story-automator/data/monitoring-pattern.mdskills/bmad-story-automator/data/stop-hook-recovery.mdskills/bmad-story-automator/src/story_automator/commands/basic.pyskills/bmad-story-automator/src/story_automator/commands/tmux.pyskills/bmad-story-automator/steps-c/step-01b-continue.mdskills/bmad-story-automator/steps-c/step-03-execute.mdtests/test_monitor_session.pytests/test_stop_hooks.py
| def _write_marker(path: Path, payload: dict[str, object]) -> None: | ||
| try: | ||
| atomic_write(path, json.dumps(payload, indent=2) + "\n") | ||
| except OSError: | ||
| # A marker we cannot persist must never crash the Stop hook; the | ||
| # breaker simply won't advance this cycle. | ||
| pass |
There was a problem hiding this comment.
Fail open when marker persistence fails, or the breaker can stall permanently.
If marker writes fail, stopHookBlocks is never persisted; subsequent stop-hook invocations can keep returning "decision": "block" forever and never reach the release threshold. This undermines the circuit-breaker guarantee under filesystem/permission failures.
Proposed fix
-def _write_marker(path: Path, payload: dict[str, object]) -> None:
+def _write_marker(path: Path, payload: dict[str, object]) -> bool:
try:
atomic_write(path, json.dumps(payload, indent=2) + "\n")
+ return True
except OSError:
- # A marker we cannot persist must never crash the Stop hook; the
- # breaker simply won't advance this cycle.
- pass
+ return False payload["stopHookBlocks"] = blocks
- _write_marker(marker, payload)
+ if not _write_marker(marker, payload):
+ print(
+ json.dumps(
+ {
+ "systemMessage": (
+ "Story Automator auto-paused because stop-hook state "
+ "could not be persisted. Fix marker file permissions and resume."
+ )
+ },
+ indent=2,
+ )
+ )
+ return 0🧰 Tools
🪛 ast-grep (0.43.0)
[info] 241-241: use jsonify instead of json.dumps for JSON output
Context: json.dumps(payload, indent=2)
Note: Security best practice.
(use-jsonify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/bmad-story-automator/src/story_automator/commands/basic.py` around
lines 240 - 246, The `_write_marker` function currently silently swallows
OSError exceptions, which prevents marker state from being persisted and can
cause the circuit breaker to stall permanently by always returning "block" on
subsequent invocations. Instead of silently passing on OSError, log the error
with details about what failed (include the path and error information) and then
re-raise the exception so that the caller can decide how to handle the
persistence failure appropriately, allowing the breaker to fail open rather than
remain blocked indefinitely.
|
Thanks for tackling the Stop-hook runaway problem here — nice work. I ran into a related but distinct issue with the same hook. Because it's registered project-scoped, it blocks every Claude Code session in the project while a run is active — not just the orchestrator that owns the run. A second session opened to watch the run (or unrelated work in another worktree) can't stop; it gets blocked too. The A small owner-gating addition fixes it cleanly: record the orchestrator's Since this PR already restructures |
Addresses #29 (and the closely related #4).
Problem
The story-automator
Stophook read its stdin and discarded it — it never inspectedstop_hook_active(Claude Code's documented guard against infinite Stop-hook loops) and blocked unconditionally while stories remained. On a long-lived / Stop-hook-resumed orchestrator this fights Claude Code's "go idle, get woken when the background work completes" model: every stop attempt is re-blocked until the harness force-ends the turn atCLAUDE_CODE_STOP_HOOK_BLOCK_CAP, and each empty "Holding" turn replays the entire (growing) transcript, paying fullcache_read.That, combined with
monitor-sessionfalse-completes pushing the orchestrator into per-turncat /tmp/*.json/tmux capture-panepolling, is the runaway #29 measured: 1,599 turns / 478M cache_read tokens in one resumed session, a 5-hour quota window consumed in ~30 minutes.Fixes
1. Stop-hook circuit breaker (
commands/basic.py)Parse the hook input, honor
stop_hook_active, and count consecutive blocks without step progress in the marker. AfterSTORY_AUTOMATOR_MAX_STOP_BLOCKS(default 5, below the harness cap of 9) the hook releases the stop with asystemMessageinstead of letting the session busy-wait a quota window away. "Progress" = a story completed (storiesRemainingdecreased) or the orchestrator bumped the marker heartbeat at a verified step (wired intostep-03-execute.md), so healthy long runs never trip it.2. Reload anti-polling guidance on resume (issue #29 suggestion 1 — the cheap win)
Add
monitoring-pattern.md(the FORBIDDEN-polling table + single-call spawn→monitor→verify cycle) to the LOAD-ONCE set indata-file-index.md, and mandate reading it instep-03-execute.mdandstep-01b-continue.mdbefore any tmux interaction. A Stop-hook resume can no longer improvise per-turn polling because the guidance was never surfaced.3. Cost ceiling in
stop-hook-recovery.md(issue #29 suggestion 4)Replace the unbounded "do whatever it takes / only stop when unrecoverable" with an explicit cost-ceiling section and forbidden-polling decision-matrix rows matching the breaker.
4. Tighten the false-complete contract (
commands/tmux.py, issue #29 suggestion 2)monitor-sessionnow re-confirms acompleted-but-unverified session in-process up to--completion-rechecks(default 3) before returningincomplete, absorbing false-completes inside the single blocking call instead of bouncing anincompletethe orchestrator then hand-polls. A broken verifier contract (verifier_contract_invalid) still returns immediately so it escalates rather than waiting pointlessly. Theincompletedecision-flow now says "re-spawnmonitor-session, never hand-poll."Tests
tests/test_stop_hooks.py: circuit-breaker block/release, reset-on-heartbeat-progress, reset-on-stop_hook_active=false, reset-on-storiesRemainingdecrease, no-marker allows stop.tests/test_monitor_session.py: verified-completion returns immediately, false-complete re-polls thenincomplete, recheck recovers when the verifier passes,--completion-rechecks 1disables re-poll, contract errors still return immediately.All new tests pass and the existing suite is unchanged (one pre-existing, unrelated failure on
mainaboutclaude --printvs--dangerously-skip-permissions).Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests