fix(stop-hooks): circuit breaker to stop orchestrator LLM-turn-polling runaway (#29) by Yash-1511 · Pull Request #45 · bmad-code-org/bmad-automator

Yash-1511 · 2026-06-15T13:30:10Z

Addresses #29 (and the closely related #4).

Problem

The story-automator Stop hook read its stdin and discarded it — it never inspected stop_hook_active (Claude Code's documented guard against infinite Stop-hook loops) and blocked unconditionally while stories remained. On a long-lived / Stop-hook-resumed orchestrator this fights Claude Code's "go idle, get woken when the background work completes" model: every stop attempt is re-blocked until the harness force-ends the turn at CLAUDE_CODE_STOP_HOOK_BLOCK_CAP, and each empty "Holding" turn replays the entire (growing) transcript, paying full cache_read.

That, combined with monitor-session false-completes pushing the orchestrator into per-turn cat /tmp/*.json / tmux capture-pane polling, is the runaway #29 measured: 1,599 turns / 478M cache_read tokens in one resumed session, a 5-hour quota window consumed in ~30 minutes.

Fixes

1. Stop-hook circuit breaker (commands/basic.py)
Parse the hook input, honor stop_hook_active, and count consecutive blocks without step progress in the marker. After STORY_AUTOMATOR_MAX_STOP_BLOCKS (default 5, below the harness cap of 9) the hook releases the stop with a systemMessage instead of letting the session busy-wait a quota window away. "Progress" = a story completed (storiesRemaining decreased) or the orchestrator bumped the marker heartbeat at a verified step (wired into step-03-execute.md), so healthy long runs never trip it.

2. Reload anti-polling guidance on resume (issue #29 suggestion 1 — the cheap win)
Add monitoring-pattern.md (the FORBIDDEN-polling table + single-call spawn→monitor→verify cycle) to the LOAD-ONCE set in data-file-index.md, and mandate reading it in step-03-execute.md and step-01b-continue.md before any tmux interaction. A Stop-hook resume can no longer improvise per-turn polling because the guidance was never surfaced.

3. Cost ceiling in stop-hook-recovery.md (issue #29 suggestion 4)
Replace the unbounded "do whatever it takes / only stop when unrecoverable" with an explicit cost-ceiling section and forbidden-polling decision-matrix rows matching the breaker.

4. Tighten the false-complete contract (commands/tmux.py, issue #29 suggestion 2)
monitor-session now re-confirms a completed-but-unverified session in-process up to --completion-rechecks (default 3) before returning incomplete, absorbing false-completes inside the single blocking call instead of bouncing an incomplete the orchestrator then hand-polls. A broken verifier contract (verifier_contract_invalid) still returns immediately so it escalates rather than waiting pointlessly. The incomplete decision-flow now says "re-spawn monitor-session, never hand-poll."

Suggestion 3 (standalone safe-poll helper) was intentionally deferred — the in-process re-poll in (4) addresses the same root cause without a new stateful command. Suggestion 5 (transcript compaction) is left for a follow-up.

Tests

tests/test_stop_hooks.py: circuit-breaker block/release, reset-on-heartbeat-progress, reset-on-stop_hook_active=false, reset-on-storiesRemaining decrease, no-marker allows stop.
tests/test_monitor_session.py: verified-completion returns immediately, false-complete re-polls then incomplete, recheck recovers when the verifier passes, --completion-rechecks 1 disables re-poll, contract errors still return immediately.

All new tests pass and the existing suite is unchanged (one pre-existing, unrelated failure on main about claude --print vs --dangerously-skip-permissions).

Summary by CodeRabbit

Release Notes

New Features
- Added configurable completion rechecking to re-verify sessions reporting as completed but failing verification
- Implemented circuit breaker mechanism to prevent infinite retry loops when sessions stall without progress
- Added heartbeat signaling to notify real progress and reset safety mechanisms
Documentation
- Updated monitoring and execution guidance with mandatory preconditions before execution
- Clarified stop-hook behavior and procedures to prevent polling anti-patterns
- Enhanced stop-hook recovery documentation with circuit breaker details and common mistake warnings
Tests
- Added test coverage for session monitoring and completion verification behavior
- Added test coverage for stop-hook circuit breaker functionality

…bmad-code-org#29) The story-automator Stop hook read its stdin and discarded it, so it never inspected `stop_hook_active` and blocked unconditionally while stories remained. On a long-lived/resumed orchestrator this fought Claude Code's "go idle, get woken on completion" model: every stop attempt was re-blocked until the harness force-ended the turn at CLAUDE_CODE_STOP_HOOK_BLOCK_CAP, and each empty "Holding" turn replayed the whole growing transcript. Issue bmad-code-org#29 measured the result: 1,599 turns / 478M cache_read tokens in one resumed session, a 5-hour quota window burned in ~30 minutes. Fixes: 1. Stop-hook circuit breaker (commands/basic.py): parse the hook input, honor `stop_hook_active`, and count consecutive blocks WITHOUT step progress in the marker. After STORY_AUTOMATOR_MAX_STOP_BLOCKS (default 5, below the harness cap of 9) release the stop with a systemMessage instead of busy-waiting. Progress = a story completed or the marker heartbeat bumped at a verified step (wired into step-03-execute.md). 2. Reload anti-polling guidance on resume: add monitoring-pattern.md to the LOAD-ONCE set in data-file-index.md and mandate reading it in step-03-execute.md and step-01b-continue.md before any tmux interaction, so a Stop-hook resume cannot improvise per-turn cat/capture-pane polling. 3. Cost ceiling in stop-hook-recovery.md: replace unbounded "do whatever it takes" with an explicit ceiling and forbidden-polling decision rows. 4. Tighten the false-complete contract (commands/tmux.py): monitor-session re-confirms a completed-but-unverified session in-process up to --completion-rechecks (default 3) before returning `incomplete`, absorbing false-completes inside the single blocking call. A broken verifier contract still returns immediately so it escalates. Adds tests for the circuit breaker (test_stop_hooks.py) and the monitor-session re-poll (test_monitor_session.py).

coderabbitai · 2026-06-15T13:30:27Z

Walkthrough

Adds a Stop-hook circuit breaker to cmd_stop_hook that limits consecutive blocks without step progress (controlled by STORY_AUTOMATOR_MAX_STOP_BLOCKS, default 5), and adds --completion-rechecks to cmd_monitor_session for in-process false-complete re-verification. Workflow documentation is updated to mandate monitoring-pattern.md loading and heartbeat signaling; new test suites cover both features.

Changes

Stop-hook circuit breaker and monitor-session false-complete rechecking

Layer / File(s)	Summary
monitor-session false-complete recheck logic `skills/bmad-story-automator/src/story_automator/commands/tmux.py`	Adds `COMPLETION_RECHECKS` and `RECHECK_GRACE_SECONDS` constants, `--completion-rechecks` CLI option, `unverified_completions` counter, and an in-process re-poll loop that retries verification when a session appears completed but the artifact is unverified, falling back to `incomplete` after the budget is exhausted.
Stop-hook circuit breaker implementation `skills/bmad-story-automator/src/story_automator/commands/basic.py`	Adds `DEFAULT_MAX_STOP_BLOCKS`, `_max_stop_blocks()`, `_write_marker()`, and reworks `cmd_stop_hook` to track `stopHookBlocks`, detect progress via heartbeat/`storiesRemaining` deltas, and emit a `systemMessage` auto-pause when the cap is reached.
Workflow documentation updates `skills/bmad-story-automator/data/monitoring-pattern.md`, `skills/bmad-story-automator/data/stop-hook-recovery.md`, `skills/bmad-story-automator/data/data-file-index.md`, `skills/bmad-story-automator/steps-c/step-01b-continue.md`, `skills/bmad-story-automator/steps-c/step-03-execute.md`	Documents `--completion-rechecks` option, circuit-breaker rules, Common Mistakes entries forbidding per-turn polling, mandatory `monitoring-pattern.md` loading at step start and on resume, and required `orchestrator-helper marker heartbeat` calls after create-story and dev-story milestones.
Tests for monitor-session rechecks and stop-hook circuit breaker `tests/test_monitor_session.py`, `tests/test_stop_hooks.py`	New `MonitorSessionRepollTests` suite covers verified completion, false-complete exhaustion, verifier pass-on-recheck, configurable recheck count, and active-resume edge case; new `StopHookTests` circuit-breaker cases cover block increments, cap release, heartbeat reset, `stop_hook_active=false` reset, and `storiesRemaining` decrease reset.

Sequence Diagram(s)

sequenceDiagram
  participant StopHook as Stop Hook (cmd_stop_hook)
  participant Marker as Active Marker (JSON)
  participant Monitor as monitor-session (cmd_monitor_session)
  participant Verifier as _verify_monitor_completion

  StopHook->>Marker: read stopHookBlocks, heartbeat, storiesRemaining
  alt progress detected (heartbeat changed or storiesRemaining decreased)
    StopHook->>Marker: reset stopHookBlocks=0
    StopHook-->>StopHook: block turn (normal)
  else no progress, blocks < max
    StopHook->>Marker: increment stopHookBlocks
    StopHook-->>StopHook: block turn
  else blocks >= STORY_AUTOMATOR_MAX_STOP_BLOCKS
    StopHook->>Marker: reset stopHookBlocks=0, persist
    StopHook-->>StopHook: emit systemMessage auto-pause
  end

  Note over Monitor,Verifier: On next execution turn
  Monitor->>Verifier: verify artifact on completed state
  alt verified immediately
    Monitor-->>StopHook: final_state=completed
  else unverified, rechecks > 0
    loop up to completion_rechecks
      Monitor->>Monitor: sleep RECHECK_GRACE_SECONDS
      Monitor->>Verifier: re-verify
    end
    Monitor-->>StopHook: final_state=completed OR incomplete
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[bug] Orchestrator falls into LLM-turn polling after monitor-session false-completes (Opus 4.7 1M context, 1,600 turns / 478M cache_read tokens in one resumed session) #29: This PR directly addresses issue #29 by implementing in-process false-complete re-verification via --completion-rechecks in tmux.py, a Stop-hook circuit breaker in basic.py, mandatory monitoring-pattern.md loading in workflow step documents, and documentation prohibiting per-turn polling patterns.

Possibly related PRs

bmad-code-org/bmad-automator#2: Modifies cmd_monitor_session and tmux completion detection in the same file and code path where this PR adds --completion-rechecks false-complete logic.
bmad-code-org/bmad-automator#18: Touches _verify_monitor_completion and cmd_monitor_session in tmux.py, the same verification flow extended by this PR's recheck logic.

Suggested reviewers

bma-d

Poem

🐇 Hop, hop, the circuit breaks!
No more polling for goodness' sake,
One blocking call is all we need,
Heartbeat resets — confirmed indeed!
The Stop hook guards, not clocks the beat,
False-completes rechecked — how neat! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: implementing a circuit breaker for stop-hook handling to prevent LLM-turn-polling runaway, directly addressing issue `#29`.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/bmad-story-automator/src/story_automator/commands/basic.py`:
- Around line 240-246: The `_write_marker` function currently silently swallows
OSError exceptions, which prevents marker state from being persisted and can
cause the circuit breaker to stall permanently by always returning "block" on
subsequent invocations. Instead of silently passing on OSError, log the error
with details about what failed (include the path and error information) and then
re-raise the exception so that the caller can decide how to handle the
persistence failure appropriately, allowing the breaker to fail open rather than
remain blocked indefinitely.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58d03866-e3c9-4d5e-8916-a6dbcbf4a84b

📥 Commits

Reviewing files that changed from the base of the PR and between f332173 and 3198167.

📒 Files selected for processing (9)

skills/bmad-story-automator/data/data-file-index.md
skills/bmad-story-automator/data/monitoring-pattern.md
skills/bmad-story-automator/data/stop-hook-recovery.md
skills/bmad-story-automator/src/story_automator/commands/basic.py
skills/bmad-story-automator/src/story_automator/commands/tmux.py
skills/bmad-story-automator/steps-c/step-01b-continue.md
skills/bmad-story-automator/steps-c/step-03-execute.md
tests/test_monitor_session.py
tests/test_stop_hooks.py

coderabbitai · 2026-06-15T13:36:45Z

+def _write_marker(path: Path, payload: dict[str, object]) -> None:
+    try:
+        atomic_write(path, json.dumps(payload, indent=2) + "\n")
+    except OSError:
+        # A marker we cannot persist must never crash the Stop hook; the
+        # breaker simply won't advance this cycle.
+        pass


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail open when marker persistence fails, or the breaker can stall permanently.

If marker writes fail, stopHookBlocks is never persisted; subsequent stop-hook invocations can keep returning "decision": "block" forever and never reach the release threshold. This undermines the circuit-breaker guarantee under filesystem/permission failures.

Proposed fix

-def _write_marker(path: Path, payload: dict[str, object]) -> None: +def _write_marker(path: Path, payload: dict[str, object]) -> bool: try: atomic_write(path, json.dumps(payload, indent=2) + "\n") + return True except OSError: - # A marker we cannot persist must never crash the Stop hook; the - # breaker simply won't advance this cycle. - pass + return False

payload["stopHookBlocks"] = blocks - _write_marker(marker, payload) + if not _write_marker(marker, payload): + print( + json.dumps( + { + "systemMessage": ( + "Story Automator auto-paused because stop-hook state " + "could not be persisted. Fix marker file permissions and resume." + ) + }, + indent=2, + ) + ) + return 0

🧰 Tools

🪛 ast-grep (0.43.0)

[info] 241-241: use jsonify instead of json.dumps for JSON output
Context: json.dumps(payload, indent=2)
Note: Security best practice.

(use-jsonify)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@skills/bmad-story-automator/src/story_automator/commands/basic.py` around lines 240 - 246, The `_write_marker` function currently silently swallows OSError exceptions, which prevents marker state from being persisted and can cause the circuit breaker to stall permanently by always returning "block" on subsequent invocations. Instead of silently passing on OSError, log the error with details about what failed (include the path and error information) and then re-raise the exception so that the caller can decide how to handle the persistence failure appropriately, allowing the breaker to fail open rather than remain blocked indefinitely.

cannt · 2026-06-25T10:52:28Z

Thanks for tackling the Stop-hook runaway problem here — nice work.

I ran into a related but distinct issue with the same hook. Because it's registered project-scoped, it blocks every Claude Code session in the project while a run is active — not just the orchestrator that owns the run. A second session opened to watch the run (or unrelated work in another worktree) can't stop; it gets blocked too. The STORY_AUTOMATOR_CHILD env only exempts spawned children, and the circuit breaker here still blocks a bystander up to MAX_STOP_BLOCKS times before releasing.

A small owner-gating addition fixes it cleanly: record the orchestrator's CLAUDE_CODE_SESSION_ID as ownerSession in the active marker at create-time, then in cmd_stop_hook compare it against the caller's session_id (already in the stdin payload you're now parsing) and return 0 for any non-owner session. Markers without ownerSession keep the current behavior, so it's backward-compatible.

Since this PR already restructures cmd_stop_hook and parses the hook input, the gating slots in as ~6 lines on top of your change. I had it as a standalone PR (#48, now closed to avoid two overlapping stop-hook PRs). Would you/the maintainers prefer I fold the owner-gating into this PR, or send it as a follow-up rebased on top once this lands? Happy either way.

Yash-1511 requested a review from bma-d as a code owner June 15, 2026 13:30

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

cannt mentioned this pull request Jun 25, 2026

fix: Stop hook blocks only the run's owning session, not other sessions in the project #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(stop-hooks): circuit breaker to stop orchestrator LLM-turn-polling runaway (#29)#45

fix(stop-hooks): circuit breaker to stop orchestrator LLM-turn-polling runaway (#29)#45
Yash-1511 wants to merge 1 commit into
bmad-code-org:mainfrom
Yash-1511:fix/issue-29-stop-hook-circuit-breaker

Yash-1511 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

cannt commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Yash-1511 commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fixes

Tests

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

cannt commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yash-1511 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading