Skip to content

fix(stop-hooks): circuit breaker to stop orchestrator LLM-turn-polling runaway (#29)#45

Open
Yash-1511 wants to merge 1 commit into
bmad-code-org:mainfrom
Yash-1511:fix/issue-29-stop-hook-circuit-breaker
Open

fix(stop-hooks): circuit breaker to stop orchestrator LLM-turn-polling runaway (#29)#45
Yash-1511 wants to merge 1 commit into
bmad-code-org:mainfrom
Yash-1511:fix/issue-29-stop-hook-circuit-breaker

Conversation

@Yash-1511

@Yash-1511 Yash-1511 commented Jun 15, 2026

Copy link
Copy Markdown

Addresses #29 (and the closely related #4).

Problem

The story-automator Stop hook read its stdin and discarded it — it never inspected stop_hook_active (Claude Code's documented guard against infinite Stop-hook loops) and blocked unconditionally while stories remained. On a long-lived / Stop-hook-resumed orchestrator this fights Claude Code's "go idle, get woken when the background work completes" model: every stop attempt is re-blocked until the harness force-ends the turn at CLAUDE_CODE_STOP_HOOK_BLOCK_CAP, and each empty "Holding" turn replays the entire (growing) transcript, paying full cache_read.

That, combined with monitor-session false-completes pushing the orchestrator into per-turn cat /tmp/*.json / tmux capture-pane polling, is the runaway #29 measured: 1,599 turns / 478M cache_read tokens in one resumed session, a 5-hour quota window consumed in ~30 minutes.

Fixes

1. Stop-hook circuit breaker (commands/basic.py)
Parse the hook input, honor stop_hook_active, and count consecutive blocks without step progress in the marker. After STORY_AUTOMATOR_MAX_STOP_BLOCKS (default 5, below the harness cap of 9) the hook releases the stop with a systemMessage instead of letting the session busy-wait a quota window away. "Progress" = a story completed (storiesRemaining decreased) or the orchestrator bumped the marker heartbeat at a verified step (wired into step-03-execute.md), so healthy long runs never trip it.

2. Reload anti-polling guidance on resume (issue #29 suggestion 1 — the cheap win)
Add monitoring-pattern.md (the FORBIDDEN-polling table + single-call spawn→monitor→verify cycle) to the LOAD-ONCE set in data-file-index.md, and mandate reading it in step-03-execute.md and step-01b-continue.md before any tmux interaction. A Stop-hook resume can no longer improvise per-turn polling because the guidance was never surfaced.

3. Cost ceiling in stop-hook-recovery.md (issue #29 suggestion 4)
Replace the unbounded "do whatever it takes / only stop when unrecoverable" with an explicit cost-ceiling section and forbidden-polling decision-matrix rows matching the breaker.

4. Tighten the false-complete contract (commands/tmux.py, issue #29 suggestion 2)
monitor-session now re-confirms a completed-but-unverified session in-process up to --completion-rechecks (default 3) before returning incomplete, absorbing false-completes inside the single blocking call instead of bouncing an incomplete the orchestrator then hand-polls. A broken verifier contract (verifier_contract_invalid) still returns immediately so it escalates rather than waiting pointlessly. The incomplete decision-flow now says "re-spawn monitor-session, never hand-poll."

Suggestion 3 (standalone safe-poll helper) was intentionally deferred — the in-process re-poll in (4) addresses the same root cause without a new stateful command. Suggestion 5 (transcript compaction) is left for a follow-up.

Tests

  • tests/test_stop_hooks.py: circuit-breaker block/release, reset-on-heartbeat-progress, reset-on-stop_hook_active=false, reset-on-storiesRemaining decrease, no-marker allows stop.
  • tests/test_monitor_session.py: verified-completion returns immediately, false-complete re-polls then incomplete, recheck recovers when the verifier passes, --completion-rechecks 1 disables re-poll, contract errors still return immediately.

All new tests pass and the existing suite is unchanged (one pre-existing, unrelated failure on main about claude --print vs --dangerously-skip-permissions).

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable completion rechecking to re-verify sessions reporting as completed but failing verification
    • Implemented circuit breaker mechanism to prevent infinite retry loops when sessions stall without progress
    • Added heartbeat signaling to notify real progress and reset safety mechanisms
  • Documentation

    • Updated monitoring and execution guidance with mandatory preconditions before execution
    • Clarified stop-hook behavior and procedures to prevent polling anti-patterns
    • Enhanced stop-hook recovery documentation with circuit breaker details and common mistake warnings
  • Tests

    • Added test coverage for session monitoring and completion verification behavior
    • Added test coverage for stop-hook circuit breaker functionality

…bmad-code-org#29)

The story-automator Stop hook read its stdin and discarded it, so it never
inspected `stop_hook_active` and blocked unconditionally while stories
remained. On a long-lived/resumed orchestrator this fought Claude Code's
"go idle, get woken on completion" model: every stop attempt was re-blocked
until the harness force-ended the turn at CLAUDE_CODE_STOP_HOOK_BLOCK_CAP,
and each empty "Holding" turn replayed the whole growing transcript. Issue
bmad-code-org#29 measured the result: 1,599 turns / 478M cache_read tokens in one
resumed session, a 5-hour quota window burned in ~30 minutes.

Fixes:

1. Stop-hook circuit breaker (commands/basic.py): parse the hook input,
   honor `stop_hook_active`, and count consecutive blocks WITHOUT step
   progress in the marker. After STORY_AUTOMATOR_MAX_STOP_BLOCKS (default 5,
   below the harness cap of 9) release the stop with a systemMessage instead
   of busy-waiting. Progress = a story completed or the marker heartbeat
   bumped at a verified step (wired into step-03-execute.md).

2. Reload anti-polling guidance on resume: add monitoring-pattern.md to the
   LOAD-ONCE set in data-file-index.md and mandate reading it in
   step-03-execute.md and step-01b-continue.md before any tmux interaction,
   so a Stop-hook resume cannot improvise per-turn cat/capture-pane polling.

3. Cost ceiling in stop-hook-recovery.md: replace unbounded "do whatever it
   takes" with an explicit ceiling and forbidden-polling decision rows.

4. Tighten the false-complete contract (commands/tmux.py): monitor-session
   re-confirms a completed-but-unverified session in-process up to
   --completion-rechecks (default 3) before returning `incomplete`, absorbing
   false-completes inside the single blocking call. A broken verifier
   contract still returns immediately so it escalates.

Adds tests for the circuit breaker (test_stop_hooks.py) and the
monitor-session re-poll (test_monitor_session.py).
@Yash-1511 Yash-1511 requested a review from bma-d as a code owner June 15, 2026 13:30
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

Adds a Stop-hook circuit breaker to cmd_stop_hook that limits consecutive blocks without step progress (controlled by STORY_AUTOMATOR_MAX_STOP_BLOCKS, default 5), and adds --completion-rechecks to cmd_monitor_session for in-process false-complete re-verification. Workflow documentation is updated to mandate monitoring-pattern.md loading and heartbeat signaling; new test suites cover both features.

Changes

Stop-hook circuit breaker and monitor-session false-complete rechecking

Layer / File(s) Summary
monitor-session false-complete recheck logic
skills/bmad-story-automator/src/story_automator/commands/tmux.py
Adds COMPLETION_RECHECKS and RECHECK_GRACE_SECONDS constants, --completion-rechecks CLI option, unverified_completions counter, and an in-process re-poll loop that retries verification when a session appears completed but the artifact is unverified, falling back to incomplete after the budget is exhausted.
Stop-hook circuit breaker implementation
skills/bmad-story-automator/src/story_automator/commands/basic.py
Adds DEFAULT_MAX_STOP_BLOCKS, _max_stop_blocks(), _write_marker(), and reworks cmd_stop_hook to track stopHookBlocks, detect progress via heartbeat/storiesRemaining deltas, and emit a systemMessage auto-pause when the cap is reached.
Workflow documentation updates
skills/bmad-story-automator/data/monitoring-pattern.md, skills/bmad-story-automator/data/stop-hook-recovery.md, skills/bmad-story-automator/data/data-file-index.md, skills/bmad-story-automator/steps-c/step-01b-continue.md, skills/bmad-story-automator/steps-c/step-03-execute.md
Documents --completion-rechecks option, circuit-breaker rules, Common Mistakes entries forbidding per-turn polling, mandatory monitoring-pattern.md loading at step start and on resume, and required orchestrator-helper marker heartbeat calls after create-story and dev-story milestones.
Tests for monitor-session rechecks and stop-hook circuit breaker
tests/test_monitor_session.py, tests/test_stop_hooks.py
New MonitorSessionRepollTests suite covers verified completion, false-complete exhaustion, verifier pass-on-recheck, configurable recheck count, and active-resume edge case; new StopHookTests circuit-breaker cases cover block increments, cap release, heartbeat reset, stop_hook_active=false reset, and storiesRemaining decrease reset.

Sequence Diagram(s)

sequenceDiagram
  participant StopHook as Stop Hook (cmd_stop_hook)
  participant Marker as Active Marker (JSON)
  participant Monitor as monitor-session (cmd_monitor_session)
  participant Verifier as _verify_monitor_completion

  StopHook->>Marker: read stopHookBlocks, heartbeat, storiesRemaining
  alt progress detected (heartbeat changed or storiesRemaining decreased)
    StopHook->>Marker: reset stopHookBlocks=0
    StopHook-->>StopHook: block turn (normal)
  else no progress, blocks < max
    StopHook->>Marker: increment stopHookBlocks
    StopHook-->>StopHook: block turn
  else blocks >= STORY_AUTOMATOR_MAX_STOP_BLOCKS
    StopHook->>Marker: reset stopHookBlocks=0, persist
    StopHook-->>StopHook: emit systemMessage auto-pause
  end

  Note over Monitor,Verifier: On next execution turn
  Monitor->>Verifier: verify artifact on completed state
  alt verified immediately
    Monitor-->>StopHook: final_state=completed
  else unverified, rechecks > 0
    loop up to completion_rechecks
      Monitor->>Monitor: sleep RECHECK_GRACE_SECONDS
      Monitor->>Verifier: re-verify
    end
    Monitor-->>StopHook: final_state=completed OR incomplete
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

  • bmad-code-org/bmad-automator#2: Modifies cmd_monitor_session and tmux completion detection in the same file and code path where this PR adds --completion-rechecks false-complete logic.
  • bmad-code-org/bmad-automator#18: Touches _verify_monitor_completion and cmd_monitor_session in tmux.py, the same verification flow extended by this PR's recheck logic.

Suggested reviewers

  • bma-d

Poem

🐇 Hop, hop, the circuit breaks!
No more polling for goodness' sake,
One blocking call is all we need,
Heartbeat resets — confirmed indeed!
The Stop hook guards, not clocks the beat,
False-completes rechecked — how neat! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing a circuit breaker for stop-hook handling to prevent LLM-turn-polling runaway, directly addressing issue #29.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/bmad-story-automator/src/story_automator/commands/basic.py`:
- Around line 240-246: The `_write_marker` function currently silently swallows
OSError exceptions, which prevents marker state from being persisted and can
cause the circuit breaker to stall permanently by always returning "block" on
subsequent invocations. Instead of silently passing on OSError, log the error
with details about what failed (include the path and error information) and then
re-raise the exception so that the caller can decide how to handle the
persistence failure appropriately, allowing the breaker to fail open rather than
remain blocked indefinitely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58d03866-e3c9-4d5e-8916-a6dbcbf4a84b

📥 Commits

Reviewing files that changed from the base of the PR and between f332173 and 3198167.

📒 Files selected for processing (9)
  • skills/bmad-story-automator/data/data-file-index.md
  • skills/bmad-story-automator/data/monitoring-pattern.md
  • skills/bmad-story-automator/data/stop-hook-recovery.md
  • skills/bmad-story-automator/src/story_automator/commands/basic.py
  • skills/bmad-story-automator/src/story_automator/commands/tmux.py
  • skills/bmad-story-automator/steps-c/step-01b-continue.md
  • skills/bmad-story-automator/steps-c/step-03-execute.md
  • tests/test_monitor_session.py
  • tests/test_stop_hooks.py

Comment on lines +240 to +246
def _write_marker(path: Path, payload: dict[str, object]) -> None:
try:
atomic_write(path, json.dumps(payload, indent=2) + "\n")
except OSError:
# A marker we cannot persist must never crash the Stop hook; the
# breaker simply won't advance this cycle.
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail open when marker persistence fails, or the breaker can stall permanently.

If marker writes fail, stopHookBlocks is never persisted; subsequent stop-hook invocations can keep returning "decision": "block" forever and never reach the release threshold. This undermines the circuit-breaker guarantee under filesystem/permission failures.

Proposed fix
-def _write_marker(path: Path, payload: dict[str, object]) -> None:
+def _write_marker(path: Path, payload: dict[str, object]) -> bool:
     try:
         atomic_write(path, json.dumps(payload, indent=2) + "\n")
+        return True
     except OSError:
-        # A marker we cannot persist must never crash the Stop hook; the
-        # breaker simply won't advance this cycle.
-        pass
+        return False
     payload["stopHookBlocks"] = blocks
-    _write_marker(marker, payload)
+    if not _write_marker(marker, payload):
+        print(
+            json.dumps(
+                {
+                    "systemMessage": (
+                        "Story Automator auto-paused because stop-hook state "
+                        "could not be persisted. Fix marker file permissions and resume."
+                    )
+                },
+                indent=2,
+            )
+        )
+        return 0
🧰 Tools
🪛 ast-grep (0.43.0)

[info] 241-241: use jsonify instead of json.dumps for JSON output
Context: json.dumps(payload, indent=2)
Note: Security best practice.

(use-jsonify)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/bmad-story-automator/src/story_automator/commands/basic.py` around
lines 240 - 246, The `_write_marker` function currently silently swallows
OSError exceptions, which prevents marker state from being persisted and can
cause the circuit breaker to stall permanently by always returning "block" on
subsequent invocations. Instead of silently passing on OSError, log the error
with details about what failed (include the path and error information) and then
re-raise the exception so that the caller can decide how to handle the
persistence failure appropriately, allowing the breaker to fail open rather than
remain blocked indefinitely.

@cannt

cannt commented Jun 25, 2026

Copy link
Copy Markdown

Thanks for tackling the Stop-hook runaway problem here — nice work.

I ran into a related but distinct issue with the same hook. Because it's registered project-scoped, it blocks every Claude Code session in the project while a run is active — not just the orchestrator that owns the run. A second session opened to watch the run (or unrelated work in another worktree) can't stop; it gets blocked too. The STORY_AUTOMATOR_CHILD env only exempts spawned children, and the circuit breaker here still blocks a bystander up to MAX_STOP_BLOCKS times before releasing.

A small owner-gating addition fixes it cleanly: record the orchestrator's CLAUDE_CODE_SESSION_ID as ownerSession in the active marker at create-time, then in cmd_stop_hook compare it against the caller's session_id (already in the stdin payload you're now parsing) and return 0 for any non-owner session. Markers without ownerSession keep the current behavior, so it's backward-compatible.

Since this PR already restructures cmd_stop_hook and parses the hook input, the gating slots in as ~6 lines on top of your change. I had it as a standalone PR (#48, now closed to avoid two overlapping stop-hook PRs). Would you/the maintainers prefer I fold the owner-gating into this PR, or send it as a follow-up rebased on top once this lands? Happy either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants