Skip to content

feat(consensus): bound worst-case wall-time without truncating slow-but-good responses#159

Merged
antonbabenko merged 6 commits into
masterfrom
feat/consensus-timeout-resilience
Jun 26, 2026
Merged

feat(consensus): bound worst-case wall-time without truncating slow-but-good responses#159
antonbabenko merged 6 commits into
masterfrom
feat/consensus-timeout-resilience

Conversation

@antonbabenko

Copy link
Copy Markdown
Owner

Problem

A recent /consensus run took 45+ minutes. The debug log showed a single provider call at ~38 min: a Codex (codex exec) subprocess running unbounded, because its kill timer was armed only when a timeoutMs was passed and the consensus path passed none. Several other calls sat at 420-600s. There was no retry, no wall-time budget, and no circuit-breaker, so a chronically-slow peer added its full ceiling to every one of up to 5 rounds.

Approach

Four layered guards (smallest blast radius first) plus the config/docs to expose them. The strategy was designed via a cross-model panel (GPT / Gemini / Grok / OpenRouter); the plan and the final diff were cross-reviewed.

  • Bound Codex (core/providers/codex.js): default 600s per-call ceiling (CODEX_DEFAULT_TIMEOUT_MS), overridable per-call or per-construction. The kill timer is now always armed. This alone fixes the observed incident on every arbiter path.
  • Transport-only retry (core/orchestrate.js callProvider): consume the previously-ignored retryable flag for exactly one case, a pre-response network error, retried once. Never on timeout / rate-limit / auth (non-idempotent in cost; would risk the slow-but-good case).
  • Global wall-time budget (runToConvergence): optional maxWallMs stops starting a new round once spent; it never aborts an in-flight call, so a legitimately slow Gemini answer is always collected. Returns stopReason: "budget-exhausted".
  • Session circuit-breaker: a peer that times out in 2 consecutive rounds is dropped from the panel for the rest of the session (stopReason: "all-providers-circuit-broken").
  • Config + wiring + docs: consensus.maxWallMs (default 1200000 ms = 20 min), wired config -> resolveConsensus -> the consensus tool -> the loop; stopReason surfaced in the tool payload. TECHNICAL / README / SETUP / CLAUDE docs updated.

Explicitly out of scope (documented in the plan): hedging (rejected: the fan-out already is the redundancy); budget/breaker on the host-driven consensus-step path (the Codex bound still fixes that path's incident); a Codex timeout config key.

Worst case

Every per-call path now has a ceiling (Codex 600s, Gemini ~420s, Grok/OpenRouter 180s). Per-round wall-time approx max(peers) + max(adjudication, revision); the global budget caps the total. No unbounded path remains.

Test plan

  • New unit tests: Codex timeout plumbing (CX-timeout-1/2/3), transport retry (ORX-retry-1/2/3), budget (RC-budget), circuit-breaker (RC-breaker), config-path flow + payload (CB-MW1/2/3, CA16).
  • npm run check: typecheck clean; 579 pass / 11 fail. All 11 failures are pre-existing listen EPERM 127.0.0.1 sandbox network-bind denials in test/grok.test.js / test/openrouter.test.js, untouched by this branch (they fail identically on master). No code-caused failures.
  • Manual: a consensus run with a small maxWallMs returns fewer rounds and stopReason: "budget-exhausted".

Review

The plan and the final whole-branch diff were cross-reviewed by GPT / Gemini / Grok / OpenRouter plus a per-task spec+quality gate. The final whole-branch review caught a critical wiring gap (the config key was stripped by resolveConsensus and never reached the loop, so the budget was inert in production); it was fixed in the last commit and re-verified end-to-end.

Commits

  • fix(codex): bound unbounded per-call timeout (default 600s)
  • fix(orchestrate): retry once on pre-response network errors only
  • feat(consensus): optional global wall-time budget for the convergence loop
  • feat(consensus): circuit-break a peer after 2 consecutive timeouts
  • feat(consensus): add consensus.maxWallMs budget config + docs
  • fix(consensus): wire maxWallMs through config resolution and surface stopReason

…stopReason

C1: resolveConsensus() in server/openrouter/config.js never emitted maxWallMs,
so cc.maxWallMs in runConsensusAuto was always undefined and the 20-min budget
never applied. Add DEFAULT_CONSENSUS_MAX_WALL_MS=1200000, compute maxWallMs in
resolveConsensus with the same soft-degrade pattern as maxRounds (except invalid
falls back to the default rather than omitting), and include it in both the
wrap() return and the no-file fallback so both config paths carry it.

I1: runConsensusTool built a new envelope from runConsensusAuto's payload but
dropped stopReason. Add stopReason to both the runConsensusAuto payload and the
runConsensusTool envelope so "budget-exhausted" / "all-providers-circuit-broken"
reaches the tool caller as documented.

Update deepEqual assertions in CB1, CB7, CB8, SU8 to include maxWallMs:1200000.
@antonbabenko antonbabenko merged commit 3073a9e into master Jun 26, 2026
1 check passed
@antonbabenko antonbabenko deleted the feat/consensus-timeout-resilience branch June 26, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant