feat(consensus): bound worst-case wall-time without truncating slow-but-good responses#159
Merged
Merged
Conversation
…stopReason C1: resolveConsensus() in server/openrouter/config.js never emitted maxWallMs, so cc.maxWallMs in runConsensusAuto was always undefined and the 20-min budget never applied. Add DEFAULT_CONSENSUS_MAX_WALL_MS=1200000, compute maxWallMs in resolveConsensus with the same soft-degrade pattern as maxRounds (except invalid falls back to the default rather than omitting), and include it in both the wrap() return and the no-file fallback so both config paths carry it. I1: runConsensusTool built a new envelope from runConsensusAuto's payload but dropped stopReason. Add stopReason to both the runConsensusAuto payload and the runConsensusTool envelope so "budget-exhausted" / "all-providers-circuit-broken" reaches the tool caller as documented. Update deepEqual assertions in CB1, CB7, CB8, SU8 to include maxWallMs:1200000.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A recent
/consensusrun took 45+ minutes. The debug log showed a single provider call at ~38 min: a Codex (codex exec) subprocess running unbounded, because its kill timer was armed only when atimeoutMswas passed and the consensus path passed none. Several other calls sat at 420-600s. There was no retry, no wall-time budget, and no circuit-breaker, so a chronically-slow peer added its full ceiling to every one of up to 5 rounds.Approach
Four layered guards (smallest blast radius first) plus the config/docs to expose them. The strategy was designed via a cross-model panel (GPT / Gemini / Grok / OpenRouter); the plan and the final diff were cross-reviewed.
core/providers/codex.js): default 600s per-call ceiling (CODEX_DEFAULT_TIMEOUT_MS), overridable per-call or per-construction. The kill timer is now always armed. This alone fixes the observed incident on every arbiter path.core/orchestrate.jscallProvider): consume the previously-ignoredretryableflag for exactly one case, a pre-responsenetworkerror, retried once. Never on timeout / rate-limit / auth (non-idempotent in cost; would risk the slow-but-good case).runToConvergence): optionalmaxWallMsstops starting a new round once spent; it never aborts an in-flight call, so a legitimately slow Gemini answer is always collected. ReturnsstopReason: "budget-exhausted".stopReason: "all-providers-circuit-broken").consensus.maxWallMs(default 1200000 ms = 20 min), wired config ->resolveConsensus-> theconsensustool -> the loop;stopReasonsurfaced in the tool payload. TECHNICAL / README / SETUP / CLAUDE docs updated.Explicitly out of scope (documented in the plan): hedging (rejected: the fan-out already is the redundancy); budget/breaker on the host-driven
consensus-steppath (the Codex bound still fixes that path's incident); a Codex timeout config key.Worst case
Every per-call path now has a ceiling (Codex 600s, Gemini ~420s, Grok/OpenRouter 180s). Per-round wall-time approx max(peers) + max(adjudication, revision); the global budget caps the total. No unbounded path remains.
Test plan
npm run check: typecheck clean; 579 pass / 11 fail. All 11 failures are pre-existinglisten EPERM 127.0.0.1sandbox network-bind denials intest/grok.test.js/test/openrouter.test.js, untouched by this branch (they fail identically onmaster). No code-caused failures.consensusrun with a smallmaxWallMsreturns fewer rounds andstopReason: "budget-exhausted".Review
The plan and the final whole-branch diff were cross-reviewed by GPT / Gemini / Grok / OpenRouter plus a per-task spec+quality gate. The final whole-branch review caught a critical wiring gap (the config key was stripped by
resolveConsensusand never reached the loop, so the budget was inert in production); it was fixed in the last commit and re-verified end-to-end.Commits
fix(codex): bound unbounded per-call timeout (default 600s)fix(orchestrate): retry once on pre-response network errors onlyfeat(consensus): optional global wall-time budget for the convergence loopfeat(consensus): circuit-break a peer after 2 consecutive timeoutsfeat(consensus): addconsensus.maxWallMsbudget config + docsfix(consensus): wiremaxWallMsthrough config resolution and surfacestopReason