feat(consensus): bound worst-case wall-time without truncating slow-but-good responses by antonbabenko · Pull Request #159 · antonbabenko/deliberation

antonbabenko · 2026-06-26T14:12:05Z

Problem

A recent /consensus run took 45+ minutes. The debug log showed a single provider call at ~38 min: a Codex (codex exec) subprocess running unbounded, because its kill timer was armed only when a timeoutMs was passed and the consensus path passed none. Several other calls sat at 420-600s. There was no retry, no wall-time budget, and no circuit-breaker, so a chronically-slow peer added its full ceiling to every one of up to 5 rounds.

Approach

Four layered guards (smallest blast radius first) plus the config/docs to expose them. The strategy was designed via a cross-model panel (GPT / Gemini / Grok / OpenRouter); the plan and the final diff were cross-reviewed.

Bound Codex (core/providers/codex.js): default 600s per-call ceiling (CODEX_DEFAULT_TIMEOUT_MS), overridable per-call or per-construction. The kill timer is now always armed. This alone fixes the observed incident on every arbiter path.
Transport-only retry (core/orchestrate.js callProvider): consume the previously-ignored retryable flag for exactly one case, a pre-response network error, retried once. Never on timeout / rate-limit / auth (non-idempotent in cost; would risk the slow-but-good case).
Global wall-time budget (runToConvergence): optional maxWallMs stops starting a new round once spent; it never aborts an in-flight call, so a legitimately slow Gemini answer is always collected. Returns stopReason: "budget-exhausted".
Session circuit-breaker: a peer that times out in 2 consecutive rounds is dropped from the panel for the rest of the session (stopReason: "all-providers-circuit-broken").
Config + wiring + docs: consensus.maxWallMs (default 1200000 ms = 20 min), wired config -> resolveConsensus -> the consensus tool -> the loop; stopReason surfaced in the tool payload. TECHNICAL / README / SETUP / CLAUDE docs updated.

Explicitly out of scope (documented in the plan): hedging (rejected: the fan-out already is the redundancy); budget/breaker on the host-driven consensus-step path (the Codex bound still fixes that path's incident); a Codex timeout config key.

Worst case

Every per-call path now has a ceiling (Codex 600s, Gemini ~420s, Grok/OpenRouter 180s). Per-round wall-time approx max(peers) + max(adjudication, revision); the global budget caps the total. No unbounded path remains.

Test plan

New unit tests: Codex timeout plumbing (CX-timeout-1/2/3), transport retry (ORX-retry-1/2/3), budget (RC-budget), circuit-breaker (RC-breaker), config-path flow + payload (CB-MW1/2/3, CA16).
npm run check: typecheck clean; 579 pass / 11 fail. All 11 failures are pre-existing listen EPERM 127.0.0.1 sandbox network-bind denials in test/grok.test.js / test/openrouter.test.js, untouched by this branch (they fail identically on master). No code-caused failures.
Manual: a consensus run with a small maxWallMs returns fewer rounds and stopReason: "budget-exhausted".

Review

The plan and the final whole-branch diff were cross-reviewed by GPT / Gemini / Grok / OpenRouter plus a per-task spec+quality gate. The final whole-branch review caught a critical wiring gap (the config key was stripped by resolveConsensus and never reached the loop, so the budget was inert in production); it was fixed in the last commit and re-verified end-to-end.

Commits

fix(codex): bound unbounded per-call timeout (default 600s)
fix(orchestrate): retry once on pre-response network errors only
feat(consensus): optional global wall-time budget for the convergence loop
feat(consensus): circuit-break a peer after 2 consecutive timeouts
feat(consensus): add consensus.maxWallMs budget config + docs
fix(consensus): wire maxWallMs through config resolution and surface stopReason

… loop

…stopReason C1: resolveConsensus() in server/openrouter/config.js never emitted maxWallMs, so cc.maxWallMs in runConsensusAuto was always undefined and the 20-min budget never applied. Add DEFAULT_CONSENSUS_MAX_WALL_MS=1200000, compute maxWallMs in resolveConsensus with the same soft-degrade pattern as maxRounds (except invalid falls back to the default rather than omitting), and include it in both the wrap() return and the no-file fallback so both config paths carry it. I1: runConsensusTool built a new envelope from runConsensusAuto's payload but dropped stopReason. Add stopReason to both the runConsensusAuto payload and the runConsensusTool envelope so "budget-exhausted" / "all-providers-circuit-broken" reaches the tool caller as documented. Update deepEqual assertions in CB1, CB7, CB8, SU8 to include maxWallMs:1200000.

antonbabenko added 6 commits June 26, 2026 15:05

fix(codex): bound unbounded per-call timeout (default 600s)

455d7aa

fix(orchestrate): retry once on pre-response network errors only

e149705

feat(consensus): optional global wall-time budget for the convergence…

5e0fff8

… loop

feat(consensus): circuit-break a peer after 2 consecutive timeouts

411abea

feat(consensus): add consensus.maxWallMs budget config + docs

dd19345

antonbabenko merged commit 3073a9e into master Jun 26, 2026
1 check passed

antonbabenko deleted the feat/consensus-timeout-resilience branch June 26, 2026 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(consensus): bound worst-case wall-time without truncating slow-but-good responses#159

feat(consensus): bound worst-case wall-time without truncating slow-but-good responses#159
antonbabenko merged 6 commits into
masterfrom
feat/consensus-timeout-resilience

antonbabenko commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antonbabenko commented Jun 26, 2026

Problem

Approach

Worst case

Test plan

Review

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant