supernovae · supernovae · Apr 20, 2026 · Apr 20, 2026
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -24,6 +24,8 @@ Picture data moving left to right on the **happy path**:
 
 Side channels include **budget** enforcement (RFC 0038) and **sandbox** allow/deny lists (RFC 0017), which can pre-empt a transition or force `fail_safe` without giving unsafe payloads back to the model.
 
+For streamed decoding, deployments should treat budget control as an active circuit-breaker (preflight budget gate + mid-stream cancellation), not a post-hoc accounting report. The reference harness now supports this runtime pattern; see [Model adaptation for budget control](./model-adaptation-budget-control.md).
+
 ## Major components
 
 These names describe responsibilities; a single deployment may fold multiple roles into one service, but the boundaries stay conceptually distinct.

diff --git a/docs/model-adaptation-budget-control.md b/docs/model-adaptation-budget-control.md
@@ -0,0 +1,76 @@
+# Model Adaptation for Budget-Constrained Reasoning
+
+This note describes practical guidance for running Open-CoT with strict token budgets, streamed cancellation, and fallback routing.
+
+## Runtime control path (what the harness enforces)
+
+In the reference harness, budget/safety control is enforced at two layers:
+
+1. **Provider-side cap**: each call gets a per-request `max_tokens` cap derived from remaining budget.
+2. **Harness-side stream breaker**:
+   - preflight prompt estimate gate,
+   - mid-stream completion-budget interruption,
+   - mid-stream safety interruption (runaway/pattern checks),
+   - forced transition to terminal FSM status before any follow-on side effect.
+
+This means budget enforcement is not dependent on model obedience.
+
+## Model behavior profile: what tends to work best
+
+Budget-following quality is usually higher for models with:
+
+- strong instruction adherence in system prompts,
+- native tool-call behavior and function-schema compliance,
+- stable short-form planning (can compress plans under hard limits),
+- lower tendency to emit long reflective preambles.
+
+Budget-following quality is usually worse for models with:
+
+- weak instruction tuning (treats budget as advisory text),
+- verbose default style (long chain-style narration before action),
+- fragile tool-call formatting under constrained output length.
+
+These are deployment traits, not absolute rules. Evaluate with your own tasks and policies.
+
+## Fine-tuning / adaptation recommendations
+
+If you train or adapt models for this harness, prioritize data and objectives that reward controlled reasoning depth:
+
+1. **Budget-conditioned demonstrations**
+   - Include explicit `budget_remaining` context in prompts.
+   - Provide successful traces at multiple budgets (tight/medium/high).
+2. **Compression preference**
+   - Reward concise plans that keep high-value steps and drop redundant rationale.
+3. **Tool-first economy**
+   - Reward early tool requests when external evidence is required, instead of long speculative reasoning.
+4. **Truncation-aware recovery**
+   - Include examples where the model says what is missing and asks for retry/escalation when budget is insufficient.
+5. **Policy-aware refusal**
+   - Include traces that correctly stop/escalate when policy or safety constraints prevent completion.
+
+## Routing when reasoning is incomplete due budget
+
+When a run ends with `budget_exhausted`, use a deterministic policy instead of silent retries:
+
+1. **Narrow retry (same model)**  
+   Retry with a smaller objective slice (single sub-problem) and explicit compact-output instruction.
+2. **Model escalation (same route family)**  
+   Route to a stronger instruction-following model for a bounded rescue pass.
+3. **Tool-heavy route**  
+   Shift from free-form reasoning to evidence/tool-driven route with minimal synthesis tokens.
+4. **Human escalation**  
+   If policy-critical or high-risk, require human approval/resolution path.
+
+Each retry should carry forward prior observations and a remaining-budget contract so failures are auditable rather than hidden.
+
+## Suggested evaluation matrix
+
+For each candidate model family, track at least:
+
+- completion-under-budget rate,
+- correctness at fixed budget tiers,
+- tool-call validity under low output caps,
+- rate of `budget_exhausted` recoveries that succeed after one retry,
+- safety/fail-safe trigger precision (true positives vs false positives).
+
+Open-CoT experiments under `docs/experiments/` can be used as baseline scaffolding for this matrix.
diff --git a/harness/README.md b/harness/README.md
@@ -111,6 +111,17 @@ The budget tracker (RFC 0038) enforces:
 
 When any hard-enforced budget hits zero, the agent is force-stopped with `budget_exhausted` status and the trace records why.
 
+### Streaming decode circuit breaker
+
+The harness now enforces token/safety limits during streamed decoding (not only after full responses):
+
+- **Preflight budget gate**: estimate prompt token cost before each model call; if insufficient remaining budget, stop before decode starts.
+- **Mid-stream token breaker**: stream callbacks track emitted completion tokens and abort decode once the remaining completion allowance is exhausted.
+- **Mid-stream safety breaker**: stream callbacks can stop runaway or unsafe output patterns and route to `fail_safe`.
+- **FSM-first shutdown**: on breaker trip, the run is forced into terminal state (`budget_exhausted`, `fail_safe`, or `external_stop`) before any subsequent tool side effects.
+
+This keeps authority in the harness FSM even when a model ignores budget instructions.
+
 ## Tool contracts
 
 Every tool is registered with a contract (RFC 0003 + RFC 0018):

diff --git a/harness/src/agents/chat-agent.ts b/harness/src/agents/chat-agent.ts
@@ -8,6 +8,7 @@ import type { AgentState } from "../core/state.js";
 import { createAgentState } from "../core/state.js";
 import { transition, forceStop } from "../core/transitions.js";
 import { createBudgetTracker } from "../core/budget-tracker.js";
+import { callLLMWithCircuitBreaker } from "../core/llm-circuit-breaker.js";
 import type { ToolRegistry } from "../core/tool-registry.js";
 import {
   emitPlan,
@@ -56,19 +57,19 @@ export async function runChatAgent(
   let lastResponse: LLMResponseWithTools | undefined;
 
   const callLLM = async (messages: LLMMessage[]): Promise<LLMResponseWithTools> => {
-    if (halted(state)) {
-      return { content: "", tokensUsed: 0, model: "noop", finishReason: "stop" };
-    }
-    try {
-      const response = await backend.chat(messages);
-      budget.recordTokens(state, response.tokensUsed, `LLM (${backend.name})`);
-      lastResponse = response;
-      return response;
-    } catch (err) {
-      const msg = err instanceof Error ? err.message : String(err);
-      forceStop(state, "fail_safe", msg);
-      return { content: "", tokensUsed: 0, model: "error", finishReason: "stop" };
-    }
+    const response = await callLLMWithCircuitBreaker({
+      backend,
+      messages,
+      state,
+      budget,
+      llmReason: `LLM (${backend.name})`,
+      stream: true,
+      safety: {
+        maxDecodedChars: 12_000,
+      },
+    });
+    lastResponse = response;
+    return response;
   };
 
   const end = (answer: string): Trace => {

diff --git a/harness/src/agents/coder-agent.ts b/harness/src/agents/coder-agent.ts
@@ -7,6 +7,7 @@ import type { AgentState } from "../core/state.js";
 import { createAgentState } from "../core/state.js";
 import { transition, forceStop } from "../core/transitions.js";
 import { createBudgetTracker } from "../core/budget-tracker.js";
+import { callLLMWithCircuitBreaker } from "../core/llm-circuit-breaker.js";
 import type { ToolRegistry } from "../core/tool-registry.js";
 import {
   emitPlan,
@@ -55,19 +56,19 @@ export async function runCoderAgent(
   let lastResponse: LLMResponseWithTools | undefined;
 
   const callLLM = async (messages: LLMMessage[]): Promise<LLMResponseWithTools> => {
-    if (halted(state)) {
-      return { content: "", tokensUsed: 0, model: "noop", finishReason: "stop" };
-    }
-    try {
-      const response = await backend.chat(messages);
-      budget.recordTokens(state, response.tokensUsed, `LLM (${backend.name})`);
-      lastResponse = response;
-      return response;
-    } catch (err) {
-      const msg = err instanceof Error ? err.message : String(err);
-      forceStop(state, "fail_safe", msg);
-      return { content: "", tokensUsed: 0, model: "error", finishReason: "stop" };
-    }
+    const response = await callLLMWithCircuitBreaker({
+      backend,
+      messages,
+      state,
+      budget,
+      llmReason: `LLM (${backend.name})`,
+      stream: true,
+      safety: {
+        maxDecodedChars: 20_000,
+      },
+    });
+    lastResponse = response;
+    return response;
   };
 
   const end = (answer: string): Trace => {

diff --git a/harness/src/agents/governed-agent.ts b/harness/src/agents/governed-agent.ts
@@ -9,6 +9,7 @@ import type { AgentState } from "../core/state.js";
 import { createAgentState } from "../core/state.js";
 import { transition, forceStop } from "../core/transitions.js";
 import { createBudgetTracker } from "../core/budget-tracker.js";
+import { callLLMWithCircuitBreaker } from "../core/llm-circuit-breaker.js";
 import type { ToolRegistry } from "../core/tool-registry.js";
 import { PermissionManager } from "../governance/permission-manager.js";
 import { PolicyEvaluator } from "../governance/policy-evaluator.js";
@@ -75,23 +76,19 @@ export async function runGovernedAgent(
   const callLLM = async (
     messages: LLMMessage[],
   ): Promise<LLMResponseWithTools> => {
-    if (halted(state)) {
-      return { content: "", tokensUsed: 0, model: "noop", finishReason: "stop" };
-    }
-    try {
-      const response = await config.backend.chat(messages);
-      budget.recordTokens(
-        state,
-        response.tokensUsed,
-        `LLM (${config.backend.name})`,
-      );
-      lastResponse = response;
-      return response;
-    } catch (err) {
-      const msg = err instanceof Error ? err.message : String(err);
-      forceStop(state, "fail_safe", `LLM failure: ${msg}`);
-      return { content: "", tokensUsed: 0, model: "error", finishReason: "stop" };
-    }
+    const response = await callLLMWithCircuitBreaker({
+      backend: config.backend,
+      messages,
+      state,
+      budget,
+      llmReason: `LLM (${config.backend.name})`,
+      stream: true,
+      safety: {
+        maxDecodedChars: 16_000,
+      },
+    });
+    lastResponse = response;
+    return response;
   };
 
   const finish = (answer: string): GovernedAgentResult => {

diff --git a/harness/src/backends/index.ts b/harness/src/backends/index.ts
@@ -1,4 +1,13 @@
-export type { LLMBackend, LLMMessage, LLMResponse, LLMResponseWithTools, ToolCallRequest } from "./types.js";
+export type {
+  LLMBackend,
+  LLMChatOptions,
+  LLMFinishReason,
+  LLMMessage,
+  LLMResponse,
+  LLMResponseWithTools,
+  LLMStreamChunk,
+  ToolCallRequest,
+} from "./types.js";
 export { MockLLMBackend } from "./mock.js";
 export { OpenAICompatBackend, configFromEnv } from "./openai-compat.js";
 export type { OpenAICompatConfig } from "./openai-compat.js";
diff --git a/harness/src/backends/mock.ts b/harness/src/backends/mock.ts
@@ -7,6 +7,7 @@
 
 import type {
   LLMBackend,
+  LLMChatOptions,
   LLMMessage,
   LLMResponseWithTools,
   ToolCallRequest,
@@ -31,7 +32,24 @@ export class MockLLMBackend implements LLMBackend {
     this.routes.push(route);
   }
 
-  async chat(messages: LLMMessage[]): Promise<LLMResponseWithTools> {
+  async chat(
+    messages: LLMMessage[],
+    options?: LLMChatOptions,
+  ): Promise<LLMResponseWithTools> {
+    if (options?.signal?.aborted) {
+      throw makeAbortError("Mock backend aborted before decode");
+    }
+
+    const response = this.resolveResponse(messages);
+    const capped = applyOutputCap(response, options?.maxOutputTokens);
+
+    if (options?.stream && options.onChunk && capped.content) {
+      await emitStream(capped.content, options);
+    }
+    return capped;
+  }
+
+  private resolveResponse(messages: LLMMessage[]): LLMResponseWithTools {
     const raw =
       [...messages].reverse().find((m) => m.role === "user")?.content ?? "";
 
@@ -90,6 +108,70 @@ export class MockLLMBackend implements LLMBackend {
   }
 }
 
+function makeAbortError(message: string): Error {
+  const err = new Error(message);
+  err.name = "AbortError";
+  return err;
+}
+
+function estimateTokens(text: string): number {
+  if (!text) return 0;
+  return Math.max(1, Math.ceil(text.length / 4));
+}
+
+function applyOutputCap(
+  response: LLMResponseWithTools,
+  maxOutputTokens?: number,
+): LLMResponseWithTools {
+  if (
+    maxOutputTokens === undefined ||
+    maxOutputTokens <= 0 ||
+    response.content.length === 0
+  ) {
+    return response;
+  }
+
+  const completionEstimate = estimateTokens(response.content);
+  if (completionEstimate <= maxOutputTokens) {
+    return response;
+  }
+
+  const maxChars = Math.max(1, maxOutputTokens * 4);
+  const truncated = response.content.slice(0, maxChars);
+  const promptEstimate = Math.max(
+    0,
+    response.tokensUsed - completionEstimate,
+  );
+  return {
+    ...response,
+    content: truncated,
+    tokensUsed: promptEstimate + estimateTokens(truncated),
+    finishReason: "length",
+  };
+}
+
+async function emitStream(
+  content: string,
+  options: LLMChatOptions,
+): Promise<void> {
+  const chunks = content.match(/\S+\s*/g) ?? [content];
+  let aggregate = "";
+  let completionTokensEstimated = 0;
+
+  for (const piece of chunks) {
+    if (options.signal?.aborted) {
+      throw makeAbortError("Mock backend aborted during streamed decode");
+    }
+    aggregate += piece;
+    completionTokensEstimated += estimateTokens(piece);
+    await options.onChunk?.({
+      contentDelta: piece,
+      content: aggregate,
+      completionTokensEstimated,
+    });
+  }
+}
+
 function getDefaultRoutes(): MockRoute[] {
   return [
     {