Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Picture data moving left to right on the **happy path**:

Side channels include **budget** enforcement (RFC 0038) and **sandbox** allow/deny lists (RFC 0017), which can pre-empt a transition or force `fail_safe` without giving unsafe payloads back to the model.

For streamed decoding, deployments should treat budget control as an active circuit-breaker (preflight budget gate + mid-stream cancellation), not a post-hoc accounting report. The reference harness now supports this runtime pattern; see [Model adaptation for budget control](./model-adaptation-budget-control.md).

## Major components

These names describe responsibilities; a single deployment may fold multiple roles into one service, but the boundaries stay conceptually distinct.
Expand Down
76 changes: 76 additions & 0 deletions docs/model-adaptation-budget-control.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Model Adaptation for Budget-Constrained Reasoning

This note describes practical guidance for running Open-CoT with strict token budgets, streamed cancellation, and fallback routing.

## Runtime control path (what the harness enforces)

In the reference harness, budget/safety control is enforced at two layers:

1. **Provider-side cap**: each call gets a per-request `max_tokens` cap derived from remaining budget.
2. **Harness-side stream breaker**:
- preflight prompt estimate gate,
- mid-stream completion-budget interruption,
- mid-stream safety interruption (runaway/pattern checks),
- forced transition to terminal FSM status before any follow-on side effect.

This means budget enforcement is not dependent on model obedience.

## Model behavior profile: what tends to work best

Budget-following quality is usually higher for models with:

- strong instruction adherence in system prompts,
- native tool-call behavior and function-schema compliance,
- stable short-form planning (can compress plans under hard limits),
- lower tendency to emit long reflective preambles.

Budget-following quality is usually worse for models with:

- weak instruction tuning (treats budget as advisory text),
- verbose default style (long chain-style narration before action),
- fragile tool-call formatting under constrained output length.

These are deployment traits, not absolute rules. Evaluate with your own tasks and policies.

## Fine-tuning / adaptation recommendations

If you train or adapt models for this harness, prioritize data and objectives that reward controlled reasoning depth:

1. **Budget-conditioned demonstrations**
- Include explicit `budget_remaining` context in prompts.
- Provide successful traces at multiple budgets (tight/medium/high).
2. **Compression preference**
- Reward concise plans that keep high-value steps and drop redundant rationale.
3. **Tool-first economy**
- Reward early tool requests when external evidence is required, instead of long speculative reasoning.
4. **Truncation-aware recovery**
- Include examples where the model says what is missing and asks for retry/escalation when budget is insufficient.
5. **Policy-aware refusal**
- Include traces that correctly stop/escalate when policy or safety constraints prevent completion.

## Routing when reasoning is incomplete due budget

When a run ends with `budget_exhausted`, use a deterministic policy instead of silent retries:

1. **Narrow retry (same model)**
Retry with a smaller objective slice (single sub-problem) and explicit compact-output instruction.
2. **Model escalation (same route family)**
Route to a stronger instruction-following model for a bounded rescue pass.
3. **Tool-heavy route**
Shift from free-form reasoning to evidence/tool-driven route with minimal synthesis tokens.
4. **Human escalation**
If policy-critical or high-risk, require human approval/resolution path.

Each retry should carry forward prior observations and a remaining-budget contract so failures are auditable rather than hidden.

## Suggested evaluation matrix

For each candidate model family, track at least:

- completion-under-budget rate,
- correctness at fixed budget tiers,
- tool-call validity under low output caps,
- rate of `budget_exhausted` recoveries that succeed after one retry,
- safety/fail-safe trigger precision (true positives vs false positives).

Open-CoT experiments under `docs/experiments/` can be used as baseline scaffolding for this matrix.
11 changes: 11 additions & 0 deletions harness/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,17 @@ The budget tracker (RFC 0038) enforces:

When any hard-enforced budget hits zero, the agent is force-stopped with `budget_exhausted` status and the trace records why.

### Streaming decode circuit breaker

The harness now enforces token/safety limits during streamed decoding (not only after full responses):

- **Preflight budget gate**: estimate prompt token cost before each model call; if insufficient remaining budget, stop before decode starts.
- **Mid-stream token breaker**: stream callbacks track emitted completion tokens and abort decode once the remaining completion allowance is exhausted.
- **Mid-stream safety breaker**: stream callbacks can stop runaway or unsafe output patterns and route to `fail_safe`.
- **FSM-first shutdown**: on breaker trip, the run is forced into terminal state (`budget_exhausted`, `fail_safe`, or `external_stop`) before any subsequent tool side effects.

This keeps authority in the harness FSM even when a model ignores budget instructions.

## Tool contracts

Every tool is registered with a contract (RFC 0003 + RFC 0018):
Expand Down
27 changes: 14 additions & 13 deletions harness/src/agents/chat-agent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import type { AgentState } from "../core/state.js";
import { createAgentState } from "../core/state.js";
import { transition, forceStop } from "../core/transitions.js";
import { createBudgetTracker } from "../core/budget-tracker.js";
import { callLLMWithCircuitBreaker } from "../core/llm-circuit-breaker.js";
import type { ToolRegistry } from "../core/tool-registry.js";
import {
emitPlan,
Expand Down Expand Up @@ -56,19 +57,19 @@ export async function runChatAgent(
let lastResponse: LLMResponseWithTools | undefined;

const callLLM = async (messages: LLMMessage[]): Promise<LLMResponseWithTools> => {
if (halted(state)) {
return { content: "", tokensUsed: 0, model: "noop", finishReason: "stop" };
}
try {
const response = await backend.chat(messages);
budget.recordTokens(state, response.tokensUsed, `LLM (${backend.name})`);
lastResponse = response;
return response;
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
forceStop(state, "fail_safe", msg);
return { content: "", tokensUsed: 0, model: "error", finishReason: "stop" };
}
const response = await callLLMWithCircuitBreaker({
backend,
messages,
state,
budget,
llmReason: `LLM (${backend.name})`,
stream: true,
safety: {
maxDecodedChars: 12_000,
},
});
lastResponse = response;
return response;
};

const end = (answer: string): Trace => {
Expand Down
27 changes: 14 additions & 13 deletions harness/src/agents/coder-agent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import type { AgentState } from "../core/state.js";
import { createAgentState } from "../core/state.js";
import { transition, forceStop } from "../core/transitions.js";
import { createBudgetTracker } from "../core/budget-tracker.js";
import { callLLMWithCircuitBreaker } from "../core/llm-circuit-breaker.js";
import type { ToolRegistry } from "../core/tool-registry.js";
import {
emitPlan,
Expand Down Expand Up @@ -55,19 +56,19 @@ export async function runCoderAgent(
let lastResponse: LLMResponseWithTools | undefined;

const callLLM = async (messages: LLMMessage[]): Promise<LLMResponseWithTools> => {
if (halted(state)) {
return { content: "", tokensUsed: 0, model: "noop", finishReason: "stop" };
}
try {
const response = await backend.chat(messages);
budget.recordTokens(state, response.tokensUsed, `LLM (${backend.name})`);
lastResponse = response;
return response;
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
forceStop(state, "fail_safe", msg);
return { content: "", tokensUsed: 0, model: "error", finishReason: "stop" };
}
const response = await callLLMWithCircuitBreaker({
backend,
messages,
state,
budget,
llmReason: `LLM (${backend.name})`,
stream: true,
safety: {
maxDecodedChars: 20_000,
},
});
lastResponse = response;
return response;
};

const end = (answer: string): Trace => {
Expand Down
31 changes: 14 additions & 17 deletions harness/src/agents/governed-agent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import type { AgentState } from "../core/state.js";
import { createAgentState } from "../core/state.js";
import { transition, forceStop } from "../core/transitions.js";
import { createBudgetTracker } from "../core/budget-tracker.js";
import { callLLMWithCircuitBreaker } from "../core/llm-circuit-breaker.js";
import type { ToolRegistry } from "../core/tool-registry.js";
import { PermissionManager } from "../governance/permission-manager.js";
import { PolicyEvaluator } from "../governance/policy-evaluator.js";
Expand Down Expand Up @@ -75,23 +76,19 @@ export async function runGovernedAgent(
const callLLM = async (
messages: LLMMessage[],
): Promise<LLMResponseWithTools> => {
if (halted(state)) {
return { content: "", tokensUsed: 0, model: "noop", finishReason: "stop" };
}
try {
const response = await config.backend.chat(messages);
budget.recordTokens(
state,
response.tokensUsed,
`LLM (${config.backend.name})`,
);
lastResponse = response;
return response;
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
forceStop(state, "fail_safe", `LLM failure: ${msg}`);
return { content: "", tokensUsed: 0, model: "error", finishReason: "stop" };
}
const response = await callLLMWithCircuitBreaker({
backend: config.backend,
messages,
state,
budget,
llmReason: `LLM (${config.backend.name})`,
stream: true,
safety: {
maxDecodedChars: 16_000,
},
});
lastResponse = response;
return response;
};

const finish = (answer: string): GovernedAgentResult => {
Expand Down
11 changes: 10 additions & 1 deletion harness/src/backends/index.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
export type { LLMBackend, LLMMessage, LLMResponse, LLMResponseWithTools, ToolCallRequest } from "./types.js";
export type {
LLMBackend,
LLMChatOptions,
LLMFinishReason,
LLMMessage,
LLMResponse,
LLMResponseWithTools,
LLMStreamChunk,
ToolCallRequest,
} from "./types.js";
export { MockLLMBackend } from "./mock.js";
export { OpenAICompatBackend, configFromEnv } from "./openai-compat.js";
export type { OpenAICompatConfig } from "./openai-compat.js";
84 changes: 83 additions & 1 deletion harness/src/backends/mock.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

import type {
LLMBackend,
LLMChatOptions,
LLMMessage,
LLMResponseWithTools,
ToolCallRequest,
Expand All @@ -31,7 +32,24 @@ export class MockLLMBackend implements LLMBackend {
this.routes.push(route);
}

async chat(messages: LLMMessage[]): Promise<LLMResponseWithTools> {
async chat(
messages: LLMMessage[],
options?: LLMChatOptions,
): Promise<LLMResponseWithTools> {
if (options?.signal?.aborted) {
throw makeAbortError("Mock backend aborted before decode");
}

const response = this.resolveResponse(messages);
const capped = applyOutputCap(response, options?.maxOutputTokens);

if (options?.stream && options.onChunk && capped.content) {
await emitStream(capped.content, options);
}
return capped;
}

private resolveResponse(messages: LLMMessage[]): LLMResponseWithTools {
const raw =
[...messages].reverse().find((m) => m.role === "user")?.content ?? "";

Expand Down Expand Up @@ -90,6 +108,70 @@ export class MockLLMBackend implements LLMBackend {
}
}

function makeAbortError(message: string): Error {
const err = new Error(message);
err.name = "AbortError";
return err;
}

function estimateTokens(text: string): number {
if (!text) return 0;
return Math.max(1, Math.ceil(text.length / 4));
}

function applyOutputCap(
response: LLMResponseWithTools,
maxOutputTokens?: number,
): LLMResponseWithTools {
if (
maxOutputTokens === undefined ||
maxOutputTokens <= 0 ||
response.content.length === 0
) {
return response;
}

const completionEstimate = estimateTokens(response.content);
if (completionEstimate <= maxOutputTokens) {
return response;
}

const maxChars = Math.max(1, maxOutputTokens * 4);
const truncated = response.content.slice(0, maxChars);
const promptEstimate = Math.max(
0,
response.tokensUsed - completionEstimate,
);
return {
...response,
content: truncated,
tokensUsed: promptEstimate + estimateTokens(truncated),
finishReason: "length",
};
}

async function emitStream(
content: string,
options: LLMChatOptions,
): Promise<void> {
const chunks = content.match(/\S+\s*/g) ?? [content];
let aggregate = "";
let completionTokensEstimated = 0;

for (const piece of chunks) {
if (options.signal?.aborted) {
throw makeAbortError("Mock backend aborted during streamed decode");
}
aggregate += piece;
completionTokensEstimated += estimateTokens(piece);
await options.onChunk?.({
contentDelta: piece,
content: aggregate,
completionTokensEstimated,
});
}
}

function getDefaultRoutes(): MockRoute[] {
return [
{
Expand Down
Loading
Loading