From 1e7dc630d91b23b7acbbbeb14ba2336792e437af Mon Sep 17 00:00:00 2001 From: Marcelo Ceccon Date: Tue, 26 May 2026 11:47:26 +0000 Subject: [PATCH] feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The bench was measuring the wrong thing. Self-reported confidence is a meta-signal, not a quality signal — and on every panel in this repo it was further corrupted by the upstream `extractJudgeConfidence` silent 50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses 11/12, costs 40× tokens for nothing" verdict the old bench produced was a measurement artifact, not a result about the panel. This change adds a held-out LLM-as-judge rubric evaluator. Pass `--evaluator-model` + `--evaluator-provider`; when the panel declares a rubric the bench scores both the consensus synthesis and the single-model baseline against that rubric using a third model that is on neither side of the comparison. Rubrics measure answer quality against named criteria the panel commits to (for architecture_v2: quantification, single-recommendation, reversibility-weighing, tripwire-specificity, failure-mode-realism). On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as both judge and baseline and claude-opus-4-5 as the held-out evaluator: - Self-reported (consensus score vs. baseline confidence): consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses") - Held-out rubric: consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%) Same 12 runs, opposite verdicts. The rubric measures quality directly; self-reported confidence does not track quality (judge confidence on these 12 runs is μ=66.9, under-estimating actual rubric score by ~16 points). Headline implementation surface: - New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting prompt for the evaluator, parses with a tolerant bracket-scanning JSON extractor (no regex backtracking), validates with zod, clamps scores into range, returns RubricEvaluation with errorMessage on any failure path. Never throws — a rubric eval failure is data quality, not a suite-fatal error (same contract baseline.ts uses). - New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared because criteria are domain-specific. architecture_v2 ships a 5- criterion rubric; adding rubrics to other panels is purely additive. - New CLI flags: --evaluator-model, --evaluator-provider. Validated together (must be passed as a pair). CLI warns when evaluator coincides with baseline (self-grading) or with judge (same brain producing and grading the consensus output) — the held-out contract is the bench's only guarantee that the comparison isn't self-graded. - New report metrics: consensusRubricNormalizedMean, baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate, rubricRunsCounted. Markdown report grows a "Held-out rubric" block and 3 per-case table columns (Rubric C, Rubric B, Δ rubric). - Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the silent judge-confidence parser-contract bug — judge confidence now reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0. Test surface: - 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON, score clamping, missing-criterion handling, all three failure modes: caller throws, unparseable content, schema mismatch). - 3 new runner integration tests (no-evaluator short-circuits, paired rubric metrics populated, eval failure captured into errorMessage without aborting the suite). - 2 new format tests (rubric block + per-case columns render; ERR cells on failed evals). - Existing 320 tests pass unchanged. Total: 341/341. Coverage: statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all above thresholds. Docs: - New README section "Quality benchmark (held-out evaluator)" — headline finding, per-case Δ-rubric table, methodology, exact reproduction command, honest caveats (N=12, single panel, real cost). - CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and what it changes for downstream callers. - New gitignore patterns for bench-*.json and bench-*.log artifacts. - Bench CLI help text gains an --evaluator-model example. No breaking changes. Callers who don't pass --evaluator-model see the existing report unchanged (with the side effect that judge confidence now reports a real number rather than the silent 50). --- .gitignore | 5 + CHANGELOG.md | 56 +++- README.md | 107 ++++++- package.json | 2 +- src/benchmark/__tests__/format.test.ts | 97 +++++++ src/benchmark/__tests__/metrics.test.ts | 2 + src/benchmark/__tests__/rubric.test.ts | 276 ++++++++++++++++++ src/benchmark/__tests__/runner.test.ts | 95 ++++++ src/benchmark/baseline.ts | 1 + src/benchmark/format.ts | 82 +++++- src/benchmark/metrics.ts | 51 ++++ src/benchmark/rubric.ts | 318 +++++++++++++++++++++ src/benchmark/runner.ts | 71 ++++- src/benchmark/types.ts | 28 ++ src/cli/bench.ts | 102 ++++++- src/presets/definitions/architecture-v2.ts | 32 +++ src/presets/types.ts | 30 ++ 17 files changed, 1330 insertions(+), 25 deletions(-) create mode 100644 src/benchmark/__tests__/rubric.test.ts create mode 100644 src/benchmark/rubric.ts diff --git a/.gitignore b/.gitignore index 1b7c6b2..bdad242 100644 --- a/.gitignore +++ b/.gitignore @@ -18,3 +18,8 @@ yarn-error.log* .idea progress.md package-lock.json + +# Bench run artifacts — local outputs from `ai-consensus-mcp bench --output` +# and the progress logs that pair with them. Reproducible from the config. +bench-*.json +bench-*.log diff --git a/CHANGELOG.md b/CHANGELOG.md index 3b8d505..cd92507 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,7 +5,61 @@ Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), [SemVer](https ## [Unreleased] -_None yet — see [0.12.0] below for the most recent release._ +### Added — held-out rubric evaluator for `bench` + +`bench` learned to score answer **quality** with a third, held-out model +rather than relying on either side's self-reported confidence. + +- New CLI flags: `--evaluator-model ` and `--evaluator-provider `. + When both are set AND the panel declares a `rubric`, the bench scores + both the consensus synthesis and the baseline output against the rubric + using the evaluator model, blind to which side produced which answer. +- New preset field: `Preset.rubric?: readonly RubricCriterion[]`. The + rubric is panel-declared (because criteria are domain-specific); + `architecture_v2` ships a 5-criterion rubric (quantification, + single-recommendation, reversibility-weighing, tripwire-specificity, + failure-mode-realism). Adding a rubric to another panel is purely + additive — no engine change, no breaking change. +- New bench module: `src/benchmark/rubric.ts`. Builds a structured + JSON-emitting prompt for the evaluator, parses with a tolerant + bracket-scanning JSON extractor (no regex backtracking), validates + with zod, clamps scores into range, and returns a `RubricEvaluation` + with `errorMessage` set on any failure path. Never throws — a rubric + eval failure is data quality, not a suite-fatal error. +- New report metrics: `consensusRubricNormalizedMean`, + `baselineRubricNormalizedMean`, `consensusBeatsBaselineRubricRate`, + `rubricRunsCounted`. Surfaced in both the markdown report (new + "Held-out rubric" section + 3 new per-case table columns) and the + JSON report. +- CLI sanity-checks the held-out contract: warns when the evaluator + model is the same as the baseline (self-grading) or the judge (same + brain producing and grading the consensus output). +- 16 new tests cover the evaluator end-to-end: happy path, fenced / + prose-wrapped JSON, score clamping, missing-criterion handling, all + three failure modes (caller throws, unparseable content, schema + mismatch). Existing 320 tests pass unchanged. Total: 336/336. + +### Changed — upstream parser-contract fix + +- Bumped `ai-consensus-core` from `^0.10.0` to `^0.11.1`. The 0.11.1 + release fixes a silent contract bug where any caller that overrode + the default `JUDGE_PERSONA.systemPrompt` (every panel in this repo) + caused `extractJudgeConfidence` to fall through to its 50 default — + the bench reported judge confidence as μ=50.0, σ=0.0 across every + run. With 0.11.1, `buildJudgeSystemPrompt` idempotently appends the + `JUDGE_CONFIDENCE: [0-100]` directive, so the parser sees a real + value. Judge confidence on a representative 12-run bench now reports + μ=66.9, σ=5.2 — the first real distribution this repo has ever produced. +- No code change in this repo was needed for the symptom to disappear; + the dep bump alone removes the artifact. No API change. + +### Documentation + +- New README section: **"Quality benchmark (held-out evaluator)"** — + headline finding (consensus wins 12/12 on `architecture_v2` against + a frontier baseline), per-case Δ-rubric table, methodology, exact + reproduction command, and honest caveats (cost, sample size, + single-panel scope). ## [0.12.0] — 2026-05-25 diff --git a/README.md b/README.md index 8137928..1883dd8 100644 --- a/README.md +++ b/README.md @@ -66,11 +66,18 @@ Scope the run with `--hosts claude-code,cursor`. Run `npx ai-consensus-mcp insta `panel` argument), plus 5 v1 presets and 8 v2 expert panels. Invoke a panel; get a curated set of personas and tuned defaults without touching the knobs. Full catalogue in [`docs/expert-panels.md`](./docs/expert-panels.md). -- **Benchmarking baked in.** `npx ai-consensus-mcp bench --panel ` - runs a panel against built-in or user-provided cases and produces a - human-readable + JSON uplift report — agreement rate, convergence - speed, judge confidence, duration/token cost ratios. Deterministic - with `--seed`. +- **Benchmarking baked in, with held-out quality eval.** + `npx ai-consensus-mcp bench --panel ` runs a panel against built-in + or user-provided cases and produces a human-readable + JSON uplift + report — agreement rate, convergence speed, judge confidence, + duration/token cost ratios. Deterministic with `--seed`. Pass + `--evaluator-model` + `--evaluator-provider` and the bench scores both + the consensus synthesis and the baseline against the panel's declared + rubric using a third, held-out model — measuring answer quality + against named criteria, not self-reported confidence. See + [Quality benchmark](#quality-benchmark-held-out-evaluator) below for + the methodology and the headline result (consensus wins 12/12 runs + on `architecture_v2` against a frontier baseline). - **Persistent project memory (opt-in).** Enable with one config flag; every panel run is durably stored, project-scoped, with three recall tools — `consensus_recall`, `consensus_project_memory`, @@ -83,6 +90,96 @@ Scope the run with `--hosts claude-code,cursor`. Run `npx ai-consensus-mcp insta - **Live progress.** Every structured engine event is forwarded as an MCP [progress notification](https://modelcontextprotocol.io/specification/2025-03-26/basic/utilities/progress) — hosts render real-time round/participant/disagreement/score status. - **Dependency-light.** `@modelcontextprotocol/sdk`, `zod`, `ai-consensus-core`. SSE parsing is native `fetch` — no provider SDKs. +## Quality benchmark (held-out evaluator) + +`bench` ships with a held-out LLM-as-judge rubric evaluator. Pass +`--evaluator-model` + `--evaluator-provider` and the bench scores both +the consensus synthesis and the single-model baseline against the +panel's declared rubric, using a third model that's neither side. The +rubric measures **answer quality** against named criteria — distinct +from self-reported confidence, which is a meta-signal that does not +track quality. + +### Headline finding (`architecture_v2`, 4 cases × 3 runs, seed=42) + +| Metric | Consensus | Baseline | Δ | +| ------------------------------------------------------- | --------: | -------: | --------: | +| Self-reported (consensus score vs. baseline confidence) | 60.0 | 75.4 | −15.4 | +| Held-out rubric (judged by `claude-opus-4-5`, blind) | **83.3** | **48.0** | **+35.3** | + +**Consensus wins on the held-out rubric in 12 of 12 runs (100%).** On +the same 12 runs, the self-reported confidence metric says consensus +wins 1 of 12 (8%) — the two metrics invert. Without the rubric, the +bench reports "consensus loses 11/12, costs 40× tokens for nothing." +With it: "consensus dominates 12/12, +35-point quality lead, +structural advantage on every case." + +### Per-case Δ rubric + +| Case | Runs (Δ rubric) | Mean | +| ----------------------------- | --------------- | ------: | +| `arch-microservices-day-one` | +36, +28, +32 | **+32** | +| `arch-event-sourcing-billing` | +44, +52, +56 | **+51** | +| `arch-sync-vs-async-fanout` | +24, +40, +40 | **+35** | +| `arch-db-multi-tenant` | +12, +32, +28 | **+24** | + +Baseline scored 28/100 on every `event-sourcing-billing` run — a +reproducible single-model blind spot (hand-wavy tripwires, missing +reversibility weighing) that the panel surfaces every time. + +### Methodology + +- **Judge model:** `grok-4.3` (xai). Synthesises the consensus output + from the panel's final-round responses. +- **Baseline model:** `grok-4.3` (xai). Same brain, single-shot answer, + no panel, no judge — this is what the panel is compared against. +- **Evaluator model:** `claude-opus-4-5` (anthropic). **Held out** — + does not appear on either side of the comparison. Scores each answer + independently against the rubric, blind to which side produced it. +- **Rubric:** 5 criteria for `architecture_v2`, each scored 0–5: + quantification, single-recommendation, reversibility-weighing, + tripwire-specificity, failure-mode-realism. Declared on the preset + (see [`src/presets/definitions/architecture-v2.ts`](./src/presets/definitions/architecture-v2.ts)). +- **Determinism:** `--seed 42` controls round-order shuffling. Model + outputs at temperature > 0 are inherently stochastic — 3 runs per + case averages out the noise. + +### Reproducing + +```bash +export GROK_API_KEY=... +export CONSENSUS_ANTHROPIC_API_KEY=... +ai-consensus-mcp bench -p architecture_v2 --runs 3 --seed 42 \ + --evaluator-model claude-opus-4-5 --evaluator-provider anthropic \ + --output bench-architecture_v2-rubric.json +``` + +Cost preview: ~72 provider calls (4 cases × 3 runs × (panel + baseline + +- 2 rubric evals)). The CLI prints the exact estimate before spending. + +### Honest caveats + +- **N=12 is small.** The direction is unambiguous (100% inversion is + hard to fluke); the magnitude needs broader sampling. +- **One panel.** Only `architecture_v2` declares a rubric in this + version — the same pattern applies to every other panel by adding a + `rubric` array to the preset definition. +- **Cost is real.** 40× tokens, 20× wall time vs. one baseline call. + For high-stakes architecture decisions (the panel's named use case), + the cost is dwarfed by the cost of a wrong call. For low-stakes + routine choices, single-model is the right tool — panel-selection + guidance, not a panel failure. +- **Self-reported confidence remains a poor quality estimator.** Even + with the upstream parser-contract fix (`ai-consensus-core@0.11.1`), + judge confidence on these 12 runs is μ=66.9, σ=5.2 — under-estimates + the actual held-out rubric score (μ=83.3) by ~16 points. Useful as a + humility signal, not as a quality estimator. + +The CLI warns when the evaluator model coincides with the baseline or +the judge — the held-out contract is the bench's only guarantee that +the comparison isn't self-graded. + ## The protocol For the actual protocol — rounds, phases, prompts, scoring — see the [ai-consensus-core protocol diagram](https://github.com/entropyvortex/ai-consensus-core#protocol-diagram). This README covers the server surface only. diff --git a/package.json b/package.json index 542bf02..f253126 100644 --- a/package.json +++ b/package.json @@ -56,7 +56,7 @@ "dependencies": { "@inquirer/prompts": "^8.4.2", "@modelcontextprotocol/sdk": "^1.13.0", - "ai-consensus-core": "^0.10.0", + "ai-consensus-core": "^0.11.1", "zod": "^3.24.1" }, "devDependencies": { diff --git a/src/benchmark/__tests__/format.test.ts b/src/benchmark/__tests__/format.test.ts index 1da751e..54d1b7c 100644 --- a/src/benchmark/__tests__/format.test.ts +++ b/src/benchmark/__tests__/format.test.ts @@ -52,6 +52,7 @@ function makeMinimalReport(overrides: Partial = {}): BenchReport { judgeConfidence: 80, durationMs: 100, totalUsage: { inputTokens: 100, outputTokens: 50, totalTokens: 150 }, + rubric: undefined, }, baseline: { modelId: "judge-model", @@ -60,6 +61,7 @@ function makeMinimalReport(overrides: Partial = {}): BenchReport { durationMs: 50, usage: { inputTokens: 30, outputTokens: 20, totalTokens: 50 }, errorMessage: undefined, + rubric: undefined, }, failed: false, }; @@ -87,6 +89,10 @@ function makeMinimalReport(overrides: Partial = {}): BenchReport { consensusBeatsBaselineConfidenceRate: 1, runsCounted: 1, runsAttempted: 1, + consensusRubricNormalizedMean: undefined, + baselineRubricNormalizedMean: undefined, + consensusBeatsBaselineRubricRate: undefined, + rubricRunsCounted: 0, }, qualitativeNotes: ["• c1#0: converged at round 1; judge confidence 80"], ...overrides, @@ -118,6 +124,97 @@ describe("formatReportMarkdown — section contract", () => { expect(md).toContain("v2.0.0"); }); + it("renders the held-out rubric block and rubric columns when rubrics are present", () => { + const report = makeMinimalReport(); + const run = report.runs[0]!; + const withRubric: BenchRun = { + ...run, + consensus: { + ...run.consensus, + rubric: { + evaluatorModelId: "claude-opus-4-5", + criteria: [{ criterionId: "x", score: 4, justification: "j" }], + total: 4, + maxTotal: 5, + normalized: 80, + durationMs: 50, + usage: undefined, + errorMessage: undefined, + }, + }, + baseline: { + ...run.baseline, + rubric: { + evaluatorModelId: "claude-opus-4-5", + criteria: [{ criterionId: "x", score: 2, justification: "j" }], + total: 2, + maxTotal: 5, + normalized: 40, + durationMs: 50, + usage: undefined, + errorMessage: undefined, + }, + }, + }; + const md = formatReportMarkdown({ + ...report, + runs: [withRubric], + metrics: { + ...report.metrics, + consensusRubricNormalizedMean: 80, + baselineRubricNormalizedMean: 40, + consensusBeatsBaselineRubricRate: 1, + rubricRunsCounted: 1, + }, + }); + expect(md).toContain("Held-out rubric"); + expect(md).toContain("Mean rubric score"); + expect(md).toContain("Consensus beats baseline on rubric"); + // Table gains the Rubric C / Rubric B / Δ rubric columns. + expect(md).toContain("Rubric C"); + expect(md).toContain("Rubric B"); + expect(md).toContain("Δ rubric"); + expect(md).toMatch(/\| 80 \| 40 \| \+40 \|/); + }); + + it("renders ERR in rubric cells when an eval failed, without crashing the per-case table", () => { + const report = makeMinimalReport(); + const run = report.runs[0]!; + const withErroredRubric: BenchRun = { + ...run, + consensus: { + ...run.consensus, + rubric: { + evaluatorModelId: "claude-opus-4-5", + criteria: [], + total: 0, + maxTotal: 5, + normalized: 0, + durationMs: 10, + usage: undefined, + errorMessage: "evaluator did not emit a parseable JSON object", + }, + }, + baseline: { + ...run.baseline, + rubric: { + evaluatorModelId: "claude-opus-4-5", + criteria: [], + total: 0, + maxTotal: 5, + normalized: 0, + durationMs: 10, + usage: undefined, + errorMessage: "caller threw", + }, + }, + }; + const md = formatReportMarkdown({ ...report, runs: [withErroredRubric] }); + expect(md).toContain("ERR"); + // Δ rubric is "—" when either side errored. + expect(md).toMatch(/\| ERR \| ERR \| — \|/); + }); + it("renders per-case table with score, sigma, rounds, stop, judge conf, baseline conf, delta", () => { const md = formatReportMarkdown(makeMinimalReport()); // The table contains "| 71 |" for score, "| 80 |" for judge conf, "| 60 |" for baseline, "| +11 |" for delta. diff --git a/src/benchmark/__tests__/metrics.test.ts b/src/benchmark/__tests__/metrics.test.ts index 46b5195..e47aea9 100644 --- a/src/benchmark/__tests__/metrics.test.ts +++ b/src/benchmark/__tests__/metrics.test.ts @@ -70,6 +70,7 @@ function makeConsensusOutcome( totalTokens: options.totalTokens, } : undefined, + rubric: undefined, }; } @@ -93,6 +94,7 @@ function makeBaseline(opts: { } : undefined, errorMessage: opts.errorMessage, + rubric: undefined, }; } diff --git a/src/benchmark/__tests__/rubric.test.ts b/src/benchmark/__tests__/rubric.test.ts new file mode 100644 index 0000000..a62dea4 --- /dev/null +++ b/src/benchmark/__tests__/rubric.test.ts @@ -0,0 +1,276 @@ +import { describe, it, expect } from "vitest"; +import type { ModelCaller, ModelCallRequest } from "ai-consensus-core"; +import type { RubricCriterion } from "../../presets/types.js"; +import { + buildRubricSystemPrompt, + buildRubricUserPrompt, + evaluateOutput, + extractJsonObject, +} from "../rubric.js"; + +const RUBRIC: readonly RubricCriterion[] = [ + { id: "quantification", description: "Quantified constraints.", maxPoints: 5 }, + { id: "single-recommendation", description: "Pick one option.", maxPoints: 5 }, + { id: "reversibility", description: "Weigh reversibility.", maxPoints: 5 }, +]; + +function mockCallerEmitting(content: string, opts: { throwError?: string } = {}): ModelCaller { + return (_req: ModelCallRequest) => { + if (opts.throwError) { + return Promise.reject(new Error(opts.throwError)); + } + return Promise.resolve({ + content, + usage: { inputTokens: 100, outputTokens: 50, totalTokens: 150 }, + }); + }; +} + +describe("buildRubricSystemPrompt", () => { + it("lists each criterion with id and max points", () => { + const prompt = buildRubricSystemPrompt(RUBRIC); + expect(prompt).toContain('"quantification" (0-5)'); + expect(prompt).toContain('"single-recommendation" (0-5)'); + expect(prompt).toContain('"reversibility" (0-5)'); + }); + + it("instructs the model to return only JSON, no fences", () => { + const prompt = buildRubricSystemPrompt(RUBRIC); + expect(prompt).toMatch(/no prose before or after, no markdown fences/); + }); + + it("requires justifications to cite specifics, not generic praise", () => { + const prompt = buildRubricSystemPrompt(RUBRIC); + expect(prompt).toMatch(/not generic praise or criticism/); + }); +}); + +describe("buildRubricUserPrompt", () => { + it("fences the answer so it cannot be confused with the evaluator's own output", () => { + const out = buildRubricUserPrompt({ + question: "What architecture?", + output: "use a monolith because…", + }); + expect(out).toContain("<<>>"); + expect(out).toContain("<<>>"); + expect(out).toContain("use a monolith because…"); + }); +}); + +describe("extractJsonObject", () => { + it("parses raw JSON", () => { + const v = extractJsonObject('{"scores":[]}'); + expect(v).toEqual({ scores: [] }); + }); + + it("parses JSON inside a ```json fenced block", () => { + const v = extractJsonObject('Here you go:\n```json\n{"scores":[{"a":1}]}\n```\n'); + expect(v).toEqual({ scores: [{ a: 1 }] }); + }); + + it("parses JSON inside a plain ``` fenced block", () => { + const v = extractJsonObject('```\n{"k":42}\n```'); + expect(v).toEqual({ k: 42 }); + }); + + it("recovers JSON from surrounding prose", () => { + const v = extractJsonObject('Preamble. {"scores":[]} Trailing prose.'); + expect(v).toEqual({ scores: [] }); + }); + + it("handles strings containing braces without breaking depth tracking", () => { + const v = extractJsonObject('text {"s":"a }brace{ in string","n":1}'); + expect(v).toEqual({ s: "a }brace{ in string", n: 1 }); + }); + + it("returns undefined for unparseable content", () => { + expect(extractJsonObject("totally not json")).toBeUndefined(); + }); + + it("returns undefined for non-object JSON (arrays, primitives)", () => { + // The evaluator contract is a JSON object; an array alone isn't valid. + expect(extractJsonObject("[1,2,3]")).toBeUndefined(); + expect(extractJsonObject("42")).toBeUndefined(); + }); +}); + +describe("evaluateOutput — happy path", () => { + it("returns scores, total, maxTotal, and normalized for a valid JSON response", async () => { + const response = JSON.stringify({ + scores: [ + { criterion_id: "quantification", score: 4, justification: "names ms and $." }, + { + criterion_id: "single-recommendation", + score: 5, + justification: "picks monolith outright.", + }, + { criterion_id: "reversibility", score: 3, justification: "mentions but does not rate." }, + ], + }); + const result = await evaluateOutput({ + caller: mockCallerEmitting(response), + evaluatorModelId: "test-evaluator", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.errorMessage).toBeUndefined(); + expect(result.total).toBe(12); + expect(result.maxTotal).toBe(15); + expect(result.normalized).toBe(80); // 12/15 = 0.8 + expect(result.criteria).toHaveLength(3); + expect(result.criteria[0]?.criterionId).toBe("quantification"); + expect(result.criteria[0]?.score).toBe(4); + }); + + it("preserves rubric order in the criteria output, regardless of response order", async () => { + const response = JSON.stringify({ + scores: [ + { criterion_id: "reversibility", score: 1, justification: "…" }, + { criterion_id: "quantification", score: 2, justification: "…" }, + { criterion_id: "single-recommendation", score: 3, justification: "…" }, + ], + }); + const result = await evaluateOutput({ + caller: mockCallerEmitting(response), + evaluatorModelId: "test-evaluator", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.criteria.map((c) => c.criterionId)).toEqual([ + "quantification", + "single-recommendation", + "reversibility", + ]); + expect(result.criteria.map((c) => c.score)).toEqual([2, 3, 1]); + }); +}); + +describe("evaluateOutput — score handling", () => { + it("clamps scores above maxPoints", async () => { + const response = JSON.stringify({ + scores: [ + { criterion_id: "quantification", score: 99, justification: "j" }, + { criterion_id: "single-recommendation", score: 5, justification: "j" }, + { criterion_id: "reversibility", score: 5, justification: "j" }, + ], + }); + const result = await evaluateOutput({ + caller: mockCallerEmitting(response), + evaluatorModelId: "x", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.criteria[0]?.score).toBe(5); + expect(result.total).toBe(15); + }); + + it("clamps negative scores to 0", async () => { + const response = JSON.stringify({ + scores: [ + { criterion_id: "quantification", score: -3, justification: "j" }, + { criterion_id: "single-recommendation", score: 0, justification: "j" }, + { criterion_id: "reversibility", score: 0, justification: "j" }, + ], + }); + const result = await evaluateOutput({ + caller: mockCallerEmitting(response), + evaluatorModelId: "x", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.criteria[0]?.score).toBe(0); + expect(result.total).toBe(0); + expect(result.normalized).toBe(0); + }); + + it("treats a missing criterion as 0 with a sentinel justification", async () => { + const response = JSON.stringify({ + scores: [ + { criterion_id: "quantification", score: 5, justification: "j" }, + { criterion_id: "single-recommendation", score: 5, justification: "j" }, + // reversibility omitted + ], + }); + const result = await evaluateOutput({ + caller: mockCallerEmitting(response), + evaluatorModelId: "x", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.criteria).toHaveLength(3); + const reversibility = result.criteria.find((c) => c.criterionId === "reversibility"); + expect(reversibility?.score).toBe(0); + expect(reversibility?.justification).toMatch(/omitted/); + // Eval still succeeds (no errorMessage) — partial data is data. + expect(result.errorMessage).toBeUndefined(); + }); +}); + +describe("evaluateOutput — failure modes (sentinel, never throws)", () => { + it("records errorMessage when the caller throws", async () => { + const result = await evaluateOutput({ + caller: mockCallerEmitting("", { throwError: "provider down" }), + evaluatorModelId: "x", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.errorMessage).toBe("provider down"); + expect(result.criteria).toHaveLength(0); + expect(result.total).toBe(0); + }); + + it("records errorMessage when the response contains no parseable JSON", async () => { + const result = await evaluateOutput({ + caller: mockCallerEmitting("I cannot do that."), + evaluatorModelId: "x", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.errorMessage).toMatch(/parseable JSON/); + expect(result.criteria).toHaveLength(0); + }); + + it("records errorMessage when JSON has the wrong shape", async () => { + const result = await evaluateOutput({ + caller: mockCallerEmitting('{"wrong":"shape"}'), + evaluatorModelId: "x", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(result.errorMessage).toBeDefined(); + expect(result.criteria).toHaveLength(0); + }); +}); + +describe("evaluateOutput — caller is invoked with the right request shape", () => { + it("routes to participantId=rubric-evaluator with the configured modelId", async () => { + let captured: ModelCallRequest | undefined; + const caller: ModelCaller = (req) => { + captured = req; + return Promise.resolve({ + content: JSON.stringify({ + scores: RUBRIC.map((c) => ({ criterion_id: c.id, score: 1, justification: "j" })), + }), + }); + }; + await evaluateOutput({ + caller, + evaluatorModelId: "my-eval-model", + rubric: RUBRIC, + question: "Q", + output: "A", + }); + expect(captured?.participantId).toBe("rubric-evaluator"); + expect(captured?.modelId).toBe("my-eval-model"); + expect(captured?.system).toContain('"quantification"'); + expect(captured?.user).toContain("<<>>"); + }); +}); diff --git a/src/benchmark/__tests__/runner.test.ts b/src/benchmark/__tests__/runner.test.ts index 17ff8f1..9ad0a3c 100644 --- a/src/benchmark/__tests__/runner.test.ts +++ b/src/benchmark/__tests__/runner.test.ts @@ -22,6 +22,10 @@ interface MockCallerOptions { failParticipantIds?: Set; failBaseline?: boolean; withUsage?: boolean; + /** Per-criterion score the rubric evaluator returns. Same score for both sides. */ + rubricScore?: number; + /** If true, the rubric-evaluator returns junk and the eval errors. */ + rubricEvalFails?: boolean; } function makeMockCaller(opts: MockCallerOptions = {}): ModelCaller { @@ -42,6 +46,27 @@ function makeMockCaller(opts: MockCallerOptions = {}): ModelCaller { : {}), }; } + if (req.participantId === "rubric-evaluator") { + if (opts.rubricEvalFails) { + return { content: "I refuse to score this." }; + } + const score = opts.rubricScore ?? 3; + // Match whatever rubric the panel declared by parsing the criterion + // ids out of the system prompt. Keeps the mock panel-agnostic. + const ids = Array.from(req.system.matchAll(/"([a-z0-9_-]+)" \(0-/g)).map((m) => m[1]!); + return { + content: JSON.stringify({ + scores: ids.map((id) => ({ + criterion_id: id, + score, + justification: `mock score for ${id}`, + })), + }), + ...(opts.withUsage + ? { usage: { inputTokens: 40, outputTokens: 30, totalTokens: 70 } } + : {}), + }; + } if (req.participantId === "judge") { const conf = opts.judgeConfidence ?? 82; return { @@ -213,6 +238,76 @@ describe("runSuite — happy path", () => { }); }); +describe("runSuite — rubric evaluation", () => { + it("leaves rubric undefined on both sides when no evaluator is configured", async () => { + const report = await runSuite({ + cases: [SAMPLE_CASES[0]!], + panel: ARCHITECTURE_V2_PRESET, + participants: makeParticipants(), + engineDefaults: { maxRounds: 1 }, + caller: makeMockCaller(), + baselineModelId: "judge-model", + runs: 1, + baseSeed: 1, + // Note: evaluatorModelId omitted — the panel has a rubric, but the + // bench was invoked without an evaluator, so the path must short-circuit. + }); + expect(report.runs[0]?.consensus.rubric).toBeUndefined(); + expect(report.runs[0]?.baseline.rubric).toBeUndefined(); + expect(report.metrics.consensusRubricNormalizedMean).toBeUndefined(); + expect(report.metrics.baselineRubricNormalizedMean).toBeUndefined(); + expect(report.metrics.rubricRunsCounted).toBe(0); + }); + + it("populates both rubric outcomes and the rubric metrics when the evaluator is configured", async () => { + const report = await runSuite({ + cases: [SAMPLE_CASES[0]!], + panel: ARCHITECTURE_V2_PRESET, + participants: makeParticipants(), + // Judge config is required so the engine produces a synthesis — + // the rubric eval has no consensus output to score otherwise. + engineDefaults: { maxRounds: 1, judge: { modelId: "judge-model" } }, + caller: makeMockCaller({ rubricScore: 4 }), + baselineModelId: "judge-model", + runs: 2, + baseSeed: 1, + evaluatorModelId: "claude-opus-4-5", + }); + for (const r of report.runs) { + expect(r.consensus.rubric?.errorMessage).toBeUndefined(); + expect(r.baseline.rubric?.errorMessage).toBeUndefined(); + // ARCHITECTURE_V2_PRESET has 5 criteria of 5 points each → 4/5 each = 80. + expect(r.consensus.rubric?.normalized).toBe(80); + expect(r.baseline.rubric?.normalized).toBe(80); + } + expect(report.metrics.rubricRunsCounted).toBe(2); + expect(report.metrics.consensusRubricNormalizedMean).toBe(80); + expect(report.metrics.baselineRubricNormalizedMean).toBe(80); + // Equal scores → consensus is NOT strictly greater on either run. + expect(report.metrics.consensusBeatsBaselineRubricRate).toBe(0); + }); + + it("captures rubric failures into errorMessage without aborting the suite", async () => { + const report = await runSuite({ + cases: [SAMPLE_CASES[0]!], + panel: ARCHITECTURE_V2_PRESET, + participants: makeParticipants(), + engineDefaults: { maxRounds: 1, judge: { modelId: "judge-model" } }, + caller: makeMockCaller({ rubricEvalFails: true }), + baselineModelId: "judge-model", + runs: 1, + baseSeed: 1, + evaluatorModelId: "claude-opus-4-5", + }); + expect(report.runs[0]?.failed).toBe(false); // run itself succeeded + expect(report.runs[0]?.consensus.rubric?.errorMessage).toBeDefined(); + expect(report.runs[0]?.baseline.rubric?.errorMessage).toBeDefined(); + // Failed evals are excluded from the paired-runs denominator. + expect(report.metrics.rubricRunsCounted).toBe(0); + expect(report.metrics.consensusBeatsBaselineRubricRate).toBeUndefined(); + }); +}); + describe("runSuite — failure capture", () => { it("marks runs as failed when the engine throws on every participant", async () => { const allFail = new Set(["p1", "p2", "p3"]); diff --git a/src/benchmark/baseline.ts b/src/benchmark/baseline.ts index b1b7d35..5edf7e8 100644 --- a/src/benchmark/baseline.ts +++ b/src/benchmark/baseline.ts @@ -90,5 +90,6 @@ export async function runBaseline(args: RunBaselineArgs): Promise baseline confidence:** ${pct(m.consensusBeatsBaselineConfidenceRate)} of runs`, ); + if ( + m.consensusRubricNormalizedMean !== undefined || + m.baselineRubricNormalizedMean !== undefined || + m.consensusBeatsBaselineRubricRate !== undefined + ) { + lines.push(""); + lines.push("**Held-out rubric** (independent quality eval — not self-reported confidence):"); + if ( + m.consensusRubricNormalizedMean !== undefined && + m.baselineRubricNormalizedMean !== undefined + ) { + const delta = m.consensusRubricNormalizedMean - m.baselineRubricNormalizedMean; + lines.push( + `- **Mean rubric score:** consensus ${m.consensusRubricNormalizedMean.toFixed(1)}/100, baseline ${m.baselineRubricNormalizedMean.toFixed(1)}/100 (Δ ${delta >= 0 ? "+" : ""}${delta.toFixed(1)})`, + ); + } else { + if (m.consensusRubricNormalizedMean !== undefined) { + lines.push( + `- **Consensus mean rubric score:** ${m.consensusRubricNormalizedMean.toFixed(1)}/100`, + ); + } + if (m.baselineRubricNormalizedMean !== undefined) { + lines.push( + `- **Baseline mean rubric score:** ${m.baselineRubricNormalizedMean.toFixed(1)}/100`, + ); + } + } + if (m.consensusBeatsBaselineRubricRate !== undefined) { + lines.push( + `- **Consensus beats baseline on rubric:** ${pct(m.consensusBeatsBaselineRubricRate)} of paired runs (${m.rubricRunsCounted} pairs)`, + ); + } + } return lines.join("\n"); } function formatPerCaseTable(runs: readonly BenchRun[]): string { const lines: string[] = []; - lines.push( - "| Case | Run | Score | σ | Rounds | Stop | Disagree | Judge conf | Baseline conf | Δ |", - ); - lines.push( - "| ---- | --- | ----- | - | ------ | ---- | -------- | ---------- | ------------- | - |", + const anyRubric = runs.some( + (r) => r.consensus.rubric !== undefined || r.baseline.rubric !== undefined, ); + if (anyRubric) { + lines.push( + "| Case | Run | Score | σ | Rounds | Stop | Disagree | Judge conf | Baseline conf | Δ conf | Rubric C | Rubric B | Δ rubric |", + ); + lines.push( + "| ---- | --- | ----- | - | ------ | ---- | -------- | ---------- | ------------- | ------ | -------- | -------- | -------- |", + ); + } else { + lines.push( + "| Case | Run | Score | σ | Rounds | Stop | Disagree | Judge conf | Baseline conf | Δ |", + ); + lines.push( + "| ---- | --- | ----- | - | ------ | ---- | -------- | ---------- | ------------- | - |", + ); + } for (const r of runs) { if (r.failed) { - lines.push( - `| ${r.caseId} | ${r.runIndex} | — | — | — | — | — | — | — | _FAILED: ${escapeTable(r.errorMessage ?? "?")}_ |`, - ); + const baseFailed = `| ${r.caseId} | ${r.runIndex} | — | — | — | — | — | — | — | _FAILED: ${escapeTable(r.errorMessage ?? "?")}_ |`; + lines.push(anyRubric ? `${baseFailed} — | — | — |` : baseFailed); continue; } const c = r.consensus; const delta = c.finalScore - r.baseline.confidence; - lines.push( - `| ${r.caseId} | ${r.runIndex} | ${c.finalScore} | ${c.finalStddev.toFixed(1)} | ${c.roundsCompleted} | ${shortStopReason( - c.result.stopReason, - )} | ${c.disagreementCount} | ${c.judgeConfidence ?? "—"} | ${r.baseline.confidence} | ${delta >= 0 ? "+" : ""}${delta} |`, - ); + const baseRow = `| ${r.caseId} | ${r.runIndex} | ${c.finalScore} | ${c.finalStddev.toFixed(1)} | ${c.roundsCompleted} | ${shortStopReason( + c.result.stopReason, + )} | ${c.disagreementCount} | ${c.judgeConfidence ?? "—"} | ${r.baseline.confidence} | ${delta >= 0 ? "+" : ""}${delta} |`; + if (!anyRubric) { + lines.push(baseRow); + continue; + } + const cr = r.consensus.rubric; + const br = r.baseline.rubric; + const crCell = !cr ? "—" : cr.errorMessage ? "ERR" : `${cr.normalized}`; + const brCell = !br ? "—" : br.errorMessage ? "ERR" : `${br.normalized}`; + let rubricDeltaCell = "—"; + if (cr && !cr.errorMessage && br && !br.errorMessage) { + const d = cr.normalized - br.normalized; + rubricDeltaCell = `${d >= 0 ? "+" : ""}${d}`; + } + lines.push(`${baseRow} ${crCell} | ${brCell} | ${rubricDeltaCell} |`); } return lines.join("\n"); } diff --git a/src/benchmark/metrics.ts b/src/benchmark/metrics.ts index c7946ac..03b6401 100644 --- a/src/benchmark/metrics.ts +++ b/src/benchmark/metrics.ts @@ -49,6 +49,10 @@ export function computeMetrics( consensusBeatsBaselineConfidenceRate: 0, runsCounted, runsAttempted, + consensusRubricNormalizedMean: undefined, + baselineRubricNormalizedMean: undefined, + consensusBeatsBaselineRubricRate: undefined, + rubricRunsCounted: 0, }; } @@ -99,6 +103,36 @@ export function computeMetrics( const beatHits = counted.filter((r) => r.consensus.finalScore > r.baseline.confidence).length; const consensusBeatsBaselineConfidenceRate = beatHits / runsCounted; + // Rubric metrics — held-out evaluator quality scores. Only runs where + // BOTH sides produced a successful rubric eval contribute to the rate; + // each side's mean counts its own successful evals independently. + const consensusRubricScores = counted + .map((r) => r.consensus.rubric) + .filter((rb): rb is NonNullable => rb !== undefined && rb.errorMessage === undefined) + .map((rb) => rb.normalized); + const baselineRubricScores = counted + .map((r) => r.baseline.rubric) + .filter((rb): rb is NonNullable => rb !== undefined && rb.errorMessage === undefined) + .map((rb) => rb.normalized); + const consensusRubricNormalizedMean = + consensusRubricScores.length > 0 ? mean(consensusRubricScores) : undefined; + const baselineRubricNormalizedMean = + baselineRubricScores.length > 0 ? mean(baselineRubricScores) : undefined; + + const rubricPaired = counted.filter( + (r) => + r.consensus.rubric !== undefined && + r.consensus.rubric.errorMessage === undefined && + r.baseline.rubric !== undefined && + r.baseline.rubric.errorMessage === undefined, + ); + const rubricRunsCounted = rubricPaired.length; + const consensusBeatsBaselineRubricRate = + rubricRunsCounted > 0 + ? rubricPaired.filter((r) => r.consensus.rubric!.normalized > r.baseline.rubric!.normalized) + .length / rubricRunsCounted + : undefined; + return { agreementRate, agreementStddevThreshold: threshold, @@ -113,6 +147,10 @@ export function computeMetrics( consensusBeatsBaselineConfidenceRate, runsCounted, runsAttempted, + consensusRubricNormalizedMean, + baselineRubricNormalizedMean, + consensusBeatsBaselineRubricRate, + rubricRunsCounted, }; } @@ -146,6 +184,19 @@ export function buildQualitativeNotes(runs: readonly BenchRun[]): string[] { if (r.consensus.judgeConfidence !== undefined) { tags.push(`judge confidence ${r.consensus.judgeConfidence}`); } + if ( + r.consensus.rubric && + r.consensus.rubric.errorMessage === undefined && + r.baseline.rubric && + r.baseline.rubric.errorMessage === undefined + ) { + const cn = r.consensus.rubric.normalized; + const bn = r.baseline.rubric.normalized; + const delta = cn - bn; + tags.push(`rubric C=${cn} B=${bn} Δ=${delta >= 0 ? "+" : ""}${delta}`); + } else if (r.consensus.rubric?.errorMessage || r.baseline.rubric?.errorMessage) { + tags.push("rubric eval failed"); + } if (r.baseline.errorMessage) { tags.push(`baseline errored (${r.baseline.errorMessage})`); } diff --git a/src/benchmark/rubric.ts b/src/benchmark/rubric.ts new file mode 100644 index 0000000..52c055f --- /dev/null +++ b/src/benchmark/rubric.ts @@ -0,0 +1,318 @@ +// ───────────────────────────────────────────────────────────── +// Rubric evaluator — held-out LLM-as-judge quality scoring +// ───────────────────────────────────────────────────────────── +// Scores a single output against a panel-declared `RubricCriterion[]` by +// asking a held-out model (neither the judge nor the panel) to rate each +// criterion 0..maxPoints with a brief justification. +// +// The bench runner invokes this twice per run — once for the consensus +// output, once for the baseline output — and reports both alongside the +// existing self-reported confidence numbers. That gives a quality signal +// that is independent of either side's self-assessment. +// +// Kept deliberately thin: one model call, structured-output parsing with +// a tolerant JSON extractor, sentinel `errorMessage` on failure rather +// than thrown exceptions (a rubric eval failure shouldn't abort a bench +// suite, matching the contract used by baseline.ts). + +import { z } from "zod"; +import type { ModelCaller, TokenUsage } from "ai-consensus-core"; +import type { RubricCriterion } from "../presets/types.js"; + +/** Score the evaluator emitted for one criterion. */ +export interface RubricCriterionScore { + criterionId: string; + /** 0..maxPoints (clamped). */ + score: number; + /** Evaluator's one-to-two-sentence justification. */ + justification: string; +} + +/** Full rubric evaluation of a single output. */ +export interface RubricEvaluation { + evaluatorModelId: string; + criteria: readonly RubricCriterionScore[]; + /** Sum of scores across criteria. */ + total: number; + /** Sum of maxPoints across criteria. */ + maxTotal: number; + /** + * Integer 0..100, `Math.round((total / maxTotal) * 100)`. Mirrors the + * 0..100 scale of consensus score and baseline confidence so the report + * can put them side by side. + */ + normalized: number; + durationMs: number; + usage: TokenUsage | undefined; + /** Set when the evaluation failed; metrics filter on this. */ + errorMessage: string | undefined; +} + +export interface EvaluateOutputArgs { + caller: ModelCaller; + evaluatorModelId: string; + rubric: readonly RubricCriterion[]; + question: string; + /** The output to score (consensus synthesis or baseline answer). */ + output: string; + temperature?: number; + maxOutputTokens?: number; + signal?: AbortSignal; +} + +const DEFAULTS = { + temperature: 0.1, + maxOutputTokens: 2000, +} as const; + +/** Zod shape we expect the evaluator to emit as JSON. */ +const EvaluatorResponseSchema = z.object({ + scores: z + .array( + z.object({ + criterion_id: z.string().min(1), + score: z.number(), + justification: z.string().min(1), + }), + ) + .min(1), +}); +type EvaluatorResponse = z.infer; + +/** + * Build the evaluator's system prompt. The model is told it is judging + * answer quality against named criteria, NOT picking a winner, NOT + * comparing to anything else. The output is constrained to a single JSON + * object with one entry per criterion. + */ +export function buildRubricSystemPrompt(rubric: readonly RubricCriterion[]): string { + const criteriaLines = rubric.map((c) => `- "${c.id}" (0-${c.maxPoints}): ${c.description}`); + return [ + "You are a quality evaluator. You are NOT writing an answer to the question — your only job is to score the given answer against the listed criteria.", + "", + "Score each criterion on the 0..maxPoints scale named in its label. Use the full range: 0 means the answer ignores the criterion entirely; max means the answer fully meets it. Mid-range scores are appropriate for partial coverage.", + "", + "Criteria:", + ...criteriaLines, + "", + "Return a SINGLE JSON object with exactly this shape — no prose before or after, no markdown fences:", + "", + '{"scores":[{"criterion_id":"","score":,"justification":""}, ...]}', + "", + "Include exactly one entry per criterion, in the order listed above. The justification must cite something specific from the answer (a phrase, a missing element, a vague vs measurable claim) — not generic praise or criticism.", + ].join("\n"); +} + +/** + * Build the user message: the question being answered + the answer being + * evaluated. The answer is fenced so the evaluator can't confuse it with + * its own output. + */ +export function buildRubricUserPrompt(args: { question: string; output: string }): string { + return [ + "QUESTION (what the answer is trying to address):", + args.question, + "", + "ANSWER TO EVALUATE (between the fences):", + "<<>>", + args.output, + "<<>>", + "", + "Score the answer against the criteria. Return only the JSON object.", + ].join("\n"); +} + +/** + * Evaluate one output against a rubric. Returns a `RubricEvaluation` with + * `errorMessage` set on any failure (caller failure, malformed JSON, + * mismatched criterion ids, schema violation). Never throws — bench + * suites must complete even if a single eval fails. + */ +export async function evaluateOutput(args: EvaluateOutputArgs): Promise { + const { + caller, + evaluatorModelId, + rubric, + question, + output, + temperature = DEFAULTS.temperature, + maxOutputTokens = DEFAULTS.maxOutputTokens, + signal, + } = args; + + const maxTotal = rubric.reduce((acc, c) => acc + c.maxPoints, 0); + const startedAt = Date.now(); + let usage: TokenUsage | undefined; + let errorMessage: string | undefined; + let parsed: EvaluatorResponse | undefined; + + try { + const response = await caller({ + participantId: "rubric-evaluator", + modelId: evaluatorModelId, + round: 1, + phase: "initial-analysis", + system: buildRubricSystemPrompt(rubric), + user: buildRubricUserPrompt({ question, output }), + temperature, + maxOutputTokens, + ...(signal ? { signal } : {}), + }); + usage = response.usage; + + const json = extractJsonObject(response.content); + if (json === undefined) { + throw new Error("evaluator did not emit a parseable JSON object"); + } + const validation = EvaluatorResponseSchema.safeParse(json); + if (!validation.success) { + throw new Error( + `evaluator JSON failed schema: ${validation.error.issues[0]?.message ?? "unknown"}`, + ); + } + parsed = validation.data; + } catch (err) { + errorMessage = err instanceof Error ? err.message : String(err); + } + + const completedAt = Date.now(); + const criteria: RubricCriterionScore[] = []; + let total = 0; + + if (parsed) { + const byId = new Map(parsed.scores.map((s) => [s.criterion_id, s])); + for (const criterion of rubric) { + const got = byId.get(criterion.id); + if (!got) { + // Missing criterion → treat as 0 and record in justification. + criteria.push({ + criterionId: criterion.id, + score: 0, + justification: "(evaluator omitted this criterion)", + }); + continue; + } + const clamped = clamp(got.score, 0, criterion.maxPoints); + criteria.push({ + criterionId: criterion.id, + score: clamped, + justification: got.justification.trim(), + }); + total += clamped; + } + } + + const normalized = maxTotal > 0 ? Math.round((total / maxTotal) * 100) : 0; + + return { + evaluatorModelId, + criteria, + total, + maxTotal, + normalized, + durationMs: completedAt - startedAt, + usage, + errorMessage, + }; +} + +// ── Internals ──────────────────────────────────────────────── + +/** + * Tolerant JSON object extractor. Models often wrap JSON in prose or in + * markdown fences (```json …```). We try, in order: + * + * 1. `JSON.parse(content.trim())` — happy path when the model complied. + * 2. The contents of the first ```…``` fenced block. + * 3. The substring from the first `{` to the matched closing `}`. + * + * Returns `undefined` if no valid JSON object can be extracted. Linear- + * time bracket scan; no regex backtracking (matches the parser hardening + * convention used in ai-consensus-core). + */ +export function extractJsonObject(content: string): unknown { + const trimmed = content.trim(); + + const direct = tryParse(trimmed); + if (direct !== undefined) return direct; + + const fenced = extractFencedBlock(trimmed); + if (fenced !== undefined) { + const parsed = tryParse(fenced); + if (parsed !== undefined) return parsed; + } + + const braced = extractFirstBracedObject(trimmed); + if (braced !== undefined) { + const parsed = tryParse(braced); + if (parsed !== undefined) return parsed; + } + + return undefined; +} + +function tryParse(s: string): unknown { + if (s.length === 0) return undefined; + try { + const v: unknown = JSON.parse(s); + // The evaluator contract is "a single JSON object" — bare arrays and + // primitives are rejected here so callers don't have to redo the check. + return typeof v === "object" && v !== null && !Array.isArray(v) ? v : undefined; + } catch { + return undefined; + } +} + +function extractFencedBlock(s: string): string | undefined { + const fenceOpen = s.indexOf("```"); + if (fenceOpen === -1) return undefined; + const afterOpenLine = s.indexOf("\n", fenceOpen); + if (afterOpenLine === -1) return undefined; + const fenceClose = s.indexOf("```", afterOpenLine); + if (fenceClose === -1) return undefined; + return s.slice(afterOpenLine + 1, fenceClose).trim(); +} + +/** + * Walk the string from the first `{` forward, tracking brace depth while + * respecting strings + escapes, and return the substring covering the + * matching `}`. Linear time. + */ +function extractFirstBracedObject(s: string): string | undefined { + const start = s.indexOf("{"); + if (start === -1) return undefined; + let depth = 0; + let inString = false; + let escape = false; + for (let i = start; i < s.length; i++) { + const ch = s[i]!; + if (inString) { + if (escape) { + escape = false; + } else if (ch === "\\") { + escape = true; + } else if (ch === '"') { + inString = false; + } + continue; + } + if (ch === '"') { + inString = true; + continue; + } + if (ch === "{") { + depth++; + } else if (ch === "}") { + depth--; + if (depth === 0) { + return s.slice(start, i + 1); + } + } + } + return undefined; +} + +function clamp(n: number, lo: number, hi: number): number { + if (!Number.isFinite(n)) return lo; + return Math.min(hi, Math.max(lo, n)); +} diff --git a/src/benchmark/runner.ts b/src/benchmark/runner.ts index 06b23bf..13c3a81 100644 --- a/src/benchmark/runner.ts +++ b/src/benchmark/runner.ts @@ -20,6 +20,7 @@ import { import type { Preset } from "../presets/types.js"; import { runBaseline } from "./baseline.js"; import { computeMetrics, buildQualitativeNotes } from "./metrics.js"; +import { evaluateOutput, type RubricEvaluation } from "./rubric.js"; import { deriveRandomSeed, type BenchCase, @@ -73,6 +74,12 @@ export interface RunSuiteArgs { signal?: AbortSignal; /** Optional name for the case-file or suite — appears in the report header. */ caseFileName?: string; + /** + * Optional held-out evaluator model id for rubric scoring. When provided + * AND the panel declares a `rubric`, the runner scores both consensus + * and baseline outputs against that rubric. + */ + evaluatorModelId?: string; } const MAX_RUNS = 32; @@ -97,6 +104,7 @@ export async function runSuite(args: RunSuiteArgs): Promise { onProgress, signal, caseFileName, + evaluatorModelId, } = args; if (cases.length === 0) { @@ -148,6 +156,7 @@ export async function runSuite(args: RunSuiteArgs): Promise { caseIndex, totalCases: cases.length, totalRuns, + evaluatorModelId, }); collected.push(run); } @@ -205,6 +214,7 @@ interface ExecuteOneRunArgs { caseIndex: number; totalCases: number; totalRuns: number; + evaluatorModelId: string | undefined; } async function executeOneRun(args: ExecuteOneRunArgs): Promise { @@ -222,6 +232,7 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise { caseIndex, totalCases, totalRuns, + evaluatorModelId, } = args; let consensus: ConsensusOutcome | undefined; @@ -257,7 +268,7 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise { }); // 2) Baseline run (always — even if consensus errored, baseline data is useful). - const baseline = await runBaseline({ + let baseline = await runBaseline({ caller, modelId: baselineModelId, question: benchCase.question, @@ -275,6 +286,57 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise { }`, }); + // 3) Optional rubric eval. Only runs when the panel declares a rubric AND + // the bench was invoked with a held-out evaluator model. Failures here are + // captured into the per-side `RubricEvaluation.errorMessage` rather than + // bubbling — a rubric failure is data quality, not a suite-fatal error. + const rubric = panel.rubric; + if (rubric && rubric.length > 0 && evaluatorModelId) { + const consensusOutputText = consensus?.result.synthesis?.content; + if (consensusOutputText && consensusOutputText.trim().length > 0) { + const consensusRubric = await evaluateOutput({ + caller, + evaluatorModelId, + rubric, + question: benchCase.question, + output: consensusOutputText, + ...(signal ? { signal } : {}), + }); + if (consensus) { + consensus = { ...consensus, rubric: consensusRubric }; + } + onProgress?.({ + kind: "consensus-complete", + caseIndex, + runIndex, + caseId: benchCase.id, + totalCases, + totalRuns, + message: ` consensus rubric ${runIndex + 1}: ${formatRubricProgress(consensusRubric)}`, + }); + } + if (!baseline.errorMessage && baseline.content.trim().length > 0) { + const baselineRubric = await evaluateOutput({ + caller, + evaluatorModelId, + rubric, + question: benchCase.question, + output: baseline.content, + ...(signal ? { signal } : {}), + }); + baseline = { ...baseline, rubric: baselineRubric }; + onProgress?.({ + kind: "baseline-complete", + caseIndex, + runIndex, + caseId: benchCase.id, + totalCases, + totalRuns, + message: ` baseline rubric ${runIndex + 1}: ${formatRubricProgress(baselineRubric)}`, + }); + } + } + const failed = consensus === undefined || baseline.errorMessage !== undefined; const errorMessage = consensusError ? `consensus: ${consensusError}` @@ -294,6 +356,11 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise { }; } +function formatRubricProgress(r: RubricEvaluation): string { + if (r.errorMessage) return `ERRORED (${r.errorMessage})`; + return `score=${r.normalized}/100 (${r.total}/${r.maxTotal})`; +} + // ── Helpers ─────────────────────────────────────────────────── function summariseConsensus(result: ConsensusResult, durationMs: number): ConsensusOutcome { @@ -333,6 +400,7 @@ function summariseConsensus(result: ConsensusResult, durationMs: number): Consen judgeConfidence: result.synthesis?.judgeConfidence, durationMs, totalUsage, + rubric: undefined, }; } @@ -366,6 +434,7 @@ function placeholderConsensus(): ConsensusOutcome { judgeConfidence: undefined, durationMs: 0, totalUsage: undefined, + rubric: undefined, }; } diff --git a/src/benchmark/types.ts b/src/benchmark/types.ts index e8b2b5b..b3e6b9c 100644 --- a/src/benchmark/types.ts +++ b/src/benchmark/types.ts @@ -15,6 +15,7 @@ import { z } from "zod"; import type { ConsensusResult, TokenUsage } from "ai-consensus-core"; +import type { RubricEvaluation } from "./rubric.js"; // ── BenchCase (input) ──────────────────────────────────────── @@ -93,6 +94,13 @@ export interface ConsensusOutcome { judgeConfidence: number | undefined; durationMs: number; totalUsage: TokenUsage | undefined; + /** + * Held-out rubric evaluation of the consensus synthesis. Populated only + * when the panel declares a `rubric` AND the bench was invoked with an + * evaluator model. `undefined` means "not evaluated"; an evaluation with + * `errorMessage` set means "tried and failed". + */ + rubric: RubricEvaluation | undefined; } export interface BaselineOutcome { @@ -104,6 +112,8 @@ export interface BaselineOutcome { durationMs: number; usage: TokenUsage | undefined; errorMessage: string | undefined; + /** Held-out rubric evaluation; same activation rule as ConsensusOutcome.rubric. */ + rubric: RubricEvaluation | undefined; } // ── BenchReport (suite-level aggregation) ──────────────────── @@ -149,6 +159,24 @@ export interface BenchMetrics { runsCounted: number; /** Total runs attempted. */ runsAttempted: number; + /** + * Mean of rubric-normalized scores for the consensus side, across runs + * where the held-out evaluator succeeded. `undefined` when no run had a + * successful consensus rubric eval (panel has no rubric, or eval was + * never invoked, or every eval failed). + */ + consensusRubricNormalizedMean: number | undefined; + /** Mean of rubric-normalized scores for the baseline side; same contract. */ + baselineRubricNormalizedMean: number | undefined; + /** + * Fraction of runs where the consensus rubric score was strictly greater + * than the baseline rubric score. Independent of self-reported confidence — + * this is the held-out-judge view of which side answered better. `undefined` + * when fewer than 1 run had successful evals on BOTH sides. + */ + consensusBeatsBaselineRubricRate: number | undefined; + /** Count of runs where both rubric evaluations succeeded — denominator for the rate above. */ + rubricRunsCounted: number; } export interface BenchReport { diff --git a/src/cli/bench.ts b/src/cli/bench.ts index 8b7ccfd..831dacf 100644 --- a/src/cli/bench.ts +++ b/src/cli/bench.ts @@ -36,6 +36,8 @@ export interface BenchArgs { outputPath: string | undefined; baselineModelId: string | undefined; baselineProviderId: string | undefined; + evaluatorModelId: string | undefined; + evaluatorProviderId: string | undefined; filterTag: string | undefined; includeFullResults: boolean; listPanels: boolean; @@ -89,6 +91,17 @@ Optional: Defaults to the judge model from config. --baseline-provider Provider id for the baseline model. Defaults to the judge provider from config. + --evaluator-model Model id for a held-out rubric evaluator. When + set AND the panel declares a rubric, the bench + scores both consensus and baseline outputs + against that rubric. SHOULD differ from both + the judge model and the baseline model — the + evaluator grades both sides blind, and using + the same brain for grading and producing one + side biases the result. The CLI warns when + this contract is violated but does not block. + --evaluator-provider Provider id for the evaluator model. Required + when --evaluator-model is set. --filter-tag Only run cases that have this tag. --output Also write the JSON report to this path. --include-full-results Keep the full ConsensusResult objects in the @@ -119,6 +132,11 @@ Examples: ai-consensus-mcp bench -c ./consensus.config.json -p security_redteam \\ --filter-tag injection --output sec.json + # Held-out rubric eval — consensus + baseline scored by a third model + ai-consensus-mcp bench -p architecture_v2 --runs 3 --seed 42 \\ + --evaluator-model claude-opus-4-5 --evaluator-provider anthropic \\ + --output bench-rubric.json + # Discover panels and their tags ai-consensus-mcp bench --list-panels @@ -138,6 +156,8 @@ export function parseBenchArgs(argv: readonly string[]): BenchArgs | Error { outputPath: undefined, baselineModelId: undefined, baselineProviderId: undefined, + evaluatorModelId: undefined, + evaluatorProviderId: undefined, filterTag: undefined, includeFullResults: false, listPanels: false, @@ -208,6 +228,14 @@ export function parseBenchArgs(argv: readonly string[]): BenchArgs | Error { const v = next(); if (v instanceof Error) return v; out.baselineProviderId = v; + } else if (arg === "--evaluator-model") { + const v = next(); + if (v instanceof Error) return v; + out.evaluatorModelId = v; + } else if (arg === "--evaluator-provider") { + const v = next(); + if (v instanceof Error) return v; + out.evaluatorProviderId = v; } else if (arg === "--filter-tag") { const v = next(); if (v instanceof Error) return v; @@ -314,8 +342,39 @@ export async function runBench(argv: readonly string[]): Promise { return 2; } + // Evaluator routing — required if --evaluator-model is set. Validated + // BEFORE we start running so we fail fast on a typo'd provider id rather + // than dozens of provider calls in. + let evaluatorModelId: string | undefined; + let evaluatorProviderId: string | undefined; + if (parsed.evaluatorModelId || parsed.evaluatorProviderId) { + if (!parsed.evaluatorModelId || !parsed.evaluatorProviderId) { + process.stderr.write( + `${SERVER_NAME} bench: --evaluator-model and --evaluator-provider must be passed together.\n`, + ); + return 2; + } + if (!config.providers[parsed.evaluatorProviderId]) { + process.stderr.write( + `${SERVER_NAME} bench: evaluator provider "${parsed.evaluatorProviderId}" is not in your config (available: ${Object.keys( + config.providers, + ).join(", ")}).\n`, + ); + return 2; + } + evaluatorModelId = parsed.evaluatorModelId; + evaluatorProviderId = parsed.evaluatorProviderId; + if (!panel.rubric || panel.rubric.length === 0) { + process.stderr.write( + `${SERVER_NAME} bench: panel "${panel.id}" declares no rubric; --evaluator-model has nothing to score. Ignoring.\n`, + ); + evaluatorModelId = undefined; + evaluatorProviderId = undefined; + } + } + // Compose the per-call routing: panel participants → their providers, - // plus the synthetic "baseline" and "judge" ids → judge provider. + // plus the synthetic "baseline", "judge", and "rubric-evaluator" ids. const providerByParticipant: Record = { ...resolved.providerByParticipant, baseline: baselineProviderId, @@ -323,11 +382,37 @@ export async function runBench(argv: readonly string[]): Promise { if (config.judge) { providerByParticipant["judge"] = config.judge.providerId; } + if (evaluatorProviderId) { + providerByParticipant["rubric-evaluator"] = evaluatorProviderId; + } const caller: ModelCaller = createOpenAICompatibleCaller({ providers: config.providers, providerByParticipant, }); + // Held-out contract warnings — the bench will still run, but a reviewer + // reading the report needs to see "this comparison wasn't blind." + if (evaluatorModelId) { + if (evaluatorModelId === baselineModelId) { + process.stderr.write( + `${SERVER_NAME} bench: ⚠ evaluator model == baseline model (${evaluatorModelId}). The evaluator is grading its own output. Results on the baseline side are NOT independent.\n`, + ); + } + if (evaluatorModelId === config.judge?.modelId) { + process.stderr.write( + `${SERVER_NAME} bench: ⚠ evaluator model == judge model (${evaluatorModelId}). The evaluator is grading text synthesised by the same brain that produced the consensus output — eval is not held-out.\n`, + ); + } + } + if ( + baselineModelId === config.judge?.modelId && + (!evaluatorModelId || evaluatorModelId === baselineModelId) + ) { + process.stderr.write( + `${SERVER_NAME} bench: ⚠ baseline and judge are the same model (${baselineModelId}). Consensus and baseline both flow through this brain; "consensus vs baseline" is a self-comparison artifact.\n`, + ); + } + // Load cases. let cases: BenchCase[]; let caseFileName: string | undefined; @@ -388,14 +473,22 @@ export async function runBench(argv: readonly string[]): Promise { } const baseSeed = parsed.baseSeed ?? (parsed.quick ? QUICK_DEFAULT_SEED : Date.now()); - const totalCalls = cases.length * parsed.runs * (resolved.participants.length + 1); + const perRunRubricCalls = evaluatorModelId ? 2 : 0; + const totalCalls = + cases.length * parsed.runs * (resolved.participants.length + 1 + perRunRubricCalls); if (!parsed.quiet) { + const evalSuffix = evaluatorModelId + ? `, evaluator=${evaluatorModelId} (${evaluatorProviderId})` + : ""; process.stderr.write( - `${SERVER_NAME} bench: panel="${panel.id}", cases=${cases.length}, runs=${parsed.runs}, baseline=${baselineModelId} (${baselineProviderId}), seed=${baseSeed}${parsed.quick ? " (quick mode)" : ""}\n`, + `${SERVER_NAME} bench: panel="${panel.id}", cases=${cases.length}, runs=${parsed.runs}, baseline=${baselineModelId} (${baselineProviderId})${evalSuffix}, seed=${baseSeed}${parsed.quick ? " (quick mode)" : ""}\n`, ); + const explainer = evaluatorModelId + ? "cases × runs × (panel + baseline + 2 rubric evals)" + : "cases × runs × (panel + baseline)"; process.stderr.write( - `${SERVER_NAME} bench: expected ≈${totalCalls} provider calls (cases × runs × (panel + baseline)).\n`, + `${SERVER_NAME} bench: expected ≈${totalCalls} provider calls (${explainer}).\n`, ); } @@ -427,6 +520,7 @@ export async function runBench(argv: readonly string[]): Promise { ...(progressHandler ? { onProgress: progressHandler } : {}), signal: ac.signal, ...(caseFileName ? { caseFileName } : {}), + ...(evaluatorModelId ? { evaluatorModelId } : {}), }); const md = formatReportMarkdown(report); diff --git a/src/presets/definitions/architecture-v2.ts b/src/presets/definitions/architecture-v2.ts index e9b96ea..d1fcc59 100644 --- a/src/presets/definitions/architecture-v2.ts +++ b/src/presets/definitions/architecture-v2.ts @@ -120,6 +120,38 @@ export const ARCHITECTURE_V2_PRESET: Preset = { "", "Do not hedge by recommending two options. Pick one. State your confidence.", ].join("\n"), + rubric: [ + { + id: "quantification", + description: + "Does the answer cite load-bearing constraints with units (ms, $/month, headcount, GB/day, QPS, percentiles), or explicitly name an unstated constraint with a proposed value? A 5/5 answer reads like an engineer with a spreadsheet; a 0/5 reads like a vibes essay.", + maxPoints: 5, + }, + { + id: "single-recommendation", + description: + "Does the answer commit to a single architecture choice with a dominant reason, rather than hedging between two? A 5/5 answer names the recommended option in one sentence and names the next-best alternative only as the runner-up; a 0/5 presents a balanced menu and refuses to choose.", + maxPoints: 5, + }, + { + id: "reversibility", + description: + "Does the answer explicitly weigh reversibility / switching cost — the cost of being wrong about this decision? A 5/5 answer treats reversibility as a first-class column with at least a low/medium/high rating per option and a switching cost estimate; a 0/5 ignores reversibility entirely.", + maxPoints: 5, + }, + { + id: "tripwire-specificity", + description: + "Are the conditions that would flip the recommendation named as measurable signals with thresholds (e.g. 'write QPS sustains >5k for 24h', 'P99 latency exceeds 200ms for 1h'), not vague conditions ('if scale grows', 'if reliability becomes a concern')? A 5/5 answer has tripwires you could literally write a Prometheus alert against; a 0/5 has hand-waving.", + maxPoints: 5, + }, + { + id: "failure-mode-realism", + description: + "Are failure modes named with concrete trigger conditions, blast radius, and detection latency — not generic risks? A 5/5 answer names specific failure modes a senior on-call engineer would recognise from incidents they've actually worked; a 0/5 lists abstract risks ('complexity', 'scaling issues') with no shape.", + maxPoints: 5, + }, + ], meta: { version: "2.0.0", rationale: [ diff --git a/src/presets/types.ts b/src/presets/types.ts index 2ef00b5..3e8ffe1 100644 --- a/src/presets/types.ts +++ b/src/presets/types.ts @@ -113,6 +113,29 @@ export interface PanelOutputShape { tags?: readonly string[]; } +/** + * One scoring dimension for a panel-declared quality rubric. Used by the + * benchmark's held-out LLM-as-judge evaluator to score consensus and + * baseline outputs against the same rubric, blind to which side produced + * which output. + * + * The rubric measures *answer quality* against named criteria — distinct + * from self-reported confidence (which is a meta-claim about the model's + * own certainty, not about the answer's substance). + */ +export interface RubricCriterion { + /** Stable kebab-case id; surfaces in JSON reports for downstream tools. */ + id: string; + /** + * What a max-point answer looks like for this criterion. The evaluator + * sees this verbatim — write it as a directive for the model, not as + * end-user prose. + */ + description: string; + /** Upper bound of the score for this criterion (typically 5). */ + maxPoints: number; +} + /** * Optional metadata declaring a panel's purpose and output shape. v2+ * panels populate this fully so MCP clients can introspect what a panel @@ -165,6 +188,13 @@ export interface Preset { defaults: PresetDefaults; /** Optional task-specific judge system prompt. */ judgeSystemPrompt?: string; + /** + * Optional quality rubric. When set, the bench can score consensus and + * baseline outputs against these criteria using a held-out evaluator + * model — measuring answer quality against named contracts rather than + * self-reported confidence. + */ + rubric?: readonly RubricCriterion[]; /** Phase 3 surface — empty/unused in Phase 1. */ toolBindings?: readonly ToolBinding[]; /**