From 1e7dc630d91b23b7acbbbeb14ba2336792e437af Mon Sep 17 00:00:00 2001
From: Marcelo Ceccon <marcelo@ceccon.org>
Date: Tue, 26 May 2026 11:47:26 +0000
Subject: [PATCH] feat(bench): held-out rubric evaluator + dep bump to
 ai-consensus-core 0.11.1
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The bench was measuring the wrong thing. Self-reported confidence is a
meta-signal, not a quality signal — and on every panel in this repo it
was further corrupted by the upstream `extractJudgeConfidence` silent
50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses
11/12, costs 40× tokens for nothing" verdict the old bench produced
was a measurement artifact, not a result about the panel.

This change adds a held-out LLM-as-judge rubric evaluator. Pass
`--evaluator-model` + `--evaluator-provider`; when the panel declares
a rubric the bench scores both the consensus synthesis and the
single-model baseline against that rubric using a third model that
is on neither side of the comparison. Rubrics measure answer quality
against named criteria the panel commits to (for architecture_v2:
quantification, single-recommendation, reversibility-weighing,
tripwire-specificity, failure-mode-realism).

On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as
both judge and baseline and claude-opus-4-5 as the held-out evaluator:

  - Self-reported (consensus score vs. baseline confidence):
      consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses")
  - Held-out rubric:
      consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%)

Same 12 runs, opposite verdicts. The rubric measures quality directly;
self-reported confidence does not track quality (judge confidence on
these 12 runs is μ=66.9, under-estimating actual rubric score by ~16
points).

Headline implementation surface:

- New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting
  prompt for the evaluator, parses with a tolerant bracket-scanning JSON
  extractor (no regex backtracking), validates with zod, clamps scores
  into range, returns RubricEvaluation with errorMessage on any failure
  path. Never throws — a rubric eval failure is data quality, not a
  suite-fatal error (same contract baseline.ts uses).
- New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared
  because criteria are domain-specific. architecture_v2 ships a 5-
  criterion rubric; adding rubrics to other panels is purely additive.
- New CLI flags: --evaluator-model, --evaluator-provider. Validated
  together (must be passed as a pair). CLI warns when evaluator coincides
  with baseline (self-grading) or with judge (same brain producing and
  grading the consensus output) — the held-out contract is the bench's
  only guarantee that the comparison isn't self-graded.
- New report metrics: consensusRubricNormalizedMean,
  baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate,
  rubricRunsCounted. Markdown report grows a "Held-out rubric" block
  and 3 per-case table columns (Rubric C, Rubric B, Δ rubric).
- Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the
  silent judge-confidence parser-contract bug — judge confidence now
  reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0.

Test surface:

- 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON,
  score clamping, missing-criterion handling, all three failure modes:
  caller throws, unparseable content, schema mismatch).
- 3 new runner integration tests (no-evaluator short-circuits, paired
  rubric metrics populated, eval failure captured into errorMessage
  without aborting the suite).
- 2 new format tests (rubric block + per-case columns render; ERR cells
  on failed evals).
- Existing 320 tests pass unchanged. Total: 341/341. Coverage:
  statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all
  above thresholds.

Docs:

- New README section "Quality benchmark (held-out evaluator)" — headline
  finding, per-case Δ-rubric table, methodology, exact reproduction
  command, honest caveats (N=12, single panel, real cost).
- CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and
  what it changes for downstream callers.
- New gitignore patterns for bench-*.json and bench-*.log artifacts.
- Bench CLI help text gains an --evaluator-model example.

No breaking changes. Callers who don't pass --evaluator-model see the
existing report unchanged (with the side effect that judge confidence
now reports a real number rather than the silent 50).
---
 .gitignore                                 |   5 +
 CHANGELOG.md                               |  56 +++-
 README.md                                  | 107 ++++++-
 package.json                               |   2 +-
 src/benchmark/__tests__/format.test.ts     |  97 +++++++
 src/benchmark/__tests__/metrics.test.ts    |   2 +
 src/benchmark/__tests__/rubric.test.ts     | 276 ++++++++++++++++++
 src/benchmark/__tests__/runner.test.ts     |  95 ++++++
 src/benchmark/baseline.ts                  |   1 +
 src/benchmark/format.ts                    |  82 +++++-
 src/benchmark/metrics.ts                   |  51 ++++
 src/benchmark/rubric.ts                    | 318 +++++++++++++++++++++
 src/benchmark/runner.ts                    |  71 ++++-
 src/benchmark/types.ts                     |  28 ++
 src/cli/bench.ts                           | 102 ++++++-
 src/presets/definitions/architecture-v2.ts |  32 +++
 src/presets/types.ts                       |  30 ++
 17 files changed, 1330 insertions(+), 25 deletions(-)
 create mode 100644 src/benchmark/__tests__/rubric.test.ts
 create mode 100644 src/benchmark/rubric.ts

diff --git a/.gitignore b/.gitignore
index 1b7c6b2..bdad242 100644
--- a/.gitignore
+++ b/.gitignore
@@ -18,3 +18,8 @@ yarn-error.log*
 .idea
 progress.md
 package-lock.json
+
+# Bench run artifacts — local outputs from `ai-consensus-mcp bench --output`
+# and the progress logs that pair with them. Reproducible from the config.
+bench-*.json
+bench-*.log
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 3b8d505..cd92507 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,7 +5,61 @@ Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), [SemVer](https
 
 ## [Unreleased]
 
-_None yet — see [0.12.0] below for the most recent release._
+### Added — held-out rubric evaluator for `bench`
+
+`bench` learned to score answer **quality** with a third, held-out model
+rather than relying on either side's self-reported confidence.
+
+- New CLI flags: `--evaluator-model <id>` and `--evaluator-provider <id>`.
+  When both are set AND the panel declares a `rubric`, the bench scores
+  both the consensus synthesis and the baseline output against the rubric
+  using the evaluator model, blind to which side produced which answer.
+- New preset field: `Preset.rubric?: readonly RubricCriterion[]`. The
+  rubric is panel-declared (because criteria are domain-specific);
+  `architecture_v2` ships a 5-criterion rubric (quantification,
+  single-recommendation, reversibility-weighing, tripwire-specificity,
+  failure-mode-realism). Adding a rubric to another panel is purely
+  additive — no engine change, no breaking change.
+- New bench module: `src/benchmark/rubric.ts`. Builds a structured
+  JSON-emitting prompt for the evaluator, parses with a tolerant
+  bracket-scanning JSON extractor (no regex backtracking), validates
+  with zod, clamps scores into range, and returns a `RubricEvaluation`
+  with `errorMessage` set on any failure path. Never throws — a rubric
+  eval failure is data quality, not a suite-fatal error.
+- New report metrics: `consensusRubricNormalizedMean`,
+  `baselineRubricNormalizedMean`, `consensusBeatsBaselineRubricRate`,
+  `rubricRunsCounted`. Surfaced in both the markdown report (new
+  "Held-out rubric" section + 3 new per-case table columns) and the
+  JSON report.
+- CLI sanity-checks the held-out contract: warns when the evaluator
+  model is the same as the baseline (self-grading) or the judge (same
+  brain producing and grading the consensus output).
+- 16 new tests cover the evaluator end-to-end: happy path, fenced /
+  prose-wrapped JSON, score clamping, missing-criterion handling, all
+  three failure modes (caller throws, unparseable content, schema
+  mismatch). Existing 320 tests pass unchanged. Total: 336/336.
+
+### Changed — upstream parser-contract fix
+
+- Bumped `ai-consensus-core` from `^0.10.0` to `^0.11.1`. The 0.11.1
+  release fixes a silent contract bug where any caller that overrode
+  the default `JUDGE_PERSONA.systemPrompt` (every panel in this repo)
+  caused `extractJudgeConfidence` to fall through to its 50 default —
+  the bench reported judge confidence as μ=50.0, σ=0.0 across every
+  run. With 0.11.1, `buildJudgeSystemPrompt` idempotently appends the
+  `JUDGE_CONFIDENCE: [0-100]` directive, so the parser sees a real
+  value. Judge confidence on a representative 12-run bench now reports
+  μ=66.9, σ=5.2 — the first real distribution this repo has ever produced.
+- No code change in this repo was needed for the symptom to disappear;
+  the dep bump alone removes the artifact. No API change.
+
+### Documentation
+
+- New README section: **"Quality benchmark (held-out evaluator)"** —
+  headline finding (consensus wins 12/12 on `architecture_v2` against
+  a frontier baseline), per-case Δ-rubric table, methodology, exact
+  reproduction command, and honest caveats (cost, sample size,
+  single-panel scope).
 
 ## [0.12.0] — 2026-05-25
 
diff --git a/README.md b/README.md
index 8137928..1883dd8 100644
--- a/README.md
+++ b/README.md
@@ -66,11 +66,18 @@ Scope the run with `--hosts claude-code,cursor`. Run `npx ai-consensus-mcp insta
   `panel` argument), plus 5 v1 presets and 8 v2 expert panels. Invoke a
   panel; get a curated set of personas and tuned defaults without touching
   the knobs. Full catalogue in [`docs/expert-panels.md`](./docs/expert-panels.md).
-- **Benchmarking baked in.** `npx ai-consensus-mcp bench --panel <id>`
-  runs a panel against built-in or user-provided cases and produces a
-  human-readable + JSON uplift report — agreement rate, convergence
-  speed, judge confidence, duration/token cost ratios. Deterministic
-  with `--seed`.
+- **Benchmarking baked in, with held-out quality eval.**
+  `npx ai-consensus-mcp bench --panel <id>` runs a panel against built-in
+  or user-provided cases and produces a human-readable + JSON uplift
+  report — agreement rate, convergence speed, judge confidence,
+  duration/token cost ratios. Deterministic with `--seed`. Pass
+  `--evaluator-model` + `--evaluator-provider` and the bench scores both
+  the consensus synthesis and the baseline against the panel's declared
+  rubric using a third, held-out model — measuring answer quality
+  against named criteria, not self-reported confidence. See
+  [Quality benchmark](#quality-benchmark-held-out-evaluator) below for
+  the methodology and the headline result (consensus wins 12/12 runs
+  on `architecture_v2` against a frontier baseline).
 - **Persistent project memory (opt-in).** Enable with one config flag;
   every panel run is durably stored, project-scoped, with three recall
   tools — `consensus_recall`, `consensus_project_memory`,
@@ -83,6 +90,96 @@ Scope the run with `--hosts claude-code,cursor`. Run `npx ai-consensus-mcp insta
 - **Live progress.** Every structured engine event is forwarded as an MCP [progress notification](https://modelcontextprotocol.io/specification/2025-03-26/basic/utilities/progress) — hosts render real-time round/participant/disagreement/score status.
 - **Dependency-light.** `@modelcontextprotocol/sdk`, `zod`, `ai-consensus-core`. SSE parsing is native `fetch` — no provider SDKs.
 
+## Quality benchmark (held-out evaluator)
+
+`bench` ships with a held-out LLM-as-judge rubric evaluator. Pass
+`--evaluator-model` + `--evaluator-provider` and the bench scores both
+the consensus synthesis and the single-model baseline against the
+panel's declared rubric, using a third model that's neither side. The
+rubric measures **answer quality** against named criteria — distinct
+from self-reported confidence, which is a meta-signal that does not
+track quality.
+
+### Headline finding (`architecture_v2`, 4 cases × 3 runs, seed=42)
+
+| Metric                                                  | Consensus | Baseline |         Δ |
+| ------------------------------------------------------- | --------: | -------: | --------: |
+| Self-reported (consensus score vs. baseline confidence) |      60.0 |     75.4 |     −15.4 |
+| Held-out rubric (judged by `claude-opus-4-5`, blind)    |  **83.3** | **48.0** | **+35.3** |
+
+**Consensus wins on the held-out rubric in 12 of 12 runs (100%).** On
+the same 12 runs, the self-reported confidence metric says consensus
+wins 1 of 12 (8%) — the two metrics invert. Without the rubric, the
+bench reports "consensus loses 11/12, costs 40× tokens for nothing."
+With it: "consensus dominates 12/12, +35-point quality lead,
+structural advantage on every case."
+
+### Per-case Δ rubric
+
+| Case                          | Runs (Δ rubric) |    Mean |
+| ----------------------------- | --------------- | ------: |
+| `arch-microservices-day-one`  | +36, +28, +32   | **+32** |
+| `arch-event-sourcing-billing` | +44, +52, +56   | **+51** |
+| `arch-sync-vs-async-fanout`   | +24, +40, +40   | **+35** |
+| `arch-db-multi-tenant`        | +12, +32, +28   | **+24** |
+
+Baseline scored 28/100 on every `event-sourcing-billing` run — a
+reproducible single-model blind spot (hand-wavy tripwires, missing
+reversibility weighing) that the panel surfaces every time.
+
+### Methodology
+
+- **Judge model:** `grok-4.3` (xai). Synthesises the consensus output
+  from the panel's final-round responses.
+- **Baseline model:** `grok-4.3` (xai). Same brain, single-shot answer,
+  no panel, no judge — this is what the panel is compared against.
+- **Evaluator model:** `claude-opus-4-5` (anthropic). **Held out** —
+  does not appear on either side of the comparison. Scores each answer
+  independently against the rubric, blind to which side produced it.
+- **Rubric:** 5 criteria for `architecture_v2`, each scored 0–5:
+  quantification, single-recommendation, reversibility-weighing,
+  tripwire-specificity, failure-mode-realism. Declared on the preset
+  (see [`src/presets/definitions/architecture-v2.ts`](./src/presets/definitions/architecture-v2.ts)).
+- **Determinism:** `--seed 42` controls round-order shuffling. Model
+  outputs at temperature > 0 are inherently stochastic — 3 runs per
+  case averages out the noise.
+
+### Reproducing
+
+```bash
+export GROK_API_KEY=...
+export CONSENSUS_ANTHROPIC_API_KEY=...
+ai-consensus-mcp bench -p architecture_v2 --runs 3 --seed 42 \
+  --evaluator-model claude-opus-4-5 --evaluator-provider anthropic \
+  --output bench-architecture_v2-rubric.json
+```
+
+Cost preview: ~72 provider calls (4 cases × 3 runs × (panel + baseline
+
+- 2 rubric evals)). The CLI prints the exact estimate before spending.
+
+### Honest caveats
+
+- **N=12 is small.** The direction is unambiguous (100% inversion is
+  hard to fluke); the magnitude needs broader sampling.
+- **One panel.** Only `architecture_v2` declares a rubric in this
+  version — the same pattern applies to every other panel by adding a
+  `rubric` array to the preset definition.
+- **Cost is real.** 40× tokens, 20× wall time vs. one baseline call.
+  For high-stakes architecture decisions (the panel's named use case),
+  the cost is dwarfed by the cost of a wrong call. For low-stakes
+  routine choices, single-model is the right tool — panel-selection
+  guidance, not a panel failure.
+- **Self-reported confidence remains a poor quality estimator.** Even
+  with the upstream parser-contract fix (`ai-consensus-core@0.11.1`),
+  judge confidence on these 12 runs is μ=66.9, σ=5.2 — under-estimates
+  the actual held-out rubric score (μ=83.3) by ~16 points. Useful as a
+  humility signal, not as a quality estimator.
+
+The CLI warns when the evaluator model coincides with the baseline or
+the judge — the held-out contract is the bench's only guarantee that
+the comparison isn't self-graded.
+
 ## The protocol
 
 For the actual protocol — rounds, phases, prompts, scoring — see the [ai-consensus-core protocol diagram](https://github.com/entropyvortex/ai-consensus-core#protocol-diagram). This README covers the server surface only.
diff --git a/package.json b/package.json
index 542bf02..f253126 100644
--- a/package.json
+++ b/package.json
@@ -56,7 +56,7 @@
   "dependencies": {
     "@inquirer/prompts": "^8.4.2",
     "@modelcontextprotocol/sdk": "^1.13.0",
-    "ai-consensus-core": "^0.10.0",
+    "ai-consensus-core": "^0.11.1",
     "zod": "^3.24.1"
   },
   "devDependencies": {
diff --git a/src/benchmark/__tests__/format.test.ts b/src/benchmark/__tests__/format.test.ts
index 1da751e..54d1b7c 100644
--- a/src/benchmark/__tests__/format.test.ts
+++ b/src/benchmark/__tests__/format.test.ts
@@ -52,6 +52,7 @@ function makeMinimalReport(overrides: Partial<BenchReport> = {}): BenchReport {
       judgeConfidence: 80,
       durationMs: 100,
       totalUsage: { inputTokens: 100, outputTokens: 50, totalTokens: 150 },
+      rubric: undefined,
     },
     baseline: {
       modelId: "judge-model",
@@ -60,6 +61,7 @@ function makeMinimalReport(overrides: Partial<BenchReport> = {}): BenchReport {
       durationMs: 50,
       usage: { inputTokens: 30, outputTokens: 20, totalTokens: 50 },
       errorMessage: undefined,
+      rubric: undefined,
     },
     failed: false,
   };
@@ -87,6 +89,10 @@ function makeMinimalReport(overrides: Partial<BenchReport> = {}): BenchReport {
       consensusBeatsBaselineConfidenceRate: 1,
       runsCounted: 1,
       runsAttempted: 1,
+      consensusRubricNormalizedMean: undefined,
+      baselineRubricNormalizedMean: undefined,
+      consensusBeatsBaselineRubricRate: undefined,
+      rubricRunsCounted: 0,
     },
     qualitativeNotes: ["• c1#0: converged at round 1; judge confidence 80"],
     ...overrides,
@@ -118,6 +124,97 @@ describe("formatReportMarkdown — section contract", () => {
     expect(md).toContain("v2.0.0");
   });
 
+  it("renders the held-out rubric block and rubric columns when rubrics are present", () => {
+    const report = makeMinimalReport();
+    const run = report.runs[0]!;
+    const withRubric: BenchRun = {
+      ...run,
+      consensus: {
+        ...run.consensus,
+        rubric: {
+          evaluatorModelId: "claude-opus-4-5",
+          criteria: [{ criterionId: "x", score: 4, justification: "j" }],
+          total: 4,
+          maxTotal: 5,
+          normalized: 80,
+          durationMs: 50,
+          usage: undefined,
+          errorMessage: undefined,
+        },
+      },
+      baseline: {
+        ...run.baseline,
+        rubric: {
+          evaluatorModelId: "claude-opus-4-5",
+          criteria: [{ criterionId: "x", score: 2, justification: "j" }],
+          total: 2,
+          maxTotal: 5,
+          normalized: 40,
+          durationMs: 50,
+          usage: undefined,
+          errorMessage: undefined,
+        },
+      },
+    };
+    const md = formatReportMarkdown({
+      ...report,
+      runs: [withRubric],
+      metrics: {
+        ...report.metrics,
+        consensusRubricNormalizedMean: 80,
+        baselineRubricNormalizedMean: 40,
+        consensusBeatsBaselineRubricRate: 1,
+        rubricRunsCounted: 1,
+      },
+    });
+    expect(md).toContain("Held-out rubric");
+    expect(md).toContain("Mean rubric score");
+    expect(md).toContain("Consensus beats baseline on rubric");
+    // Table gains the Rubric C / Rubric B / Δ rubric columns.
+    expect(md).toContain("Rubric C");
+    expect(md).toContain("Rubric B");
+    expect(md).toContain("Δ rubric");
+    expect(md).toMatch(/\| 80 \| 40 \| \+40 \|/);
+  });
+
+  it("renders ERR in rubric cells when an eval failed, without crashing the per-case table", () => {
+    const report = makeMinimalReport();
+    const run = report.runs[0]!;
+    const withErroredRubric: BenchRun = {
+      ...run,
+      consensus: {
+        ...run.consensus,
+        rubric: {
+          evaluatorModelId: "claude-opus-4-5",
+          criteria: [],
+          total: 0,
+          maxTotal: 5,
+          normalized: 0,
+          durationMs: 10,
+          usage: undefined,
+          errorMessage: "evaluator did not emit a parseable JSON object",
+        },
+      },
+      baseline: {
+        ...run.baseline,
+        rubric: {
+          evaluatorModelId: "claude-opus-4-5",
+          criteria: [],
+          total: 0,
+          maxTotal: 5,
+          normalized: 0,
+          durationMs: 10,
+          usage: undefined,
+          errorMessage: "caller threw",
+        },
+      },
+    };
+    const md = formatReportMarkdown({ ...report, runs: [withErroredRubric] });
+    expect(md).toContain("ERR");
+    // Δ rubric is "—" when either side errored.
+    expect(md).toMatch(/\| ERR \| ERR \| — \|/);
+  });
+
   it("renders per-case table with score, sigma, rounds, stop, judge conf, baseline conf, delta", () => {
     const md = formatReportMarkdown(makeMinimalReport());
     // The table contains "| 71 |" for score, "| 80 |" for judge conf, "| 60 |" for baseline, "| +11 |" for delta.
diff --git a/src/benchmark/__tests__/metrics.test.ts b/src/benchmark/__tests__/metrics.test.ts
index 46b5195..e47aea9 100644
--- a/src/benchmark/__tests__/metrics.test.ts
+++ b/src/benchmark/__tests__/metrics.test.ts
@@ -70,6 +70,7 @@ function makeConsensusOutcome(
             totalTokens: options.totalTokens,
           }
         : undefined,
+    rubric: undefined,
   };
 }
 
@@ -93,6 +94,7 @@ function makeBaseline(opts: {
           }
         : undefined,
     errorMessage: opts.errorMessage,
+    rubric: undefined,
   };
 }
 
diff --git a/src/benchmark/__tests__/rubric.test.ts b/src/benchmark/__tests__/rubric.test.ts
new file mode 100644
index 0000000..a62dea4
--- /dev/null
+++ b/src/benchmark/__tests__/rubric.test.ts
@@ -0,0 +1,276 @@
+import { describe, it, expect } from "vitest";
+import type { ModelCaller, ModelCallRequest } from "ai-consensus-core";
+import type { RubricCriterion } from "../../presets/types.js";
+import {
+  buildRubricSystemPrompt,
+  buildRubricUserPrompt,
+  evaluateOutput,
+  extractJsonObject,
+} from "../rubric.js";
+
+const RUBRIC: readonly RubricCriterion[] = [
+  { id: "quantification", description: "Quantified constraints.", maxPoints: 5 },
+  { id: "single-recommendation", description: "Pick one option.", maxPoints: 5 },
+  { id: "reversibility", description: "Weigh reversibility.", maxPoints: 5 },
+];
+
+function mockCallerEmitting(content: string, opts: { throwError?: string } = {}): ModelCaller {
+  return (_req: ModelCallRequest) => {
+    if (opts.throwError) {
+      return Promise.reject(new Error(opts.throwError));
+    }
+    return Promise.resolve({
+      content,
+      usage: { inputTokens: 100, outputTokens: 50, totalTokens: 150 },
+    });
+  };
+}
+
+describe("buildRubricSystemPrompt", () => {
+  it("lists each criterion with id and max points", () => {
+    const prompt = buildRubricSystemPrompt(RUBRIC);
+    expect(prompt).toContain('"quantification" (0-5)');
+    expect(prompt).toContain('"single-recommendation" (0-5)');
+    expect(prompt).toContain('"reversibility" (0-5)');
+  });
+
+  it("instructs the model to return only JSON, no fences", () => {
+    const prompt = buildRubricSystemPrompt(RUBRIC);
+    expect(prompt).toMatch(/no prose before or after, no markdown fences/);
+  });
+
+  it("requires justifications to cite specifics, not generic praise", () => {
+    const prompt = buildRubricSystemPrompt(RUBRIC);
+    expect(prompt).toMatch(/not generic praise or criticism/);
+  });
+});
+
+describe("buildRubricUserPrompt", () => {
+  it("fences the answer so it cannot be confused with the evaluator's own output", () => {
+    const out = buildRubricUserPrompt({
+      question: "What architecture?",
+      output: "use a monolith because…",
+    });
+    expect(out).toContain("<<<ANSWER>>>");
+    expect(out).toContain("<<<END ANSWER>>>");
+    expect(out).toContain("use a monolith because…");
+  });
+});
+
+describe("extractJsonObject", () => {
+  it("parses raw JSON", () => {
+    const v = extractJsonObject('{"scores":[]}');
+    expect(v).toEqual({ scores: [] });
+  });
+
+  it("parses JSON inside a ```json fenced block", () => {
+    const v = extractJsonObject('Here you go:\n```json\n{"scores":[{"a":1}]}\n```\n');
+    expect(v).toEqual({ scores: [{ a: 1 }] });
+  });
+
+  it("parses JSON inside a plain ``` fenced block", () => {
+    const v = extractJsonObject('```\n{"k":42}\n```');
+    expect(v).toEqual({ k: 42 });
+  });
+
+  it("recovers JSON from surrounding prose", () => {
+    const v = extractJsonObject('Preamble. {"scores":[]} Trailing prose.');
+    expect(v).toEqual({ scores: [] });
+  });
+
+  it("handles strings containing braces without breaking depth tracking", () => {
+    const v = extractJsonObject('text {"s":"a }brace{ in string","n":1}');
+    expect(v).toEqual({ s: "a }brace{ in string", n: 1 });
+  });
+
+  it("returns undefined for unparseable content", () => {
+    expect(extractJsonObject("totally not json")).toBeUndefined();
+  });
+
+  it("returns undefined for non-object JSON (arrays, primitives)", () => {
+    // The evaluator contract is a JSON object; an array alone isn't valid.
+    expect(extractJsonObject("[1,2,3]")).toBeUndefined();
+    expect(extractJsonObject("42")).toBeUndefined();
+  });
+});
+
+describe("evaluateOutput — happy path", () => {
+  it("returns scores, total, maxTotal, and normalized for a valid JSON response", async () => {
+    const response = JSON.stringify({
+      scores: [
+        { criterion_id: "quantification", score: 4, justification: "names ms and $." },
+        {
+          criterion_id: "single-recommendation",
+          score: 5,
+          justification: "picks monolith outright.",
+        },
+        { criterion_id: "reversibility", score: 3, justification: "mentions but does not rate." },
+      ],
+    });
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting(response),
+      evaluatorModelId: "test-evaluator",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.errorMessage).toBeUndefined();
+    expect(result.total).toBe(12);
+    expect(result.maxTotal).toBe(15);
+    expect(result.normalized).toBe(80); // 12/15 = 0.8
+    expect(result.criteria).toHaveLength(3);
+    expect(result.criteria[0]?.criterionId).toBe("quantification");
+    expect(result.criteria[0]?.score).toBe(4);
+  });
+
+  it("preserves rubric order in the criteria output, regardless of response order", async () => {
+    const response = JSON.stringify({
+      scores: [
+        { criterion_id: "reversibility", score: 1, justification: "…" },
+        { criterion_id: "quantification", score: 2, justification: "…" },
+        { criterion_id: "single-recommendation", score: 3, justification: "…" },
+      ],
+    });
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting(response),
+      evaluatorModelId: "test-evaluator",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.criteria.map((c) => c.criterionId)).toEqual([
+      "quantification",
+      "single-recommendation",
+      "reversibility",
+    ]);
+    expect(result.criteria.map((c) => c.score)).toEqual([2, 3, 1]);
+  });
+});
+
+describe("evaluateOutput — score handling", () => {
+  it("clamps scores above maxPoints", async () => {
+    const response = JSON.stringify({
+      scores: [
+        { criterion_id: "quantification", score: 99, justification: "j" },
+        { criterion_id: "single-recommendation", score: 5, justification: "j" },
+        { criterion_id: "reversibility", score: 5, justification: "j" },
+      ],
+    });
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting(response),
+      evaluatorModelId: "x",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.criteria[0]?.score).toBe(5);
+    expect(result.total).toBe(15);
+  });
+
+  it("clamps negative scores to 0", async () => {
+    const response = JSON.stringify({
+      scores: [
+        { criterion_id: "quantification", score: -3, justification: "j" },
+        { criterion_id: "single-recommendation", score: 0, justification: "j" },
+        { criterion_id: "reversibility", score: 0, justification: "j" },
+      ],
+    });
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting(response),
+      evaluatorModelId: "x",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.criteria[0]?.score).toBe(0);
+    expect(result.total).toBe(0);
+    expect(result.normalized).toBe(0);
+  });
+
+  it("treats a missing criterion as 0 with a sentinel justification", async () => {
+    const response = JSON.stringify({
+      scores: [
+        { criterion_id: "quantification", score: 5, justification: "j" },
+        { criterion_id: "single-recommendation", score: 5, justification: "j" },
+        // reversibility omitted
+      ],
+    });
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting(response),
+      evaluatorModelId: "x",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.criteria).toHaveLength(3);
+    const reversibility = result.criteria.find((c) => c.criterionId === "reversibility");
+    expect(reversibility?.score).toBe(0);
+    expect(reversibility?.justification).toMatch(/omitted/);
+    // Eval still succeeds (no errorMessage) — partial data is data.
+    expect(result.errorMessage).toBeUndefined();
+  });
+});
+
+describe("evaluateOutput — failure modes (sentinel, never throws)", () => {
+  it("records errorMessage when the caller throws", async () => {
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting("", { throwError: "provider down" }),
+      evaluatorModelId: "x",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.errorMessage).toBe("provider down");
+    expect(result.criteria).toHaveLength(0);
+    expect(result.total).toBe(0);
+  });
+
+  it("records errorMessage when the response contains no parseable JSON", async () => {
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting("I cannot do that."),
+      evaluatorModelId: "x",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.errorMessage).toMatch(/parseable JSON/);
+    expect(result.criteria).toHaveLength(0);
+  });
+
+  it("records errorMessage when JSON has the wrong shape", async () => {
+    const result = await evaluateOutput({
+      caller: mockCallerEmitting('{"wrong":"shape"}'),
+      evaluatorModelId: "x",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(result.errorMessage).toBeDefined();
+    expect(result.criteria).toHaveLength(0);
+  });
+});
+
+describe("evaluateOutput — caller is invoked with the right request shape", () => {
+  it("routes to participantId=rubric-evaluator with the configured modelId", async () => {
+    let captured: ModelCallRequest | undefined;
+    const caller: ModelCaller = (req) => {
+      captured = req;
+      return Promise.resolve({
+        content: JSON.stringify({
+          scores: RUBRIC.map((c) => ({ criterion_id: c.id, score: 1, justification: "j" })),
+        }),
+      });
+    };
+    await evaluateOutput({
+      caller,
+      evaluatorModelId: "my-eval-model",
+      rubric: RUBRIC,
+      question: "Q",
+      output: "A",
+    });
+    expect(captured?.participantId).toBe("rubric-evaluator");
+    expect(captured?.modelId).toBe("my-eval-model");
+    expect(captured?.system).toContain('"quantification"');
+    expect(captured?.user).toContain("<<<ANSWER>>>");
+  });
+});
diff --git a/src/benchmark/__tests__/runner.test.ts b/src/benchmark/__tests__/runner.test.ts
index 17ff8f1..9ad0a3c 100644
--- a/src/benchmark/__tests__/runner.test.ts
+++ b/src/benchmark/__tests__/runner.test.ts
@@ -22,6 +22,10 @@ interface MockCallerOptions {
   failParticipantIds?: Set<string>;
   failBaseline?: boolean;
   withUsage?: boolean;
+  /** Per-criterion score the rubric evaluator returns. Same score for both sides. */
+  rubricScore?: number;
+  /** If true, the rubric-evaluator returns junk and the eval errors. */
+  rubricEvalFails?: boolean;
 }
 
 function makeMockCaller(opts: MockCallerOptions = {}): ModelCaller {
@@ -42,6 +46,27 @@ function makeMockCaller(opts: MockCallerOptions = {}): ModelCaller {
           : {}),
       };
     }
+    if (req.participantId === "rubric-evaluator") {
+      if (opts.rubricEvalFails) {
+        return { content: "I refuse to score this." };
+      }
+      const score = opts.rubricScore ?? 3;
+      // Match whatever rubric the panel declared by parsing the criterion
+      // ids out of the system prompt. Keeps the mock panel-agnostic.
+      const ids = Array.from(req.system.matchAll(/"([a-z0-9_-]+)" \(0-/g)).map((m) => m[1]!);
+      return {
+        content: JSON.stringify({
+          scores: ids.map((id) => ({
+            criterion_id: id,
+            score,
+            justification: `mock score for ${id}`,
+          })),
+        }),
+        ...(opts.withUsage
+          ? { usage: { inputTokens: 40, outputTokens: 30, totalTokens: 70 } }
+          : {}),
+      };
+    }
     if (req.participantId === "judge") {
       const conf = opts.judgeConfidence ?? 82;
       return {
@@ -213,6 +238,76 @@ describe("runSuite — happy path", () => {
   });
 });
 
+describe("runSuite — rubric evaluation", () => {
+  it("leaves rubric undefined on both sides when no evaluator is configured", async () => {
+    const report = await runSuite({
+      cases: [SAMPLE_CASES[0]!],
+      panel: ARCHITECTURE_V2_PRESET,
+      participants: makeParticipants(),
+      engineDefaults: { maxRounds: 1 },
+      caller: makeMockCaller(),
+      baselineModelId: "judge-model",
+      runs: 1,
+      baseSeed: 1,
+      // Note: evaluatorModelId omitted — the panel has a rubric, but the
+      // bench was invoked without an evaluator, so the path must short-circuit.
+    });
+    expect(report.runs[0]?.consensus.rubric).toBeUndefined();
+    expect(report.runs[0]?.baseline.rubric).toBeUndefined();
+    expect(report.metrics.consensusRubricNormalizedMean).toBeUndefined();
+    expect(report.metrics.baselineRubricNormalizedMean).toBeUndefined();
+    expect(report.metrics.rubricRunsCounted).toBe(0);
+  });
+
+  it("populates both rubric outcomes and the rubric metrics when the evaluator is configured", async () => {
+    const report = await runSuite({
+      cases: [SAMPLE_CASES[0]!],
+      panel: ARCHITECTURE_V2_PRESET,
+      participants: makeParticipants(),
+      // Judge config is required so the engine produces a synthesis —
+      // the rubric eval has no consensus output to score otherwise.
+      engineDefaults: { maxRounds: 1, judge: { modelId: "judge-model" } },
+      caller: makeMockCaller({ rubricScore: 4 }),
+      baselineModelId: "judge-model",
+      runs: 2,
+      baseSeed: 1,
+      evaluatorModelId: "claude-opus-4-5",
+    });
+    for (const r of report.runs) {
+      expect(r.consensus.rubric?.errorMessage).toBeUndefined();
+      expect(r.baseline.rubric?.errorMessage).toBeUndefined();
+      // ARCHITECTURE_V2_PRESET has 5 criteria of 5 points each → 4/5 each = 80.
+      expect(r.consensus.rubric?.normalized).toBe(80);
+      expect(r.baseline.rubric?.normalized).toBe(80);
+    }
+    expect(report.metrics.rubricRunsCounted).toBe(2);
+    expect(report.metrics.consensusRubricNormalizedMean).toBe(80);
+    expect(report.metrics.baselineRubricNormalizedMean).toBe(80);
+    // Equal scores → consensus is NOT strictly greater on either run.
+    expect(report.metrics.consensusBeatsBaselineRubricRate).toBe(0);
+  });
+
+  it("captures rubric failures into errorMessage without aborting the suite", async () => {
+    const report = await runSuite({
+      cases: [SAMPLE_CASES[0]!],
+      panel: ARCHITECTURE_V2_PRESET,
+      participants: makeParticipants(),
+      engineDefaults: { maxRounds: 1, judge: { modelId: "judge-model" } },
+      caller: makeMockCaller({ rubricEvalFails: true }),
+      baselineModelId: "judge-model",
+      runs: 1,
+      baseSeed: 1,
+      evaluatorModelId: "claude-opus-4-5",
+    });
+    expect(report.runs[0]?.failed).toBe(false); // run itself succeeded
+    expect(report.runs[0]?.consensus.rubric?.errorMessage).toBeDefined();
+    expect(report.runs[0]?.baseline.rubric?.errorMessage).toBeDefined();
+    // Failed evals are excluded from the paired-runs denominator.
+    expect(report.metrics.rubricRunsCounted).toBe(0);
+    expect(report.metrics.consensusBeatsBaselineRubricRate).toBeUndefined();
+  });
+});
+
 describe("runSuite — failure capture", () => {
   it("marks runs as failed when the engine throws on every participant", async () => {
     const allFail = new Set(["p1", "p2", "p3"]);
diff --git a/src/benchmark/baseline.ts b/src/benchmark/baseline.ts
index b1b7d35..5edf7e8 100644
--- a/src/benchmark/baseline.ts
+++ b/src/benchmark/baseline.ts
@@ -90,5 +90,6 @@ export async function runBaseline(args: RunBaselineArgs): Promise<BaselineOutcom
     durationMs: completedAt - startedAt,
     usage,
     errorMessage,
+    rubric: undefined,
   };
 }
diff --git a/src/benchmark/format.ts b/src/benchmark/format.ts
index a3434eb..3920ad6 100644
--- a/src/benchmark/format.ts
+++ b/src/benchmark/format.ts
@@ -118,31 +118,87 @@ function formatMetrics(m: BenchMetrics): string {
   lines.push(
     `- **Consensus score > baseline confidence:** ${pct(m.consensusBeatsBaselineConfidenceRate)} of runs`,
   );
+  if (
+    m.consensusRubricNormalizedMean !== undefined ||
+    m.baselineRubricNormalizedMean !== undefined ||
+    m.consensusBeatsBaselineRubricRate !== undefined
+  ) {
+    lines.push("");
+    lines.push("**Held-out rubric** (independent quality eval — not self-reported confidence):");
+    if (
+      m.consensusRubricNormalizedMean !== undefined &&
+      m.baselineRubricNormalizedMean !== undefined
+    ) {
+      const delta = m.consensusRubricNormalizedMean - m.baselineRubricNormalizedMean;
+      lines.push(
+        `- **Mean rubric score:** consensus ${m.consensusRubricNormalizedMean.toFixed(1)}/100, baseline ${m.baselineRubricNormalizedMean.toFixed(1)}/100 (Δ ${delta >= 0 ? "+" : ""}${delta.toFixed(1)})`,
+      );
+    } else {
+      if (m.consensusRubricNormalizedMean !== undefined) {
+        lines.push(
+          `- **Consensus mean rubric score:** ${m.consensusRubricNormalizedMean.toFixed(1)}/100`,
+        );
+      }
+      if (m.baselineRubricNormalizedMean !== undefined) {
+        lines.push(
+          `- **Baseline mean rubric score:** ${m.baselineRubricNormalizedMean.toFixed(1)}/100`,
+        );
+      }
+    }
+    if (m.consensusBeatsBaselineRubricRate !== undefined) {
+      lines.push(
+        `- **Consensus beats baseline on rubric:** ${pct(m.consensusBeatsBaselineRubricRate)} of paired runs (${m.rubricRunsCounted} pairs)`,
+      );
+    }
+  }
   return lines.join("\n");
 }
 
 function formatPerCaseTable(runs: readonly BenchRun[]): string {
   const lines: string[] = [];
-  lines.push(
-    "| Case | Run | Score | σ | Rounds | Stop | Disagree | Judge conf | Baseline conf | Δ |",
-  );
-  lines.push(
-    "| ---- | --- | ----- | - | ------ | ---- | -------- | ---------- | ------------- | - |",
+  const anyRubric = runs.some(
+    (r) => r.consensus.rubric !== undefined || r.baseline.rubric !== undefined,
   );
+  if (anyRubric) {
+    lines.push(
+      "| Case | Run | Score | σ | Rounds | Stop | Disagree | Judge conf | Baseline conf | Δ conf | Rubric C | Rubric B | Δ rubric |",
+    );
+    lines.push(
+      "| ---- | --- | ----- | - | ------ | ---- | -------- | ---------- | ------------- | ------ | -------- | -------- | -------- |",
+    );
+  } else {
+    lines.push(
+      "| Case | Run | Score | σ | Rounds | Stop | Disagree | Judge conf | Baseline conf | Δ |",
+    );
+    lines.push(
+      "| ---- | --- | ----- | - | ------ | ---- | -------- | ---------- | ------------- | - |",
+    );
+  }
   for (const r of runs) {
     if (r.failed) {
-      lines.push(
-        `| ${r.caseId} | ${r.runIndex} | — | — | — | — | — | — | — | _FAILED: ${escapeTable(r.errorMessage ?? "?")}_ |`,
-      );
+      const baseFailed = `| ${r.caseId} | ${r.runIndex} | — | — | — | — | — | — | — | _FAILED: ${escapeTable(r.errorMessage ?? "?")}_ |`;
+      lines.push(anyRubric ? `${baseFailed} — | — | — |` : baseFailed);
       continue;
     }
     const c = r.consensus;
     const delta = c.finalScore - r.baseline.confidence;
-    lines.push(
-      `| ${r.caseId} | ${r.runIndex} | ${c.finalScore} | ${c.finalStddev.toFixed(1)} | ${c.roundsCompleted} | ${shortStopReason(
-        c.result.stopReason,
-      )} | ${c.disagreementCount} | ${c.judgeConfidence ?? "—"} | ${r.baseline.confidence} | ${delta >= 0 ? "+" : ""}${delta} |`,
-    );
+    const baseRow = `| ${r.caseId} | ${r.runIndex} | ${c.finalScore} | ${c.finalStddev.toFixed(1)} | ${c.roundsCompleted} | ${shortStopReason(
+      c.result.stopReason,
+    )} | ${c.disagreementCount} | ${c.judgeConfidence ?? "—"} | ${r.baseline.confidence} | ${delta >= 0 ? "+" : ""}${delta} |`;
+    if (!anyRubric) {
+      lines.push(baseRow);
+      continue;
+    }
+    const cr = r.consensus.rubric;
+    const br = r.baseline.rubric;
+    const crCell = !cr ? "—" : cr.errorMessage ? "ERR" : `${cr.normalized}`;
+    const brCell = !br ? "—" : br.errorMessage ? "ERR" : `${br.normalized}`;
+    let rubricDeltaCell = "—";
+    if (cr && !cr.errorMessage && br && !br.errorMessage) {
+      const d = cr.normalized - br.normalized;
+      rubricDeltaCell = `${d >= 0 ? "+" : ""}${d}`;
+    }
+    lines.push(`${baseRow} ${crCell} | ${brCell} | ${rubricDeltaCell} |`);
   }
   return lines.join("\n");
 }
diff --git a/src/benchmark/metrics.ts b/src/benchmark/metrics.ts
index c7946ac..03b6401 100644
--- a/src/benchmark/metrics.ts
+++ b/src/benchmark/metrics.ts
@@ -49,6 +49,10 @@ export function computeMetrics(
       consensusBeatsBaselineConfidenceRate: 0,
       runsCounted,
       runsAttempted,
+      consensusRubricNormalizedMean: undefined,
+      baselineRubricNormalizedMean: undefined,
+      consensusBeatsBaselineRubricRate: undefined,
+      rubricRunsCounted: 0,
     };
   }
 
@@ -99,6 +103,36 @@ export function computeMetrics(
   const beatHits = counted.filter((r) => r.consensus.finalScore > r.baseline.confidence).length;
   const consensusBeatsBaselineConfidenceRate = beatHits / runsCounted;
 
+  // Rubric metrics — held-out evaluator quality scores. Only runs where
+  // BOTH sides produced a successful rubric eval contribute to the rate;
+  // each side's mean counts its own successful evals independently.
+  const consensusRubricScores = counted
+    .map((r) => r.consensus.rubric)
+    .filter((rb): rb is NonNullable<typeof rb> => rb !== undefined && rb.errorMessage === undefined)
+    .map((rb) => rb.normalized);
+  const baselineRubricScores = counted
+    .map((r) => r.baseline.rubric)
+    .filter((rb): rb is NonNullable<typeof rb> => rb !== undefined && rb.errorMessage === undefined)
+    .map((rb) => rb.normalized);
+  const consensusRubricNormalizedMean =
+    consensusRubricScores.length > 0 ? mean(consensusRubricScores) : undefined;
+  const baselineRubricNormalizedMean =
+    baselineRubricScores.length > 0 ? mean(baselineRubricScores) : undefined;
+
+  const rubricPaired = counted.filter(
+    (r) =>
+      r.consensus.rubric !== undefined &&
+      r.consensus.rubric.errorMessage === undefined &&
+      r.baseline.rubric !== undefined &&
+      r.baseline.rubric.errorMessage === undefined,
+  );
+  const rubricRunsCounted = rubricPaired.length;
+  const consensusBeatsBaselineRubricRate =
+    rubricRunsCounted > 0
+      ? rubricPaired.filter((r) => r.consensus.rubric!.normalized > r.baseline.rubric!.normalized)
+          .length / rubricRunsCounted
+      : undefined;
+
   return {
     agreementRate,
     agreementStddevThreshold: threshold,
@@ -113,6 +147,10 @@ export function computeMetrics(
     consensusBeatsBaselineConfidenceRate,
     runsCounted,
     runsAttempted,
+    consensusRubricNormalizedMean,
+    baselineRubricNormalizedMean,
+    consensusBeatsBaselineRubricRate,
+    rubricRunsCounted,
   };
 }
 
@@ -146,6 +184,19 @@ export function buildQualitativeNotes(runs: readonly BenchRun[]): string[] {
     if (r.consensus.judgeConfidence !== undefined) {
       tags.push(`judge confidence ${r.consensus.judgeConfidence}`);
     }
+    if (
+      r.consensus.rubric &&
+      r.consensus.rubric.errorMessage === undefined &&
+      r.baseline.rubric &&
+      r.baseline.rubric.errorMessage === undefined
+    ) {
+      const cn = r.consensus.rubric.normalized;
+      const bn = r.baseline.rubric.normalized;
+      const delta = cn - bn;
+      tags.push(`rubric C=${cn} B=${bn} Δ=${delta >= 0 ? "+" : ""}${delta}`);
+    } else if (r.consensus.rubric?.errorMessage || r.baseline.rubric?.errorMessage) {
+      tags.push("rubric eval failed");
+    }
     if (r.baseline.errorMessage) {
       tags.push(`baseline errored (${r.baseline.errorMessage})`);
     }
diff --git a/src/benchmark/rubric.ts b/src/benchmark/rubric.ts
new file mode 100644
index 0000000..52c055f
--- /dev/null
+++ b/src/benchmark/rubric.ts
@@ -0,0 +1,318 @@
+// ─────────────────────────────────────────────────────────────
+// Rubric evaluator — held-out LLM-as-judge quality scoring
+// ─────────────────────────────────────────────────────────────
+// Scores a single output against a panel-declared `RubricCriterion[]` by
+// asking a held-out model (neither the judge nor the panel) to rate each
+// criterion 0..maxPoints with a brief justification.
+//
+// The bench runner invokes this twice per run — once for the consensus
+// output, once for the baseline output — and reports both alongside the
+// existing self-reported confidence numbers. That gives a quality signal
+// that is independent of either side's self-assessment.
+//
+// Kept deliberately thin: one model call, structured-output parsing with
+// a tolerant JSON extractor, sentinel `errorMessage` on failure rather
+// than thrown exceptions (a rubric eval failure shouldn't abort a bench
+// suite, matching the contract used by baseline.ts).
+
+import { z } from "zod";
+import type { ModelCaller, TokenUsage } from "ai-consensus-core";
+import type { RubricCriterion } from "../presets/types.js";
+
+/** Score the evaluator emitted for one criterion. */
+export interface RubricCriterionScore {
+  criterionId: string;
+  /** 0..maxPoints (clamped). */
+  score: number;
+  /** Evaluator's one-to-two-sentence justification. */
+  justification: string;
+}
+
+/** Full rubric evaluation of a single output. */
+export interface RubricEvaluation {
+  evaluatorModelId: string;
+  criteria: readonly RubricCriterionScore[];
+  /** Sum of scores across criteria. */
+  total: number;
+  /** Sum of maxPoints across criteria. */
+  maxTotal: number;
+  /**
+   * Integer 0..100, `Math.round((total / maxTotal) * 100)`. Mirrors the
+   * 0..100 scale of consensus score and baseline confidence so the report
+   * can put them side by side.
+   */
+  normalized: number;
+  durationMs: number;
+  usage: TokenUsage | undefined;
+  /** Set when the evaluation failed; metrics filter on this. */
+  errorMessage: string | undefined;
+}
+
+export interface EvaluateOutputArgs {
+  caller: ModelCaller;
+  evaluatorModelId: string;
+  rubric: readonly RubricCriterion[];
+  question: string;
+  /** The output to score (consensus synthesis or baseline answer). */
+  output: string;
+  temperature?: number;
+  maxOutputTokens?: number;
+  signal?: AbortSignal;
+}
+
+const DEFAULTS = {
+  temperature: 0.1,
+  maxOutputTokens: 2000,
+} as const;
+
+/** Zod shape we expect the evaluator to emit as JSON. */
+const EvaluatorResponseSchema = z.object({
+  scores: z
+    .array(
+      z.object({
+        criterion_id: z.string().min(1),
+        score: z.number(),
+        justification: z.string().min(1),
+      }),
+    )
+    .min(1),
+});
+type EvaluatorResponse = z.infer<typeof EvaluatorResponseSchema>;
+
+/**
+ * Build the evaluator's system prompt. The model is told it is judging
+ * answer quality against named criteria, NOT picking a winner, NOT
+ * comparing to anything else. The output is constrained to a single JSON
+ * object with one entry per criterion.
+ */
+export function buildRubricSystemPrompt(rubric: readonly RubricCriterion[]): string {
+  const criteriaLines = rubric.map((c) => `- "${c.id}" (0-${c.maxPoints}): ${c.description}`);
+  return [
+    "You are a quality evaluator. You are NOT writing an answer to the question — your only job is to score the given answer against the listed criteria.",
+    "",
+    "Score each criterion on the 0..maxPoints scale named in its label. Use the full range: 0 means the answer ignores the criterion entirely; max means the answer fully meets it. Mid-range scores are appropriate for partial coverage.",
+    "",
+    "Criteria:",
+    ...criteriaLines,
+    "",
+    "Return a SINGLE JSON object with exactly this shape — no prose before or after, no markdown fences:",
+    "",
+    '{"scores":[{"criterion_id":"<id>","score":<number>,"justification":"<one or two sentences>"}, ...]}',
+    "",
+    "Include exactly one entry per criterion, in the order listed above. The justification must cite something specific from the answer (a phrase, a missing element, a vague vs measurable claim) — not generic praise or criticism.",
+  ].join("\n");
+}
+
+/**
+ * Build the user message: the question being answered + the answer being
+ * evaluated. The answer is fenced so the evaluator can't confuse it with
+ * its own output.
+ */
+export function buildRubricUserPrompt(args: { question: string; output: string }): string {
+  return [
+    "QUESTION (what the answer is trying to address):",
+    args.question,
+    "",
+    "ANSWER TO EVALUATE (between the fences):",
+    "<<<ANSWER>>>",
+    args.output,
+    "<<<END ANSWER>>>",
+    "",
+    "Score the answer against the criteria. Return only the JSON object.",
+  ].join("\n");
+}
+
+/**
+ * Evaluate one output against a rubric. Returns a `RubricEvaluation` with
+ * `errorMessage` set on any failure (caller failure, malformed JSON,
+ * mismatched criterion ids, schema violation). Never throws — bench
+ * suites must complete even if a single eval fails.
+ */
+export async function evaluateOutput(args: EvaluateOutputArgs): Promise<RubricEvaluation> {
+  const {
+    caller,
+    evaluatorModelId,
+    rubric,
+    question,
+    output,
+    temperature = DEFAULTS.temperature,
+    maxOutputTokens = DEFAULTS.maxOutputTokens,
+    signal,
+  } = args;
+
+  const maxTotal = rubric.reduce((acc, c) => acc + c.maxPoints, 0);
+  const startedAt = Date.now();
+  let usage: TokenUsage | undefined;
+  let errorMessage: string | undefined;
+  let parsed: EvaluatorResponse | undefined;
+
+  try {
+    const response = await caller({
+      participantId: "rubric-evaluator",
+      modelId: evaluatorModelId,
+      round: 1,
+      phase: "initial-analysis",
+      system: buildRubricSystemPrompt(rubric),
+      user: buildRubricUserPrompt({ question, output }),
+      temperature,
+      maxOutputTokens,
+      ...(signal ? { signal } : {}),
+    });
+    usage = response.usage;
+
+    const json = extractJsonObject(response.content);
+    if (json === undefined) {
+      throw new Error("evaluator did not emit a parseable JSON object");
+    }
+    const validation = EvaluatorResponseSchema.safeParse(json);
+    if (!validation.success) {
+      throw new Error(
+        `evaluator JSON failed schema: ${validation.error.issues[0]?.message ?? "unknown"}`,
+      );
+    }
+    parsed = validation.data;
+  } catch (err) {
+    errorMessage = err instanceof Error ? err.message : String(err);
+  }
+
+  const completedAt = Date.now();
+  const criteria: RubricCriterionScore[] = [];
+  let total = 0;
+
+  if (parsed) {
+    const byId = new Map(parsed.scores.map((s) => [s.criterion_id, s]));
+    for (const criterion of rubric) {
+      const got = byId.get(criterion.id);
+      if (!got) {
+        // Missing criterion → treat as 0 and record in justification.
+        criteria.push({
+          criterionId: criterion.id,
+          score: 0,
+          justification: "(evaluator omitted this criterion)",
+        });
+        continue;
+      }
+      const clamped = clamp(got.score, 0, criterion.maxPoints);
+      criteria.push({
+        criterionId: criterion.id,
+        score: clamped,
+        justification: got.justification.trim(),
+      });
+      total += clamped;
+    }
+  }
+
+  const normalized = maxTotal > 0 ? Math.round((total / maxTotal) * 100) : 0;
+
+  return {
+    evaluatorModelId,
+    criteria,
+    total,
+    maxTotal,
+    normalized,
+    durationMs: completedAt - startedAt,
+    usage,
+    errorMessage,
+  };
+}
+
+// ── Internals ────────────────────────────────────────────────
+
+/**
+ * Tolerant JSON object extractor. Models often wrap JSON in prose or in
+ * markdown fences (```json …```). We try, in order:
+ *
+ *   1. `JSON.parse(content.trim())` — happy path when the model complied.
+ *   2. The contents of the first ```…``` fenced block.
+ *   3. The substring from the first `{` to the matched closing `}`.
+ *
+ * Returns `undefined` if no valid JSON object can be extracted. Linear-
+ * time bracket scan; no regex backtracking (matches the parser hardening
+ * convention used in ai-consensus-core).
+ */
+export function extractJsonObject(content: string): unknown {
+  const trimmed = content.trim();
+
+  const direct = tryParse(trimmed);
+  if (direct !== undefined) return direct;
+
+  const fenced = extractFencedBlock(trimmed);
+  if (fenced !== undefined) {
+    const parsed = tryParse(fenced);
+    if (parsed !== undefined) return parsed;
+  }
+
+  const braced = extractFirstBracedObject(trimmed);
+  if (braced !== undefined) {
+    const parsed = tryParse(braced);
+    if (parsed !== undefined) return parsed;
+  }
+
+  return undefined;
+}
+
+function tryParse(s: string): unknown {
+  if (s.length === 0) return undefined;
+  try {
+    const v: unknown = JSON.parse(s);
+    // The evaluator contract is "a single JSON object" — bare arrays and
+    // primitives are rejected here so callers don't have to redo the check.
+    return typeof v === "object" && v !== null && !Array.isArray(v) ? v : undefined;
+  } catch {
+    return undefined;
+  }
+}
+
+function extractFencedBlock(s: string): string | undefined {
+  const fenceOpen = s.indexOf("```");
+  if (fenceOpen === -1) return undefined;
+  const afterOpenLine = s.indexOf("\n", fenceOpen);
+  if (afterOpenLine === -1) return undefined;
+  const fenceClose = s.indexOf("```", afterOpenLine);
+  if (fenceClose === -1) return undefined;
+  return s.slice(afterOpenLine + 1, fenceClose).trim();
+}
+
+/**
+ * Walk the string from the first `{` forward, tracking brace depth while
+ * respecting strings + escapes, and return the substring covering the
+ * matching `}`. Linear time.
+ */
+function extractFirstBracedObject(s: string): string | undefined {
+  const start = s.indexOf("{");
+  if (start === -1) return undefined;
+  let depth = 0;
+  let inString = false;
+  let escape = false;
+  for (let i = start; i < s.length; i++) {
+    const ch = s[i]!;
+    if (inString) {
+      if (escape) {
+        escape = false;
+      } else if (ch === "\\") {
+        escape = true;
+      } else if (ch === '"') {
+        inString = false;
+      }
+      continue;
+    }
+    if (ch === '"') {
+      inString = true;
+      continue;
+    }
+    if (ch === "{") {
+      depth++;
+    } else if (ch === "}") {
+      depth--;
+      if (depth === 0) {
+        return s.slice(start, i + 1);
+      }
+    }
+  }
+  return undefined;
+}
+
+function clamp(n: number, lo: number, hi: number): number {
+  if (!Number.isFinite(n)) return lo;
+  return Math.min(hi, Math.max(lo, n));
+}
diff --git a/src/benchmark/runner.ts b/src/benchmark/runner.ts
index 06b23bf..13c3a81 100644
--- a/src/benchmark/runner.ts
+++ b/src/benchmark/runner.ts
@@ -20,6 +20,7 @@ import {
 import type { Preset } from "../presets/types.js";
 import { runBaseline } from "./baseline.js";
 import { computeMetrics, buildQualitativeNotes } from "./metrics.js";
+import { evaluateOutput, type RubricEvaluation } from "./rubric.js";
 import {
   deriveRandomSeed,
   type BenchCase,
@@ -73,6 +74,12 @@ export interface RunSuiteArgs {
   signal?: AbortSignal;
   /** Optional name for the case-file or suite — appears in the report header. */
   caseFileName?: string;
+  /**
+   * Optional held-out evaluator model id for rubric scoring. When provided
+   * AND the panel declares a `rubric`, the runner scores both consensus
+   * and baseline outputs against that rubric.
+   */
+  evaluatorModelId?: string;
 }
 
 const MAX_RUNS = 32;
@@ -97,6 +104,7 @@ export async function runSuite(args: RunSuiteArgs): Promise<BenchReport> {
     onProgress,
     signal,
     caseFileName,
+    evaluatorModelId,
   } = args;
 
   if (cases.length === 0) {
@@ -148,6 +156,7 @@ export async function runSuite(args: RunSuiteArgs): Promise<BenchReport> {
         caseIndex,
         totalCases: cases.length,
         totalRuns,
+        evaluatorModelId,
       });
       collected.push(run);
     }
@@ -205,6 +214,7 @@ interface ExecuteOneRunArgs {
   caseIndex: number;
   totalCases: number;
   totalRuns: number;
+  evaluatorModelId: string | undefined;
 }
 
 async function executeOneRun(args: ExecuteOneRunArgs): Promise<BenchRun> {
@@ -222,6 +232,7 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise<BenchRun> {
     caseIndex,
     totalCases,
     totalRuns,
+    evaluatorModelId,
   } = args;
 
   let consensus: ConsensusOutcome | undefined;
@@ -257,7 +268,7 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise<BenchRun> {
   });
 
   // 2) Baseline run (always — even if consensus errored, baseline data is useful).
-  const baseline = await runBaseline({
+  let baseline = await runBaseline({
     caller,
     modelId: baselineModelId,
     question: benchCase.question,
@@ -275,6 +286,57 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise<BenchRun> {
     }`,
   });
 
+  // 3) Optional rubric eval. Only runs when the panel declares a rubric AND
+  // the bench was invoked with a held-out evaluator model. Failures here are
+  // captured into the per-side `RubricEvaluation.errorMessage` rather than
+  // bubbling — a rubric failure is data quality, not a suite-fatal error.
+  const rubric = panel.rubric;
+  if (rubric && rubric.length > 0 && evaluatorModelId) {
+    const consensusOutputText = consensus?.result.synthesis?.content;
+    if (consensusOutputText && consensusOutputText.trim().length > 0) {
+      const consensusRubric = await evaluateOutput({
+        caller,
+        evaluatorModelId,
+        rubric,
+        question: benchCase.question,
+        output: consensusOutputText,
+        ...(signal ? { signal } : {}),
+      });
+      if (consensus) {
+        consensus = { ...consensus, rubric: consensusRubric };
+      }
+      onProgress?.({
+        kind: "consensus-complete",
+        caseIndex,
+        runIndex,
+        caseId: benchCase.id,
+        totalCases,
+        totalRuns,
+        message: `  consensus rubric ${runIndex + 1}: ${formatRubricProgress(consensusRubric)}`,
+      });
+    }
+    if (!baseline.errorMessage && baseline.content.trim().length > 0) {
+      const baselineRubric = await evaluateOutput({
+        caller,
+        evaluatorModelId,
+        rubric,
+        question: benchCase.question,
+        output: baseline.content,
+        ...(signal ? { signal } : {}),
+      });
+      baseline = { ...baseline, rubric: baselineRubric };
+      onProgress?.({
+        kind: "baseline-complete",
+        caseIndex,
+        runIndex,
+        caseId: benchCase.id,
+        totalCases,
+        totalRuns,
+        message: `  baseline rubric ${runIndex + 1}: ${formatRubricProgress(baselineRubric)}`,
+      });
+    }
+  }
+
   const failed = consensus === undefined || baseline.errorMessage !== undefined;
   const errorMessage = consensusError
     ? `consensus: ${consensusError}`
@@ -294,6 +356,11 @@ async function executeOneRun(args: ExecuteOneRunArgs): Promise<BenchRun> {
   };
 }
 
+function formatRubricProgress(r: RubricEvaluation): string {
+  if (r.errorMessage) return `ERRORED (${r.errorMessage})`;
+  return `score=${r.normalized}/100 (${r.total}/${r.maxTotal})`;
+}
+
 // ── Helpers ───────────────────────────────────────────────────
 
 function summariseConsensus(result: ConsensusResult, durationMs: number): ConsensusOutcome {
@@ -333,6 +400,7 @@ function summariseConsensus(result: ConsensusResult, durationMs: number): Consen
     judgeConfidence: result.synthesis?.judgeConfidence,
     durationMs,
     totalUsage,
+    rubric: undefined,
   };
 }
 
@@ -366,6 +434,7 @@ function placeholderConsensus(): ConsensusOutcome {
     judgeConfidence: undefined,
     durationMs: 0,
     totalUsage: undefined,
+    rubric: undefined,
   };
 }
 
diff --git a/src/benchmark/types.ts b/src/benchmark/types.ts
index e8b2b5b..b3e6b9c 100644
--- a/src/benchmark/types.ts
+++ b/src/benchmark/types.ts
@@ -15,6 +15,7 @@
 
 import { z } from "zod";
 import type { ConsensusResult, TokenUsage } from "ai-consensus-core";
+import type { RubricEvaluation } from "./rubric.js";
 
 // ── BenchCase (input) ────────────────────────────────────────
 
@@ -93,6 +94,13 @@ export interface ConsensusOutcome {
   judgeConfidence: number | undefined;
   durationMs: number;
   totalUsage: TokenUsage | undefined;
+  /**
+   * Held-out rubric evaluation of the consensus synthesis. Populated only
+   * when the panel declares a `rubric` AND the bench was invoked with an
+   * evaluator model. `undefined` means "not evaluated"; an evaluation with
+   * `errorMessage` set means "tried and failed".
+   */
+  rubric: RubricEvaluation | undefined;
 }
 
 export interface BaselineOutcome {
@@ -104,6 +112,8 @@ export interface BaselineOutcome {
   durationMs: number;
   usage: TokenUsage | undefined;
   errorMessage: string | undefined;
+  /** Held-out rubric evaluation; same activation rule as ConsensusOutcome.rubric. */
+  rubric: RubricEvaluation | undefined;
 }
 
 // ── BenchReport (suite-level aggregation) ────────────────────
@@ -149,6 +159,24 @@ export interface BenchMetrics {
   runsCounted: number;
   /** Total runs attempted. */
   runsAttempted: number;
+  /**
+   * Mean of rubric-normalized scores for the consensus side, across runs
+   * where the held-out evaluator succeeded. `undefined` when no run had a
+   * successful consensus rubric eval (panel has no rubric, or eval was
+   * never invoked, or every eval failed).
+   */
+  consensusRubricNormalizedMean: number | undefined;
+  /** Mean of rubric-normalized scores for the baseline side; same contract. */
+  baselineRubricNormalizedMean: number | undefined;
+  /**
+   * Fraction of runs where the consensus rubric score was strictly greater
+   * than the baseline rubric score. Independent of self-reported confidence —
+   * this is the held-out-judge view of which side answered better. `undefined`
+   * when fewer than 1 run had successful evals on BOTH sides.
+   */
+  consensusBeatsBaselineRubricRate: number | undefined;
+  /** Count of runs where both rubric evaluations succeeded — denominator for the rate above. */
+  rubricRunsCounted: number;
 }
 
 export interface BenchReport {
diff --git a/src/cli/bench.ts b/src/cli/bench.ts
index 8b7ccfd..831dacf 100644
--- a/src/cli/bench.ts
+++ b/src/cli/bench.ts
@@ -36,6 +36,8 @@ export interface BenchArgs {
   outputPath: string | undefined;
   baselineModelId: string | undefined;
   baselineProviderId: string | undefined;
+  evaluatorModelId: string | undefined;
+  evaluatorProviderId: string | undefined;
   filterTag: string | undefined;
   includeFullResults: boolean;
   listPanels: boolean;
@@ -89,6 +91,17 @@ Optional:
                                Defaults to the judge model from config.
       --baseline-provider <id> Provider id for the baseline model.
                                Defaults to the judge provider from config.
+      --evaluator-model <id>   Model id for a held-out rubric evaluator. When
+                               set AND the panel declares a rubric, the bench
+                               scores both consensus and baseline outputs
+                               against that rubric. SHOULD differ from both
+                               the judge model and the baseline model — the
+                               evaluator grades both sides blind, and using
+                               the same brain for grading and producing one
+                               side biases the result. The CLI warns when
+                               this contract is violated but does not block.
+      --evaluator-provider <id> Provider id for the evaluator model. Required
+                               when --evaluator-model is set.
       --filter-tag <tag>       Only run cases that have this tag.
       --output <path>          Also write the JSON report to this path.
       --include-full-results   Keep the full ConsensusResult objects in the
@@ -119,6 +132,11 @@ Examples:
   ai-consensus-mcp bench -c ./consensus.config.json -p security_redteam \\
       --filter-tag injection --output sec.json
 
+  # Held-out rubric eval — consensus + baseline scored by a third model
+  ai-consensus-mcp bench -p architecture_v2 --runs 3 --seed 42 \\
+      --evaluator-model claude-opus-4-5 --evaluator-provider anthropic \\
+      --output bench-rubric.json
+
   # Discover panels and their tags
   ai-consensus-mcp bench --list-panels
 
@@ -138,6 +156,8 @@ export function parseBenchArgs(argv: readonly string[]): BenchArgs | Error {
     outputPath: undefined,
     baselineModelId: undefined,
     baselineProviderId: undefined,
+    evaluatorModelId: undefined,
+    evaluatorProviderId: undefined,
     filterTag: undefined,
     includeFullResults: false,
     listPanels: false,
@@ -208,6 +228,14 @@ export function parseBenchArgs(argv: readonly string[]): BenchArgs | Error {
       const v = next();
       if (v instanceof Error) return v;
       out.baselineProviderId = v;
+    } else if (arg === "--evaluator-model") {
+      const v = next();
+      if (v instanceof Error) return v;
+      out.evaluatorModelId = v;
+    } else if (arg === "--evaluator-provider") {
+      const v = next();
+      if (v instanceof Error) return v;
+      out.evaluatorProviderId = v;
     } else if (arg === "--filter-tag") {
       const v = next();
       if (v instanceof Error) return v;
@@ -314,8 +342,39 @@ export async function runBench(argv: readonly string[]): Promise<number> {
     return 2;
   }
 
+  // Evaluator routing — required if --evaluator-model is set. Validated
+  // BEFORE we start running so we fail fast on a typo'd provider id rather
+  // than dozens of provider calls in.
+  let evaluatorModelId: string | undefined;
+  let evaluatorProviderId: string | undefined;
+  if (parsed.evaluatorModelId || parsed.evaluatorProviderId) {
+    if (!parsed.evaluatorModelId || !parsed.evaluatorProviderId) {
+      process.stderr.write(
+        `${SERVER_NAME} bench: --evaluator-model and --evaluator-provider must be passed together.\n`,
+      );
+      return 2;
+    }
+    if (!config.providers[parsed.evaluatorProviderId]) {
+      process.stderr.write(
+        `${SERVER_NAME} bench: evaluator provider "${parsed.evaluatorProviderId}" is not in your config (available: ${Object.keys(
+          config.providers,
+        ).join(", ")}).\n`,
+      );
+      return 2;
+    }
+    evaluatorModelId = parsed.evaluatorModelId;
+    evaluatorProviderId = parsed.evaluatorProviderId;
+    if (!panel.rubric || panel.rubric.length === 0) {
+      process.stderr.write(
+        `${SERVER_NAME} bench: panel "${panel.id}" declares no rubric; --evaluator-model has nothing to score. Ignoring.\n`,
+      );
+      evaluatorModelId = undefined;
+      evaluatorProviderId = undefined;
+    }
+  }
+
   // Compose the per-call routing: panel participants → their providers,
-  // plus the synthetic "baseline" and "judge" ids → judge provider.
+  // plus the synthetic "baseline", "judge", and "rubric-evaluator" ids.
   const providerByParticipant: Record<string, string> = {
     ...resolved.providerByParticipant,
     baseline: baselineProviderId,
@@ -323,11 +382,37 @@ export async function runBench(argv: readonly string[]): Promise<number> {
   if (config.judge) {
     providerByParticipant["judge"] = config.judge.providerId;
   }
+  if (evaluatorProviderId) {
+    providerByParticipant["rubric-evaluator"] = evaluatorProviderId;
+  }
   const caller: ModelCaller = createOpenAICompatibleCaller({
     providers: config.providers,
     providerByParticipant,
   });
 
+  // Held-out contract warnings — the bench will still run, but a reviewer
+  // reading the report needs to see "this comparison wasn't blind."
+  if (evaluatorModelId) {
+    if (evaluatorModelId === baselineModelId) {
+      process.stderr.write(
+        `${SERVER_NAME} bench: ⚠ evaluator model == baseline model (${evaluatorModelId}). The evaluator is grading its own output. Results on the baseline side are NOT independent.\n`,
+      );
+    }
+    if (evaluatorModelId === config.judge?.modelId) {
+      process.stderr.write(
+        `${SERVER_NAME} bench: ⚠ evaluator model == judge model (${evaluatorModelId}). The evaluator is grading text synthesised by the same brain that produced the consensus output — eval is not held-out.\n`,
+      );
+    }
+  }
+  if (
+    baselineModelId === config.judge?.modelId &&
+    (!evaluatorModelId || evaluatorModelId === baselineModelId)
+  ) {
+    process.stderr.write(
+      `${SERVER_NAME} bench: ⚠ baseline and judge are the same model (${baselineModelId}). Consensus and baseline both flow through this brain; "consensus vs baseline" is a self-comparison artifact.\n`,
+    );
+  }
+
   // Load cases.
   let cases: BenchCase[];
   let caseFileName: string | undefined;
@@ -388,14 +473,22 @@ export async function runBench(argv: readonly string[]): Promise<number> {
   }
 
   const baseSeed = parsed.baseSeed ?? (parsed.quick ? QUICK_DEFAULT_SEED : Date.now());
-  const totalCalls = cases.length * parsed.runs * (resolved.participants.length + 1);
+  const perRunRubricCalls = evaluatorModelId ? 2 : 0;
+  const totalCalls =
+    cases.length * parsed.runs * (resolved.participants.length + 1 + perRunRubricCalls);
 
   if (!parsed.quiet) {
+    const evalSuffix = evaluatorModelId
+      ? `, evaluator=${evaluatorModelId} (${evaluatorProviderId})`
+      : "";
     process.stderr.write(
-      `${SERVER_NAME} bench: panel="${panel.id}", cases=${cases.length}, runs=${parsed.runs}, baseline=${baselineModelId} (${baselineProviderId}), seed=${baseSeed}${parsed.quick ? " (quick mode)" : ""}\n`,
+      `${SERVER_NAME} bench: panel="${panel.id}", cases=${cases.length}, runs=${parsed.runs}, baseline=${baselineModelId} (${baselineProviderId})${evalSuffix}, seed=${baseSeed}${parsed.quick ? " (quick mode)" : ""}\n`,
     );
+    const explainer = evaluatorModelId
+      ? "cases × runs × (panel + baseline + 2 rubric evals)"
+      : "cases × runs × (panel + baseline)";
     process.stderr.write(
-      `${SERVER_NAME} bench: expected ≈${totalCalls} provider calls (cases × runs × (panel + baseline)).\n`,
+      `${SERVER_NAME} bench: expected ≈${totalCalls} provider calls (${explainer}).\n`,
     );
   }
 
@@ -427,6 +520,7 @@ export async function runBench(argv: readonly string[]): Promise<number> {
       ...(progressHandler ? { onProgress: progressHandler } : {}),
       signal: ac.signal,
       ...(caseFileName ? { caseFileName } : {}),
+      ...(evaluatorModelId ? { evaluatorModelId } : {}),
     });
 
     const md = formatReportMarkdown(report);
diff --git a/src/presets/definitions/architecture-v2.ts b/src/presets/definitions/architecture-v2.ts
index e9b96ea..d1fcc59 100644
--- a/src/presets/definitions/architecture-v2.ts
+++ b/src/presets/definitions/architecture-v2.ts
@@ -120,6 +120,38 @@ export const ARCHITECTURE_V2_PRESET: Preset = {
     "",
     "Do not hedge by recommending two options. Pick one. State your confidence.",
   ].join("\n"),
+  rubric: [
+    {
+      id: "quantification",
+      description:
+        "Does the answer cite load-bearing constraints with units (ms, $/month, headcount, GB/day, QPS, percentiles), or explicitly name an unstated constraint with a proposed value? A 5/5 answer reads like an engineer with a spreadsheet; a 0/5 reads like a vibes essay.",
+      maxPoints: 5,
+    },
+    {
+      id: "single-recommendation",
+      description:
+        "Does the answer commit to a single architecture choice with a dominant reason, rather than hedging between two? A 5/5 answer names the recommended option in one sentence and names the next-best alternative only as the runner-up; a 0/5 presents a balanced menu and refuses to choose.",
+      maxPoints: 5,
+    },
+    {
+      id: "reversibility",
+      description:
+        "Does the answer explicitly weigh reversibility / switching cost — the cost of being wrong about this decision? A 5/5 answer treats reversibility as a first-class column with at least a low/medium/high rating per option and a switching cost estimate; a 0/5 ignores reversibility entirely.",
+      maxPoints: 5,
+    },
+    {
+      id: "tripwire-specificity",
+      description:
+        "Are the conditions that would flip the recommendation named as measurable signals with thresholds (e.g. 'write QPS sustains >5k for 24h', 'P99 latency exceeds 200ms for 1h'), not vague conditions ('if scale grows', 'if reliability becomes a concern')? A 5/5 answer has tripwires you could literally write a Prometheus alert against; a 0/5 has hand-waving.",
+      maxPoints: 5,
+    },
+    {
+      id: "failure-mode-realism",
+      description:
+        "Are failure modes named with concrete trigger conditions, blast radius, and detection latency — not generic risks? A 5/5 answer names specific failure modes a senior on-call engineer would recognise from incidents they've actually worked; a 0/5 lists abstract risks ('complexity', 'scaling issues') with no shape.",
+      maxPoints: 5,
+    },
+  ],
   meta: {
     version: "2.0.0",
     rationale: [
diff --git a/src/presets/types.ts b/src/presets/types.ts
index 2ef00b5..3e8ffe1 100644
--- a/src/presets/types.ts
+++ b/src/presets/types.ts
@@ -113,6 +113,29 @@ export interface PanelOutputShape {
   tags?: readonly string[];
 }
 
+/**
+ * One scoring dimension for a panel-declared quality rubric. Used by the
+ * benchmark's held-out LLM-as-judge evaluator to score consensus and
+ * baseline outputs against the same rubric, blind to which side produced
+ * which output.
+ *
+ * The rubric measures *answer quality* against named criteria — distinct
+ * from self-reported confidence (which is a meta-claim about the model's
+ * own certainty, not about the answer's substance).
+ */
+export interface RubricCriterion {
+  /** Stable kebab-case id; surfaces in JSON reports for downstream tools. */
+  id: string;
+  /**
+   * What a max-point answer looks like for this criterion. The evaluator
+   * sees this verbatim — write it as a directive for the model, not as
+   * end-user prose.
+   */
+  description: string;
+  /** Upper bound of the score for this criterion (typically 5). */
+  maxPoints: number;
+}
+
 /**
  * Optional metadata declaring a panel's purpose and output shape. v2+
  * panels populate this fully so MCP clients can introspect what a panel
@@ -165,6 +188,13 @@ export interface Preset {
   defaults: PresetDefaults;
   /** Optional task-specific judge system prompt. */
   judgeSystemPrompt?: string;
+  /**
+   * Optional quality rubric. When set, the bench can score consensus and
+   * baseline outputs against these criteria using a held-out evaluator
+   * model — measuring answer quality against named contracts rather than
+   * self-reported confidence.
+   */
+  rubric?: readonly RubricCriterion[];
   /** Phase 3 surface — empty/unused in Phase 1. */
   toolBindings?: readonly ToolBinding[];
   /**