Skip to content

feat: adversarial benchmark verification grid — prevent hallucinated metrics #330

@justrach

Description

@justrach

Problem

When users run DevSwarm for benchmarking or performance-related tasks, swarm agents fabricate benchmark results. This is a known failure mode across all LLM-based coding agents:

  • Agent claims "30% faster" without running any benchmark
  • Agent invents numbers that look plausible but were never measured
  • Agent runs a benchmark once, gets a fluke result, and reports it as fact
  • Agent cherry-picks a favorable run and ignores regressions
  • Agent writes a benchmark that doesn't actually exercise the code path it claims to test

This is especially dangerous because fake benchmarks look real. A human reviewer might trust a nicely formatted table of numbers that were entirely hallucinated. The swarm's synthesis step compounds this — it merges outputs from multiple workers, and if any worker hallucinated metrics, the synthesis inherits the lie.

Solution: Adversarial Verification Grid

Introduce a dedicated verification grid — a set of adversarial agents whose sole job is to challenge and independently reproduce any quantitative claims made by other grids.

How it works

When the swarm detects that a task involves benchmarks, performance claims, or quantitative metrics, it activates an adversarial grid:

┌─────────────────────────────────────────────────┐
│  WORK GRIDS (existing)                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │ Search   │  │ Review   │  │  Fix     │      │
│  │ Grid     │→ │ Grid     │→ │  Grid    │      │
│  └──────────┘  └──────────┘  └──────────┘      │
│        │              │             │            │
│        ▼              ▼             ▼            │
│  ┌─────────────────────────────────────────┐    │
│  │         Claims Collector                │    │
│  │  (extracts quantitative assertions)     │    │
│  └─────────────────────────────────────────┘    │
│                      │                           │
│                      ▼                           │
│  ┌─────────────────────────────────────────┐    │
│  │      ADVERSARIAL VERIFICATION GRID      │    │
│  │                                         │    │
│  │  ┌───────────┐  ┌───────────────────┐   │    │
│  │  │ Reproducer│  │ Skeptic           │   │    │
│  │  │ Agent     │  │ Agent             │   │    │
│  │  │           │  │                   │   │    │
│  │  │ Re-runs   │  │ Challenges:       │   │    │
│  │  │ benchmarks│  │ - Was it run?     │   │    │
│  │  │ from      │  │ - Methodology ok? │   │    │
│  │  │ scratch   │  │ - Stats valid?    │   │    │
│  │  │ N times   │  │ - Baseline exist? │   │    │
│  │  └───────────┘  └───────────────────┘   │    │
│  │                                         │    │
│  │  Verdict: VERIFIED / UNVERIFIED / FRAUD │    │
│  └─────────────────────────────────────────┘    │
└─────────────────────────────────────────────────┘

Adversarial agents

1. Reproducer Agent (writable, sandboxed)

  • Takes any benchmark command from the work grid's output
  • Re-runs it independently N times (default 3)
  • Compares its results against the claimed numbers
  • Reports: actual mean, stddev, whether the claim is within tolerance
  • If no benchmark command was actually provided → flags as UNVERIFIED (agent likely hallucinated)

2. Skeptic Agent (read-only)

  • Reviews the benchmark methodology, not the numbers
  • Checks:
    • Was a baseline measured before the change?
    • Is the benchmark actually exercising the claimed code path? (uses find_callers/blast_radius from codegraff)
    • Is the sample size sufficient?
    • Are there confounding variables (warm-up, GC, I/O)?
    • Does the benchmark code import/call the function it claims to test?
  • Reports methodology issues as structured findings

Claims Collector

A lightweight parser that extracts quantitative assertions from agent output:

  • Regex patterns for benchmark-like claims: \d+(\.\d+)?[x%]?\s*(faster|slower|improvement|regression|speedup)
  • Table detection: any markdown table with numeric columns
  • Command detection: any shell command containing bench, perf, hyperfine, time, criterion
  • Each extracted claim becomes a verification task for the adversarial grid

Verdict system

Each claim gets a verdict:

Verdict Meaning Action
VERIFIED Reproducer confirmed within ±15% tolerance Include in synthesis with ✅
UNVERIFIED No benchmark command found, or agent didn't actually run it Strip from synthesis, flag with ⚠️
DISPUTED Reproducer got significantly different numbers Include both sets of numbers with ⚠️
FRAUD Benchmark code doesn't exercise the claimed path, or numbers are statistically impossible Strip from synthesis, flag with 🚫

Grid configuration

const VerificationGrid = Grid{
    .name = "adversarial-verify",
    .roles = &.{
        .{ .name = "reproducer",
           .system_prompt = REPRODUCER_PROMPT,
           .tool_allowlist = null,  // needs to run commands
           .model = "claude-sonnet-4-6",
           .max_tool_calls = 30 },  // benchmark runs take many calls
        .{ .name = "skeptic",
           .system_prompt = SKEPTIC_PROMPT,
           .tool_allowlist = &.{ "zigrep", "zigread", "find_callers", "blast_radius" },
           .model = "claude-sonnet-4-6",
           .max_tool_calls = 15 },
    },
    .policy = .writable,  // reproducer needs to run benchmarks
    .synthesis = .adversarial_merge,  // special merge that attaches verdicts
    .telemetry = .{ .track_verdicts = true },
};

Activation

The verification grid should activate automatically when:

  1. The user's task mentions benchmarks, performance, or metrics
  2. Any work grid agent output contains quantitative claims (detected by Claims Collector)
  3. The user explicitly requests verification (--verify flag)

It should be skippable (--no-verify) for speed when the user trusts the results.

Why this needs the Grid abstraction (see #321)

This proposal depends on the grid system from #321's telemetry issue:

  • Grid isolation: the adversarial grid must have its own permission set, resource budget, and synthesis strategy
  • Grid-to-grid handoff: work grid outputs feed into the Claims Collector, which creates tasks for the verification grid
  • Grid-level telemetry: we need to track verification overhead (how much extra time/cost does verification add?)
  • Grid ordering: verification grid runs AFTER work grids complete, not in parallel

This makes the grid abstraction P0 — it's not just an organizational nicety, it's required for adversarial verification to work.

Implementation Tasks

  1. Claims Collector — regex + heuristic parser that extracts quantitative assertions from agent output
  2. Reproducer Agent prompt — system prompt that instructs the agent to independently re-run benchmarks and compare results
  3. Skeptic Agent prompt — system prompt focused on methodology review, leveraging codegraff's find_callers/blast_radius
  4. Verdict system — VERIFIED/UNVERIFIED/DISPUTED/FRAUD classification with tolerance thresholds
  5. Grid abstraction — the Grid struct with isolation, handoff protocol, and ordering (shared with feat: swarm telemetry & observability system + grid abstraction #321)
  6. Synthesis integration — adversarial verdicts attached to claims in the final synthesis output
  7. Auto-activation — keyword detection on task + output to trigger verification grid

Acceptance

  • Quantitative claims in agent output are automatically detected
  • Reproducer agent independently re-runs benchmark commands and compares results
  • Skeptic agent validates methodology (code path coverage, baseline existence, sample size)
  • Each claim receives a VERIFIED/UNVERIFIED/DISPUTED/FRAUD verdict
  • Unverified and fraudulent claims are stripped or flagged in synthesis output
  • Verification grid activates automatically for benchmark-related tasks
  • --no-verify flag to skip for speed
  • Grid abstraction supports ordered execution (work grids → verification grid)

Related

  • feat: swarm telemetry & observability system + grid abstraction #321 — Swarm telemetry & observability (grid abstraction is shared infrastructure)
  • Better CodeDB doc: Proposal 8 (Agent-to-Agent Review Chains) — adversarial verification is a specialized review chain
  • Better CodeDB doc: Proposal 10 (Swarm Role Registry) — reproducer and skeptic are new built-in roles

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions