Problem
When users run DevSwarm for benchmarking or performance-related tasks, swarm agents fabricate benchmark results . This is a known failure mode across all LLM-based coding agents:
Agent claims "30% faster" without running any benchmark
Agent invents numbers that look plausible but were never measured
Agent runs a benchmark once, gets a fluke result, and reports it as fact
Agent cherry-picks a favorable run and ignores regressions
Agent writes a benchmark that doesn't actually exercise the code path it claims to test
This is especially dangerous because fake benchmarks look real . A human reviewer might trust a nicely formatted table of numbers that were entirely hallucinated. The swarm's synthesis step compounds this — it merges outputs from multiple workers, and if any worker hallucinated metrics, the synthesis inherits the lie.
Solution: Adversarial Verification Grid
Introduce a dedicated verification grid — a set of adversarial agents whose sole job is to challenge and independently reproduce any quantitative claims made by other grids.
How it works
When the swarm detects that a task involves benchmarks, performance claims, or quantitative metrics, it activates an adversarial grid:
┌─────────────────────────────────────────────────┐
│ WORK GRIDS (existing) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Search │ │ Review │ │ Fix │ │
│ │ Grid │→ │ Grid │→ │ Grid │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Claims Collector │ │
│ │ (extracts quantitative assertions) │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ ADVERSARIAL VERIFICATION GRID │ │
│ │ │ │
│ │ ┌───────────┐ ┌───────────────────┐ │ │
│ │ │ Reproducer│ │ Skeptic │ │ │
│ │ │ Agent │ │ Agent │ │ │
│ │ │ │ │ │ │ │
│ │ │ Re-runs │ │ Challenges: │ │ │
│ │ │ benchmarks│ │ - Was it run? │ │ │
│ │ │ from │ │ - Methodology ok? │ │ │
│ │ │ scratch │ │ - Stats valid? │ │ │
│ │ │ N times │ │ - Baseline exist? │ │ │
│ │ └───────────┘ └───────────────────┘ │ │
│ │ │ │
│ │ Verdict: VERIFIED / UNVERIFIED / FRAUD │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Adversarial agents
1. Reproducer Agent (writable, sandboxed)
Takes any benchmark command from the work grid's output
Re-runs it independently N times (default 3)
Compares its results against the claimed numbers
Reports: actual mean, stddev, whether the claim is within tolerance
If no benchmark command was actually provided → flags as UNVERIFIED (agent likely hallucinated)
2. Skeptic Agent (read-only)
Reviews the benchmark methodology, not the numbers
Checks:
Was a baseline measured before the change?
Is the benchmark actually exercising the claimed code path? (uses find_callers/blast_radius from codegraff)
Is the sample size sufficient?
Are there confounding variables (warm-up, GC, I/O)?
Does the benchmark code import/call the function it claims to test?
Reports methodology issues as structured findings
Claims Collector
A lightweight parser that extracts quantitative assertions from agent output:
Regex patterns for benchmark-like claims: \d+(\.\d+)?[x%]?\s*(faster|slower|improvement|regression|speedup)
Table detection: any markdown table with numeric columns
Command detection: any shell command containing bench, perf, hyperfine, time, criterion
Each extracted claim becomes a verification task for the adversarial grid
Verdict system
Each claim gets a verdict:
Verdict
Meaning
Action
VERIFIED
Reproducer confirmed within ±15% tolerance
Include in synthesis with ✅
UNVERIFIED
No benchmark command found, or agent didn't actually run it
Strip from synthesis, flag with ⚠️
DISPUTED
Reproducer got significantly different numbers
Include both sets of numbers with ⚠️
FRAUD
Benchmark code doesn't exercise the claimed path, or numbers are statistically impossible
Strip from synthesis, flag with 🚫
Grid configuration
const VerificationGrid = Grid {
.name = "adversarial-verify" ,
.roles = &.{
.{ .name = "reproducer" ,
.system_prompt = REPRODUCER_PROMPT ,
.tool_allowlist = null , // needs to run commands
.model = "claude-sonnet-4-6" ,
.max_tool_calls = 30 }, // benchmark runs take many calls
.{ .name = "skeptic" ,
.system_prompt = SKEPTIC_PROMPT ,
.tool_allowlist = &.{ "zigrep" , "zigread" , "find_callers" , "blast_radius" },
.model = "claude-sonnet-4-6" ,
.max_tool_calls = 15 },
},
.policy = .writable , // reproducer needs to run benchmarks
.synthesis = .adversarial_merge , // special merge that attaches verdicts
.telemetry = .{ .track_verdicts = true },
};
Activation
The verification grid should activate automatically when:
The user's task mentions benchmarks, performance, or metrics
Any work grid agent output contains quantitative claims (detected by Claims Collector)
The user explicitly requests verification (--verify flag)
It should be skippable (--no-verify) for speed when the user trusts the results.
Why this needs the Grid abstraction (see #321 )
This proposal depends on the grid system from #321 's telemetry issue:
Grid isolation : the adversarial grid must have its own permission set, resource budget, and synthesis strategy
Grid-to-grid handoff : work grid outputs feed into the Claims Collector, which creates tasks for the verification grid
Grid-level telemetry : we need to track verification overhead (how much extra time/cost does verification add?)
Grid ordering : verification grid runs AFTER work grids complete, not in parallel
This makes the grid abstraction P0 — it's not just an organizational nicety, it's required for adversarial verification to work.
Implementation Tasks
Claims Collector — regex + heuristic parser that extracts quantitative assertions from agent output
Reproducer Agent prompt — system prompt that instructs the agent to independently re-run benchmarks and compare results
Skeptic Agent prompt — system prompt focused on methodology review, leveraging codegraff's find_callers/blast_radius
Verdict system — VERIFIED/UNVERIFIED/DISPUTED/FRAUD classification with tolerance thresholds
Grid abstraction — the Grid struct with isolation, handoff protocol, and ordering (shared with feat: swarm telemetry & observability system + grid abstraction #321 )
Synthesis integration — adversarial verdicts attached to claims in the final synthesis output
Auto-activation — keyword detection on task + output to trigger verification grid
Acceptance
Related
feat: swarm telemetry & observability system + grid abstraction #321 — Swarm telemetry & observability (grid abstraction is shared infrastructure)
Better CodeDB doc: Proposal 8 (Agent-to-Agent Review Chains) — adversarial verification is a specialized review chain
Better CodeDB doc: Proposal 10 (Swarm Role Registry) — reproducer and skeptic are new built-in roles
Problem
When users run DevSwarm for benchmarking or performance-related tasks, swarm agents fabricate benchmark results. This is a known failure mode across all LLM-based coding agents:
This is especially dangerous because fake benchmarks look real. A human reviewer might trust a nicely formatted table of numbers that were entirely hallucinated. The swarm's synthesis step compounds this — it merges outputs from multiple workers, and if any worker hallucinated metrics, the synthesis inherits the lie.
Solution: Adversarial Verification Grid
Introduce a dedicated verification grid — a set of adversarial agents whose sole job is to challenge and independently reproduce any quantitative claims made by other grids.
How it works
When the swarm detects that a task involves benchmarks, performance claims, or quantitative metrics, it activates an adversarial grid:
Adversarial agents
1. Reproducer Agent (writable, sandboxed)
2. Skeptic Agent (read-only)
find_callers/blast_radiusfrom codegraff)Claims Collector
A lightweight parser that extracts quantitative assertions from agent output:
\d+(\.\d+)?[x%]?\s*(faster|slower|improvement|regression|speedup)bench,perf,hyperfine,time,criterionVerdict system
Each claim gets a verdict:
Grid configuration
Activation
The verification grid should activate automatically when:
--verifyflag)It should be skippable (
--no-verify) for speed when the user trusts the results.Why this needs the Grid abstraction (see #321)
This proposal depends on the grid system from #321's telemetry issue:
This makes the grid abstraction P0 — it's not just an organizational nicety, it's required for adversarial verification to work.
Implementation Tasks
find_callers/blast_radiusGridstruct with isolation, handoff protocol, and ordering (shared with feat: swarm telemetry & observability system + grid abstraction #321)Acceptance
--no-verifyflag to skip for speedRelated