feat: adversarial benchmark verification grid — prevent hallucinated metrics

## Problem

When users run DevSwarm for benchmarking or performance-related tasks, swarm agents **fabricate benchmark results**. This is a known failure mode across all LLM-based coding agents:

- Agent claims "30% faster" without running any benchmark
- Agent invents numbers that look plausible but were never measured
- Agent runs a benchmark once, gets a fluke result, and reports it as fact
- Agent cherry-picks a favorable run and ignores regressions
- Agent writes a benchmark that doesn't actually exercise the code path it claims to test

This is especially dangerous because **fake benchmarks look real**. A human reviewer might trust a nicely formatted table of numbers that were entirely hallucinated. The swarm's synthesis step compounds this — it merges outputs from multiple workers, and if any worker hallucinated metrics, the synthesis inherits the lie.

## Solution: Adversarial Verification Grid

Introduce a dedicated **verification grid** — a set of adversarial agents whose sole job is to challenge and independently reproduce any quantitative claims made by other grids.

### How it works

When the swarm detects that a task involves benchmarks, performance claims, or quantitative metrics, it activates an adversarial grid:

```
┌─────────────────────────────────────────────────┐
│  WORK GRIDS (existing)                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │ Search   │  │ Review   │  │  Fix     │      │
│  │ Grid     │→ │ Grid     │→ │  Grid    │      │
│  └──────────┘  └──────────┘  └──────────┘      │
│        │              │             │            │
│        ▼              ▼             ▼            │
│  ┌─────────────────────────────────────────┐    │
│  │         Claims Collector                │    │
│  │  (extracts quantitative assertions)     │    │
│  └─────────────────────────────────────────┘    │
│                      │                           │
│                      ▼                           │
│  ┌─────────────────────────────────────────┐    │
│  │      ADVERSARIAL VERIFICATION GRID      │    │
│  │                                         │    │
│  │  ┌───────────┐  ┌───────────────────┐   │    │
│  │  │ Reproducer│  │ Skeptic           │   │    │
│  │  │ Agent     │  │ Agent             │   │    │
│  │  │           │  │                   │   │    │
│  │  │ Re-runs   │  │ Challenges:       │   │    │
│  │  │ benchmarks│  │ - Was it run?     │   │    │
│  │  │ from      │  │ - Methodology ok? │   │    │
│  │  │ scratch   │  │ - Stats valid?    │   │    │
│  │  │ N times   │  │ - Baseline exist? │   │    │
│  │  └───────────┘  └───────────────────┘   │    │
│  │                                         │    │
│  │  Verdict: VERIFIED / UNVERIFIED / FRAUD │    │
│  └─────────────────────────────────────────┘    │
└─────────────────────────────────────────────────┘
```

### Adversarial agents

**1. Reproducer Agent** (writable, sandboxed)
- Takes any benchmark command from the work grid's output
- Re-runs it independently N times (default 3)
- Compares its results against the claimed numbers
- Reports: actual mean, stddev, whether the claim is within tolerance
- If no benchmark command was actually provided → flags as **UNVERIFIED** (agent likely hallucinated)

**2. Skeptic Agent** (read-only)
- Reviews the benchmark methodology, not the numbers
- Checks:
  - Was a baseline measured before the change?
  - Is the benchmark actually exercising the claimed code path? (uses `find_callers`/`blast_radius` from codegraff)
  - Is the sample size sufficient?
  - Are there confounding variables (warm-up, GC, I/O)?
  - Does the benchmark code import/call the function it claims to test?
- Reports methodology issues as structured findings

### Claims Collector

A lightweight parser that extracts quantitative assertions from agent output:
- Regex patterns for benchmark-like claims: `\d+(\.\d+)?[x%]?\s*(faster|slower|improvement|regression|speedup)`
- Table detection: any markdown table with numeric columns
- Command detection: any shell command containing `bench`, `perf`, `hyperfine`, `time`, `criterion`
- Each extracted claim becomes a verification task for the adversarial grid

### Verdict system

Each claim gets a verdict:

| Verdict | Meaning | Action |
|---------|---------|--------|
| **VERIFIED** | Reproducer confirmed within ±15% tolerance | Include in synthesis with ✅ |
| **UNVERIFIED** | No benchmark command found, or agent didn't actually run it | Strip from synthesis, flag with ⚠️ |
| **DISPUTED** | Reproducer got significantly different numbers | Include both sets of numbers with ⚠️ |
| **FRAUD** | Benchmark code doesn't exercise the claimed path, or numbers are statistically impossible | Strip from synthesis, flag with 🚫 |

### Grid configuration

```zig
const VerificationGrid = Grid{
    .name = "adversarial-verify",
    .roles = &.{
        .{ .name = "reproducer",
           .system_prompt = REPRODUCER_PROMPT,
           .tool_allowlist = null,  // needs to run commands
           .model = "claude-sonnet-4-6",
           .max_tool_calls = 30 },  // benchmark runs take many calls
        .{ .name = "skeptic",
           .system_prompt = SKEPTIC_PROMPT,
           .tool_allowlist = &.{ "zigrep", "zigread", "find_callers", "blast_radius" },
           .model = "claude-sonnet-4-6",
           .max_tool_calls = 15 },
    },
    .policy = .writable,  // reproducer needs to run benchmarks
    .synthesis = .adversarial_merge,  // special merge that attaches verdicts
    .telemetry = .{ .track_verdicts = true },
};
```

## Activation

The verification grid should activate **automatically** when:
1. The user's task mentions benchmarks, performance, or metrics
2. Any work grid agent output contains quantitative claims (detected by Claims Collector)
3. The user explicitly requests verification (`--verify` flag)

It should be **skippable** (`--no-verify`) for speed when the user trusts the results.

## Why this needs the Grid abstraction (see #321)

This proposal depends on the grid system from #321's telemetry issue:
- **Grid isolation**: the adversarial grid must have its own permission set, resource budget, and synthesis strategy
- **Grid-to-grid handoff**: work grid outputs feed into the Claims Collector, which creates tasks for the verification grid
- **Grid-level telemetry**: we need to track verification overhead (how much extra time/cost does verification add?)
- **Grid ordering**: verification grid runs AFTER work grids complete, not in parallel

This makes the grid abstraction **P0** — it's not just an organizational nicety, it's required for adversarial verification to work.

## Implementation Tasks

1. **Claims Collector** — regex + heuristic parser that extracts quantitative assertions from agent output
2. **Reproducer Agent prompt** — system prompt that instructs the agent to independently re-run benchmarks and compare results
3. **Skeptic Agent prompt** — system prompt focused on methodology review, leveraging codegraff's `find_callers`/`blast_radius`
4. **Verdict system** — VERIFIED/UNVERIFIED/DISPUTED/FRAUD classification with tolerance thresholds
5. **Grid abstraction** — the `Grid` struct with isolation, handoff protocol, and ordering (shared with #321)
6. **Synthesis integration** — adversarial verdicts attached to claims in the final synthesis output
7. **Auto-activation** — keyword detection on task + output to trigger verification grid

## Acceptance

- [ ] Quantitative claims in agent output are automatically detected
- [ ] Reproducer agent independently re-runs benchmark commands and compares results
- [ ] Skeptic agent validates methodology (code path coverage, baseline existence, sample size)
- [ ] Each claim receives a VERIFIED/UNVERIFIED/DISPUTED/FRAUD verdict
- [ ] Unverified and fraudulent claims are stripped or flagged in synthesis output
- [ ] Verification grid activates automatically for benchmark-related tasks
- [ ] `--no-verify` flag to skip for speed
- [ ] Grid abstraction supports ordered execution (work grids → verification grid)

## Related

- #321 — Swarm telemetry & observability (grid abstraction is shared infrastructure)
- Better CodeDB doc: Proposal 8 (Agent-to-Agent Review Chains) — adversarial verification is a specialized review chain
- Better CodeDB doc: Proposal 10 (Swarm Role Registry) — reproducer and skeptic are new built-in roles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adversarial benchmark verification grid — prevent hallucinated metrics #330

Problem

Solution: Adversarial Verification Grid

How it works

Adversarial agents

Claims Collector

Verdict system

Grid configuration

Activation

Why this needs the Grid abstraction (see #321)

Implementation Tasks

Acceptance

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Verdict	Meaning	Action
VERIFIED	Reproducer confirmed within ±15% tolerance	Include in synthesis with ✅
UNVERIFIED	No benchmark command found, or agent didn't actually run it	Strip from synthesis, flag with ⚠️
DISPUTED	Reproducer got significantly different numbers	Include both sets of numbers with ⚠️
FRAUD	Benchmark code doesn't exercise the claimed path, or numbers are statistically impossible	Strip from synthesis, flag with 🚫

feat: adversarial benchmark verification grid — prevent hallucinated metrics #330

Description

Problem

Solution: Adversarial Verification Grid

How it works

Adversarial agents

Claims Collector

Verdict system

Grid configuration

Activation

Why this needs the Grid abstraction (see #321)

Implementation Tasks

Acceptance

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions