This evaluation mirrors the probe and runtime sweeps for reward-system judges. The harness replays captured session trajectories, swaps judge pairings, and reports how each combination scores, escalates, and performs. Use it to validate default reward models and justify alternates before rolling them into production configs.
- Compare small/large judge stacks against the dataset that ships with
configs/examples/openai_agent.yaml. - Track average reward, variance, escalation frequency, latency, and correlation with a baseline judge.
- Produce reproducible JSON artifacts and summary tables aligned with probe (
docs/probe_eval.md) and runtime (docs/runtime_eval.md) reports.
- Location:
atlas/data/reward_eval_trajectories.jsonl - Shape: newline-delimited JSON containing a README comment followed by full
SessionTrajectorypayloads. Each object includes:task,final_answer,plan,steps,execution_mode,teacher_intervenedadaptive_summary,session_metadata,trajectory_type- Optional
focus_prompt
- Composition: Start with the curated seed trajectories (paired, auto, coach) included in the repo, then append fresh captures that contain full
final_answertext and step outputs. The capture tool automatically skips blank runs so the dataset only contains usable trajectories. - Expansion: Use
scripts/collect_reward_trajectories.pyto record new trajectories without editing runtime code:Run multiple passes (optionally with different task datasets or repeats) until at least 30 trajectories are captured; the script appends to the JSONL file while preserving the README header.python -m scripts.collect_reward_trajectories \ --tasks atlas/data/synthetic_runtime_tasks.jsonl \ --output atlas/data/reward_eval_trajectories.jsonl \ --limit 30 \ --repeats 2 \ --shuffle \ --append \ --config configs/examples/openai_agent.yaml
| ID | Small Judge (provider) | Large Judge (provider) | Notes |
|---|---|---|---|
gemini_pair |
gemini/gemini-2.5-flash (Gemini) |
gemini/gemini-2.5-pro (Gemini) |
Current default shipped in openai_agent.yaml; used as baseline. |
claude_stack |
claude-haiku-4-5 (Anthropic) |
claude-sonnet-4-5-20250929 (Anthropic) |
Claude-only stack for provider redundancy. |
gpt5_stack |
gpt-5-mini (OpenAI) |
gpt-5 (OpenAI) |
Evaluates OpenAI’s small/large pairing for parity with runtime options. |
grok_stack |
xai/grok-4-fast (xAI) |
xai/grok-4 (xAI) |
Tests xAI’s fast vs. larger Grok judges without cross-provider escalation. |
Add new combinations by editing configs/eval/reward/models.yaml; no orchestrator code changes are required. The CLI will use the file when present and fall back to the baked-in defaults otherwise.
Configuration-first: The defaults above live in
configs/eval/reward/models.yaml. Update that file to swap models or add combos—scripts/benchmark_reward_models.pywill load it automatically and fall back to built-ins if the file is missing.
- Ensure the dataset is committed and up to date (
atlas/data/reward_eval_trajectories.jsonl). - Export required API keys (
GEMINI_API_KEY,ANTHROPIC_API_KEY,XAI_API_KEY, etc.) and optionally store them in.env. The harness loads.envautomatically viaload_dotenv_if_available(). - Run the evaluator harness:
python -m scripts.benchmark_reward_models \ --dataset atlas/data/reward_eval_trajectories.jsonl \ --output results/reward/eval_gemini_claude.json \ --repeats 1 \ --concurrency 1 \ --collect-audit
- Inspect the console table for high-level metrics and review the JSON artifact for per-run details. When you are ready to compare every judge pairing in one pass, run:
python -m scripts.benchmark_reward_models \ --judge-combos gemini_pair claude_stack gpt5_stack grok_stack \ --dataset atlas/data/reward_eval_trajectories.jsonl \ --output results/reward/latest.json \ --repeats 1 \ --concurrency 1 \ --collect-audit
Key command options:
--judge-combos: space-separated combo IDs to evaluate (defaults to all built-ins).--baseline: combo ID used for deltas and correlation (defaults togemini_pair).--repeats: number of passes over the dataset (helps quantify variance).--concurrency: maximum concurrent evaluations per combo.--collect-audit: capture minimal prompt/response context for debugging judge behaviour.--markdown-output: override the Markdown summary path (defaults toresults/reward/<timestamp>.md).
Each run writes:
- A Markdown summary table to
results/reward/<timestamp>.md(or the path you pass via--markdown-output). - A JSON artifact when
--outputis provided.
- Reward score statistics: mean and standard deviation across successful runs.
- Uncertainty: average model-reported uncertainty.
- Escalation rate: fraction of runs that triggered arbiter escalation.
- Latency: mean, median, and p95 wall-clock latency (milliseconds).
- Failures: count of judge requests that raised exceptions.
- Baseline agreement: delta statistics, Pearson correlation, and fraction of runs within ±0.02 reward of the baseline pair.
Tip: divide
latency_*_msby 1 000 when reporting seconds.
The CLI emits a Markdown report for every run (see results/reward/). Import the latest generated table into your release notes or dashboards instead of maintaining static values here.