Skip to content

feat: experiment-based result layout and read-only Studio mode #972

@christso

Description

@christso

Objective

Public leaderboard that reuses the existing Studio result artifact format (JSONL + benchmark.json). Benchmark-agnostic, aligned with AgentV's positioning as a general-purpose eval framework.

Supersedes the implementation approach in #966 (the objective from #966 remains the same).

Why change the approach

PR #970 (implementing #966) introduced a custom result.schema.json with SWE-bench-specific fields (instance_id, resolved, dataset: "swe-bench-lite"). This creates problems:

  1. Tied to SWE-bench — AgentV evaluates all kinds of agents, not just coding agents. A new schema would be needed for every benchmark type.
  2. Parallel data format — Studio already reads index.jsonl + benchmark.json + timing.json from .agentv/results/. Inventing a second format means maintaining two schemas and two rendering paths.
  3. Duplicated UI work — Studio already has sortable tables, experiment comparison, and drill-down views. Building a separate Astro component reimplements this.

Design

Architecture

  • Separate benchmark results repo (e.g., EntityProcess/agentv-benchmarks) — contains only result artifacts, no source code. Keeps the main agentv repo clean.
  • Studio in read-only mode serves as the leaderboard UI — no new UI to build. Deployment topology (subdomain, reverse proxy, etc.) is the operator's responsibility.
  • Any developer can deploy their own internal leaderboard by pointing a read-only Studio instance at their own results repo/directory.

Data hierarchy

Leaderboard concept Studio field Example
Benchmark Project swe-bench-lite
Submission (model + workspace + skills) Experiment claude-opus-4-custom-skills
Execution Run (timestamp) 2026-04-08T12-00-00
Test case Test django__django-11099

Project = the results repo registered in Studio. Each benchmark repo is a project.

Experiment = the submission identity. This is the full agent stack (model + workspace template + skills), not just the model. The submitter names it. This becomes the primary leaderboard row.

Run = a timestamped execution within an experiment. Multiple runs per experiment are supported (re-runs, improved harness).

Directory layout (breaking change)

Results are always nested under experiment:

.agentv/results/runs/
├── claude-opus-4-custom-skills/       ← experiment
│   ├── 2026-04-08T12-00-00/           ← run
│   │   ├── index.jsonl
│   │   ├── benchmark.json
│   │   └── timing.json
│   └── 2026-04-07T10-00-00/
├── gpt-4o-swe-agent/
│   └── 2026-04-05T08-00-00/
└── default/                           ← no --experiment specified
    └── 2026-04-06T09-00-00/
  • --experiment provided → runs/<experiment>/<timestamp>/
  • No --experimentruns/default/<timestamp>/
  • Always runs/<experiment>/<timestamp>/ — one code path, consistent structure

Data format

Use the existing Studio result artifacts that agentv eval already produces:

Artifact Commit to results repo? Purpose
index.jsonl Yes Scores, verdicts, cost, tool calls, timing
benchmark.json Yes Aggregate stats (pass rate, mean score, total cost)
timing.json Yes Aggregate timing
grading.json (per-test) Optional Per-evaluator breakdown for drill-down
input.md / response.md No Full conversations — too large, potentially sensitive

For 300 SWE-bench Lite instances, index.jsonl ≈ 220 KB per model. 100 models ≈ 22 MB. Manageable for git.

Submission flow

Run eval locally → commit safe artifacts to benchmark results repo → open PR → CI validates JSONL → merge → Studio serves updated leaderboard

No agentv results export --public CLI command needed — submitters commit the safe artifacts directly. CI validation is the gate that rejects PRs containing conversation files or invalid JSONL.

Studio UX for leaderboard

Studio already supports this hierarchy. The key UX change:

  1. Home → Projects dashboard (each benchmark repo = a project card)
  2. Project pageExperiments tab as default view (this IS the leaderboard table: experiment name, pass rate, cost, runs, last run)
  3. Click experiment → Runs for that submission (timestamped executions)
  4. Click run → Individual test results with existing drill-down

The Experiments tab already exists in Studio. It just needs to be the default/primary view instead of Recent Runs.

CLI changes

  • Add --experiment flag to agentv eval — currently only available on agentv pipeline input/run. Required for leaderboard submissions, defaults to default when omitted.
  • Add --read-only flag to agentv results serve — disables RunEvalModal, FeedbackPanel, and write endpoints for public deployment.

What stays from #966

  • benchmarks/swe-bench-lite/ directory with setup.ts, grader, README
  • GitHub PR submission workflow
  • CI schema validation (validates JSONL format, not a custom JSON schema)
  • Pareto frontier chart, multi-dimensional columns

What changes from #966 / PR #970

  • No custom result.schema.json — use existing JSONL schema
  • No results/*.json in custom format — commit standard .agentv/results/ artifacts
  • No agentv results export --public — not needed with separate results repo
  • No landing page integration — Studio IS the leaderboard, deployed independently
  • No Astro component — Studio serves the UI directly
  • Leaderboard reads JSONL not a bespoke JSON format
  • Benchmark-agnostic — same leaderboard works for any future benchmark

Execution path

  1. Docker workspace environments (feat: Docker workspace execution environments for coding benchmarks #965) — done (PR feat(core): Docker workspace execution environments #971 merged)
  2. Add --experiment flag to agentv eval, implement runs/<experiment>/<timestamp>/ directory layout
  3. Default Studio view to Experiments tab for project pages
  4. Add --read-only mode to agentv results serve
  5. benchmarks/swe-bench-lite/ setup + grader (reuse from feat: curated public benchmark dataset and leaderboard #970)
  6. Set up benchmark results repo with CI validation for JSONL submissions
  7. Pareto frontier chart on Experiments view

Acceptance signals

  • agentv eval --experiment <name> writes results to runs/<name>/<timestamp>/
  • agentv eval without --experiment writes to runs/default/<timestamp>/
  • Studio Experiments tab is default view for project pages
  • agentv results serve --read-only disables write operations
  • SWE-bench Lite results committed in standard JSONL format in separate results repo
  • Studio renders leaderboard from committed JSONL artifacts
  • Same format works for a non-coding benchmark (proves it's agnostic)
  • Sortable multi-dimensional columns (score, cost, $/fix, tool calls, latency)
  • Pareto frontier chart
  • CI validates submitted JSONL in results repo
  • Submission workflow documented

Non-goals

  • Modifying Studio's internal data format
  • agentv results export --public CLI command (not needed with separate repo)
  • Landing page or Astro component integration
  • Prescribing deployment topology (subdomains, etc.)

Dependencies

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions