You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Public leaderboard that reuses the existing Studio result artifact format (JSONL + benchmark.json). Benchmark-agnostic, aligned with AgentV's positioning as a general-purpose eval framework.
Supersedes the implementation approach in #966 (the objective from #966 remains the same).
Why change the approach
PR #970 (implementing #966) introduced a custom result.schema.json with SWE-bench-specific fields (instance_id, resolved, dataset: "swe-bench-lite"). This creates problems:
Tied to SWE-bench — AgentV evaluates all kinds of agents, not just coding agents. A new schema would be needed for every benchmark type.
Parallel data format — Studio already reads index.jsonl + benchmark.json + timing.json from .agentv/results/. Inventing a second format means maintaining two schemas and two rendering paths.
Duplicated UI work — Studio already has sortable tables, experiment comparison, and drill-down views. Building a separate Astro component reimplements this.
Design
Architecture
Separate benchmark results repo (e.g., EntityProcess/agentv-benchmarks) — contains only result artifacts, no source code. Keeps the main agentv repo clean.
Studio in read-only mode serves as the leaderboard UI — no new UI to build. Deployment topology (subdomain, reverse proxy, etc.) is the operator's responsibility.
Any developer can deploy their own internal leaderboard by pointing a read-only Studio instance at their own results repo/directory.
Data hierarchy
Leaderboard concept
Studio field
Example
Benchmark
Project
swe-bench-lite
Submission (model + workspace + skills)
Experiment
claude-opus-4-custom-skills
Execution
Run (timestamp)
2026-04-08T12-00-00
Test case
Test
django__django-11099
Project = the results repo registered in Studio. Each benchmark repo is a project.
Experiment = the submission identity. This is the full agent stack (model + workspace template + skills), not just the model. The submitter names it. This becomes the primary leaderboard row.
Run = a timestamped execution within an experiment. Multiple runs per experiment are supported (re-runs, improved harness).
Always runs/<experiment>/<timestamp>/ — one code path, consistent structure
Data format
Use the existing Studio result artifacts that agentv eval already produces:
Artifact
Commit to results repo?
Purpose
index.jsonl
Yes
Scores, verdicts, cost, tool calls, timing
benchmark.json
Yes
Aggregate stats (pass rate, mean score, total cost)
timing.json
Yes
Aggregate timing
grading.json (per-test)
Optional
Per-evaluator breakdown for drill-down
input.md / response.md
No
Full conversations — too large, potentially sensitive
For 300 SWE-bench Lite instances, index.jsonl ≈ 220 KB per model. 100 models ≈ 22 MB. Manageable for git.
Submission flow
Run eval locally → commit safe artifacts to benchmark results repo → open PR → CI validates JSONL → merge → Studio serves updated leaderboard
No agentv results export --public CLI command needed — submitters commit the safe artifacts directly. CI validation is the gate that rejects PRs containing conversation files or invalid JSONL.
Studio UX for leaderboard
Studio already supports this hierarchy. The key UX change:
Home → Projects dashboard (each benchmark repo = a project card)
Project page → Experiments tab as default view (this IS the leaderboard table: experiment name, pass rate, cost, runs, last run)
Click experiment → Runs for that submission (timestamped executions)
Click run → Individual test results with existing drill-down
The Experiments tab already exists in Studio. It just needs to be the default/primary view instead of Recent Runs.
CLI changes
Add --experiment flag to agentv eval — currently only available on agentv pipeline input/run. Required for leaderboard submissions, defaults to default when omitted.
Add --read-only flag to agentv results serve — disables RunEvalModal, FeedbackPanel, and write endpoints for public deployment.
Objective
Public leaderboard that reuses the existing Studio result artifact format (JSONL + benchmark.json). Benchmark-agnostic, aligned with AgentV's positioning as a general-purpose eval framework.
Supersedes the implementation approach in #966 (the objective from #966 remains the same).
Why change the approach
PR #970 (implementing #966) introduced a custom
result.schema.jsonwith SWE-bench-specific fields (instance_id,resolved,dataset: "swe-bench-lite"). This creates problems:index.jsonl+benchmark.json+timing.jsonfrom.agentv/results/. Inventing a second format means maintaining two schemas and two rendering paths.Design
Architecture
EntityProcess/agentv-benchmarks) — contains only result artifacts, no source code. Keeps the main agentv repo clean.Data hierarchy
swe-bench-liteclaude-opus-4-custom-skills2026-04-08T12-00-00django__django-11099Project = the results repo registered in Studio. Each benchmark repo is a project.
Experiment = the submission identity. This is the full agent stack (model + workspace template + skills), not just the model. The submitter names it. This becomes the primary leaderboard row.
Run = a timestamped execution within an experiment. Multiple runs per experiment are supported (re-runs, improved harness).
Directory layout (breaking change)
Results are always nested under experiment:
--experimentprovided →runs/<experiment>/<timestamp>/--experiment→runs/default/<timestamp>/runs/<experiment>/<timestamp>/— one code path, consistent structureData format
Use the existing Studio result artifacts that
agentv evalalready produces:index.jsonlbenchmark.jsontiming.jsongrading.json(per-test)input.md/response.mdFor 300 SWE-bench Lite instances,
index.jsonl≈ 220 KB per model. 100 models ≈ 22 MB. Manageable for git.Submission flow
No
agentv results export --publicCLI command needed — submitters commit the safe artifacts directly. CI validation is the gate that rejects PRs containing conversation files or invalid JSONL.Studio UX for leaderboard
Studio already supports this hierarchy. The key UX change:
The Experiments tab already exists in Studio. It just needs to be the default/primary view instead of Recent Runs.
CLI changes
--experimentflag toagentv eval— currently only available onagentv pipeline input/run. Required for leaderboard submissions, defaults todefaultwhen omitted.--read-onlyflag toagentv results serve— disables RunEvalModal, FeedbackPanel, and write endpoints for public deployment.What stays from #966
benchmarks/swe-bench-lite/directory withsetup.ts, grader, READMEWhat changes from #966 / PR #970
result.schema.json— use existing JSONL schemaresults/*.jsonin custom format — commit standard.agentv/results/artifactsagentv results export --public— not needed with separate results repoExecution path
Docker workspace environments (feat: Docker workspace execution environments for coding benchmarks #965)— done (PR feat(core): Docker workspace execution environments #971 merged)--experimentflag toagentv eval, implementruns/<experiment>/<timestamp>/directory layout--read-onlymode toagentv results servebenchmarks/swe-bench-lite/setup + grader (reuse from feat: curated public benchmark dataset and leaderboard #970)Acceptance signals
agentv eval --experiment <name>writes results toruns/<name>/<timestamp>/agentv evalwithout--experimentwrites toruns/default/<timestamp>/agentv results serve --read-onlydisables write operationsNon-goals
agentv results export --publicCLI command (not needed with separate repo)Dependencies
Related