feat: curated public benchmark dataset and leaderboard by christso · Pull Request #970 · EntityProcess/agentv

christso · 2026-04-08T05:38:42Z

Curated Public Benchmark Dataset & Leaderboard

Closes #966

What this PR adds

This PR adds two connected pieces:

benchmarks/swe-bench-lite/ benchmark tooling and result validation
a public leaderboard page at /leaderboard in the docs/web app

Where the leaderboard lives and how it is hosted

The leaderboard is implemented in apps/web, not in Studio.
apps/web is the Astro/Starlight site for https://agentv.dev (apps/web/astro.config.mjs sets site: 'https://agentv.dev').
apps/web/wrangler.toml points Wrangler at ./dist, so the deployment target is the existing Cloudflare-hosted docs site.
There is no new GitHub Actions deployment workflow in this PR for the leaderboard. The workflows changed here are validation-oriented; deploy automation for apps/web appears to be handled separately from the workflows in this branch.
apps/studio is a separate Vite/React app in the monorepo and has no runtime relationship to this leaderboard page beyond sharing the repo.

Benchmark infrastructure (`benchmarks/swe-bench-lite/`)

setup.ts — downloads SWE-bench Lite metadata and generates per-instance eval files
graders/swe-bench-grader.ts — reusable grader implementation for SWE-bench-style patch evaluation
validate-result.ts — zero-dependency result JSON validator with format constraints and length limits
result.schema.json — JSON Schema for CI validation
results/*.json — sample benchmark result files used to render the leaderboard UI
e2e-test/ — Docker-backed end-to-end validation fixture used to prove the grading path works with real providers

Leaderboard UI (`apps/web/src/pages/leaderboard.astro`)

sortable table with rank, model, provider, resolution rate, cost, cost/fix, tool calls, latency, and date
model-type filters and provider dropdown filter
Pareto frontier chart for score vs cost
CTA section linking users to the benchmark setup/run flow
landing-page nav + CTA integration in Lander.astro

Core Docker fix included here

To validate the benchmark path end-to-end, this PR also includes a Docker behavior fix:

DockerWorkspaceProvider.pullImage() now checks docker image inspect before trying docker pull
this fixes locally built images failing with pull access denied
unit tests were updated for the new inspect-then-pull behavior

Security hardening

command injection mitigation in grader execution (execFileSync + test-name validation)
YAML field validation in benchmark generation
CSS class sanitization in the leaderboard page to avoid XSS through provider names
stricter result validation constraints in the standalone validator

CI

added benchmark-results validation to .github/workflows/validate.yml

Validation completed

✅ Gemini target (gemini-3-flash-preview) passes Docker-backed E2E eval: score 1.0
✅ Azure target (gpt-5.4-mini) passes Docker-backed E2E eval: score 1.0
✅ leaderboard UI validated locally with agent-browser (table, filters, chart, CTA)
✅ test suite passing: 385 tests
✅ web build passing: 44 pages

Important reviewer context / likely follow-ups

The leaderboard currently renders from checked-in sample result JSON; it is not yet backed by an automated submission ingestion pipeline.
This PR adds the benchmark + UI foundation, but not moderation/governance for accepting public result submissions.
If reviewers want automated publication, the next likely step is to add or document the deploy path for apps/web; this PR does not establish a new deploy workflow.
If reviewers want to productionize the benchmark path further, the next likely step is to tighten the grader/image packaging story for generated SWE-bench evals beyond the dedicated E2E fixture.

cloudflare-workers-and-pages · 2026-04-08T05:39:53Z

Deploying agentv with Cloudflare Pages

Latest commit:	`94f12d0`
Status:	✅ Deploy successful!
Preview URL:	https://97bc182f.agentv.pages.dev
Branch Preview URL:	https://feat-966-leaderboard.agentv.pages.dev

View logs

SWE-bench Lite benchmark infrastructure and public leaderboard on agentv.dev. Closes #966 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- setup.ts: downloads dataset from HuggingFace, generates EVAL.yaml files - graders/swe-bench-grader.ts: code-grader template for SWE-bench - validate-result.ts: Zod-based result JSON validation - result.schema.json: JSON Schema for CI validation - README.md: run/submit instructions - 6 sample result files for leaderboard development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- /leaderboard route with SWE-bench Lite results - Sortable multi-dimensional table (%, cost, $/Fix, tools, latency) - SVG Pareto frontier chart (score vs cost scatter) - Filter by model type (proprietary, open-weights, open-source) - Cost-normalized ranking ($/Fix) with color coding - Pareto frontier badges on optimal models - CTA section with run/submit instructions - Leaderboard link in landing page nav + CTA section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Validates SWE-bench Lite result files against schema on PRs and pushes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Rewrite validate-result.ts without zod dependency (runs standalone) - Make per_instance count mismatch a warning (supports partial results) - Add provider filter dropdown to leaderboard page - Both model type and provider filters apply simultaneously Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implements Docker-based workspace type for coding benchmarks (SWE-bench). Agent runs on host, grader runs inside container. Closes #965 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Merge feat/965-docker-workspace into leaderboard branch - Rewrite swe-bench-grader.ts to apply patches and run pytest inside container - Add Docker prerequisites to benchmark README - Fix eval-schema.json formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The problem_statement from HuggingFace contains multiline content (code blocks, markdown) that must be indented to match the YAML block scalar indentation level. Without proper indentation, the YAML parser fails on content like backtick fences. All 3 test EVAL.yaml files now pass agentv validate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Grader: replace execSync with execFileSync (no shell interpretation) - Grader: validate test names against safe pattern before execution - Setup: validate instance_id, repo, base_commit, version fields - Leaderboard: sanitize provider names for CSS class interpolation - Validator: add length limits and format constraints on string fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- DockerWorkspaceProvider.pullImage() now checks if image exists locally via 'docker image inspect' before attempting 'docker pull' - Fixes local-only Docker images failing with 'pull access denied' - Added E2E test eval (calculator-bug) with Python grader running in container - Fixed setup.ts to use 'command' instead of 'value' for code-grader - Fixed config nesting: grader config fields at assertion level, not nested - Updated Docker workspace unit tests for new inspect-then-pull behavior - Validated E2E with Gemini (score 1.0) and Azure GPT-5.4-mini (score 1.0) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

christso · 2026-04-08T10:14:30Z

Design pivot: use Studio result artifacts instead of custom JSON schema

After reviewing this PR, we identified a better approach for the leaderboard data format:

Problem with current approach: The result.schema.json is SWE-bench-specific (instance_id, resolved, dataset: "swe-bench-lite"). AgentV is a general-purpose eval framework — a new schema would be needed for every benchmark type.

Revised approach: Reuse the existing Studio result artifacts (index.jsonl + benchmark.json + timing.json) that agentv eval already produces. These are benchmark-agnostic (~750 bytes/test), already rendered by Studio, and contain richer data (per-evaluator breakdowns, assertions, evidence) without needing a parallel format.

See #972 for the revised design.

What's been split out

Docker fix → fix(core): skip docker pull for locally-built images #973 (merged independently — pullImage() inspect-before-pull)
CI validation (validate.yml change) — intentionally dropped; it validates a SWE-bench-specific schema that won't exist in the revised design

Recommendation

Close this PR in favor of a new implementation against #972.

christso · 2026-04-08T10:14:36Z

Closing in favor of #972 (revised design using Studio result artifacts). Docker fix split out to #973.

christso force-pushed the feat/966-leaderboard branch from 4848f9e to 4dc6b4b Compare April 8, 2026 06:00

christso and others added 9 commits April 8, 2026 06:41

feat: curated public benchmark dataset and leaderboard

055ea5d

SWE-bench Lite benchmark infrastructure and public leaderboard on agentv.dev. Closes #966 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: add benchmark result JSON validation

a5c051e

Validates SWE-bench Lite result files against schema on PRs and pushes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(core): Docker workspace execution environments

c3978ea

Implements Docker-based workspace type for coding benchmarks (SWE-bench). Agent runs on host, grader runs inside container. Closes #965 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

christso force-pushed the feat/966-leaderboard branch from dfa2f23 to a0c5954 Compare April 8, 2026 06:43

christso marked this pull request as ready for review April 8, 2026 07:41

This was referenced Apr 8, 2026

feat: experiment-based result layout and read-only Studio mode #972

Closed

fix(core): skip docker pull for locally-built images #973

Merged

christso closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: curated public benchmark dataset and leaderboard#970

feat: curated public benchmark dataset and leaderboard#970
christso wants to merge 10 commits intomainfrom
feat/966-leaderboard

christso commented Apr 8, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

christso commented Apr 8, 2026

Uh oh!

christso commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Curated Public Benchmark Dataset & Leaderboard

What this PR adds

Where the leaderboard lives and how it is hosted

Benchmark infrastructure (benchmarks/swe-bench-lite/)

Leaderboard UI (apps/web/src/pages/leaderboard.astro)

Core Docker fix included here

Security hardening

CI

Validation completed

Important reviewer context / likely follow-ups

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Apr 8, 2026

Design pivot: use Studio result artifacts instead of custom JSON schema

What's been split out

Recommendation

Uh oh!

christso commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Apr 8, 2026 •

edited

Loading

Benchmark infrastructure (`benchmarks/swe-bench-lite/`)

Leaderboard UI (`apps/web/src/pages/leaderboard.astro`)

cloudflare-workers-and-pages Bot commented Apr 8, 2026 •

edited

Loading