Skip to content

feat: curated public benchmark dataset and leaderboard#970

Closed
christso wants to merge 10 commits intomainfrom
feat/966-leaderboard
Closed

feat: curated public benchmark dataset and leaderboard#970
christso wants to merge 10 commits intomainfrom
feat/966-leaderboard

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 8, 2026

Curated Public Benchmark Dataset & Leaderboard

Closes #966

What this PR adds

This PR adds two connected pieces:

  1. benchmarks/swe-bench-lite/ benchmark tooling and result validation
  2. a public leaderboard page at /leaderboard in the docs/web app

Where the leaderboard lives and how it is hosted

  • The leaderboard is implemented in apps/web, not in Studio.
  • apps/web is the Astro/Starlight site for https://agentv.dev (apps/web/astro.config.mjs sets site: 'https://agentv.dev').
  • apps/web/wrangler.toml points Wrangler at ./dist, so the deployment target is the existing Cloudflare-hosted docs site.
  • There is no new GitHub Actions deployment workflow in this PR for the leaderboard. The workflows changed here are validation-oriented; deploy automation for apps/web appears to be handled separately from the workflows in this branch.
  • apps/studio is a separate Vite/React app in the monorepo and has no runtime relationship to this leaderboard page beyond sharing the repo.

Benchmark infrastructure (benchmarks/swe-bench-lite/)

  • setup.ts — downloads SWE-bench Lite metadata and generates per-instance eval files
  • graders/swe-bench-grader.ts — reusable grader implementation for SWE-bench-style patch evaluation
  • validate-result.ts — zero-dependency result JSON validator with format constraints and length limits
  • result.schema.json — JSON Schema for CI validation
  • results/*.json — sample benchmark result files used to render the leaderboard UI
  • e2e-test/ — Docker-backed end-to-end validation fixture used to prove the grading path works with real providers

Leaderboard UI (apps/web/src/pages/leaderboard.astro)

  • sortable table with rank, model, provider, resolution rate, cost, cost/fix, tool calls, latency, and date
  • model-type filters and provider dropdown filter
  • Pareto frontier chart for score vs cost
  • CTA section linking users to the benchmark setup/run flow
  • landing-page nav + CTA integration in Lander.astro

Core Docker fix included here

To validate the benchmark path end-to-end, this PR also includes a Docker behavior fix:

  • DockerWorkspaceProvider.pullImage() now checks docker image inspect before trying docker pull
  • this fixes locally built images failing with pull access denied
  • unit tests were updated for the new inspect-then-pull behavior

Security hardening

  • command injection mitigation in grader execution (execFileSync + test-name validation)
  • YAML field validation in benchmark generation
  • CSS class sanitization in the leaderboard page to avoid XSS through provider names
  • stricter result validation constraints in the standalone validator

CI

  • added benchmark-results validation to .github/workflows/validate.yml

Validation completed

  • ✅ Gemini target (gemini-3-flash-preview) passes Docker-backed E2E eval: score 1.0
  • ✅ Azure target (gpt-5.4-mini) passes Docker-backed E2E eval: score 1.0
  • ✅ leaderboard UI validated locally with agent-browser (table, filters, chart, CTA)
  • ✅ test suite passing: 385 tests
  • ✅ web build passing: 44 pages

Important reviewer context / likely follow-ups

  • The leaderboard currently renders from checked-in sample result JSON; it is not yet backed by an automated submission ingestion pipeline.
  • This PR adds the benchmark + UI foundation, but not moderation/governance for accepting public result submissions.
  • If reviewers want automated publication, the next likely step is to add or document the deploy path for apps/web; this PR does not establish a new deploy workflow.
  • If reviewers want to productionize the benchmark path further, the next likely step is to tighten the grader/image packaging story for generated SWE-bench evals beyond the dedicated E2E fixture.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 8, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 94f12d0
Status: ✅  Deploy successful!
Preview URL: https://97bc182f.agentv.pages.dev
Branch Preview URL: https://feat-966-leaderboard.agentv.pages.dev

View logs

@christso christso force-pushed the feat/966-leaderboard branch from 4848f9e to 4dc6b4b Compare April 8, 2026 06:00
christso and others added 9 commits April 8, 2026 06:41
SWE-bench Lite benchmark infrastructure and public leaderboard on agentv.dev.

Closes #966

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- setup.ts: downloads dataset from HuggingFace, generates EVAL.yaml files
- graders/swe-bench-grader.ts: code-grader template for SWE-bench
- validate-result.ts: Zod-based result JSON validation
- result.schema.json: JSON Schema for CI validation
- README.md: run/submit instructions
- 6 sample result files for leaderboard development

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- /leaderboard route with SWE-bench Lite results
- Sortable multi-dimensional table (%, cost, $/Fix, tools, latency)
- SVG Pareto frontier chart (score vs cost scatter)
- Filter by model type (proprietary, open-weights, open-source)
- Cost-normalized ranking ($/Fix) with color coding
- Pareto frontier badges on optimal models
- CTA section with run/submit instructions
- Leaderboard link in landing page nav + CTA section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validates SWE-bench Lite result files against schema on PRs and pushes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rewrite validate-result.ts without zod dependency (runs standalone)
- Make per_instance count mismatch a warning (supports partial results)
- Add provider filter dropdown to leaderboard page
- Both model type and provider filters apply simultaneously

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements Docker-based workspace type for coding benchmarks (SWE-bench).
Agent runs on host, grader runs inside container.

Closes #965

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Merge feat/965-docker-workspace into leaderboard branch
- Rewrite swe-bench-grader.ts to apply patches and run pytest inside container
- Add Docker prerequisites to benchmark README
- Fix eval-schema.json formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The problem_statement from HuggingFace contains multiline content
(code blocks, markdown) that must be indented to match the YAML
block scalar indentation level. Without proper indentation, the
YAML parser fails on content like backtick fences.

All 3 test EVAL.yaml files now pass agentv validate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Grader: replace execSync with execFileSync (no shell interpretation)
- Grader: validate test names against safe pattern before execution
- Setup: validate instance_id, repo, base_commit, version fields
- Leaderboard: sanitize provider names for CSS class interpolation
- Validator: add length limits and format constraints on string fields

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso force-pushed the feat/966-leaderboard branch from dfa2f23 to a0c5954 Compare April 8, 2026 06:43
- DockerWorkspaceProvider.pullImage() now checks if image exists locally
  via 'docker image inspect' before attempting 'docker pull'
- Fixes local-only Docker images failing with 'pull access denied'
- Added E2E test eval (calculator-bug) with Python grader running in container
- Fixed setup.ts to use 'command' instead of 'value' for code-grader
- Fixed config nesting: grader config fields at assertion level, not nested
- Updated Docker workspace unit tests for new inspect-then-pull behavior
- Validated E2E with Gemini (score 1.0) and Azure GPT-5.4-mini (score 1.0)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso
Copy link
Copy Markdown
Collaborator Author

christso commented Apr 8, 2026

Design pivot: use Studio result artifacts instead of custom JSON schema

After reviewing this PR, we identified a better approach for the leaderboard data format:

Problem with current approach: The result.schema.json is SWE-bench-specific (instance_id, resolved, dataset: "swe-bench-lite"). AgentV is a general-purpose eval framework — a new schema would be needed for every benchmark type.

Revised approach: Reuse the existing Studio result artifacts (index.jsonl + benchmark.json + timing.json) that agentv eval already produces. These are benchmark-agnostic (~750 bytes/test), already rendered by Studio, and contain richer data (per-evaluator breakdowns, assertions, evidence) without needing a parallel format.

See #972 for the revised design.

What's been split out

Recommendation

Close this PR in favor of a new implementation against #972.

@christso
Copy link
Copy Markdown
Collaborator Author

christso commented Apr 8, 2026

Closing in favor of #972 (revised design using Studio result artifacts). Docker fix split out to #973.

@christso christso closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: curated public benchmark dataset and leaderboard

1 participant