feat: curated public benchmark dataset and leaderboard#970
Closed
feat: curated public benchmark dataset and leaderboard#970
Conversation
Deploying agentv with
|
| Latest commit: |
94f12d0
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://97bc182f.agentv.pages.dev |
| Branch Preview URL: | https://feat-966-leaderboard.agentv.pages.dev |
4848f9e to
4dc6b4b
Compare
SWE-bench Lite benchmark infrastructure and public leaderboard on agentv.dev. Closes #966 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- setup.ts: downloads dataset from HuggingFace, generates EVAL.yaml files - graders/swe-bench-grader.ts: code-grader template for SWE-bench - validate-result.ts: Zod-based result JSON validation - result.schema.json: JSON Schema for CI validation - README.md: run/submit instructions - 6 sample result files for leaderboard development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- /leaderboard route with SWE-bench Lite results - Sortable multi-dimensional table (%, cost, $/Fix, tools, latency) - SVG Pareto frontier chart (score vs cost scatter) - Filter by model type (proprietary, open-weights, open-source) - Cost-normalized ranking ($/Fix) with color coding - Pareto frontier badges on optimal models - CTA section with run/submit instructions - Leaderboard link in landing page nav + CTA section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validates SWE-bench Lite result files against schema on PRs and pushes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rewrite validate-result.ts without zod dependency (runs standalone) - Make per_instance count mismatch a warning (supports partial results) - Add provider filter dropdown to leaderboard page - Both model type and provider filters apply simultaneously Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements Docker-based workspace type for coding benchmarks (SWE-bench). Agent runs on host, grader runs inside container. Closes #965 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Merge feat/965-docker-workspace into leaderboard branch - Rewrite swe-bench-grader.ts to apply patches and run pytest inside container - Add Docker prerequisites to benchmark README - Fix eval-schema.json formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The problem_statement from HuggingFace contains multiline content (code blocks, markdown) that must be indented to match the YAML block scalar indentation level. Without proper indentation, the YAML parser fails on content like backtick fences. All 3 test EVAL.yaml files now pass agentv validate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Grader: replace execSync with execFileSync (no shell interpretation) - Grader: validate test names against safe pattern before execution - Setup: validate instance_id, repo, base_commit, version fields - Leaderboard: sanitize provider names for CSS class interpolation - Validator: add length limits and format constraints on string fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dfa2f23 to
a0c5954
Compare
- DockerWorkspaceProvider.pullImage() now checks if image exists locally via 'docker image inspect' before attempting 'docker pull' - Fixes local-only Docker images failing with 'pull access denied' - Added E2E test eval (calculator-bug) with Python grader running in container - Fixed setup.ts to use 'command' instead of 'value' for code-grader - Fixed config nesting: grader config fields at assertion level, not nested - Updated Docker workspace unit tests for new inspect-then-pull behavior - Validated E2E with Gemini (score 1.0) and Azure GPT-5.4-mini (score 1.0) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced Apr 8, 2026
Collaborator
Author
Design pivot: use Studio result artifacts instead of custom JSON schemaAfter reviewing this PR, we identified a better approach for the leaderboard data format: Problem with current approach: The Revised approach: Reuse the existing Studio result artifacts ( See #972 for the revised design. What's been split out
RecommendationClose this PR in favor of a new implementation against #972. |
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Curated Public Benchmark Dataset & Leaderboard
Closes #966
What this PR adds
This PR adds two connected pieces:
benchmarks/swe-bench-lite/benchmark tooling and result validation/leaderboardin the docs/web appWhere the leaderboard lives and how it is hosted
apps/web, not in Studio.apps/webis the Astro/Starlight site forhttps://agentv.dev(apps/web/astro.config.mjssetssite: 'https://agentv.dev').apps/web/wrangler.tomlpoints Wrangler at./dist, so the deployment target is the existing Cloudflare-hosted docs site.apps/webappears to be handled separately from the workflows in this branch.apps/studiois a separate Vite/React app in the monorepo and has no runtime relationship to this leaderboard page beyond sharing the repo.Benchmark infrastructure (
benchmarks/swe-bench-lite/)setup.ts— downloads SWE-bench Lite metadata and generates per-instance eval filesgraders/swe-bench-grader.ts— reusable grader implementation for SWE-bench-style patch evaluationvalidate-result.ts— zero-dependency result JSON validator with format constraints and length limitsresult.schema.json— JSON Schema for CI validationresults/*.json— sample benchmark result files used to render the leaderboard UIe2e-test/— Docker-backed end-to-end validation fixture used to prove the grading path works with real providersLeaderboard UI (
apps/web/src/pages/leaderboard.astro)Lander.astroCore Docker fix included here
To validate the benchmark path end-to-end, this PR also includes a Docker behavior fix:
DockerWorkspaceProvider.pullImage()now checksdocker image inspectbefore tryingdocker pullpull access deniedSecurity hardening
execFileSync+ test-name validation)CI
benchmark-resultsvalidation to.github/workflows/validate.ymlValidation completed
gemini-3-flash-preview) passes Docker-backed E2E eval: score 1.0gpt-5.4-mini) passes Docker-backed E2E eval: score 1.0Important reviewer context / likely follow-ups
apps/web; this PR does not establish a new deploy workflow.