Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/workflows/validate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,22 @@ jobs:

- name: Validate eval schemas
run: bun apps/cli/dist/cli.js validate 'examples/features/**/evals/**/*.eval.yaml' 'examples/features/**/*.EVAL.yaml'

benchmark-results:
name: Validate Benchmark Results
runs-on: ubuntu-latest
if: >-
contains(github.event.pull_request.title, 'benchmark') ||
contains(join(github.event.pull_request.labels.*.name, ','), 'benchmark') ||
github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-bun

- name: Validate SWE-bench Lite result JSON files
run: |
if ls benchmarks/swe-bench-lite/results/*.json 1> /dev/null 2>&1; then
bun benchmarks/swe-bench-lite/validate-result.ts benchmarks/swe-bench-lite/results/*.json
else
echo "No result files found — skipping"
fi
14 changes: 14 additions & 0 deletions apps/web/src/components/Lander.astro
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
</a>
<div class="av-nav-links">
<a href="/docs/">Docs</a>
<a href="/leaderboard">Leaderboard</a>
<a href="https://github.com/EntityProcess/agentv" target="_blank" rel="noopener noreferrer">GitHub</a>
<button class="av-nav-pill" data-command="npm install -g agentv">
<code>npm install -g agentv</code>
Expand Down Expand Up @@ -118,6 +119,19 @@
</div>
</section>

<!-- Leaderboard CTA Section -->
<section class="av-features" style="border-top: 1px solid rgba(255,255,255,0.04);">
<div class="av-container" style="text-align:center;">
<h2 class="av-section-heading">Public Leaderboard</h2>
<p style="color:#94a3b8; max-width:560px; margin:0 auto 1.5rem; font-size:0.95rem;">
SWE-bench Lite results with richer metrics — cost efficiency, tool usage, and Pareto-optimal rankings. See how models actually compare.
</p>
<a href="/leaderboard" class="av-btn-primary" style="display:inline-block; padding:0.75rem 2rem; font-size:0.9rem;">
View Leaderboard →
</a>
</div>
</section>

<!-- Quick Start Section -->
<section class="av-quickstart">
<div class="av-container">
Expand Down
Loading
Loading