🚀 VibeCodingBench

The benchmark that measures what AI coding agents actually do in production

Why • Quick Start • Tasks • Evaluation • Leaderboard • Contributing

<img src="https://img.shields.io/badge/version-1.0.0-brightgreen" alt="Version">

Why VibeCodingBench?

Existing benchmarks are disconnected from reality. See our full thesis for detailed analysis.

Benchmark	Focus	Real-World Signal	Limitation
HumanEval	Algorithmic puzzles	❌ Low	Not production code
SWE-bench	Bug fixes in 12 repos	⚠️ Medium	63% suspicious patches
SWE-bench Pro	Multi-file tasks	⚠️ Medium	70% → 23% performance drop
VibeCodingBench	Full-stack features	✅ High	Production-aligned tasks

The Evidence

Developer Time Distribution (Sonar Research):

Writing new code: 32% | Code maintenance: 19% | Testing: 12%
Developers code only 52 minutes/day on average

The Boilerplate Burden (GitHub Octoverse 2025):

2.4M repos use Notebooks (+75% YoY)
1.9M repos use Dockerfiles (+120% YoY)
Developers need help with repetitive patterns: auth, CRUD, integrations

SWE-EVO Exposes the Gap (arxiv:2512.18470):

Best models: 65% on simple fixes → only 21% on code evolution
"Current AI agents struggle with comprehensive planning and execution"

Quality Beyond Pass Rate (Qodo 2025):

"Claude Sonnet 4 averaged 2.11 issues per passing task"
Pass rate alone hides production risks

Developer Frustration (Stack Overflow 2025):

66% cite "AI solutions almost right, but not quite" as top frustration
45% say "debugging AI code is more time-consuming"

Quick Start

From Source

git clone https://github.com/alt-research/vibe-coding-benchmark-public.git
cd coding-model-benchmark
npm install
npm run build

# List tasks
node packages/cli/dist/index.js list

# Run a task with mock agent
node packages/cli/dist/index.js run saas-core/auth/supabase-oauth --agent mock

# Run with real agent (requires API key)
export ANTHROPIC_API_KEY=your_key
node packages/cli/dist/index.js run saas-core/auth/supabase-oauth --agent claude

# Run full evaluation across agents
node packages/cli/dist/index.js eval --agents claude,glm,minimax

# Watch live execution
node packages/cli/dist/index.js run <task-id> --agent claude --live

Task Categories

Category	Weight	Tasks	Languages	Examples
SaaS Core	25%	20	TS, Go, Python, Java, Rust	`supabase-oauth`, `jwt-refresh-tokens`, `rbac-permissions`
Glue Code	20%	20	Python, Go, TS, Java, Rust	`csv-normalizer`, `kafka-producer`, `cdc-pipeline`
AI Integration	20%	20	Python, TS, Go	`pdf-qa`, `research-agent`, `semantic-search`
Frontend	15%	20	React, Vue, Svelte, RN	`landing-page`, `data-grid`, `collaborative-editor`
API Integrations	10%	20	TS, Go, Python, Java	`checkout-session`, `twilio-sms`, `saml-sso`
Code Evolution	10%	20	TS, Python, Go, Kotlin	`flask-to-fastapi`, `java-to-kotlin`, `secrets-rotation`

Total: 180 tasks across 10 languages (TypeScript, Python, Go, Java, Kotlin, Rust, C#, React, Vue, Svelte)

Language Distribution

Based on GitHub Octoverse 2025 and Stack Overflow Developer Survey 2025:

Language	% of Tasks	Rationale
TypeScript/JavaScript	40%	#1 on GitHub, dominant in web dev
Python	25%	#2 on GitHub, AI/ML leader
Go	15%	Rising for cloud-native, microservices
Java/Kotlin	10%	Enterprise, Android development
Rust	5%	Systems programming, performance-critical
C#	5%	Enterprise, game development

Task Structure

Each task is a self-contained directory:

tasks/saas-core/auth/supabase-oauth/
├── task.yaml           # Metadata, constraints
├── PROMPT.md           # Instructions for the agent
├── tests/              # Evaluation tests
│   └── auth.test.ts    # Playwright E2E tests
├── docker-compose.yaml # Services (DB, mock APIs)
└── golden/             # Reference implementation (optional)

Hot-reload support: Add new tasks while the benchmark is running!

Evaluation

Multi-Dimensional Scoring

We measure what senior engineers care about:

Dimension	Weight	Method	Why It Matters
Functional	40%	Playwright E2E, Pass@k	Does it work?
Visual	20%	Pixel diff vs reference	Does it look right?
Quality	20%	ESLint + Semgrep + complexity	Is it maintainable?
Cost	10%	Tokens used, context pollution	Is it efficient?
Speed	10%	Wall-clock time, step count	Is it fast?

Security Gate

Any Critical/High vulnerability = automatic fail. We use Semgrep with OWASP rules.

The Scoring Formula

Final = (Functional × 0.4) + (Visual × 0.2) + (Quality × 0.2)
        - (Cost Penalty) - (Speed Penalty)

Security Fail → Final = 0

Supported Agents

Agent	Model	Status	Config	Pricing (Input/Output per MTok)
Claude	Haiku 4.5	✅ Supported	`ANTHROPIC_API_KEY`	$1.00 / $5.00
Claude	Opus 4.5	✅ Supported	`ANTHROPIC_API_KEY`	$5.00 / $25.00
Qwen	Qwen3-Max	✅ Supported	`QWEN_API_KEY`	$1.20 / $6.00
GLM	GLM-4.7	✅ Supported	`GLM_API_KEY`	$0.60 / $2.20
MiniMax	M2.1	✅ Supported	`MINIMAX_API_KEY`	$0.30 / $1.20
OpenAI	GPT-5.2	✅ Supported	`OPENAI_API_KEY`	$1.75 / $14.00
DeepSeek	Chat-v3	✅ Supported	`DEEPSEEK_API_KEY`	$0.40 / $1.60
Gemini	3-Flash Preview	✅ Supported	`GOOGLE_API_KEY`	$0.50 / $3.00

Leaderboard

📈 LEADERBOARD (2026-01-27) - 180 tasks evaluated, 15 models

╔══════╤══════════════════════╤═══════╤═══════════╤════════════╤════════════╤══════════════╤═════════════╗
║ Rank │ Model                │ Final │ Pass Rate │ Total Cost │ Total Time │ Avg Time/Task│ Total Tokens║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #1   │ Claude Opus 4.5      │ 89.2% │ 100.0%    │ $12.31     │ 2h 12m     │ 44s          │ 648K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #2   │ Claude Haiku 4.5     │ 89.0% │ 99.4%     │ $3.03      │ 1h 5m      │ 22s          │ 798K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #3   │ Grok 4 Fast          │ 88.8% │ 98.9%     │ $0.21      │ 1h 57m     │ 70s          │ 520K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #4   │ OpenAI GPT-5.2       │ 88.8% │ 98.3%     │ $5.01      │ 1h 24m     │ 28s          │ 485K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #5   │ Qwen3 Max            │ 88.6% │ 100.0%    │ $5.42      │ 2h 15m     │ 45s          │ 949K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #6   │ Claude Sonnet 4.5    │ 88.6% │ 98.3%     │ $6.98      │ 2h 6m      │ 42s          │ 612K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #7   │ GLM 4-Plus           │ 88.2% │ 98.9%     │ $0.93      │ 4h 49m     │ 96s          │ 794K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #8   │ DeepSeek v3.2        │ 88.2% │ 98.3%     │ $0.50      │ 4h 29m     │ 90s          │ 543K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #9   │ Grok 4               │ 88.0% │ 97.8%     │ $5.47      │ 2h 5m      │ 75s          │ 480K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #10  │ MiniMax M2.1         │ 87.4% │ 99.4%     │ $2.40      │ 8h 15m     │ 165s         │ 2.78M       ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #11  │ Grok 4.1 Fast        │ 86.8% │ 97.2%     │ $0.24      │ 2h 27m     │ 89s          │ 580K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #12  │ Gemini 3 Pro Preview │ 85.8% │ 95.0%     │ $10.34     │ 1h 36m     │ 32s          │ 738K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #13  │ GLM-4.7              │ 83.9% │ 85.6%     │ $0.73      │ 2h 50m     │ 57s          │ 623K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #14  │ GLM 4.7 Flash        │ 83.8% │ 92.8%     │ $1.11      │ 2h 15m     │ 45s          │ 650K        ║
╟──────┼──────────────────────┼───────┼───────────┼────────────┼────────────┼──────────────┼─────────────╢
║ #15  │ Gemini 3 Flash       │ 83.4% │ 92.2%     │ $0.86      │ 1h 23m     │ 28s          │ 384K        ║
╚══════╧══════════════════════╧═══════╧═══════════╧════════════╧════════════╧══════════════╧═════════════╝

Pricing (OpenRouter 2026-01-27)

Model	Input $/M	Output $/M
Claude Opus 4.5	$5.00	$25.00
Claude Sonnet 4.5	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
Qwen3 Max	$1.20	$6.00
OpenAI GPT-5.2	$1.75	$14.00
Grok 4	$3.00	$15.00
Grok 4 Fast	$0.20	$0.50
Grok 4.1 Fast	$0.20	$0.50
GLM 4-Plus/4.7	$0.40	$1.50
GLM 4.7 Flash	$0.07	$0.40
DeepSeek v3.2	$0.30	$1.20
Gemini 3 Flash	$0.50	$3.00
Gemini 3 Pro	$2.00	$12.00
MiniMax M2.1	$0.27	$1.12

Detailed Metrics

Model	Functional	Quality	Cost/Task	Tokens/Task
Claude Opus 4.5	85.0%	80.0%	$0.0684	3,599
Claude Haiku 4.5	84.5%	79.6%	$0.0168	4,435
Grok 4 Fast	84.1%	80.0%	$0.0012	2,889
Qwen3 Max	85.0%	80.0%	$0.0301	5,273
OpenAI GPT-5.2	83.6%	79.6%	$0.0278	2,694
Claude Sonnet 4.5	83.6%	80.0%	$0.0388	3,400
GLM 4-Plus	84.1%	80.0%	$0.0052	4,412
DeepSeek v3.2	83.6%	80.0%	$0.0028	3,015
Grok 4	83.6%	80.0%	$0.0304	2,667
MiniMax M2.1	84.5%	80.0%	$0.0133	15,436
Grok 4.1 Fast	82.6%	78.7%	$0.0013	3,222
Gemini 3 Pro Preview	80.8%	77.3%	$0.0574	4,099
GLM-4.7	72.7%	79.6%	$0.0041	3,464
GLM 4.7 Flash	78.9%	79.6%	$0.0062	3,611
Gemini 3 Flash	78.4%	75.1%	$0.0048	2,133

Live Dashboard: https://vibecoding.llmbench.xyz

Contributing

We welcome contributions! See CONTRIBUTING.md for details.

Adding a New Task

Create task directory:

mkdir -p tasks/<category>/<subcategory>/<task-name>

Add task.yaml:

name: My New Task
category: saas-core
difficulty: medium
stack: nextjs-supabase
tags: [typescript, auth]

Write PROMPT.md with clear requirements
Add tests (Playwright for web, pytest for Python)
Submit PR using the template

Architecture

vibecodingbench/
├── packages/
│   ├── cli/              # CLI tool
│   ├── evaluator/        # Scoring engine
│   └── leaderboard/      # Web dashboard
├── tasks/                # 120 benchmark tasks
│   ├── saas-core/        # 20 tasks
│   ├── glue-code/        # 20 tasks
│   ├── ai-integration/   # 20 tasks
│   ├── frontend/         # 20 tasks
│   ├── api-integrations/ # 20 tasks
│   └── code-evolution/   # 20 tasks
├── templates/            # Starter codebases
│   ├── nextjs-supabase/
│   ├── fastapi-postgres/
│   ├── go-fiber/
│   └── rust-axum/
└── docker/               # Base images

Deployment

Self-Hosted (Docker)

# Build and run production stack
./scripts/deploy.sh docker

# Or in background
./scripts/deploy.sh docker --detach

# Services available at:
# - Dashboard: http://localhost:3000
# - API: http://localhost:3001

Fly.io

cd packages/leaderboard
fly launch --config fly.toml
fly deploy

Environment Setup

# Required
export ANTHROPIC_API_KEY=...            # Claude (Anthropic)
export OPENAI_API_KEY=...               # OpenAI
export GOOGLE_API_KEY=...               # Gemini (Google AI)

# Optional
export GLM_API_KEY=...                  # GLM (Zhipu AI)
export MINIMAX_API_KEY=...              # MiniMax
export QWEN_API_KEY=...                 # Qwen (Alibaba DashScope)
export DEEPSEEK_API_KEY=...             # DeepSeek

Citation

If you use VibeCodingBench in your research, please cite:

@software{vibecodingbench2025,
  title = {VibeCodingBench: A Benchmark for AI Coding Agents on Real-World Developer Tasks},
  year = {2025},
  url = {https://github.com/alt-research/vibe-coding-benchmark-public}
}

_{Built with ❤️ by the open-source community}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.agent		.agent
.claude/commands		.claude/commands
.clinerules		.clinerules
.cursor		.cursor
.factory/droids		.factory/droids
.gemini/workflows		.gemini/workflows
.github		.github
.opencode		.opencode
.trae		.trae
.windsurf/rules		.windsurf/rules
docs		docs
openspec/changes/init-vibecodingbench		openspec/changes/init-vibecodingbench
packages		packages
scripts		scripts
tasks		tasks
templates		templates
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Caddyfile		Caddyfile
LICENSE		LICENSE
README.md		README.md
docker-compose.prod.yaml		docker-compose.prod.yaml
docker-compose.yaml		docker-compose.yaml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json
turbo.json		turbo.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 VibeCodingBench

Why VibeCodingBench?

The Evidence

Quick Start

From Source

Task Categories

Language Distribution

Task Structure

Evaluation

Multi-Dimensional Scoring

Security Gate

The Scoring Formula

Supported Agents

Leaderboard

Pricing (OpenRouter 2026-01-27)

Detailed Metrics

Contributing

Adding a New Task

Architecture

Deployment

Self-Hosted (Docker)

Fly.io

Environment Setup

Citation

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🚀 VibeCodingBench

Why VibeCodingBench?

The Evidence

Quick Start

From Source

Task Categories

Language Distribution

Task Structure

Evaluation

Multi-Dimensional Scoring

Security Gate

The Scoring Formula

Supported Agents

Leaderboard

Pricing (OpenRouter 2026-01-27)

Detailed Metrics

Contributing

Adding a New Task

Architecture

Deployment

Self-Hosted (Docker)

Fly.io

Environment Setup

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages