Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
9275c48
feat: Add SWE-bench accuracy evaluation via mini-swe-agent
tianmu-li Jun 5, 2026
4976431
fix: write swe-bench eval outputs under report_dir, add setup runbook
tianmu-li Jun 5, 2026
c0ba830
refactor: align SWEBenchScorer with VBench subproject pattern
tianmu-li Jun 5, 2026
c607cd4
test: add SWE-bench accuracy smoke test config
tianmu-li Jun 5, 2026
2103f09
test: verify SWEBenchScorer compatibility with multi-turn perf dataset
tianmu-li Jun 5, 2026
6010d31
fix: disallow swe_bench as perf dataset; use explicit JSONL for perf …
tianmu-li Jun 5, 2026
6f42f3a
fix: address SWEBenchScorer code review findings
tianmu-li Jun 5, 2026
9317e36
feat: add SWEBenchScorer preflight checks before benchmark starts
tianmu-li Jun 5, 2026
57c30f9
Remove unintended changes
tianmu-li Jun 5, 2026
3ed5b86
feat(swe-bench): set opinionated defaults for SWEBenchScorer
tianmu-li Jun 5, 2026
105784c
test: add coverage for SWEBenchScorer subprocess error paths
tianmu-li Jun 5, 2026
7972782
chore: remove swe_bench smoke YAML examples
tianmu-li Jun 5, 2026
f262705
refactor: trim SWEBenchScorer docstrings to match repo style
tianmu-li Jun 5, 2026
fabd09b
test: simplify SWEBenchScorer tests
tianmu-li Jun 5, 2026
1fe1e3e
fix: skip endpoint accuracy phase for SWEBenchScorer; resolve report_…
tianmu-li Jun 7, 2026
79ec94d
fix: skip sample_index_map load for SKIP_ENDPOINT_PHASE scorers
tianmu-li Jun 7, 2026
6a39053
fix: address code review findings in SWEBenchScorer and related code
tianmu-li Jun 14, 2026
bb9b307
fix: wire preflight() call and fix post-rebase issues
tianmu-li Jun 15, 2026
e724bd7
fix: address code review issues in SWEBenchScorer branch
tianmu-li Jun 15, 2026
1e542c9
chore: move SWE-bench files into examples/09_MultiTurn/
tianmu-li Jun 8, 2026
c3c6921
fix: add parser mapping for dummy_1k.jsonl perf dataset in swe_bench_…
tianmu-li Jun 8, 2026
e7d1f65
Trim example doc and yaml
tianmu-li Jun 8, 2026
4dcabe7
fix: restore import order in dataset.py and shopify __init__.py
tianmu-li Jun 8, 2026
1665131
fix: address post-cherry-pick review findings
tianmu-li Jun 15, 2026
37a0343
chore: move SWE-bench files from 09_MultiTurn into 10_Agentic_Inference
tianmu-li Jun 15, 2026
20cc392
fix: address code review findings for SWEBenchScorer
tianmu-li Jun 15, 2026
1bcdb63
fix: address code review findings for SWEBenchScorer (round 2)
tianmu-li Jun 15, 2026
c407508
refactor: simplify and consolidate SWE-bench tests
tianmu-li Jun 15, 2026
662ffe3
feat: forward max_new_tokens to LiteLLM as max_tokens in SWEBenchScorer
tianmu-li Jun 15, 2026
0d3216c
feat: stream SWEBenchScorer subprocess output and clear stale output dir
tianmu-li Jun 16, 2026
e9a1d56
feat: use PTY for SWEBenchScorer subprocess output and suppress LiteL…
tianmu-li Jun 16, 2026
0e56db2
feat: replace PTY passthrough with tqdm progress bar for SWEBenchScorer
tianmu-li Jun 16, 2026
5c5b662
Update benchmark yaml
tianmu-li Jun 16, 2026
08ef760
fix: forward endpoint api_key to SWEBenchScorer mini-swe-agent config
tianmu-li Jun 16, 2026
d1ff52e
Update standalone yaml; increase timeout
tianmu-li Jun 21, 2026
7855120
Merge branch 'main' into feat/swe_bench_scorer
tianmu-li Jun 21, 2026
1b978b7
Address review comments
tianmu-li Jun 21, 2026
6895b46
Merge remote-tracking branch 'origin/feat/swe_bench_scorer' into feat…
tianmu-li Jun 21, 2026
2aa6ea7
fix: address swe-bench review follow-ups
tianmu-li Jun 23, 2026
a0a2820
fix: pre-pull swe-bench images in scorer preflight
tianmu-li Jun 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions examples/10_Agentic_Inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,3 +194,29 @@ Update the first `datasets` entry (`name` and `path`), `model_params.name`, and
uv run inference-endpoint benchmark from-config \
--config examples/10_Agentic_Inference/kimi_agentic_benchmark.yaml
```

## SWE-bench Accuracy

`swe_bench_accuracy.yaml` runs the SWE-bench accuracy evaluation alongside a
minimal performance dataset. The benchmark framework skips its built-in
accuracy phase for this dataset; instead, `SWEBenchScorer` shells out to
`mini-swe-agent` and the `swebench` evaluation harness, and that external flow
drives requests to the configured endpoint.

The isolated `uv` environment for those tools lives in `accuracy/`. Sync it
once before running:

```bash
cd examples/10_Agentic_Inference/accuracy
uv sync
```

Then run the benchmark from the repo root:

```bash
uv run inference-endpoint benchmark from-config \
--config examples/10_Agentic_Inference/swe_bench_accuracy.yaml
```

See `accuracy/RUNBOOK.md` for preconditions, sanity checks, and common failure
modes.
54 changes: 54 additions & 0 deletions examples/10_Agentic_Inference/accuracy/RUNBOOK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# SWE-bench Accuracy Smoke-Test Runbook

End-to-end validation for the SWE-bench accuracy pipeline. Unit tests mock all
subprocesses, so running the real pipeline is the only way to catch Docker,
HuggingFace access, or mini-swe-agent wiring issues.

## 0. Preconditions

- Docker daemon running (swebench harness spawns one container per instance).
- Docker Hub auth or a pre-seeded image cache for uncached SWE-bench images.
- Network egress to PyPI and HuggingFace Hub.
- `uv` binary on PATH (`curl -LsSf https://astral.sh/uv/install.sh | sh`).
- Parent endpoints env already synced (`uv sync --extra dev` from repo root).

## 1. Sync the accuracy subproject

From the repo root:

```bash
cd examples/10_Agentic_Inference/accuracy
uv sync
```

Sanity check:

```bash
uv run mini-extra --help
uv run python -m swebench.harness.run_evaluation --help
```

Override the default subproject path via env var if needed:

```bash
export SWE_BENCH_PROJECT_PATH=/path/to/examples/10_Agentic_Inference/accuracy
```

## 2. End-to-end test (requires live endpoint)

```bash
uv run inference-endpoint benchmark from-config \
--config examples/10_Agentic_Inference/swe_bench_accuracy.yaml
```

Scorer preflight now resolves the requested SWE-bench instances and pre-pulls any
missing Docker images before `mini-extra swebench` starts. Cached images are
skipped.

## Common failure modes

| Symptom | Likely cause | Fix |
| ---------------------------------------------------- | ------------------------------------- | --------------------------------------------------------- |
| `FileNotFoundError: SWE-bench subproject not found` | subproject not synced | Run `uv sync` in `examples/10_Agentic_Inference/accuracy` |
| Docker error during `run_evaluation` | Docker daemon not running | Start Docker and retry |
| `Failed to pre-pull required SWE-bench Docker image` | Docker Hub rate limit or missing auth | Run `docker login` or use a local image cache/mirror |
29 changes: 29 additions & 0 deletions examples/10_Agentic_Inference/accuracy/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Isolated uv project for the SWE-bench accuracy evaluator.
#
# mini-swe-agent and swebench pin specific versions of litellm, docker,
# and other packages that are not part of the parent endpoints env. Keeping
# the swebench env separate means the parent lockfile stays solvable and
# the evaluation env stays reproducible.
#
# `inference_endpoint.evaluation.scoring.SWEBenchScorer` invokes
# mini-extra and swebench.harness.run_evaluation via `uv run --project`,
# so the main benchmark process never needs to import these packages.
#
# Usage on the accuracy host:
# cd examples/10_Agentic_Inference/accuracy
# uv sync
# # SWEBenchScorer in the parent will shell out automatically.

[project]
name = "swe-bench-accuracy"
version = "0.1.0"
description = "Isolated SWE-bench accuracy environment for the multi-turn agentic benchmark."
requires-python = ">=3.12"
dependencies = [
"mini-swe-agent==2.3.0",
"swebench==4.1.0",
]

[tool.uv]
# Script-runner env: no build, no install of this project itself.
package = false
7 changes: 7 additions & 0 deletions examples/10_Agentic_Inference/kimi_agentic_benchmark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,13 @@ datasets:
num_trajectories_to_issue: 990 # Should be integer multiple of 990.
# Required benchmark default; set to true only for faster optimization/debug runs.
stop_issuing_on_first_user_complete: false
- name: swe_bench
type: "accuracy"
accuracy_config:
eval_method: "swe_bench_scorer"
num_repeats: 1
extras:
num_instances: 200

settings:
runtime:
Expand Down
48 changes: 48 additions & 0 deletions examples/10_Agentic_Inference/qwen_agentic_benchmark.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: "qwen-agentic-benchmark"
version: "1.0"
type: "online"

model_params:
name: "Qwen/Qwen3.6-35B-A3B"
temperature: 1.0
top_k: 20
top_p: 0.95
repetition_penalty: 1.0
presence_penalty: 1.5
max_new_tokens: 8192
chat_template_kwargs:
preserve_thinking: true

datasets:
- name: agentic_coding
type: performance
path: /path/to/agentic_combined.jsonl
accuracy_config:
eval_method: agentic_inference_inline # required benchmark default.
agentic_inference:
turn_timeout_s: 14400.0
enable_salt: true # do not change.
inject_tool_delay: true # do not change.
- name: swe_bench
type: "accuracy"
accuracy_config:
eval_method: "swe_bench_scorer"
num_repeats: 1
extras:
num_instances: 200

settings:
runtime:
min_duration_ms: 0
max_duration_ms: 36000000

load_pattern:
type: agentic_inference
target_concurrency: 8 # Submission-specific concurrency.

endpoint_config:
endpoints:
- "http://localhost:30000"
api_type: openai

report_dir: logs/qwen_agentic
42 changes: 42 additions & 0 deletions examples/10_Agentic_Inference/swe_bench_accuracy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
type: "online"

model_params:
name: "Qwen/Qwen3.6-35B-A3B"
temperature: 1.0
top_p: 0.95
top_k: 20
repetition_penalty: 1.0
presence_penalty: 1.5
max_new_tokens: 8192
chat_template_kwargs:
preserve_thinking: true

datasets:
# Minimal performance dataset required by the framework.
- name: swe_bench_perf
type: "performance"
path: "tests/assets/datasets/dummy_1k.jsonl"
parser:
prompt: text_input

# Accuracy dataset — instance_id rows tell mini-swe-agent which instances to run.
# First run downloads ~10 MB from HuggingFace and caches to datasets_dir.
- name: swe_bench
type: "accuracy"
accuracy_config:
eval_method: "swe_bench_scorer"
num_repeats: 1
extras:
num_instances: 200

settings:
load_pattern:
type: "concurrency"
target_concurrency: 10 # mini-extra inherits target_concurrency from performance dataset
runtime:
n_samples_to_issue: 10

endpoint_config:
endpoints:
- "http://localhost:30000"
api_type: "openai"
Loading
Loading