Skip to content

[Feature]: SWE-bench using mini-swe-agent #310

Description

@tianmu-li

Motivation

Enable SWE-bench accuracy evaluation, aligned with performance dataset to be used in agentic inference benchmark

Proposed Solution

Call mini-swe-agent directly using a modified version of https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml that support custom model and sampling parameters. Wait for run to finish, then use swe-bench to verify pass rate and collect results.
Propose to start with princeton-nlp/SWE-bench_Lite dev split (23 samples). Extend to test split (300 samples) or SWE-bench_Verified as needed.

Alternatives Considered

No response

Additional Context

No response

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions