Skip to content

feat: SWE-bench submission format export #979

@christso

Description

@christso

Objective

Add an export adapter that converts agentv results to the SWE-bench submission format, enabling participation in the SWE-bench leaderboard.

Motivation

AgentV's Docker workspace feature (#971) makes it capable of running SWE-bench evaluations. However, to submit results to the official SWE-bench leaderboard, results must be in their specific format (all_preds.jsonl + metadata.yaml + trajs/). A built-in exporter bridges this gap.

Design

Following "Lightweight Core, Plugin Extensibility" — this should be a CLI command or export adapter, not a core feature.

Proposed interface

agentv export --format swe-bench <results-dir> --output <submission-dir>

Output structure

submission/
├── all_preds.jsonl          # {instance_id, model_name_or_path, model_patch}
├── metadata.yaml            # Agent scaffold description
├── README.md                # Auto-generated from config
├── trajs/                   # Agent trajectories
│   └── <instance_id>.json   # From agentv trace data
└── results.json             # Converted from index.jsonl

Mapping (agentv → SWE-bench)

AgentV Field SWE-bench Field
test_id instance_id
unified_diff / output model_patch
target model_name_or_path
trace_summary trajs/<instance_id>.json
score, scores[] results.json

Acceptance Criteria

  • agentv export --format swe-bench produces valid SWE-bench submission directory
  • Generates all_preds.jsonl with correct instance_id → model_patch mapping
  • Generates trajectory files from trace data
  • Auto-generates metadata.yaml from agentv config
  • Can be submitted to SWE-bench leaderboard validation

Non-goals

  • Auto-submitting to leaderboard (just generate the format)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions