Skip to content

omxyz/jina-test-swt-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jina Test — SWT-Bench Lite Results

GPT-5.4 + Jina Test on SWT-Bench Lite (Unit Test Mode).

Results

Metric Value
Success rate (𝒮) 63.6% (175/275 resolved)
Coverage delta (Δ𝒞) 51.8%
Mean coverage 63.3%
Completed 247
Errors 28
Dataset SWT-Bench Lite
Mode Unit Test

Leaderboard Comparison (Unit Test Mode)

Rank System 𝒮 Δ𝒞
🥇 GPT-5.4 + Jina Test 63.6% 51.8%
🥈 e-Otter++ (IBM) 52.5% 56.4%
🥉 Amazon Q Developer (AWS) 39.9% 52.7%
4 AssertFlip (University of Waterloo) 38.0% 44.2%

Repository Layout

.
├── README.md                     ← this file
├── submission/                   ← files ready for submit@swtbench.com
│   ├── predictions.jsonl         ← 275 predictions in SWT-Bench format
│   ├── predictions.zip           ← zipped predictions for email
│   ├── report.json               ← local SWT-Bench eval report
│   ├── metadata.json             ← run configuration
│   └── SUBMISSION.md             ← approach + reproduction summary
├── results/
│   └── swt_lite/
│       ├── summary.json          ← headline numbers
│       ├── instance_ids.json     ← per-instance resolved/unresolved/error IDs
│       └── cost_report.jsonl     ← per-instance LLM cost
└── prompts/
    └── unittest.j2               ← the prompt used for test generation

Approach

Jina Test is a test generation agent built around OpenAI GPT-5.4 with a structured, verify-before-submit workflow. For each SWT-Bench instance:

  1. Understand — read the issue, explore the repo, locate relevant source and test files, identify the test framework.
  2. Locate the bug — narrow down the exact module / class / function that is broken.
  3. Write minimal failing tests — add tests to the project's existing test suite, following its conventions (pytest / Django test / unittest / etc.).
  4. Verify failure — run the tests and confirm they fail on the buggy code for the right reason. Iterate if they unexpectedly pass.
  5. No source edits — only test files are modified.

Model Configuration

  • Base model: OpenAI GPT-5.4
  • Reasoning: extended thinking, 200k budget, reasoning_effort=high
  • Output: max 128k tokens
  • Max iterations per instance: 500
  • Critic runs: 3 (finish_and_message critic)
  • Tools: terminal, file editor, task tracker
  • Context management: LLM-based condenser (max_size=240, keep_first=2)

Execution

Each instance runs inside a Docker container built on top of the SWE-Bench base image for that repository. The agent interacts with the container via a tool API. The resulting git diff (touching only test files) is captured as the prediction.

Submission

Email submission/predictions.zip + submission/report.json to submit@swtbench.com with a link to this repository.

See submission/SUBMISSION.md for the full submission document.

Dataset

Predictions were generated against the SWT-Bench Lite test split (275 instances after filtering). Evaluation was run against the SWT-Bench harness using SWE-Bench Lite as the reference dataset.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages