GPT-5.4 + Jina Test on SWT-Bench Lite (Unit Test Mode).
| Metric | Value |
|---|---|
| Success rate (𝒮) | 63.6% (175/275 resolved) |
| Coverage delta (Δ𝒞) | 51.8% |
| Mean coverage | 63.3% |
| Completed | 247 |
| Errors | 28 |
| Dataset | SWT-Bench Lite |
| Mode | Unit Test |
| Rank | System | 𝒮 | Δ𝒞 |
|---|---|---|---|
| 🥇 | GPT-5.4 + Jina Test | 63.6% | 51.8% |
| 🥈 | e-Otter++ (IBM) | 52.5% | 56.4% |
| 🥉 | Amazon Q Developer (AWS) | 39.9% | 52.7% |
| 4 | AssertFlip (University of Waterloo) | 38.0% | 44.2% |
.
├── README.md ← this file
├── submission/ ← files ready for submit@swtbench.com
│ ├── predictions.jsonl ← 275 predictions in SWT-Bench format
│ ├── predictions.zip ← zipped predictions for email
│ ├── report.json ← local SWT-Bench eval report
│ ├── metadata.json ← run configuration
│ └── SUBMISSION.md ← approach + reproduction summary
├── results/
│ └── swt_lite/
│ ├── summary.json ← headline numbers
│ ├── instance_ids.json ← per-instance resolved/unresolved/error IDs
│ └── cost_report.jsonl ← per-instance LLM cost
└── prompts/
└── unittest.j2 ← the prompt used for test generation
Jina Test is a test generation agent built around OpenAI GPT-5.4 with a structured, verify-before-submit workflow. For each SWT-Bench instance:
- Understand — read the issue, explore the repo, locate relevant source and test files, identify the test framework.
- Locate the bug — narrow down the exact module / class / function that is broken.
- Write minimal failing tests — add tests to the project's existing test suite, following its conventions (pytest / Django test / unittest / etc.).
- Verify failure — run the tests and confirm they fail on the buggy code for the right reason. Iterate if they unexpectedly pass.
- No source edits — only test files are modified.
- Base model: OpenAI GPT-5.4
- Reasoning: extended thinking, 200k budget,
reasoning_effort=high - Output: max 128k tokens
- Max iterations per instance: 500
- Critic runs: 3 (finish_and_message critic)
- Tools: terminal, file editor, task tracker
- Context management: LLM-based condenser (max_size=240, keep_first=2)
Each instance runs inside a Docker container built on top of the SWE-Bench base image for that repository. The agent interacts with the container via a tool API. The resulting git diff (touching only test files) is captured as the prediction.
Email submission/predictions.zip + submission/report.json to
submit@swtbench.com with a link to this repository.
See submission/SUBMISSION.md for the full
submission document.
Predictions were generated against the SWT-Bench Lite test split (275 instances after filtering). Evaluation was run against the SWT-Bench harness using SWE-Bench Lite as the reference dataset.