Jina Test — SWT-Bench Lite Results

GPT-5.4 + Jina Test on SWT-Bench Lite (Unit Test Mode).

Results

Metric	Value
Success rate (𝒮)	63.6% (175/275 resolved)
Coverage delta (Δ𝒞)	51.8%
Mean coverage	63.3%
Completed	247
Errors	28
Dataset	SWT-Bench Lite
Mode	Unit Test

Leaderboard Comparison (Unit Test Mode)

Rank	System	𝒮	Δ𝒞
🥇	GPT-5.4 + Jina Test	63.6%	51.8%
🥈	e-Otter++ (IBM)	52.5%	56.4%
🥉	Amazon Q Developer (AWS)	39.9%	52.7%
4	AssertFlip (University of Waterloo)	38.0%	44.2%

Repository Layout

.
├── README.md                     ← this file
├── submission/                   ← files ready for submit@swtbench.com
│   ├── predictions.jsonl         ← 275 predictions in SWT-Bench format
│   ├── predictions.zip           ← zipped predictions for email
│   ├── report.json               ← local SWT-Bench eval report
│   ├── metadata.json             ← run configuration
│   └── SUBMISSION.md             ← approach + reproduction summary
├── results/
│   └── swt_lite/
│       ├── summary.json          ← headline numbers
│       ├── instance_ids.json     ← per-instance resolved/unresolved/error IDs
│       └── cost_report.jsonl     ← per-instance LLM cost
└── prompts/
    └── unittest.j2               ← the prompt used for test generation

Approach

Jina Test is a test generation agent built around OpenAI GPT-5.4 with a structured, verify-before-submit workflow. For each SWT-Bench instance:

Understand — read the issue, explore the repo, locate relevant source and test files, identify the test framework.
Locate the bug — narrow down the exact module / class / function that is broken.
Write minimal failing tests — add tests to the project's existing test suite, following its conventions (pytest / Django test / unittest / etc.).
Verify failure — run the tests and confirm they fail on the buggy code for the right reason. Iterate if they unexpectedly pass.
No source edits — only test files are modified.

Model Configuration

Base model: OpenAI GPT-5.4
Reasoning: extended thinking, 200k budget, reasoning_effort=high
Output: max 128k tokens
Max iterations per instance: 500
Critic runs: 3 (finish_and_message critic)
Tools: terminal, file editor, task tracker
Context management: LLM-based condenser (max_size=240, keep_first=2)

Execution

Each instance runs inside a Docker container built on top of the SWE-Bench base image for that repository. The agent interacts with the container via a tool API. The resulting git diff (touching only test files) is captured as the prediction.

Submission

Email submission/predictions.zip + submission/report.json to submit@swtbench.com with a link to this repository.

See submission/SUBMISSION.md for the full submission document.

Dataset

Predictions were generated against the SWT-Bench Lite test split (275 instances after filtering). Evaluation was run against the SWT-Bench harness using SWE-Bench Lite as the reference dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
prompts		prompts
results/swt_lite		results/swt_lite
submission		submission
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jina Test — SWT-Bench Lite Results

Results

Leaderboard Comparison (Unit Test Mode)

Repository Layout

Approach

Model Configuration

Execution

Submission

Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jina Test — SWT-Bench Lite Results

Results

Leaderboard Comparison (Unit Test Mode)

Repository Layout

Approach

Model Configuration

Execution

Submission

Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages