-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Description
Extend the evaluation harness to run the 4-way comparison (raw vs impl vs full vs nlcmd) against the nginx/nginx-tests repository. This adds a second benchmark alongside SWE-bench Verified to validate the planning flow's effectiveness on a different domain (C/nginx vs Python/astropy), providing the third data point needed for the evaluation report.
Context from team meeting:
- Current eval results on SWE-bench Verified (5 tasks) show planning improves correctness and patch quality
- Natural language command orchestration is 2.6x slower than script orchestration but produces richer artifacts
- With three data points (SWE-bench + nginx), the team will have sufficient data to report
Proposed Solution
Implementation Plan: Nginx-Tests Benchmark for Evaluation Harness
Consensus Summary
The reducer's minimal approach is adopted as the foundation: add nginx support directly via if/elif branches in the existing single-file harness, with 5 curated tasks in a JSON file, and prove-exit-code-based scoring. From the bold proposal, we retain the structured task schema design (with test_repo, test_commit, modules_required fields) and the compile-failure-as-distinct-status idea. From the critique, we adopt the critical prerequisite: validate one nginx task end-to-end manually before building the automation, and we raise the LOC estimate to account for the two-repo worktree complexity and prompt parameterization the reducer underestimated.
Goal
Add nginx/nginx-tests as a second benchmark to the eval harness so the same 4-way mode comparison (raw/impl/full/nlcmd) runs against C-language bug-fix tasks, providing a third cross-domain data point for the evaluation report.
Success criteria:
python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-runcompletes without error- At least 5 curated nginx tasks load from
nginx_tasks.jsonand produce valid worktrees with both repos - Scoring via
provecorrectly reports pass/fail for a gold-patched nginx build - The existing SWE-bench flow (
--benchmark swebench, the default) is completely unchanged
Out of scope:
- Generic benchmark plugin/adapter registry for N benchmarks
- ✅ Good to have in the future: Extract a dispatch-dict adapter protocol when a 3rd benchmark is added (Rule of Three).
- Automated task mining script from nginx git history
- ✅ Good to have in the future: Write
scripts/mine_nginx_tasks.pywhen scaling beyond 10 curated tasks.
- ✅ Good to have in the future: Write
- Full TAP output parser with per-test granularity
- ✅ Good to have in the future: Parse
prove -voutput for partial-credit scoring (~20 LOC addition).
- ✅ Good to have in the future: Parse
- Docker-based nginx compilation environment
- ❌ Not needed: nginx builds with
./auto/configure && makeon macOS/Linux; Docker adds complexity without benefit at this stage.
- ❌ Not needed: nginx builds with
Bug Reproduction
Skip reason: This is a feature request, not a bug fix. No reproduction needed.
Codebase Analysis
Files verified (docs/code checked by agents):
python/agentize/eval/eval_harness.py: 1054-line single-file harness.load_tasksis HuggingFace-specific.setup_worktreeassumes SWE-bench task keys (repo,instance_id,base_commit,problem_statement). Mode dispatch in_cmd_run(lines 926-951) calls different functions per mode but each has SWE-bench-specific prompts.extract_patch(lines 713-732) uses genericgit diff— reusable as-is.score_predictions(lines 739-762) callsswebench.harness.run_evaluation— nginx needs a completely different scorer.python/agentize/eval/eval_harness.md: Documents single-file philosophy, three-layer architecture, prerequisites, ramp-up strategy. Must be updated with nginx benchmark section.python/agentize/eval/eval-report-2026-03-01.md: 4-way comparison on 5 SWE-bench tasks. Establishes the reporting format nginx results must follow.tests/cli/test-eval-harness-cli.sh: 58-line shell test covering module import, help text,aggregate_metrics,write_overrides. Must be extended for--benchmarkflag.CLAUDE.md: Confirms "early stage, breaking changes OK" and single-file preferences.
File changes:
| File | Level | Purpose |
|---|---|---|
python/agentize/eval/eval_harness.py |
major | Add --benchmark CLI flag, load_nginx_tasks(), setup_nginx_worktree(), score_nginx(), benchmark dispatch branches in _cmd_run and _cmd_score, compile_failed status support (~180 LOC added) |
python/agentize/eval/eval_harness.md |
medium | Document nginx benchmark usage, prerequisites, task format (~50 LOC added) |
python/agentize/eval/nginx_tasks.json (new) |
major | 5 curated nginx bug-fix task definitions (Est: 100 LOC) |
tests/cli/test-eval-harness-cli.sh |
minor | Add --benchmark nginx flag parsing and nginx task loading tests (~20 LOC added) |
Current architecture notes:
- The harness is a linear pipeline: load → setup → run → extract → score
- Mode dispatch (
raw/impl/full/nlcmd) is in_cmd_runviaif/elif— this pattern is reused for benchmark dispatch - The
run_implprompt at line 248 is hardcoded for SWE-bench ("Read .issue.md and implement the fix described") — nginx tasks use the same.issue.mdconvention so this prompt works without modification _make_result()returns a dict withstatusfield — must support a new"compile_failed"valueextract_patch()uses genericgit diff— reusable for nginx without changes
Interface Design
New interfaces:
-
load_nginx_tasks(task_file: str, instance_ids: list[str] | None, limit: int | None) -> list[dict]- Reads
nginx_tasks.json, filters byinstance_idsandlimit - Returns list of task dicts with keys:
instance_id,repo,test_repo,base_commit,test_commit,problem_statement,test_files,modules_required - Simple JSON load + filter, <20 LOC
- Reads
-
setup_nginx_worktree(task: dict, repos_dir: Path, worktrees_dir: Path) -> str- Clones nginx source bare repo (cache), creates worktree at
base_commit - Clones nginx-tests repo at
test_commitinto a sibling directory - Writes
.issue.mdfromproblem_statement - Returns worktree path
- Steps:
- Clone
nginx/nginxbare repo (reuses existingsetup_worktreepattern) - Create detached worktree at
task["base_commit"] - Clone
nginx/nginx-testsattask["test_commit"]intoworktrees_dir/<instance_id>__tests/ - Write
.issue.mdwith problem statement - Return worktree path
- Clone
- Clones nginx source bare repo (cache), creates worktree at
-
score_nginx(wt_path: str, task: dict, tests_path: str) -> dict- Compiles nginx from worktree:
./auto/configure <modules> && make - If compilation fails, returns
{"status": "compile_failed", "resolved": False} - Runs
prove -v <test_files>withTEST_NGINX_BINARYpointing to compiled binary - Returns
{"status": "completed", "resolved": <bool from exit code>} - Steps:
- Run
./auto/configurewithtask["modules_required"]flags - Run
make -j$(nproc) - If make fails → return
compile_failed - Set
TEST_NGINX_BINARY=<wt>/objs/nginx - Run
prove <test_files>from the tests directory - Return resolved=True if exit code 0, else resolved=False
- Run
- Compiles nginx from worktree:
Modified interfaces:
_cmd_run(args) — add benchmark dispatch:
# Load tasks
- print(f"Loading tasks from {args.dataset}...")
- tasks = load_tasks(
- dataset_name=args.dataset,
- instance_ids=args.instance_ids,
- limit=args.limit,
- )
+ if args.benchmark == "nginx":
+ tasks_file = Path(__file__).parent / "nginx_tasks.json"
+ print(f"Loading nginx tasks from {tasks_file}...")
+ tasks = load_nginx_tasks(str(tasks_file), args.instance_ids, args.limit)
+ else:
+ print(f"Loading tasks from {args.dataset}...")
+ tasks = load_tasks(
+ dataset_name=args.dataset,
+ instance_ids=args.instance_ids,
+ limit=args.limit,
+ )_cmd_run(args) — worktree setup dispatch:
- wt_path = setup_worktree(task, repos_dir, worktrees_dir)
+ if args.benchmark == "nginx":
+ wt_path = setup_nginx_worktree(task, repos_dir, worktrees_dir)
+ else:
+ wt_path = setup_worktree(task, repos_dir, worktrees_dir)_cmd_run(args) — scoring dispatch (after patch extraction):
+ # Nginx: score immediately after extraction
+ if args.benchmark == "nginx" and result["status"] == "completed":
+ tests_path = worktrees_dir / (instance_id + "__tests")
+ score = score_nginx(wt_path, task, str(tests_path))
+ result.update(score)CLI argument addition:
run_parser.add_argument("--mode", choices=["raw", "impl", "full", "nlcmd"], default="raw")
+ run_parser.add_argument("--benchmark", choices=["swebench", "nginx"], default="swebench")Documentation changes:
python/agentize/eval/eval_harness.md— add "Nginx Benchmark" section
Documentation Planning
Interface docs
python/agentize/eval/eval_harness.md— update with nginx benchmark section:
+ ## Nginx Benchmark
+
+ The harness supports nginx/nginx-tests as a second benchmark via `--benchmark nginx`.
+ This runs the same 4-way mode comparison against C-language bug-fix tasks from the
+ nginx web server.
+
+ ### Prerequisites (nginx-specific)
+
+ | Dependency | Purpose | Install |
+ |------------|---------|---------|
+ | C compiler (gcc/clang) | Compile nginx from source | System package |
+ | PCRE library | nginx regex support | `brew install pcre` / `apt install libpcre3-dev` |
+ | zlib | nginx gzip support | `brew install zlib` / `apt install zlib1g-dev` |
+ | OpenSSL | nginx SSL support | `brew install openssl` / `apt install libssl-dev` |
+ | Perl + prove | Run nginx test suite | Pre-installed on macOS/Linux |
+ | Test::Nginx | Perl test framework for nginx | `cpan Test::Nginx` |
+
+ ### Usage
+
+ ```bash
+ # Dry-run
+ python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-run
+
+ # Single task
+ python -m agentize.eval.eval_harness run --benchmark nginx --mode raw \
+ --instance-ids nginx__12345 --timeout 1800
+ ```
+
+ ### Task Format (nginx_tasks.json)
+
+ Each task specifies:
+ - `instance_id`: Unique identifier (nginx commit hash)
+ - `repo`: `nginx/nginx` (source repo)
+ - `test_repo`: `nginx/nginx-tests` (test suite repo)
+ - `base_commit`: Pre-fix commit in nginx source
+ - `test_commit`: Corresponding commit in nginx-tests
+ - `problem_statement`: Bug description for the AI
+ - `test_files`: List of `.t` files to run
+ - `modules_required`: nginx configure flags needed
+
+ ### Scoring
+
+ Scoring compiles nginx from the patched worktree and runs `prove` against
+ the specified test files. Results are binary: resolved (all tests pass) or
+ not resolved (any test fails or compilation fails). Compilation failures
+ are reported as `compile_failed` status.Test Strategy
Test modifications:
tests/cli/test-eval-harness-cli.sh— extend existing tests- Test case:
--benchmark nginxflag is accepted byrun --help - Test case:
load_nginx_tasksloads and filters from JSON correctly - Test case:
_make_resultsupportscompile_failedstatus (viaaggregate_metricstest)
- Test case:
Test data required:
python/agentize/eval/nginx_tasks.json— the curated task file serves as test fixture for load tests
Implementation Steps
Step 1: Update documentation (Estimated: 50 LOC)
- Add nginx benchmark section to
python/agentize/eval/eval_harness.md(as shown in Documentation Planning) - Dependencies: None
- Correspondence:
- Docs: Defines nginx benchmark interface, prerequisites, task format, and scoring semantics
- Tests: N/A
Step 2: Create curated nginx task definitions (Estimated: 100 LOC)
- Create
python/agentize/eval/nginx_tasks.jsonwith 5 validated tasks - Each task must be manually verified: checkout base_commit, confirm test fails, apply gold patch, confirm test passes
- Task schema:
[
{
"instance_id": "nginx__<short_hash>",
"repo": "nginx/nginx",
"test_repo": "nginx/nginx-tests",
"base_commit": "<commit_before_fix>",
"test_commit": "<corresponding_test_commit>",
"problem_statement": "<bug description derived from commit message and code context>",
"test_files": ["proxy.t"],
"modules_required": ["--with-http_ssl_module"],
"gold_patch": "<the actual fix diff for validation>"
}
]- Dependencies: Step 1
- Correspondence:
- Docs: Implements the task format defined in Step 1
- Tests: Serves as fixture for load_nginx_tasks tests
Step 3: Add CLI tests for nginx benchmark (Estimated: 20 LOC)
- Extend
tests/cli/test-eval-harness-cli.sh:
# Test 7: --benchmark flag is accepted
python -m agentize.eval.eval_harness run --help 2>&1 | grep -q "benchmark" || test_fail "--benchmark flag missing"
# Test 8: load_nginx_tasks loads tasks from JSON
output=$(python -c "
import json
from agentize.eval.eval_harness import load_nginx_tasks
tasks = load_nginx_tasks('python/agentize/eval/nginx_tasks.json', None, 2)
print(json.dumps({'count': len(tasks), 'has_id': 'instance_id' in tasks[0]}))
")
echo "$output" | python -c "
import sys, json
d = json.load(sys.stdin)
assert d['count'] <= 2, f'limit not applied: {d[\"count\"]}'
assert d['has_id'], 'missing instance_id'
" || test_fail "load_nginx_tasks failed"- Dependencies: Step 2
- Correspondence:
- Docs: Tests the interfaces defined in Step 1
- Tests: New test cases for
--benchmarkflag andload_nginx_tasks
Step 4: Implement nginx harness functions (Estimated: 180 LOC)
- Add to
python/agentize/eval/eval_harness.py:load_nginx_tasks()— JSON loader with filter support (~15 LOC)setup_nginx_worktree()— two-repo clone + worktree (~50 LOC)score_nginx()— compile + prove + exit-code check (~60 LOC)--benchmarkCLI argument addition (~5 LOC)- Benchmark dispatch branches in
_cmd_run— load/setup/score dispatching (~30 LOC) - Benchmark dispatch in
_cmd_score— skip SWE-bench scorer for nginx (~10 LOC) - Support
compile_failedstatus in_make_resultandaggregate_metrics(~10 LOC)
- Dependencies: Step 3
- Correspondence:
- Docs: Implements all interfaces from Step 1
- Tests: Satisfies test cases from Step 3
Total estimated complexity: ~350 LOC (Medium)
Recommended approach: Milestone commits
Milestone strategy:
- M0 (prerequisite, before any code): Manually validate one nginx task end-to-end: clone nginx source at a pre-fix commit, clone nginx-tests at the corresponding test commit, verify test fails, apply gold patch, compile, verify test passes. Document the exact commands in
.tmp/nginx-poc.md. This must succeed before proceeding. - M1: Steps 1-2 (docs + task JSON). Commit:
[feat][eval] Add nginx benchmark docs and curated task definitions - M2: Steps 3-4 (tests + implementation). Commit:
[feat][eval] Add nginx benchmark support to eval harness - Delivery: Run
--benchmark nginx --mode raw --limit 1 --dry-runsuccessfully, then run one real task
Success Criteria
- Manual end-to-end proof of concept documented in
.tmp/nginx-poc.md(M0) -
python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-runexits 0 -
load_nginx_tasksloads 5 tasks from JSON, filtering by instance_ids and limit works -
setup_nginx_worktreeclones both repos and creates a valid worktree -
score_nginxcompiles nginx and scores viaproveexit code - Existing SWE-bench tests pass unchanged (
tests/cli/test-eval-harness-cli.sh) -
compile_failedis a distinct status in metrics (not lumped witherror)
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| No suitable nginx tasks exist (can't find 5 single-commit fixes with deterministic tests) | Medium | High | M0 proof-of-concept validates feasibility before any code is written. If <5 tasks found, scope down to 3 or pivot to a different C project. |
| Two-repo commit correlation is ambiguous (can't reliably pair source fix with test commit) | Medium | High | Pin exact commit pairs in task JSON. Each pair is validated end-to-end during curation. No automated correlation needed. |
| C compilation failures dominate AI results (LLMs are weaker at C than Python) | High | Medium | Select tasks requiring minimal C changes (1-3 lines). Report compile_failed as distinct status for meaningful analysis. |
| Test::Nginx CPAN module not installed | Medium | Low | Document in prerequisites. Check for module availability in score_nginx and report clear error. |
| Flaky nginx tests (timing-dependent network assertions) | Medium | Medium | Only curate tasks with deterministic assertions (regex on response body, not timing). Validate each task 3x during curation. |
| Single file grows too large (1054 + 180 = ~1234 LOC) | Low | Low | 1234 LOC is still manageable for a linear pipeline. If a 3rd benchmark is added, extract to adapter pattern at that point. |
Dependencies
- System dependencies (must be pre-installed):
- C compiler (gcc or clang)
- PCRE, zlib, OpenSSL development libraries
- Perl +
prove(Test::Harness) Test::NginxCPAN module
- No new Python dependencies
- Repository access:
https://github.com/nginx/nginxandhttps://github.com/nginx/nginx-tests(public repos)
Related PR
TBD - will be updated when PR is created