Skip to content

[plan][test]: Extend eval harness to benchmark against nginx/nginx-tests #979

@ayazhankadessova

Description

@ayazhankadessova

Description

Extend the evaluation harness to run the 4-way comparison (raw vs impl vs full vs nlcmd) against the nginx/nginx-tests repository. This adds a second benchmark alongside SWE-bench Verified to validate the planning flow's effectiveness on a different domain (C/nginx vs Python/astropy), providing the third data point needed for the evaluation report.

Context from team meeting:

  • Current eval results on SWE-bench Verified (5 tasks) show planning improves correctness and patch quality
  • Natural language command orchestration is 2.6x slower than script orchestration but produces richer artifacts
  • With three data points (SWE-bench + nginx), the team will have sufficient data to report

Proposed Solution

Implementation Plan: Nginx-Tests Benchmark for Evaluation Harness

Consensus Summary

The reducer's minimal approach is adopted as the foundation: add nginx support directly via if/elif branches in the existing single-file harness, with 5 curated tasks in a JSON file, and prove-exit-code-based scoring. From the bold proposal, we retain the structured task schema design (with test_repo, test_commit, modules_required fields) and the compile-failure-as-distinct-status idea. From the critique, we adopt the critical prerequisite: validate one nginx task end-to-end manually before building the automation, and we raise the LOC estimate to account for the two-repo worktree complexity and prompt parameterization the reducer underestimated.

Goal

Add nginx/nginx-tests as a second benchmark to the eval harness so the same 4-way mode comparison (raw/impl/full/nlcmd) runs against C-language bug-fix tasks, providing a third cross-domain data point for the evaluation report.

Success criteria:

  • python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-run completes without error
  • At least 5 curated nginx tasks load from nginx_tasks.json and produce valid worktrees with both repos
  • Scoring via prove correctly reports pass/fail for a gold-patched nginx build
  • The existing SWE-bench flow (--benchmark swebench, the default) is completely unchanged

Out of scope:

  • Generic benchmark plugin/adapter registry for N benchmarks
    • ✅ Good to have in the future: Extract a dispatch-dict adapter protocol when a 3rd benchmark is added (Rule of Three).
  • Automated task mining script from nginx git history
    • ✅ Good to have in the future: Write scripts/mine_nginx_tasks.py when scaling beyond 10 curated tasks.
  • Full TAP output parser with per-test granularity
    • ✅ Good to have in the future: Parse prove -v output for partial-credit scoring (~20 LOC addition).
  • Docker-based nginx compilation environment
    • ❌ Not needed: nginx builds with ./auto/configure && make on macOS/Linux; Docker adds complexity without benefit at this stage.

Bug Reproduction

Skip reason: This is a feature request, not a bug fix. No reproduction needed.

Codebase Analysis

Files verified (docs/code checked by agents):

  • python/agentize/eval/eval_harness.py: 1054-line single-file harness. load_tasks is HuggingFace-specific. setup_worktree assumes SWE-bench task keys (repo, instance_id, base_commit, problem_statement). Mode dispatch in _cmd_run (lines 926-951) calls different functions per mode but each has SWE-bench-specific prompts. extract_patch (lines 713-732) uses generic git diff — reusable as-is. score_predictions (lines 739-762) calls swebench.harness.run_evaluation — nginx needs a completely different scorer.
  • python/agentize/eval/eval_harness.md: Documents single-file philosophy, three-layer architecture, prerequisites, ramp-up strategy. Must be updated with nginx benchmark section.
  • python/agentize/eval/eval-report-2026-03-01.md: 4-way comparison on 5 SWE-bench tasks. Establishes the reporting format nginx results must follow.
  • tests/cli/test-eval-harness-cli.sh: 58-line shell test covering module import, help text, aggregate_metrics, write_overrides. Must be extended for --benchmark flag.
  • CLAUDE.md: Confirms "early stage, breaking changes OK" and single-file preferences.

File changes:

File Level Purpose
python/agentize/eval/eval_harness.py major Add --benchmark CLI flag, load_nginx_tasks(), setup_nginx_worktree(), score_nginx(), benchmark dispatch branches in _cmd_run and _cmd_score, compile_failed status support (~180 LOC added)
python/agentize/eval/eval_harness.md medium Document nginx benchmark usage, prerequisites, task format (~50 LOC added)
python/agentize/eval/nginx_tasks.json (new) major 5 curated nginx bug-fix task definitions (Est: 100 LOC)
tests/cli/test-eval-harness-cli.sh minor Add --benchmark nginx flag parsing and nginx task loading tests (~20 LOC added)

Current architecture notes:

  • The harness is a linear pipeline: load → setup → run → extract → score
  • Mode dispatch (raw/impl/full/nlcmd) is in _cmd_run via if/elif — this pattern is reused for benchmark dispatch
  • The run_impl prompt at line 248 is hardcoded for SWE-bench ("Read .issue.md and implement the fix described") — nginx tasks use the same .issue.md convention so this prompt works without modification
  • _make_result() returns a dict with status field — must support a new "compile_failed" value
  • extract_patch() uses generic git diff — reusable for nginx without changes

Interface Design

New interfaces:

  1. load_nginx_tasks(task_file: str, instance_ids: list[str] | None, limit: int | None) -> list[dict]

    • Reads nginx_tasks.json, filters by instance_ids and limit
    • Returns list of task dicts with keys: instance_id, repo, test_repo, base_commit, test_commit, problem_statement, test_files, modules_required
    • Simple JSON load + filter, <20 LOC
  2. setup_nginx_worktree(task: dict, repos_dir: Path, worktrees_dir: Path) -> str

    • Clones nginx source bare repo (cache), creates worktree at base_commit
    • Clones nginx-tests repo at test_commit into a sibling directory
    • Writes .issue.md from problem_statement
    • Returns worktree path
    • Steps:
      1. Clone nginx/nginx bare repo (reuses existing setup_worktree pattern)
      2. Create detached worktree at task["base_commit"]
      3. Clone nginx/nginx-tests at task["test_commit"] into worktrees_dir/<instance_id>__tests/
      4. Write .issue.md with problem statement
      5. Return worktree path
  3. score_nginx(wt_path: str, task: dict, tests_path: str) -> dict

    • Compiles nginx from worktree: ./auto/configure <modules> && make
    • If compilation fails, returns {"status": "compile_failed", "resolved": False}
    • Runs prove -v <test_files> with TEST_NGINX_BINARY pointing to compiled binary
    • Returns {"status": "completed", "resolved": <bool from exit code>}
    • Steps:
      1. Run ./auto/configure with task["modules_required"] flags
      2. Run make -j$(nproc)
      3. If make fails → return compile_failed
      4. Set TEST_NGINX_BINARY=<wt>/objs/nginx
      5. Run prove <test_files> from the tests directory
      6. Return resolved=True if exit code 0, else resolved=False

Modified interfaces:

_cmd_run(args) — add benchmark dispatch:

     # Load tasks
-    print(f"Loading tasks from {args.dataset}...")
-    tasks = load_tasks(
-        dataset_name=args.dataset,
-        instance_ids=args.instance_ids,
-        limit=args.limit,
-    )
+    if args.benchmark == "nginx":
+        tasks_file = Path(__file__).parent / "nginx_tasks.json"
+        print(f"Loading nginx tasks from {tasks_file}...")
+        tasks = load_nginx_tasks(str(tasks_file), args.instance_ids, args.limit)
+    else:
+        print(f"Loading tasks from {args.dataset}...")
+        tasks = load_tasks(
+            dataset_name=args.dataset,
+            instance_ids=args.instance_ids,
+            limit=args.limit,
+        )

_cmd_run(args) — worktree setup dispatch:

-        wt_path = setup_worktree(task, repos_dir, worktrees_dir)
+        if args.benchmark == "nginx":
+            wt_path = setup_nginx_worktree(task, repos_dir, worktrees_dir)
+        else:
+            wt_path = setup_worktree(task, repos_dir, worktrees_dir)

_cmd_run(args) — scoring dispatch (after patch extraction):

+        # Nginx: score immediately after extraction
+        if args.benchmark == "nginx" and result["status"] == "completed":
+            tests_path = worktrees_dir / (instance_id + "__tests")
+            score = score_nginx(wt_path, task, str(tests_path))
+            result.update(score)

CLI argument addition:

  run_parser.add_argument("--mode", choices=["raw", "impl", "full", "nlcmd"], default="raw")
+ run_parser.add_argument("--benchmark", choices=["swebench", "nginx"], default="swebench")

Documentation changes:

  • python/agentize/eval/eval_harness.md — add "Nginx Benchmark" section

Documentation Planning

Interface docs

  • python/agentize/eval/eval_harness.md — update with nginx benchmark section:
+ ## Nginx Benchmark
+
+ The harness supports nginx/nginx-tests as a second benchmark via `--benchmark nginx`.
+ This runs the same 4-way mode comparison against C-language bug-fix tasks from the
+ nginx web server.
+
+ ### Prerequisites (nginx-specific)
+
+ | Dependency | Purpose | Install |
+ |------------|---------|---------|
+ | C compiler (gcc/clang) | Compile nginx from source | System package |
+ | PCRE library | nginx regex support | `brew install pcre` / `apt install libpcre3-dev` |
+ | zlib | nginx gzip support | `brew install zlib` / `apt install zlib1g-dev` |
+ | OpenSSL | nginx SSL support | `brew install openssl` / `apt install libssl-dev` |
+ | Perl + prove | Run nginx test suite | Pre-installed on macOS/Linux |
+ | Test::Nginx | Perl test framework for nginx | `cpan Test::Nginx` |
+
+ ### Usage
+
+ ```bash
+ # Dry-run
+ python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-run
+
+ # Single task
+ python -m agentize.eval.eval_harness run --benchmark nginx --mode raw \
+     --instance-ids nginx__12345 --timeout 1800
+ ```
+
+ ### Task Format (nginx_tasks.json)
+
+ Each task specifies:
+ - `instance_id`: Unique identifier (nginx commit hash)
+ - `repo`: `nginx/nginx` (source repo)
+ - `test_repo`: `nginx/nginx-tests` (test suite repo)
+ - `base_commit`: Pre-fix commit in nginx source
+ - `test_commit`: Corresponding commit in nginx-tests
+ - `problem_statement`: Bug description for the AI
+ - `test_files`: List of `.t` files to run
+ - `modules_required`: nginx configure flags needed
+
+ ### Scoring
+
+ Scoring compiles nginx from the patched worktree and runs `prove` against
+ the specified test files. Results are binary: resolved (all tests pass) or
+ not resolved (any test fails or compilation fails). Compilation failures
+ are reported as `compile_failed` status.

Test Strategy

Test modifications:

  • tests/cli/test-eval-harness-cli.sh — extend existing tests
    • Test case: --benchmark nginx flag is accepted by run --help
    • Test case: load_nginx_tasks loads and filters from JSON correctly
    • Test case: _make_result supports compile_failed status (via aggregate_metrics test)

Test data required:

  • python/agentize/eval/nginx_tasks.json — the curated task file serves as test fixture for load tests

Implementation Steps

Step 1: Update documentation (Estimated: 50 LOC)

  • Add nginx benchmark section to python/agentize/eval/eval_harness.md (as shown in Documentation Planning)
  • Dependencies: None
  • Correspondence:
    • Docs: Defines nginx benchmark interface, prerequisites, task format, and scoring semantics
    • Tests: N/A

Step 2: Create curated nginx task definitions (Estimated: 100 LOC)

  • Create python/agentize/eval/nginx_tasks.json with 5 validated tasks
  • Each task must be manually verified: checkout base_commit, confirm test fails, apply gold patch, confirm test passes
  • Task schema:
[
  {
    "instance_id": "nginx__<short_hash>",
    "repo": "nginx/nginx",
    "test_repo": "nginx/nginx-tests",
    "base_commit": "<commit_before_fix>",
    "test_commit": "<corresponding_test_commit>",
    "problem_statement": "<bug description derived from commit message and code context>",
    "test_files": ["proxy.t"],
    "modules_required": ["--with-http_ssl_module"],
    "gold_patch": "<the actual fix diff for validation>"
  }
]
  • Dependencies: Step 1
  • Correspondence:
    • Docs: Implements the task format defined in Step 1
    • Tests: Serves as fixture for load_nginx_tasks tests

Step 3: Add CLI tests for nginx benchmark (Estimated: 20 LOC)

  • Extend tests/cli/test-eval-harness-cli.sh:
# Test 7: --benchmark flag is accepted
python -m agentize.eval.eval_harness run --help 2>&1 | grep -q "benchmark" || test_fail "--benchmark flag missing"

# Test 8: load_nginx_tasks loads tasks from JSON
output=$(python -c "
import json
from agentize.eval.eval_harness import load_nginx_tasks
tasks = load_nginx_tasks('python/agentize/eval/nginx_tasks.json', None, 2)
print(json.dumps({'count': len(tasks), 'has_id': 'instance_id' in tasks[0]}))
")
echo "$output" | python -c "
import sys, json
d = json.load(sys.stdin)
assert d['count'] <= 2, f'limit not applied: {d[\"count\"]}'
assert d['has_id'], 'missing instance_id'
" || test_fail "load_nginx_tasks failed"
  • Dependencies: Step 2
  • Correspondence:
    • Docs: Tests the interfaces defined in Step 1
    • Tests: New test cases for --benchmark flag and load_nginx_tasks

Step 4: Implement nginx harness functions (Estimated: 180 LOC)

  • Add to python/agentize/eval/eval_harness.py:
    1. load_nginx_tasks() — JSON loader with filter support (~15 LOC)
    2. setup_nginx_worktree() — two-repo clone + worktree (~50 LOC)
    3. score_nginx() — compile + prove + exit-code check (~60 LOC)
    4. --benchmark CLI argument addition (~5 LOC)
    5. Benchmark dispatch branches in _cmd_run — load/setup/score dispatching (~30 LOC)
    6. Benchmark dispatch in _cmd_score — skip SWE-bench scorer for nginx (~10 LOC)
    7. Support compile_failed status in _make_result and aggregate_metrics (~10 LOC)
  • Dependencies: Step 3
  • Correspondence:
    • Docs: Implements all interfaces from Step 1
    • Tests: Satisfies test cases from Step 3

Total estimated complexity: ~350 LOC (Medium)
Recommended approach: Milestone commits

Milestone strategy:

  • M0 (prerequisite, before any code): Manually validate one nginx task end-to-end: clone nginx source at a pre-fix commit, clone nginx-tests at the corresponding test commit, verify test fails, apply gold patch, compile, verify test passes. Document the exact commands in .tmp/nginx-poc.md. This must succeed before proceeding.
  • M1: Steps 1-2 (docs + task JSON). Commit: [feat][eval] Add nginx benchmark docs and curated task definitions
  • M2: Steps 3-4 (tests + implementation). Commit: [feat][eval] Add nginx benchmark support to eval harness
  • Delivery: Run --benchmark nginx --mode raw --limit 1 --dry-run successfully, then run one real task

Success Criteria

  • Manual end-to-end proof of concept documented in .tmp/nginx-poc.md (M0)
  • python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-run exits 0
  • load_nginx_tasks loads 5 tasks from JSON, filtering by instance_ids and limit works
  • setup_nginx_worktree clones both repos and creates a valid worktree
  • score_nginx compiles nginx and scores via prove exit code
  • Existing SWE-bench tests pass unchanged (tests/cli/test-eval-harness-cli.sh)
  • compile_failed is a distinct status in metrics (not lumped with error)

Risks and Mitigations

Risk Likelihood Impact Mitigation
No suitable nginx tasks exist (can't find 5 single-commit fixes with deterministic tests) Medium High M0 proof-of-concept validates feasibility before any code is written. If <5 tasks found, scope down to 3 or pivot to a different C project.
Two-repo commit correlation is ambiguous (can't reliably pair source fix with test commit) Medium High Pin exact commit pairs in task JSON. Each pair is validated end-to-end during curation. No automated correlation needed.
C compilation failures dominate AI results (LLMs are weaker at C than Python) High Medium Select tasks requiring minimal C changes (1-3 lines). Report compile_failed as distinct status for meaningful analysis.
Test::Nginx CPAN module not installed Medium Low Document in prerequisites. Check for module availability in score_nginx and report clear error.
Flaky nginx tests (timing-dependent network assertions) Medium Medium Only curate tasks with deterministic assertions (regex on response body, not timing). Validate each task 3x during curation.
Single file grows too large (1054 + 180 = ~1234 LOC) Low Low 1234 LOC is still manageable for a linear pipeline. If a 3rd benchmark is added, extract to adapter pattern at that point.

Dependencies

  • System dependencies (must be pre-installed):
    • C compiler (gcc or clang)
    • PCRE, zlib, OpenSSL development libraries
    • Perl + prove (Test::Harness)
    • Test::Nginx CPAN module
  • No new Python dependencies
  • Repository access: https://github.com/nginx/nginx and https://github.com/nginx/nginx-tests (public repos)

Related PR

TBD - will be updated when PR is created

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentize:planPlan created by /ultra-planner command

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions