[plan][test]: Extend eval harness to benchmark against nginx/nginx-tests

## Description

Extend the evaluation harness to run the 4-way comparison (raw vs impl vs full vs nlcmd) against the nginx/nginx-tests repository. This adds a second benchmark alongside SWE-bench Verified to validate the planning flow's effectiveness on a different domain (C/nginx vs Python/astropy), providing the third data point needed for the evaluation report.

**Context from team meeting:**
- Current eval results on SWE-bench Verified (5 tasks) show planning improves correctness and patch quality
- Natural language command orchestration is 2.6x slower than script orchestration but produces richer artifacts
- With three data points (SWE-bench + nginx), the team will have sufficient data to report

## Proposed Solution

# Implementation Plan: Nginx-Tests Benchmark for Evaluation Harness

## Consensus Summary

The reducer's minimal approach is adopted as the foundation: add nginx support directly via `if/elif` branches in the existing single-file harness, with 5 curated tasks in a JSON file, and `prove`-exit-code-based scoring. From the bold proposal, we retain the structured task schema design (with `test_repo`, `test_commit`, `modules_required` fields) and the compile-failure-as-distinct-status idea. From the critique, we adopt the critical prerequisite: **validate one nginx task end-to-end manually before building the automation**, and we raise the LOC estimate to account for the two-repo worktree complexity and prompt parameterization the reducer underestimated.

## Goal

Add nginx/nginx-tests as a second benchmark to the eval harness so the same 4-way mode comparison (raw/impl/full/nlcmd) runs against C-language bug-fix tasks, providing a third cross-domain data point for the evaluation report.

**Success criteria:**
- `python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-run` completes without error
- At least 5 curated nginx tasks load from `nginx_tasks.json` and produce valid worktrees with both repos
- Scoring via `prove` correctly reports pass/fail for a gold-patched nginx build
- The existing SWE-bench flow (`--benchmark swebench`, the default) is completely unchanged

**Out of scope:**
- Generic benchmark plugin/adapter registry for N benchmarks
  - ✅ Good to have in the future: Extract a dispatch-dict adapter protocol when a 3rd benchmark is added (Rule of Three).
- Automated task mining script from nginx git history
  - ✅ Good to have in the future: Write `scripts/mine_nginx_tasks.py` when scaling beyond 10 curated tasks.
- Full TAP output parser with per-test granularity
  - ✅ Good to have in the future: Parse `prove -v` output for partial-credit scoring (~20 LOC addition).
- Docker-based nginx compilation environment
  - ❌ Not needed: nginx builds with `./auto/configure && make` on macOS/Linux; Docker adds complexity without benefit at this stage.

## Bug Reproduction

**Skip reason:** This is a feature request, not a bug fix. No reproduction needed.

## Codebase Analysis

**Files verified (docs/code checked by agents):**
- `python/agentize/eval/eval_harness.py`: 1054-line single-file harness. `load_tasks` is HuggingFace-specific. `setup_worktree` assumes SWE-bench task keys (`repo`, `instance_id`, `base_commit`, `problem_statement`). Mode dispatch in `_cmd_run` (lines 926-951) calls different functions per mode but each has SWE-bench-specific prompts. `extract_patch` (lines 713-732) uses generic `git diff` — reusable as-is. `score_predictions` (lines 739-762) calls `swebench.harness.run_evaluation` — nginx needs a completely different scorer.
- `python/agentize/eval/eval_harness.md`: Documents single-file philosophy, three-layer architecture, prerequisites, ramp-up strategy. Must be updated with nginx benchmark section.
- `python/agentize/eval/eval-report-2026-03-01.md`: 4-way comparison on 5 SWE-bench tasks. Establishes the reporting format nginx results must follow.
- `tests/cli/test-eval-harness-cli.sh`: 58-line shell test covering module import, help text, `aggregate_metrics`, `write_overrides`. Must be extended for `--benchmark` flag.
- `CLAUDE.md`: Confirms "early stage, breaking changes OK" and single-file preferences.

**File changes:**

| File | Level | Purpose |
|------|-------|---------|
| `python/agentize/eval/eval_harness.py` | major | Add `--benchmark` CLI flag, `load_nginx_tasks()`, `setup_nginx_worktree()`, `score_nginx()`, benchmark dispatch branches in `_cmd_run` and `_cmd_score`, `compile_failed` status support (~180 LOC added) |
| `python/agentize/eval/eval_harness.md` | medium | Document nginx benchmark usage, prerequisites, task format (~50 LOC added) |
| `python/agentize/eval/nginx_tasks.json` (new) | major | 5 curated nginx bug-fix task definitions (Est: 100 LOC) |
| `tests/cli/test-eval-harness-cli.sh` | minor | Add `--benchmark nginx` flag parsing and nginx task loading tests (~20 LOC added) |

**Current architecture notes:**
- The harness is a linear pipeline: load → setup → run → extract → score
- Mode dispatch (`raw`/`impl`/`full`/`nlcmd`) is in `_cmd_run` via `if/elif` — this pattern is reused for benchmark dispatch
- The `run_impl` prompt at line 248 is hardcoded for SWE-bench ("Read .issue.md and implement the fix described") — nginx tasks use the same `.issue.md` convention so this prompt works without modification
- `_make_result()` returns a dict with `status` field — must support a new `"compile_failed"` value
- `extract_patch()` uses generic `git diff` — reusable for nginx without changes

## Interface Design

**New interfaces:**

1. `load_nginx_tasks(task_file: str, instance_ids: list[str] | None, limit: int | None) -> list[dict]`
   - Reads `nginx_tasks.json`, filters by `instance_ids` and `limit`
   - Returns list of task dicts with keys: `instance_id`, `repo`, `test_repo`, `base_commit`, `test_commit`, `problem_statement`, `test_files`, `modules_required`
   - Simple JSON load + filter, <20 LOC

2. `setup_nginx_worktree(task: dict, repos_dir: Path, worktrees_dir: Path) -> str`
   - Clones nginx source bare repo (cache), creates worktree at `base_commit`
   - Clones nginx-tests repo at `test_commit` into a sibling directory
   - Writes `.issue.md` from `problem_statement`
   - Returns worktree path
   - Steps:
     1. Clone `nginx/nginx` bare repo (reuses existing `setup_worktree` pattern)
     2. Create detached worktree at `task["base_commit"]`
     3. Clone `nginx/nginx-tests` at `task["test_commit"]` into `worktrees_dir/<instance_id>__tests/`
     4. Write `.issue.md` with problem statement
     5. Return worktree path

3. `score_nginx(wt_path: str, task: dict, tests_path: str) -> dict`
   - Compiles nginx from worktree: `./auto/configure <modules> && make`
   - If compilation fails, returns `{"status": "compile_failed", "resolved": False}`
   - Runs `prove -v <test_files>` with `TEST_NGINX_BINARY` pointing to compiled binary
   - Returns `{"status": "completed", "resolved": <bool from exit code>}`
   - Steps:
     1. Run `./auto/configure` with `task["modules_required"]` flags
     2. Run `make -j$(nproc)`
     3. If make fails → return `compile_failed`
     4. Set `TEST_NGINX_BINARY=<wt>/objs/nginx`
     5. Run `prove <test_files>` from the tests directory
     6. Return resolved=True if exit code 0, else resolved=False

**Modified interfaces:**

`_cmd_run(args)` — add benchmark dispatch:
```diff
     # Load tasks
-    print(f"Loading tasks from {args.dataset}...")
-    tasks = load_tasks(
-        dataset_name=args.dataset,
-        instance_ids=args.instance_ids,
-        limit=args.limit,
-    )
+    if args.benchmark == "nginx":
+        tasks_file = Path(__file__).parent / "nginx_tasks.json"
+        print(f"Loading nginx tasks from {tasks_file}...")
+        tasks = load_nginx_tasks(str(tasks_file), args.instance_ids, args.limit)
+    else:
+        print(f"Loading tasks from {args.dataset}...")
+        tasks = load_tasks(
+            dataset_name=args.dataset,
+            instance_ids=args.instance_ids,
+            limit=args.limit,
+        )
```

`_cmd_run(args)` — worktree setup dispatch:
```diff
-        wt_path = setup_worktree(task, repos_dir, worktrees_dir)
+        if args.benchmark == "nginx":
+            wt_path = setup_nginx_worktree(task, repos_dir, worktrees_dir)
+        else:
+            wt_path = setup_worktree(task, repos_dir, worktrees_dir)
```

`_cmd_run(args)` — scoring dispatch (after patch extraction):
```diff
+        # Nginx: score immediately after extraction
+        if args.benchmark == "nginx" and result["status"] == "completed":
+            tests_path = worktrees_dir / (instance_id + "__tests")
+            score = score_nginx(wt_path, task, str(tests_path))
+            result.update(score)
```

CLI argument addition:
```diff
  run_parser.add_argument("--mode", choices=["raw", "impl", "full", "nlcmd"], default="raw")
+ run_parser.add_argument("--benchmark", choices=["swebench", "nginx"], default="swebench")
```

**Documentation changes:**
- `python/agentize/eval/eval_harness.md` — add "Nginx Benchmark" section

## Documentation Planning

### Interface docs
- `python/agentize/eval/eval_harness.md` — update with nginx benchmark section:
```diff
+ ## Nginx Benchmark
+
+ The harness supports nginx/nginx-tests as a second benchmark via `--benchmark nginx`.
+ This runs the same 4-way mode comparison against C-language bug-fix tasks from the
+ nginx web server.
+
+ ### Prerequisites (nginx-specific)
+
+ | Dependency | Purpose | Install |
+ |------------|---------|---------|
+ | C compiler (gcc/clang) | Compile nginx from source | System package |
+ | PCRE library | nginx regex support | `brew install pcre` / `apt install libpcre3-dev` |
+ | zlib | nginx gzip support | `brew install zlib` / `apt install zlib1g-dev` |
+ | OpenSSL | nginx SSL support | `brew install openssl` / `apt install libssl-dev` |
+ | Perl + prove | Run nginx test suite | Pre-installed on macOS/Linux |
+ | Test::Nginx | Perl test framework for nginx | `cpan Test::Nginx` |
+
+ ### Usage
+
+ ```bash
+ # Dry-run
+ python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-run
+
+ # Single task
+ python -m agentize.eval.eval_harness run --benchmark nginx --mode raw \
+     --instance-ids nginx__12345 --timeout 1800
+ ```
+
+ ### Task Format (nginx_tasks.json)
+
+ Each task specifies:
+ - `instance_id`: Unique identifier (nginx commit hash)
+ - `repo`: `nginx/nginx` (source repo)
+ - `test_repo`: `nginx/nginx-tests` (test suite repo)
+ - `base_commit`: Pre-fix commit in nginx source
+ - `test_commit`: Corresponding commit in nginx-tests
+ - `problem_statement`: Bug description for the AI
+ - `test_files`: List of `.t` files to run
+ - `modules_required`: nginx configure flags needed
+
+ ### Scoring
+
+ Scoring compiles nginx from the patched worktree and runs `prove` against
+ the specified test files. Results are binary: resolved (all tests pass) or
+ not resolved (any test fails or compilation fails). Compilation failures
+ are reported as `compile_failed` status.
```

## Test Strategy

**Test modifications:**
- `tests/cli/test-eval-harness-cli.sh` — extend existing tests
  - Test case: `--benchmark nginx` flag is accepted by `run --help`
  - Test case: `load_nginx_tasks` loads and filters from JSON correctly
  - Test case: `_make_result` supports `compile_failed` status (via `aggregate_metrics` test)

**Test data required:**
- `python/agentize/eval/nginx_tasks.json` — the curated task file serves as test fixture for load tests

## Implementation Steps

**Step 1: Update documentation** (Estimated: 50 LOC)
- Add nginx benchmark section to `python/agentize/eval/eval_harness.md` (as shown in Documentation Planning)
- Dependencies: None
- Correspondence:
  - Docs: Defines nginx benchmark interface, prerequisites, task format, and scoring semantics
  - Tests: N/A

**Step 2: Create curated nginx task definitions** (Estimated: 100 LOC)
- Create `python/agentize/eval/nginx_tasks.json` with 5 validated tasks
- Each task must be manually verified: checkout base_commit, confirm test fails, apply gold patch, confirm test passes
- Task schema:
```json
[
  {
    "instance_id": "nginx__<short_hash>",
    "repo": "nginx/nginx",
    "test_repo": "nginx/nginx-tests",
    "base_commit": "<commit_before_fix>",
    "test_commit": "<corresponding_test_commit>",
    "problem_statement": "<bug description derived from commit message and code context>",
    "test_files": ["proxy.t"],
    "modules_required": ["--with-http_ssl_module"],
    "gold_patch": "<the actual fix diff for validation>"
  }
]
```
- Dependencies: Step 1
- Correspondence:
  - Docs: Implements the task format defined in Step 1
  - Tests: Serves as fixture for load_nginx_tasks tests

**Step 3: Add CLI tests for nginx benchmark** (Estimated: 20 LOC)
- Extend `tests/cli/test-eval-harness-cli.sh`:
```bash
# Test 7: --benchmark flag is accepted
python -m agentize.eval.eval_harness run --help 2>&1 | grep -q "benchmark" || test_fail "--benchmark flag missing"

# Test 8: load_nginx_tasks loads tasks from JSON
output=$(python -c "
import json
from agentize.eval.eval_harness import load_nginx_tasks
tasks = load_nginx_tasks('python/agentize/eval/nginx_tasks.json', None, 2)
print(json.dumps({'count': len(tasks), 'has_id': 'instance_id' in tasks[0]}))
")
echo "$output" | python -c "
import sys, json
d = json.load(sys.stdin)
assert d['count'] <= 2, f'limit not applied: {d[\"count\"]}'
assert d['has_id'], 'missing instance_id'
" || test_fail "load_nginx_tasks failed"
```
- Dependencies: Step 2
- Correspondence:
  - Docs: Tests the interfaces defined in Step 1
  - Tests: New test cases for `--benchmark` flag and `load_nginx_tasks`

**Step 4: Implement nginx harness functions** (Estimated: 180 LOC)
- Add to `python/agentize/eval/eval_harness.py`:
  1. `load_nginx_tasks()` — JSON loader with filter support (~15 LOC)
  2. `setup_nginx_worktree()` — two-repo clone + worktree (~50 LOC)
  3. `score_nginx()` — compile + prove + exit-code check (~60 LOC)
  4. `--benchmark` CLI argument addition (~5 LOC)
  5. Benchmark dispatch branches in `_cmd_run` — load/setup/score dispatching (~30 LOC)
  6. Benchmark dispatch in `_cmd_score` — skip SWE-bench scorer for nginx (~10 LOC)
  7. Support `compile_failed` status in `_make_result` and `aggregate_metrics` (~10 LOC)
- Dependencies: Step 3
- Correspondence:
  - Docs: Implements all interfaces from Step 1
  - Tests: Satisfies test cases from Step 3

**Total estimated complexity:** ~350 LOC (Medium)
**Recommended approach:** Milestone commits

**Milestone strategy:**
- **M0 (prerequisite, before any code)**: Manually validate one nginx task end-to-end: clone nginx source at a pre-fix commit, clone nginx-tests at the corresponding test commit, verify test fails, apply gold patch, compile, verify test passes. Document the exact commands in `.tmp/nginx-poc.md`. **This must succeed before proceeding.**
- **M1**: Steps 1-2 (docs + task JSON). Commit: `[feat][eval] Add nginx benchmark docs and curated task definitions`
- **M2**: Steps 3-4 (tests + implementation). Commit: `[feat][eval] Add nginx benchmark support to eval harness`
- **Delivery**: Run `--benchmark nginx --mode raw --limit 1 --dry-run` successfully, then run one real task

## Success Criteria

- [ ] Manual end-to-end proof of concept documented in `.tmp/nginx-poc.md` (M0)
- [ ] `python -m agentize.eval.eval_harness run --benchmark nginx --mode raw --limit 1 --dry-run` exits 0
- [ ] `load_nginx_tasks` loads 5 tasks from JSON, filtering by instance_ids and limit works
- [ ] `setup_nginx_worktree` clones both repos and creates a valid worktree
- [ ] `score_nginx` compiles nginx and scores via `prove` exit code
- [ ] Existing SWE-bench tests pass unchanged (`tests/cli/test-eval-harness-cli.sh`)
- [ ] `compile_failed` is a distinct status in metrics (not lumped with `error`)

## Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| No suitable nginx tasks exist (can't find 5 single-commit fixes with deterministic tests) | Medium | High | M0 proof-of-concept validates feasibility before any code is written. If <5 tasks found, scope down to 3 or pivot to a different C project. |
| Two-repo commit correlation is ambiguous (can't reliably pair source fix with test commit) | Medium | High | Pin exact commit pairs in task JSON. Each pair is validated end-to-end during curation. No automated correlation needed. |
| C compilation failures dominate AI results (LLMs are weaker at C than Python) | High | Medium | Select tasks requiring minimal C changes (1-3 lines). Report `compile_failed` as distinct status for meaningful analysis. |
| Test::Nginx CPAN module not installed | Medium | Low | Document in prerequisites. Check for module availability in `score_nginx` and report clear error. |
| Flaky nginx tests (timing-dependent network assertions) | Medium | Medium | Only curate tasks with deterministic assertions (regex on response body, not timing). Validate each task 3x during curation. |
| Single file grows too large (1054 + 180 = ~1234 LOC) | Low | Low | 1234 LOC is still manageable for a linear pipeline. If a 3rd benchmark is added, extract to adapter pattern at that point. |

## Dependencies

- **System dependencies** (must be pre-installed):
  - C compiler (gcc or clang)
  - PCRE, zlib, OpenSSL development libraries
  - Perl + `prove` (Test::Harness)
  - `Test::Nginx` CPAN module
- **No new Python dependencies**
- **Repository access**: `https://github.com/nginx/nginx` and `https://github.com/nginx/nginx-tests` (public repos)

## Related PR

TBD - will be updated when PR is created


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plan][test]: Extend eval harness to benchmark against nginx/nginx-tests #979

Description

Proposed Solution

Implementation Plan: Nginx-Tests Benchmark for Evaluation Harness

Consensus Summary

Goal

Bug Reproduction

Codebase Analysis

Interface Design

Documentation Planning

Interface docs

Test Strategy

Implementation Steps

Success Criteria

Risks and Mitigations

Dependencies

Related PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Level	Purpose
`python/agentize/eval/eval_harness.py`	major	Add `--benchmark` CLI flag, `load_nginx_tasks()`, `setup_nginx_worktree()`, `score_nginx()`, benchmark dispatch branches in `_cmd_run` and `_cmd_score`, `compile_failed` status support (~180 LOC added)
`python/agentize/eval/eval_harness.md`	medium	Document nginx benchmark usage, prerequisites, task format (~50 LOC added)
`python/agentize/eval/nginx_tasks.json` (new)	major	5 curated nginx bug-fix task definitions (Est: 100 LOC)
`tests/cli/test-eval-harness-cli.sh`	minor	Add `--benchmark nginx` flag parsing and nginx task loading tests (~20 LOC added)

Risk	Likelihood	Impact	Mitigation
No suitable nginx tasks exist (can't find 5 single-commit fixes with deterministic tests)	Medium	High	M0 proof-of-concept validates feasibility before any code is written. If <5 tasks found, scope down to 3 or pivot to a different C project.
Two-repo commit correlation is ambiguous (can't reliably pair source fix with test commit)	Medium	High	Pin exact commit pairs in task JSON. Each pair is validated end-to-end during curation. No automated correlation needed.
C compilation failures dominate AI results (LLMs are weaker at C than Python)	High	Medium	Select tasks requiring minimal C changes (1-3 lines). Report `compile_failed` as distinct status for meaningful analysis.
Test::Nginx CPAN module not installed	Medium	Low	Document in prerequisites. Check for module availability in `score_nginx` and report clear error.
Flaky nginx tests (timing-dependent network assertions)	Medium	Medium	Only curate tasks with deterministic assertions (regex on response body, not timing). Validate each task 3x during curation.
Single file grows too large (1054 + 180 = ~1234 LOC)	Low	Low	1234 LOC is still manageable for a linear pipeline. If a 3rd benchmark is added, extract to adapter pattern at that point.

[plan][test]: Extend eval harness to benchmark against nginx/nginx-tests #979

Description

Description

Proposed Solution

Implementation Plan: Nginx-Tests Benchmark for Evaluation Harness

Consensus Summary

Goal

Bug Reproduction

Codebase Analysis

Interface Design

Documentation Planning

Interface docs

Test Strategy

Implementation Steps

Success Criteria

Risks and Mitigations

Dependencies

Related PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions