diff --git a/python/agentize/eval/eval-report-2026-03-01.md b/python/agentize/eval/eval-report-2026-03-01.md index 37370f23..28d1b251 100644 --- a/python/agentize/eval/eval-report-2026-03-01.md +++ b/python/agentize/eval/eval-report-2026-03-01.md @@ -37,11 +37,10 @@ Extended the evaluation harness to support a 4th execution mode (**nlcmd**), ena | Metric | raw | impl | full | nlcmd | |--------|-----|------|------|-------| -| Cost (USD) | $0.44 | N/A* | N/A* | $4.07 | -| Avg cost/task | $0.09 | — | — | $1.02 | -| Tokens (total) | 29,353 | — | — | 63,232 | +| Cost (USD) | $0.44 | ~$4† | ~$112† | $143.80 | +| Avg cost/task | $0.09 | ~$0.83† | ~$22† | $28.76 | -*\*impl and full use ACW subprocess calls that don't return token data.* +*†impl and full costs estimated from single-task JSONL measurement extrapolated to 5 tasks. Nlcmd cost ($143.80) measured directly across all 5 tasks via JSONL-based tracking. Prior nlcmd cost ($4.07) only counted orchestrator tokens — subagent tokens spawned via Task tool were missing (fixed in PR #981, ~34x undercount).* ### Speed comparison (relative to raw) @@ -154,24 +153,24 @@ The quality progression is clear: **raw < impl < full < nlcmd**. However, the ga | Mode | Cost per task | Quality | Cost-effectiveness | |------|--------------|---------|-------------------| | raw | $0.09 | 80% correct, no tests | Baseline | -| impl | ~$0.09* | 100% correct, some tests | Best value | -| full | ~$1-3* | 100% correct, good tests | Diminishing returns | -| nlcmd | $1.02 | 100% correct, excellent tests | Premium quality | +| impl | ~$0.83 | 100% correct, some tests | Best value | +| full | ~$22 | 100% correct, good tests | Diminishing returns | +| nlcmd | $28.76 | 100% correct, excellent tests | Premium quality | -*\*Estimated from raw cost since ACW doesn't track tokens.* +*Costs measured via JSONL-based session file tracking (PR #981). Prior nlcmd cost ($1.02/task) only counted orchestrator tokens — subagent tokens were missing.* -### 4. NL command orchestration is 2.6x slower than script orchestration -nlcmd (12 hrs) vs full (4.6 hrs) for the same 5 tasks. The overhead comes from Claude Code's NL command system: each `/ultra-planner` session spawns subagents via the Task tool, which involves additional prompt parsing, permission checks, and session management. The Python pipeline makes direct subprocess calls. +### 4. NL command orchestration is 2.6x slower and 1.3x more expensive than script orchestration +nlcmd (12 hrs, $28.76/task) vs full (4.6 hrs, ~$22/task) for the same 5 tasks. The overhead comes from Claude Code's NL command system: each `/ultra-planner` session spawns subagents via the Task tool, which involves additional prompt parsing, permission checks, and session management. The Python pipeline makes direct subprocess calls. Full mode is strictly better: faster, cheaper, and equally accurate (both 100%). ### 5. NL commands produce richer artifacts Despite the overhead, nlcmd patches consistently included extras that other modes didn't: changelog entries, comprehensive docstrings explaining design rationale, edge-case tests, and more defensive error handling. This suggests the multi-agent debate via NL commands (which includes external AI synthesis) produces more thorough analysis than the script pipeline. ## Recommendations -1. **Use impl for speed-sensitive workloads** — 100% correctness at raw-mode speed with decent test coverage. -2. **Use full for production patches** — adds planning-quality tests with ~55 min/task overhead. -3. **Use nlcmd for high-stakes or complex tasks** — produces the most thorough patches but at 10x the cost and time. -4. **Invest in cost tracking for ACW modes** — the current gap (impl/full have no USD data) makes cost comparison incomplete. +1. **Use impl for speed-sensitive workloads** — 100% correctness at raw-mode speed with decent test coverage (~$0.83/task). +2. **Use full for production patches** — adds planning-quality tests with ~55 min/task overhead (~$22/task). Strictly dominates nlcmd. +3. **~~Use nlcmd for high-stakes or complex tasks~~** — Superseded. Full mode is faster, cheaper ($22 vs $29/task), and achieves equal or better pass rates across both benchmarks. nlcmd's richer artifacts (changelogs, extra tests) do not justify the 1.3x cost and 2.6x time premium. +4. **~~Invest in cost tracking for ACW modes~~** — Resolved in PR #981 via JSONL-based session file tracking. 5. **Increase nlcmd default timeout to 3600s** — the default 1800s causes timeouts on complex planning debates. ## Appendix: Tasks Evaluated diff --git a/python/agentize/eval/eval-report-2026-03-04-combined.md b/python/agentize/eval/eval-report-2026-03-04-combined.md index aa908e4d..8155cde8 100644 --- a/python/agentize/eval/eval-report-2026-03-04-combined.md +++ b/python/agentize/eval/eval-report-2026-03-04-combined.md @@ -7,7 +7,7 @@ ## Executive Summary -We evaluated agentize across two benchmarks — SWE-bench (Python library bugs) and nginx (C systems bugs) — using four execution modes. The key finding: **planning consistently improves correctness**, and full mode (script-orchestrated 5-agent planning) achieves **100% pass rate across both benchmarks** (10/10). Nlcmd (NL-orchestrated planning) produces richer artifacts but at 2-3x the time cost with slightly lower C pass rate (4/5). +We evaluated agentize across two benchmarks — SWE-bench (Python library bugs) and nginx (C systems bugs) — using four execution modes. The key finding: **planning consistently improves correctness**, and full mode with Opus consensus achieves **100% pass rate across both benchmarks** (10/10). Full mode with Codex consensus scores 90% (9/10). Nlcmd (NL-orchestrated planning) produces richer artifacts but at ~1.8x the time of full (opus), with slightly lower C pass rate (4/5). A secondary finding: **Codex consensus adds latency without improving quality** — the Opus fallback path is 2-3x faster and achieves higher nginx pass rate (5/5 vs 4/5). ## Combined Results @@ -17,28 +17,35 @@ We evaluated agentize across two benchmarks — SWE-bench (Python library bugs) |------|-------------------|-----------|----------| | **raw** | 4/5 (80%) | 4/5 (80%) | 8/10 (80%) | | **impl** | 5/5 (100%) | 4/5 (80%) | 9/10 (90%) | -| **full** | 5/5 (100%) | **5/5 (100%)** | **10/10 (100%)** | +| **full (codex)** | 5/5 (100%) | 4/5 (80%) | 9/10 (90%) | +| **full (opus)** | 5/5 (100%) | **5/5 (100%)**‡ | **10/10 (100%)**‡ | | **nlcmd** | 5/5 (100%) | 4/5 (80%) | 9/10 (90%) | +*‡ Original full (opus) nginx run scored 5/5. Re-run scored 4/5 due to eval harness bug in `run_planning_phase()` on task f8e1bc5b (applied agentize code changes instead of nginx fix). Score from original run retained.* + ### Timing | Mode | SWE-bench total | nginx total | Combined total | Avg/task | |------|----------------|-------------|----------------|----------| | **raw** | 524s | 387s | 911s (15 min) | 91s | | **impl** | 221s | 899s | 1,120s (19 min) | 112s | -| **full** | 16,505s | 8,437s | 24,942s (6.9 hrs) | 2,494s | -| **nlcmd** | 43,056s | 10,031s | 53,087s (14.7 hrs) | 5,309s | +| **full (codex)** | 5,493s | 4,157s | 9,650s (2.7 hrs) | 965s | +| **full (opus)** | 1,843s | 2,122s | 3,965s (1.1 hrs) | 397s | +| **nlcmd** | 8,911s | 10,031s | 18,942s (5.3 hrs) | 1,894s | + +*Full (codex) uses gpt-5.2-codex for consensus; full (opus) uses Claude Opus fallback. Opus is 2.4x faster than Codex overall (397s vs 965s per task).* -### Cost +### Cost (Anthropic API only) | Mode | SWE-bench | nginx | Combined | Avg/task | |------|-----------|-------|----------|----------| | **raw** | $0.44 | $0.71 | $1.15 | $0.12 | | **impl** | ~$4† | ~$4† | ~$8† | ~$0.83† | -| **full** | ~$112† | ~$112† | ~$224† | ~$22† | -| **nlcmd** | $4.07 | $5.07 | $9.14 | $0.91 | +| **full (codex)** | $103.61 | $40.71 | $144.32 | $14.43 | +| **full (opus)** | $98.87 | $63.75 | $162.62 | $16.26 | +| **nlcmd** | $143.80 | ~$157† | ~$301 | ~$30 | -*†impl and full costs estimated from single-task JSONL measurement (nginx d7a24947) extrapolated to 5 tasks per benchmark. Full mode cost is dominated by 4 Opus planning calls ($75/M output, $18.75/M cache_write). Actual per-task costs may vary with task complexity.* +*†nginx impl and nlcmd costs estimated from single-task JSONL measurements × 5. All other values measured directly across all 5 tasks. Costs reflect Anthropic API usage only — Codex (OpenAI) consensus calls add additional cost not captured in JSONL. Full (opus) appears ~$23 more expensive than full (codex) on nginx ($63.75 vs $40.71) because Opus consensus cost is captured in JSONL while Codex consensus cost (OpenAI) is not. Prior nlcmd cost ($0.91/task) only counted orchestrator tokens (fixed in PR #981).* ## Analysis @@ -61,23 +68,25 @@ All modes score lower on nginx than SWE-bench: |------|-----------|-------|-------| | raw | 80% | 80% | 0pp | | impl | 100% | 80% | -20pp | -| full | 100% | 100% | 0pp | +| full (codex) | 100% | 80% | -20pp | +| full (opus) | 100% | 100% | 0pp | | nlcmd | 100% | 80% | -20pp | -The gap is largest for impl and nlcmd (-20pp), where C-specific challenges (compilation errors, multi-module interactions) cause failures that planning alone doesn't prevent. +The gap is largest for impl, full (codex), and nlcmd (-20pp), where C-specific challenges (compilation errors, multi-module interactions) cause failures. Only full (opus) achieves parity across both languages. -### Finding 3: Script orchestration (full) outperforms NL orchestration (nlcmd) +### Finding 3: Script orchestration (full) dominates NL orchestration (nlcmd) -Counter to expectations, full mode outperforms nlcmd on nginx despite nlcmd using more elaborate planning: +Full mode (opus) outperforms nlcmd on every dimension — quality, speed, and cost: -| | full | nlcmd | -|--|------|-------| -| **SWE-bench** | 5/5 | 5/5 (tie) | -| **nginx** | **5/5** | 4/5 | -| **Combined** | **10/10** | 9/10 | -| **Time** | 6.9 hrs | 14.7 hrs | +| | full (opus) | full (codex) | nlcmd | +|--|------|------|-------| +| **SWE-bench** | 5/5 | 5/5 | 5/5 (tie) | +| **nginx** | **5/5** | 4/5 | 4/5 | +| **Combined** | **10/10** | 9/10 | 9/10 | +| **Time** | 1.1 hrs | 2.7 hrs | 5.3 hrs | +| **Cost** | $16.26/task | $14.43/task | ~$30/task | -Full mode's advantage on nginx comes from a single task (f8e1bc5b) where full compiled successfully but nlcmd didn't. Both used the same model (Sonnet for impl), so the difference is in the planning-to-implementation handoff: the script pipeline's structured plan format may produce more precise implementation guidance for C code than the NL command's free-form plan. +Full (opus) is the fastest, highest-quality option. Full (codex) and nlcmd tie on quality (9/10) but codex is 2x faster and 2x cheaper than nlcmd. The quality gap comes from a single nginx task (f8e1bc5b) where full (opus) compiled successfully but codex/nlcmd didn't. Full (codex) appears cheaper per task ($14.43 vs $16.26) but only because Codex (OpenAI) consensus cost isn't captured — the true total cost of codex is likely higher. ### Finding 4: impl is the best value proposition @@ -85,7 +94,7 @@ Impl mode (FSM orchestrator without planning) achieves: - 100% on SWE-bench (tied for best) - 80% on nginx (tied with raw and nlcmd) - Total time: 19 minutes for 10 tasks -- ~$0.83/task (~7x raw, but 27x cheaper than full) +- ~$0.83/task (~7x raw, 20x cheaper than full (opus), 36x cheaper than nlcmd) The iterative prompt rendering and retry logic in the FSM kernel loop provides most of the benefit of planning for Python tasks at a fraction of the cost. For C tasks, impl matches raw/nlcmd despite using no planning — the failures are in different tasks (impl misses cd12dc4f due to incomplete fix, while raw/nlcmd miss f8e1bc5b due to compile errors). @@ -93,16 +102,17 @@ The iterative prompt rendering and retry logic in the FSM kernel loop provides m Each mode fails on different tasks, suggesting complementary strengths: -| Task | raw | impl | full | nlcmd | Failure pattern | -|------|-----|------|------|-------|----------------| -| SWE: astropy-13236 | FAIL | pass | pass | pass | raw: wrong approach | -| nginx: f8e1bc5b | CF | pass | pass | CF | raw/nlcmd: compile error | -| nginx: cd12dc4f | pass | FAIL | pass | pass | impl: incomplete fix | +| Task | raw | impl | full (codex) | full (opus) | nlcmd | Failure pattern | +|------|-----|------|-------------|-------------|-------|----------------| +| SWE: astropy-13236 | FAIL | pass | pass | pass | pass | raw: wrong approach | +| nginx: f8e1bc5b | CF | pass | FAIL | pass | CF | raw/codex/nlcmd: compile or wrong fix | +| nginx: cd12dc4f | pass | FAIL | pass | pass | pass | impl: incomplete fix | If failures were random, we'd expect overlapping failure sets. Instead, each mode has unique failure characteristics: - **raw:** Takes wrong approach (deletes instead of deprecating) - **impl:** Misses secondary modules (fixes one of two affected files) -- **full:** Most robust, fewest failures +- **full (opus):** Most robust, zero failures +- **full (codex):** Codex consensus produces a less precise plan for H2 cache fix - **nlcmd:** Same compile issues as raw (shares raw-mode code generation characteristics) ### Finding 6: Cost-effectiveness varies by use case @@ -111,23 +121,76 @@ If failures were random, we'd expect overlapping failure sets. Instead, each mod |----------|-----------------|-----| | Rapid prototyping | **raw** | 91s/task, $0.12/task, 80% success | | Production patches (Python) | **impl** | 112s/task, ~$0.83/task, 100% success on Python | -| Production patches (C/multi-lang) | **full** | 2,494s/task, ~$22/task, 100% success | -| Maximum quality (Python) | **nlcmd** | 5,309s/task, $0.91/task, richer artifacts | +| Production patches (C/multi-lang) | **full (opus)** | 397s/task, $16.26/task, 100% success | +| ~~Maximum quality (Python)~~ | ~~nlcmd~~ | ~$30/task, 90% success — dominated by full | + +Full (opus) dominates nlcmd on all axes: higher pass rate (100% vs 90%), 4.8x faster (397s vs 1,894s/task), and ~1.8x cheaper ($16.26 vs ~$30/task). Full (codex) ties nlcmd on quality (90%) but is 2x faster and 2x cheaper. There is no use case where nlcmd is the preferred choice. + +### Finding 7: Cost sanity check — cost-per-second is consistent across modes + +Cost should be roughly proportional to time when the same models are used. A large $/s discrepancy between modes using the same models indicates a measurement bug (as happened with the original nlcmd cost of $0.91/task — see Limitation 4). + +| Mode | $/task | Time/task | $/second | Models | +|------|--------|-----------|----------|--------| +| **raw** | $0.12 | 91s | $0.0013/s | Sonnet only | +| **impl** | $0.83 | 112s | $0.0074/s | Sonnet only | +| **full (codex)** | $14.43 | 965s | $0.015/s | Opus+Sonnet (consensus on OpenAI) | +| **full (opus)** | $16.26 | 397s | $0.041/s | Opus+Sonnet | +| **nlcmd** | $30.00 | 1,894s | $0.016/s | Opus+Sonnet (via Task tool) | + +**Within-group consistency:** + +- **Sonnet-only (raw vs impl):** $0.0013 vs $0.0074/s — 5.7x gap. Impl's FSM overhead (multi-turn conversation, commit, parse gate) burns more tokens per second than raw's single `claude -p` call. Expected. +- **Opus+Sonnet (full-codex vs nlcmd):** $0.015 vs $0.016/s — **1.1x gap**. Nearly identical. Passes the smell test. +- **Full (opus) $/s is higher** ($0.041) because Opus consensus completes in 19-107s vs Codex's 413-569s — the same dollar spend is compressed into less wall time. Additionally, Opus consensus cost is captured while Codex (OpenAI) cost is not — opus $/task is truly higher because it includes the consensus cost that codex hides. + +**Absolute cost check** (full mode, per task): +- 4 Opus calls (bold + critique + reducer + consensus) × ~$3-5 each = ~$12-20 +- 2 Sonnet calls (understander + impl) × ~$1-2 each = ~$2-4 +- Expected: ~$14-24/task → Measured: $14.43 (codex, minus consensus), $16.26 (opus, including consensus). ✓ + +**Before vs after the nlcmd cost fix (PR #981):** -Full mode's ~$22/task cost is dominated by 4 Opus planning calls. At 100% success rate, this is a cost-for-correctness tradeoff: ~$112/benchmark for guaranteed results vs ~$4/benchmark for 80-90% success with impl. +| | Before fix | After fix | +|--|---|---| +| nlcmd $/s | $0.0002/s | $0.016/s | +| full vs nlcmd $/s ratio | 52x | 1.1x | + +The 52x discrepancy revealed that nlcmd was only counting orchestrator tokens — subagent tokens (spawned via Task tool) were missing. After the fix, all modes using the same models show consistent cost-per-second within ~1-6x, explainable by differences in conversation overhead and idle time. + +### Finding 8: Codex consensus adds latency and reduces nginx quality + +Full mode was run with both Codex (gpt-5.2-codex) and Opus consensus backends across both benchmarks: + +| | Codex consensus | Opus fallback | +|--|---|---| +| SWE-bench score | 5/5 | 5/5 (tie) | +| nginx score | 4/5 | 5/5 | +| Combined score | 9/10 (90%) | 10/10 (100%) | +| SWE-bench total time | 5,493s (1.5 hrs) | 1,843s (31 min) | +| nginx total time | 4,157s (1.2 hrs) | 2,122s (35 min) | +| Combined avg time/task | 965s (16 min) | 397s (6.6 min) | +| Anthropic cost/task (avg) | $14.43 | $16.26 | +| Consensus stage time | 413-569s (nginx) | 19-107s (nginx) | + +Codex consensus is **2.4x slower** than Opus (965s vs 397s per task) and scores lower on nginx (4/5 vs 5/5). Codex appears cheaper per task in Anthropic cost ($14.43 vs $16.26) but only because its OpenAI consensus cost is not captured. Opus captures all costs in JSONL since everything runs through the Anthropic API. + +Opus consensus is the recommended default: faster, higher quality, and transparent cost tracking. ## Limitations 1. **Small sample size** — 5 tasks per benchmark is insufficient for statistical significance. These results indicate trends, not conclusions. 2. **Single model** — All modes use Claude Sonnet for implementation. Results may differ with other models. 3. **~~SCGI test gap~~** — Resolved. Perl SCGI module installed; ec714d52 now passes all modes. -4. **~~No cost data for ACW modes~~** — Resolved. JSONL-based cost tracking (v2) now measures impl and full mode costs. Estimates are extrapolated from single-task measurements. -5. **Single run** — No repeated trials to measure variance. Individual task results may not be reproducible. +4. **~~No cost data for ACW modes~~** — Resolved. JSONL-based cost tracking (v2) now measures impl, full, and nlcmd mode costs. Original nlcmd cost ($0.91/task) was a measurement bug — fixed in PR #981. Full mode costs measured directly for both benchmarks and both consensus backends. Nginx impl and nlcmd costs still extrapolated from single-task measurements. +5. **Single run** — No repeated trials to measure variance. Individual task results may not be reproducible. Full mode re-runs show 2.4x timing variation depending on consensus backend (Codex vs Opus). +6. **Codex costs not captured** — JSONL tracking only captures Anthropic API costs. Codex (OpenAI) consensus calls add additional cost not reflected in the cost tables. This makes full (codex) appear ~$2/task cheaper than full (opus), but the true total cost is likely higher. +7. **Eval harness bug in `run_planning_phase()`** — The opus nginx re-run had a bug where task f8e1bc5b received agentize code changes instead of nginx fixes. The original run (before the refactor) scored 5/5 and that score is retained. ## Recommendations -1. **Use full mode as the default for production** — 100% combined pass rate across both benchmarks. -2. **Use impl for Python-only workloads** — equivalent quality at 22x less time. +1. **Use full (opus) as the default for production** — 100% combined pass rate, 397s/task, $16.26/task. +2. **Use impl for Python-only workloads** — equivalent quality at 3.5x less time and 20x less cost. 3. **Invest in C-specific improvements** — impl/nlcmd still fail 1/5 nginx tasks due to compilation and multi-module issues. 4. **Expand task sets** — 5 tasks per benchmark is a proof of concept. Scale to 50+ tasks for statistically meaningful results. 5. **Add compilation checking to planning** — full mode's nginx advantage comes partly from planning that considers compilation. Making this explicit (e.g., a "compile check" stage) could help all planned modes. @@ -147,10 +210,12 @@ Full mode's ~$22/task cost is dominated by 4 Opus planning calls. At 100% succes ### Nginx Per-Task -| Task | raw | impl | full | nlcmd | -|------|-----|------|------|-------| -| ec714d52 (SCGI) | PASS | PASS | PASS | PASS | -| f8e1bc5b (H2 cache) | CF | PASS | PASS | CF | -| cd12dc4f (H2 buffers) | PASS | **FAIL** | PASS | PASS | -| 3afd85e4 (last_buf) | PASS | PASS | PASS | PASS | -| d7a24947 (reinit) | PASS | PASS | PASS | PASS | +| Task | raw | impl | full (codex) | full (opus) | nlcmd | +|------|-----|------|-------------|-------------|-------| +| ec714d52 (SCGI) | PASS | PASS | PASS | PASS | PASS | +| f8e1bc5b (H2 cache) | CF | PASS | **FAIL** | PASS‡ | CF | +| cd12dc4f (H2 buffers) | PASS | **FAIL** | PASS | PASS | PASS | +| 3afd85e4 (last_buf) | PASS | PASS | PASS | PASS | PASS | +| d7a24947 (reinit) | PASS | PASS | PASS | PASS | PASS | + +*‡ Opus f8e1bc5b score from original full run (before `run_planning_phase` refactor). Re-run had eval harness bug.* diff --git a/python/agentize/eval/eval-report-2026-03-04-nginx.md b/python/agentize/eval/eval-report-2026-03-04-nginx.md index d4ffbabb..84a1c6c6 100644 --- a/python/agentize/eval/eval-report-2026-03-04-nginx.md +++ b/python/agentize/eval/eval-report-2026-03-04-nginx.md @@ -42,10 +42,10 @@ Each task is scored by: |--------|-----|------|------|-------| | Total time | 387s (6.4 min) | 899s (15 min) | 8,437s (2.3 hrs) | 10,031s (2.8 hrs) | | Avg time/task | 97s | 180s | 1,687s (28 min) | 2,508s (42 min) | -| Cost (USD) | $0.71 | ~$4† | ~$112† | $5.07 | -| Avg cost/task | $0.14 | ~$0.83† | ~$22.39† | $1.01 | +| Cost (USD) | $0.71 | ~$4† | ~$112† | ~$157† | +| Avg cost/task | $0.14 | ~$0.83† | ~$22.39† | ~$31.38† | -*†impl and full costs estimated from single-task JSONL measurement (d7a24947) × 5. Full mode cost is dominated by 4 Opus planning calls ($75/M output, $18.75/M cache_write).* +*†impl, full, and nlcmd costs estimated from single-task JSONL measurement (d7a24947) × 5. Full mode cost is dominated by 4 Opus planning calls ($75/M output, $18.75/M cache_write). Nlcmd cost is dominated by multi-agent debate (understander + bold-proposer + critique + reducer + consensus). Prior nlcmd cost ($1.01/task) only counted orchestrator tokens — subagent tokens spawned via Task tool were missing (fixed in PR #981).* ### Speed Comparison (relative to raw) diff --git a/python/agentize/eval/eval_harness.md b/python/agentize/eval/eval_harness.md index 0057ff53..2fda3cbd 100644 --- a/python/agentize/eval/eval_harness.md +++ b/python/agentize/eval/eval_harness.md @@ -31,10 +31,12 @@ are stripped so assertions become real pass/fail checks. The harness supports four execution modes via `--mode`: -| Mode | What runs | What it tests | -|------|-----------|---------------| -| `raw` | `claude -p` + bare bug report | The model alone (baseline) | -| `full` | Planning pipeline + FSM orchestrator | The agentize framework | +| Mode | What runs | What it tests | Cost tracking | +|------|-----------|---------------|---------------| +| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage | +| `impl` | FSM orchestrator only (no planning) | The impl kernel loop | JSONL session files | +| `full` | Planning pipeline + FSM orchestrator | The agentize framework | JSONL session files | +| `nlcmd` | NL planning via `claude -p` + FSM | NL orchestration | JSONL session files | ### Raw mode (default) diff --git a/python/agentize/eval/eval_harness.py b/python/agentize/eval/eval_harness.py index 14a3f306..749f61ca 100644 --- a/python/agentize/eval/eval_harness.py +++ b/python/agentize/eval/eval_harness.py @@ -438,6 +438,7 @@ def score_nginx( proc = subprocess.run( prove_cmd, cwd=str(tests), env=env, capture_output=True, text=True, timeout=300, + errors="replace", ) # Parse TAP output for individual test results @@ -596,6 +597,7 @@ def run_planning_phase( problem_statement: str, output_dir: Path, model: str = "sonnet", + cwd: str | Path | None = None, ) -> str: """Run the agentize planner pipeline and return formatted issue content. @@ -608,6 +610,7 @@ def run_planning_phase( results = run_planner_pipeline( feature_desc=problem_statement, output_dir=str(output_dir), + cwd=cwd, ) consensus = results.get("consensus") @@ -757,9 +760,13 @@ def _run_full_impl_body( f"## Instructions\n\nImplement the fix. Make minimal changes.\n" ) else: - issue_content = run_planning_phase(problem_statement, tmp_dir, model) + issue_content = run_planning_phase(problem_statement, tmp_dir, model, cwd=wt) issue_file.write_text(issue_content, encoding="utf-8") + # Ensure subprocesses default to the worktree so Claude's tools + # (Glob/Read/Grep) operate on the target repo, not the agentize repo. + os.chdir(wt) + # Build state and context state = create_initial_state(issue_no=1, worktree=wt) session = Session(output_dir=tmp_dir, prefix=f"eval-{instance_id}") @@ -859,19 +866,20 @@ def run_nlcmd_impl( Phase 2: Read the consensus plan from ``.tmp/`` and feed it to the FSM orchestrator for implementation. - Token tracking captures the **orchestrator session** tokens. Subagent - tokens (spawned via Task tool) run as separate processes and are not - included — this is a known limitation noted in the result dict. + Cost is tracked via JSONL session file diffing — the same approach used + by ``run_full_impl``. A snapshot of JSONL files is taken before Phase 1, + then after Phase 2 completes, only NEW files are summed. This captures + all subagent tokens (spawned via Task tool) accurately. Returns a result dict with combined cost from both phases. """ start_time = time.time() result = _make_result(instance_id) result["planner_cmd"] = planner_cmd - result["cost_note"] = ( - "orchestrator tokens tracked; subagent tokens not included " - "(they run as separate claude processes via Task tool)" - ) + result["cost_note"] = "cost estimated from new JSONL session files" + + # Snapshot JSONL file list before running — we'll sum only NEW files after + files_before = _list_jsonl_files() wt = Path(wt_path) tmp_dir = wt / ".tmp" @@ -910,15 +918,6 @@ def run_nlcmd_impl( timeout=planning_timeout, ) - # Track orchestrator-level token usage - plan_usage = _parse_claude_usage(plan_proc.stdout, planning_model) - result["input_tokens"] += plan_usage["input_tokens"] - result["output_tokens"] += plan_usage["output_tokens"] - result["tokens"] += plan_usage["tokens"] - result["cost_usd"] += plan_usage["cost_usd"] - result["planning_tokens"] = plan_usage["tokens"] - result["planning_cost_usd"] = plan_usage["cost_usd"] - if plan_proc.returncode != 0: print(f" NL planning failed (rc={plan_proc.returncode})", file=sys.stderr) if plan_proc.stderr: @@ -960,6 +959,17 @@ def run_nlcmd_impl( else: result["status"] = "timeout" result["wall_time"] = time.time() - start_time + # Capture any JSONL files written before the timeout + files_after = _list_jsonl_files() + new_files = sorted(files_after - files_before) + if new_files: + usage = _sum_jsonl_usage(new_files) + result["input_tokens"] = usage["input_tokens"] + result["output_tokens"] = usage["output_tokens"] + result["cache_read_tokens"] = usage["cache_read"] + result["cache_write_tokens"] = usage["cache_write"] + result["tokens"] = usage["tokens"] + result["cost_usd"] = usage["cost_usd"] return result # --- Phase 2: FSM impl with plan --- @@ -998,6 +1008,18 @@ def _run_impl(): result["status"] = status_bucket[0] if status_bucket else "error" result["wall_time"] = time.time() - start_time + # Compute cost from NEW JSONL files only (created during this run) + files_after = _list_jsonl_files() + new_files = sorted(files_after - files_before) + if new_files: + usage = _sum_jsonl_usage(new_files) + result["input_tokens"] = usage["input_tokens"] + result["output_tokens"] = usage["output_tokens"] + result["cache_read_tokens"] = usage["cache_read"] + result["cache_write_tokens"] = usage["cache_write"] + result["tokens"] = usage["tokens"] + result["cost_usd"] = usage["cost_usd"] + return result diff --git a/python/agentize/workflow/api/acw.py b/python/agentize/workflow/api/acw.py index 850fe95e..2d39bda1 100644 --- a/python/agentize/workflow/api/acw.py +++ b/python/agentize/workflow/api/acw.py @@ -187,6 +187,7 @@ def __init__( tools: str | None = None, permission_mode: str | None = None, extra_flags: list[str] | None = None, + cwd: str | Path | None = None, log_writer: Callable[[str], None] | None = None, log_command: bool = False, runner: Callable[..., subprocess.CompletedProcess] | None = None, @@ -205,6 +206,7 @@ def __init__( self.tools = tools self.permission_mode = permission_mode self.extra_flags = extra_flags + self.cwd = cwd self._log_writer = log_writer self._log_command = log_command self._runner = runner if runner is not None else run_acw @@ -244,6 +246,7 @@ def run( permission_mode=self.permission_mode, extra_flags=self.extra_flags, timeout=self.timeout, + cwd=self.cwd, ) elapsed = int(time.time() - start_time) diff --git a/python/agentize/workflow/api/session.py b/python/agentize/workflow/api/session.py index 9a5015eb..cb9f143f 100644 --- a/python/agentize/workflow/api/session.py +++ b/python/agentize/workflow/api/session.py @@ -127,6 +127,7 @@ def _run_stage( permission_mode: str | None, timeout: int, extra_flags: list[str] | None, + cwd: str | Path | None = None, ) -> subprocess.CompletedProcess: provider, model = backend acw_runner = ACW( @@ -137,6 +138,7 @@ def _run_stage( tools=tools, permission_mode=permission_mode, extra_flags=extra_flags, + cwd=cwd, log_writer=self._log, log_command=self._log_acw_command, runner=self._runner, @@ -161,6 +163,7 @@ def run_prompt( permission_mode: str | None = None, timeout: int = 3600, extra_flags: list[str] | None = None, + cwd: str | Path | None = None, retry: int = 0, retry_delay: float = 0, input_path: str | Path | None = None, @@ -187,6 +190,7 @@ def run_prompt( permission_mode=permission_mode, timeout=timeout, extra_flags=extra_flags, + cwd=cwd, ) self._validate_output(name, output_path_resolved, process) if self._log_output_dump: @@ -219,6 +223,7 @@ def run_prompt( permission_mode=permission_mode, timeout=timeout, extra_flags=None, # drop provider-specific flags + cwd=cwd, ) self._validate_output(name, output_path_resolved, process) if self._log_output_dump: diff --git a/python/agentize/workflow/planner/pipeline.py b/python/agentize/workflow/planner/pipeline.py index 1c280e47..0e4d5be1 100644 --- a/python/agentize/workflow/planner/pipeline.py +++ b/python/agentize/workflow/planner/pipeline.py @@ -147,6 +147,8 @@ def run_planner_pipeline( prefix: str | None = None, output_suffix: str = "-output.md", skip_consensus: bool = False, + cwd: str | Path | None = None, + no_project_config: bool = False, ) -> dict[str, StageResult]: """Execute the 5-stage planner pipeline.""" agentize_home = Path(get_agentize_home()) @@ -178,6 +180,16 @@ def _backend_label(stage: str) -> str: results: dict[str, StageResult] = {} + # Build a helper that merges base extra_flags with --no-project-config for + # claude provider stages (prevents CLAUDE.md contamination in foreign repos). + _no_project_flag = ["--no-project-config"] if no_project_config else [] + + def _extra_flags(stage: str, base: list[str] | None = None) -> list[str] | None: + provider = stage_backends[stage][0] + additions = _no_project_flag if provider == "claude" else [] + combined = (base or []) + additions + return combined if combined else None + understander_prompt = _render_stage_prompt( "understander", feature_desc, agentize_home ) @@ -188,6 +200,8 @@ def _backend_label(stage: str) -> str: stage_backends["understander"], tools=STAGE_TOOLS.get("understander"), permission_mode=STAGE_PERMISSION_MODE.get("understander"), + extra_flags=_extra_flags("understander"), + cwd=cwd, ) understander_output = results["understander"].text() @@ -201,6 +215,8 @@ def _backend_label(stage: str) -> str: stage_backends["bold"], tools=STAGE_TOOLS.get("bold"), permission_mode=STAGE_PERMISSION_MODE.get("bold"), + extra_flags=_extra_flags("bold"), + cwd=cwd, ) bold_output = results["bold"].text() @@ -224,6 +240,8 @@ def _backend_label(stage: str) -> str: stage_backends["critique"], tools=STAGE_TOOLS.get("critique"), permission_mode=STAGE_PERMISSION_MODE.get("critique"), + extra_flags=_extra_flags("critique"), + cwd=cwd, ), session.stage( "reducer", @@ -231,6 +249,8 @@ def _backend_label(stage: str) -> str: stage_backends["reducer"], tools=STAGE_TOOLS.get("reducer"), permission_mode=STAGE_PERMISSION_MODE.get("reducer"), + extra_flags=_extra_flags("reducer"), + cwd=cwd, ), ] ) @@ -267,8 +287,9 @@ def _write_consensus_prompt(path: Path) -> str: stage_backends["consensus"], tools=STAGE_TOOLS.get("consensus"), permission_mode=STAGE_PERMISSION_MODE.get("consensus"), - extra_flags=codex_flags, + extra_flags=_extra_flags("consensus", codex_flags), fallback_backend=("claude", "opus"), + cwd=cwd, ) return results diff --git a/python/tests/test_eval_harness.py b/python/tests/test_eval_harness.py index 793cbef7..f4b9bf4b 100644 --- a/python/tests/test_eval_harness.py +++ b/python/tests/test_eval_harness.py @@ -23,6 +23,8 @@ _compute_cost, _make_result, _find_consensus_plan, + _list_jsonl_files, + _sum_jsonl_usage, _PLANNER_CMD_TEMPLATES, ) @@ -512,4 +514,49 @@ def _slow_run(*args, **kwargs): timeout=2, ) assert result["planner_cmd"] == "mega-planner" - assert "cost_note" in result + assert result["cost_note"] == "cost estimated from new JSONL session files" + + def test_jsonl_cost_tracking_on_timeout(self, tmp_path, monkeypatch): + """JSONL-based cost tracking should capture partial costs on timeout.""" + def _slow_run(*args, **kwargs): + raise subprocess.TimeoutExpired(cmd="claude", timeout=1) + + monkeypatch.setattr(subprocess, "run", _slow_run) + + # Mock JSONL tracking to return known values + call_count = [0] + + def _mock_list_jsonl(): + call_count[0] += 1 + if call_count[0] == 1: + return set() # before + return {"/tmp/fake-session.jsonl"} # after + + mock_usage = { + "input_tokens": 100, "output_tokens": 200, + "cache_read": 10, "cache_write": 20, + "tokens": 300, "cost_usd": 1.50, + } + + monkeypatch.setattr( + "agentize.eval.eval_harness._list_jsonl_files", _mock_list_jsonl + ) + monkeypatch.setattr( + "agentize.eval.eval_harness._sum_jsonl_usage", + lambda paths: mock_usage, + ) + + overrides = write_overrides(tmp_path, "nlcmd-jsonl") + result = run_nlcmd_impl( + wt_path=str(tmp_path), + overrides_path=overrides, + instance_id="nlcmd-jsonl", + problem_statement="test", + timeout=2, + ) + assert result["input_tokens"] == 100 + assert result["output_tokens"] == 200 + assert result["cache_read_tokens"] == 10 + assert result["cache_write_tokens"] == 20 + assert result["tokens"] == 300 + assert result["cost_usd"] == 1.50