Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 13 additions & 14 deletions python/agentize/eval/eval-report-2026-03-01.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,10 @@ Extended the evaluation harness to support a 4th execution mode (**nlcmd**), ena

| Metric | raw | impl | full | nlcmd |
|--------|-----|------|------|-------|
| Cost (USD) | $0.44 | N/A* | N/A* | $4.07 |
| Avg cost/task | $0.09 | — | — | $1.02 |
| Tokens (total) | 29,353 | — | — | 63,232 |
| Cost (USD) | $0.44 | ~$4† | ~$112† | $143.80 |
| Avg cost/task | $0.09 | ~$0.83† | ~$22† | $28.76 |

*\*impl and full use ACW subprocess calls that don't return token data.*
*impl and full costs estimated from single-task JSONL measurement extrapolated to 5 tasks. Nlcmd cost ($143.80) measured directly across all 5 tasks via JSONL-based tracking. Prior nlcmd cost ($4.07) only counted orchestrator tokens — subagent tokens spawned via Task tool were missing (fixed in PR #981, ~34x undercount).*

### Speed comparison (relative to raw)

Expand Down Expand Up @@ -154,24 +153,24 @@ The quality progression is clear: **raw < impl < full < nlcmd**. However, the ga
| Mode | Cost per task | Quality | Cost-effectiveness |
|------|--------------|---------|-------------------|
| raw | $0.09 | 80% correct, no tests | Baseline |
| impl | ~$0.09* | 100% correct, some tests | Best value |
| full | ~$1-3* | 100% correct, good tests | Diminishing returns |
| nlcmd | $1.02 | 100% correct, excellent tests | Premium quality |
| impl | ~$0.83 | 100% correct, some tests | Best value |
| full | ~$22 | 100% correct, good tests | Diminishing returns |
| nlcmd | $28.76 | 100% correct, excellent tests | Premium quality |

*\*Estimated from raw cost since ACW doesn't track tokens.*
*Costs measured via JSONL-based session file tracking (PR #981). Prior nlcmd cost ($1.02/task) only counted orchestrator tokens — subagent tokens were missing.*

### 4. NL command orchestration is 2.6x slower than script orchestration
nlcmd (12 hrs) vs full (4.6 hrs) for the same 5 tasks. The overhead comes from Claude Code's NL command system: each `/ultra-planner` session spawns subagents via the Task tool, which involves additional prompt parsing, permission checks, and session management. The Python pipeline makes direct subprocess calls.
### 4. NL command orchestration is 2.6x slower and 1.3x more expensive than script orchestration
nlcmd (12 hrs, $28.76/task) vs full (4.6 hrs, ~$22/task) for the same 5 tasks. The overhead comes from Claude Code's NL command system: each `/ultra-planner` session spawns subagents via the Task tool, which involves additional prompt parsing, permission checks, and session management. The Python pipeline makes direct subprocess calls. Full mode is strictly better: faster, cheaper, and equally accurate (both 100%).

### 5. NL commands produce richer artifacts
Despite the overhead, nlcmd patches consistently included extras that other modes didn't: changelog entries, comprehensive docstrings explaining design rationale, edge-case tests, and more defensive error handling. This suggests the multi-agent debate via NL commands (which includes external AI synthesis) produces more thorough analysis than the script pipeline.

## Recommendations

1. **Use impl for speed-sensitive workloads** — 100% correctness at raw-mode speed with decent test coverage.
2. **Use full for production patches** — adds planning-quality tests with ~55 min/task overhead.
3. **Use nlcmd for high-stakes or complex tasks** — produces the most thorough patches but at 10x the cost and time.
4. **Invest in cost tracking for ACW modes** — the current gap (impl/full have no USD data) makes cost comparison incomplete.
1. **Use impl for speed-sensitive workloads** — 100% correctness at raw-mode speed with decent test coverage (~$0.83/task).
2. **Use full for production patches** — adds planning-quality tests with ~55 min/task overhead (~$22/task). Strictly dominates nlcmd.
3. **~~Use nlcmd for high-stakes or complex tasks~~** — Superseded. Full mode is faster, cheaper ($22 vs $29/task), and achieves equal or better pass rates across both benchmarks. nlcmd's richer artifacts (changelogs, extra tests) do not justify the 1.3x cost and 2.6x time premium.
4. **~~Invest in cost tracking for ACW modes~~** — Resolved in PR #981 via JSONL-based session file tracking.
5. **Increase nlcmd default timeout to 3600s** — the default 1800s causes timeouts on complex planning debates.

## Appendix: Tasks Evaluated
Expand Down
145 changes: 105 additions & 40 deletions python/agentize/eval/eval-report-2026-03-04-combined.md

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions python/agentize/eval/eval-report-2026-03-04-nginx.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ Each task is scored by:
|--------|-----|------|------|-------|
| Total time | 387s (6.4 min) | 899s (15 min) | 8,437s (2.3 hrs) | 10,031s (2.8 hrs) |
| Avg time/task | 97s | 180s | 1,687s (28 min) | 2,508s (42 min) |
| Cost (USD) | $0.71 | ~$4† | ~$112† | $5.07 |
| Avg cost/task | $0.14 | ~$0.83† | ~$22.39† | $1.01 |
| Cost (USD) | $0.71 | ~$4† | ~$112† | ~$157† |
| Avg cost/task | $0.14 | ~$0.83† | ~$22.39† | ~$31.38† |

*†impl and full costs estimated from single-task JSONL measurement (d7a24947) × 5. Full mode cost is dominated by 4 Opus planning calls ($75/M output, $18.75/M cache_write).*
*†impl, full, and nlcmd costs estimated from single-task JSONL measurement (d7a24947) × 5. Full mode cost is dominated by 4 Opus planning calls ($75/M output, $18.75/M cache_write). Nlcmd cost is dominated by multi-agent debate (understander + bold-proposer + critique + reducer + consensus). Prior nlcmd cost ($1.01/task) only counted orchestrator tokens — subagent tokens spawned via Task tool were missing (fixed in PR #981).*

### Speed Comparison (relative to raw)

Expand Down
10 changes: 6 additions & 4 deletions python/agentize/eval/eval_harness.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,12 @@ are stripped so assertions become real pass/fail checks.

The harness supports four execution modes via `--mode`:

| Mode | What runs | What it tests |
|------|-----------|---------------|
| `raw` | `claude -p` + bare bug report | The model alone (baseline) |
| `full` | Planning pipeline + FSM orchestrator | The agentize framework |
| Mode | What runs | What it tests | Cost tracking |
|------|-----------|---------------|---------------|
| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage |
| `impl` | FSM orchestrator only (no planning) | The impl kernel loop | JSONL session files |
| `full` | Planning pipeline + FSM orchestrator | The agentize framework | JSONL session files |
| `nlcmd` | NL planning via `claude -p` + FSM | NL orchestration | JSONL session files |

### Raw mode (default)

Expand Down
56 changes: 39 additions & 17 deletions python/agentize/eval/eval_harness.py
Original file line number Diff line number Diff line change
Expand Up @@ -438,6 +438,7 @@ def score_nginx(
proc = subprocess.run(
prove_cmd, cwd=str(tests),
env=env, capture_output=True, text=True, timeout=300,
errors="replace",
)

# Parse TAP output for individual test results
Expand Down Expand Up @@ -596,6 +597,7 @@ def run_planning_phase(
problem_statement: str,
output_dir: Path,
model: str = "sonnet",
cwd: str | Path | None = None,
) -> str:
"""Run the agentize planner pipeline and return formatted issue content.

Expand All @@ -608,6 +610,7 @@ def run_planning_phase(
results = run_planner_pipeline(
feature_desc=problem_statement,
output_dir=str(output_dir),
cwd=cwd,
)

consensus = results.get("consensus")
Expand Down Expand Up @@ -757,9 +760,13 @@ def _run_full_impl_body(
f"## Instructions\n\nImplement the fix. Make minimal changes.\n"
)
else:
issue_content = run_planning_phase(problem_statement, tmp_dir, model)
issue_content = run_planning_phase(problem_statement, tmp_dir, model, cwd=wt)
issue_file.write_text(issue_content, encoding="utf-8")

# Ensure subprocesses default to the worktree so Claude's tools
# (Glob/Read/Grep) operate on the target repo, not the agentize repo.
os.chdir(wt)

# Build state and context
state = create_initial_state(issue_no=1, worktree=wt)
session = Session(output_dir=tmp_dir, prefix=f"eval-{instance_id}")
Expand Down Expand Up @@ -859,19 +866,20 @@ def run_nlcmd_impl(
Phase 2: Read the consensus plan from ``.tmp/`` and feed it to the FSM
orchestrator for implementation.

Token tracking captures the **orchestrator session** tokens. Subagent
tokens (spawned via Task tool) run as separate processes and are not
included — this is a known limitation noted in the result dict.
Cost is tracked via JSONL session file diffing — the same approach used
by ``run_full_impl``. A snapshot of JSONL files is taken before Phase 1,
then after Phase 2 completes, only NEW files are summed. This captures
all subagent tokens (spawned via Task tool) accurately.

Returns a result dict with combined cost from both phases.
"""
start_time = time.time()
result = _make_result(instance_id)
result["planner_cmd"] = planner_cmd
result["cost_note"] = (
"orchestrator tokens tracked; subagent tokens not included "
"(they run as separate claude processes via Task tool)"
)
result["cost_note"] = "cost estimated from new JSONL session files"

# Snapshot JSONL file list before running — we'll sum only NEW files after
files_before = _list_jsonl_files()

wt = Path(wt_path)
tmp_dir = wt / ".tmp"
Expand Down Expand Up @@ -910,15 +918,6 @@ def run_nlcmd_impl(
timeout=planning_timeout,
)

# Track orchestrator-level token usage
plan_usage = _parse_claude_usage(plan_proc.stdout, planning_model)
result["input_tokens"] += plan_usage["input_tokens"]
result["output_tokens"] += plan_usage["output_tokens"]
result["tokens"] += plan_usage["tokens"]
result["cost_usd"] += plan_usage["cost_usd"]
result["planning_tokens"] = plan_usage["tokens"]
result["planning_cost_usd"] = plan_usage["cost_usd"]

if plan_proc.returncode != 0:
print(f" NL planning failed (rc={plan_proc.returncode})", file=sys.stderr)
if plan_proc.stderr:
Expand Down Expand Up @@ -960,6 +959,17 @@ def run_nlcmd_impl(
else:
result["status"] = "timeout"
result["wall_time"] = time.time() - start_time
# Capture any JSONL files written before the timeout
files_after = _list_jsonl_files()
new_files = sorted(files_after - files_before)
if new_files:
usage = _sum_jsonl_usage(new_files)
result["input_tokens"] = usage["input_tokens"]
result["output_tokens"] = usage["output_tokens"]
result["cache_read_tokens"] = usage["cache_read"]
result["cache_write_tokens"] = usage["cache_write"]
result["tokens"] = usage["tokens"]
result["cost_usd"] = usage["cost_usd"]
return result

# --- Phase 2: FSM impl with plan ---
Expand Down Expand Up @@ -998,6 +1008,18 @@ def _run_impl():
result["status"] = status_bucket[0] if status_bucket else "error"
result["wall_time"] = time.time() - start_time

# Compute cost from NEW JSONL files only (created during this run)
files_after = _list_jsonl_files()
new_files = sorted(files_after - files_before)
if new_files:
usage = _sum_jsonl_usage(new_files)
result["input_tokens"] = usage["input_tokens"]
result["output_tokens"] = usage["output_tokens"]
result["cache_read_tokens"] = usage["cache_read"]
result["cache_write_tokens"] = usage["cache_write"]
result["tokens"] = usage["tokens"]
result["cost_usd"] = usage["cost_usd"]

return result


Expand Down
3 changes: 3 additions & 0 deletions python/agentize/workflow/api/acw.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ def __init__(
tools: str | None = None,
permission_mode: str | None = None,
extra_flags: list[str] | None = None,
cwd: str | Path | None = None,
log_writer: Callable[[str], None] | None = None,
log_command: bool = False,
runner: Callable[..., subprocess.CompletedProcess] | None = None,
Expand All @@ -205,6 +206,7 @@ def __init__(
self.tools = tools
self.permission_mode = permission_mode
self.extra_flags = extra_flags
self.cwd = cwd
self._log_writer = log_writer
self._log_command = log_command
self._runner = runner if runner is not None else run_acw
Expand Down Expand Up @@ -244,6 +246,7 @@ def run(
permission_mode=self.permission_mode,
extra_flags=self.extra_flags,
timeout=self.timeout,
cwd=self.cwd,
)

elapsed = int(time.time() - start_time)
Expand Down
5 changes: 5 additions & 0 deletions python/agentize/workflow/api/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def _run_stage(
permission_mode: str | None,
timeout: int,
extra_flags: list[str] | None,
cwd: str | Path | None = None,
) -> subprocess.CompletedProcess:
provider, model = backend
acw_runner = ACW(
Expand All @@ -137,6 +138,7 @@ def _run_stage(
tools=tools,
permission_mode=permission_mode,
extra_flags=extra_flags,
cwd=cwd,
log_writer=self._log,
log_command=self._log_acw_command,
runner=self._runner,
Expand All @@ -161,6 +163,7 @@ def run_prompt(
permission_mode: str | None = None,
timeout: int = 3600,
extra_flags: list[str] | None = None,
cwd: str | Path | None = None,
retry: int = 0,
retry_delay: float = 0,
input_path: str | Path | None = None,
Expand All @@ -187,6 +190,7 @@ def run_prompt(
permission_mode=permission_mode,
timeout=timeout,
extra_flags=extra_flags,
cwd=cwd,
)
self._validate_output(name, output_path_resolved, process)
if self._log_output_dump:
Expand Down Expand Up @@ -219,6 +223,7 @@ def run_prompt(
permission_mode=permission_mode,
timeout=timeout,
extra_flags=None, # drop provider-specific flags
cwd=cwd,
)
self._validate_output(name, output_path_resolved, process)
if self._log_output_dump:
Expand Down
23 changes: 22 additions & 1 deletion python/agentize/workflow/planner/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,8 @@ def run_planner_pipeline(
prefix: str | None = None,
output_suffix: str = "-output.md",
skip_consensus: bool = False,
cwd: str | Path | None = None,
no_project_config: bool = False,
) -> dict[str, StageResult]:
"""Execute the 5-stage planner pipeline."""
agentize_home = Path(get_agentize_home())
Expand Down Expand Up @@ -178,6 +180,16 @@ def _backend_label(stage: str) -> str:

results: dict[str, StageResult] = {}

# Build a helper that merges base extra_flags with --no-project-config for
# claude provider stages (prevents CLAUDE.md contamination in foreign repos).
_no_project_flag = ["--no-project-config"] if no_project_config else []

def _extra_flags(stage: str, base: list[str] | None = None) -> list[str] | None:
provider = stage_backends[stage][0]
additions = _no_project_flag if provider == "claude" else []
combined = (base or []) + additions
return combined if combined else None

understander_prompt = _render_stage_prompt(
"understander", feature_desc, agentize_home
)
Expand All @@ -188,6 +200,8 @@ def _backend_label(stage: str) -> str:
stage_backends["understander"],
tools=STAGE_TOOLS.get("understander"),
permission_mode=STAGE_PERMISSION_MODE.get("understander"),
extra_flags=_extra_flags("understander"),
cwd=cwd,
)
understander_output = results["understander"].text()

Expand All @@ -201,6 +215,8 @@ def _backend_label(stage: str) -> str:
stage_backends["bold"],
tools=STAGE_TOOLS.get("bold"),
permission_mode=STAGE_PERMISSION_MODE.get("bold"),
extra_flags=_extra_flags("bold"),
cwd=cwd,
)
bold_output = results["bold"].text()

Expand All @@ -224,13 +240,17 @@ def _backend_label(stage: str) -> str:
stage_backends["critique"],
tools=STAGE_TOOLS.get("critique"),
permission_mode=STAGE_PERMISSION_MODE.get("critique"),
extra_flags=_extra_flags("critique"),
cwd=cwd,
),
session.stage(
"reducer",
reducer_prompt,
stage_backends["reducer"],
tools=STAGE_TOOLS.get("reducer"),
permission_mode=STAGE_PERMISSION_MODE.get("reducer"),
extra_flags=_extra_flags("reducer"),
cwd=cwd,
),
]
)
Expand Down Expand Up @@ -267,8 +287,9 @@ def _write_consensus_prompt(path: Path) -> str:
stage_backends["consensus"],
tools=STAGE_TOOLS.get("consensus"),
permission_mode=STAGE_PERMISSION_MODE.get("consensus"),
extra_flags=codex_flags,
extra_flags=_extra_flags("consensus", codex_flags),
fallback_backend=("claude", "opus"),
cwd=cwd,
)

return results
Expand Down
49 changes: 48 additions & 1 deletion python/tests/test_eval_harness.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@
_compute_cost,
_make_result,
_find_consensus_plan,
_list_jsonl_files,
_sum_jsonl_usage,
_PLANNER_CMD_TEMPLATES,
)

Expand Down Expand Up @@ -512,4 +514,49 @@ def _slow_run(*args, **kwargs):
timeout=2,
)
assert result["planner_cmd"] == "mega-planner"
assert "cost_note" in result
assert result["cost_note"] == "cost estimated from new JSONL session files"

def test_jsonl_cost_tracking_on_timeout(self, tmp_path, monkeypatch):
"""JSONL-based cost tracking should capture partial costs on timeout."""
def _slow_run(*args, **kwargs):
raise subprocess.TimeoutExpired(cmd="claude", timeout=1)

monkeypatch.setattr(subprocess, "run", _slow_run)

# Mock JSONL tracking to return known values
call_count = [0]

def _mock_list_jsonl():
call_count[0] += 1
if call_count[0] == 1:
return set() # before
return {"/tmp/fake-session.jsonl"} # after

mock_usage = {
"input_tokens": 100, "output_tokens": 200,
"cache_read": 10, "cache_write": 20,
"tokens": 300, "cost_usd": 1.50,
}

monkeypatch.setattr(
"agentize.eval.eval_harness._list_jsonl_files", _mock_list_jsonl
)
monkeypatch.setattr(
"agentize.eval.eval_harness._sum_jsonl_usage",
lambda paths: mock_usage,
)

overrides = write_overrides(tmp_path, "nlcmd-jsonl")
result = run_nlcmd_impl(
wt_path=str(tmp_path),
overrides_path=overrides,
instance_id="nlcmd-jsonl",
problem_statement="test",
timeout=2,
)
assert result["input_tokens"] == 100
assert result["output_tokens"] == 200
assert result["cache_read_tokens"] == 10
assert result["cache_write_tokens"] == 20
assert result["tokens"] == 300
assert result["cost_usd"] == 1.50
Loading