Synthesys-Lab · ayazhankadessova · Mar 12, 2026 · Mar 5, 2026 · Mar 6, 2026 · Mar 8, 2026
diff --git a/python/agentize/eval/eval-report-2026-03-01.md b/python/agentize/eval/eval-report-2026-03-01.md
@@ -37,11 +37,10 @@ Extended the evaluation harness to support a 4th execution mode (**nlcmd**), ena
 
 | Metric | raw | impl | full | nlcmd |
 |--------|-----|------|------|-------|
-| Cost (USD) | $0.44 | N/A* | N/A* | $4.07 |
-| Avg cost/task | $0.09 | — | — | $1.02 |
-| Tokens (total) | 29,353 | — | — | 63,232 |
+| Cost (USD) | $0.44 | ~$4† | ~$112† | $143.80 |
+| Avg cost/task | $0.09 | ~$0.83† | ~$22† | $28.76 |
 
-*\*impl and full use ACW subprocess calls that don't return token data.*
+*†impl and full costs estimated from single-task JSONL measurement extrapolated to 5 tasks. Nlcmd cost ($143.80) measured directly across all 5 tasks via JSONL-based tracking. Prior nlcmd cost ($4.07) only counted orchestrator tokens — subagent tokens spawned via Task tool were missing (fixed in PR #981, ~34x undercount).*
 
 ### Speed comparison (relative to raw)
 
@@ -154,24 +153,24 @@ The quality progression is clear: **raw < impl < full < nlcmd**. However, the ga
 | Mode | Cost per task | Quality | Cost-effectiveness |
 |------|--------------|---------|-------------------|
 | raw | $0.09 | 80% correct, no tests | Baseline |
-| impl | ~$0.09* | 100% correct, some tests | Best value |
-| full | ~$1-3* | 100% correct, good tests | Diminishing returns |
-| nlcmd | $1.02 | 100% correct, excellent tests | Premium quality |
+| impl | ~$0.83 | 100% correct, some tests | Best value |
+| full | ~$22 | 100% correct, good tests | Diminishing returns |
+| nlcmd | $28.76 | 100% correct, excellent tests | Premium quality |
 
-*\*Estimated from raw cost since ACW doesn't track tokens.*
+*Costs measured via JSONL-based session file tracking (PR #981). Prior nlcmd cost ($1.02/task) only counted orchestrator tokens — subagent tokens were missing.*
 
-### 4. NL command orchestration is 2.6x slower than script orchestration
-nlcmd (12 hrs) vs full (4.6 hrs) for the same 5 tasks. The overhead comes from Claude Code's NL command system: each `/ultra-planner` session spawns subagents via the Task tool, which involves additional prompt parsing, permission checks, and session management. The Python pipeline makes direct subprocess calls.
+### 4. NL command orchestration is 2.6x slower and 1.3x more expensive than script orchestration
+nlcmd (12 hrs, $28.76/task) vs full (4.6 hrs, ~$22/task) for the same 5 tasks. The overhead comes from Claude Code's NL command system: each `/ultra-planner` session spawns subagents via the Task tool, which involves additional prompt parsing, permission checks, and session management. The Python pipeline makes direct subprocess calls. Full mode is strictly better: faster, cheaper, and equally accurate (both 100%).
 
 ### 5. NL commands produce richer artifacts
 Despite the overhead, nlcmd patches consistently included extras that other modes didn't: changelog entries, comprehensive docstrings explaining design rationale, edge-case tests, and more defensive error handling. This suggests the multi-agent debate via NL commands (which includes external AI synthesis) produces more thorough analysis than the script pipeline.
 
 ## Recommendations
 
-1. **Use impl for speed-sensitive workloads** — 100% correctness at raw-mode speed with decent test coverage.
-2. **Use full for production patches** — adds planning-quality tests with ~55 min/task overhead.
-3. **Use nlcmd for high-stakes or complex tasks** — produces the most thorough patches but at 10x the cost and time.
-4. **Invest in cost tracking for ACW modes** — the current gap (impl/full have no USD data) makes cost comparison incomplete.
+1. **Use impl for speed-sensitive workloads** — 100% correctness at raw-mode speed with decent test coverage (~$0.83/task).
+2. **Use full for production patches** — adds planning-quality tests with ~55 min/task overhead (~$22/task). Strictly dominates nlcmd.
+3. **~~Use nlcmd for high-stakes or complex tasks~~** — Superseded. Full mode is faster, cheaper ($22 vs $29/task), and achieves equal or better pass rates across both benchmarks. nlcmd's richer artifacts (changelogs, extra tests) do not justify the 1.3x cost and 2.6x time premium.
+4. **~~Invest in cost tracking for ACW modes~~** — Resolved in PR #981 via JSONL-based session file tracking.
 5. **Increase nlcmd default timeout to 3600s** — the default 1800s causes timeouts on complex planning debates.
 
 ## Appendix: Tasks Evaluated

diff --git a/python/agentize/eval/eval-report-2026-03-04-combined.md b/python/agentize/eval/eval-report-2026-03-04-combined.md
diff --git a/python/agentize/eval/eval-report-2026-03-04-nginx.md b/python/agentize/eval/eval-report-2026-03-04-nginx.md
@@ -42,10 +42,10 @@ Each task is scored by:
 |--------|-----|------|------|-------|
 | Total time | 387s (6.4 min) | 899s (15 min) | 8,437s (2.3 hrs) | 10,031s (2.8 hrs) |
 | Avg time/task | 97s | 180s | 1,687s (28 min) | 2,508s (42 min) |
-| Cost (USD) | $0.71 | ~$4† | ~$112† | $5.07 |
-| Avg cost/task | $0.14 | ~$0.83† | ~$22.39† | $1.01 |
+| Cost (USD) | $0.71 | ~$4† | ~$112† | ~$157† |
+| Avg cost/task | $0.14 | ~$0.83† | ~$22.39† | ~$31.38† |
 
-*†impl and full costs estimated from single-task JSONL measurement (d7a24947) × 5. Full mode cost is dominated by 4 Opus planning calls ($75/M output, $18.75/M cache_write).*
+*†impl, full, and nlcmd costs estimated from single-task JSONL measurement (d7a24947) × 5. Full mode cost is dominated by 4 Opus planning calls ($75/M output, $18.75/M cache_write). Nlcmd cost is dominated by multi-agent debate (understander + bold-proposer + critique + reducer + consensus). Prior nlcmd cost ($1.01/task) only counted orchestrator tokens — subagent tokens spawned via Task tool were missing (fixed in PR #981).*
 
 ### Speed Comparison (relative to raw)
 

diff --git a/python/agentize/eval/eval_harness.md b/python/agentize/eval/eval_harness.md
@@ -31,10 +31,12 @@ are stripped so assertions become real pass/fail checks.
 
 The harness supports four execution modes via `--mode`:
 
-| Mode | What runs | What it tests |
-|------|-----------|---------------|
-| `raw` | `claude -p` + bare bug report | The model alone (baseline) |
-| `full` | Planning pipeline + FSM orchestrator | The agentize framework |
+| Mode | What runs | What it tests | Cost tracking |
+|------|-----------|---------------|---------------|
+| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage |
+| `impl` | FSM orchestrator only (no planning) | The impl kernel loop | JSONL session files |
+| `full` | Planning pipeline + FSM orchestrator | The agentize framework | JSONL session files |
+| `nlcmd` | NL planning via `claude -p` + FSM | NL orchestration | JSONL session files |
 
 ### Raw mode (default)
 

diff --git a/python/agentize/eval/eval_harness.py b/python/agentize/eval/eval_harness.py
@@ -438,6 +438,7 @@ def score_nginx(
         proc = subprocess.run(
             prove_cmd, cwd=str(tests),
             env=env, capture_output=True, text=True, timeout=300,
+            errors="replace",
         )
 
         # Parse TAP output for individual test results
@@ -596,6 +597,7 @@ def run_planning_phase(
     problem_statement: str,
     output_dir: Path,
     model: str = "sonnet",
+    cwd: str | Path | None = None,
 ) -> str:
     """Run the agentize planner pipeline and return formatted issue content.
 
@@ -608,6 +610,7 @@ def run_planning_phase(
     results = run_planner_pipeline(
         feature_desc=problem_statement,
         output_dir=str(output_dir),
+        cwd=cwd,
     )
 
     consensus = results.get("consensus")
@@ -757,9 +760,13 @@ def _run_full_impl_body(
             f"## Instructions\n\nImplement the fix. Make minimal changes.\n"
         )
     else:
-        issue_content = run_planning_phase(problem_statement, tmp_dir, model)
+        issue_content = run_planning_phase(problem_statement, tmp_dir, model, cwd=wt)
     issue_file.write_text(issue_content, encoding="utf-8")
 
+    # Ensure subprocesses default to the worktree so Claude's tools
+    # (Glob/Read/Grep) operate on the target repo, not the agentize repo.
+    os.chdir(wt)
+
     # Build state and context
     state = create_initial_state(issue_no=1, worktree=wt)
     session = Session(output_dir=tmp_dir, prefix=f"eval-{instance_id}")
@@ -859,19 +866,20 @@ def run_nlcmd_impl(
     Phase 2: Read the consensus plan from ``.tmp/`` and feed it to the FSM
     orchestrator for implementation.
 
-    Token tracking captures the **orchestrator session** tokens.  Subagent
-    tokens (spawned via Task tool) run as separate processes and are not
-    included — this is a known limitation noted in the result dict.
+    Cost is tracked via JSONL session file diffing — the same approach used
+    by ``run_full_impl``.  A snapshot of JSONL files is taken before Phase 1,
+    then after Phase 2 completes, only NEW files are summed.  This captures
+    all subagent tokens (spawned via Task tool) accurately.
 
     Returns a result dict with combined cost from both phases.
     """
     start_time = time.time()
     result = _make_result(instance_id)
     result["planner_cmd"] = planner_cmd
-    result["cost_note"] = (
-        "orchestrator tokens tracked; subagent tokens not included "
-        "(they run as separate claude processes via Task tool)"
-    )
+    result["cost_note"] = "cost estimated from new JSONL session files"
+
+    # Snapshot JSONL file list before running — we'll sum only NEW files after
+    files_before = _list_jsonl_files()
 
     wt = Path(wt_path)
     tmp_dir = wt / ".tmp"
@@ -910,15 +918,6 @@ def run_nlcmd_impl(
             timeout=planning_timeout,
         )
 
-        # Track orchestrator-level token usage
-        plan_usage = _parse_claude_usage(plan_proc.stdout, planning_model)
-        result["input_tokens"] += plan_usage["input_tokens"]
-        result["output_tokens"] += plan_usage["output_tokens"]
-        result["tokens"] += plan_usage["tokens"]
-        result["cost_usd"] += plan_usage["cost_usd"]
-        result["planning_tokens"] = plan_usage["tokens"]
-        result["planning_cost_usd"] = plan_usage["cost_usd"]
-
         if plan_proc.returncode != 0:
             print(f"  NL planning failed (rc={plan_proc.returncode})", file=sys.stderr)
             if plan_proc.stderr:
@@ -960,6 +959,17 @@ def run_nlcmd_impl(
         else:
             result["status"] = "timeout"
             result["wall_time"] = time.time() - start_time
+            # Capture any JSONL files written before the timeout
+            files_after = _list_jsonl_files()
+            new_files = sorted(files_after - files_before)
+            if new_files:
+                usage = _sum_jsonl_usage(new_files)
+                result["input_tokens"] = usage["input_tokens"]
+                result["output_tokens"] = usage["output_tokens"]
+                result["cache_read_tokens"] = usage["cache_read"]
+                result["cache_write_tokens"] = usage["cache_write"]
+                result["tokens"] = usage["tokens"]
+                result["cost_usd"] = usage["cost_usd"]
             return result
 
     # --- Phase 2: FSM impl with plan ---
@@ -998,6 +1008,18 @@ def _run_impl():
         result["status"] = status_bucket[0] if status_bucket else "error"
         result["wall_time"] = time.time() - start_time
 
+    # Compute cost from NEW JSONL files only (created during this run)
+    files_after = _list_jsonl_files()
+    new_files = sorted(files_after - files_before)
+    if new_files:
+        usage = _sum_jsonl_usage(new_files)
+        result["input_tokens"] = usage["input_tokens"]
+        result["output_tokens"] = usage["output_tokens"]
+        result["cache_read_tokens"] = usage["cache_read"]
+        result["cache_write_tokens"] = usage["cache_write"]
+        result["tokens"] = usage["tokens"]
+        result["cost_usd"] = usage["cost_usd"]
+
     return result
 
 

diff --git a/python/agentize/workflow/api/acw.py b/python/agentize/workflow/api/acw.py
@@ -187,6 +187,7 @@ def __init__(
         tools: str | None = None,
         permission_mode: str | None = None,
         extra_flags: list[str] | None = None,
+        cwd: str | Path | None = None,
         log_writer: Callable[[str], None] | None = None,
         log_command: bool = False,
         runner: Callable[..., subprocess.CompletedProcess] | None = None,
@@ -205,6 +206,7 @@ def __init__(
         self.tools = tools
         self.permission_mode = permission_mode
         self.extra_flags = extra_flags
+        self.cwd = cwd
         self._log_writer = log_writer
         self._log_command = log_command
         self._runner = runner if runner is not None else run_acw
@@ -244,6 +246,7 @@ def run(
             permission_mode=self.permission_mode,
             extra_flags=self.extra_flags,
             timeout=self.timeout,
+            cwd=self.cwd,
         )
 
         elapsed = int(time.time() - start_time)

diff --git a/python/agentize/workflow/api/session.py b/python/agentize/workflow/api/session.py
@@ -127,6 +127,7 @@ def _run_stage(
         permission_mode: str | None,
         timeout: int,
         extra_flags: list[str] | None,
+        cwd: str | Path | None = None,
     ) -> subprocess.CompletedProcess:
         provider, model = backend
         acw_runner = ACW(
@@ -137,6 +138,7 @@ def _run_stage(
             tools=tools,
             permission_mode=permission_mode,
             extra_flags=extra_flags,
+            cwd=cwd,
             log_writer=self._log,
             log_command=self._log_acw_command,
             runner=self._runner,
@@ -161,6 +163,7 @@ def run_prompt(
         permission_mode: str | None = None,
         timeout: int = 3600,
         extra_flags: list[str] | None = None,
+        cwd: str | Path | None = None,
         retry: int = 0,
         retry_delay: float = 0,
         input_path: str | Path | None = None,
@@ -187,6 +190,7 @@ def run_prompt(
                     permission_mode=permission_mode,
                     timeout=timeout,
                     extra_flags=extra_flags,
+                    cwd=cwd,
                 )
                 self._validate_output(name, output_path_resolved, process)
                 if self._log_output_dump:
@@ -219,6 +223,7 @@ def run_prompt(
                     permission_mode=permission_mode,
                     timeout=timeout,
                     extra_flags=None,  # drop provider-specific flags
+                    cwd=cwd,
                 )
                 self._validate_output(name, output_path_resolved, process)
                 if self._log_output_dump:

diff --git a/python/agentize/workflow/planner/pipeline.py b/python/agentize/workflow/planner/pipeline.py
@@ -147,6 +147,8 @@ def run_planner_pipeline(
     prefix: str | None = None,
     output_suffix: str = "-output.md",
     skip_consensus: bool = False,
+    cwd: str | Path | None = None,
+    no_project_config: bool = False,
 ) -> dict[str, StageResult]:
     """Execute the 5-stage planner pipeline."""
     agentize_home = Path(get_agentize_home())
@@ -178,6 +180,16 @@ def _backend_label(stage: str) -> str:
 
     results: dict[str, StageResult] = {}
 
+    # Build a helper that merges base extra_flags with --no-project-config for
+    # claude provider stages (prevents CLAUDE.md contamination in foreign repos).
+    _no_project_flag = ["--no-project-config"] if no_project_config else []
+
+    def _extra_flags(stage: str, base: list[str] | None = None) -> list[str] | None:
+        provider = stage_backends[stage][0]
+        additions = _no_project_flag if provider == "claude" else []
+        combined = (base or []) + additions
+        return combined if combined else None
+
     understander_prompt = _render_stage_prompt(
         "understander", feature_desc, agentize_home
     )
@@ -188,6 +200,8 @@ def _backend_label(stage: str) -> str:
         stage_backends["understander"],
         tools=STAGE_TOOLS.get("understander"),
         permission_mode=STAGE_PERMISSION_MODE.get("understander"),
+        extra_flags=_extra_flags("understander"),
+        cwd=cwd,
     )
     understander_output = results["understander"].text()
 
@@ -201,6 +215,8 @@ def _backend_label(stage: str) -> str:
         stage_backends["bold"],
         tools=STAGE_TOOLS.get("bold"),
         permission_mode=STAGE_PERMISSION_MODE.get("bold"),
+        extra_flags=_extra_flags("bold"),
+        cwd=cwd,
     )
     bold_output = results["bold"].text()
 
@@ -224,13 +240,17 @@ def _backend_label(stage: str) -> str:
                 stage_backends["critique"],
                 tools=STAGE_TOOLS.get("critique"),
                 permission_mode=STAGE_PERMISSION_MODE.get("critique"),
+                extra_flags=_extra_flags("critique"),
+                cwd=cwd,
             ),
             session.stage(
                 "reducer",
                 reducer_prompt,
                 stage_backends["reducer"],
                 tools=STAGE_TOOLS.get("reducer"),
                 permission_mode=STAGE_PERMISSION_MODE.get("reducer"),
+                extra_flags=_extra_flags("reducer"),
+                cwd=cwd,
             ),
         ]
     )
@@ -267,8 +287,9 @@ def _write_consensus_prompt(path: Path) -> str:
         stage_backends["consensus"],
         tools=STAGE_TOOLS.get("consensus"),
         permission_mode=STAGE_PERMISSION_MODE.get("consensus"),
-        extra_flags=codex_flags,
+        extra_flags=_extra_flags("consensus", codex_flags),
         fallback_backend=("claude", "opus"),
+        cwd=cwd,
     )
 
     return results

diff --git a/python/tests/test_eval_harness.py b/python/tests/test_eval_harness.py
@@ -23,6 +23,8 @@
     _compute_cost,
     _make_result,
     _find_consensus_plan,
+    _list_jsonl_files,
+    _sum_jsonl_usage,
     _PLANNER_CMD_TEMPLATES,
 )
 
@@ -512,4 +514,49 @@ def _slow_run(*args, **kwargs):
             timeout=2,
         )
         assert result["planner_cmd"] == "mega-planner"
-        assert "cost_note" in result
+        assert result["cost_note"] == "cost estimated from new JSONL session files"
+
+    def test_jsonl_cost_tracking_on_timeout(self, tmp_path, monkeypatch):
+        """JSONL-based cost tracking should capture partial costs on timeout."""
+        def _slow_run(*args, **kwargs):
+            raise subprocess.TimeoutExpired(cmd="claude", timeout=1)
+
+        monkeypatch.setattr(subprocess, "run", _slow_run)
+
+        # Mock JSONL tracking to return known values
+        call_count = [0]
+
+        def _mock_list_jsonl():
+            call_count[0] += 1
+            if call_count[0] == 1:
+                return set()  # before
+            return {"/tmp/fake-session.jsonl"}  # after
+
+        mock_usage = {
+            "input_tokens": 100, "output_tokens": 200,
+            "cache_read": 10, "cache_write": 20,
+            "tokens": 300, "cost_usd": 1.50,
+        }
+
+        monkeypatch.setattr(
+            "agentize.eval.eval_harness._list_jsonl_files", _mock_list_jsonl
+        )
+        monkeypatch.setattr(
+            "agentize.eval.eval_harness._sum_jsonl_usage",
+            lambda paths: mock_usage,
+        )
+
+        overrides = write_overrides(tmp_path, "nlcmd-jsonl")
+        result = run_nlcmd_impl(
+            wt_path=str(tmp_path),
+            overrides_path=overrides,
+            instance_id="nlcmd-jsonl",
+            problem_statement="test",
+            timeout=2,
+        )
+        assert result["input_tokens"] == 100
+        assert result["output_tokens"] == 200
+        assert result["cache_read_tokens"] == 10
+        assert result["cache_write_tokens"] == 20
+        assert result["tokens"] == 300
+        assert result["cost_usd"] == 1.50