[plan][bugfix]: Fix undercounted cost estimation for nlcmd mode

## Description

The eval harness `run_nlcmd_impl()` only tracks orchestrator-level tokens from the top-level `claude -p` call via `_parse_claude_usage()`, but misses all subagent tokens spawned via the Task tool (understander, bold-proposer, critique, reducer, consensus). This causes nlcmd to report ~$0.91/task while the actual cost is likely ~$20-30+/task.

**The sanity check that reveals the bug:**
- full mode: $22/task in 2,494s → $0.009/second
- nlcmd mode: $0.91/task in 5,309s → $0.0002/second
- Both use Opus for planning and Sonnet for impl → cost-per-second should be similar
- 52x difference in cost-per-second is impossible with the same models

**Affected module:** `python/agentize/eval/eval_harness.py`

## Proposed Solution

### File Changes

| File | Level | Purpose |
|------|-------|---------|
| `python/agentize/eval/eval_harness.md` | minor | Update Execution Modes table and nlcmd cost-tracking description |
| `python/tests/test_eval_harness.py` | minor | Update `cost_note` assertion; add test verifying JSONL tracking path |
| `python/agentize/eval/eval_harness.py` | major | Snapshot JSONL files before Phase 1, sum new files after Phase 2, remove `_parse_claude_usage` call and `planning_tokens`/`planning_cost_usd` fields |

### Implementation Steps

**Step 1: Update documentation** (Estimated: ~15 LOC)
- `python/agentize/eval/eval_harness.md` — Add `nlcmd` row to the Mode table. Add nlcmd subsection explaining JSONL-based cost tracking covers both Phase 1 (NL planning) and Phase 2 (FSM impl). Remove stale "subagent tokens not included" text.

**Step 2: Update tests** (Estimated: ~15 LOC)
- `python/tests/test_eval_harness.py`, `TestNlcmdImpl.test_result_has_planner_cmd` — Update `cost_note` value expectation.
- Same file, `TestNlcmdImpl` — Add `test_jsonl_cost_tracking_applied` that monkeypatches `_list_jsonl_files` and `_sum_jsonl_usage` to verify the JSONL path is exercised.

**Step 3: Implement JSONL tracking in `run_nlcmd_impl()`** (Estimated: ~15 LOC)
- `python/agentize/eval/eval_harness.py`, `run_nlcmd_impl()`:
  1. Add `files_before = _list_jsonl_files()` before Phase 1 starts (~line 870)
  2. Update `cost_note` to `"cost estimated from new JSONL session files"`
  3. Remove lines 913-920 (`_parse_claude_usage` call and `planning_tokens`/`planning_cost_usd`)
  4. After Phase 2 completion (~line 999), add JSONL diff block:
     ```python
     files_after = _list_jsonl_files()
     new_files = sorted(files_after - files_before)
     if new_files:
         usage = _sum_jsonl_usage(new_files)
         result["input_tokens"] = usage["input_tokens"]
         result["output_tokens"] = usage["output_tokens"]
         result["cache_read_tokens"] = usage["cache_read"]
         result["cache_write_tokens"] = usage["cache_write"]
         result["tokens"] = usage["tokens"]
         result["cost_usd"] = usage["cost_usd"]
     ```
  5. Also add JSONL diff to timeout early-return path (~line 961)

### Test Strategy

- `python -m pytest python/tests/test_eval_harness.py -v`
- Updated `test_result_has_planner_cmd` must pass (cost_note field present)
- New `test_jsonl_cost_tracking_applied` must pass (JSONL diff path active)

## Related PR

TBD - will be updated when PR is created


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plan][bugfix]: Fix undercounted cost estimation for nlcmd mode #980

Description

Proposed Solution

File Changes

Implementation Steps

Test Strategy

Related PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Level	Purpose
`python/agentize/eval/eval_harness.md`	minor	Update Execution Modes table and nlcmd cost-tracking description
`python/tests/test_eval_harness.py`	minor	Update `cost_note` assertion; add test verifying JSONL tracking path
`python/agentize/eval/eval_harness.py`	major	Snapshot JSONL files before Phase 1, sum new files after Phase 2, remove `_parse_claude_usage` call and `planning_tokens`/`planning_cost_usd` fields

[plan][bugfix]: Fix undercounted cost estimation for nlcmd mode #980

Description

Description

Proposed Solution

File Changes

Implementation Steps

Test Strategy

Related PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions