Skip to content

[plan][bugfix]: Fix undercounted cost estimation for nlcmd mode #980

@ayazhankadessova

Description

@ayazhankadessova

Description

The eval harness run_nlcmd_impl() only tracks orchestrator-level tokens from the top-level claude -p call via _parse_claude_usage(), but misses all subagent tokens spawned via the Task tool (understander, bold-proposer, critique, reducer, consensus). This causes nlcmd to report ~$0.91/task while the actual cost is likely ~$20-30+/task.

The sanity check that reveals the bug:

  • full mode: $22/task in 2,494s → $0.009/second
  • nlcmd mode: $0.91/task in 5,309s → $0.0002/second
  • Both use Opus for planning and Sonnet for impl → cost-per-second should be similar
  • 52x difference in cost-per-second is impossible with the same models

Affected module: python/agentize/eval/eval_harness.py

Proposed Solution

File Changes

File Level Purpose
python/agentize/eval/eval_harness.md minor Update Execution Modes table and nlcmd cost-tracking description
python/tests/test_eval_harness.py minor Update cost_note assertion; add test verifying JSONL tracking path
python/agentize/eval/eval_harness.py major Snapshot JSONL files before Phase 1, sum new files after Phase 2, remove _parse_claude_usage call and planning_tokens/planning_cost_usd fields

Implementation Steps

Step 1: Update documentation (Estimated: ~15 LOC)

  • python/agentize/eval/eval_harness.md — Add nlcmd row to the Mode table. Add nlcmd subsection explaining JSONL-based cost tracking covers both Phase 1 (NL planning) and Phase 2 (FSM impl). Remove stale "subagent tokens not included" text.

Step 2: Update tests (Estimated: ~15 LOC)

  • python/tests/test_eval_harness.py, TestNlcmdImpl.test_result_has_planner_cmd — Update cost_note value expectation.
  • Same file, TestNlcmdImpl — Add test_jsonl_cost_tracking_applied that monkeypatches _list_jsonl_files and _sum_jsonl_usage to verify the JSONL path is exercised.

Step 3: Implement JSONL tracking in run_nlcmd_impl() (Estimated: ~15 LOC)

  • python/agentize/eval/eval_harness.py, run_nlcmd_impl():
    1. Add files_before = _list_jsonl_files() before Phase 1 starts (~line 870)
    2. Update cost_note to "cost estimated from new JSONL session files"
    3. Remove lines 913-920 (_parse_claude_usage call and planning_tokens/planning_cost_usd)
    4. After Phase 2 completion (~line 999), add JSONL diff block:
      files_after = _list_jsonl_files()
      new_files = sorted(files_after - files_before)
      if new_files:
          usage = _sum_jsonl_usage(new_files)
          result["input_tokens"] = usage["input_tokens"]
          result["output_tokens"] = usage["output_tokens"]
          result["cache_read_tokens"] = usage["cache_read"]
          result["cache_write_tokens"] = usage["cache_write"]
          result["tokens"] = usage["tokens"]
          result["cost_usd"] = usage["cost_usd"]
    5. Also add JSONL diff to timeout early-return path (~line 961)

Test Strategy

  • python -m pytest python/tests/test_eval_harness.py -v
  • Updated test_result_has_planner_cmd must pass (cost_note field present)
  • New test_jsonl_cost_tracking_applied must pass (JSONL diff path active)

Related PR

TBD - will be updated when PR is created

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentize:planPlan created by /ultra-planner command

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions