-
Notifications
You must be signed in to change notification settings - Fork 10
Closed
Labels
agentize:planPlan created by /ultra-planner commandPlan created by /ultra-planner command
Description
Description
The eval harness run_nlcmd_impl() only tracks orchestrator-level tokens from the top-level claude -p call via _parse_claude_usage(), but misses all subagent tokens spawned via the Task tool (understander, bold-proposer, critique, reducer, consensus). This causes nlcmd to report ~$0.91/task while the actual cost is likely ~$20-30+/task.
The sanity check that reveals the bug:
- full mode: $22/task in 2,494s → $0.009/second
- nlcmd mode: $0.91/task in 5,309s → $0.0002/second
- Both use Opus for planning and Sonnet for impl → cost-per-second should be similar
- 52x difference in cost-per-second is impossible with the same models
Affected module: python/agentize/eval/eval_harness.py
Proposed Solution
File Changes
| File | Level | Purpose |
|---|---|---|
python/agentize/eval/eval_harness.md |
minor | Update Execution Modes table and nlcmd cost-tracking description |
python/tests/test_eval_harness.py |
minor | Update cost_note assertion; add test verifying JSONL tracking path |
python/agentize/eval/eval_harness.py |
major | Snapshot JSONL files before Phase 1, sum new files after Phase 2, remove _parse_claude_usage call and planning_tokens/planning_cost_usd fields |
Implementation Steps
Step 1: Update documentation (Estimated: ~15 LOC)
python/agentize/eval/eval_harness.md— Addnlcmdrow to the Mode table. Add nlcmd subsection explaining JSONL-based cost tracking covers both Phase 1 (NL planning) and Phase 2 (FSM impl). Remove stale "subagent tokens not included" text.
Step 2: Update tests (Estimated: ~15 LOC)
python/tests/test_eval_harness.py,TestNlcmdImpl.test_result_has_planner_cmd— Updatecost_notevalue expectation.- Same file,
TestNlcmdImpl— Addtest_jsonl_cost_tracking_appliedthat monkeypatches_list_jsonl_filesand_sum_jsonl_usageto verify the JSONL path is exercised.
Step 3: Implement JSONL tracking in run_nlcmd_impl() (Estimated: ~15 LOC)
python/agentize/eval/eval_harness.py,run_nlcmd_impl():- Add
files_before = _list_jsonl_files()before Phase 1 starts (~line 870) - Update
cost_noteto"cost estimated from new JSONL session files" - Remove lines 913-920 (
_parse_claude_usagecall andplanning_tokens/planning_cost_usd) - After Phase 2 completion (~line 999), add JSONL diff block:
files_after = _list_jsonl_files() new_files = sorted(files_after - files_before) if new_files: usage = _sum_jsonl_usage(new_files) result["input_tokens"] = usage["input_tokens"] result["output_tokens"] = usage["output_tokens"] result["cache_read_tokens"] = usage["cache_read"] result["cache_write_tokens"] = usage["cache_write"] result["tokens"] = usage["tokens"] result["cost_usd"] = usage["cost_usd"]
- Also add JSONL diff to timeout early-return path (~line 961)
- Add
Test Strategy
python -m pytest python/tests/test_eval_harness.py -v- Updated
test_result_has_planner_cmdmust pass (cost_note field present) - New
test_jsonl_cost_tracking_appliedmust pass (JSONL diff path active)
Related PR
TBD - will be updated when PR is created
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
agentize:planPlan created by /ultra-planner commandPlan created by /ultra-planner command