[#980][bugfix] Fix undercounted nlcmd cost estimation via JSONL tracking by ayazhankadessova · Pull Request #981 · Synthesys-Lab/agentize

ayazhankadessova · 2026-03-05T19:01:54Z

Summary

Replace _parse_claude_usage() with _list_jsonl_files()/_sum_jsonl_usage() in run_nlcmd_impl() to capture all subagent tokens
Previously only tracked top-level orchestrator tokens ($0.91/task), actual cost is ~$20-30+/task
Same JSONL-based approach already used by run_full_impl()

Changes

python/agentize/eval/eval_harness.py: Add JSONL snapshot before Phase 1, diff after Phase 2 (and on timeout)
python/agentize/eval/eval_harness.md: Update mode table with all 4 modes and cost tracking method
python/tests/test_eval_harness.py: Add test_jsonl_cost_tracking_on_timeout, update cost_note assertion

Test plan

All 47 existing tests pass
New test_jsonl_cost_tracking_on_timeout validates JSONL path
Re-run nlcmd eval with fix to verify accurate cost reporting

Closes #980

- Replace _parse_claude_usage() with _list_jsonl_files()/_sum_jsonl_usage() in run_nlcmd_impl() to capture all subagent tokens (understander, bold-proposer, critique, reducer, consensus) - Previously only tracked top-level orchestrator tokens ($0.91/task), actual cost is ~$20-30+/task including subagents - Add JSONL diff to both normal and timeout paths - Update eval_harness.md mode table with all 4 modes and cost tracking - Add test_jsonl_cost_tracking_on_timeout test - Update cost_note assertion in test_result_has_planner_cmd Closes #980

…$0.91) - eval-report-2026-03-04-combined.md: nlcmd cost $0.91 → ~$31/task, updated executive summary, findings 3/4/6, and limitations section. nlcmd is now dominated by full on all axes (quality, speed, cost). - eval-report-2026-03-04-nginx.md: nlcmd cost $1.01 → ~$31.38/task, updated footnote explaining measurement bug fixed in PR #981. Prior nlcmd cost only counted orchestrator tokens — subagent tokens spawned via Task tool were missing.

- eval-report-2026-03-04-combined.md: SWE-bench nlcmd cost updated from ~$157 (extrapolated from nginx) to $143.80 (measured across 5 tasks). Avg/task: ~$30 ($28.76 SWE-bench, $31.38 nginx). Updated findings 3/4/6 and limitations footnote.

…/task, was 1.02) - eval-report-2026-03-01.md: Replace N/A impl/full costs with JSONL measurements (~0.83 and ~22/task). Update nlcmd from 1.02 to 28.76/task (was 34x undercounted -- only orchestrator tokens counted, subagent tokens missing). Update recommendations: full mode strictly dominates nlcmd. Mark cost tracking gap as resolved (PR #981). - eval-report-2026-03-04-combined.md: Minor wording fix in executive summary.

…x findings - Timing: full SWE-bench 16,505s → 5,493s (measured with Codex consensus) - Timing: nlcmd SWE-bench 43,056s → 8,911s (measured re-run) - Cost: full SWE-bench ~$112 (extrapolated) → $103.61 (measured) - Added Finding 7: Codex consensus 3x slower than Opus fallback (18 min vs 6 min/task) with same Anthropic cost (~$20/task) - Added limitation: Codex (OpenAI) costs not captured in JSONL - Updated recommendations with corrected timing ratios

…ost tables - Timing: full (codex) 1,099s/task vs full (opus) 369s/task (3x faster) - Cost: full (codex) $20.72/task vs full (opus) $19.77/task (same ballpark) - Renamed Cost table to "Cost (Anthropic API only)" to clarify Codex costs are not captured - Added note that nginx codex/opus breakdown not yet available

- Cost-per-second table across all modes validates measurement accuracy - Sonnet-only group: 5.7x $/s gap (raw vs impl) — explained by FSM overhead - Opus+Sonnet group: 1.2x $/s gap (full vs nlcmd) — passes smell test - Absolute cost check: 4 Opus + 2 Sonnet ≈ $18-24 → measured $20.72 ✓ - Documents before/after nlcmd fix: 52x discrepancy → 1.2x - Renumbered Codex finding to Finding 8

- nginx full data placed in opus row (original run used Opus fallback) - full (opus) combined: 10,280s (2.9 hrs), ~$211, ~$21/task - nginx full (codex) marked TBD — re-run needed

…ss bugs Report updates: - Fill in nginx full (codex) data: 4,157s, $40.71, 4/5 resolved - Replace nginx full (opus) extrapolations with measured data: 2,122s (was 8,437s), $63.75 (was ~$112) - Split pass rates, Finding 2, Finding 5, and per-task appendix into codex/opus columns - Update Finding 3, 6, 7, 8 with new measured values - Add Limitation 7 documenting run_planning_phase harness bug Eval harness fixes: - Add errors='replace' to score_nginx() subprocess.run() to handle binary output from prove tests (fixes UnicodeDecodeError on proxy_h2_cache.t) - Add os.chdir(wt) before FSM orchestrator to ensure Claude tools operate on target repo - Thread cwd parameter through run_planning_phase → run_planner_pipeline → Session → ACW - Remove hardcoded no_project_config=True (--no-project-config not a valid claude CLI flag) Pipeline cwd support: - Add cwd parameter to run_planner_pipeline(), propagated to all 5 stages - Add cwd parameter to Session._run_stage() and Session.run_prompt() - Add cwd parameter to ACW.__init__() and ACW.run() - Add no_project_config parameter and _extra_flags() helper to pipeline (unused for now)

ayazhankadessova added the agentize:pr PR created by agentize label Mar 5, 2026

ayazhankadessova added 8 commits March 6, 2026 05:52

[docs][eval] Fill in empty cells in timing/cost tables

ca8b191

- nginx full data placed in opus row (original run used Opus fallback) - full (opus) combined: 10,280s (2.9 hrs), ~$211, ~$21/task - nginx full (codex) marked TBD — re-run needed

ayazhankadessova merged commit ecabf50 into main Mar 12, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#980][bugfix] Fix undercounted nlcmd cost estimation via JSONL tracking#981

[#980][bugfix] Fix undercounted nlcmd cost estimation via JSONL tracking#981
ayazhankadessova merged 9 commits intomainfrom
issue-980

ayazhankadessova commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ayazhankadessova commented Mar 5, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants