Skip to content

[#980][bugfix] Fix undercounted nlcmd cost estimation via JSONL tracking#981

Merged
ayazhankadessova merged 9 commits intomainfrom
issue-980
Mar 12, 2026
Merged

[#980][bugfix] Fix undercounted nlcmd cost estimation via JSONL tracking#981
ayazhankadessova merged 9 commits intomainfrom
issue-980

Conversation

@ayazhankadessova
Copy link
Contributor

Summary

  • Replace _parse_claude_usage() with _list_jsonl_files()/_sum_jsonl_usage() in run_nlcmd_impl() to capture all subagent tokens
  • Previously only tracked top-level orchestrator tokens ($0.91/task), actual cost is ~$20-30+/task
  • Same JSONL-based approach already used by run_full_impl()

Changes

  • python/agentize/eval/eval_harness.py: Add JSONL snapshot before Phase 1, diff after Phase 2 (and on timeout)
  • python/agentize/eval/eval_harness.md: Update mode table with all 4 modes and cost tracking method
  • python/tests/test_eval_harness.py: Add test_jsonl_cost_tracking_on_timeout, update cost_note assertion

Test plan

  • All 47 existing tests pass
  • New test_jsonl_cost_tracking_on_timeout validates JSONL path
  • Re-run nlcmd eval with fix to verify accurate cost reporting

Closes #980

- Replace _parse_claude_usage() with _list_jsonl_files()/_sum_jsonl_usage()
  in run_nlcmd_impl() to capture all subagent tokens (understander,
  bold-proposer, critique, reducer, consensus)
- Previously only tracked top-level orchestrator tokens ($0.91/task),
  actual cost is ~$20-30+/task including subagents
- Add JSONL diff to both normal and timeout paths
- Update eval_harness.md mode table with all 4 modes and cost tracking
- Add test_jsonl_cost_tracking_on_timeout test
- Update cost_note assertion in test_result_has_planner_cmd

Closes #980
@ayazhankadessova ayazhankadessova added the agentize:pr PR created by agentize label Mar 5, 2026
…$0.91)

- eval-report-2026-03-04-combined.md: nlcmd cost $0.91 → ~$31/task,
  updated executive summary, findings 3/4/6, and limitations section.
  nlcmd is now dominated by full on all axes (quality, speed, cost).
- eval-report-2026-03-04-nginx.md: nlcmd cost $1.01 → ~$31.38/task,
  updated footnote explaining measurement bug fixed in PR #981.

Prior nlcmd cost only counted orchestrator tokens — subagent tokens
spawned via Task tool were missing.
- eval-report-2026-03-04-combined.md: SWE-bench nlcmd cost updated from
  ~$157 (extrapolated from nginx) to $143.80 (measured across 5 tasks).
  Avg/task: ~$30 ($28.76 SWE-bench, $31.38 nginx).
  Updated findings 3/4/6 and limitations footnote.
…/task, was 1.02)

- eval-report-2026-03-01.md: Replace N/A impl/full costs with JSONL measurements
  (~0.83 and ~22/task). Update nlcmd from 1.02 to 28.76/task (was 34x
  undercounted -- only orchestrator tokens counted, subagent tokens missing).
  Update recommendations: full mode strictly dominates nlcmd. Mark cost tracking
  gap as resolved (PR #981).
- eval-report-2026-03-04-combined.md: Minor wording fix in executive summary.
…x findings

- Timing: full SWE-bench 16,505s → 5,493s (measured with Codex consensus)
- Timing: nlcmd SWE-bench 43,056s → 8,911s (measured re-run)
- Cost: full SWE-bench ~$112 (extrapolated) → $103.61 (measured)
- Added Finding 7: Codex consensus 3x slower than Opus fallback
  (18 min vs 6 min/task) with same Anthropic cost (~$20/task)
- Added limitation: Codex (OpenAI) costs not captured in JSONL
- Updated recommendations with corrected timing ratios
…ost tables

- Timing: full (codex) 1,099s/task vs full (opus) 369s/task (3x faster)
- Cost: full (codex) $20.72/task vs full (opus) $19.77/task (same ballpark)
- Renamed Cost table to "Cost (Anthropic API only)" to clarify Codex
  costs are not captured
- Added note that nginx codex/opus breakdown not yet available
- Cost-per-second table across all modes validates measurement accuracy
- Sonnet-only group: 5.7x $/s gap (raw vs impl) — explained by FSM overhead
- Opus+Sonnet group: 1.2x $/s gap (full vs nlcmd) — passes smell test
- Absolute cost check: 4 Opus + 2 Sonnet ≈ $18-24 → measured $20.72 ✓
- Documents before/after nlcmd fix: 52x discrepancy → 1.2x
- Renumbered Codex finding to Finding 8
- nginx full data placed in opus row (original run used Opus fallback)
- full (opus) combined: 10,280s (2.9 hrs), ~$211, ~$21/task
- nginx full (codex) marked TBD — re-run needed
…ss bugs

Report updates:
- Fill in nginx full (codex) data: 4,157s, $40.71, 4/5 resolved
- Replace nginx full (opus) extrapolations with measured data: 2,122s (was 8,437s), $63.75 (was ~$112)
- Split pass rates, Finding 2, Finding 5, and per-task appendix into codex/opus columns
- Update Finding 3, 6, 7, 8 with new measured values
- Add Limitation 7 documenting run_planning_phase harness bug

Eval harness fixes:
- Add errors='replace' to score_nginx() subprocess.run() to handle binary output from prove tests (fixes UnicodeDecodeError on proxy_h2_cache.t)
- Add os.chdir(wt) before FSM orchestrator to ensure Claude tools operate on target repo
- Thread cwd parameter through run_planning_phase → run_planner_pipeline → Session → ACW
- Remove hardcoded no_project_config=True (--no-project-config not a valid claude CLI flag)

Pipeline cwd support:
- Add cwd parameter to run_planner_pipeline(), propagated to all 5 stages
- Add cwd parameter to Session._run_stage() and Session.run_prompt()
- Add cwd parameter to ACW.__init__() and ACW.run()
- Add no_project_config parameter and _extra_flags() helper to pipeline (unused for now)
@ayazhankadessova ayazhankadessova merged commit ecabf50 into main Mar 12, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agentize:pr PR created by agentize

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan][bugfix]: Fix undercounted cost estimation for nlcmd mode

2 participants