[#980][bugfix] Fix undercounted nlcmd cost estimation via JSONL tracking#981
Merged
ayazhankadessova merged 9 commits intomainfrom Mar 12, 2026
Merged
[#980][bugfix] Fix undercounted nlcmd cost estimation via JSONL tracking#981ayazhankadessova merged 9 commits intomainfrom
ayazhankadessova merged 9 commits intomainfrom
Conversation
- Replace _parse_claude_usage() with _list_jsonl_files()/_sum_jsonl_usage() in run_nlcmd_impl() to capture all subagent tokens (understander, bold-proposer, critique, reducer, consensus) - Previously only tracked top-level orchestrator tokens ($0.91/task), actual cost is ~$20-30+/task including subagents - Add JSONL diff to both normal and timeout paths - Update eval_harness.md mode table with all 4 modes and cost tracking - Add test_jsonl_cost_tracking_on_timeout test - Update cost_note assertion in test_result_has_planner_cmd Closes #980
…$0.91) - eval-report-2026-03-04-combined.md: nlcmd cost $0.91 → ~$31/task, updated executive summary, findings 3/4/6, and limitations section. nlcmd is now dominated by full on all axes (quality, speed, cost). - eval-report-2026-03-04-nginx.md: nlcmd cost $1.01 → ~$31.38/task, updated footnote explaining measurement bug fixed in PR #981. Prior nlcmd cost only counted orchestrator tokens — subagent tokens spawned via Task tool were missing.
- eval-report-2026-03-04-combined.md: SWE-bench nlcmd cost updated from ~$157 (extrapolated from nginx) to $143.80 (measured across 5 tasks). Avg/task: ~$30 ($28.76 SWE-bench, $31.38 nginx). Updated findings 3/4/6 and limitations footnote.
…/task, was 1.02) - eval-report-2026-03-01.md: Replace N/A impl/full costs with JSONL measurements (~0.83 and ~22/task). Update nlcmd from 1.02 to 28.76/task (was 34x undercounted -- only orchestrator tokens counted, subagent tokens missing). Update recommendations: full mode strictly dominates nlcmd. Mark cost tracking gap as resolved (PR #981). - eval-report-2026-03-04-combined.md: Minor wording fix in executive summary.
…x findings - Timing: full SWE-bench 16,505s → 5,493s (measured with Codex consensus) - Timing: nlcmd SWE-bench 43,056s → 8,911s (measured re-run) - Cost: full SWE-bench ~$112 (extrapolated) → $103.61 (measured) - Added Finding 7: Codex consensus 3x slower than Opus fallback (18 min vs 6 min/task) with same Anthropic cost (~$20/task) - Added limitation: Codex (OpenAI) costs not captured in JSONL - Updated recommendations with corrected timing ratios
…ost tables - Timing: full (codex) 1,099s/task vs full (opus) 369s/task (3x faster) - Cost: full (codex) $20.72/task vs full (opus) $19.77/task (same ballpark) - Renamed Cost table to "Cost (Anthropic API only)" to clarify Codex costs are not captured - Added note that nginx codex/opus breakdown not yet available
- Cost-per-second table across all modes validates measurement accuracy - Sonnet-only group: 5.7x $/s gap (raw vs impl) — explained by FSM overhead - Opus+Sonnet group: 1.2x $/s gap (full vs nlcmd) — passes smell test - Absolute cost check: 4 Opus + 2 Sonnet ≈ $18-24 → measured $20.72 ✓ - Documents before/after nlcmd fix: 52x discrepancy → 1.2x - Renumbered Codex finding to Finding 8
- nginx full data placed in opus row (original run used Opus fallback) - full (opus) combined: 10,280s (2.9 hrs), ~$211, ~$21/task - nginx full (codex) marked TBD — re-run needed
…ss bugs Report updates: - Fill in nginx full (codex) data: 4,157s, $40.71, 4/5 resolved - Replace nginx full (opus) extrapolations with measured data: 2,122s (was 8,437s), $63.75 (was ~$112) - Split pass rates, Finding 2, Finding 5, and per-task appendix into codex/opus columns - Update Finding 3, 6, 7, 8 with new measured values - Add Limitation 7 documenting run_planning_phase harness bug Eval harness fixes: - Add errors='replace' to score_nginx() subprocess.run() to handle binary output from prove tests (fixes UnicodeDecodeError on proxy_h2_cache.t) - Add os.chdir(wt) before FSM orchestrator to ensure Claude tools operate on target repo - Thread cwd parameter through run_planning_phase → run_planner_pipeline → Session → ACW - Remove hardcoded no_project_config=True (--no-project-config not a valid claude CLI flag) Pipeline cwd support: - Add cwd parameter to run_planner_pipeline(), propagated to all 5 stages - Add cwd parameter to Session._run_stage() and Session.run_prompt() - Add cwd parameter to ACW.__init__() and ACW.run() - Add no_project_config parameter and _extra_flags() helper to pipeline (unused for now)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_parse_claude_usage()with_list_jsonl_files()/_sum_jsonl_usage()inrun_nlcmd_impl()to capture all subagent tokensrun_full_impl()Changes
python/agentize/eval/eval_harness.py: Add JSONL snapshot before Phase 1, diff after Phase 2 (and on timeout)python/agentize/eval/eval_harness.md: Update mode table with all 4 modes and cost tracking methodpython/tests/test_eval_harness.py: Addtest_jsonl_cost_tracking_on_timeout, updatecost_noteassertionTest plan
test_jsonl_cost_tracking_on_timeoutvalidates JSONL pathCloses #980