Skip to content

feat(parser): raise default token budget from 2048 to 4096 (#845)#852

Merged
gfargo merged 1 commit intomainfrom
feat/raise-token-budget-default-845
May 6, 2026
Merged

feat(parser): raise default token budget from 2048 to 4096 (#845)#852
gfargo merged 1 commit intomainfrom
feat/raise-token-budget-default-845

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 6, 2026

First optimization PR in the #845 sprint. Single-line fallback raises that match the budget the rest of the system already assumes.

Why

The diff-condensing pipeline had two 2048 fallbacks for maxTokens:

  • src/lib/parsers/default/utils/summarizeDiffs.ts:250 (default param)
  • src/lib/parsers/default/index.ts:55 (maxTokens || 2048 in fileChangeParser)

Both came from when 4k context was a stretch for fast models. Every shipped service config in src/lib/langchain/utils.ts already sets tokenLimit: 4096 (openai, anthropic, ollama defaults), so the fallback only fires when:

  1. A user's custom service definition omits tokenLimit
  2. A caller skips the service.tokenLimit → maxTokens plumbing entirely

Both cases land in a needlessly tight budget that triggers extra pre-summarization on diffs the model could swallow whole. Raising the fallback to 4096 just synchronizes the parser's "no value" assumption with the rest of the system.

Bench (against the realistic post-#849 baseline)

bin/benchmark.ts's default maxTokens was also bumped from 2048 to 4096 so per-PR diffs reflect the most-common production budget.

fixture calls before calls after Δ calls wall before wall after Δ wall
tiny 0 0 0 1 ms 2 ms +1 ms
medium 20 19 -1 31,137 ms 29,267 ms -1,870 ms
large 41 30 -11 (-27%) 72,093 ms 59,992 ms -12,101 ms (-17%)
feature-add 11 11 0 15,967 ms 19,591 ms +3,624 ms
refactor 28 20 -8 (-29%) 33,999 ms 41,340 ms +7,341 ms
initial-commit 41 30 -11 (-27%) 72,285 ms 60,034 ms -12,251 ms (-17%)
docs-update 8 7 -1 18,570 ms 18,563 ms -7 ms
dep-bump 0 0 0 0 ms 0 ms 0 ms

Reading the numbers

The headline is the 27-29% drop in LLM call count on the heavy fixtures (large, initial-commit, refactor). That's direct API cost reduction — the user pays for fewer round-trips regardless of wall-clock effects.

Wall-clock follows the call-count drop on large / initial-commit (-12s / -17%). On refactor it moves the other way (+7s) because fewer-but-larger calls each pay the bench latency model's per-call base cost twice over; with realistic API latency the cross-over point may sit differently — worth measuring on a real run before declaring this a regression vs. a wall-clock-neutral cost win.

Test plan

  • npm run lint
  • npm run test:jest (1250 tests pass — no behavior change to assert beyond the existing default-bound tests)
  • npm run build
  • npm run test:cli
  • npm run bench → numbers above

Plan reference

PR 1 of the #845 sprint. PR 2 (skip-trivial-diffs) is the next chunk.

Match the canonical service tokenLimit shipped in
`langchain/utils.ts` for openai / anthropic / ollama (all 4096).
The 2048 fallback was a holdover from when 4k was a stretch for
fast models — today every shipped service overrides it to 4096
already, so the fallback only fires for users whose custom
service definition omits tokenLimit. Without this raise, those
users hit a needlessly tight budget that triggers extra
pre-summarization on diffs the model could absorb whole.

Two call sites updated:
  - `summarizeDiffs.ts:250` default param
  - `parsers/default/index.ts:55` `||` fallback

Bench (bin/benchmark.ts default also bumped to 4096 so per-PR
diffs reflect the most-common production budget):

| fixture        | calls before | calls after | Δ calls |
|----------------|-------------:|------------:|--------:|
| tiny           |            0 |           0 |       0 |
| medium         |           20 |          19 |      -1 |
| large          |           41 |          30 |     -11 (-27%) |
| feature-add    |           11 |          11 |       0 |
| refactor       |           28 |          20 |      -8 (-29%) |
| initial-commit |           41 |          30 |     -11 (-27%) |
| docs-update    |            8 |           7 |      -1 |
| dep-bump       |            0 |           0 |       0 |

Heavy fixtures (large, initial-commit, refactor) get a real 27-29%
reduction in LLM call count — direct API cost reduction. Wall
clock for `large` / `initial-commit` improved 12 s (17%); the
`refactor` wall went up slightly because fewer-but-larger calls
serialize a bit (each pays the latency model's per-call base
cost), trading API spend for a small wall-clock cost. Net is a
clear win on the cost dimension that scales with diff size.
@gfargo gfargo merged commit 259b93b into main May 6, 2026
9 checks passed
@gfargo gfargo deleted the feat/raise-token-budget-default-845 branch May 6, 2026 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant