feat(parser): raise default token budget from 2048 to 4096 (#845)#852
Merged
feat(parser): raise default token budget from 2048 to 4096 (#845)#852
Conversation
Match the canonical service tokenLimit shipped in `langchain/utils.ts` for openai / anthropic / ollama (all 4096). The 2048 fallback was a holdover from when 4k was a stretch for fast models — today every shipped service overrides it to 4096 already, so the fallback only fires for users whose custom service definition omits tokenLimit. Without this raise, those users hit a needlessly tight budget that triggers extra pre-summarization on diffs the model could absorb whole. Two call sites updated: - `summarizeDiffs.ts:250` default param - `parsers/default/index.ts:55` `||` fallback Bench (bin/benchmark.ts default also bumped to 4096 so per-PR diffs reflect the most-common production budget): | fixture | calls before | calls after | Δ calls | |----------------|-------------:|------------:|--------:| | tiny | 0 | 0 | 0 | | medium | 20 | 19 | -1 | | large | 41 | 30 | -11 (-27%) | | feature-add | 11 | 11 | 0 | | refactor | 28 | 20 | -8 (-29%) | | initial-commit | 41 | 30 | -11 (-27%) | | docs-update | 8 | 7 | -1 | | dep-bump | 0 | 0 | 0 | Heavy fixtures (large, initial-commit, refactor) get a real 27-29% reduction in LLM call count — direct API cost reduction. Wall clock for `large` / `initial-commit` improved 12 s (17%); the `refactor` wall went up slightly because fewer-but-larger calls serialize a bit (each pays the latency model's per-call base cost), trading API spend for a small wall-clock cost. Net is a clear win on the cost dimension that scales with diff size.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First optimization PR in the #845 sprint. Single-line fallback raises that match the budget the rest of the system already assumes.
Why
The diff-condensing pipeline had two
2048fallbacks formaxTokens:src/lib/parsers/default/utils/summarizeDiffs.ts:250(default param)src/lib/parsers/default/index.ts:55(maxTokens || 2048infileChangeParser)Both came from when 4k context was a stretch for fast models. Every shipped service config in
src/lib/langchain/utils.tsalready setstokenLimit: 4096(openai, anthropic, ollama defaults), so the fallback only fires when:tokenLimitservice.tokenLimit → maxTokensplumbing entirelyBoth cases land in a needlessly tight budget that triggers extra pre-summarization on diffs the model could swallow whole. Raising the fallback to 4096 just synchronizes the parser's "no value" assumption with the rest of the system.
Bench (against the realistic post-#849 baseline)
bin/benchmark.ts's defaultmaxTokenswas also bumped from 2048 to 4096 so per-PR diffs reflect the most-common production budget.Reading the numbers
The headline is the 27-29% drop in LLM call count on the heavy fixtures (large, initial-commit, refactor). That's direct API cost reduction — the user pays for fewer round-trips regardless of wall-clock effects.
Wall-clock follows the call-count drop on
large/initial-commit(-12s / -17%). Onrefactorit moves the other way (+7s) because fewer-but-larger calls each pay the bench latency model's per-call base cost twice over; with realistic API latency the cross-over point may sit differently — worth measuring on a real run before declaring this a regression vs. a wall-clock-neutral cost win.Test plan
npm run lintnpm run test:jest(1250 tests pass — no behavior change to assert beyond the existing default-bound tests)npm run buildnpm run test:clinpm run bench→ numbers abovePlan reference
PR 1 of the
#845sprint. PR 2 (skip-trivial-diffs) is the next chunk.