feat(parser): skip pure-shape diffs in pre-process pass (#845)#853
Merged
feat(parser): skip pure-shape diffs in pre-process pass (#845)#853
Conversation
Pure additions / deletions / renames-no-edit / binary file changes have no information content beyond the diff's shape — an LLM summary just produces "Added X" / "Removed Y" / "Renamed A to B" that we can templated for free. New `trivialDiff.ts` helper detects these shapes from the hunk body and returns a deterministic summary string; `summarizeFileDiff` short-circuits on a non-undefined return and skips the LLM call entirely. Detection rules (cheap on purpose — runs per file in pre-process): - "Binary files X and Y differ" header → 'binary' - rename from / rename to headers AND no `+`/`-` body → 'rename' - all body lines start with `+` (and at least one does) → 'addition' - all body lines start with `-` (and at least one does) → 'deletion' - otherwise → undefined (LLM path stays in charge) Headers (diff --git, index, ---, +++, @@, new file mode, etc.) are ignored when classifying so the metadata `--- /dev/null` doesn't fool the deletion detector. Bench (post-PR-1 baseline → PR 2): | fixture | calls before | calls after | wall before | wall after | |----------------|-------------:|------------:|------------:|-----------:| | tiny | 0 | 0 | 1 ms | 1 ms | | medium | 19 | 6 (-68%) | 29.3 s | 6.9 s (-76%) | | large | 30 | 6 (-80%) | 60.0 s | 9.7 s (-84%) | | feature-add | 11 | 4 (-64%) | 19.6 s | 5.6 s (-71%) | | refactor | 20 | 20 | 41.3 s | 41.3 s | | initial-commit | 30 | 6 (-80%) | 60.0 s | 9.8 s (-84%) | | docs-update | 7 | 7 | 18.6 s | 18.6 s | | dep-bump | 0 | 0 | 0 ms | 0 ms | Initial-commit-shaped repos (the user's reported #845 pain point) collapse from 60 s / 30 LLM calls to 9.8 s / 6 calls — an 84 % wall-clock cut. Modification-heavy fixtures (refactor, docs-update) still need real LLM work and stay flat as expected; their optimization comes from PR 4 (continuous-queue waves) and PR 6 (per-type prompts). 15 new tests in `trivialDiff.test.ts` cover the four trivial shapes, headers-vs-body classification, line-count templating, singular wording for 1-line edits, and the rename-with-edit fall-through (which must NOT be classified as trivial because the body has actual changes).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 2 of the #845 sprint. Headline: initial-commit-shaped repos drop from 60 s / 30 LLM calls to 10 s / 6 calls — an 84% wall-clock cut, no quality loss.
Why
Pure additions / deletions / renames-with-no-edit / binary file changes have no information content beyond the diff's shape. The LLM was getting a multi-thousand-token raw diff for
+ const foo = 1\n+ const bar = 2\n...and producing "Added foo.ts with N lines" — a string we can templated for free. NewtrivialDiff.tshelper detects those shapes from the hunk body and returns a deterministic summary;summarizeFileDiffshort-circuits and skips the LLM entirely.Detection rules
Cheap on purpose (runs per file in pre-process):
binaryBinary files X and Y differheaderUpdated binary file \X`.`renamerename from+rename toheaders AND no+/-body linesRenamed \old` → `new`.`addition+(≥1 such line)Added \file` (N lines).`deletion-(≥1 such line)Removed \file` (N lines).`modification+/-or rename-with-editHeaders (
diff --git,index,---,+++,@@,new file mode,deleted file mode,similarity index,rename from,rename to,Binary files) are ignored when classifying so the metadata--- /dev/nulldoesn't fool the deletion detector.Bench (post-PR-1 baseline → PR 2)
Reading the wins
Initial-commit shape (the user's #845 pain point) collapses 84%. A fresh
git initwith 50 source files used to take a minute of LLM work; now it takes ~10 seconds because all the pure-add files take the templated path.Modification-heavy fixtures are unchanged, as expected — refactor (mixed
+/-lines) and docs-update (markdown edits) still need real LLM work. They're targets for PR 4 (continuous-queue waves) and PR 6 (per-type prompts) respectively.Why not 0 calls on the pure-add fixtures? Pre-process only runs for files larger than
maxFileTokens(1024). Smaller pure-add files keep their raw diffs in the directory aggregate; the wave consolidation then summarizes those directories. Extending skip-trivial to wave consolidation (when all files in a directory are trivial) would eliminate those 6 remaining calls — possible follow-up PR if the marginal win is worth it.Test plan
npm run lintnpm run test:jest(1265 tests pass — 15 new intrivialDiff.test.tscovering the four trivial shapes, headers-vs-body classification, line-count templating, singular wording for 1-line edits, and the rename-with-edit fall-through that MUST NOT classify as trivial)npm run buildnpm run test:clinpm run bench— numbers abovePlan reference
PR 2 of the
#845sprint. PR 3 (raise concurrency cap + adaptive backoff) is next, then PR 4 (continuous-queue waves), then PR 5 (disk cache).