Skip to content

feat(parser): skip pure-shape diffs in pre-process pass (#845)#853

Merged
gfargo merged 1 commit intomainfrom
feat/skip-trivial-diffs-845
May 6, 2026
Merged

feat(parser): skip pure-shape diffs in pre-process pass (#845)#853
gfargo merged 1 commit intomainfrom
feat/skip-trivial-diffs-845

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 6, 2026

PR 2 of the #845 sprint. Headline: initial-commit-shaped repos drop from 60 s / 30 LLM calls to 10 s / 6 calls — an 84% wall-clock cut, no quality loss.

Why

Pure additions / deletions / renames-with-no-edit / binary file changes have no information content beyond the diff's shape. The LLM was getting a multi-thousand-token raw diff for + const foo = 1\n+ const bar = 2\n... and producing "Added foo.ts with N lines" — a string we can templated for free. New trivialDiff.ts helper detects those shapes from the hunk body and returns a deterministic summary; summarizeFileDiff short-circuits and skips the LLM entirely.

Detection rules

Cheap on purpose (runs per file in pre-process):

Shape Detection Templated summary
binary Binary files X and Y differ header Updated binary file \X`.`
rename rename from + rename to headers AND no +/- body lines Renamed \old` → `new`.`
addition All body lines start with + (≥1 such line) Added \file` (N lines).`
deletion All body lines start with - (≥1 such line) Removed \file` (N lines).`
modification mixed +/- or rename-with-edit undefined → LLM path stays

Headers (diff --git, index, ---, +++, @@, new file mode, deleted file mode, similarity index, rename from, rename to, Binary files) are ignored when classifying so the metadata --- /dev/null doesn't fool the deletion detector.

Bench (post-PR-1 baseline → PR 2)

fixture calls before calls after Δ calls wall before wall after Δ wall
tiny 0 0 0 1 ms 1 ms 0 ms
medium 19 6 -13 (-68%) 29,267 ms 6,906 ms -22,361 ms (-76%)
large 30 6 -24 (-80%) 59,992 ms 9,749 ms -50,243 ms (-84%)
feature-add 11 4 -7 (-64%) 19,591 ms 5,640 ms -13,951 ms (-71%)
refactor 20 20 0 41,340 ms 41,347 ms +7 ms
initial-commit 30 6 -24 (-80%) 60,034 ms 9,818 ms -50,216 ms (-84%)
docs-update 7 7 0 18,563 ms 18,564 ms +1 ms
dep-bump 0 0 0 0 ms 0 ms 0 ms

Reading the wins

Initial-commit shape (the user's #845 pain point) collapses 84%. A fresh git init with 50 source files used to take a minute of LLM work; now it takes ~10 seconds because all the pure-add files take the templated path.

Modification-heavy fixtures are unchanged, as expected — refactor (mixed +/- lines) and docs-update (markdown edits) still need real LLM work. They're targets for PR 4 (continuous-queue waves) and PR 6 (per-type prompts) respectively.

Why not 0 calls on the pure-add fixtures? Pre-process only runs for files larger than maxFileTokens (1024). Smaller pure-add files keep their raw diffs in the directory aggregate; the wave consolidation then summarizes those directories. Extending skip-trivial to wave consolidation (when all files in a directory are trivial) would eliminate those 6 remaining calls — possible follow-up PR if the marginal win is worth it.

Test plan

  • npm run lint
  • npm run test:jest (1265 tests pass — 15 new in trivialDiff.test.ts covering the four trivial shapes, headers-vs-body classification, line-count templating, singular wording for 1-line edits, and the rename-with-edit fall-through that MUST NOT classify as trivial)
  • npm run build
  • npm run test:cli
  • npm run bench — numbers above

Plan reference

PR 2 of the #845 sprint. PR 3 (raise concurrency cap + adaptive backoff) is next, then PR 4 (continuous-queue waves), then PR 5 (disk cache).

Pure additions / deletions / renames-no-edit / binary file changes
have no information content beyond the diff's shape — an LLM
summary just produces "Added X" / "Removed Y" / "Renamed A to B"
that we can templated for free. New `trivialDiff.ts` helper
detects these shapes from the hunk body and returns a
deterministic summary string; `summarizeFileDiff` short-circuits
on a non-undefined return and skips the LLM call entirely.

Detection rules (cheap on purpose — runs per file in pre-process):
  - "Binary files X and Y differ" header → 'binary'
  - rename from / rename to headers AND no `+`/`-` body → 'rename'
  - all body lines start with `+` (and at least one does) → 'addition'
  - all body lines start with `-` (and at least one does) → 'deletion'
  - otherwise → undefined (LLM path stays in charge)

Headers (diff --git, index, ---, +++, @@, new file mode, etc.) are
ignored when classifying so the metadata `--- /dev/null` doesn't
fool the deletion detector.

Bench (post-PR-1 baseline → PR 2):

| fixture        | calls before | calls after | wall before | wall after |
|----------------|-------------:|------------:|------------:|-----------:|
| tiny           |            0 |           0 |      1 ms   |     1 ms   |
| medium         |           19 |  6  (-68%)  |     29.3 s  |   6.9 s (-76%) |
| large          |           30 |  6  (-80%)  |     60.0 s  |   9.7 s (-84%) |
| feature-add    |           11 |  4  (-64%)  |     19.6 s  |   5.6 s (-71%) |
| refactor       |           20 |          20 |     41.3 s  |  41.3 s    |
| initial-commit |           30 |  6  (-80%)  |     60.0 s  |   9.8 s (-84%) |
| docs-update    |            7 |           7 |     18.6 s  |  18.6 s    |
| dep-bump       |            0 |           0 |      0 ms   |     0 ms   |

Initial-commit-shaped repos (the user's reported #845 pain point)
collapse from 60 s / 30 LLM calls to 9.8 s / 6 calls — an 84 %
wall-clock cut. Modification-heavy fixtures (refactor, docs-update)
still need real LLM work and stay flat as expected; their
optimization comes from PR 4 (continuous-queue waves) and PR 6
(per-type prompts).

15 new tests in `trivialDiff.test.ts` cover the four trivial
shapes, headers-vs-body classification, line-count templating,
singular wording for 1-line edits, and the rename-with-edit
fall-through (which must NOT be classified as trivial because
the body has actual changes).
@gfargo gfargo merged commit c809e81 into main May 6, 2026
9 checks passed
@gfargo gfargo deleted the feat/skip-trivial-diffs-845 branch May 6, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant