feat(parser): skip pure-shape diffs in pre-process pass (#845) by gfargo · Pull Request #853 · gfargo/coco

gfargo · 2026-05-06T01:14:37Z

PR 2 of the #845 sprint. Headline: initial-commit-shaped repos drop from 60 s / 30 LLM calls to 10 s / 6 calls — an 84% wall-clock cut, no quality loss.

Why

Pure additions / deletions / renames-with-no-edit / binary file changes have no information content beyond the diff's shape. The LLM was getting a multi-thousand-token raw diff for + const foo = 1\n+ const bar = 2\n... and producing "Added foo.ts with N lines" — a string we can templated for free. New trivialDiff.ts helper detects those shapes from the hunk body and returns a deterministic summary; summarizeFileDiff short-circuits and skips the LLM entirely.

Detection rules

Cheap on purpose (runs per file in pre-process):

Shape	Detection	Templated summary
`binary`	`Binary files X and Y differ` header	`Updated binary file \`X`.`
`rename`	`rename from` + `rename to` headers AND no `+`/`-` body lines	`Renamed \`old` → `new`.`
`addition`	All body lines start with `+` (≥1 such line)	`Added \`file` (N lines).`
`deletion`	All body lines start with `-` (≥1 such line)	`Removed \`file` (N lines).`
`modification`	mixed `+`/`-` or rename-with-edit	undefined → LLM path stays

Headers (diff --git, index, ---, +++, @@, new file mode, deleted file mode, similarity index, rename from, rename to, Binary files) are ignored when classifying so the metadata --- /dev/null doesn't fool the deletion detector.

Bench (post-PR-1 baseline → PR 2)

fixture	calls before	calls after	Δ calls	wall before	wall after	Δ wall
tiny	0	0	0	1 ms	1 ms	0 ms
medium	19	6	-13 (-68%)	29,267 ms	6,906 ms	-22,361 ms (-76%)
large	30	6	-24 (-80%)	59,992 ms	9,749 ms	-50,243 ms (-84%)
feature-add	11	4	-7 (-64%)	19,591 ms	5,640 ms	-13,951 ms (-71%)
refactor	20	20	0	41,340 ms	41,347 ms	+7 ms
initial-commit	30	6	-24 (-80%)	60,034 ms	9,818 ms	-50,216 ms (-84%)
docs-update	7	7	0	18,563 ms	18,564 ms	+1 ms
dep-bump	0	0	0	0 ms	0 ms	0 ms

Reading the wins

Initial-commit shape (the user's #845 pain point) collapses 84%. A fresh git init with 50 source files used to take a minute of LLM work; now it takes ~10 seconds because all the pure-add files take the templated path.

Modification-heavy fixtures are unchanged, as expected — refactor (mixed +/- lines) and docs-update (markdown edits) still need real LLM work. They're targets for PR 4 (continuous-queue waves) and PR 6 (per-type prompts) respectively.

Why not 0 calls on the pure-add fixtures? Pre-process only runs for files larger than maxFileTokens (1024). Smaller pure-add files keep their raw diffs in the directory aggregate; the wave consolidation then summarizes those directories. Extending skip-trivial to wave consolidation (when all files in a directory are trivial) would eliminate those 6 remaining calls — possible follow-up PR if the marginal win is worth it.

Test plan

npm run lint
npm run test:jest (1265 tests pass — 15 new in trivialDiff.test.ts covering the four trivial shapes, headers-vs-body classification, line-count templating, singular wording for 1-line edits, and the rename-with-edit fall-through that MUST NOT classify as trivial)
npm run build
npm run test:cli
npm run bench — numbers above

Plan reference

PR 2 of the #845 sprint. PR 3 (raise concurrency cap + adaptive backoff) is next, then PR 4 (continuous-queue waves), then PR 5 (disk cache).

Pure additions / deletions / renames-no-edit / binary file changes have no information content beyond the diff's shape — an LLM summary just produces "Added X" / "Removed Y" / "Renamed A to B" that we can templated for free. New `trivialDiff.ts` helper detects these shapes from the hunk body and returns a deterministic summary string; `summarizeFileDiff` short-circuits on a non-undefined return and skips the LLM call entirely. Detection rules (cheap on purpose — runs per file in pre-process): - "Binary files X and Y differ" header → 'binary' - rename from / rename to headers AND no `+`/`-` body → 'rename' - all body lines start with `+` (and at least one does) → 'addition' - all body lines start with `-` (and at least one does) → 'deletion' - otherwise → undefined (LLM path stays in charge) Headers (diff --git, index, ---, +++, @@, new file mode, etc.) are ignored when classifying so the metadata `--- /dev/null` doesn't fool the deletion detector. Bench (post-PR-1 baseline → PR 2): | fixture | calls before | calls after | wall before | wall after | |----------------|-------------:|------------:|------------:|-----------:| | tiny | 0 | 0 | 1 ms | 1 ms | | medium | 19 | 6 (-68%) | 29.3 s | 6.9 s (-76%) | | large | 30 | 6 (-80%) | 60.0 s | 9.7 s (-84%) | | feature-add | 11 | 4 (-64%) | 19.6 s | 5.6 s (-71%) | | refactor | 20 | 20 | 41.3 s | 41.3 s | | initial-commit | 30 | 6 (-80%) | 60.0 s | 9.8 s (-84%) | | docs-update | 7 | 7 | 18.6 s | 18.6 s | | dep-bump | 0 | 0 | 0 ms | 0 ms | Initial-commit-shaped repos (the user's reported #845 pain point) collapse from 60 s / 30 LLM calls to 9.8 s / 6 calls — an 84 % wall-clock cut. Modification-heavy fixtures (refactor, docs-update) still need real LLM work and stay flat as expected; their optimization comes from PR 4 (continuous-queue waves) and PR 6 (per-type prompts). 15 new tests in `trivialDiff.test.ts` cover the four trivial shapes, headers-vs-body classification, line-count templating, singular wording for 1-line edits, and the rename-with-edit fall-through (which must NOT be classified as trivial because the body has actual changes).

gfargo merged commit c809e81 into main May 6, 2026
9 checks passed

gfargo deleted the feat/skip-trivial-diffs-845 branch May 6, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): skip pure-shape diffs in pre-process pass (#845)#853

feat(parser): skip pure-shape diffs in pre-process pass (#845)#853
gfargo merged 1 commit intomainfrom
feat/skip-trivial-diffs-845

gfargo commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gfargo commented May 6, 2026

Why

Detection rules

Bench (post-PR-1 baseline → PR 2)

Reading the wins

Test plan

Plan reference

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant