feat(bench): realistic per-language fixture generators + scenarios (#845)#849
Merged
feat(bench): realistic per-language fixture generators + scenarios (#845)#849
Conversation
) The v0 fixtures from #847 used a seeded LCG to generate noise. Good for deterministic latency measurement, useless for telling whether an optimization translates to real-shaped diffs. This PR swaps that out for code-shaped content per file type and adds named scenarios that mirror real commit workflows. Generators (src/lib/parsers/default/__fixtures__/generators.ts): - generateTypeScript — imports, types, functions, classes, JSDoc - generatePython — imports, defs, classes, decorators, docstrings - generateMarkdown — headers, lists, paragraphs, code blocks, tables - generateJson — nested config object with realistic key names - generateYaml — CI workflow shape - generateLockfile — yarn lock-style entries - generateContentForFile — extension-based dispatcher Diff-shape wrappers (diffs.ts): - asAdditionDiff / asDeletionDiff — pure +/- shapes - asModificationDiff — context + remove + add interleaving - asRenameDiff — git rename header (no body) - asBinaryDiff — binary file marker Scenarios in addition to the original tiny/medium/large: - feature-add (14 files) — new module + tests + docs touch - refactor (30 files) — rename + ~25 modifications - initial-commit (50) — same shape as user's #845 repro - docs-update (9) — markdown-heavy - dep-bump (3) — package.json + lockfile + CHANGELOG Re-captured baseline (committed at .bench/baseline.json): | fixture | wall-clock | calls | llm total ms | prompt tokens | |----------------|-----------:|------:|-------------:|--------------:| | tiny | 2 ms | 0 | 0 ms| 0 | | medium | 31,124 ms| 20 | 106,333 ms| 34,237 | | large | 72,151 ms| 41 | 244,112 ms| 74,197 | | feature-add | 15,967 ms| 11 | 54,726 ms| 18,937 | | refactor | 33,994 ms| 28 | 153,871 ms| 52,430 | | initial-commit | 72,291 ms| 41 | 245,148 ms| 74,546 | | docs-update | 18,563 ms| 8 | 56,293 ms| 13,908 | | dep-bump | 27,158 ms| 1 | 27,141 ms| 19,597 | Three observations the realistic fixtures surface that the LCG fixtures hid: 1. dep-bump pays 27s for one LLM call — the lockfile pre-summary. Skip-trivial / per-extension fast-path should basically zero this. 2. refactor (30 files of mixed +/-) fires 28 LLM calls. The continuous-queue wave consolidation work (PR 4) targets exactly this shape. 3. docs-update is markdown-heavy with 8 calls in 19s. A markdown- specific shorter prompt could measurably trim this. Tests: 14 new generator tests + 5 new fixture-level tests covering determinism, expected-marker presence, scaling behavior, and shape properties of the rename / dep-bump scenarios.
5 tasks
gfargo
added a commit
that referenced
this pull request
May 6, 2026
#845) (#850) The dep-bump fixture from #849 included `yarn.lock` and reported 27 seconds of LLM work for it. That's a bench-fixture artifact, not a real-world cost. Lockfiles live in `DEFAULT_IGNORED_FILES` and the `.lock` extension lives in `DEFAULT_IGNORED_EXTENSIONS` (see `src/lib/config/constants.ts`), so `getChanges` strips them before the diff-condensing pipeline ever sees them on a real `coco commit`. - Drop `yarn.lock` from DEP_BUMP_FILES; the realistic shape is just `package.json` + `CHANGELOG.md`. - Update the dep-bump-shape test to assert the post-filter invariant (no lockfiles in the fixture) instead of asserting a lockfile is present. - Add a guard test that fails loudly if any future fixture accidentally drifts back into including a default-ignored file. - Re-baseline. dep-bump now reports 0 ms / 0 LLM calls (early exit, total budget already under threshold), reflecting what a real dep bump costs the pipeline. The "lockfile fast-path" optimization angle from the original plan is dropped — the existing ignore filter already handles that, and any pipeline-level skip would be redundant.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on the bench harness from #847. Replaces the LCG-noise content with per-language code-shaped generators and adds named scenario fixtures that mirror real commit workflows. Re-captures the baseline so all subsequent #845 PRs measure against realistic input.
Why bother
The v0 LCG fixtures were good for deterministic latency measurement but useless for telling whether an optimization actually translates to real-shaped diffs. The skip-trivial work (PR 2 in the plan) needs realistic shape detection (pure additions vs. modifications vs. renames). The markdown fast-path (PR 6) needs realistic markdown content. The cache work (PR 5) needs realistic content hashing. Without that, "wins" claimed against synthetic noise might disappear on the user's actual diffs.
What's in the PR
Generators (
src/lib/parsers/default/__fixtures__/generators.ts) — one per file type, each seeded for determinism:generateTypeScript— imports, types, functions, classes, JSDocgeneratePython— imports, defs, classes, decorators, docstringsgenerateMarkdown— headers, lists, paragraphs, code blocks, tablesgenerateJson— nested config object with realistic key namesgenerateYaml— CI workflow shapegenerateLockfile— yarn lock-style entriesgenerateContentForFile— extension-based dispatcherDiff-shape wrappers (
diffs.ts) — producegit diffoutput with the right header + body format for each shape:asAdditionDiff/asDeletionDiff— pure +/- shapesasModificationDiff— context + remove + add interleavingasRenameDiff— git rename header (no body)asBinaryDiff— binary file markerNew scenario fixtures (in addition to the original sized ones):
feature-add(14 files) — new module + tests + docs touchrefactor(30 files) — rename + ~25 modificationsinitial-commit(50 files) — mirrors the user's coco commit pipeline takes ~4 minutes on a 43-file / 77k-token initial commit #845 reprodocs-update(9 files) — markdown-heavydep-bump(3 files) — package.json + lockfile + CHANGELOGRe-captured baseline
Three things the realistic fixtures surface that the LCG hid
dep-bumppays 27 seconds for a single LLM call — the lockfile pre-summary. Skip-trivial / per-extension fast-path should basically zero this.refactorfires 28 LLM calls on 30 files of mixed +/-. The continuous-queue wave consolidation work (PR 4 in the plan) targets exactly this shape.docs-updateis markdown-heavy with 8 calls in 19s. A markdown-specific shorter prompt could measurably trim this.Test plan
npm run lintnpm run test:jest(1249 tests pass — 19 new acrossgenerators.test.tsandindex.test.tscovering determinism, expected-marker presence, scaling, and shape properties of the rename / dep-bump scenarios)npm run buildnpm run test:clinpm run bench --updateproduces stable numbers (re-runs match within ms)Plan reference
/Users/gfargo/.claude/plans/polymorphic-wondering-sunbeam.md(the #845 master plan). PR 1 (raise default token budget) is the next chunk; with realistic fixtures committed it can post both wall-clock and per-scenario behavior diffs in its description.