Skip to content

feat(bench): realistic per-language fixture generators + scenarios (#845)#849

Merged
gfargo merged 1 commit intomainfrom
feat/realistic-bench-fixtures-845
May 6, 2026
Merged

feat(bench): realistic per-language fixture generators + scenarios (#845)#849
gfargo merged 1 commit intomainfrom
feat/realistic-bench-fixtures-845

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 6, 2026

Builds on the bench harness from #847. Replaces the LCG-noise content with per-language code-shaped generators and adds named scenario fixtures that mirror real commit workflows. Re-captures the baseline so all subsequent #845 PRs measure against realistic input.

Why bother

The v0 LCG fixtures were good for deterministic latency measurement but useless for telling whether an optimization actually translates to real-shaped diffs. The skip-trivial work (PR 2 in the plan) needs realistic shape detection (pure additions vs. modifications vs. renames). The markdown fast-path (PR 6) needs realistic markdown content. The cache work (PR 5) needs realistic content hashing. Without that, "wins" claimed against synthetic noise might disappear on the user's actual diffs.

What's in the PR

Generators (src/lib/parsers/default/__fixtures__/generators.ts) — one per file type, each seeded for determinism:

  • generateTypeScript — imports, types, functions, classes, JSDoc
  • generatePython — imports, defs, classes, decorators, docstrings
  • generateMarkdown — headers, lists, paragraphs, code blocks, tables
  • generateJson — nested config object with realistic key names
  • generateYaml — CI workflow shape
  • generateLockfile — yarn lock-style entries
  • generateContentForFile — extension-based dispatcher

Diff-shape wrappers (diffs.ts) — produce git diff output with the right header + body format for each shape:

  • asAdditionDiff / asDeletionDiff — pure +/- shapes
  • asModificationDiff — context + remove + add interleaving
  • asRenameDiff — git rename header (no body)
  • asBinaryDiff — binary file marker

New scenario fixtures (in addition to the original sized ones):

Re-captured baseline

fixture wall-clock llm calls llm total ms prompt tokens
tiny 2 ms 0 0 ms 0
medium 31,124 ms 20 106,333 ms 34,237
large 72,151 ms 41 244,112 ms 74,197
feature-add 15,967 ms 11 54,726 ms 18,937
refactor 33,994 ms 28 153,871 ms 52,430
initial-commit 72,291 ms 41 245,148 ms 74,546
docs-update 18,563 ms 8 56,293 ms 13,908
dep-bump 27,158 ms 1 27,141 ms 19,597

Three things the realistic fixtures surface that the LCG hid

  1. dep-bump pays 27 seconds for a single LLM call — the lockfile pre-summary. Skip-trivial / per-extension fast-path should basically zero this.
  2. refactor fires 28 LLM calls on 30 files of mixed +/-. The continuous-queue wave consolidation work (PR 4 in the plan) targets exactly this shape.
  3. docs-update is markdown-heavy with 8 calls in 19s. A markdown-specific shorter prompt could measurably trim this.

Test plan

  • npm run lint
  • npm run test:jest (1249 tests pass — 19 new across generators.test.ts and index.test.ts covering determinism, expected-marker presence, scaling, and shape properties of the rename / dep-bump scenarios)
  • npm run build
  • npm run test:cli
  • npm run bench --update produces stable numbers (re-runs match within ms)

Plan reference

/Users/gfargo/.claude/plans/polymorphic-wondering-sunbeam.md (the #845 master plan). PR 1 (raise default token budget) is the next chunk; with realistic fixtures committed it can post both wall-clock and per-scenario behavior diffs in its description.

)

The v0 fixtures from #847 used a seeded LCG to generate noise.
Good for deterministic latency measurement, useless for telling
whether an optimization translates to real-shaped diffs. This PR
swaps that out for code-shaped content per file type and adds
named scenarios that mirror real commit workflows.

Generators (src/lib/parsers/default/__fixtures__/generators.ts):
  - generateTypeScript — imports, types, functions, classes, JSDoc
  - generatePython — imports, defs, classes, decorators, docstrings
  - generateMarkdown — headers, lists, paragraphs, code blocks, tables
  - generateJson — nested config object with realistic key names
  - generateYaml — CI workflow shape
  - generateLockfile — yarn lock-style entries
  - generateContentForFile — extension-based dispatcher

Diff-shape wrappers (diffs.ts):
  - asAdditionDiff / asDeletionDiff — pure +/- shapes
  - asModificationDiff — context + remove + add interleaving
  - asRenameDiff — git rename header (no body)
  - asBinaryDiff — binary file marker

Scenarios in addition to the original tiny/medium/large:
  - feature-add (14 files)  — new module + tests + docs touch
  - refactor (30 files)     — rename + ~25 modifications
  - initial-commit (50)     — same shape as user's #845 repro
  - docs-update (9)         — markdown-heavy
  - dep-bump (3)            — package.json + lockfile + CHANGELOG

Re-captured baseline (committed at .bench/baseline.json):

| fixture        | wall-clock | calls | llm total ms | prompt tokens |
|----------------|-----------:|------:|-------------:|--------------:|
| tiny           |       2 ms |     0 |          0 ms|             0 |
| medium         |   31,124 ms|    20 |    106,333 ms|        34,237 |
| large          |   72,151 ms|    41 |    244,112 ms|        74,197 |
| feature-add    |   15,967 ms|    11 |     54,726 ms|        18,937 |
| refactor       |   33,994 ms|    28 |    153,871 ms|        52,430 |
| initial-commit |   72,291 ms|    41 |    245,148 ms|        74,546 |
| docs-update    |   18,563 ms|     8 |     56,293 ms|        13,908 |
| dep-bump       |   27,158 ms|     1 |     27,141 ms|        19,597 |

Three observations the realistic fixtures surface that the LCG
fixtures hid:

1. dep-bump pays 27s for one LLM call — the lockfile pre-summary.
   Skip-trivial / per-extension fast-path should basically zero this.
2. refactor (30 files of mixed +/-) fires 28 LLM calls. The
   continuous-queue wave consolidation work (PR 4) targets exactly
   this shape.
3. docs-update is markdown-heavy with 8 calls in 19s. A markdown-
   specific shorter prompt could measurably trim this.

Tests: 14 new generator tests + 5 new fixture-level tests covering
determinism, expected-marker presence, scaling behavior, and shape
properties of the rename / dep-bump scenarios.
@gfargo gfargo merged commit ff491f9 into main May 6, 2026
9 checks passed
gfargo added a commit that referenced this pull request May 6, 2026
#845) (#850)

The dep-bump fixture from #849 included `yarn.lock` and reported
27 seconds of LLM work for it. That's a bench-fixture artifact,
not a real-world cost. Lockfiles live in `DEFAULT_IGNORED_FILES`
and the `.lock` extension lives in `DEFAULT_IGNORED_EXTENSIONS`
(see `src/lib/config/constants.ts`), so `getChanges` strips them
before the diff-condensing pipeline ever sees them on a real
`coco commit`.

- Drop `yarn.lock` from DEP_BUMP_FILES; the realistic shape is
  just `package.json` + `CHANGELOG.md`.
- Update the dep-bump-shape test to assert the post-filter
  invariant (no lockfiles in the fixture) instead of asserting a
  lockfile is present.
- Add a guard test that fails loudly if any future fixture
  accidentally drifts back into including a default-ignored file.
- Re-baseline. dep-bump now reports 0 ms / 0 LLM calls (early
  exit, total budget already under threshold), reflecting what a
  real dep bump costs the pipeline.

The "lockfile fast-path" optimization angle from the original
plan is dropped — the existing ignore filter already handles
that, and any pipeline-level skip would be redundant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant