feat(bench): measurement infrastructure for the diff-condensing pipeline (#845)#847
Merged
feat(bench): measurement infrastructure for the diff-condensing pipeline (#845)#847
Conversation
…ine (#845) First chunk of the #845 perf overhaul: a reproducible benchmark harness so every later PR can show concrete before/after numbers instead of hand-waving about "should be faster". Three pieces: 1. Telemetry persistence in `observability.ts`. When COCO_BENCH=1 is set (or any non-`0` value), every llm call accumulates into a narrow `LlmBenchCall` buffer; `flushLlmBenchRun` writes the record to `<cwd>/.coco-bench.json` (overridable via COCO_BENCH_FILE). Best-effort: write failures are silent and the buffer self-clears after each flush. 2. Synthetic diff fixtures at `src/lib/parsers/default/__fixtures__/`. Three sizes: - tiny ( 5 files, ~790 tokens) — early-exit path - medium (25 files, ~36k tokens) — typical commit - large (50 files, ~83k tokens) — initial-commit shape Content comes from a seeded LCG so before/after runs compare the same input. Each fixture exports a fully-populated DiffNode tree so `summarizeDiffs` runs without a real git repo. 3. `bin/benchmark.ts` runner (`npm run bench`). Plugs the fixtures into `summarizeDiffs` with a duck-typed mock chain that simulates per-call latency proportional to input size (deterministic so PR diffs are apples-to-apples, not real-world wall-clock). Captures stage timings + per-call telemetry. `--update` overwrites `.bench/baseline.json`; `--fixture=<name>` narrows to a single fixture for tighter feedback loops. Baseline numbers committed at `.bench/baseline.json` against current `main`: | fixture | wall-clock | llm calls | llm total ms | prompt tokens | |---------|------------|-----------|--------------|---------------| | tiny | 2 ms | 0 | 0 ms | 0 | | medium | 30,213 ms | 20 | 102,723 ms | 91,766 | | large | 70,048 ms | 41 | 236,818 ms | 220,199 | The 3.4× spread between large fixture's wall-clock and total LLM time (236 s of model work in 70 s wall) reflects the existing `maxConcurrent=6` parallelism. Subsequent PRs in the #845 sprint will move these numbers and the deltas will land directly in PR descriptions.
5 tasks
gfargo
added a commit
that referenced
this pull request
May 6, 2026
) (#849) The v0 fixtures from #847 used a seeded LCG to generate noise. Good for deterministic latency measurement, useless for telling whether an optimization translates to real-shaped diffs. This PR swaps that out for code-shaped content per file type and adds named scenarios that mirror real commit workflows. Generators (src/lib/parsers/default/__fixtures__/generators.ts): - generateTypeScript — imports, types, functions, classes, JSDoc - generatePython — imports, defs, classes, decorators, docstrings - generateMarkdown — headers, lists, paragraphs, code blocks, tables - generateJson — nested config object with realistic key names - generateYaml — CI workflow shape - generateLockfile — yarn lock-style entries - generateContentForFile — extension-based dispatcher Diff-shape wrappers (diffs.ts): - asAdditionDiff / asDeletionDiff — pure +/- shapes - asModificationDiff — context + remove + add interleaving - asRenameDiff — git rename header (no body) - asBinaryDiff — binary file marker Scenarios in addition to the original tiny/medium/large: - feature-add (14 files) — new module + tests + docs touch - refactor (30 files) — rename + ~25 modifications - initial-commit (50) — same shape as user's #845 repro - docs-update (9) — markdown-heavy - dep-bump (3) — package.json + lockfile + CHANGELOG Re-captured baseline (committed at .bench/baseline.json): | fixture | wall-clock | calls | llm total ms | prompt tokens | |----------------|-----------:|------:|-------------:|--------------:| | tiny | 2 ms | 0 | 0 ms| 0 | | medium | 31,124 ms| 20 | 106,333 ms| 34,237 | | large | 72,151 ms| 41 | 244,112 ms| 74,197 | | feature-add | 15,967 ms| 11 | 54,726 ms| 18,937 | | refactor | 33,994 ms| 28 | 153,871 ms| 52,430 | | initial-commit | 72,291 ms| 41 | 245,148 ms| 74,546 | | docs-update | 18,563 ms| 8 | 56,293 ms| 13,908 | | dep-bump | 27,158 ms| 1 | 27,141 ms| 19,597 | Three observations the realistic fixtures surface that the LCG fixtures hid: 1. dep-bump pays 27s for one LLM call — the lockfile pre-summary. Skip-trivial / per-extension fast-path should basically zero this. 2. refactor (30 files of mixed +/-) fires 28 LLM calls. The continuous-queue wave consolidation work (PR 4) targets exactly this shape. 3. docs-update is markdown-heavy with 8 calls in 19s. A markdown- specific shorter prompt could measurably trim this. Tests: 14 new generator tests + 5 new fixture-level tests covering determinism, expected-marker presence, scaling behavior, and shape properties of the rename / dep-bump scenarios.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First chunk of the #845 perf overhaul. None of the perf claims in subsequent PRs are credible without a reproducible baseline; this PR is purely measurement scaffolding (no behavior changes).
Three pieces
Telemetry persistence in
observability.ts. WhenCOCO_BENCH=1is set (or any non-0value), every LLM call accumulates into a narrowLlmBenchCallbuffer;flushLlmBenchRunwrites the record to<cwd>/.coco-bench.json(overridable viaCOCO_BENCH_FILE). Best-effort: write failures are silent and the buffer self-clears after each flush. Production runs (no env var set) pay zero overhead.Synthetic diff fixtures at
src/lib/parsers/default/__fixtures__/. Three sizes:tiny(5 files, ~790 tokens) — early-exit pathmedium(25 files, ~36k tokens) — typical commitlarge(50 files, ~83k tokens) — initial-commit shape (mirrors the user's coco commit pipeline takes ~4 minutes on a 43-file / 77k-token initial commit #845 repro)Content comes from a seeded LCG so before/after runs compare the same input. Each fixture exports a fully-populated
DiffNodetree sosummarizeDiffsruns without a real git repo.bin/benchmark.tsrunner. Plugs the fixtures intosummarizeDiffswith a duck-typed mock chain that simulates per-call latency proportional to input size. Latency is deterministic, not realistic — the goal is apples-to-apples PR diffs, not wall-clock predictions for real-world runs.--updateoverwrites.bench/baseline.json;--fixture=<name>narrows to a single fixture for tighter feedback loops.Baseline (committed at
.bench/baseline.json)The 3.4× spread between large fixture's wall-clock (70s) and total LLM time (236s) reflects the existing
maxConcurrent=6parallelism. PRs in the rest of the #845 sprint will move these numbers and post the deltas in their descriptions.Test plan
npm run lintnpm run test:jest(1199 tests pass — no behavior change to assert)npm run buildnpm run test:clinpm run bench --updateproduces stable numbers (re-run gives identical wall-clock within a few ms thanks to deterministic mock latency)Plan reference
See
/Users/gfargo/.claude/plans/polymorphic-wondering-sunbeam.mdfor the full phased plan. PR 1 (raise default token budget) is the next chunk; subsequent PRs land each optimization with anpm run benchdiff in their description.