feat(bench): measurement infrastructure for the diff-condensing pipeline (#845) by gfargo · Pull Request #847 · gfargo/coco

gfargo · 2026-05-05T15:09:50Z

First chunk of the #845 perf overhaul. None of the perf claims in subsequent PRs are credible without a reproducible baseline; this PR is purely measurement scaffolding (no behavior changes).

Three pieces

Telemetry persistence in observability.ts. When COCO_BENCH=1 is set (or any non-0 value), every LLM call accumulates into a narrow LlmBenchCall buffer; flushLlmBenchRun writes the record to <cwd>/.coco-bench.json (overridable via COCO_BENCH_FILE). Best-effort: write failures are silent and the buffer self-clears after each flush. Production runs (no env var set) pay zero overhead.
Synthetic diff fixtures at src/lib/parsers/default/__fixtures__/. Three sizes:
- tiny (5 files, ~790 tokens) — early-exit path
- medium (25 files, ~36k tokens) — typical commit
- large (50 files, ~83k tokens) — initial-commit shape (mirrors the user's coco commit pipeline takes ~4 minutes on a 43-file / 77k-token initial commit #845 repro)
Content comes from a seeded LCG so before/after runs compare the same input. Each fixture exports a fully-populated DiffNode tree so summarizeDiffs runs without a real git repo.
bin/benchmark.ts runner. Plugs the fixtures into summarizeDiffs with a duck-typed mock chain that simulates per-call latency proportional to input size. Latency is deterministic, not realistic — the goal is apples-to-apples PR diffs, not wall-clock predictions for real-world runs. --update overwrites .bench/baseline.json; --fixture=<name> narrows to a single fixture for tighter feedback loops.

Baseline (committed at `.bench/baseline.json`)

fixture	wall-clock	llm calls	llm total ms	prompt tokens
tiny	2 ms	0	0 ms	0
medium	30,213 ms	20	102,723 ms	91,766
large	70,048 ms	41	236,818 ms	220,199

The 3.4× spread between large fixture's wall-clock (70s) and total LLM time (236s) reflects the existing maxConcurrent=6 parallelism. PRs in the rest of the #845 sprint will move these numbers and post the deltas in their descriptions.

Test plan

npm run lint
npm run test:jest (1199 tests pass — no behavior change to assert)
npm run build
npm run test:cli
npm run bench --update produces stable numbers (re-run gives identical wall-clock within a few ms thanks to deterministic mock latency)

Plan reference

See /Users/gfargo/.claude/plans/polymorphic-wondering-sunbeam.md for the full phased plan. PR 1 (raise default token budget) is the next chunk; subsequent PRs land each optimization with a npm run bench diff in their description.

…ine (#845) First chunk of the #845 perf overhaul: a reproducible benchmark harness so every later PR can show concrete before/after numbers instead of hand-waving about "should be faster". Three pieces: 1. Telemetry persistence in `observability.ts`. When COCO_BENCH=1 is set (or any non-`0` value), every llm call accumulates into a narrow `LlmBenchCall` buffer; `flushLlmBenchRun` writes the record to `<cwd>/.coco-bench.json` (overridable via COCO_BENCH_FILE). Best-effort: write failures are silent and the buffer self-clears after each flush. 2. Synthetic diff fixtures at `src/lib/parsers/default/__fixtures__/`. Three sizes: - tiny ( 5 files, ~790 tokens) — early-exit path - medium (25 files, ~36k tokens) — typical commit - large (50 files, ~83k tokens) — initial-commit shape Content comes from a seeded LCG so before/after runs compare the same input. Each fixture exports a fully-populated DiffNode tree so `summarizeDiffs` runs without a real git repo. 3. `bin/benchmark.ts` runner (`npm run bench`). Plugs the fixtures into `summarizeDiffs` with a duck-typed mock chain that simulates per-call latency proportional to input size (deterministic so PR diffs are apples-to-apples, not real-world wall-clock). Captures stage timings + per-call telemetry. `--update` overwrites `.bench/baseline.json`; `--fixture=<name>` narrows to a single fixture for tighter feedback loops. Baseline numbers committed at `.bench/baseline.json` against current `main`: | fixture | wall-clock | llm calls | llm total ms | prompt tokens | |---------|------------|-----------|--------------|---------------| | tiny | 2 ms | 0 | 0 ms | 0 | | medium | 30,213 ms | 20 | 102,723 ms | 91,766 | | large | 70,048 ms | 41 | 236,818 ms | 220,199 | The 3.4× spread between large fixture's wall-clock and total LLM time (236 s of model work in 70 s wall) reflects the existing `maxConcurrent=6` parallelism. Subsequent PRs in the #845 sprint will move these numbers and the deltas will land directly in PR descriptions.

) (#849) The v0 fixtures from #847 used a seeded LCG to generate noise. Good for deterministic latency measurement, useless for telling whether an optimization translates to real-shaped diffs. This PR swaps that out for code-shaped content per file type and adds named scenarios that mirror real commit workflows. Generators (src/lib/parsers/default/__fixtures__/generators.ts): - generateTypeScript — imports, types, functions, classes, JSDoc - generatePython — imports, defs, classes, decorators, docstrings - generateMarkdown — headers, lists, paragraphs, code blocks, tables - generateJson — nested config object with realistic key names - generateYaml — CI workflow shape - generateLockfile — yarn lock-style entries - generateContentForFile — extension-based dispatcher Diff-shape wrappers (diffs.ts): - asAdditionDiff / asDeletionDiff — pure +/- shapes - asModificationDiff — context + remove + add interleaving - asRenameDiff — git rename header (no body) - asBinaryDiff — binary file marker Scenarios in addition to the original tiny/medium/large: - feature-add (14 files) — new module + tests + docs touch - refactor (30 files) — rename + ~25 modifications - initial-commit (50) — same shape as user's #845 repro - docs-update (9) — markdown-heavy - dep-bump (3) — package.json + lockfile + CHANGELOG Re-captured baseline (committed at .bench/baseline.json): | fixture | wall-clock | calls | llm total ms | prompt tokens | |----------------|-----------:|------:|-------------:|--------------:| | tiny | 2 ms | 0 | 0 ms| 0 | | medium | 31,124 ms| 20 | 106,333 ms| 34,237 | | large | 72,151 ms| 41 | 244,112 ms| 74,197 | | feature-add | 15,967 ms| 11 | 54,726 ms| 18,937 | | refactor | 33,994 ms| 28 | 153,871 ms| 52,430 | | initial-commit | 72,291 ms| 41 | 245,148 ms| 74,546 | | docs-update | 18,563 ms| 8 | 56,293 ms| 13,908 | | dep-bump | 27,158 ms| 1 | 27,141 ms| 19,597 | Three observations the realistic fixtures surface that the LCG fixtures hid: 1. dep-bump pays 27s for one LLM call — the lockfile pre-summary. Skip-trivial / per-extension fast-path should basically zero this. 2. refactor (30 files of mixed +/-) fires 28 LLM calls. The continuous-queue wave consolidation work (PR 4) targets exactly this shape. 3. docs-update is markdown-heavy with 8 calls in 19s. A markdown- specific shorter prompt could measurably trim this. Tests: 14 new generator tests + 5 new fixture-level tests covering determinism, expected-marker presence, scaling behavior, and shape properties of the rename / dep-bump scenarios.

gfargo merged commit bcb246b into main May 6, 2026
9 checks passed

gfargo deleted the feat/diff-pipeline-bench-845 branch May 6, 2026 00:20

gfargo mentioned this pull request May 6, 2026

feat(bench): realistic per-language fixture generators + scenarios (#845) #849

Merged

5 tasks

gfargo mentioned this pull request May 6, 2026

feat(parser): raise default maxConcurrent to 24 + adaptive backoff (#845) #856

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): measurement infrastructure for the diff-condensing pipeline (#845)#847

feat(bench): measurement infrastructure for the diff-condensing pipeline (#845)#847
gfargo merged 1 commit intomainfrom
feat/diff-pipeline-bench-845

gfargo commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gfargo commented May 5, 2026

Three pieces

Baseline (committed at .bench/baseline.json)

Test plan

Plan reference

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Baseline (committed at `.bench/baseline.json`)