Skip to content

feat(bench): measurement infrastructure for the diff-condensing pipeline (#845)#847

Merged
gfargo merged 1 commit intomainfrom
feat/diff-pipeline-bench-845
May 6, 2026
Merged

feat(bench): measurement infrastructure for the diff-condensing pipeline (#845)#847
gfargo merged 1 commit intomainfrom
feat/diff-pipeline-bench-845

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 5, 2026

First chunk of the #845 perf overhaul. None of the perf claims in subsequent PRs are credible without a reproducible baseline; this PR is purely measurement scaffolding (no behavior changes).

Three pieces

  1. Telemetry persistence in observability.ts. When COCO_BENCH=1 is set (or any non-0 value), every LLM call accumulates into a narrow LlmBenchCall buffer; flushLlmBenchRun writes the record to <cwd>/.coco-bench.json (overridable via COCO_BENCH_FILE). Best-effort: write failures are silent and the buffer self-clears after each flush. Production runs (no env var set) pay zero overhead.

  2. Synthetic diff fixtures at src/lib/parsers/default/__fixtures__/. Three sizes:

    Content comes from a seeded LCG so before/after runs compare the same input. Each fixture exports a fully-populated DiffNode tree so summarizeDiffs runs without a real git repo.

  3. bin/benchmark.ts runner. Plugs the fixtures into summarizeDiffs with a duck-typed mock chain that simulates per-call latency proportional to input size. Latency is deterministic, not realistic — the goal is apples-to-apples PR diffs, not wall-clock predictions for real-world runs. --update overwrites .bench/baseline.json; --fixture=<name> narrows to a single fixture for tighter feedback loops.

Baseline (committed at .bench/baseline.json)

fixture wall-clock llm calls llm total ms prompt tokens
tiny 2 ms 0 0 ms 0
medium 30,213 ms 20 102,723 ms 91,766
large 70,048 ms 41 236,818 ms 220,199

The 3.4× spread between large fixture's wall-clock (70s) and total LLM time (236s) reflects the existing maxConcurrent=6 parallelism. PRs in the rest of the #845 sprint will move these numbers and post the deltas in their descriptions.

Test plan

  • npm run lint
  • npm run test:jest (1199 tests pass — no behavior change to assert)
  • npm run build
  • npm run test:cli
  • npm run bench --update produces stable numbers (re-run gives identical wall-clock within a few ms thanks to deterministic mock latency)

Plan reference

See /Users/gfargo/.claude/plans/polymorphic-wondering-sunbeam.md for the full phased plan. PR 1 (raise default token budget) is the next chunk; subsequent PRs land each optimization with a npm run bench diff in their description.

…ine (#845)

First chunk of the #845 perf overhaul: a reproducible benchmark
harness so every later PR can show concrete before/after numbers
instead of hand-waving about "should be faster". Three pieces:

1. Telemetry persistence in `observability.ts`. When COCO_BENCH=1
   is set (or any non-`0` value), every llm call accumulates into
   a narrow `LlmBenchCall` buffer; `flushLlmBenchRun` writes the
   record to `<cwd>/.coco-bench.json` (overridable via
   COCO_BENCH_FILE). Best-effort: write failures are silent and
   the buffer self-clears after each flush.

2. Synthetic diff fixtures at
   `src/lib/parsers/default/__fixtures__/`. Three sizes:
     - tiny  ( 5 files,  ~790 tokens)  — early-exit path
     - medium (25 files, ~36k tokens)  — typical commit
     - large  (50 files, ~83k tokens)  — initial-commit shape
   Content comes from a seeded LCG so before/after runs compare
   the same input. Each fixture exports a fully-populated DiffNode
   tree so `summarizeDiffs` runs without a real git repo.

3. `bin/benchmark.ts` runner (`npm run bench`). Plugs the fixtures
   into `summarizeDiffs` with a duck-typed mock chain that simulates
   per-call latency proportional to input size (deterministic so
   PR diffs are apples-to-apples, not real-world wall-clock).
   Captures stage timings + per-call telemetry. `--update`
   overwrites `.bench/baseline.json`; `--fixture=<name>` narrows
   to a single fixture for tighter feedback loops.

Baseline numbers committed at `.bench/baseline.json` against
current `main`:

| fixture | wall-clock | llm calls | llm total ms | prompt tokens |
|---------|------------|-----------|--------------|---------------|
| tiny    |     2 ms   |     0     |     0 ms     |       0       |
| medium  | 30,213 ms  |    20     | 102,723 ms   |    91,766     |
| large   | 70,048 ms  |    41     | 236,818 ms   |   220,199     |

The 3.4× spread between large fixture's wall-clock and total LLM
time (236 s of model work in 70 s wall) reflects the existing
`maxConcurrent=6` parallelism. Subsequent PRs in the #845 sprint
will move these numbers and the deltas will land directly in PR
descriptions.
@gfargo gfargo merged commit bcb246b into main May 6, 2026
9 checks passed
@gfargo gfargo deleted the feat/diff-pipeline-bench-845 branch May 6, 2026 00:20
gfargo added a commit that referenced this pull request May 6, 2026
) (#849)

The v0 fixtures from #847 used a seeded LCG to generate noise.
Good for deterministic latency measurement, useless for telling
whether an optimization translates to real-shaped diffs. This PR
swaps that out for code-shaped content per file type and adds
named scenarios that mirror real commit workflows.

Generators (src/lib/parsers/default/__fixtures__/generators.ts):
  - generateTypeScript — imports, types, functions, classes, JSDoc
  - generatePython — imports, defs, classes, decorators, docstrings
  - generateMarkdown — headers, lists, paragraphs, code blocks, tables
  - generateJson — nested config object with realistic key names
  - generateYaml — CI workflow shape
  - generateLockfile — yarn lock-style entries
  - generateContentForFile — extension-based dispatcher

Diff-shape wrappers (diffs.ts):
  - asAdditionDiff / asDeletionDiff — pure +/- shapes
  - asModificationDiff — context + remove + add interleaving
  - asRenameDiff — git rename header (no body)
  - asBinaryDiff — binary file marker

Scenarios in addition to the original tiny/medium/large:
  - feature-add (14 files)  — new module + tests + docs touch
  - refactor (30 files)     — rename + ~25 modifications
  - initial-commit (50)     — same shape as user's #845 repro
  - docs-update (9)         — markdown-heavy
  - dep-bump (3)            — package.json + lockfile + CHANGELOG

Re-captured baseline (committed at .bench/baseline.json):

| fixture        | wall-clock | calls | llm total ms | prompt tokens |
|----------------|-----------:|------:|-------------:|--------------:|
| tiny           |       2 ms |     0 |          0 ms|             0 |
| medium         |   31,124 ms|    20 |    106,333 ms|        34,237 |
| large          |   72,151 ms|    41 |    244,112 ms|        74,197 |
| feature-add    |   15,967 ms|    11 |     54,726 ms|        18,937 |
| refactor       |   33,994 ms|    28 |    153,871 ms|        52,430 |
| initial-commit |   72,291 ms|    41 |    245,148 ms|        74,546 |
| docs-update    |   18,563 ms|     8 |     56,293 ms|        13,908 |
| dep-bump       |   27,158 ms|     1 |     27,141 ms|        19,597 |

Three observations the realistic fixtures surface that the LCG
fixtures hid:

1. dep-bump pays 27s for one LLM call — the lockfile pre-summary.
   Skip-trivial / per-extension fast-path should basically zero this.
2. refactor (30 files of mixed +/-) fires 28 LLM calls. The
   continuous-queue wave consolidation work (PR 4) targets exactly
   this shape.
3. docs-update is markdown-heavy with 8 calls in 19s. A markdown-
   specific shorter prompt could measurably trim this.

Tests: 14 new generator tests + 5 new fixture-level tests covering
determinism, expected-marker presence, scaling behavior, and shape
properties of the rename / dep-bump scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant