feat(parser): continuous-queue scheduler + tail-variance bench (#845)#857
Merged
feat(parser): continuous-queue scheduler + tail-variance bench (#845)#857
Conversation
Two changes that together let the bench show wave-locking
pathology and PR 4's fix for it.
## Continuous queue replaces wave-by-wave Promise.all
Both `summarizeInWaves` (directory consolidation in
`summarizeDiffs.ts`) and `processInWaves` (per-file pre-process
in `summarizeLargeFiles.ts`) used a wave-by-wave loop:
for (each batch of maxConcurrent items)
await Promise.all(batch)
The slowest call in each wave forced the next wave to wait — even
if the other wave members had finished long before. Real-world
LLM latencies have meaningful tail variance (one slow call in
every wave is common); the wave-locking compounded that into
significant dead time.
Replaced both passes with a `createLimit(maxConcurrent)`
semaphore (the same primitive `collectDiffs` already uses).
Items dispatch in order; as soon as any in-flight slot frees,
the next item starts. The directory pass also re-checks the
budget at dispatch time — if earlier completions already dropped
the total under maxTokens, subsequent items return their input
unchanged instead of paying for an unnecessary LLM call.
## Bench: tail-variance latency model + monorepo fixture
The previous mock latency formula (base + per-token) gave every
call the same shape, so wave-locking never showed in numbers.
Added a deterministic per-call multiplier keyed on the input
hash: ~10% of calls land in a 3x slow bucket, ~3% in a 6x slow
bucket. Same input still yields the same latency every run
(reproducible) but the workload now exhibits the long-tail
behavior real LLMs do.
Added a `monorepo` fixture (80 modification-shaped files across
35 directories) sized to produce more LLM calls than the default
maxConcurrent. With the variance model + monorepo fixture, the
wave-locking is finally measurable.
## Bench: wave-based vs continuous-queue (variance latency)
| fixture | wave-based | continuous-queue | Δ wall |
|----------------|------------:|-----------------:|------------:|
| tiny | 4 ms | 2 ms | -2 ms |
| medium | 8,511 ms | 8,502 ms | -9 ms |
| large | 14,420 ms | 14,413 ms | -7 ms |
| feature-add | 13,066 ms | 13,062 ms | -4 ms |
| refactor | 51,148 ms | 51,116 ms | -32 ms |
| initial-commit | 16,974 ms | 16,965 ms | -9 ms |
| docs-update | 24,128 ms | 24,384 ms | +256 ms |
| dep-bump | 0 ms | 0 ms | 0 ms |
| **monorepo** | **213,233 ms** | **88,957 ms** | **-124,276 ms (-58%)** |
All fixtures except monorepo have ≤20 LLM calls — they fit in a
single wave at concurrency 24, so the scheduler choice is moot
for them. The monorepo fixture (80 calls = ~4 waves at wave-
based) is where wave-locking shows: continuous queue cuts the
wall by 58%.
The other fixtures stay flat as expected (within noise). The
LLM call counts are unchanged — this PR is pure scheduler
restructuring, no skip-trivial-style call-elimination.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 4 of the #845 sprint. Replaces both pipeline passes' wave-by-wave Promise.all with a continuous-queue semaphore so the slowest call in a batch no longer blocks the next batch from starting. Monorepo-shape fixtures see a 58% wall-clock cut (213s → 89s); other fixtures stay flat because they fit in a single wave at concurrency 24.
Two changes
1. Continuous queue
Both schedulers replaced:
summarizeInWavesinsummarizeDiffs.ts(directory consolidation)processInWavesinsummarizeLargeFiles.ts(per-file pre-process)Old loop:
New loop:
createLimit(maxConcurrent)semaphore — items dispatch in order, the next item starts as soon as any in-flight slot frees. The directory pass also re-checks the budget at dispatch time so subsequent items short-circuit when earlier completions already dropped the total undermaxTokens.2. Tail-variance latency model in the bench
The previous mock latency was
baseLatencyMs + perTokenMs * tokens— same shape for every call. That hid wave-locking entirely because the slowest call in any wave was the same as every other call. Real LLMs don't behave that way; they have meaningful tail variance (one of every ten calls is noticeably slow, sometimes 5-10x).Added a deterministic per-call multiplier keyed on the input hash:
This finally makes wave-locking measurable in the bench. Without it, the architectural change was correct but invisible.
3. New
monorepofixture80 modification-shaped files across 35 directories — sized to produce more LLM calls than
maxConcurrent(24), so wave-based requires multiple waves and shows the wave-locking pathology under variance.Bench: wave-based vs continuous-queue (both at variance latency)
Every other fixture has ≤20 LLM calls — they fit in a single wave at concurrency 24, so scheduler choice is moot. The monorepo fixture (80 calls = ~4 waves at wave-based) is where wave-locking surfaces.
LLM call counts are unchanged across the board. PR 4 is pure scheduler restructuring; no calls eliminated like in PR 2's skip-trivial. The win is purely from filling slots more efficiently under variance.
Test plan
npm run lintnpm run test:jest(1294 tests pass — 1 updated assertion insummarizeDiffs.test.tsto match the new "continuous queue" log line, plus 19 new fixture tests covering the monorepo addition)npm run buildnpm run test:clinpm run bench— numbers abovePlan reference
PR 4 of the
#845sprint. PR 5 (per-repo content-hashed disk cache) is the final headline win — re-runs ofcoco commitafter small changes will hit the cache for unchanged files and skip the LLM entirely.