feat(parser): raise default maxConcurrent to 24 + adaptive backoff (#845)#856
Merged
feat(parser): raise default maxConcurrent to 24 + adaptive backoff (#845)#856
Conversation
) Two changes that compound: 1. Default `maxConcurrent` bumped 12 → 24 in the openai + anthropic service configs (`langchain/utils.ts`). The fast tier on each provider comfortably handles ~30 concurrent on the per-key default rate limit; 24 doubles throughput while leaving headroom for retries. Ollama stays at 1 (single local instance, more would just queue). Bench default also bumped so per-PR diffs reflect the new production setting. 2. New `invokeWithBackoff` wrapper around `chain.invoke()` in `summarize()`. Catches retryable errors (HTTP 429 / 502 / 503 / 504, rate-limit codes, connection resets, "rate limit" or "timeout" in the message body), waits with exponential backoff (1s / 2s / 4s, capped at 5s), retries up to 3 times. Non-retryable errors (4xx other than 429) still propagate immediately so a malformed prompt fails fast. Bench (post-PR-2 baseline → PR 3): | fixture | wall before | wall after | Δ wall | |----------------|------------:|-----------:|------------:| | tiny | 1 ms | 2 ms | +1 ms | | medium | 6,906 ms | 6,905 ms | -1 ms | | large | 9,749 ms | 9,754 ms | +5 ms | | feature-add | 5,640 ms | 5,641 ms | +1 ms | | refactor | 41,347 ms | 26,718 ms | -14,629 ms (-35%) | | initial-commit | 9,818 ms | 9,816 ms | -2 ms | | docs-update | 18,564 ms | 10,248 ms | -8,316 ms (-45%) | | dep-bump | 0 ms | 0 ms | 0 ms | PR 3 targets the modification-heavy fixtures that PR 2 couldn't help (refactor's 20 LLM calls and docs-update's 7 calls were serializing across multiple waves at the old concurrency=12). At concurrency 24, both fit in a single wave — the slowest call in the wave dominates, not the wave count. Pure-add fixtures (medium / large / initial-commit / feature-add) already had ≤7 calls each which fit in one wave at concurrency 12 — bumping to 24 doesn't help them. They needed the skip-trivial work in PR 2 instead, which is exactly what they got. 10 new tests in `chains/summarize/index.test.ts` cover the backoff matrix: success-no-retry, retry-then-succeed for each retryable error shape, no-retry on non-retryable errors, give-up after 3 retries on persistent rate limits, and that the chain receives unchanged Document[] input across retries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 3 of the #845 sprint. Headline: modification-heavy fixtures (refactor, docs-update) get 35-45% wall-clock cut, and the pipeline now degrades gracefully under rate limits instead of failing the whole run on a transient 429.
Two changes
1. Default maxConcurrent: 12 → 24
src/lib/langchain/utils.ts— both openai and anthropic service defaults bumped. The fast tier on each provider comfortably handles ~30 concurrent on the per-key default rate limit; 24 doubles throughput while leaving headroom for retries. Ollama stays at 1 (single local instance, more would just queue). The bench default inbin/benchmark.tsis also bumped so per-PR diffs reflect the new production setting.2. Adaptive backoff in
summarize()New
invokeWithBackoffwrapper aroundchain.invoke()that catches retryable errors and waits with exponential backoff before retrying:code: 'rate_limit_exceeded'code: 'ECONNRESET'/'ETIMEDOUT'/rate.?limit|429|too many requests|timeout|temporarily unavailable/i3 retries, 1s / 2s / 4s waits (capped at 5s). Worst-case extra latency: ~7s before surfacing the failure. Bounded enough that the user doesn't wait forever; long enough to ride out brief rate-limit blips that would otherwise kill an entire pipeline mid-run.
Bench (post-PR-2 baseline → PR 3)
PR 3 targets the modification-heavy fixtures that PR 2 couldn't help.
refactor(20 LLM calls) anddocs-update(7 calls) were serializing across multiple waves at the old concurrency 12 — at 24 they fit in a single wave, so the slowest call in the wave dominates instead of the wave count.Pure-add fixtures already had ≤7 calls each which fit in one wave at concurrency 12 — bumping to 24 doesn't help them, which is why they're flat. They needed the skip-trivial work from PR 2, which is what they got.
Cumulative state vs original baseline (#847)
Test plan
npm run lintnpm run test:jest(1275 tests pass — 10 new inchains/summarize/index.test.tscovering the full backoff matrix: success-no-retry, retry-then-succeed for each retryable error shape, no-retry on non-retryable, give-up after 3 retries, unchanged Document[] across retries)npm run buildnpm run test:clinpm run bench— numbers abovePlan reference
PR 3 of the
#845sprint. PR 4 (continuous-queue waves — kills the wave-locking on refactor's slowest call) is next, then PR 5 (disk cache).