Context
Incremental indexing already uses a reusable worker pool (src/commands/incremental_index_command.ts) to avoid per-file worker startup/teardown overhead.
Full indexing (src/commands/full_index_producer.ts) still spawns one Worker per file, which can be a major perf and stability hit on large repos.
Historical context / likely rationale
I searched the repo's merged PR history for explicit rationale around producer worker lifecycle (worker pools vs per-file workers) and did not find a PR/issue that directly explains this choice.
The closest related production-history signals are about consumer-side overload when Elasticsearch is slow/timeouting:
Those fixes are about not overwhelming memory when ES is slow. They don’t directly require per-file producer workers; producer concurrency is already bounded by p-queue.
Plausible reasons for per-file workers (no historical ticket found):
- Simplicity: spawn a worker, parse one file, terminate.
- Defensive memory reset: if tree-sitter/native parsing accumulates memory over time, terminating per file forces a reset of native allocations.
Why this matters
- Worker startup/teardown overhead dominates when indexing large codebases.
- High worker churn increases memory pressure and can trigger OS limits / slowdowns.
Where in code
src/commands/full_index_producer.ts: creates a new Worker inside the per-file loop.
Suggested fix
Pool worker threads for full indexing while keeping safety under load:
- Create N workers upfront (N = min(CPU_CORES, configured pool size, file count)).
- Maintain an idle-worker queue and assign jobs.
- Ensure event listeners are cleaned up per job.
- Terminate workers only once after queue drain.
Optional safety guardrail (to preserve the likely “memory reset” benefit of per-file workers):
- Add a worker recycle policy (terminate/recreate a worker after N files or after a memory threshold).
Config knob:
Test plan
- Unit test asserting
Worker is instantiated at most poolSize times during full indexing for many files.
Context
Incremental indexing already uses a reusable worker pool (
src/commands/incremental_index_command.ts) to avoid per-file worker startup/teardown overhead.Full indexing (
src/commands/full_index_producer.ts) still spawns one Worker per file, which can be a major perf and stability hit on large repos.Historical context / likely rationale
I searched the repo's merged PR history for explicit rationale around producer worker lifecycle (worker pools vs per-file workers) and did not find a PR/issue that directly explains this choice.
The closest related production-history signals are about consumer-side overload when Elasticsearch is slow/timeouting:
Those fixes are about not overwhelming memory when ES is slow. They don’t directly require per-file producer workers; producer concurrency is already bounded by p-queue.
Plausible reasons for per-file workers (no historical ticket found):
Why this matters
Where in code
src/commands/full_index_producer.ts: creates a newWorkerinside the per-file loop.Suggested fix
Pool worker threads for full indexing while keeping safety under load:
Optional safety guardrail (to preserve the likely “memory reset” benefit of per-file workers):
Config knob:
PRODUCER_WORKER_POOL_SIZEand keep the same clamp-to-CPU behavior.Test plan
Workeris instantiated at mostpoolSizetimes during full indexing for many files.