Problem
Current ES bulk indexing doesn't use available optimizations. For large repos (21h+ indexing), these add up significantly.
Current: Unoptimized Bulk Indexing
══════════════════════════════════
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Bulk Request │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 500 docs × 5KB = 2.5MB uncompressed │ │
│ │ │ │
│ │ ──────────────► Network ──────────────► │ │
│ │ (2.5MB payload each request) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Meanwhile, ES is also doing: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • Auto-refresh every 1 second (I/O + CPU) │ │
│ │ • Replicating to replica shards (network + I/O) │ │
│ │ • Competing for resources with bulk indexing │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Current State
// elasticsearch.ts
const baseOptions: Partial<ClientOptions> = {
requestTimeout: 90000,
// compression: NOT SET (defaults to none)
};
const bulkOptions = {
refresh: false, // ✓ Good: don't wait for refresh
operations,
// But ES still auto-refreshes every 1 second in background
};
Note: refresh: false only means "don't wait synchronously." ES still refreshes automatically every second, consuming resources during bulk indexing.
Proposed Optimizations
1. Enable Gzip Compression (Easy Win)
Before: 2.5MB per request
════════════════════════
Client ────── 2.5MB ──────► ES Server
(uncompressed)
After: ~250KB per request
═════════════════════════
Client ────── 250KB ──────► ES Server
(gzipped)
~10x smaller payloads = faster network transfer
Implementation:
const client = new Client({
node: elasticsearchConfig.endpoint,
compression: 'gzip', // ← One line change
});
2. Disable Auto-Refresh During Bulk Indexing
Current: Background refreshes compete with indexing
═══════════════════════════════════════════════════
Time ─────────────────────────────────────────────────────────►
Bulk 1 ──► Bulk 2 ──► Bulk 3 ──► Bulk 4 ──► Bulk 5
│ │ │
▼ ▼ ▼
[refresh] [refresh] [refresh]
(1 sec) (1 sec) (1 sec)
│ │ │
└─────── CPU/IO work ───┘
competes with
bulk indexing
Proposed: Single refresh at end
════════════════════════════════
Time ─────────────────────────────────────────────────────────►
[Set refresh=-1] ──► Bulk 1 ──► Bulk 2 ──► ... ──► Done ──► [Refresh]
│
▼
Single refresh
All data searchable
Implementation:
// Before bulk indexing
await client.indices.putSettings({
index: indexName,
body: { 'index.refresh_interval': '-1' }
});
// ... bulk indexing ...
// After bulk indexing
await client.indices.putSettings({
index: indexName,
body: { 'index.refresh_interval': '1s' }
});
await client.indices.refresh({ index: indexName });
3. Disable Replicas During Bulk Indexing
Current: Every write replicated immediately
═══════════════════════════════════════════
Primary ──► Write ──► Replicate ──► Wait for ACK
│
▼
Replica shard
(network + I/O)
Proposed: Replicate once at end
════════════════════════════════
[Set replicas=0] ──► Bulk writes (primary only) ──► [Set replicas=1]
│
▼
ES replicates
everything at once
Implementation:
// Before bulk indexing
const originalSettings = await client.indices.getSettings({ index: indexName });
await client.indices.putSettings({
index: indexName,
body: { 'index.number_of_replicas': 0 }
});
// ... bulk indexing ...
// After bulk indexing
await client.indices.putSettings({
index: indexName,
body: { 'index.number_of_replicas': originalReplicas }
});
Expected Impact
| Optimization |
Effort |
Impact |
| Gzip compression |
One line |
5-10x smaller payloads |
| Disable refresh |
~20 lines |
2-5x faster bulk writes |
| Disable replicas |
~20 lines |
~2x faster (if replicas > 0) |
Combined: Potentially 2-10x faster ES indexing phase.
Implementation Plan
Phase 1: Compression (Low Risk)
- Add
compression: 'gzip' to ES client options
- No lifecycle changes needed
- Can be toggled via env var if needed
Phase 2: Index Settings Lifecycle (Medium Risk)
- Add
startBulkIndexing(indexName) - saves settings, optimizes for bulk
- Add
finishBulkIndexing(indexName) - restores settings, refreshes
- Wrap producer+worker in
index_command.ts
- Handle errors (restore settings on failure)
Code Changes
| File |
Change |
elasticsearch.ts |
Add compression: 'gzip' to client |
elasticsearch.ts |
Add startBulkIndexing(), finishBulkIndexing() |
index_command.ts |
Wrap indexing in lifecycle calls |
config.ts |
Add ENABLE_BULK_OPTIMIZATIONS flag (default: true) |
Interaction with Other Issues
| Issue |
Interaction |
| #121 (filePaths aggregation) |
If #121 changes from bulk to update API, verify settings optimizations still apply. Both APIs benefit from refresh_interval=-1 and replicas=0. |
| #122 (parallel enqueue/dequeue) |
Settings lifecycle must handle producer failure mid-way. Use try/finally to restore settings even on error. |
| #120 (auto-retry) |
No conflict. Auto-retry happens after bulk indexing, settings already restored. |
Acceptance Criteria
Problem
Current ES bulk indexing doesn't use available optimizations. For large repos (21h+ indexing), these add up significantly.
Current State
Note:
refresh: falseonly means "don't wait synchronously." ES still refreshes automatically every second, consuming resources during bulk indexing.Proposed Optimizations
1. Enable Gzip Compression (Easy Win)
Implementation:
2. Disable Auto-Refresh During Bulk Indexing
Implementation:
3. Disable Replicas During Bulk Indexing
Implementation:
Expected Impact
Combined: Potentially 2-10x faster ES indexing phase.
Implementation Plan
Phase 1: Compression (Low Risk)
compression: 'gzip'to ES client optionsPhase 2: Index Settings Lifecycle (Medium Risk)
startBulkIndexing(indexName)- saves settings, optimizes for bulkfinishBulkIndexing(indexName)- restores settings, refreshesindex_command.tsCode Changes
elasticsearch.tscompression: 'gzip'to clientelasticsearch.tsstartBulkIndexing(),finishBulkIndexing()index_command.tsconfig.tsENABLE_BULK_OPTIMIZATIONSflag (default: true)Interaction with Other Issues
bulktoupdateAPI, verify settings optimizations still apply. Both APIs benefit fromrefresh_interval=-1andreplicas=0.Acceptance Criteria
refresh_intervalset to-1during bulk indexingnumber_of_replicasset to0during bulk indexing (configurable)--cleanflagbulkandupdateAPIs (future-proof for bug: documents with identical content overwrite each other - only one file discoverable via search #121)