Skip to content

feat: ES bulk indexing optimizations (compression, refresh, replicas) #124

@kapral18

Description

@kapral18

Problem

Current ES bulk indexing doesn't use available optimizations. For large repos (21h+ indexing), these add up significantly.

Current: Unoptimized Bulk Indexing
══════════════════════════════════

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  Bulk Request                                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  500 docs × 5KB = 2.5MB uncompressed                    │   │
│  │                                                         │   │
│  │  ──────────────► Network ──────────────►               │   │
│  │     (2.5MB payload each request)                        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Meanwhile, ES is also doing:                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  • Auto-refresh every 1 second (I/O + CPU)              │   │
│  │  • Replicating to replica shards (network + I/O)        │   │
│  │  • Competing for resources with bulk indexing           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Current State

// elasticsearch.ts
const baseOptions: Partial<ClientOptions> = {
  requestTimeout: 90000,
  // compression: NOT SET (defaults to none)
};

const bulkOptions = {
  refresh: false,  // ✓ Good: don't wait for refresh
  operations,
  // But ES still auto-refreshes every 1 second in background
};

Note: refresh: false only means "don't wait synchronously." ES still refreshes automatically every second, consuming resources during bulk indexing.

Proposed Optimizations

1. Enable Gzip Compression (Easy Win)

Before: 2.5MB per request
════════════════════════

Client ────── 2.5MB ──────► ES Server
         (uncompressed)


After: ~250KB per request
═════════════════════════

Client ────── 250KB ──────► ES Server
          (gzipped)
          
~10x smaller payloads = faster network transfer

Implementation:

const client = new Client({
  node: elasticsearchConfig.endpoint,
  compression: 'gzip',  // ← One line change
});

2. Disable Auto-Refresh During Bulk Indexing

Current: Background refreshes compete with indexing
═══════════════════════════════════════════════════

Time ─────────────────────────────────────────────────────────►

Bulk 1 ──► Bulk 2 ──► Bulk 3 ──► Bulk 4 ──► Bulk 5
              │           │           │
              ▼           ▼           ▼
         [refresh]   [refresh]   [refresh]
          (1 sec)     (1 sec)     (1 sec)
              │           │           │
              └─────── CPU/IO work ───┘
                   competes with
                   bulk indexing


Proposed: Single refresh at end
════════════════════════════════

Time ─────────────────────────────────────────────────────────►

[Set refresh=-1] ──► Bulk 1 ──► Bulk 2 ──► ... ──► Done ──► [Refresh]
                                                                 │
                                                                 ▼
                                                        Single refresh
                                                        All data searchable

Implementation:

// Before bulk indexing
await client.indices.putSettings({
  index: indexName,
  body: { 'index.refresh_interval': '-1' }
});

// ... bulk indexing ...

// After bulk indexing
await client.indices.putSettings({
  index: indexName,
  body: { 'index.refresh_interval': '1s' }
});
await client.indices.refresh({ index: indexName });

3. Disable Replicas During Bulk Indexing

Current: Every write replicated immediately
═══════════════════════════════════════════

Primary ──► Write ──► Replicate ──► Wait for ACK
                          │
                          ▼
                    Replica shard
                    (network + I/O)


Proposed: Replicate once at end
════════════════════════════════

[Set replicas=0] ──► Bulk writes (primary only) ──► [Set replicas=1]
                                                           │
                                                           ▼
                                                    ES replicates
                                                    everything at once

Implementation:

// Before bulk indexing
const originalSettings = await client.indices.getSettings({ index: indexName });
await client.indices.putSettings({
  index: indexName,
  body: { 'index.number_of_replicas': 0 }
});

// ... bulk indexing ...

// After bulk indexing  
await client.indices.putSettings({
  index: indexName,
  body: { 'index.number_of_replicas': originalReplicas }
});

Expected Impact

Optimization Effort Impact
Gzip compression One line 5-10x smaller payloads
Disable refresh ~20 lines 2-5x faster bulk writes
Disable replicas ~20 lines ~2x faster (if replicas > 0)

Combined: Potentially 2-10x faster ES indexing phase.

Implementation Plan

Phase 1: Compression (Low Risk)

  • Add compression: 'gzip' to ES client options
  • No lifecycle changes needed
  • Can be toggled via env var if needed

Phase 2: Index Settings Lifecycle (Medium Risk)

  • Add startBulkIndexing(indexName) - saves settings, optimizes for bulk
  • Add finishBulkIndexing(indexName) - restores settings, refreshes
  • Wrap producer+worker in index_command.ts
  • Handle errors (restore settings on failure)

Code Changes

File Change
elasticsearch.ts Add compression: 'gzip' to client
elasticsearch.ts Add startBulkIndexing(), finishBulkIndexing()
index_command.ts Wrap indexing in lifecycle calls
config.ts Add ENABLE_BULK_OPTIMIZATIONS flag (default: true)

Interaction with Other Issues

Issue Interaction
#121 (filePaths aggregation) If #121 changes from bulk to update API, verify settings optimizations still apply. Both APIs benefit from refresh_interval=-1 and replicas=0.
#122 (parallel enqueue/dequeue) Settings lifecycle must handle producer failure mid-way. Use try/finally to restore settings even on error.
#120 (auto-retry) No conflict. Auto-retry happens after bulk indexing, settings already restored.

Acceptance Criteria

  • Gzip compression enabled on ES client
  • refresh_interval set to -1 during bulk indexing
  • number_of_replicas set to 0 during bulk indexing (configurable)
  • Original settings restored after bulk indexing completes
  • Settings restored even on error (try/finally)
  • Final refresh ensures all data is searchable
  • Can be disabled via environment variable
  • Works correctly with --clean flag
  • Works correctly with incremental indexing
  • Works with both bulk and update APIs (future-proof for bug: documents with identical content overwrite each other - only one file discoverable via search #121)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions