Skip to content

feat: add detailed timing metrics for bottleneck identification #123

@kapral18

Description

@kapral18

Problem

Current metrics capture counts and totals, but not per-phase timing breakdowns. When indexing takes 21 hours, we can't identify which phase is the bottleneck.

Current Metrics: We Know WHAT, Not WHERE
════════════════════════════════════════

┌─────────────────────────────────────────────────────────────────┐
│                     batchDuration = 5000ms                      │
│                         (total only)                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   [Dequeue]     [ES Bulk Request]      [Commit]                 │
│    SQLite        Network + Infer        SQLite                  │
│                                                                 │
│     ???              ???                 ???                    │
│                                                                 │
│   Is this      Is inference slow?    Is this                    │
│   slow?        Network latency?      slow?                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

We see: "batch took 5000ms"
We don't see: "4500ms was ES inference, 300ms was dequeue, 200ms was commit"

Why Not Use ES/Kibana Monitoring?

ES Stack Monitoring shows server-side metrics:

  • Bulk indexing rate
  • Shard-level latency
  • Cluster health

But it cannot show client-side timing:

  • SQLite queue operations (enqueue/dequeue/commit)
  • File parsing time
  • Network latency vs ES processing
  • Time waiting for inference

We need client-side instrumentation to identify whether bottlenecks are in:

  • Our code (parsing, SQLite)
  • Network (latency to ES)
  • ES processing (inference, indexing)

What Exists Today

// metrics.ts - Current metrics
ParserMetrics:
   filesProcessed (counter)
   filesFailed (counter)
   chunksCreated (counter)
   chunkSize (histogram)
   parseDuration         // MISSING
   enqueueDuration       // MISSING

QueueMetrics:
   documentsEnqueued (counter)
   documentsDequeued (counter)
   enqueueDuration       // MISSING
   dequeueDuration       // MISSING

IndexerMetrics:
   batchDuration (histogram)  // Total only
   batchSize (histogram)
   esRequestDuration     // MISSING - ES call breakdown
   commitDuration        // MISSING

Proposed Solution

Add timing histograms for each phase:

Proposed: Per-Phase Timing
══════════════════════════

Producer Phase:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  [Read File]  →  [Parse AST]  →  [Extract]  →  [Enqueue]       │
│                                                                 │
│  fileReadDuration  parseDuration  extractDuration  enqueueDuration
│   (histogram)       (histogram)    (histogram)      (histogram) │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Worker Phase:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  [Dequeue]  →  [ES Bulk Request]  →  [Commit]                  │
│                                                                 │
│  dequeueDuration   esRequestDuration   commitDuration           │
│   (histogram)        (histogram)        (histogram)             │
│                                                                 │
│                          │                                      │
│                          ▼                                      │
│              (includes network + inference)                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementation

New Metrics

// Producer timing
producer.fileReadDuration    // Time to read file from disk
producer.parseDuration       // Time for tree-sitter/parser
producer.extractDuration     // Time for symbol extraction
producer.enqueueDuration     // Time for SQLite INSERT

// Worker timing  
worker.dequeueDuration       // Time for SQLite SELECT + UPDATE
worker.esRequestDuration     // Time for ES bulk API call
worker.commitDuration        // Time for SQLite DELETE

Instrumentation Points

File Add Timing For
producer_worker.ts File read, parse, extract
sqlite_queue.ts Enqueue, dequeue, commit operations
indexer_worker.ts ES request (wrap indexCodeChunks call)
elasticsearch.ts Optionally: internal bulk timing

Example Output

After implementation, logs/metrics would show:

Bottleneck Analysis (P95 timings):
──────────────────────────────────
Phase               P95      % of Total
─────────────────────────────────────────
File Read           5ms         0.1%
Parse              15ms         0.3%
Extract            10ms         0.2%
Enqueue            20ms         0.4%
Dequeue            25ms         0.5%
ES Request       4800ms        96.0%  ◄── BOTTLENECK
Commit             25ms         0.5%
─────────────────────────────────────────

Recommendation: ES inference is the bottleneck.
Consider: Batch size tuning, ES scaling, or async inference.

Acceptance Criteria

  • Producer timing metrics: fileReadDuration, parseDuration, extractDuration, enqueueDuration
  • Worker timing metrics: dequeueDuration, esRequestDuration, commitDuration
  • All timing metrics are histograms (for percentile analysis)
  • Metrics include repo/branch labels for filtering
  • Existing batchDuration preserved for backward compatibility
  • Documentation updated with new metric names

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions