Skip to content

perf: trigram index v2 — integer postings, delta compression, query planner #142

@justrach

Description

@justrach

Summary

Redesign the trigram index for better build time, memory, and query planning. Current implementation uses string-heavy HashMaps with no compression or query optimization.

Tier 1 — Highest Leverage

1. Integer-only postings

Replace HashMap(Trigram, StringHashMap(PostingMask)) with HashMap(Trigram, SortedDocIdList). Store doc IDs (u32) instead of string paths. Path lookup via separate doc_id -> path table.

  • Better cache locality
  • Smaller postings
  • Faster intersections
  • Less allocator churn

2. Delta/block-compressed postings

Store sorted doc IDs as delta-encoded blocks instead of raw arrays.

  • [381, 994, 2001] → [381, +613, +1007] with varint encoding
  • Block-based so you can skip/decode small chunks
  • Reduces memory bandwidth (often the real bottleneck)

3. Query planner — smallest postings first

Current candidates() intersects trigrams in arbitrary order. Instead:

  • Estimate posting size for each trigram
  • Start from rarest trigram
  • Progressively intersect
  • Abort early when candidate set is small enough

4. Separate path/content/symbol indexes

Path queries and content queries behave differently. Keep separate trigram indexes for:

  • File path trigrams (short, structured)
  • File content trigrams (long, noisy)
  • Symbol/name index (exact token lookup)

5. Incremental segment-based indexing

Instead of full rebuild:

  • Track file content hash
  • Only re-tokenize changed files
  • Maintain segment-based index (LSM-tree style)
  • Background compaction merges segments

Tier 2

  • Hybrid sparse list + Roaring bitmap for dense trigrams
  • Thread-local build then merge (avoid global map contention)
  • Mmap immutable segments (fast startup, OS page cache)
  • Arena allocators / contiguous memory pools

Tier 3

  • SIMD intersection/decode
  • Coarse shard/block-level indexes
  • Bloom/sketch prefilters per file block

Mental model

speed = fewer candidates x cheaper intersections x less memory movement

Not: smarter compression papers. Don't SIMD your way out of a bad data structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions