-
Notifications
You must be signed in to change notification settings - Fork 36
perf: trigram index v2 — integer postings, delta compression, query planner #142
Description
Summary
Redesign the trigram index for better build time, memory, and query planning. Current implementation uses string-heavy HashMaps with no compression or query optimization.
Tier 1 — Highest Leverage
1. Integer-only postings
Replace HashMap(Trigram, StringHashMap(PostingMask)) with HashMap(Trigram, SortedDocIdList). Store doc IDs (u32) instead of string paths. Path lookup via separate doc_id -> path table.
- Better cache locality
- Smaller postings
- Faster intersections
- Less allocator churn
2. Delta/block-compressed postings
Store sorted doc IDs as delta-encoded blocks instead of raw arrays.
[381, 994, 2001] → [381, +613, +1007]with varint encoding- Block-based so you can skip/decode small chunks
- Reduces memory bandwidth (often the real bottleneck)
3. Query planner — smallest postings first
Current candidates() intersects trigrams in arbitrary order. Instead:
- Estimate posting size for each trigram
- Start from rarest trigram
- Progressively intersect
- Abort early when candidate set is small enough
4. Separate path/content/symbol indexes
Path queries and content queries behave differently. Keep separate trigram indexes for:
- File path trigrams (short, structured)
- File content trigrams (long, noisy)
- Symbol/name index (exact token lookup)
5. Incremental segment-based indexing
Instead of full rebuild:
- Track file content hash
- Only re-tokenize changed files
- Maintain segment-based index (LSM-tree style)
- Background compaction merges segments
Tier 2
- Hybrid sparse list + Roaring bitmap for dense trigrams
- Thread-local build then merge (avoid global map contention)
- Mmap immutable segments (fast startup, OS page cache)
- Arena allocators / contiguous memory pools
Tier 3
- SIMD intersection/decode
- Coarse shard/block-level indexes
- Bloom/sketch prefilters per file block
Mental model
speed = fewer candidates x cheaper intersections x less memory movement
Not: smarter compression papers. Don't SIMD your way out of a bad data structure.