-
Notifications
You must be signed in to change notification settings - Fork 37
perf: mmap-backed trigram index for instant startup and lower RSS #164
Copy link
Copy link
Closed
Labels
priority:p2Medium priorityMedium priority
Description
Problem
The trigram index is currently loaded from disk into heap-allocated HashMaps on every startup. For a 5k file repo:
- Disk: ~52MB (trigram.postings + trigram.lookup)
- In-memory: ~160MB (HashMap overhead, string keys, pointer chasing)
- Load time: ~200ms (read + deserialize + build HashMaps)
After releaseContents, the trigram index is the single largest remaining memory consumer.
Proposed Solution
Memory-map the trigram index files instead of deserializing into HashMaps. The disk format already has a sorted lookup table — binary search on mmap'd data gives O(log n) lookups with zero deserialization.
Current disk format (already in src/index.zig)
Postings file (trigram.postings): Header (51 bytes) + file table + DiskPosting entries (8 bytes each: file_id u32, next_mask u8, loc_mask u8, pad 2)
Lookup file (trigram.lookup): Header (12 bytes) + sorted LookupEntry entries (12 bytes each: trigram u32, offset u32, count u32)
mmap query flow
- mmap both files (lazy — OS handles paging)
- For query "handleRemote":
- Extract trigrams: "han", "and", "ndl", "dle", ...
- For each trigram: binary search the sorted LookupEntry array (O(log n), ~17 comparisons for 100k entries)
- Read DiskPosting entries at the offset (zero-copy from mmap)
- Collect and intersect file_ids (sorted merge)
- Resolve file_ids to paths via file table
- Return candidate paths
Memory savings
- Current (HashMap): ~160MB RSS, ~200ms startup
- mmap: ~0MB RSS (OS page cache), ~1ms startup (just mmap syscall)
- Trade-off: O(log n) binary search per trigram vs O(1) hash lookup — negligible for typical query sizes
Implementation plan
- Add
MmapTrigramIndexstruct insrc/index.zigwithcandidates()via binary search - In
scanBg: afterwriteToDisk, swap heap index for mmap index - In
Explorer: support both index types (tagged union or runtime dispatch)
Interaction with existing optimizations
- Integer doc IDs (perf: trigram index v2 — integer postings, delta compression, query planner #142): mmap already uses u32 file_ids — natural fit
- Batch-accumulate: still used during index building, then swapped to mmap
- releaseContents: still frees file content — mmap replaces the trigram index
- Sorted merge intersection: works on mmap postings since they are sorted
Files to modify
src/index.zig— addMmapTrigramIndexsrc/explore.zig— support index type swapsrc/main.zig— swap to mmap inscanBg
Test cases
- mmap candidates returns same results as in-memory index
- Binary search on sorted lookup table
- Handles missing/corrupt files gracefully
- File table path resolution
Why this matters
- Startup: MCP server serves queries in ~1ms instead of 200ms
- Memory: ~160MB freed — data lives in OS page cache
- Scalability: 100k+ file repos without RAM concerns
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
priority:p2Medium priorityMedium priority