Skip to content

perf: mmap-backed trigram index for instant startup and lower RSS #164

@justrach

Description

@justrach

Problem

The trigram index is currently loaded from disk into heap-allocated HashMaps on every startup. For a 5k file repo:

  • Disk: ~52MB (trigram.postings + trigram.lookup)
  • In-memory: ~160MB (HashMap overhead, string keys, pointer chasing)
  • Load time: ~200ms (read + deserialize + build HashMaps)

After releaseContents, the trigram index is the single largest remaining memory consumer.

Proposed Solution

Memory-map the trigram index files instead of deserializing into HashMaps. The disk format already has a sorted lookup table — binary search on mmap'd data gives O(log n) lookups with zero deserialization.

Current disk format (already in src/index.zig)

Postings file (trigram.postings): Header (51 bytes) + file table + DiskPosting entries (8 bytes each: file_id u32, next_mask u8, loc_mask u8, pad 2)

Lookup file (trigram.lookup): Header (12 bytes) + sorted LookupEntry entries (12 bytes each: trigram u32, offset u32, count u32)

mmap query flow

  1. mmap both files (lazy — OS handles paging)
  2. For query "handleRemote":
    • Extract trigrams: "han", "and", "ndl", "dle", ...
    • For each trigram: binary search the sorted LookupEntry array (O(log n), ~17 comparisons for 100k entries)
    • Read DiskPosting entries at the offset (zero-copy from mmap)
    • Collect and intersect file_ids (sorted merge)
    • Resolve file_ids to paths via file table
  3. Return candidate paths

Memory savings

  • Current (HashMap): ~160MB RSS, ~200ms startup
  • mmap: ~0MB RSS (OS page cache), ~1ms startup (just mmap syscall)
  • Trade-off: O(log n) binary search per trigram vs O(1) hash lookup — negligible for typical query sizes

Implementation plan

  1. Add MmapTrigramIndex struct in src/index.zig with candidates() via binary search
  2. In scanBg: after writeToDisk, swap heap index for mmap index
  3. In Explorer: support both index types (tagged union or runtime dispatch)

Interaction with existing optimizations

Files to modify

  • src/index.zig — add MmapTrigramIndex
  • src/explore.zig — support index type swap
  • src/main.zig — swap to mmap in scanBg

Test cases

  • mmap candidates returns same results as in-memory index
  • Binary search on sorted lookup table
  • Handles missing/corrupt files gracefully
  • File table path resolution

Why this matters

  • Startup: MCP server serves queries in ~1ms instead of 200ms
  • Memory: ~160MB freed — data lives in OS page cache
  • Scalability: 100k+ file repos without RAM concerns

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions