perf: trigram index v2 — 36% faster indexing, 59% less CPU#144
perf: trigram index v2 — 36% faster indexing, 59% less CPU#144
Conversation
10 regression tests covering: - Find all matching files (no false negatives) - No false positives - Intersection narrows correctly - File removal cleans index - Re-indexing replaces old content - Disk roundtrip preserves results - 500-file scale test - Short query fallback - Word index alongside trigram These tests must pass before and after any trigram index changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace string-based postings (StringHashMap per trigram) with integer doc ID postings (sorted ArrayList(DocPosting) with binary search). Changes: - DocPosting struct: doc_id (u32) + next_mask + loc_mask - PostingList with sorted insertion and binary search lookup - path_to_id / id_to_path mappings for string resolution - candidates() intersects by u32 doc_id, resolves to paths at end - All disk persistence updated to use new structure Benchmark (2000 files): - Memory: 31MB → 27MB (-14%) - Selective search: 331µs → 307µs (-7%) - Index time: unchanged - All regression tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of calling getOrAddPosting (binary search + sorted insert) for every byte position in the file, accumulate masks in a local HashMap first, then bulk-append one posting per trigram to the global index. This eliminates O(n) array shifts per byte position and reduces the inner loop to pure HashMap lookups (O(1) amortized). Benchmark (5200 files, 200 with 200 lines each): - Index time: 475ms → 306ms (-36%) - User CPU: 0.28s → 0.12s (-57%) - I/O unchanged (expected) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two more optimizations: 1. Skip pure-whitespace trigrams during indexing — these are 12% of all trigram occurrences but terrible filters (match nearly every file). Reduces index size and build time. 2. Replace HashMap-based intersection with sorted merge intersection. Since posting lists are sorted by doc_id, we can intersect in O(n+m) with zero allocations instead of O(n*log(m)) with HashMap overhead. Combined benchmark (5200 files): - Index: 480ms → 306ms (-36%) - CPU: 0.29s → 0.12s (-59%) - Dense search: 763µs → 280µs (-63%) - Selective search: unchanged (already fast) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d34138d67b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if ((c0 == ' ' or c0 == '\t' or c0 == '\n' or c0 == '\r') and | ||
| (c1 == ' ' or c1 == '\t' or c1 == '\n' or c1 == '\r') and | ||
| (c2 == ' ' or c2 == '\t' or c2 == '\n' or c2 == '\r')) continue; |
There was a problem hiding this comment.
Preserve whitespace trigrams in index
Skipping all-whitespace trigrams during indexing introduces false negatives because query trigram extraction still includes them. For example, searching for strings like "foo bar" (or any query containing three consecutive whitespace chars) will hit candidates()/candidatesRegex() with a trigram that was never indexed, so the candidate set becomes empty even when matching files exist. This is a correctness regression from the previous behavior.
Useful? React with 👍 / 👎.
| } | ||
| trigrams.deinit(self.allocator); | ||
| _ = self.file_trigrams.remove(path); | ||
| _ = self.path_to_id.remove(path); |
There was a problem hiding this comment.
Reuse doc IDs when reindexing an existing path
indexFile() always calls removeFile() first, and removeFile() deletes the path_to_id entry while leaving id_to_path append-only. That means every reindex of the same file allocates a brand-new doc ID and appends another path pointer, so normal edit/save churn causes unbounded id_to_path growth and eventually risks u32 doc ID exhaustion in long-lived processes.
Useful? React with 👍 / 👎.
Summary
Addresses #142 — Trigram index v2 with integer doc IDs, batch processing, and optimized intersection.
Optimizations implemented
Combined benchmark (5200 files, 200 with 200 lines each)
Regression suite
10 new tests covering all correctness invariants:
Test plan
Tracked in #142. Architecture plan from @justrach.