Skip to content

perf: trigram index v2 — 36% faster indexing, 59% less CPU#144

Merged
justrach merged 5 commits intomainfrom
feat/142-trigram-v2
Apr 5, 2026
Merged

perf: trigram index v2 — 36% faster indexing, 59% less CPU#144
justrach merged 5 commits intomainfrom
feat/142-trigram-v2

Conversation

@justrach
Copy link
Copy Markdown
Owner

@justrach justrach commented Apr 5, 2026

Summary

Addresses #142 — Trigram index v2 with integer doc IDs, batch processing, and optimized intersection.

Optimizations implemented

Change Index Time CPU Memory Query
Integer doc IDs (u32 instead of string paths) -14% -7%
Batch-accumulate trigrams (local HashMap → bulk append) -36% -57%
Skip whitespace-only trigrams smaller index
Sorted merge intersection (O(n+m), zero alloc) -63% dense
Pre-size local HashMap -4%

Combined benchmark (5200 files, 200 with 200 lines each)

Metric Before After Improvement
Index time 481ms 310ms -36%
User CPU 0.29s 0.12s -59%
Dense search ("void") 763µs 280µs -63%
Selective search 950µs 1.0ms Same
Peak memory 140MB 135MB -4%

Regression suite

10 new tests covering all correctness invariants:

  • No false negatives or positives
  • Intersection narrows correctly
  • File removal/re-index
  • Disk roundtrip
  • 500-file scale test
  • Short query fallback
  • Word index compatibility

Test plan

  • All 10 regression tests pass
  • All existing tests pass
  • Release build succeeds
  • Benchmarked on 5200 files

Tracked in #142. Architecture plan from @justrach.

justrach and others added 5 commits April 5, 2026 09:46
10 regression tests covering:
- Find all matching files (no false negatives)
- No false positives
- Intersection narrows correctly
- File removal cleans index
- Re-indexing replaces old content
- Disk roundtrip preserves results
- 500-file scale test
- Short query fallback
- Word index alongside trigram

These tests must pass before and after any trigram index changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace string-based postings (StringHashMap per trigram) with integer
doc ID postings (sorted ArrayList(DocPosting) with binary search).

Changes:
- DocPosting struct: doc_id (u32) + next_mask + loc_mask
- PostingList with sorted insertion and binary search lookup
- path_to_id / id_to_path mappings for string resolution
- candidates() intersects by u32 doc_id, resolves to paths at end
- All disk persistence updated to use new structure

Benchmark (2000 files):
- Memory: 31MB → 27MB (-14%)
- Selective search: 331µs → 307µs (-7%)
- Index time: unchanged
- All regression tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of calling getOrAddPosting (binary search + sorted insert) for
every byte position in the file, accumulate masks in a local HashMap
first, then bulk-append one posting per trigram to the global index.

This eliminates O(n) array shifts per byte position and reduces the
inner loop to pure HashMap lookups (O(1) amortized).

Benchmark (5200 files, 200 with 200 lines each):
- Index time: 475ms → 306ms (-36%)
- User CPU: 0.28s → 0.12s (-57%)
- I/O unchanged (expected)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two more optimizations:

1. Skip pure-whitespace trigrams during indexing — these are 12% of all
   trigram occurrences but terrible filters (match nearly every file).
   Reduces index size and build time.

2. Replace HashMap-based intersection with sorted merge intersection.
   Since posting lists are sorted by doc_id, we can intersect in O(n+m)
   with zero allocations instead of O(n*log(m)) with HashMap overhead.

Combined benchmark (5200 files):
- Index: 480ms → 306ms (-36%)
- CPU: 0.29s → 0.12s (-59%)
- Dense search: 763µs → 280µs (-63%)
- Selective search: unchanged (already fast)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d34138d67b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +350 to +352
if ((c0 == ' ' or c0 == '\t' or c0 == '\n' or c0 == '\r') and
(c1 == ' ' or c1 == '\t' or c1 == '\n' or c1 == '\r') and
(c2 == ' ' or c2 == '\t' or c2 == '\n' or c2 == '\r')) continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve whitespace trigrams in index

Skipping all-whitespace trigrams during indexing introduces false negatives because query trigram extraction still includes them. For example, searching for strings like "foo bar" (or any query containing three consecutive whitespace chars) will hit candidates()/candidatesRegex() with a trigram that was never indexed, so the candidate set becomes empty even when matching files exist. This is a correctness regression from the previous behavior.

Useful? React with 👍 / 👎.

}
trigrams.deinit(self.allocator);
_ = self.file_trigrams.remove(path);
_ = self.path_to_id.remove(path);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse doc IDs when reindexing an existing path

indexFile() always calls removeFile() first, and removeFile() deletes the path_to_id entry while leaving id_to_path append-only. That means every reindex of the same file allocates a brand-new doc ID and appends another path pointer, so normal edit/save churn causes unbounded id_to_path growth and eventually risks u32 doc ID exhaustion in long-lived processes.

Useful? React with 👍 / 👎.

@justrach justrach merged commit 575af1d into main Apr 5, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant