perf: trigram index v2 — 36% faster indexing, 59% less CPU by justrach · Pull Request #144 · justrach/codedb

justrach · 2026-04-05T02:18:25Z

Summary

Addresses #142 — Trigram index v2 with integer doc IDs, batch processing, and optimized intersection.

Optimizations implemented

Change	Index Time	CPU	Memory	Query
Integer doc IDs (u32 instead of string paths)	—	—	-14%	-7%
Batch-accumulate trigrams (local HashMap → bulk append)	-36%	-57%	—	—
Skip whitespace-only trigrams	—	—	smaller index	—
Sorted merge intersection (O(n+m), zero alloc)	—	—	—	-63% dense
Pre-size local HashMap	—	—	-4%	—

Combined benchmark (5200 files, 200 with 200 lines each)

Metric	Before	After	Improvement
Index time	481ms	310ms	-36%
User CPU	0.29s	0.12s	-59%
Dense search ("void")	763µs	280µs	-63%
Selective search	950µs	1.0ms	Same
Peak memory	140MB	135MB	-4%

Regression suite

10 new tests covering all correctness invariants:

No false negatives or positives
Intersection narrows correctly
File removal/re-index
Disk roundtrip
500-file scale test
Short query fallback
Word index compatibility

Test plan

All 10 regression tests pass
All existing tests pass
Release build succeeds
Benchmarked on 5200 files

Tracked in #142. Architecture plan from @justrach.

10 regression tests covering: - Find all matching files (no false negatives) - No false positives - Intersection narrows correctly - File removal cleans index - Re-indexing replaces old content - Disk roundtrip preserves results - 500-file scale test - Short query fallback - Word index alongside trigram These tests must pass before and after any trigram index changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace string-based postings (StringHashMap per trigram) with integer doc ID postings (sorted ArrayList(DocPosting) with binary search). Changes: - DocPosting struct: doc_id (u32) + next_mask + loc_mask - PostingList with sorted insertion and binary search lookup - path_to_id / id_to_path mappings for string resolution - candidates() intersects by u32 doc_id, resolves to paths at end - All disk persistence updated to use new structure Benchmark (2000 files): - Memory: 31MB → 27MB (-14%) - Selective search: 331µs → 307µs (-7%) - Index time: unchanged - All regression tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of calling getOrAddPosting (binary search + sorted insert) for every byte position in the file, accumulate masks in a local HashMap first, then bulk-append one posting per trigram to the global index. This eliminates O(n) array shifts per byte position and reduces the inner loop to pure HashMap lookups (O(1) amortized). Benchmark (5200 files, 200 with 200 lines each): - Index time: 475ms → 306ms (-36%) - User CPU: 0.28s → 0.12s (-57%) - I/O unchanged (expected) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two more optimizations: 1. Skip pure-whitespace trigrams during indexing — these are 12% of all trigram occurrences but terrible filters (match nearly every file). Reduces index size and build time. 2. Replace HashMap-based intersection with sorted merge intersection. Since posting lists are sorted by doc_id, we can intersect in O(n+m) with zero allocations instead of O(n*log(m)) with HashMap overhead. Combined benchmark (5200 files): - Index: 480ms → 306ms (-36%) - CPU: 0.29s → 0.12s (-59%) - Dense search: 763µs → 280µs (-63%) - Selective search: unchanged (already fast) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d34138d67b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T02:23:33Z

src/index.zig

+                if ((c0 == ' ' or c0 == '\t' or c0 == '\n' or c0 == '\r') and
+                    (c1 == ' ' or c1 == '\t' or c1 == '\n' or c1 == '\r') and
+                    (c2 == ' ' or c2 == '\t' or c2 == '\n' or c2 == '\r')) continue;


Preserve whitespace trigrams in index

Skipping all-whitespace trigrams during indexing introduces false negatives because query trigram extraction still includes them. For example, searching for strings like "foo bar" (or any query containing three consecutive whitespace chars) will hit candidates()/candidatesRegex() with a trigram that was never indexed, so the candidate set becomes empty even when matching files exist. This is a correctness regression from the previous behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-05T02:23:33Z

src/index.zig

        }
        trigrams.deinit(self.allocator);
        _ = self.file_trigrams.remove(path);
+        _ = self.path_to_id.remove(path);


Reuse doc IDs when reindexing an existing path

indexFile() always calls removeFile() first, and removeFile() deletes the path_to_id entry while leaving id_to_path append-only. That means every reindex of the same file allocates a brand-new doc ID and appends another path pointer, so normal edit/save churn causes unbounded id_to_path growth and eventually risks u32 doc ID exhaustion in long-lived processes.

Useful? React with 👍 / 👎.

justrach and others added 5 commits April 5, 2026 09:46

perf: pre-size local trigram HashMap, restore all opts (#142)

d34138d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Apr 5, 2026

View reviewed changes

justrach merged commit 575af1d into main Apr 5, 2026
1 check failed

justrach mentioned this pull request Apr 5, 2026

perf: trigram index v2 — integer postings, delta compression, query planner #142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: trigram index v2 — 36% faster indexing, 59% less CPU#144

perf: trigram index v2 — 36% faster indexing, 59% less CPU#144
justrach merged 5 commits intomainfrom
feat/142-trigram-v2

justrach commented Apr 5, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 5, 2026

Uh oh!

chatgpt-codex-connector bot Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justrach commented Apr 5, 2026

Summary

Optimizations implemented

Combined benchmark (5200 files, 200 with 200 lines each)

Regression suite

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant