You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A text search engine built from scratch in modern C++20. Features BM25 ranking, boolean/phrase query support, skip-pointer accelerated intersection, concurrent query execution, LRU caching, binary index persistence, and memory-mapped reads.
Architecture
flowchart TD
Corpus["Corpus (jsonl / text)"] --> Pipeline
subgraph Pipeline ["Indexer Pipeline"]
direction LR
T["Tokenizer"] --> SF["Stopword Filter"] --> ST["Stemmer"]
end
Pipeline --> FwdIdx["Forward Index<br>(doc_id to len, title)"]
Pipeline --> InvIdx["Inverted Index<br>(term to postings + skip index)"]
FwdIdx --> Ser["Serializer<br>(binary fmt)"]
InvIdx --> Ser
Ser --> Disk[("index.bin<br>(on disk)")]
Disk --> Mmap["mmap reader"]
Mmap --> Engine
subgraph Engine ["Query Engine"]
direction TB
QP["Parser<br>(AST tree)"] --> QE["Executor<br>(skip+gallop intersection)"]
QE --> Scorer["BM25 Scorer<br>(precomputed constants)"]
Scorer --> TopK["Top-K Heap"]
TopK --> Cache["LRU Cache"]
Cache --> TP["Thread Pool<br>(concurrent)"]
end
Loading
Benchmarks
Measured on Apple M2, release build (-O3 -march=native), corpus of 50k documents.
metric
value
p50
13 µs
p95
528 µs
p99
1,257 µs
build
2,510 ms
peak qps
57,409
Quick Start
Build
mkdir build &&cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)# use -j$(sysctl -n hw.ncpu) on macOS
Generate Corpus
python3 data/generate_corpus.py
# generates 50k docs across 15 domains: physics, medicine, music, economics, etc.
Build Index
cd build
./search_engine --corpus ../data/sample_corpus.jsonl --stopwords ../data/stopwords.txt --mode build
> black hole AND gravity
1. [4290] Gravity and Black Hole (score: 25.60)
2. [32252] Fluid Dynamics and Gravity (score: 21.63)
3. [18144] Black Hole and Velocity (score: 21.59)
> shakespeare AND poetry
1. [46629] Shakespeare and Poetry (score: 16.42)
2. [12614] Comedy and Poetry (score: 14.15)
> cancer AND chemotherapy
1. [45126] Chemotherapy and Cancer (score: 15.50)
2. [24260] Surgery and Chemotherapy (score: 13.46)
roman empire, silk road, magna carta, french revolution
Economics
cryptocurrency, blockchain, hedge fund, inflation
Literature
shakespeare, magical realism, dystopia, sonnet
Philosophy
epistemology, existentialism, free will, kant
Geography
tectonic plate, climate change, glacier, volcano
Music
jazz, symphony, chord progression, beethoven
Engineering
robotics, 3d printing, aerospace, smart grid
Psychology
cognitive bias, depression, mindfulness, pavlov
CLI Reference
Flag
Default
Description
--corpus
-
Path to JSONL file or text directory
--index
index.bin
Path to serialized index file
--stopwords
data/stopwords.txt
Stopwords file path
--mode
interactive
build, query, or interactive
--query
-
Query string (for query mode)
--top-k
10
Results per query
--k1
1.2
BM25 term frequency saturation
--b
0.75
BM25 document length normalization
--cache-size
1024
LRU cache capacity
BM25 Tuning
k1 controls term frequency saturation. At k1=0, term frequency is ignored. At k1=2.0, high-frequency terms dominate. Default 1.2 works well for general text.
b controls length normalization. At b=0, document length is ignored. At b=1.0, short documents are heavily boosted. Default 0.75 balances precision and recall.
Design Decisions
Decision
Rationale
Sorted posting lists
O(n+m) merge intersection for AND queries; prerequisite for SIMD acceleration
Flat forward index
Vector indexed by doc_id -> O(1) length lookups during scoring, zero pointer chasing
Pre-resolved BM25 scorer
All posting list pointers and BM25 constants precomputed at construction; per-document scoring is pure arithmetic
Skip pointers
Block-128 skip index on long posting lists; enables galloping search for O(log n) advancement
Galloping intersection
When list sizes differ by 8×+, iterate short list and binary-search the long one
Thread pool
Fixed workers sharing a read-only index; near-linear scaling (3.82× on 4 cores)
LRU query cache
O(1) hash+list cache with mutex; 28% hit rate under realistic workloads
Sorted stopwords
Binary search over ~130 words fits in L1 cache; beats hash table for this cardinality
No external deps
Pure C++20 standard library + POSIX (mmap, stat); no Boost, no JSON libs
No inheritance
Structs with free functions; the problem domain doesn't benefit from polymorphism
Compile Flags
Flag
CMake Option
Effect
ENABLE_POSITIONS
-DENABLE_POSITIONS=ON (default)
Stores token positions for phrase queries. ~2× memory but enables exact phrase matching
ENABLE_COMPRESSION
-DENABLE_COMPRESSION=OFF (default)
Reserved for delta+varint posting list compression
Future Work
Posting list compression - delta encoding + varint to reduce index size 3–5× and improve cache line utilization
SIMD intersection - NEON-vectorized sorted merge for AND queries on ARM64
Concurrent index updates - log-structured merge for adding documents without full rebuild
Two-phase scoring - cheap term-count filter pass, then full BM25 on shortlist only
Sharded index - partition by term hash for distributed multi-node query processing
Learned term weights - adjust IDF using click-through data or relevance judgments
About
fast full-text search engine in c++, inverted index, bm25 ranking, sub-millisecond query latency