mini-search-engine

A text search engine built from scratch in modern C++20. Features BM25 ranking, boolean/phrase query support, skip-pointer accelerated intersection, concurrent query execution, LRU caching, binary index persistence, and memory-mapped reads.

Architecture

flowchart TD
    Corpus["Corpus (jsonl / text)"] --> Pipeline
    
    subgraph Pipeline ["Indexer Pipeline"]
        direction LR
        T["Tokenizer"] --> SF["Stopword Filter"] --> ST["Stemmer"]
    end
    
    Pipeline --> FwdIdx["Forward Index<br>(doc_id to len, title)"]
    Pipeline --> InvIdx["Inverted Index<br>(term to postings + skip index)"]
    
    FwdIdx --> Ser["Serializer<br>(binary fmt)"]
    InvIdx --> Ser
    
    Ser --> Disk[("index.bin<br>(on disk)")]
    Disk --> Mmap["mmap reader"]
    
    Mmap --> Engine
    
    subgraph Engine ["Query Engine"]
        direction TB
        QP["Parser<br>(AST tree)"] --> QE["Executor<br>(skip+gallop intersection)"]
        QE --> Scorer["BM25 Scorer<br>(precomputed constants)"]
        Scorer --> TopK["Top-K Heap"]
        TopK --> Cache["LRU Cache"]
        Cache --> TP["Thread Pool<br>(concurrent)"]
    end

Benchmarks

Measured on Apple M2, release build (-O3 -march=native), corpus of 50k documents.

metric	value
p50	13 µs
p95	528 µs
p99	1,257 µs
build	2,510 ms
peak qps	57,409

Quick Start

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)    # use -j$(sysctl -n hw.ncpu) on macOS

Generate Corpus

python3 data/generate_corpus.py
# generates 50k docs across 15 domains: physics, medicine, music, economics, etc.

Build Index

cd build
./search_engine --corpus ../data/sample_corpus.jsonl --stopwords ../data/stopwords.txt --mode build

Interactive Search

./search_engine --index index.bin --stopwords ../data/stopwords.txt

> black hole AND gravity
  1. [4290] Gravity and Black Hole  (score: 25.60)
  2. [32252] Fluid Dynamics and Gravity  (score: 21.63)
  3. [18144] Black Hole and Velocity  (score: 21.59)

> shakespeare AND poetry
  1. [46629] Shakespeare and Poetry  (score: 16.42)
  2. [12614] Comedy and Poetry  (score: 14.15)

> cancer AND chemotherapy
  1. [45126] Chemotherapy and Cancer  (score: 15.50)
  2. [24260] Surgery and Chemotherapy  (score: 13.46)

One-Shot Query

./search_engine --index index.bin --stopwords ../data/stopwords.txt \
    --mode query --query "quantum mechanics AND electron"

Run Tests

cd build && ./run_tests    # 60 tests across all components

Run Benchmarks

cd build
./bench_index ../data/sample_corpus.jsonl ../data/stopwords.txt
./bench_query ../data/sample_corpus.jsonl ../data/stopwords.txt 10000 4

Query Syntax

Syntax	Meaning
`algorithm`	Single term - finds all docs with "algorithm"
`algorithm data structure`	Implicit OR - docs with any of the terms
`algorithm AND sorting`	Both terms must appear in the document
`algorithm NOT sorting`	Has "algorithm", excludes "sorting"
`"data structure"`	Exact phrase - terms must be adjacent

Corpus Domains

The built-in generator covers 15 knowledge domains with 840+ source terms:

Domain	Example Terms
Computer Science	algorithm, neural network, distributed system, kubernetes
Physics	quantum mechanics, black hole, superconductor, spacetime
Mathematics	topology, fourier transform, markov chain, hilbert space
Biology	crispr, gene editing, neurotransmitter, photosynthesis
Chemistry	periodic table, spectroscopy, polymer, catalyst
Medicine	penicillin, chemotherapy, mri, epidemiology
Astronomy	exoplanet, gravitational wave, james webb, nebula
History	roman empire, silk road, magna carta, french revolution
Economics	cryptocurrency, blockchain, hedge fund, inflation
Literature	shakespeare, magical realism, dystopia, sonnet
Philosophy	epistemology, existentialism, free will, kant
Geography	tectonic plate, climate change, glacier, volcano
Music	jazz, symphony, chord progression, beethoven
Engineering	robotics, 3d printing, aerospace, smart grid
Psychology	cognitive bias, depression, mindfulness, pavlov

CLI Reference

Flag	Default	Description
`--corpus`	-	Path to JSONL file or text directory
`--index`	`index.bin`	Path to serialized index file
`--stopwords`	`data/stopwords.txt`	Stopwords file path
`--mode`	`interactive`	`build`, `query`, or `interactive`
`--query`	-	Query string (for `query` mode)
`--top-k`	`10`	Results per query
`--k1`	`1.2`	BM25 term frequency saturation
`--b`	`0.75`	BM25 document length normalization
`--cache-size`	`1024`	LRU cache capacity

BM25 Tuning

k1 controls term frequency saturation. At k1=0, term frequency is ignored. At k1=2.0, high-frequency terms dominate. Default 1.2 works well for general text.
b controls length normalization. At b=0, document length is ignored. At b=1.0, short documents are heavily boosted. Default 0.75 balances precision and recall.

Design Decisions

Decision	Rationale
Sorted posting lists	O(n+m) merge intersection for AND queries; prerequisite for SIMD acceleration
Flat forward index	Vector indexed by doc_id -> O(1) length lookups during scoring, zero pointer chasing
Pre-resolved BM25 scorer	All posting list pointers and BM25 constants precomputed at construction; per-document scoring is pure arithmetic
Skip pointers	Block-128 skip index on long posting lists; enables galloping search for O(log n) advancement
Galloping intersection	When list sizes differ by 8×+, iterate short list and binary-search the long one
Thread pool	Fixed workers sharing a read-only index; near-linear scaling (3.82× on 4 cores)
LRU query cache	O(1) hash+list cache with mutex; 28% hit rate under realistic workloads
Sorted stopwords	Binary search over ~130 words fits in L1 cache; beats hash table for this cardinality
No external deps	Pure C++20 standard library + POSIX (`mmap`, `stat`); no Boost, no JSON libs
No inheritance	Structs with free functions; the problem domain doesn't benefit from polymorphism

Compile Flags

Flag	CMake Option	Effect
`ENABLE_POSITIONS`	`-DENABLE_POSITIONS=ON` (default)	Stores token positions for phrase queries. ~2× memory but enables exact phrase matching
`ENABLE_COMPRESSION`	`-DENABLE_COMPRESSION=OFF` (default)	Reserved for delta+varint posting list compression

Future Work

Posting list compression - delta encoding + varint to reduce index size 3–5× and improve cache line utilization
SIMD intersection - NEON-vectorized sorted merge for AND queries on ARM64
Concurrent index updates - log-structured merge for adding documents without full rebuild
Two-phase scoring - cheap term-count filter pass, then full BM25 on shortlist only
Sharded index - partition by term hash for distributed multi-node query processing
Learned term weights - adjust IDF using click-through data or relevance judgments

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
benchmarks		benchmarks
data		data
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
demo.gif		demo.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mini-search-engine

Architecture

Benchmarks

Quick Start

Build

Generate Corpus

Build Index

Interactive Search

One-Shot Query

Run Tests

Run Benchmarks

Query Syntax

Corpus Domains

CLI Reference

BM25 Tuning

Design Decisions

Compile Flags

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mini-search-engine

Architecture

Benchmarks

Quick Start

Build

Generate Corpus

Build Index

Interactive Search

One-Shot Query

Run Tests

Run Benchmarks

Query Syntax

Corpus Domains

CLI Reference

BM25 Tuning

Design Decisions

Compile Flags

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages