Skip to content

sciguy-code/mini-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mini-search-engine

mini-search-engine demo A text search engine built from scratch in modern C++20. Features BM25 ranking, boolean/phrase query support, skip-pointer accelerated intersection, concurrent query execution, LRU caching, binary index persistence, and memory-mapped reads.


Architecture

flowchart TD
    Corpus["Corpus (jsonl / text)"] --> Pipeline
    
    subgraph Pipeline ["Indexer Pipeline"]
        direction LR
        T["Tokenizer"] --> SF["Stopword Filter"] --> ST["Stemmer"]
    end
    
    Pipeline --> FwdIdx["Forward Index<br>(doc_id to len, title)"]
    Pipeline --> InvIdx["Inverted Index<br>(term to postings + skip index)"]
    
    FwdIdx --> Ser["Serializer<br>(binary fmt)"]
    InvIdx --> Ser
    
    Ser --> Disk[("index.bin<br>(on disk)")]
    Disk --> Mmap["mmap reader"]
    
    Mmap --> Engine
    
    subgraph Engine ["Query Engine"]
        direction TB
        QP["Parser<br>(AST tree)"] --> QE["Executor<br>(skip+gallop intersection)"]
        QE --> Scorer["BM25 Scorer<br>(precomputed constants)"]
        Scorer --> TopK["Top-K Heap"]
        TopK --> Cache["LRU Cache"]
        Cache --> TP["Thread Pool<br>(concurrent)"]
    end
Loading

Benchmarks

Measured on Apple M2, release build (-O3 -march=native), corpus of 50k documents.

metric value
p50 13 µs
p95 528 µs
p99 1,257 µs
build 2,510 ms
peak qps 57,409


Quick Start

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)    # use -j$(sysctl -n hw.ncpu) on macOS

Generate Corpus

python3 data/generate_corpus.py
# generates 50k docs across 15 domains: physics, medicine, music, economics, etc.

Build Index

cd build
./search_engine --corpus ../data/sample_corpus.jsonl --stopwords ../data/stopwords.txt --mode build

Interactive Search

./search_engine --index index.bin --stopwords ../data/stopwords.txt
> black hole AND gravity
  1. [4290] Gravity and Black Hole  (score: 25.60)
  2. [32252] Fluid Dynamics and Gravity  (score: 21.63)
  3. [18144] Black Hole and Velocity  (score: 21.59)

> shakespeare AND poetry
  1. [46629] Shakespeare and Poetry  (score: 16.42)
  2. [12614] Comedy and Poetry  (score: 14.15)

> cancer AND chemotherapy
  1. [45126] Chemotherapy and Cancer  (score: 15.50)
  2. [24260] Surgery and Chemotherapy  (score: 13.46)

One-Shot Query

./search_engine --index index.bin --stopwords ../data/stopwords.txt \
    --mode query --query "quantum mechanics AND electron"

Run Tests

cd build && ./run_tests    # 60 tests across all components

Run Benchmarks

cd build
./bench_index ../data/sample_corpus.jsonl ../data/stopwords.txt
./bench_query ../data/sample_corpus.jsonl ../data/stopwords.txt 10000 4

Query Syntax

Syntax Meaning
algorithm Single term - finds all docs with "algorithm"
algorithm data structure Implicit OR - docs with any of the terms
algorithm AND sorting Both terms must appear in the document
algorithm NOT sorting Has "algorithm", excludes "sorting"
"data structure" Exact phrase - terms must be adjacent

Corpus Domains

The built-in generator covers 15 knowledge domains with 840+ source terms:

Domain Example Terms
Computer Science algorithm, neural network, distributed system, kubernetes
Physics quantum mechanics, black hole, superconductor, spacetime
Mathematics topology, fourier transform, markov chain, hilbert space
Biology crispr, gene editing, neurotransmitter, photosynthesis
Chemistry periodic table, spectroscopy, polymer, catalyst
Medicine penicillin, chemotherapy, mri, epidemiology
Astronomy exoplanet, gravitational wave, james webb, nebula
History roman empire, silk road, magna carta, french revolution
Economics cryptocurrency, blockchain, hedge fund, inflation
Literature shakespeare, magical realism, dystopia, sonnet
Philosophy epistemology, existentialism, free will, kant
Geography tectonic plate, climate change, glacier, volcano
Music jazz, symphony, chord progression, beethoven
Engineering robotics, 3d printing, aerospace, smart grid
Psychology cognitive bias, depression, mindfulness, pavlov

CLI Reference

Flag Default Description
--corpus - Path to JSONL file or text directory
--index index.bin Path to serialized index file
--stopwords data/stopwords.txt Stopwords file path
--mode interactive build, query, or interactive
--query - Query string (for query mode)
--top-k 10 Results per query
--k1 1.2 BM25 term frequency saturation
--b 0.75 BM25 document length normalization
--cache-size 1024 LRU cache capacity

BM25 Tuning

  • k1 controls term frequency saturation. At k1=0, term frequency is ignored. At k1=2.0, high-frequency terms dominate. Default 1.2 works well for general text.
  • b controls length normalization. At b=0, document length is ignored. At b=1.0, short documents are heavily boosted. Default 0.75 balances precision and recall.

Design Decisions

Decision Rationale
Sorted posting lists O(n+m) merge intersection for AND queries; prerequisite for SIMD acceleration
Flat forward index Vector indexed by doc_id -> O(1) length lookups during scoring, zero pointer chasing
Pre-resolved BM25 scorer All posting list pointers and BM25 constants precomputed at construction; per-document scoring is pure arithmetic
Skip pointers Block-128 skip index on long posting lists; enables galloping search for O(log n) advancement
Galloping intersection When list sizes differ by 8×+, iterate short list and binary-search the long one
Thread pool Fixed workers sharing a read-only index; near-linear scaling (3.82× on 4 cores)
LRU query cache O(1) hash+list cache with mutex; 28% hit rate under realistic workloads
Sorted stopwords Binary search over ~130 words fits in L1 cache; beats hash table for this cardinality
No external deps Pure C++20 standard library + POSIX (mmap, stat); no Boost, no JSON libs
No inheritance Structs with free functions; the problem domain doesn't benefit from polymorphism

Compile Flags

Flag CMake Option Effect
ENABLE_POSITIONS -DENABLE_POSITIONS=ON (default) Stores token positions for phrase queries. ~2× memory but enables exact phrase matching
ENABLE_COMPRESSION -DENABLE_COMPRESSION=OFF (default) Reserved for delta+varint posting list compression

Future Work

  • Posting list compression - delta encoding + varint to reduce index size 3–5× and improve cache line utilization
  • SIMD intersection - NEON-vectorized sorted merge for AND queries on ARM64
  • Concurrent index updates - log-structured merge for adding documents without full rebuild
  • Two-phase scoring - cheap term-count filter pass, then full BM25 on shortlist only
  • Sharded index - partition by term hash for distributed multi-node query processing
  • Learned term weights - adjust IDF using click-through data or relevance judgments

About

fast full-text search engine in c++, inverted index, bm25 ranking, sub-millisecond query latency

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors