A small but real document-retrieval system written from scratch in pure Python (no external libraries). It implements every classical IR component: text preprocessing, an inverted index, TF-IDF weighting, cosine similarity ranking, and a CLI for indexing and querying.
- Text preprocessing: regex tokenization, lowercasing, stop-word filtering, Porter stemming
- Inverted index with term-frequency posting lists (
term → {doc_id → tf}) - TF-IDF weighting with smoothed IDF:
tf · log((N+1)/(df+1)) + 1 - Cosine similarity ranking on sparse tf-idf vectors
- CLI with three commands:
index,query, and an interactiveshell - Persistence: indices serialize to/from JSON
- Tested: 28 unit + integration tests, all passing
- No dependencies: standard library only
There's nothing to compile. Either run via PYTHONPATH=src (no install
needed) or pip install -e . from the project root.
# Option A: zero-install
PYTHONPATH=src python3 -m mini_search --help
# Option B: editable install (puts a `mini-search` script on your PATH)
pip install -e .
mini-search --helpmake demoThis builds an index over the bundled corpus/ (10 small documents about
information retrieval) and runs three sample queries.
# Build an index from a directory of .txt / .md files
python -m mini_search index corpus --out index.json
# Run one query
python -m mini_search query "inverted index posting list" --index index.json --top 5
# Interactive shell
python -m mini_search shell --index index.jsonSample output:
5 result(s) for 'inverted index posting list':
1. [0.4410] 03-inverted-index.txt
An inverted index is the data structure at the heart of every classical
search engine. For each term in the vocabulary, it stores a posting...
2. [0.1910] 09-python.txt
...search engines because its dictionary, list, and set primitives map
cleanly to posting lists and sparse vectors...
from mini_search import SearchEngine
engine = SearchEngine()
engine.index_directory("corpus")
for r in engine.search("tf-idf cosine ranking", top_k=5):
print(f"{r.score:.3f} {r.title}")
engine.save("index.json")
later = SearchEngine.load("index.json")| File | Responsibility |
|---|---|
tokenizer.py |
Tokenization pipeline (lowercase, regex, stop-words, stem) |
stemmer.py |
Porter (1980) suffix-stripping algorithm |
stopwords.py |
Built-in English stop-word list |
inverted_index.py |
term → {doc_id → tf} posting lists + serialization |
tfidf.py |
IDF, doc/query vectors, cosine similarity |
engine.py |
High-level facade; ranking + snippets |
persistence.py |
JSON save/load of an engine |
cli.py |
index / query / shell subcommands |
┌──────────────┐
│ Tokenizer │ (shared between
│ lower → stop │ indexing and querying)
│ → stem │
└──────┬───────┘
│
┌────────────────────┼────────────────────┐
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Documents │ index → tokens → │ Query │
└──────┬──────┘ posting lists └──────┬───────┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Inverted │ ◀───── candidate ───── │ Query vector │
│ Index │ documents │ (tf-idf) │
└──────┬──────┘ └──────┬───────┘
│ │
└────────► Cosine similarity scoring ◀───┘
│
▼
Ranked results
make test28 tests in 4 files covering: stemmer, tokenizer, inverted index, TF-IDF math, the engine, and end-to-end ranking quality on the bundled corpus.
Inspired by Ben Boyter's Building a Vector Space Indexing Engine in Python, which demonstrates cosine similarity over raw term counts in ~60 lines.
This project extends that seed with everything a real search engine needs:
| Component | Boyter article | This project |
|---|---|---|
| Cosine similarity | ✅ | ✅ |
| TF-IDF weighting | ❌ | ✅ (smoothed IDF) |
| Inverted index | ❌ (full scan) | ✅ (posting lists) |
| Stop-word removal | ❌ | ✅ |
| Stemming | ❌ | ✅ (Porter 1980) |
| Tokenization | .split(' ') |
regex + length filter |
| CLI | raw_input() |
argparse with subcommands |
| Persistence | ❌ | ✅ (JSON round-trip) |
| Tests | ❌ | ✅ (28 unit + integration) |