Mini Search Engine with Vector Space Ranking

A small but real document-retrieval system written from scratch in pure Python (no external libraries). It implements every classical IR component: text preprocessing, an inverted index, TF-IDF weighting, cosine similarity ranking, and a CLI for indexing and querying.

Features

Text preprocessing: regex tokenization, lowercasing, stop-word filtering, Porter stemming
Inverted index with term-frequency posting lists (term → {doc_id → tf})
TF-IDF weighting with smoothed IDF: tf · log((N+1)/(df+1)) + 1
Cosine similarity ranking on sparse tf-idf vectors
CLI with three commands: index, query, and an interactive shell
Persistence: indices serialize to/from JSON
Tested: 28 unit + integration tests, all passing
No dependencies: standard library only

Install / build

There's nothing to compile. Either run via PYTHONPATH=src (no install needed) or pip install -e . from the project root.

# Option A: zero-install
PYTHONPATH=src python3 -m mini_search --help

# Option B: editable install (puts a `mini-search` script on your PATH)
pip install -e .
mini-search --help

Quick demo

make demo

This builds an index over the bundled corpus/ (10 small documents about information retrieval) and runs three sample queries.

CLI usage

# Build an index from a directory of .txt / .md files
python -m mini_search index corpus --out index.json

# Run one query
python -m mini_search query "inverted index posting list" --index index.json --top 5

# Interactive shell
python -m mini_search shell --index index.json

Sample output:

5 result(s) for 'inverted index posting list':

   1. [0.4410] 03-inverted-index.txt
      An inverted index is the data structure at the heart of every classical
      search engine. For each term in the vocabulary, it stores a posting...

   2. [0.1910] 09-python.txt
      ...search engines because its dictionary, list, and set primitives map
      cleanly to posting lists and sparse vectors...

Library usage

from mini_search import SearchEngine

engine = SearchEngine()
engine.index_directory("corpus")

for r in engine.search("tf-idf cosine ranking", top_k=5):
    print(f"{r.score:.3f}  {r.title}")

engine.save("index.json")
later = SearchEngine.load("index.json")

Module layout

File	Responsibility
`tokenizer.py`	Tokenization pipeline (lowercase, regex, stop-words, stem)
`stemmer.py`	Porter (1980) suffix-stripping algorithm
`stopwords.py`	Built-in English stop-word list
`inverted_index.py`	`term → {doc_id → tf}` posting lists + serialization
`tfidf.py`	IDF, doc/query vectors, cosine similarity
`engine.py`	High-level facade; ranking + snippets
`persistence.py`	JSON save/load of an engine
`cli.py`	`index` / `query` / `shell` subcommands

Architecture

                          ┌──────────────┐
                          │   Tokenizer  │  (shared between
                          │ lower → stop │   indexing and querying)
                          │ → stem       │
                          └──────┬───────┘
                                 │
            ┌────────────────────┼────────────────────┐
            ▼                                         ▼
     ┌─────────────┐                          ┌──────────────┐
     │  Documents  │  index → tokens →        │    Query     │
     └──────┬──────┘  posting lists           └──────┬───────┘
            │                                        │
            ▼                                        ▼
     ┌─────────────┐                          ┌──────────────┐
     │   Inverted  │  ◀───── candidate ───── │  Query vector │
     │    Index    │        documents       │   (tf-idf)    │
     └──────┬──────┘                          └──────┬───────┘
            │                                        │
            └────────► Cosine similarity scoring ◀───┘
                              │
                              ▼
                        Ranked results

Tests

make test

28 tests in 4 files covering: stemmer, tokenizer, inverted index, TF-IDF math, the engine, and end-to-end ranking quality on the bundled corpus.

Inspiration & extensions beyond the source

Inspired by Ben Boyter's Building a Vector Space Indexing Engine in Python, which demonstrates cosine similarity over raw term counts in ~60 lines.

This project extends that seed with everything a real search engine needs:

Component	Boyter article	This project
Cosine similarity	✅	✅
TF-IDF weighting	❌	✅ (smoothed IDF)
Inverted index	❌ (full scan)	✅ (posting lists)
Stop-word removal	❌	✅
Stemming	❌	✅ (Porter 1980)
Tokenization	`.split(' ')`	regex + length filter
CLI	`raw_input()`	`argparse` with subcommands
Persistence	❌	✅ (JSON round-trip)
Tests	❌	✅ (28 unit + integration)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
corpus		corpus
src/mini_search		src/mini_search
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Search Engine with Vector Space Ranking

Features

Install / build

Quick demo

CLI usage

Library usage

Module layout

Architecture

Tests

Inspiration & extensions beyond the source

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini Search Engine with Vector Space Ranking

Features

Install / build

Quick demo

CLI usage

Library usage

Module layout

Architecture

Tests

Inspiration & extensions beyond the source

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages