Skip to content

Dileepkumar65/mini-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini Search Engine with Vector Space Ranking

A small but real document-retrieval system written from scratch in pure Python (no external libraries). It implements every classical IR component: text preprocessing, an inverted index, TF-IDF weighting, cosine similarity ranking, and a CLI for indexing and querying.

Features

  • Text preprocessing: regex tokenization, lowercasing, stop-word filtering, Porter stemming
  • Inverted index with term-frequency posting lists (term → {doc_id → tf})
  • TF-IDF weighting with smoothed IDF: tf · log((N+1)/(df+1)) + 1
  • Cosine similarity ranking on sparse tf-idf vectors
  • CLI with three commands: index, query, and an interactive shell
  • Persistence: indices serialize to/from JSON
  • Tested: 28 unit + integration tests, all passing
  • No dependencies: standard library only

Install / build

There's nothing to compile. Either run via PYTHONPATH=src (no install needed) or pip install -e . from the project root.

# Option A: zero-install
PYTHONPATH=src python3 -m mini_search --help

# Option B: editable install (puts a `mini-search` script on your PATH)
pip install -e .
mini-search --help

Quick demo

make demo

This builds an index over the bundled corpus/ (10 small documents about information retrieval) and runs three sample queries.

CLI usage

# Build an index from a directory of .txt / .md files
python -m mini_search index corpus --out index.json

# Run one query
python -m mini_search query "inverted index posting list" --index index.json --top 5

# Interactive shell
python -m mini_search shell --index index.json

Sample output:

5 result(s) for 'inverted index posting list':

   1. [0.4410] 03-inverted-index.txt
      An inverted index is the data structure at the heart of every classical
      search engine. For each term in the vocabulary, it stores a posting...

   2. [0.1910] 09-python.txt
      ...search engines because its dictionary, list, and set primitives map
      cleanly to posting lists and sparse vectors...

Library usage

from mini_search import SearchEngine

engine = SearchEngine()
engine.index_directory("corpus")

for r in engine.search("tf-idf cosine ranking", top_k=5):
    print(f"{r.score:.3f}  {r.title}")

engine.save("index.json")
later = SearchEngine.load("index.json")

Module layout

File Responsibility
tokenizer.py Tokenization pipeline (lowercase, regex, stop-words, stem)
stemmer.py Porter (1980) suffix-stripping algorithm
stopwords.py Built-in English stop-word list
inverted_index.py term → {doc_id → tf} posting lists + serialization
tfidf.py IDF, doc/query vectors, cosine similarity
engine.py High-level facade; ranking + snippets
persistence.py JSON save/load of an engine
cli.py index / query / shell subcommands

Architecture

                          ┌──────────────┐
                          │   Tokenizer  │  (shared between
                          │ lower → stop │   indexing and querying)
                          │ → stem       │
                          └──────┬───────┘
                                 │
            ┌────────────────────┼────────────────────┐
            ▼                                         ▼
     ┌─────────────┐                          ┌──────────────┐
     │  Documents  │  index → tokens →        │    Query     │
     └──────┬──────┘  posting lists           └──────┬───────┘
            │                                        │
            ▼                                        ▼
     ┌─────────────┐                          ┌──────────────┐
     │   Inverted  │  ◀───── candidate ───── │  Query vector │
     │    Index    │        documents       │   (tf-idf)    │
     └──────┬──────┘                          └──────┬───────┘
            │                                        │
            └────────► Cosine similarity scoring ◀───┘
                              │
                              ▼
                        Ranked results

Tests

make test

28 tests in 4 files covering: stemmer, tokenizer, inverted index, TF-IDF math, the engine, and end-to-end ranking quality on the bundled corpus.

Inspiration & extensions beyond the source

Inspired by Ben Boyter's Building a Vector Space Indexing Engine in Python, which demonstrates cosine similarity over raw term counts in ~60 lines.

This project extends that seed with everything a real search engine needs:

Component Boyter article This project
Cosine similarity
TF-IDF weighting ✅ (smoothed IDF)
Inverted index ❌ (full scan) ✅ (posting lists)
Stop-word removal
Stemming ✅ (Porter 1980)
Tokenization .split(' ') regex + length filter
CLI raw_input() argparse with subcommands
Persistence ✅ (JSON round-trip)
Tests ✅ (28 unit + integration)

About

A small TF-IDF + cosine-similarity search engine in pure Python — inverted index, Porter stemmer, CLI

Topics

Resources

Stars

Watchers

Forks

Contributors