This repository contains a search engine built to understand how real-world search engines are designed beyond the basic "term → document list" idea. Extended with AI-powered hybrid search and RAG (Retrieval-Augmented Generation).
A single-node, multi-segment search engine that:
-
Indexes Wikipedia pages
-
Uses an inverted index with:
- term frequency
- positions
- character offsets
-
Supports:
- boolean AND queries
- TF-IDF ranking
- hybrid search (TF-IDF + semantic vector search with RRF fusion)
- semantic search (cosine similarity over Gemini embeddings)
- RAG (retrieval-augmented generation for natural language Q&A)
- snippet generation
-
Persists data in PostgreSQL
-
Uses immutable segments and explicit merges
The design deliberately mirrors how real search engines (Lucene-style) think about indexing.
This system accurately models several non-obvious realities of search engines.
-
Each indexing operation creates a new segment
-
A segment contains:
- its own documents
- its own inverted index
-
Once created, a segment is never mutated
This makes indexing:
- failure-isolated
- concurrency-friendly
- conceptually simple
Documents are split into two identities:
Canonical document
- Unique URL
- Stored once
- Never deleted
Segment document
- Local doc_id
- Points to a canonical document
- Exists only inside a segment
This separates content identity from index identity.
- Segments are marked active = true | false
- Queries only run against active segments
- Inactive segments (and their postings) are retained
- Two active segments can be merged:
- A new segment is created
- Documents are re-indexed
- Old segments are marked inactive
- No in-place mutation occurs
This exposes why:
- scoring changes after merges
- global statistics are expensive
- merges are operationally costly
The system demonstrates that:
- The same document
- With the same query
- Can produce different scores depending on segment composition
This happens because:
- N (number of documents) changes
- DF is segment-scoped
- Merging alters statistical context
This is intentional and educational.
Results from keyword search (TF-IDF) and semantic vector search (cosine similarity) are merged using Reciprocal Rank Fusion (RRF) with k=60. This demonstrates how modern search engines combine multiple ranking signals.
Top-K documents from hybrid search are fed as context to an LLM (Gemini 2.5 Flash Lite) to generate direct, cited answers to natural language questions.
┌────────────┐
│ Wikipedia │
│ Fetcher │
└─────┬──────┘
│
┌─────▼──────┐
│ Text │
│ Processing │
└─────┬──────┘
│
┌───────────────▼────────────────┐
│ Segment (immutable) │
│ ~ │
│ Documents + Inverted Index │
└───────────┬───────┬────────────┘
│ │
┌──────────▼──┐ ┌─▼───────────────┐
│ PostgreSQL │ │ Gemini Embedding│
│ Storage │ │ (768-dim) │
│ │ └────────┬────────┘
│ canonical │ │
│ _documents │◄──────────┘
│ (+ embedding│ stored as JSON
│ TEXT col) │
│ segment_docs│
│ postings │
│ terms │
│ segments │
└──────┬──────┘
│
┌──────────▼───────────┐
│ Query Time │
│ │
│ ┌───────┐ ┌──────┐ │
│ │TF-IDF │ │Cosine│ │
│ │Keyword│ │Simil.│ │
│ └───┬───┘ └──┬───┘ │
│ └────┬────┘ │
│ ┌────▼────┐ │
│ │ RRF │ │
│ │ Fusion │ │
│ └────┬────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ RAG │ │
│ │ (optional)│ │
│ │ Gemini LLM│ │
│ └───────────┘ │
└──────────────────────┘
fetch/
WikipediaFetcher Full article text via Wikipedia extract API
processing/
Tokenization
Normalization
Stop-word filtering
index/
InvertedIndex
Posting
Stores:
- doc_id
- term frequency
- positions
- offsets
segment/
Segment
SegmentManager
Owns lifecycle and visibility of segments
storage/
PostgreSQL persistence layer
Explicit separation of:
- documents
- terms
- postings
- segments
embedding/ ← NEW
EmbeddingService Gemini embedding API client (768-dim)
EmbeddingStore JSON text storage in PostgreSQL
query/
Boolean query execution
Snippet generation using offsets
SemanticQueryEngine ← NEW: cosine similarity search
HybridQueryEngine ← NEW: RRF fusion of TF-IDF + semantic
ranking/
TF-IDF ranking implementation
CosineSimilarity ← NEW: vector similarity
rag/ ← NEW
LlmClient Gemini 2.5 Flash Lite API client
ContextAssembler Top-K doc context builder
RagPipeline Full RAG orchestration
shell/
Interactive CLI (SearchShell)
Exposes:
- seed
- seed-file
- search (hybrid)
- semantic-search
- ask (RAG)
- reindex-embeddings
- merge
- Java 17+
- Docker + Docker Compose
- Gemini API key (optional, for AI features)
# Set Gemini API key (required for semantic search, hybrid search, and RAG)
export GEMINI_API_KEY=your_key_hereGet a free API key from Google AI Studio.
Without the API key, the engine falls back to keyword-only TF-IDF search.
docker-compose up -d
psql -h localhost -U search -d search -f schema.sql
If adding AI features to an existing database, run:
ALTER TABLE canonical_documents ADD COLUMN IF NOT EXISTS embedding TEXT;mvn clean package
java -jar target/search-engine.jar
seed <page-key>— Fetch and index a Wikipedia page (also generates embedding if AI enabled)seed-file <path>— Seed multiple pages from a filesearch <query>— Hybrid search (TF-IDF + semantic) or keyword-only if AI disabledsemantic-search <query>— Pure semantic vector search (requires GEMINI_API_KEY)ask <question>— RAG: retrieves docs, generates AI answer with citations (requires GEMINI_API_KEY)reindex-embeddings— Generate embeddings for all documents missing them (requires GEMINI_API_KEY)merge <segA> <segB>— Merge two segmentsload— Reload from databaseexit— Exit
This project exists to answer questions like:
- Why are search engines append-only?
- Why does scoring change after merges?
- Why is deletion expensive?
- Why do real systems tolerate approximation?
- How do hybrid search systems combine keyword and semantic signals?
- How does RAG ground LLM answers in retrieved evidence?
The answers emerge naturally once you try to build one.