Local RAG Experiment

A from-scratch, fully local Retrieval-Augmented Generation system over a Wikipedia subset. Built end-to-end on Apple Silicon to explore the practical trade-offs of small-model RAG: embeddings, vector search, prompt construction, and on-device generation.

Hardware & Stack

Hardware: Apple M4 MacBook Pro (arm64)
LLM runtime: MLX via mlx-lm
LLM: Phi-2 (microsoft/phi-2)
Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384-dim)
Vector store: ChromaDB (local, persistent)
Dataset: HuggingFace wikimedia/wikipedia, config 20231101.simple
Language: Python 3.11

How RAG Works (5 steps)

Chunk: Split each source document into overlapping word windows (here, 300 words with 50-word overlap) so each chunk fits the embedder and stays semantically coherent.
Embed: Run every chunk through a sentence-transformer to produce a dense vector that captures meaning rather than surface tokens.
Index: Store the chunks and vectors in a vector database (ChromaDB), keyed for fast nearest-neighbor lookup.
Retrieve: Embed the user's question with the same model and pull the top-k most similar chunks from the index.
Generate: Feed those chunks plus the question into a local LLM, instructing it to ground its answer in the provided context.

Prerequisites

Python 3.11 (arm64 build on Apple Silicon)
~2 GB free disk for the Phi-2 weights and the Chroma index
A working HuggingFace cache (~/.cache/huggingface) — Phi-2 will be pulled on first run if not already present

Installation

python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

Run in this order:

# 1. Build the index (streams 500 wiki articles → chunk → embed → store)
python src/ingest.py

# 2. Ask one-off questions
python src/query.py "Your question here"

# 3. Run the canned evaluation suite
python src/evaluate_rag.py

Important Notes

LLM model path: update LLM_MODEL in src/query.py to match whatever Phi-2 weights you actually have cached. microsoft/phi-2 works; the mlx-community/phi-2 repo is currently broken.
Re-ingest: ingest.py uses collection.add, which will raise on duplicate IDs. Delete db/ before re-running, or switch to upsert.
Dataset: this repo targets wikimedia/wikipedia with config 20231101.simple. The older wikipedia/20220301.simple builder is deprecated and may fail to load.

Evaluation Results

Five canned probes from src/evaluate_rag.py:

#	Question	Retrieval	Answer
1	Who was Albert Einstein?	good	good
2	What is the capital of France?	weak	correct but ungrounded
3	What is photosynthesis?	good	good
4	When did World War II end?	good	good
5	What is the Great Wall of China?	weak	poor

RAG Failure Modes (Observed)

Weak retrieval (semantic mismatch) — The Great Wall question pulls chunks about walls / Chinese history that don't actually contain the target facts. MiniLM similarity is approximate, and the simple Wikipedia subset has thin coverage of the specific entity.
Ungrounded generation — Even when retrieval finds the right "Paris is the capital of France" chunk, Phi-2 sometimes ignores the context and recites parametric knowledge. Phi-2 is not strongly instruction-tuned and doesn't reliably defer to retrieved evidence.
Coverage gaps — Only 500 articles are ingested. Anything not covered is either silently wrong or answered from the LLM's own weights, which defeats the point of the retrieval step.

Potential Improvements

Hybrid search — combine BM25 (lexical) with semantic search to catch cases where the question shares rare terms with the source.
Reranking — use a cross-encoder (e.g. bge-reranker) over the top-k retrievals to push the most relevant chunk to position 1.
Instruction-tuned LLM — swap Phi-2 for an instruction-tuned model (Mistral-Instruct, Llama-3-Instruct, Phi-3-mini) so the model actually follows "answer only from context" prompts.
Larger corpus — ingest the full simple-Wikipedia dump (~200k articles) or the full English Wikipedia for serious coverage.

Project Structure

rag-experiment/
├── src/
│   ├── ingest.py         # stream + chunk + embed + persist
│   ├── query.py          # CLI: question → retrieval → Phi-2 answer
│   └── evaluate_rag.py   # batch eval over fixed test set
├── data/
│   └── eval_results.txt  # latest evaluation output (committed)
├── db/                   # ChromaDB persistent store (gitignored)
├── notebooks/            # exploration
├── requirements.txt
├── README.md
└── CLAUDE.md             # notes for future Claude sessions

Acknowledgements

Built with assistance from Claude (Anthropic) for code generation and Claude Code for project scaffolding. All system design, evaluation, and analysis are my own.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local RAG Experiment

Hardware & Stack

How RAG Works (5 steps)

Prerequisites

Installation

Usage

Important Notes

Evaluation Results

RAG Failure Modes (Observed)

Potential Improvements

Project Structure

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Local RAG Experiment

Hardware & Stack

How RAG Works (5 steps)

Prerequisites

Installation

Usage

Important Notes

Evaluation Results

RAG Failure Modes (Observed)

Potential Improvements

Project Structure

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages