A from-scratch, fully local Retrieval-Augmented Generation system over a Wikipedia subset. Built end-to-end on Apple Silicon to explore the practical trade-offs of small-model RAG: embeddings, vector search, prompt construction, and on-device generation.
- Hardware: Apple M4 MacBook Pro (arm64)
- LLM runtime: MLX via
mlx-lm - LLM: Phi-2 (
microsoft/phi-2) - Embeddings:
sentence-transformers/all-MiniLM-L6-v2(384-dim) - Vector store: ChromaDB (local, persistent)
- Dataset: HuggingFace
wikimedia/wikipedia, config20231101.simple - Language: Python 3.11
- Chunk: Split each source document into overlapping word windows (here, 300 words with 50-word overlap) so each chunk fits the embedder and stays semantically coherent.
- Embed: Run every chunk through a sentence-transformer to produce a dense vector that captures meaning rather than surface tokens.
- Index: Store the chunks and vectors in a vector database (ChromaDB), keyed for fast nearest-neighbor lookup.
- Retrieve: Embed the user's question with the same model and pull the top-k most similar chunks from the index.
- Generate: Feed those chunks plus the question into a local LLM, instructing it to ground its answer in the provided context.
- Python 3.11 (arm64 build on Apple Silicon)
- ~2 GB free disk for the Phi-2 weights and the Chroma index
- A working HuggingFace cache (
~/.cache/huggingface) — Phi-2 will be pulled on first run if not already present
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun in this order:
# 1. Build the index (streams 500 wiki articles → chunk → embed → store)
python src/ingest.py
# 2. Ask one-off questions
python src/query.py "Your question here"
# 3. Run the canned evaluation suite
python src/evaluate_rag.py- LLM model path: update
LLM_MODELinsrc/query.pyto match whatever Phi-2 weights you actually have cached.microsoft/phi-2works; themlx-community/phi-2repo is currently broken. - Re-ingest:
ingest.pyusescollection.add, which will raise on duplicate IDs. Deletedb/before re-running, or switch toupsert. - Dataset: this repo targets
wikimedia/wikipediawith config20231101.simple. The olderwikipedia/20220301.simplebuilder is deprecated and may fail to load.
Five canned probes from src/evaluate_rag.py:
| # | Question | Retrieval | Answer |
|---|---|---|---|
| 1 | Who was Albert Einstein? | good | good |
| 2 | What is the capital of France? | weak | correct but ungrounded |
| 3 | What is photosynthesis? | good | good |
| 4 | When did World War II end? | good | good |
| 5 | What is the Great Wall of China? | weak | poor |
- Weak retrieval (semantic mismatch) — The Great Wall question pulls chunks about walls / Chinese history that don't actually contain the target facts. MiniLM similarity is approximate, and the simple Wikipedia subset has thin coverage of the specific entity.
- Ungrounded generation — Even when retrieval finds the right "Paris is the capital of France" chunk, Phi-2 sometimes ignores the context and recites parametric knowledge. Phi-2 is not strongly instruction-tuned and doesn't reliably defer to retrieved evidence.
- Coverage gaps — Only 500 articles are ingested. Anything not covered is either silently wrong or answered from the LLM's own weights, which defeats the point of the retrieval step.
- Hybrid search — combine BM25 (lexical) with semantic search to catch cases where the question shares rare terms with the source.
- Reranking — use a cross-encoder (e.g.
bge-reranker) over the top-k retrievals to push the most relevant chunk to position 1. - Instruction-tuned LLM — swap Phi-2 for an instruction-tuned model (Mistral-Instruct, Llama-3-Instruct, Phi-3-mini) so the model actually follows "answer only from context" prompts.
- Larger corpus — ingest the full simple-Wikipedia dump (~200k articles) or the full English Wikipedia for serious coverage.
rag-experiment/
├── src/
│ ├── ingest.py # stream + chunk + embed + persist
│ ├── query.py # CLI: question → retrieval → Phi-2 answer
│ └── evaluate_rag.py # batch eval over fixed test set
├── data/
│ └── eval_results.txt # latest evaluation output (committed)
├── db/ # ChromaDB persistent store (gitignored)
├── notebooks/ # exploration
├── requirements.txt
├── README.md
└── CLAUDE.md # notes for future Claude sessions
Built with assistance from Claude (Anthropic) for code generation and Claude Code for project scaffolding. All system design, evaluation, and analysis are my own.