Skip to content

Melmonster13/rag-experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local RAG Experiment

A from-scratch, fully local Retrieval-Augmented Generation system over a Wikipedia subset. Built end-to-end on Apple Silicon to explore the practical trade-offs of small-model RAG: embeddings, vector search, prompt construction, and on-device generation.

Hardware & Stack

  • Hardware: Apple M4 MacBook Pro (arm64)
  • LLM runtime: MLX via mlx-lm
  • LLM: Phi-2 (microsoft/phi-2)
  • Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384-dim)
  • Vector store: ChromaDB (local, persistent)
  • Dataset: HuggingFace wikimedia/wikipedia, config 20231101.simple
  • Language: Python 3.11

How RAG Works (5 steps)

  1. Chunk: Split each source document into overlapping word windows (here, 300 words with 50-word overlap) so each chunk fits the embedder and stays semantically coherent.
  2. Embed: Run every chunk through a sentence-transformer to produce a dense vector that captures meaning rather than surface tokens.
  3. Index: Store the chunks and vectors in a vector database (ChromaDB), keyed for fast nearest-neighbor lookup.
  4. Retrieve: Embed the user's question with the same model and pull the top-k most similar chunks from the index.
  5. Generate: Feed those chunks plus the question into a local LLM, instructing it to ground its answer in the provided context.

Prerequisites

  • Python 3.11 (arm64 build on Apple Silicon)
  • ~2 GB free disk for the Phi-2 weights and the Chroma index
  • A working HuggingFace cache (~/.cache/huggingface) — Phi-2 will be pulled on first run if not already present

Installation

python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

Run in this order:

# 1. Build the index (streams 500 wiki articles → chunk → embed → store)
python src/ingest.py

# 2. Ask one-off questions
python src/query.py "Your question here"

# 3. Run the canned evaluation suite
python src/evaluate_rag.py

Important Notes

  • LLM model path: update LLM_MODEL in src/query.py to match whatever Phi-2 weights you actually have cached. microsoft/phi-2 works; the mlx-community/phi-2 repo is currently broken.
  • Re-ingest: ingest.py uses collection.add, which will raise on duplicate IDs. Delete db/ before re-running, or switch to upsert.
  • Dataset: this repo targets wikimedia/wikipedia with config 20231101.simple. The older wikipedia/20220301.simple builder is deprecated and may fail to load.

Evaluation Results

Five canned probes from src/evaluate_rag.py:

# Question Retrieval Answer
1 Who was Albert Einstein? good good
2 What is the capital of France? weak correct but ungrounded
3 What is photosynthesis? good good
4 When did World War II end? good good
5 What is the Great Wall of China? weak poor

RAG Failure Modes (Observed)

  1. Weak retrieval (semantic mismatch) — The Great Wall question pulls chunks about walls / Chinese history that don't actually contain the target facts. MiniLM similarity is approximate, and the simple Wikipedia subset has thin coverage of the specific entity.
  2. Ungrounded generation — Even when retrieval finds the right "Paris is the capital of France" chunk, Phi-2 sometimes ignores the context and recites parametric knowledge. Phi-2 is not strongly instruction-tuned and doesn't reliably defer to retrieved evidence.
  3. Coverage gaps — Only 500 articles are ingested. Anything not covered is either silently wrong or answered from the LLM's own weights, which defeats the point of the retrieval step.

Potential Improvements

  1. Hybrid search — combine BM25 (lexical) with semantic search to catch cases where the question shares rare terms with the source.
  2. Reranking — use a cross-encoder (e.g. bge-reranker) over the top-k retrievals to push the most relevant chunk to position 1.
  3. Instruction-tuned LLM — swap Phi-2 for an instruction-tuned model (Mistral-Instruct, Llama-3-Instruct, Phi-3-mini) so the model actually follows "answer only from context" prompts.
  4. Larger corpus — ingest the full simple-Wikipedia dump (~200k articles) or the full English Wikipedia for serious coverage.

Project Structure

rag-experiment/
├── src/
│   ├── ingest.py         # stream + chunk + embed + persist
│   ├── query.py          # CLI: question → retrieval → Phi-2 answer
│   └── evaluate_rag.py   # batch eval over fixed test set
├── data/
│   └── eval_results.txt  # latest evaluation output (committed)
├── db/                   # ChromaDB persistent store (gitignored)
├── notebooks/            # exploration
├── requirements.txt
├── README.md
└── CLAUDE.md             # notes for future Claude sessions

Acknowledgements

Built with assistance from Claude (Anthropic) for code generation and Claude Code for project scaffolding. All system design, evaluation, and analysis are my own.

About

Local RAG system using ChromaDB, sentence-transformers, and Phi-2 on Apple Silicon

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages