A fully local, offline Retrieval-Augmented Generation (RAG) pipeline that lets you ask natural language questions over a collection of PDF research papers — with cited answers powered by a local LLM (no API key required).
- Ingests PDF research papers using PyMuPDF
- Chunks each page into ~400-token pieces with overlap
- Embeds chunks into 384-dimensional vectors using Sentence Transformers
- Indexes them in a FAISS vector store for fast similarity search
- Retrieves the top-4 most relevant passages for any query
- Generates a grounded answer with source citations using a local Ollama LLM
PDF Papers
│
▼
Text Extraction (PyMuPDF)
│
▼
400-token Chunks (tiktoken + LangChain splitter)
│
▼
Embeddings (sentence-transformers/all-MiniLM-L6-v2 · 384-dim)
│
▼
FAISS Index (IndexFlatIP · cosine similarity)
│
▼ ◄── User Question (embedded the same way)
Top-4 Retrieval
│
▼
LLM (Ollama · llama3.2) ──► Answer + Citations
| Component | Library / Tool |
|---|---|
| PDF parsing | PyMuPDF (fitz) |
| Tokenization | tiktoken (cl100k_base) |
| Text splitting | LangChain RecursiveCharacterTextSplitter |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
| Vector store | FAISS (IndexFlatIP) |
| LLM | Ollama (llama3.2 3B — local, free, offline) |
| Orchestration | LangChain (langchain-core, langchain-ollama) |
| Interface | Jupyter Notebook |
- Python 3.11+ (via conda recommended)
- Ollama installed and running
llama3.2model pulled
git clone https://github.com/<your-username>/RAG-Research-QA.git
cd RAG-Research-QAconda create -n rag_env python=3.11 -y
conda activate rag_env
pip install -r requirements.txt# macOS
brew install ollama
brew services start ollama
ollama pull llama3.2Drop any PDF research papers into the papers/ folder.
jupyter notebook RAG_Pipeline.ipynbRegister the kernel if needed:
python -m ipykernel install --user --name rag_env --display-name "RAG Project (Python 3.11)"| Cell | Purpose |
|---|---|
| Cell 1 | Install all dependencies |
| Cell 2 | Configure paths, chunk size, model names |
| Cell 3 | Extract text from all PDFs in papers/ |
| Cell 4 | Split pages into 400-token chunks |
| Cell 5 | Generate embeddings (384-dim, CPU) |
| Cell 6 | Build & save FAISS index to data/ |
| Cell 6b | (Optional) Load existing index — skip re-embedding |
| Cell 7 | Ask a question → get answer + citations |
Typical workflow:
- First time / adding new papers: Run Cells 2 → 3 → 4 → 5 → 6, then Cell 7
- Returning session: Run Cell 6b (loads saved index), then Cell 7
RAG/
├── RAG_Pipeline.ipynb # Main notebook — full pipeline
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
├── papers/ # Drop your PDF papers here
│ ├── attention_is_all_you_need.pdf
│ ├── bert.pdf
│ ├── gpt3.pdf
│ ├── llama.pdf
│ └── rag_paper.pdf
└── data/ # Auto-generated (gitignored)
├── index.faiss # FAISS vector index
└── metadata.pkl # Chunk metadata
All settings are in Cell 2 of the notebook:
PAPERS_DIR = "papers" # folder with your PDFs
CHUNK_SIZE = 400 # tokens per chunk
CHUNK_OVERLAP = 50 # token overlap between chunks
TOP_K = 4 # number of passages retrieved per query
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
OLLAMA_MODEL = "llama3.2" # any model you have pulled in Ollama| Metric | Value |
|---|---|
| Embedding model size | ~90 MB |
| Embedding speed | ~500 chunks/min on CPU |
| FAISS search latency | < 50 ms for 18K vectors |
| LLM response time | ~1–2 min (llama3.2 on CPU) |
| Scales to | ~10,000+ papers before needing ANN index |
This setup handles 200+ papers with no code changes — just drop PDFs in and re-run Cells 3→6.
For 10,000+ papers, switch to an approximate index:
# In Cell 6, replace IndexFlatIP with:
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFFlat(quantizer, dim, 100)
index.train(embeddings)MIT