Skip to content

thansenai-code/RAG-Research-QA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 Research Paper Question Answering System (RAG)

A fully local, offline Retrieval-Augmented Generation (RAG) pipeline that lets you ask natural language questions over a collection of PDF research papers — with cited answers powered by a local LLM (no API key required).


🧠 What It Does

  1. Ingests PDF research papers using PyMuPDF
  2. Chunks each page into ~400-token pieces with overlap
  3. Embeds chunks into 384-dimensional vectors using Sentence Transformers
  4. Indexes them in a FAISS vector store for fast similarity search
  5. Retrieves the top-4 most relevant passages for any query
  6. Generates a grounded answer with source citations using a local Ollama LLM

🏗️ Architecture

PDF Papers
    │
    ▼
Text Extraction (PyMuPDF)
    │
    ▼
400-token Chunks (tiktoken + LangChain splitter)
    │
    ▼
Embeddings (sentence-transformers/all-MiniLM-L6-v2 · 384-dim)
    │
    ▼
FAISS Index (IndexFlatIP · cosine similarity)
    │
    ▼  ◄── User Question (embedded the same way)
Top-4 Retrieval
    │
    ▼
LLM (Ollama · llama3.2) ──► Answer + Citations

🛠️ Tech Stack

Component Library / Tool
PDF parsing PyMuPDF (fitz)
Tokenization tiktoken (cl100k_base)
Text splitting LangChain RecursiveCharacterTextSplitter
Embeddings sentence-transformers/all-MiniLM-L6-v2
Vector store FAISS (IndexFlatIP)
LLM Ollama (llama3.2 3B — local, free, offline)
Orchestration LangChain (langchain-core, langchain-ollama)
Interface Jupyter Notebook

📋 Prerequisites

  • Python 3.11+ (via conda recommended)
  • Ollama installed and running
  • llama3.2 model pulled

⚡ Quick Start

1. Clone the repo

git clone https://github.com/<your-username>/RAG-Research-QA.git
cd RAG-Research-QA

2. Create a conda environment

conda create -n rag_env python=3.11 -y
conda activate rag_env
pip install -r requirements.txt

3. Install and start Ollama

# macOS
brew install ollama
brew services start ollama
ollama pull llama3.2

4. Add your papers

Drop any PDF research papers into the papers/ folder.

5. Open the notebook

jupyter notebook RAG_Pipeline.ipynb

Register the kernel if needed:

python -m ipykernel install --user --name rag_env --display-name "RAG Project (Python 3.11)"

📓 Notebook Walkthrough

Cell Purpose
Cell 1 Install all dependencies
Cell 2 Configure paths, chunk size, model names
Cell 3 Extract text from all PDFs in papers/
Cell 4 Split pages into 400-token chunks
Cell 5 Generate embeddings (384-dim, CPU)
Cell 6 Build & save FAISS index to data/
Cell 6b (Optional) Load existing index — skip re-embedding
Cell 7 Ask a question → get answer + citations

Typical workflow:

  • First time / adding new papers: Run Cells 2 → 3 → 4 → 5 → 6, then Cell 7
  • Returning session: Run Cell 6b (loads saved index), then Cell 7

📂 Project Structure

RAG/
├── RAG_Pipeline.ipynb   # Main notebook — full pipeline
├── requirements.txt     # Python dependencies
├── .env.example         # Environment variable template
├── papers/              # Drop your PDF papers here
│   ├── attention_is_all_you_need.pdf
│   ├── bert.pdf
│   ├── gpt3.pdf
│   ├── llama.pdf
│   └── rag_paper.pdf
└── data/                # Auto-generated (gitignored)
    ├── index.faiss      # FAISS vector index
    └── metadata.pkl     # Chunk metadata

⚙️ Configuration

All settings are in Cell 2 of the notebook:

PAPERS_DIR    = "papers"      # folder with your PDFs
CHUNK_SIZE    = 400           # tokens per chunk
CHUNK_OVERLAP = 50            # token overlap between chunks
TOP_K         = 4             # number of passages retrieved per query
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
OLLAMA_MODEL  = "llama3.2"    # any model you have pulled in Ollama

📈 Performance (approximate)

Metric Value
Embedding model size ~90 MB
Embedding speed ~500 chunks/min on CPU
FAISS search latency < 50 ms for 18K vectors
LLM response time ~1–2 min (llama3.2 on CPU)
Scales to ~10,000+ papers before needing ANN index

🔄 Scaling Up

This setup handles 200+ papers with no code changes — just drop PDFs in and re-run Cells 3→6.

For 10,000+ papers, switch to an approximate index:

# In Cell 6, replace IndexFlatIP with:
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFFlat(quantizer, dim, 100)
index.train(embeddings)

📄 License

MIT

About

Local RAG pipeline for research paper Q&A — PyMuPDF · FAISS · Sentence Transformers · Ollama · LangChain

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors