This repository contains an advanced Retrieval-Augmented Generation (RAG) pipeline that combines hybrid search (Dense + Sparse) with Reciprocal Rank Fusion (RRF), and leverages Google's Gemini models for powerful reranking and response generation with precise source citations.
- Hybrid Retrieval System:
- Dense Retrieval: Uses
sentence-transformers(all-MiniLM-L6-v2) and ChromaDB for semantic search. - Sparse Retrieval: Uses BM25 (
rank_bm25) for exact keyword matching.
- Dense Retrieval: Uses
- Reciprocal Rank Fusion (RRF): Combines the results of dense and sparse searches mathematically to get the best of both worlds.
- LLM Reranking: Sends the top 20 candidates to the Gemini API to select the absolute top 5 most contextually relevant chunks.
- Citation-Backed Generation: Uses the Gemini API to answer the user's query while strictly citing the sources inline (e.g.,
[1],[2]). - Evaluation Module: Built-in evaluation script to test the pipeline against a truth dataset and verify citation coverage and relevance.
rag-project/
├── data/
│ └── belgeler/ # Put your .txt and .pdf files here
├── db/ # Auto-generated ChromaDB storage and BM25 chunks
├── src/
│ ├── ingest.py # Reads docs, chunks them, and stores embeddings
│ ├── search.py # Performs Hybrid Search (ChromaDB + BM25) and RRF fusion
│ ├── rerank.py # Sends top 20 to Gemini for top-5 reranking
│ ├── generate.py # Generates the final answer with citations via Gemini
│ └── eval.py # Runs tests from eval_set.json
├── eval_set.json # Custom question-answer pairs for evaluation
├── main.py # Interactive CLI application
├── requirements.txt # Python dependencies
├── .env # Environment variables (API Keys)
└── .gitignore
Make sure you have Python installed, then install the required packages:
pip install -r requirements.txtThis project uses the Google Gemini API for reranking and generating answers.
- Get a free API key from Google AI Studio.
- Create a
.envfile in the root directory (already ignored by Git):GEMINI_API_KEY=AIzaSy_YOUR_API_KEY_HERE
Drop your .txt or .pdf files into the data/belgeler/ folder.
Run the ingestion script to chunk the texts, create embeddings, and build the search index:
python src/ingest.pyNote: The first time you run this, it will download the sentence-transformer model (~80MB).
Start the interactive CLI to ask questions about your documents:
python main.pyType your question and watch the pipeline search, rerank, and generate a well-cited answer!
To evaluate the system, add some questions and expected keywords into eval_set.json:
[
{
"question": "What is machine learning?",
"expected_keywords": ["learning", "data", "algorithms"]
}
]Then run the evaluation script to see how well the pipeline performs (verifying both citations and keywords):
python src/eval.py(Note: If you are using the free tier of the Gemini API, you may occasionally see a 503 UNAVAILABLE error during heavy automated eval tests. Simply wait a minute and retry.)
We ran a comparison test to demonstrate why Hybrid Search is necessary. When querying "Transformer mimarisinin avantajları nelerdir?" (What are the advantages of the Transformer architecture?):
Only Dense Search (ChromaDB):
nlp_temelleri.txt✅ (Correctly identifies Transformers)veri_bilimi.txt❌ (Completely unrelated text about Data Science metrics)veri_bilimi.txt❌ (Unrelated text about Feature Engineering) Why? Dense search alone can sometimes be biased by the semantic structure of sentences rather than strict keyword matching.
Hybrid Search (Dense + BM25 + RRF):
nlp_temelleri.txt✅ (Correctly identifies Transformers)nlp_temelleri.txt✅ (Related context about NLP and Word Embeddings)veri_bilimi.txt❌ Why? BM25 caught the exact keyword "Transformer" and "RNN" to boost the relevance of the NLP document, resulting in much richer and more accurate candidates for the LLM.
| Metric | Score |
|---|---|
| Overall Accuracy | 80% (4/5) |
| Citation Coverage | 14/14 successful citations |
| Hybrid vs Dense | BM25 removed 2 irrelevant chunks |
| Resilience | 503 errors handled via retry + fallback |
Tested on 3 Turkish documents (Yapay Zeka, Veri Bilimi, NLP Temelleri).
$ python main.py
==================================================
RAG Pipeline -- Soru-Cevap
==================================================
❓ SORU: Transformer mimarisinin geleneksel RNN modellerine göre avantajı nedir?
🔍 [1/3] Hibrit arama yapılıyor (Dense + BM25 + RRF)...
→ 7 aday chunk bulundu.
🎯 [2/3] Reranking yapılıyor (Gemini)...
[RERANK] Gemini'ye 7 aday gönderiliyor...
[RERANK] Top 5 parça seçildi.
✍️ [3/3] Yanıt üretiliyor (Gemini)...
[GENERATE] Yanıt üretildi (490 karakter).
============================================================
📝 YANIT:
------------------------------------------------------------
Transformer mimarisi, RNN'ye kıyasla birçok kritik avantaj sunar:
* **Paralel İşleme:** RNN'ler sıralı işler; Transformer tüm sekansı
aynı anda işler. Bu eğitimi dramatik biçimde hızlandırır [1].
* **Öz-Dikkat (Self-Attention):** Cümledeki tüm kelimeler arasındaki
uzun menzilli bağımlılıkları doğrudan yakalar [1].
* **Hafıza Kaybı Yok:** RNN'lerin gradient vanishing sorunu yoktur;
Transformer bu problemi mimari olarak aşar [2].
📚 KAYNAKLAR: [1] nlp_temelleri.txt [2] nlp_temelleri.txt
============================================================
[EVAL] 5 soru değerlendiriliyor...
[1/5] ✓ BAŞARILI — Atıf: 1/1
[2/5] ✓ BAŞARILI — Atıf: 5/5
[3/5] ✓ BAŞARILI — Atıf: 3/3
[4/5] ✗ BAŞARISIZ — Atıf: 0/0
[5/5] ✓ BAŞARILI — Atıf: 5/5
[SONUÇ] Doğruluk oranı: 80.0% (4/5)
- ChromaDB: Vector database for semantic search.
- Sentence-Transformers: Lightweight embedding generation.
- Rank-BM25: Keyword-based sparse retrieval.
- Google GenAI SDK: LLM integration for reranking and generation.
- PyMuPDF (fitz): High-speed PDF parsing.