RAG Pipeline with Hybrid Search and Gemini

This repository contains an advanced Retrieval-Augmented Generation (RAG) pipeline that combines hybrid search (Dense + Sparse) with Reciprocal Rank Fusion (RRF), and leverages Google's Gemini models for powerful reranking and response generation with precise source citations.

🌟 Key Features

Hybrid Retrieval System:
- Dense Retrieval: Uses sentence-transformers (all-MiniLM-L6-v2) and ChromaDB for semantic search.
- Sparse Retrieval: Uses BM25 (rank_bm25) for exact keyword matching.
Reciprocal Rank Fusion (RRF): Combines the results of dense and sparse searches mathematically to get the best of both worlds.
LLM Reranking: Sends the top 20 candidates to the Gemini API to select the absolute top 5 most contextually relevant chunks.
Citation-Backed Generation: Uses the Gemini API to answer the user's query while strictly citing the sources inline (e.g., [1], [2]).
Evaluation Module: Built-in evaluation script to test the pipeline against a truth dataset and verify citation coverage and relevance.

📁 Project Structure

rag-project/
├── data/
│   └── belgeler/        # Put your .txt and .pdf files here
├── db/                  # Auto-generated ChromaDB storage and BM25 chunks
├── src/
│   ├── ingest.py        # Reads docs, chunks them, and stores embeddings
│   ├── search.py        # Performs Hybrid Search (ChromaDB + BM25) and RRF fusion
│   ├── rerank.py        # Sends top 20 to Gemini for top-5 reranking
│   ├── generate.py      # Generates the final answer with citations via Gemini
│   └── eval.py          # Runs tests from eval_set.json
├── eval_set.json        # Custom question-answer pairs for evaluation
├── main.py              # Interactive CLI application
├── requirements.txt     # Python dependencies
├── .env                 # Environment variables (API Keys)
└── .gitignore

🚀 Getting Started

1. Install Dependencies

Make sure you have Python installed, then install the required packages:

pip install -r requirements.txt

2. Set Up API Keys

This project uses the Google Gemini API for reranking and generating answers.

Get a free API key from Google AI Studio.
Create a .env file in the root directory (already ignored by Git):
```
GEMINI_API_KEY=AIzaSy_YOUR_API_KEY_HERE
```

3. Ingest Documents

Drop your .txt or .pdf files into the data/belgeler/ folder. Run the ingestion script to chunk the texts, create embeddings, and build the search index:

python src/ingest.py

Note: The first time you run this, it will download the sentence-transformer model (~80MB).

4. Run the RAG Pipeline

Start the interactive CLI to ask questions about your documents:

python main.py

Type your question and watch the pipeline search, rerank, and generate a well-cited answer!

🧪 Evaluation

To evaluate the system, add some questions and expected keywords into eval_set.json:

[
  {
    "question": "What is machine learning?",
    "expected_keywords": ["learning", "data", "algorithms"]
  }
]

Then run the evaluation script to see how well the pipeline performs (verifying both citations and keywords):

python src/eval.py

(Note: If you are using the free tier of the Gemini API, you may occasionally see a 503 UNAVAILABLE error during heavy automated eval tests. Simply wait a minute and retry.)

📊 Dense vs Hybrid Search Comparison

We ran a comparison test to demonstrate why Hybrid Search is necessary. When querying "Transformer mimarisinin avantajları nelerdir?" (What are the advantages of the Transformer architecture?):

Only Dense Search (ChromaDB):

nlp_temelleri.txt ✅ (Correctly identifies Transformers)
veri_bilimi.txt ❌ (Completely unrelated text about Data Science metrics)
veri_bilimi.txt ❌ (Unrelated text about Feature Engineering) Why? Dense search alone can sometimes be biased by the semantic structure of sentences rather than strict keyword matching.

Hybrid Search (Dense + BM25 + RRF):

nlp_temelleri.txt ✅ (Correctly identifies Transformers)
nlp_temelleri.txt ✅ (Related context about NLP and Word Embeddings)
veri_bilimi.txt ❌ Why? BM25 caught the exact keyword "Transformer" and "RNN" to boost the relevance of the NLP document, resulting in much richer and more accurate candidates for the LLM.

🧪 Eval Results

Metric	Score
Overall Accuracy	80% (4/5)
Citation Coverage	14/14 successful citations
Hybrid vs Dense	BM25 removed 2 irrelevant chunks
Resilience	503 errors handled via retry + fallback

Tested on 3 Turkish documents (Yapay Zeka, Veri Bilimi, NLP Temelleri).

🎬 Demo

$ python main.py

==================================================
   RAG Pipeline -- Soru-Cevap
==================================================

❓ SORU: Transformer mimarisinin geleneksel RNN modellerine göre avantajı nedir?

🔍 [1/3] Hibrit arama yapılıyor (Dense + BM25 + RRF)...
   → 7 aday chunk bulundu.

🎯 [2/3] Reranking yapılıyor (Gemini)...
[RERANK] Gemini'ye 7 aday gönderiliyor...
[RERANK] Top 5 parça seçildi.

✍️  [3/3] Yanıt üretiliyor (Gemini)...
[GENERATE] Yanıt üretildi (490 karakter).

============================================================
📝 YANIT:
------------------------------------------------------------
Transformer mimarisi, RNN'ye kıyasla birçok kritik avantaj sunar:

* **Paralel İşleme:** RNN'ler sıralı işler; Transformer tüm sekansı
  aynı anda işler. Bu eğitimi dramatik biçimde hızlandırır [1].
* **Öz-Dikkat (Self-Attention):** Cümledeki tüm kelimeler arasındaki
  uzun menzilli bağımlılıkları doğrudan yakalar [1].
* **Hafıza Kaybı Yok:** RNN'lerin gradient vanishing sorunu yoktur;
  Transformer bu problemi mimari olarak aşar [2].

📚 KAYNAKLAR: [1] nlp_temelleri.txt  [2] nlp_temelleri.txt
============================================================

[EVAL] 5 soru değerlendiriliyor...
[1/5] ✓ BAŞARILI — Atıf: 1/1
[2/5] ✓ BAŞARILI — Atıf: 5/5
[3/5] ✓ BAŞARILI — Atıf: 3/3
[4/5] ✗ BAŞARISIZ — Atıf: 0/0
[5/5] ✓ BAŞARILI — Atıf: 5/5
[SONUÇ] Doğruluk oranı: 80.0% (4/5)

🛠 Technologies Used

ChromaDB: Vector database for semantic search.
Sentence-Transformers: Lightweight embedding generation.
Rank-BM25: Keyword-based sparse retrieval.
Google GenAI SDK: LLM integration for reranking and generation.
PyMuPDF (fitz): High-speed PDF parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data/belgeler		data/belgeler
src		src
.gitignore		.gitignore
README.md		README.md
eval_compare_report.json		eval_compare_report.json
eval_report.json		eval_report.json
eval_set.json		eval_set.json
main.py		main.py
requirements.txt		requirements.txt
test_skip.py		test_skip.py
test_skip_json.py		test_skip_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Pipeline with Hybrid Search and Gemini

🌟 Key Features

📁 Project Structure

🚀 Getting Started

1. Install Dependencies

2. Set Up API Keys

3. Ingest Documents

4. Run the RAG Pipeline

🧪 Evaluation

📊 Dense vs Hybrid Search Comparison

🧪 Eval Results

🎬 Demo

🛠 Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Pipeline with Hybrid Search and Gemini

🌟 Key Features

📁 Project Structure

🚀 Getting Started

1. Install Dependencies

2. Set Up API Keys

3. Ingest Documents

4. Run the RAG Pipeline

🧪 Evaluation

📊 Dense vs Hybrid Search Comparison

🧪 Eval Results

🎬 Demo

🛠 Technologies Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages