Skip to content

jasstt/rag_project

Repository files navigation

RAG Pipeline with Hybrid Search and Gemini

Python Gemini ChromaDB BM25

This repository contains an advanced Retrieval-Augmented Generation (RAG) pipeline that combines hybrid search (Dense + Sparse) with Reciprocal Rank Fusion (RRF), and leverages Google's Gemini models for powerful reranking and response generation with precise source citations.

🌟 Key Features

  • Hybrid Retrieval System:
    • Dense Retrieval: Uses sentence-transformers (all-MiniLM-L6-v2) and ChromaDB for semantic search.
    • Sparse Retrieval: Uses BM25 (rank_bm25) for exact keyword matching.
  • Reciprocal Rank Fusion (RRF): Combines the results of dense and sparse searches mathematically to get the best of both worlds.
  • LLM Reranking: Sends the top 20 candidates to the Gemini API to select the absolute top 5 most contextually relevant chunks.
  • Citation-Backed Generation: Uses the Gemini API to answer the user's query while strictly citing the sources inline (e.g., [1], [2]).
  • Evaluation Module: Built-in evaluation script to test the pipeline against a truth dataset and verify citation coverage and relevance.

📁 Project Structure

rag-project/
├── data/
│   └── belgeler/        # Put your .txt and .pdf files here
├── db/                  # Auto-generated ChromaDB storage and BM25 chunks
├── src/
│   ├── ingest.py        # Reads docs, chunks them, and stores embeddings
│   ├── search.py        # Performs Hybrid Search (ChromaDB + BM25) and RRF fusion
│   ├── rerank.py        # Sends top 20 to Gemini for top-5 reranking
│   ├── generate.py      # Generates the final answer with citations via Gemini
│   └── eval.py          # Runs tests from eval_set.json
├── eval_set.json        # Custom question-answer pairs for evaluation
├── main.py              # Interactive CLI application
├── requirements.txt     # Python dependencies
├── .env                 # Environment variables (API Keys)
└── .gitignore

🚀 Getting Started

1. Install Dependencies

Make sure you have Python installed, then install the required packages:

pip install -r requirements.txt

2. Set Up API Keys

This project uses the Google Gemini API for reranking and generating answers.

  1. Get a free API key from Google AI Studio.
  2. Create a .env file in the root directory (already ignored by Git):
    GEMINI_API_KEY=AIzaSy_YOUR_API_KEY_HERE

3. Ingest Documents

Drop your .txt or .pdf files into the data/belgeler/ folder. Run the ingestion script to chunk the texts, create embeddings, and build the search index:

python src/ingest.py

Note: The first time you run this, it will download the sentence-transformer model (~80MB).

4. Run the RAG Pipeline

Start the interactive CLI to ask questions about your documents:

python main.py

Type your question and watch the pipeline search, rerank, and generate a well-cited answer!

🧪 Evaluation

To evaluate the system, add some questions and expected keywords into eval_set.json:

[
  {
    "question": "What is machine learning?",
    "expected_keywords": ["learning", "data", "algorithms"]
  }
]

Then run the evaluation script to see how well the pipeline performs (verifying both citations and keywords):

python src/eval.py

(Note: If you are using the free tier of the Gemini API, you may occasionally see a 503 UNAVAILABLE error during heavy automated eval tests. Simply wait a minute and retry.)

📊 Dense vs Hybrid Search Comparison

We ran a comparison test to demonstrate why Hybrid Search is necessary. When querying "Transformer mimarisinin avantajları nelerdir?" (What are the advantages of the Transformer architecture?):

Only Dense Search (ChromaDB):

  1. nlp_temelleri.txt ✅ (Correctly identifies Transformers)
  2. veri_bilimi.txt ❌ (Completely unrelated text about Data Science metrics)
  3. veri_bilimi.txt ❌ (Unrelated text about Feature Engineering) Why? Dense search alone can sometimes be biased by the semantic structure of sentences rather than strict keyword matching.

Hybrid Search (Dense + BM25 + RRF):

  1. nlp_temelleri.txt ✅ (Correctly identifies Transformers)
  2. nlp_temelleri.txt ✅ (Related context about NLP and Word Embeddings)
  3. veri_bilimi.txtWhy? BM25 caught the exact keyword "Transformer" and "RNN" to boost the relevance of the NLP document, resulting in much richer and more accurate candidates for the LLM.

🧪 Eval Results

Metric Score
Overall Accuracy 80% (4/5)
Citation Coverage 14/14 successful citations
Hybrid vs Dense BM25 removed 2 irrelevant chunks
Resilience 503 errors handled via retry + fallback

Tested on 3 Turkish documents (Yapay Zeka, Veri Bilimi, NLP Temelleri).

🎬 Demo

$ python main.py

==================================================
   RAG Pipeline -- Soru-Cevap
==================================================

❓ SORU: Transformer mimarisinin geleneksel RNN modellerine göre avantajı nedir?

🔍 [1/3] Hibrit arama yapılıyor (Dense + BM25 + RRF)...
   → 7 aday chunk bulundu.

🎯 [2/3] Reranking yapılıyor (Gemini)...
[RERANK] Gemini'ye 7 aday gönderiliyor...
[RERANK] Top 5 parça seçildi.

✍️  [3/3] Yanıt üretiliyor (Gemini)...
[GENERATE] Yanıt üretildi (490 karakter).

============================================================
📝 YANIT:
------------------------------------------------------------
Transformer mimarisi, RNN'ye kıyasla birçok kritik avantaj sunar:

* **Paralel İşleme:** RNN'ler sıralı işler; Transformer tüm sekansı
  aynı anda işler. Bu eğitimi dramatik biçimde hızlandırır [1].
* **Öz-Dikkat (Self-Attention):** Cümledeki tüm kelimeler arasındaki
  uzun menzilli bağımlılıkları doğrudan yakalar [1].
* **Hafıza Kaybı Yok:** RNN'lerin gradient vanishing sorunu yoktur;
  Transformer bu problemi mimari olarak aşar [2].

📚 KAYNAKLAR: [1] nlp_temelleri.txt  [2] nlp_temelleri.txt
============================================================

[EVAL] 5 soru değerlendiriliyor...
[1/5] ✓ BAŞARILI — Atıf: 1/1
[2/5] ✓ BAŞARILI — Atıf: 5/5
[3/5] ✓ BAŞARILI — Atıf: 3/3
[4/5] ✗ BAŞARISIZ — Atıf: 0/0
[5/5] ✓ BAŞARILI — Atıf: 5/5
[SONUÇ] Doğruluk oranı: 80.0% (4/5)

🛠 Technologies Used

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages