A research benchmark that compares Retrieval-Augmented Generation (RAG) techniques for obstetric medical question answering. The project evaluates multiple retrieval strategies and model backends with RAGAS metrics, producing reproducible JSON reports for side-by-side analysis.
Important
Asegurate de tener configurados los siguientes requisitos antes de comenzar.
- Python 3.8+
OPENAI_API_KEYen tu archivo.env- Embeddings inicializados con
python scripts/create_embeddings.py
Tip
For reproducible comparisons, keep retrieval defaults aligned across architectures (collection guia_embarazo_parto, embedding model text-embedding-3-small, and retrieval k=5).
Note
Final answers are intentionally generated in Spanish to match the medical domain use case.
Comparative benchmark of obstetric RAG pipelines across retrieval strategies and language models.
- Overview
- Quick Start
- Architecture at a Glance
- RAG Variants
- Evaluation
- Results
- Configuration Defaults
- Project Structure
- Extend the Benchmark
This repository provides an end-to-end pipeline for:
- building a Chroma vector index from obstetric guidance documents,
- running multiple RAG implementations,
- evaluating quality with RAGAS,
- comparing outputs across model providers.
It is designed for research iteration: same data, same metrics, different retrieval and generation strategies.
git clone https://github.com/NicolasHoyosDevs/RAG-Benchmark.git
cd RAG-Benchmark
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCreate .env in the repository root:
OPENAI_API_KEY=your_openai_api_key_herepython scripts/create_embeddings.pypython scripts/run_evaluation.py hybridflowchart LR
U[Research User] --> Q[Clinical Question]
subgraph P["RAG Pipeline"]
Q --> R1[Query Transform]
R1 --> R2[Retriever]
R2 --> C[(ChromaDB and JSON Chunks)]
C --> R3[Top-K Context]
R3 --> R4[Generator LLM or SLM]
R4 --> A[Spanish Answer]
end
A --> E[Evaluation Orchestrator]
subgraph M["RAGAS"]
E --> M1[Faithfulness]
E --> M2[Answer Relevancy]
E --> M3[Context Precision]
E --> M4[Context Recall]
end
M1 --> O[Benchmark Reports]
M2 --> O
M3 --> O
M4 --> O
The benchmark currently compares four retrieval strategies:
- Simple Semantic: dense vector similarity retrieval only.
- Hybrid: dense retrieval plus sparse lexical retrieval (BM25).
- HyDE: generates a hypothetical answer and retrieves from that representation.
- Query Rewriter: produces multiple query variants and aggregates retrieval.
RAG quality is assessed with RAGAS using four core metrics:
- Faithfulness
- Answer Relevancy
- Context Precision
- Context Recall
Run modes:
# Single architecture
python scripts/run_evaluation.py simple
python scripts/run_evaluation.py hybrid
python scripts/run_evaluation.py hyde
python scripts/run_evaluation.py rewriter
# Multi-model for one architecture
python scripts/run_evaluation.py multi-model hybrid
# Full matrix (all models x all RAGs)
python scripts/run_evaluation.py all-models-all-ragsWarning
Evaluation runs can trigger many API calls and costs. Prefer targeted runs while iterating.
Outputs are saved in results/ as timestamped JSON files, for example:
ragas_evaluation_hybrid_YYYYMMDD_HHMMSS.jsonragas_comprehensive_all_rags_all_models_YYYYMMDD_HHMMSS.json
Typical structure:
{
"metadata": {
"rag_type": "hybrid",
"model_used": "gpt-4o",
"timestamp": "20260311_110842"
},
"rag_results": {
"faithfulness": 0.85,
"answer_relevancy": 0.78,
"context_precision": 0.92,
"context_recall": 0.76
}
}- Vector DB: ChromaDB (persistent)
- Embedding model:
text-embedding-3-small - Collection:
guia_embarazo_parto - Typical retrieval depth:
k=5
RAG-Benchmark/
├── config/ # Runtime and pricing configs
├── data/ # Corpus, chunks, and persistent embeddings
├── docs/ # Architecture notes and guides
├── public/ # Banner and static assets
├── results/ # Evaluation JSON outputs
├── scripts/ # Embedding/evaluation CLI entrypoints
├── src/
│ ├── common/ # Shared providers and utilities
│ ├── evaluation/ # RAGAS orchestration
│ └── rag/ # RAG implementations
└── tests/ # Project tests
- Add a module in
src/rag/. - Implement
query_for_evaluation(...)returning at least:questionanswercontextsmetadata
- Register the variant in
src/evaluation/ragas_evaluator.py. - Run targeted and comparative evaluations.
- Add model support in
src/common/model_provider.py. - Verify compatibility with evaluation and pricing flow.
- Run
multi-modelevaluation for regression comparison.
For contribution workflow and licensing details, use the repository dedicated files.
