Skip to content

JhonHander/obstetrics-rag-benchmark

 
 

Repository files navigation

Obstetric RAG Research

RAG Architectures Overview

License: MIT Python Version OpenAI LangChain RAGAS ChromaDB

A research benchmark that compares Retrieval-Augmented Generation (RAG) techniques for obstetric medical question answering. The project evaluates multiple retrieval strategies and model backends with RAGAS metrics, producing reproducible JSON reports for side-by-side analysis.

Important

Asegurate de tener configurados los siguientes requisitos antes de comenzar.

  • Python 3.8+
  • OPENAI_API_KEY en tu archivo .env
  • Embeddings inicializados con python scripts/create_embeddings.py

Tip

For reproducible comparisons, keep retrieval defaults aligned across architectures (collection guia_embarazo_parto, embedding model text-embedding-3-small, and retrieval k=5).

Note

Final answers are intentionally generated in Spanish to match the medical domain use case.


Comparative benchmark of obstetric RAG pipelines across retrieval strategies and language models.

Table of Contents

Overview

This repository provides an end-to-end pipeline for:

  • building a Chroma vector index from obstetric guidance documents,
  • running multiple RAG implementations,
  • evaluating quality with RAGAS,
  • comparing outputs across model providers.

It is designed for research iteration: same data, same metrics, different retrieval and generation strategies.

Quick Start

1) Install

git clone https://github.com/NicolasHoyosDevs/RAG-Benchmark.git
cd RAG-Benchmark
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2) Configure

Create .env in the repository root:

OPENAI_API_KEY=your_openai_api_key_here

3) Build Embeddings

python scripts/create_embeddings.py

4) Run a First Evaluation

python scripts/run_evaluation.py hybrid

Architecture at a Glance

flowchart LR
  U[Research User] --> Q[Clinical Question]

  subgraph P["RAG Pipeline"]
    Q --> R1[Query Transform]
    R1 --> R2[Retriever]
    R2 --> C[(ChromaDB and JSON Chunks)]
    C --> R3[Top-K Context]
    R3 --> R4[Generator LLM or SLM]
    R4 --> A[Spanish Answer]
  end

  A --> E[Evaluation Orchestrator]

  subgraph M["RAGAS"]
    E --> M1[Faithfulness]
    E --> M2[Answer Relevancy]
    E --> M3[Context Precision]
    E --> M4[Context Recall]
  end

  M1 --> O[Benchmark Reports]
  M2 --> O
  M3 --> O
  M4 --> O
Loading

RAG Variants

The benchmark currently compares four retrieval strategies:

  1. Simple Semantic: dense vector similarity retrieval only.
  2. Hybrid: dense retrieval plus sparse lexical retrieval (BM25).
  3. HyDE: generates a hypothetical answer and retrieves from that representation.
  4. Query Rewriter: produces multiple query variants and aggregates retrieval.

Evaluation

RAG quality is assessed with RAGAS using four core metrics:

  • Faithfulness
  • Answer Relevancy
  • Context Precision
  • Context Recall

Run modes:

# Single architecture
python scripts/run_evaluation.py simple
python scripts/run_evaluation.py hybrid
python scripts/run_evaluation.py hyde
python scripts/run_evaluation.py rewriter

# Multi-model for one architecture
python scripts/run_evaluation.py multi-model hybrid

# Full matrix (all models x all RAGs)
python scripts/run_evaluation.py all-models-all-rags

Warning

Evaluation runs can trigger many API calls and costs. Prefer targeted runs while iterating.

Results

Outputs are saved in results/ as timestamped JSON files, for example:

  • ragas_evaluation_hybrid_YYYYMMDD_HHMMSS.json
  • ragas_comprehensive_all_rags_all_models_YYYYMMDD_HHMMSS.json

Typical structure:

{
  "metadata": {
    "rag_type": "hybrid",
    "model_used": "gpt-4o",
    "timestamp": "20260311_110842"
  },
  "rag_results": {
    "faithfulness": 0.85,
    "answer_relevancy": 0.78,
    "context_precision": 0.92,
    "context_recall": 0.76
  }
}

Configuration Defaults

  • Vector DB: ChromaDB (persistent)
  • Embedding model: text-embedding-3-small
  • Collection: guia_embarazo_parto
  • Typical retrieval depth: k=5

Project Structure

RAG-Benchmark/
├── config/            # Runtime and pricing configs
├── data/              # Corpus, chunks, and persistent embeddings
├── docs/              # Architecture notes and guides
├── public/            # Banner and static assets
├── results/           # Evaluation JSON outputs
├── scripts/           # Embedding/evaluation CLI entrypoints
├── src/
│   ├── common/        # Shared providers and utilities
│   ├── evaluation/    # RAGAS orchestration
│   └── rag/           # RAG implementations
└── tests/             # Project tests

Extend the Benchmark

Add a new RAG variant

  1. Add a module in src/rag/.
  2. Implement query_for_evaluation(...) returning at least:
    • question
    • answer
    • contexts
    • metadata
  3. Register the variant in src/evaluation/ragas_evaluator.py.
  4. Run targeted and comparative evaluations.

Add a new model backend

  1. Add model support in src/common/model_provider.py.
  2. Verify compatibility with evaluation and pricing flow.
  3. Run multi-model evaluation for regression comparison.

For contribution workflow and licensing details, use the repository dedicated files.

About

A comprehensive research project comparing different Retrieval-Augmented Generation (RAG) techniques applied to medical question-answering in obstetrics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 90.8%
  • Jupyter Notebook 9.2%