https://arxiv.org/html/2407.12101v2
A production-ready implementation of the Dartboard RAG algorithm that addresses redundancy in document retrieval by optimizing both relevance and diversity.
The Dartboard RAG process addresses a common challenge in large knowledge bases: ensuring the retrieved information is both relevant and non-redundant. By explicitly optimizing a combined relevance-diversity scoring function, it prevents multiple documents from offering the same information.
This implementation is based on the paper: "Better RAG using Relevant Information Gain"
- Relevance & Diversity Balance: Combines document relevance to the query with diversity among selected documents
- Configurable Weights: Adjustable
RELEVANCE_WEIGHTandDIVERSITY_WEIGHTfor dynamic control - Production Ready: Clean, modular code design for easy integration
- Multiple Retrieval Modes: Support for simple top-k and advanced dartboard retrieval
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
Create a
.envfile with your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
python main.pyThis will:
- Download sample data
- Create a vector store
- Demonstrate both simple and dartboard retrieval
- Show the effects of different weight configurations
from ingestion import DocumentIngestion
# Initialize ingestion system
ingestion = DocumentIngestion()
# Create vector store from PDF
vector_store = ingestion.encode_pdf(
path="path/to/your/document.pdf",
chunk_size=1000,
chunk_overlap=200,
density_multiplier=1 # Increase to simulate dense datasets
)
# Save for later use
ingestion.save_vector_store(vector_store, "my_vector_store")from retrieval import DartboardRetrieval
from ingestion import DocumentIngestion
# Load existing vector store
ingestion = DocumentIngestion()
vector_store = ingestion.load_vector_store("my_vector_store")
# Initialize retrieval with custom weights
retrieval = DartboardRetrieval(
vector_store=vector_store,
diversity_weight=1.0,
relevance_weight=1.0,
sigma=0.1
)
# Perform dartboard retrieval
query = "What is climate change?"
texts, scores = retrieval.get_context_with_dartboard(
query=query,
num_results=5,
oversampling_factor=3
)
# Compare with simple retrieval
retrieval.compare_retrievals(query, k=5)- Document Retrieval: Initial candidate selection using similarity search
- Distance Calculation: Compute distances between query-documents and document-document pairs
- Dartboard Selection: Iteratively select documents balancing relevance and diversity
- Score Combination:
combined_score = diversity_weight * diversity + relevance_weight * relevance
diversity_weight: Controls importance of diversity (default: 1.0)relevance_weight: Controls importance of relevance (default: 1.0)sigma: Smoothing parameter for probability conversion (default: 0.1)oversampling_factor: Multiplier for initial candidate retrieval (default: 3)
- Dense Knowledge Bases: When documents contain overlapping information
- Comprehensive Answers: When you need diverse perspectives on a topic
- Avoiding Echo Chambers: When simple top-k retrieval returns repetitive content
Rag/dartboard/
├── requirements.txt # Python dependencies
├── ingestion.py # Document processing and vector store creation
├── retrieval.py # Dartboard retrieval algorithm
├── main.py # Complete workflow demonstration
└── README.md # This file
retrieval.update_weights(
diversity_weight=3.0,
relevance_weight=1.0,
sigma=0.1
)retrieval.update_weights(
diversity_weight=1.0,
relevance_weight=3.0,
sigma=0.1
)retrieval.update_weights(
diversity_weight=1.5,
relevance_weight=1.5,
sigma=0.15
)- Oversampling Factor: Higher values provide better diversity but increase computation
- Vector Store Size: Larger stores benefit more from dartboard selection
- Query Complexity: Complex queries may benefit from higher relevance weights
The dartboard retrieval can be easily integrated with:
- Hybrid Retrieval: Combine dense and sparse (BM25) similarities
- Cross-Encoders: Use cross-encoder scores directly
- Custom Embeddings: Replace OpenAI embeddings with any embedding provider
- "Vector store not found": Run ingestion first or check the save path
- OpenAI API errors: Verify your API key in the
.envfile - Memory issues: Reduce
oversampling_factoror use smaller chunks
- Pre-compute and save vector stores for large documents
- Adjust chunk size based on your document type
- Use appropriate density_multiplier for testing vs production
Feel free to submit issues and pull requests. This implementation aims to be production-ready while maintaining clarity and ease of use.
- Original paper: "Better RAG using Relevant Information Gain"
- Based on the official implementation but reorganized for production use
- LangChain integration for document processing and embeddings
This implementation is provided as-is for educational and commercial use.