Fast, efficient, minimal, extendible and elegant RAG system
RocketRAG is a high-performance Retrieval-Augmented Generation (RAG) system designed with a focus on speed, simplicity, and extensibility. Built on top of state-of-the-art libraries, it provides both CLI and web server capabilities for seamless integration into any workflow.
rocketrag.mp4
RocketRAG aims to be the fastest and most efficient RAG library while maintaining:
- Minimal footprint - Clean, lightweight codebase
- Maximum extensibility - Pluggable architecture for all components
- Peak performance - Leveraging the best-in-class libraries
- Ease of use - Simple CLI and API interfaces
RocketRAG is built on top of cutting-edge, performance-optimized libraries:
- Chonkie - Ultra-fast semantic chunking with model2vec
- Kreuzberg - Lightning-fast document loading and processing
- llama-cpp-python - Optimized LLM inference with GGUF support
- Milvus Lite - High-performance vector database
- Sentence Transformers - State-of-the-art embeddings
pip install rocketrag# Run directly without installation
uvx rocketrag --help
# Or install globally
uvx install rocketragfrom rocketrag import RocketRAG
rag = RocketRAG("./data") # Path do your data (supports PDF, TXT, MD, etc.)
rag.prepare() # Construct vector database
# Ask questions
answer, sources = rag.ask("What is the main topic of the documents?")
print(answer)# Prepare documents from a directory
rocketrag prepare --data-dir ./documents
# Ask questions via CLI
rocketrag ask "What are the key findings?"
# Start web server
rocketrag server --port 8000# Same commands work with uvx
uvx rocketrag prepare --data-dir ./documents
uvx rocketrag ask "What are the key findings?"
uvx rocketrag server --port 8000
# Run as module
uvx --from rocketrag python -m rocketrag --helpRocketRAG follows a modular, plugin-based architecture:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Document β β Chunking β β Vectorization β
β Loaders βββββΆβ (Chonkie) βββββΆβ (SentenceTransf)β
β (Kreuzberg) β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ βββββββββββββββββββ β
β LLM β β Vector DB βββββββββββββββ
β (llama-cpp-py) ββββββ (Milvus Lite) β
β β β β
βββββββββββββββββββ βββββββββββββββββββ
- BaseLoader: Pluggable document loading (PDF, TXT, MD, etc.)
- BaseChunker: Configurable chunking strategies (semantic, recursive, etc.)
- BaseVectorizer: Flexible embedding models
- BaseLLM: Swappable language models
- MilvusLiteDB: High-performance vector storage and retrieval
from rocketrag import RocketRAG
from rocketrag.vectors import SentenceTransformersVectorizer
from rocketrag.chonk import ChonkieChunker
from rocketrag.llm import LLamaLLM
from rocketrag.loaders import KreuzbergLoader
# Configure high-performance components
vectorizer = SentenceTransformersVectorizer(
model_name="minishlab/potion-multilingual-128M" # Fast multilingual model
)
chunker = ChonkieChunker(
method="semantic", # Semantic chunking for better context
embedding_model="minishlab/potion-multilingual-128M",
chunk_size=512
)
llm = LLamaLLM(
repo_id="unsloth/gemma-3n-E2B-it-GGUF",
filename="*Q8_0.gguf" # Quantized for speed
)
loader = KreuzbergLoader() # Ultra-fast document processing
rag = RocketRAG(
vectorizer=vectorizer,
chunker=chunker,
llm=llm,
loader=loader
)# Custom chunking strategy
rocketrag prepare \
--chonker chonkie \
--chonker-args '{"method": "semantic", "chunk_size": 512}' \
--vectorizer-args '{"model_name": "all-MiniLM-L6-v2"}'
# Custom LLM for inference
rocketrag ask "Your question" \
--repo-id "microsoft/DialoGPT-medium" \
--filename "*.gguf"RocketRAG includes a FastAPI-based web server with OpenAI-compatible endpoints:
# Start server
rocketrag server --port 8000 --host 0.0.0.0GET /- Interactive web interfacePOST /ask- Question answeringPOST /ask/stream- Streaming responsesGET /chat- Chat interfaceGET /browse- Document browserGET /visualize- Vector visualizationGET /health- Health check
import requests
response = requests.post(
"http://localhost:8000/ask",
json={"question": "What are the main findings?"}
)
result = response.json()
print(result["answer"])
print(result["sources"])- β‘ Ultra-fast document processing with Kreuzberg
- π§ Semantic chunking with Chonkie and model2vec
- π High-performance vector search with Milvus Lite
- π€ Optimized LLM inference with llama-cpp-python
- π Rich CLI interface with progress bars and formatting
- π Web server with interactive UI
- π Pluggable architecture for easy customization
- π Vector visualization for debugging and analysis
- π Document browsing interface
- π¬ Streaming responses for real-time interaction
- π Batch processing for large document sets
- π Metadata preservation throughout the pipeline
- π― Context-aware chunking for better retrieval
git clone https://github.com/yourusername/rocketrag.git
cd rocketrag
pip install -e ".[dev]"pytest tests/ruff check .
ruff format .RocketRAG is designed for speed:
- Document Loading: 10x faster with Kreuzberg's optimized parsers
- Chunking: Semantic chunking with model2vec for superior context preservation
- Vectorization: Optimized batch processing with sentence-transformers
- Retrieval: Sub-millisecond vector search with Milvus Lite
- Generation: GGUF quantization for 4x faster inference
We welcome contributions! RocketRAG's modular architecture makes it easy to:
- Add new document loaders
- Implement custom chunking strategies
- Integrate different embedding models
- Support additional LLM backends
- Enhance the web interface
RocketRAG builds upon the excellent work of:
- Chonkie for semantic chunking
- Kreuzberg for document processing
- llama-cpp-python for LLM inference
- Milvus for vector storage
- Sentence Transformers for embeddings