This repository presents a comparative analysis between Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG) architectures, focusing on API cost optimization and local deployment capabilities. The paper inspires this implementation Don't Do RAG by Chan et al.
This case study explores:
- Cost Reduction: Comparing API costs between RAG and CAG approaches
- Local Deployment: CPU-optimized implementation using Ollama
- Multi-Model Analysis: Implementation across different models and platforms
- Google Gemini: For cloud-based API implementation
- Llama 3.2 3B (via Ollama): For local CPU-optimized deployment
-
CAG Implementation:
- Efficient caching mechanism
- Reduced API calls
- Cost-effective knowledge retrieval
-
Local Optimization:
- CPU-optimized Ollama integration
- Efficient batch processing
- Thread optimization
- Memory-efficient operations
- Install Dependencies:
pip install -r requirements.txt- Install Ollama (for localally running Model):
# For Windows, download from https://ollama.ai/download
# For Linux:
curl https://ollama.ai/install.sh | sh- Setup Environment:
cp .env.template .env
# Add your API keys- Pull Llama Model:
ollama pull llama3.2python kvcache_ollama.py --dataset hotpotqa-train \
--similarity bertscore \
--maxKnowledge 27 \
--maxQuestion 27 \
--model "llama3.2" \
--output "./results/cag_ollama_results.txt"python rag_gemini.py --dataset hotpotqa-train \
--similarity bertscore \
--maxKnowledge 120 \
--maxQuestion 120 \
--output "./results/rag_gemini_results.txt"Results from different implementations can be found in:
- CAG Results:
CAG/New_Results/CAG/ - RAG Results:
CAG/New_Results/RAG/
CAG/
├── kvcache_gemini.py # CAG implementation with Gemini
├── kvcache_ollama.py # CAG implementation with Ollama
├── rag_gemini.py # RAG implementation with Gemini
├── rag_groq.py # RAG implementation with Groq
├── rag_ollama.py # RAG implementation with Ollama
└── New_Results/ # Performance results and comparisons
This implementation builds upon the work presented in:
@misc{chan2024dontragcacheaugmentedgeneration,
title={Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks},
author={Brian J Chan and Chao-Ting Chen and Jui-Hung Cheng and Hen-Hsen Huang},
year={2024},
eprint={2412.15605},
archivePrefix={arXiv}
}This project is licensed under the MIT License.