-
Document Relevancy Validation - 1 call per document (typically 1-5 calls per query)
- Most expensive: ~400-500 tokens per validation
- Total: ~500-2500 tokens per query
-
Response Generation - 1 call per query
- Cost: ~1000-2000 tokens (input + output)
-
Reranking - 1 call per query (when docs found)
- Cost: ~500-1000 tokens
-
Query Transformation - 0-2 calls (during iterative retrieval)
- Cost: ~200-400 tokens per call
-
Embeddings - 1 call per query
- Cost: ~100-200 tokens (relatively cheap)
Total estimated tokens per query: 2200-5400 tokens
Assuming:
- 1000 queries/day
- Average 3000 tokens/query
- Gemini pricing: $0.075 per 1M input tokens, $0.30 per 1M output tokens
Current daily cost: $0.90/day ($27/month)
- Cache Key:
response:{query_hash}:{context_hash} - What: Full response answer
- TTL: 24 hours
- Benefit: Skip entire pipeline for repeated questions
- Expected Hit Rate: 30-50% (common questions like "how to auth", "error codes")
- Cache Key:
relevancy:{query_hash}:{doc_hash} - What: Validation result
{"relevant": bool, "score": float} - TTL: 7 days (doc relevancy is stable)
- Benefit: Skip validation for same (query, doc) pairs
- Expected Hit Rate: 40-60% (same docs get validated for similar queries)
- Cache Key:
rerank:{query_hash}:{docs_hash} - What: Reranked document indices
[0, 2, 1, ...] - TTL: 7 days
- Benefit: Skip reranking for same query+doc combinations
- Cache Key:
transform:{query_hash} - What: Transformed query string
- TTL: 7 days
- Benefit: Skip transformation for similar queries
- Cache Key:
embedding:{text_hash} - What: Embedding vector (768 floats, store as JSON)
- TTL: 30 days (embeddings never change)
- Benefit: Skip embedding generation for repeated queries/texts
Without caching: ~2200-5400 tokens per query
With caching:
- First query: Same as above (cache miss)
- Cached query: ~0 tokens (100% reduction)
- Similar query (partial cache): 50-80% reduction
For community bot with repeated questions:
- Estimated cache hit rate: 30-50%
- Token savings: 30-50% overall
- Cost savings: ~$8-13/month
File: src/config/settings.py
# Redis Caching
REDIS_ENABLED: bool = True
REDIS_URL: str = "redis://localhost:6379/0"
REDIS_TTL_RESPONSE: int = 86400 # 24 hours
REDIS_TTL_VALIDATION: int = 604800 # 7 days
REDIS_TTL_RERANK: int = 604800 # 7 days
REDIS_TTL_TRANSFORM: int = 604800 # 7 days
REDIS_TTL_EMBEDDING: int = 2592000 # 30 daysFile: src/rag/cache.py (NEW)
- Create
RedisCacheclass with:- Connection management (lazy connection, reconnect on failure)
- Cache get/set with TTL
- Hash generation for keys (SHA256 of normalized text)
- Batch operations for multiple validations
- Graceful fallback when Redis unavailable
File: src/rag/pipeline.py
- Add cache check at start of
query():# Check cache first if cache: cached_response = cache.get_response(question, chat_history, mcp_context) if cached_response: logger.info("Cache hit: returning cached response") return cached_response
- Cache result after generation:
# Cache the response if cache: cache.set_response(question, chat_history, mcp_context, result)
File: src/rag/gemini_client.py
-
Add caching to
validate_document_relevancy():for doc in documents: # Check cache first cached = cache.get_validation(query, doc) if cache else None if cached: # Use cached result else: # Call Gemini and cache result
-
Add caching to
rerank_documents()andtransform_query()
File: src/rag/vector_store.py
- Add caching to
_get_embedding():cached = cache.get_embedding(text) if cache else None if cached: return cached # Generate and cache
Hash Strategy:
- Normalize: lowercase, strip whitespace, remove punctuation variations
- Use SHA256 hash of normalized query/text
- Include context hash for responses (chat history, MCP context)
Key Format:
response:{query_hash}:{context_hash}
relevancy:{query_hash}:{doc_hash}
rerank:{query_hash}:{docs_hash}
transform:{query_hash}
embedding:{text_hash}
- Graceful Fallback: If Redis unavailable, continue without caching (no errors)
- Cache Statistics: Track hits/misses, add to
/statscommand - Logging: Log cache hits/misses for monitoring
requirements.txt- Addredis>=5.0.0src/config/settings.py- Add Redis configsrc/rag/cache.py(NEW) - Redis cache clientsrc/rag/pipeline.py- Add response cachingsrc/rag/gemini_client.py- Add validation/rerank/transform cachingsrc/rag/vector_store.py- Add embedding cachingsrc/bot/telegram_bot.py- Add cache stats to/statscommand
- Add Redis service in Railway dashboard
- Get Redis URL from Railway environment variables
- Set
REDIS_URLenvironment variable (or use default) - Set
REDIS_ENABLED=true(or false to disable)
- Test cache hits for identical queries
- Test cache misses for new queries
- Test Redis connection failure fallback
- Test TTL expiration
- Measure token usage before/after
- Test with Railway Redis
- ✅ Redis caching reduces Gemini API calls by 30%+ for repeated queries
- ✅ Cache hit rate > 30% in production
- ✅ No performance degradation (cache lookups < 10ms)
- ✅ Graceful fallback when Redis unavailable
- ✅ Cache statistics visible in
/statscommand - ✅ Monthly cost savings of $8-13
- Significant token/cost savings (30-50%)
- Faster responses for cached queries
- Reduced API rate limit issues
- Better user experience (faster responses)
- Additional infrastructure (Redis service)
- Slight complexity increase
- Cache invalidation complexity (if docs change)
- Memory usage (Redis storage)
YES, implement Redis caching - The benefits (30-50% cost savings, faster responses) outweigh the costs (Redis service, slight complexity). For a community bot with repeated questions, this is a high-value optimization.