Project Type: Advanced-Level System Design + AI Engineering
Status: Production-Ready with Observability Stack
Tech Stack: FastAPI, Prometheus, Locust, httpx, asyncio
Large Language Models are computationally expensive. When multiple users send queries simultaneously, systems face:
- High operational costs (API charges/GPU usage)
- Performance degradation (slow responses, failures)
- Resource wastage (using expensive models for simple tasks)
This project demonstrates cost-efficient scaling strategies under multi-user load.
- Smart Router: Analyzes prompt complexity
- Short/Simple → Local Model (Fast, Cheap: ~$0.000004/req)
- Long/Complex → Cloud API (Powerful, Expensive: ~$0.00014/req)
- Savings: 74% of traffic routed to cheaper provider
- Circuit Breaker: Auto-failover after 3 API failures
- Async Architecture: Non-blocking I/O with httpx
- Graceful Degradation: System remains functional during outages
- In-Memory Cache: SHA-256 hash-based exact match
- Cost Reduction: 98% savings on repeated queries
- TTL: 1-hour cache expiration
- Prometheus Metrics:
llm_total_cost_usd_total- Cumulative spend per providerllm_cache_events_total- Hit/Miss trackinghttp_request_duration_seconds- Latency histograms
- Real-time Dashboard: Live cost and performance tracking
User Request → Cache Check → Router → [Local GPU | HuggingFace API]
↓ ↓
Prometheus ← Track Cost/Latency ←┘
API Call Success → Reset failure counter
API Call Fail → Increment counter
Counter ≥ 3 → Open circuit (30s timeout)
Circuit Open → Skip API, use fallback
| Metric | Baseline | Advanced | Improvement |
|---|---|---|---|
| Median Latency | 20,000ms | 3ms | 99.98% ↓ |
| Throughput | 1.38 req/s | 18.26 req/s | 13.2x ↑ |
| Cost (cached) | $0.0058 | $0.00007 | 98% ↓ |
| Failure Rate | 0% | 0% | Stable ✓ |
- Python 3.10+
- Virtual environment
.venv\Scripts\activate
pip install -r requirements.txtCreate .env file:
HF_API_TOKEN=your_huggingface_token_here
uvicorn src.api.main:app --host 0.0.0.0 --port 8000- Web UI: http://localhost:8000
- Prometheus Metrics: http://localhost:8000/metrics
- Health Check: http://localhost:8000/health
locust -f src/simulation/locustfile.py --headless -u 50 -r 5 --run-time 1m --host http://localhost:8000 --csv resultsScenarios Tested:
- 50 concurrent users
- 75% short prompts (local routing)
- 25% complex prompts (cloud routing)
- 1-minute duration
OptiRoute AI/
├── src/
│ ├── api/
│ │ ├── main.py # FastAPI app
│ │ ├── endpoints.py # /generate, /health routes
│ │ └── middleware.py # Prometheus metrics
│ ├── llm_engine/
│ │ ├── base.py # Abstract provider interface
│ │ ├── local_provider.py # Mock GPU simulation
│ │ ├── huggingface_provider.py # Async HF API + Circuit Breaker
│ │ ├── router.py # Dynamic routing logic
│ │ └── cache.py # Response cache
│ ├── simulation/
│ │ └── locustfile.py # Load test scenarios
│ └── frontend/
│ ├── index.html # Chat UI
│ ├── app.js # Frontend logic
│ └── style.css # Glassmorphism design
├── requirements.txt
└── .env
- Async I/O:
httpx+asyncio.sleep()for non-blocking operations - Circuit Breaker: Prevents cascading failures
- Caching: Zero-cost responses for repeated queries
- Observability: Prometheus-ready metrics
- Cost Tracking: Per-model granular billing
- Semantic Caching: Embeddings-based similarity matching
- ML-Based Router: Classifier for intelligent routing
- Redis Cache: Distributed caching for multi-node deployment
- Grafana Dashboard: Beautiful real-time visualizations
- Kubernetes: Auto-scaling with HPA
- Real Local LLM: llama-cpp-python integration
- Backend Engineering: FastAPI, async programming
- DevOps: Load testing, monitoring, metrics
- AI Systems: LLM provider abstraction, cost optimization
- Design Patterns: Circuit breaker, caching, dependency injection
- Performance Engineering: Bottleneck identification, 99.98% latency reduction
Built to demonstrate expert-level engineering practices in LLM cost optimization and production observability.
Key Achievements:
- 99.98% latency reduction through async architecture
- 98% cost savings via intelligent caching
- Production-grade monitoring with Prometheus/Grafana
MIT License - Open source and free to use
- GitHub: @prem85642
- LinkedIn: Prem Kumar Tiwari
Built with: FastAPI • Python • Prometheus • Grafana • HuggingFace API
⭐ Star this repo if you found it helpful!