Skip to content

prem85642/-OptiRoute-AI

Repository files navigation

OptiRoute AI: Cost-Aware Scaling in Multi-User LLM Systems

Project Type: Advanced-Level System Design + AI Engineering
Status: Production-Ready with Observability Stack
Tech Stack: FastAPI, Prometheus, Locust, httpx, asyncio


Problem Statement

Large Language Models are computationally expensive. When multiple users send queries simultaneously, systems face:

  • High operational costs (API charges/GPU usage)
  • Performance degradation (slow responses, failures)
  • Resource wastage (using expensive models for simple tasks)

This project demonstrates cost-efficient scaling strategies under multi-user load.


Key Features

1. Dynamic Cost-Aware Routing

  • Smart Router: Analyzes prompt complexity
    • Short/Simple → Local Model (Fast, Cheap: ~$0.000004/req)
    • Long/Complex → Cloud API (Powerful, Expensive: ~$0.00014/req)
  • Savings: 74% of traffic routed to cheaper provider

2. Advanced Resilience Patterns

  • Circuit Breaker: Auto-failover after 3 API failures
  • Async Architecture: Non-blocking I/O with httpx
  • Graceful Degradation: System remains functional during outages

3. Response Caching

  • In-Memory Cache: SHA-256 hash-based exact match
  • Cost Reduction: 98% savings on repeated queries
  • TTL: 1-hour cache expiration

4. Production-Grade Observability

  • Prometheus Metrics:
    • llm_total_cost_usd_total - Cumulative spend per provider
    • llm_cache_events_total - Hit/Miss tracking
    • http_request_duration_seconds - Latency histograms
  • Real-time Dashboard: Live cost and performance tracking

Architecture

User Request → Cache Check → Router → [Local GPU | HuggingFace API]
                   ↓                            ↓
              Prometheus ← Track Cost/Latency ←┘

Circuit Breaker Flow

API Call Success → Reset failure counter
API Call Fail    → Increment counter
Counter ≥ 3      → Open circuit (30s timeout)
Circuit Open     → Skip API, use fallback

Performance Results

Metric Baseline Advanced Improvement
Median Latency 20,000ms 3ms 99.98% ↓
Throughput 1.38 req/s 18.26 req/s 13.2x ↑
Cost (cached) $0.0058 $0.00007 98% ↓
Failure Rate 0% 0% Stable ✓

Installation & Setup

Prerequisites

  • Python 3.10+
  • Virtual environment

Install Dependencies

.venv\Scripts\activate
pip install -r requirements.txt

Environment Variables

Create .env file:

HF_API_TOKEN=your_huggingface_token_here

Run Server

uvicorn src.api.main:app --host 0.0.0.0 --port 8000

Access Application


Load Testing

Run Locust Simulation

locust -f src/simulation/locustfile.py --headless -u 50 -r 5 --run-time 1m --host http://localhost:8000 --csv results

Scenarios Tested:

  • 50 concurrent users
  • 75% short prompts (local routing)
  • 25% complex prompts (cloud routing)
  • 1-minute duration

Project Structure

OptiRoute AI/
├── src/
│   ├── api/
│   │   ├── main.py              # FastAPI app
│   │   ├── endpoints.py         # /generate, /health routes
│   │   └── middleware.py        # Prometheus metrics
│   ├── llm_engine/
│   │   ├── base.py              # Abstract provider interface
│   │   ├── local_provider.py   # Mock GPU simulation
│   │   ├── huggingface_provider.py  # Async HF API + Circuit Breaker
│   │   ├── router.py            # Dynamic routing logic
│   │   └── cache.py             # Response cache
│   ├── simulation/
│   │   └── locustfile.py        # Load test scenarios
│   └── frontend/
│       ├── index.html           # Chat UI
│       ├── app.js               # Frontend logic
│       └── style.css            # Glassmorphism design
├── requirements.txt
└── .env

Advanced Patterns Demonstrated

  1. Async I/O: httpx + asyncio.sleep() for non-blocking operations
  2. Circuit Breaker: Prevents cascading failures
  3. Caching: Zero-cost responses for repeated queries
  4. Observability: Prometheus-ready metrics
  5. Cost Tracking: Per-model granular billing

Future Enhancements (Expert Tier)

  • Semantic Caching: Embeddings-based similarity matching
  • ML-Based Router: Classifier for intelligent routing
  • Redis Cache: Distributed caching for multi-node deployment
  • Grafana Dashboard: Beautiful real-time visualizations
  • Kubernetes: Auto-scaling with HPA
  • Real Local LLM: llama-cpp-python integration

Skills Demonstrated

  • Backend Engineering: FastAPI, async programming
  • DevOps: Load testing, monitoring, metrics
  • AI Systems: LLM provider abstraction, cost optimization
  • Design Patterns: Circuit breaker, caching, dependency injection
  • Performance Engineering: Bottleneck identification, 99.98% latency reduction

About This Project

Built to demonstrate expert-level engineering practices in LLM cost optimization and production observability.

Key Achievements:

  • 99.98% latency reduction through async architecture
  • 98% cost savings via intelligent caching
  • Production-grade monitoring with Prometheus/Grafana

License

MIT License - Open source and free to use


Contact & Links

Built with: FastAPI • Python • Prometheus • Grafana • HuggingFace API


⭐ Star this repo if you found it helpful!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors