A simple, Go-based alternative to the litellm proxy, without all the extra stuff you don't need! A modular reverse proxy that forwards requests to various LLM providers (OpenAI, Anthropic, Gemini) using Go and the Gorilla web toolkit.
- Multi-provider support: Full support for OpenAI, Anthropic, and Gemini
- Streaming Support: Native streaming support for all providers
- OpenAI Integration: Complete OpenAI API compatibility with
/openaiprefix - Anthropic Integration: Claude API support with
/anthropicprefix - Gemini Integration: Google Gemini API support with
/geminiprefix - Comprehensive Logging: Request/response monitoring with streaming detection
- CORS Support: Browser-based application compatibility
- Health Check: Detailed health status for all providers
- Configurable Port: Environment variable configuration (default: 9002)
- Rate Limiting (experimental): Optional request/token-based limits per user/API key/model/provider
- Circuit Breaker: Opt-in provider health tracking that classifies upstream failures, retries transient / rate-limit errors, and emits a dedicated degraded-signal response so clients can fall back to another provider during an outage
# Get help on available commands
make help
# Install dependencies and build
make install build
# Run the proxy
make run
# Or run in development mode
make devOnce the proxy is running, you can make requests to LLM providers through the proxy:
# Health check (shows all provider statuses)
curl http://localhost:9002/health
# OpenAI Chat completions (replace YOUR_API_KEY with your actual OpenAI API key)
curl -X POST http://localhost:9002/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello, world!"}],
"max_tokens": 50
}'
# OpenAI Streaming
curl -X POST http://localhost:9002/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"stream_options": {"include_usage": true}
}'
# Anthropic Messages
curl -X POST http://localhost:9002/anthropic/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-3-sonnet-20240229",
"max_tokens": 100,
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Gemini Generate Content
curl -X POST http://localhost:9002/gemini/v1/models/gemini-pro:generateContent?key=YOUR_API_KEY \
-H "Content-Type: application/json" \
-d '{
"contents": [{"parts": [{"text": "Hello!"}]}]
}'The project includes comprehensive integration tests for all providers:
# Run all tests
make test-all
# Run tests for specific providers
make test-openai
make test-anthropic
make test-gemini
# Run health check tests only
make test-health
# Check environment variables
make env-checkTo run integration tests, you need to set up environment variables:
export OPENAI_API_KEY=your_openai_key
export ANTHROPIC_API_KEY=your_anthropic_key
export GEMINI_API_KEY=your_gemini_keyPORT: Environment variable to set the server port (default: 9002)
- Disabled by default. Enable via config: see
configs/base.ymlandconfigs/dev.yml. - Supports provisional token estimation with post-response reconciliation using
X-LLM-Input-Tokens(input tokens only). - Returns
429 Too Many RequestswithRetry-AfterandX-RateLimit-*headers when throttled. - Redis backend is currently not supported; only the in-process memory backend is available.
Minimal dev example (see configs/dev.yml for a full setup):
features:
rate_limiting:
enabled: true
backend: "memory" # single instance only
estimation:
max_sample_bytes: 20000
bytes_per_token: 4 # Fallback to request size (Content-Length based)
chars_per_token: 4 # Default for message-based estimation
# Optional per-provider overrides (recommended)
provider_chars_per_token:
openai: 5 # ~185–190 tokens per 1k chars (from scripts/token_estimation.py)
anthropic: 3 # ~290–315 tokens per 1k chars (from scripts/token_estimation.py)
limits:
requests_per_minute: 0 # 0 = unlimited (dev defaults)
tokens_per_minute: 0- We currently account for and reconcile only input tokens. Output tokens are not yet considered for rate limits/credits.
- For small JSON requests (size controlled by
max_sample_bytes), the proxy extracts textual message content via provider-specific parsers and estimates tokens by character count usingchars_per_token(with per-provider overrides). - Default per-provider values come from benchmarks produced by
scripts/token_estimation.py. You can run the script to generate your own table and override values in config. - Non-text modalities (images/videos) are not supported for estimation at this time and will fall back to credit-based only behavior essentially via
max_sample_bytes. - Optimistic first request: to avoid estimation blocking initial traffic, the first token-bearing request in a window (when current token count is zero) is allowed even if token limits would otherwise apply. Subsequent requests are enforced normally.
Disabled by default. Enable it when you want the proxy to detect provider outages (as opposed to individual request failures) and broadcast that state to clients so they can switch to a fallback provider instead of retry-storming a dead upstream.
- Classifies every upstream failure as one of: local rate-limit, global rate-limit, or provider-degraded. Provider-specific rules — e.g. Anthropic
529 Overloaded→ degraded, Gemini429 RESOURCE_EXHAUSTED→ local rate-limit, OpenAI 429 with bothx-ratelimit-remaining-requestsandx-ratelimit-remaining-tokensat zero → global rate-limit. - Retries transient failures with jittered exponential backoff (configurable attempts). Rate-limit retries honour
Retry-After. Sustained global rate limits escalate to degraded after a configurable window. - Opens the circuit once a provider accumulates enough terminal failures inside a sliding window. While open, every request is fast-failed locally without touching the network. After a cooldown the proxy issues one half-open probe; success closes the circuit, failure re-opens it for another cooldown.
- Surfaces state on
GET /healthper provider:circuit_state(closed/open/half_open),circuit_failures,circuit_cooldown_until.
When a provider is degraded (terminal retry exhaustion, open-circuit fast-fail, or half-open probe failure), the proxy returns a synthetic response:
-
HTTP 503 status
-
X-Llm-Proxy-Error-Class: provider_degradedresponse header -
JSON body containing a configurable marker substring —
[LLM_PROXY_PROVIDER_DEGRADED]by default:{"error":{"message":"[LLM_PROXY_PROVIDER_DEGRADED] Provider openai is currently degraded or unavailable. Please try again later.","type":"provider_degraded","code":"provider_degraded"}}
Clients detect the degraded condition by looking for that substring anywhere in str(exception) (or the response body) and can then fall back to a different provider / model.
The 503 status and X-Llm-Proxy-Error-Class header are always set, but on their own they are not enough:
- 5xx is ambiguous. The proxy streams real provider 5xx responses straight through (Anthropic 529, OpenAI 500/502/503/504, Gemini 500/503, …). A client that only looks at status code cannot tell a passthrough upstream 503 from a proxy-synthesised "circuit open" 503 — they have very different retry / fallback semantics.
- 4xx is wrong. The caller did nothing wrong; the upstream is degraded. 4xx would break SDK retry logic that (correctly) refuses to retry 4xx.
- A novel status code (e.g. 599) is hostile. Many reverse proxies, CDNs, and HTTP clients coerce unknown codes to 500 or strip them entirely. The OpenAI / Anthropic / Google SDKs all map any ≥500 response into a generic
APIError/ServerErrorclass, so a custom code buys you nothing downstream. - Custom headers get stripped by SDK exception wrappers. By the time an HTTPS error propagates up through e.g. the OpenAI Python SDK or LangChain, the caller typically only sees
str(exception)(the body). Response headers are usually only accessible if the caller catches a specific, provider-native exception type before any framework wraps it. A body substring survives every wrapping layer.
So the contract is 503 + header + body marker, and clients can key off any of them. The body marker is the most reliable because exception message text is the lowest common denominator across every SDK stack.
Config lives under features.circuit_breaker in YAML. All values shown below are defaults:
features:
circuit_breaker:
enabled: true # gate the whole feature
backend: "memory" # "memory" (single instance) or "redis" (multi-instance)
failure_threshold: 5 # terminal failures in the window that trip the circuit
window_seconds: 120 # sliding-window TTL for the failure counter
cooldown_seconds: 300 # how long the circuit stays open before a probe
max_transient_retries: 2 # retries for degraded-class failures
max_rate_limit_retries: 2 # retries for rate-limit failures
retry_contribution_mode: "log" # "off" | "log" | "on" — whether retried failures count
global_rate_limit_escalation_window: 60 # seconds of sustained global 429s → degraded
degraded_signal: "" # override the body marker; empty → "[LLM_PROXY_PROVIDER_DEGRADED]"
test_mode_enabled: false # prod: leave off. Enables X-LLM-Proxy-Test-Mode header.
# redis: # required when backend == "redis"
# address: "localhost:6379"
# password: ""
# db: 0The degraded_signal field lets you embed a project- or company-specific tag in the response body (e.g. "[MY_COMPANY_UPSTREAM_DOWN]"). Clients then pattern-match on that tag instead of the default. Leaving it empty keeps the default, which is the right choice for most deployments.
When test_mode_enabled: true, the proxy honours an X-LLM-Proxy-Test-Mode request header (or llm_proxy_test_mode query parameter, for SDKs like the Google Gemini client that don't let you pass custom headers):
force_degraded— return the synthesised 503 / degraded body immediately, as if the circuit were open.force_transient_recover— fail the first attempt with a degraded error and succeed on the retry, so you can exercise the retry loop without the circuit tripping.
This is intended strictly for integration tests — leave test_mode_enabled off in production.
GET /health- Health check endpoint for all providers. When the circuit breaker is enabled the response also includescircuit_state,circuit_failures, andcircuit_cooldown_untilper provider.
POST /openai/v1/chat/completions- OpenAI chat completions endpoint (streaming supported)POST /openai/v1/completions- OpenAI completions endpoint (streaming supported)* /openai/v1/*- All other OpenAI API endpoints
POST /anthropic/v1/messages- Anthropic messages endpoint (streaming supported)* /anthropic/v1/*- All other Anthropic API endpoints
POST /gemini/v1/models/{model}:generateContent- Gemini content generation (streaming supported)POST /gemini/v1/models/{model}:streamGenerateContent- Explicit streaming endpoint* /gemini/v1/*- All other Gemini API endpoints
The proxy is built with a modular architecture:
main.go: Core server setup, middleware, and provider registrationproviders/openai.go: OpenAI-specific proxy implementation with streaming supportproviders/anthropic.go: Anthropic proxy implementation with streaming supportproviders/gemini.go: Gemini proxy implementation with streaming supportproviders/provider.go: Common interfaces and provider management
Each provider implements its own:
- Route registration
- Request/response handling with streaming support
- Error handling
- Health status reporting
- Response metadata parsing
# Get help on all available commands
make help
# Code quality
make check # Run all code quality checks
make fmt # Format Go code
make vet # Run go vet
make lint # Run golint
# Building
make build # Build the binary
make clean # Clean build artifacts
make install # Install dependencies
# Running
make run # Run the built binary
make dev # Run in development mode
# Testing
make test # Run unit tests
make test-all # Run all tests including integration
make test-openai # Run OpenAI tests only
make test-anthropic # Run Anthropic tests only
make test-gemini # Run Gemini tests onlyTests are organized by provider:
openai_test.go: OpenAI integration tests (streaming and non-streaming)anthropic_test.go: Anthropic integration tests (streaming and non-streaming)gemini_test.go: Gemini integration tests (streaming and non-streaming)common_test.go: Health check and environment variable teststest_helpers.go: Shared test utilities
- Logging: Logs all incoming requests with streaming detection
- CORS: Adds CORS headers for browser compatibility
- Streaming: Optimized handling for streaming responses
- Error Handling: Provider-specific error handling
To add a new provider:
- Create a new file (e.g.,
newprovider.go) - Implement the
Providerinterface - Add streaming detection logic
- Add response metadata parsing
- Create corresponding test file
- Register the provider in
main.go
- Gorilla Mux - HTTP router and URL matcher
The binary includes build-time information:
- Git commit hash
- Build timestamp
- Go version
View build info with:
make version