Module: github.com/agentuity/llmproxy
A pluggable, composable library for proxying requests to upstream LLM providers. The core lifecycle follows a five-stage pipeline:
Parse --> Enrich --> Resolve --> Forward --> Extract
Callers construct a Proxy with a Provider and optional Interceptor chain, then call Forward(ctx, req) to proxy a single request through the full lifecycle.
Extracts BodyMetadata from a raw HTTP request body. Returns structured metadata (model name, messages, max tokens, stream flag, custom fields) alongside the raw body bytes. The raw bytes are returned because http.Request.Body is a ReadCloser that can only be read once — downstream stages need access to the original payload.
Modifies the outgoing upstream request with provider-specific headers. Receives the parsed BodyMetadata and raw body bytes. For most providers this sets Authorization: Bearer <key>. AWS Bedrock computes an AWS Signature V4 instead.
Determines the upstream URL from metadata. Each provider maps to its own endpoint scheme:
| Provider | URL Pattern |
|---|---|
| OpenAI | https://api.openai.com/v1/chat/completions |
| Bedrock | https://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/converse |
| Google AI | https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent |
Parses the upstream HTTP response into ResponseMetadata (ID, model, token usage, choices). Reads and returns the raw response body bytes so they can be re-attached to the response, preserving any custom JSON fields the provider may include.
Composes the four interfaces above into a single unit:
Name() string
BodyParser() BodyParser
RequestEnricher() RequestEnricher
ResponseExtractor() ResponseExtractor
URLResolver() URLResolver
BaseProvider supplies a configurable default implementation via functional options. Providers that share the OpenAI chat completions format embed the openai_compatible base and only override what differs (name, URL, auth).
Wraps the request/response cycle for cross-cutting concerns. Signature:
Intercept(req, meta, rawBody, next) -> (resp, respMeta, rawRespBody, err)
The chain wraps in reverse order so that given interceptors [A, B, C], execution flows as:
A -> B -> C -> upstream -> C -> B -> A
Each interceptor can inspect/modify the request before calling next, and inspect/modify the response after next returns.
Manages a collection of named providers. MapRegistry provides a thread-safe, map-based lookup by provider name.
The main entry point. Forward(ctx, req) orchestrates the full request lifecycle. Configurable with:
WithInterceptor(i)— adds an interceptor to the chainWithHTTPClient(c)— sets a custom*http.Clientfor upstream calls
An HTTP handler that provides automatic provider and API type detection from a single endpoint. Implements http.Handler for easy integration.
Forward(ctx, req) -> (resp, meta, err)
ServeHTTP(w, r)
Detection Flow:
- Parse body - Extract model name and request structure
- Detect provider - From
X-Providerheader, model prefix (openai/gpt-4), or model pattern (gpt-*) - Strip provider prefix - If model has known provider prefix, strip before forwarding
- Detect API type - From path (
/v1/messages) or body+provider (input→ Responses) - Route to provider - Forward to detected provider with correct endpoint
Configuration options:
WithAutoRouterRegistry(r)— Use custom registryWithAutoRouterDetector(d)— Custom provider detection logicWithAutoRouterModelProviderLookup(lookup)— Hook for model→provider mapping (e.g., models.dev-backed detection); called when model pattern detection failsWithAutoRouterInterceptor(i)— Add interceptor to chainWithAutoRouterHTTPClient(c)— Custom HTTP clientWithAutoRouterFallbackProvider(p)— Provider when detection failsWithAutoRouterWebSocket(upgrader, dialer)— Enable WebSocket mode (opt-in, see WebSocket Mode)WithAutoRouterWSBillingCallback(cb)— Per-turn billing callback for WebSocket connections
Example:
// Basic setup
router := llmproxy.NewAutoRouter(
llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
llmproxy.WithAutoRouterInterceptor(interceptors.NewLogging(logger)),
)
router.RegisterProvider(openaiProvider)
router.RegisterProvider(anthropicProvider)
http.Handle("/", router)// With models.dev-backed provider detection
adapter, _ := modelsdev.LoadFromURL()
router := llmproxy.NewAutoRouter(
llmproxy.WithAutoRouterModelProviderLookup(adapter.FindProviderForModel),
llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
)| Field | Type | Description |
|---|---|---|
| model | string | Target model identifier |
| messages | []Message | Conversation messages |
| maxTokens | int | Token generation limit |
| stream | bool | Whether to stream the response |
| custom | map[string]any | Provider-specific fields |
| Field | Type | Description |
|---|---|---|
| id | string | Response identifier |
| object | string | Object type |
| model | string | Model that generated the response |
| usage | Usage | Token counts |
| choices | []Choice | Response choices |
| custom | map[string]any | Provider-specific fields |
| Field | Type | Description |
|---|---|---|
| role | string | system, user, assistant, tool |
| content | string | Message content |
| Field | Type | Description |
|---|---|---|
| promptTokens | int | Input tokens consumed |
| completionTokens | int | Output tokens generated |
| totalTokens | int | Sum of prompt + completion |
| Field | Type | Description |
|---|---|---|
| cachedTokens | int | Tokens served from cache (OpenAI, Azure) |
| cacheCreationInputTokens | int | Tokens written to cache (Anthropic) |
| cacheReadInputTokens | int | Tokens read from cache (Anthropic) |
| ephemeral5mInputTokens | int | 5-minute cache write tokens (Anthropic) |
| ephemeral1hInputTokens | int | 1-hour cache write tokens (Anthropic) |
| cacheWriteTokens | int | Tokens written to cache (Bedrock) |
| cacheDetails | []CacheDetail | TTL-based cache write breakdown (Bedrock) |
| Field | Type | Description |
|---|---|---|
| ttl | string | Time-to-live for cache entry (e.g., "5m", "1h") |
| cacheWriteTokens | int | Tokens written to cache at this TTL |
| Field | Type | Description |
|---|---|---|
| index | int | Choice index |
| message | Message | Non-streaming response content |
| delta | Message | Streaming response content |
| finishReason | string | Why generation stopped |
Per-model pricing in USD per 1M tokens:
| Field | Description |
|---|---|
| input | Input token cost |
| output | Output token cost |
| cacheRead | Cached input read cost |
| cacheWrite | Cached input write cost |
| Field | Description |
|---|---|
| provider | Provider name |
| model | Model name |
| tokens | Usage breakdown |
| costs | Computed costs |
The full flow through Proxy.Forward:
+------------------+
| Incoming HTTP |
| Request |
+--------+---------+
|
1. Read request body
|
+--------v---------+
| BodyParser |
| Parse -> Meta |
| + raw body bytes|
+--------+---------+
|
2. Resolve upstream URL
|
+--------v---------+
| URLResolver |
+--------+---------+
|
3. Create upstream request,
copy headers
|
4. Enrich request
|
+--------v---------+
| RequestEnricher |
| (auth headers) |
+--------+---------+
|
5. Store metadata in
request context
|
6. Execute interceptor chain
|
+--------------v--------------+
| Interceptor A |
| +- Interceptor B |
| | +- Interceptor C |
| | | +- HTTP call ---->| upstream
| | | +- response <----+|
| | +- post-process |
| +- post-process |
+- post-process |
+-----------------------------+
|
7. Extract ResponseMetadata
|
+--------v---------+
| ResponseExtractor|
+--------+---------+
|
8. Re-attach raw body
to response
|
9. Return response
+ metadata
Steps in detail:
- Read and parse the request body into
BodyMetadata. The raw bytes are preserved for later use. - Resolve the upstream URL from the metadata (model name, provider config).
- Create a new HTTP request for the upstream provider and copy relevant headers from the original request.
- Enrich the upstream request with provider-specific authentication (Bearer token, API key, AWS Signature V4).
- Store
BodyMetadatain the request context so interceptors can access it. - Execute the request through the interceptor chain. Each interceptor wraps the next, forming an onion-style pipeline.
- Extract
ResponseMetadatafrom the upstream response. - Re-attach the raw response body so custom JSON fields from the provider are preserved.
- Return the response and metadata to the caller.
The AutoRouter enables automatic provider and API detection from a single endpoint. POST to / with any LLM request and routing happens automatically.
Detection happens in two phases:
Phase 1: Path-based detection
| Path | API Type |
|---|---|
/v1/chat/completions |
Chat Completions |
/v1/responses |
Responses |
/v1/completions |
Legacy Completions |
/v1/messages |
Anthropic Messages |
:generateContent |
Gemini GenerateContent |
/converse |
Bedrock Converse |
Phase 2: Body + Provider detection (when path is / or unknown)
| Body Field | Provider | API Type |
|---|---|---|
input |
any | Responses |
prompt |
any | Completions |
contents |
any | GenerateContent |
messages |
anthropic | Messages |
messages |
other | Chat Completions |
Provider is detected in priority order:
-
X-Provider header — Explicit override
curl -X POST http://localhost:8080/ \ -H 'X-Provider: anthropic' \ -d '{"model":"claude-3-opus",...}'
-
Model prefix — Provider prefix in model name (stripped before forwarding)
# Model "openai/gpt-4" routes to OpenAI, forwards "gpt-4" curl -X POST http://localhost:8080/ \ -d '{"model":"anthropic/claude-3-opus",...}'
-
Model pattern — Match against known patterns
Pattern Provider gpt-*,o1-*,o3-*,chatgpt-*OpenAI claude-*Anthropic gemini-*,gemma-*Google AI grok-*x.AI accounts/fireworks/*Fireworks sonar*Perplexity anthropic.claude-*,amazon.*Bedrock
Only known provider prefixes are stripped:
// Stripped (known providers)
"openai/gpt-4" -> "gpt-4"
"anthropic/claude-3" -> "claude-3"
"fireworks/models/llama" -> "models/llama"
// Preserved (unknown or model-native paths)
"accounts/fireworks/models/llama" -> "accounts/fireworks/models/llama"
"some-unknown/model" -> "some-unknown/model"# Auto-detect everything - POST to /
curl -X POST http://localhost:8080/ \
-H 'Content-Type: application/json' \
-d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'
# Auto-detect Anthropic from model name
curl -X POST http://localhost:8080/ \
-H 'Content-Type: application/json' \
-d '{"model":"claude-3-opus","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'
# Auto-detect Responses API from input field
curl -X POST http://localhost:8080/ \
-H 'Content-Type: application/json' \
-d '{"model":"gpt-4o","input":"Hello"}'
# Traditional path-based routing still works
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'The proxy fully supports SSE (Server-Sent Events) streaming with efficient token usage extraction for billing.
Extends ResponseExtractor for streaming responses:
type StreamingResponseExtractor interface {
ResponseExtractor
ExtractStreamingWithController(resp, w, rc) -> (ResponseMetadata, error)
IsStreamingResponse(resp) -> bool
}
All built-in providers implement this interface for streaming support.
+------------------+
| Incoming Request |
| stream: true |
+--------+---------+
|
Parse body, detect
stream: true flag
|
+--------v---------+
| AutoRouter |
| ForwardStreaming |
+--------+---------+
|
Auto-inject
stream_options:
{include_usage:true}
|
+--------v---------+
| HTTP Request |
| to upstream |
+--------+---------+
|
+--------v---------+
| Upstream Response|
| text/event-stream|
+--------+---------+
|
+--------v---------+
| StreamingExtractor|
| Parse SSE events |
| Extract usage |
| Flush each chunk |
+--------+---------+
|
+--------v---------+
| BillingCalculator|
| Calculate cost |
+--------+---------+
|
+--------v---------+
| HTTP Response |
| to client |
+------------------+
OpenAI Chat Completions: Usage is sent in the final chunk before [DONE]:
data: {"id":"...","usage":{"prompt_tokens":100,"completion_tokens":50,"total_tokens":150}}
data: [DONE]OpenAI Responses API: Usage is in the response.completed event. The StreamingMultiAPIExtractor automatically detects the API type from the request context and dispatches to the correct streaming extractor:
data: {"type":"response.created","response":{"id":"resp_123","model":"gpt-4o"}}
data: {"type":"response.output_text.delta","delta":"Hello"}
data: {"type":"response.completed","response":{"usage":{"input_tokens":10,"output_tokens":5,"total_tokens":15}}}
data: [DONE]No stream_options.include_usage is needed for the Responses API — usage is always included in response.completed. The proxy automatically skips stream_options injection for Responses API requests.
Anthropic: Usage is sent in message_start and message_delta events:
data: {"type":"message_start","message":{"usage":{"input_tokens":100}}}
...
data: {"type":"message_delta","usage":{"output_tokens":50}}
data: {"type":"message_stop"}When BillingCalculator is configured and the request has stream: true, the proxy automatically injects stream_options.include_usage for OpenAI-compatible Chat Completions endpoints only:
{
"stream": true,
"stream_options": { "include_usage": true }
}This ensures providers return token usage in their streaming responses for billing calculation.
The following are excluded from this injection because they already include usage natively:
- Responses API — usage is always present in the
response.completedevent - Anthropic — usage is sent in
message_startandmessage_deltaevents - Bedrock and Google AI — usage is included in their streaming event formats
Uses http.ResponseController for optimal streaming:
rc := http.NewResponseController(w)
for each SSE event {
w.Write(event)
rc.Flush() // Immediate flush after each chunk
}Non-streaming responses also use chunked read/write/flush with a 512KB buffer for better performance.
adapter, _ := modelsdev.LoadFromURL()
billingCallback := func(r llmproxy.BillingResult) {
log.Printf("Cost: $%.6f", r.TotalCost)
}
router := llmproxy.NewAutoRouter(
llmproxy.WithAutoRouterBillingCalculator(
llmproxy.NewBillingCalculator(adapter.GetCostLookup(), billingCallback),
),
)After the stream completes, the billing callback is invoked with the extracted token usage.
The proxy supports the OpenAI Responses API WebSocket mode for persistent, multi-turn connections. This is useful for tool-call-heavy workflows where multiple response.create / response.completed cycles happen on a single connection.
The library defines abstract WebSocket interfaces — no WebSocket library is vendored. Consumers bring their own implementation (gorilla/websocket, nhooyr.io/websocket, etc.) and wire it in via thin adapters.
// WSConn abstracts a WebSocket connection.
// gorilla/websocket's *Conn satisfies this directly.
type WSConn interface {
ReadMessage() (messageType int, p []byte, err error)
WriteMessage(messageType int, data []byte) error
Close() error
}
// WSUpgrader upgrades an HTTP request to a WebSocket connection.
type WSUpgrader interface {
Upgrade(w http.ResponseWriter, r *http.Request, responseHeader http.Header) (WSConn, error)
}
// WSDialer dials a WebSocket connection to an upstream server.
type WSDialer interface {
DialContext(ctx context.Context, urlStr string, requestHeader http.Header) (WSConn, *http.Response, error)
}
// WebSocketCapableProvider is implemented by providers that support WebSocket mode.
type WebSocketCapableProvider interface {
Provider
WebSocketURL(meta BodyMetadata) (*url.URL, error)
}The OpenAI provider implements WebSocketCapableProvider. Other providers can opt in by implementing the same interface.
gorilla's *websocket.Conn already satisfies WSConn — no wrapper needed. You only need thin adapters for Upgrader and Dialer because their return types differ (*websocket.Conn vs WSConn):
package myadapter
import (
"context"
"net/http"
"github.com/agentuity/llmproxy"
"github.com/gorilla/websocket"
)
// GorillaUpgrader wraps gorilla's Upgrader to satisfy llmproxy.WSUpgrader.
type GorillaUpgrader struct {
Upgrader websocket.Upgrader
}
func (u *GorillaUpgrader) Upgrade(w http.ResponseWriter, r *http.Request, h http.Header) (llmproxy.WSConn, error) {
conn, err := u.Upgrader.Upgrade(w, r, h)
if err != nil {
return nil, err
}
return conn, nil // *websocket.Conn satisfies WSConn
}
// GorillaDialer wraps gorilla's Dialer to satisfy llmproxy.WSDialer.
type GorillaDialer struct {
Dialer websocket.Dialer
}
func (d *GorillaDialer) DialContext(ctx context.Context, urlStr string, h http.Header) (llmproxy.WSConn, *http.Response, error) {
conn, resp, err := d.Dialer.DialContext(ctx, urlStr, h)
if err != nil {
return nil, resp, err
}
return conn, resp, nil // *websocket.Conn satisfies WSConn
}router := llmproxy.NewAutoRouter(
llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
// In production, configure CheckOrigin to validate against trusted origins.
llmproxy.WithAutoRouterWebSocket(
&myadapter.GorillaUpgrader{
Upgrader: websocket.Upgrader{
CheckOrigin: func(r *http.Request) bool {
// Validate r.Header.Get("Origin") against a whitelist
return isAllowedOrigin(r)
},
},
},
&myadapter.GorillaDialer{
Dialer: websocket.Dialer{},
},
),
llmproxy.WithAutoRouterWSBillingCallback(func(turn int, meta llmproxy.ResponseMetadata, billing *llmproxy.BillingResult) {
log.Printf("Turn %d: model=%s prompt=%d completion=%d",
turn, meta.Model, meta.Usage.PromptTokens, meta.Usage.CompletionTokens)
if billing != nil {
log.Printf(" Cost: $%.6f", billing.TotalCost)
}
}),
)
router.RegisterProvider(openaiProvider)
http.Handle("/", router)WebSocket support is opt-in — if WithAutoRouterWebSocket is not called, WebSocket upgrade requests are rejected and normal HTTP handling is unchanged.
The OpenAI Responses API WebSocket mode operates at wss://api.openai.com/v1/responses:
| Aspect | Detail |
|---|---|
| Client sends | {"type":"response.create","model":"gpt-4o","input":[...]} |
| Server sends | Same events as SSE: response.created, response.output_text.delta, response.completed, etc. |
| Multi-turn | Client sends new response.create with previous_response_id on same connection |
| Usage | In response.completed → response.usage.{input_tokens, output_tokens, total_tokens} |
| Frames | JSON text frames — no data: prefix (unlike SSE) |
+------------------+ +------------------+ +------------------+
| Client | | AutoRouter | | Upstream |
| | | | | (OpenAI) |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
1. WS Upgrade --------> 2. Accept upgrade |
| 3. Read first message |
| (response.create) |
| 4. Detect provider/model |
| 5. Strip model prefix |
| 6. Enrich headers (auth) |
| 7. Dial upstream WS --------> 8. Accept
| 9. Forward first msg --------> |
| | |
| === Bidirectional Relay === |
| | |
10. response.create -----> Strip prefix ------------> response.create
| | |
| | <----------- response.created
response.created <------ | |
| | <----------- response.output_text.delta
text delta <---------------- | |
| | <----------- response.completed
response.completed <---- Extract usage, (includes usage)
| compute billing |
| | |
11. response.create -----> Strip prefix ------------> (next turn)
| ... ...
| | |
12. Close ----------------> Close both ------------> Close
Billing is calculated per turn — each response.completed event triggers:
- Usage extraction (
input_tokens,output_tokens,total_tokens, cache details) - Cost calculation via
BillingCalculator(if configured) WSBillingCallbackinvocation with the turn number, response metadata, and billing result
For multi-turn connections, each turn is billed independently. The callback receives an incrementing turn counter so consumers can aggregate if needed.
Works the same as HTTP mode — openai/gpt-4o is stripped to gpt-4o in all response.create messages. Non-response.create messages are forwarded byte-for-byte without modification.
- Both relay goroutines use a
sync.Once-guarded close — when either side closes, the other is closed immediately - WebSocket close errors (
io.EOF, "connection closed") are treated as normal termination, not errors - The
ForwardWebSocketmethod blocks until both relay goroutines complete
Nine providers are included. Six share the OpenAI-compatible base; three have fully custom implementations.
providers/openai_compatible implements BodyParser, ResponseExtractor, URLResolver, and RequestEnricher for the OpenAI chat completions format. Providers that speak this protocol embed the base and override only what differs (name, base URL, auth configuration).
The OpenAI provider also supports the Responses API (/v1/responses) with automatic detection based on the input field in the request body.
| Provider | Package | Auth | API Format | Notes |
|---|---|---|---|---|
| OpenAI | providers/openai |
Bearer token | Chat completions, Responses | Supports both APIs with auto-detection |
| Anthropic | providers/anthropic |
x-api-key header + anthropic-version |
Anthropic Messages API | Custom parser/extractor |
| Groq | providers/groq |
Bearer token | OpenAI-compatible | Wraps openai_compatible |
| Fireworks | providers/fireworks |
Bearer token | OpenAI-compatible | Wraps openai_compatible |
| x.AI | providers/xai |
Bearer token | OpenAI-compatible | Wraps openai_compatible |
| Google AI | providers/googleai |
API key in URL query param | Gemini generateContent | Custom parser/extractor/resolver |
| AWS Bedrock | providers/bedrock |
AWS Signature V4 | Converse API | Custom everything; supports both Converse and InvokeModel |
| Azure OpenAI | providers/azure |
api-key header or Azure AD Bearer token |
OpenAI chat completions | Uses deployments instead of models |
| OpenAI-compatible | providers/openai_compatible |
Bearer token | OpenAI chat completions | Reusable base for OpenAI-compatible providers |
OpenAI — Wraps openai_compatible with support for multiple APIs:
- Chat Completions (
/v1/chat/completions) — Standard messages-based API - Responses (
/v1/responses) — Newer API withinputfield, built-in tools support. Supports both HTTP (SSE streaming) and WebSocket modes. - Legacy Completions (
/v1/completions) — Older prompt-based API
The provider auto-detects the API type from the request body:
inputfield → Responses APIpromptfield → Completions APImessagesfield → Chat Completions API
The OpenAI provider implements WebSocketCapableProvider, enabling persistent WebSocket connections for multi-turn Responses API workflows when WithAutoRouterWebSocket is configured.
Anthropic — Custom body parser translates between the proxy's canonical format and Anthropic's Messages API. Custom extractor maps Anthropic's response shape (content blocks, stop_reason) back to ResponseMetadata. Auth uses the x-api-key header alongside an anthropic-version header.
Groq, Fireworks, x.AI — Each wraps openai_compatible with its own base URL and provider name. No custom parsing or extraction logic needed.
Google AI — Custom body parser converts to the Gemini generateContent format (contents/parts). Custom URL resolver appends the API key as a query parameter rather than using a header. Custom extractor maps Gemini's response (candidates/content/parts) back to ResponseMetadata.
AWS Bedrock — Fully custom implementation. The body parser converts to the Bedrock Converse API format. The enricher computes AWS Signature V4 signing using region, access key, and secret key. The URL resolver constructs region-specific endpoints. Supports both the Converse and InvokeModel API paths.
Azure OpenAI — Uses deployments instead of direct model access. Two construction modes:
New(resourceName, deploymentID, apiVersion, opts...)— Fixed deployment, uses configured deploymentID for all requestsNewWithDynamicDeployment(resourceName, apiVersion, opts...)— Dynamic deployment, uses the model field from each request as the deployment name
Authentication via functional options:
WithAPIKey(apiKey)— Setsapi-keyheaderWithAzureADToken(token)— SetsAuthorization: BearerheaderWithAzureADTokenRefresher(fn)— Token refresh callback for expiring Azure AD tokens
URL format: https://{resource}.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version={version}
Eight built-in interceptors are provided in the interceptors/ package.
NewLogging(logger) — Logs each request/response cycle with:
- Model name
- HTTP method and URL
- Response status
- Latency
- Token usage (prompt, completion, total)
Accepts an llmproxy.Logger interface, which is compatible with github.com/agentuity/go-common/logger without requiring the dependency.
NewMetrics(m) — Tracks aggregate statistics:
| Field | Description |
|---|---|
| TotalRequests | Number of requests processed |
| TotalTokens | Total tokens consumed |
| TotalPromptTokens | Total input tokens |
| TotalCompletionTokens | Total output tokens |
| TotalLatency | Cumulative request duration |
| Errors | Number of failed requests |
All fields use sync/atomic operations for thread safety. The Metrics struct can be read concurrently from monitoring goroutines.
NewRetry(maxAttempts, delay) — Retries failed requests under specific conditions:
- Retries on: HTTP 429 (rate limit) and 5xx (server errors)
- Does NOT retry:
context.Canceled,context.DeadlineExceeded - Body handling: Reconstructs the request body from raw bytes on each retry attempt
- Custom predicate:
NewRetryWithPredicate(maxAttempts, delay, predicate)allows callers to supply a custom function that decides whether a given response should be retried
NewRetryWithRateLimitHeaders(maxAttempts, defaultDelay) — Retries with rate limit header support:
- Retry-After header: Parses both seconds (integer) and HTTP date formats
- X-RateLimit-Reset header: Fallback if Retry-After not present
- Max delay: Values over 24 hours are ignored (fallback to defaultDelay); exactly 24 hours is accepted
- Precedence: Retry-After takes precedence over X-RateLimit-Reset
Example:
// Use rate limit headers from provider
retry := interceptors.NewRetryWithRateLimitHeaders(3, time.Second)
// If provider returns 429 with Retry-After: 30, waits 30s instead of 1sNewBilling(lookup, onResult) — Calculates the cost of each request:
- Uses a
CostLookupfunction to retrieve per-model pricing - Detects the provider from the model name
- Computes input/output/cache costs based on token usage
- Calls the
onResultcallback with aBillingResultafter each request - Stores
BillingResultinResponseMetadata.Custom["billing_result"]for downstream access
When using AutoRouter, billing results are automatically added as response headers:
X-Gateway-Cost— Total cost in USDX-Gateway-Prompt-Tokens— Input token countX-Gateway-Completion-Tokens— Output token count
NewTracing(extractor) — Propagates OpenTelemetry trace context:
- Upstream headers:
X-Request-ID: the trace ID (32 hex chars)traceparent: W3C Trace Context format (00-{traceid}-{spanid}-{flags})
- Response header:
X-Request-IDfor correlation - Extractor function: Pulls trace info from request context
The extractor function signature:
func(ctx context.Context) TraceInfo
For OpenTelemetry:
func otelExtractor(ctx context.Context) interceptors.TraceInfo {
span := trace.SpanFromContext(ctx)
if !span.SpanContext().IsValid() {
return interceptors.TraceInfo{}
}
return interceptors.TraceInfo{
TraceID: span.SpanContext().TraceID(),
SpanID: span.SpanContext().SpanID(),
Sampled: span.SpanContext().IsSampled(),
}
}Use NewTracingWithHeader(extractor, headerName) to customize the response header name.
NewHeaderBan(requestHeaders, responseHeaders) — Strips specified headers:
- Request headers: Removed before forwarding upstream
- Response headers: Removed before returning to caller
- Case-insensitive: HTTP header matching is case-insensitive
Convenience constructors:
NewResponseHeaderBan(headers...)— Strip only response headersNewRequestHeaderBan(headers...)— Strip only request headers
Example:
ban := interceptors.NewResponseHeaderBan("Openai-Organization", "Openai-Project")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(ban))NewAddHeader(requestHeaders, responseHeaders) — Adds custom headers to requests and responses:
- Request headers: Added before forwarding upstream
- Response headers: Added before returning to caller
Convenience constructors:
NewAddResponseHeader(headers...)— Add only response headersNewAddRequestHeader(headers...)— Add only request headers
Example:
add := interceptors.NewAddResponseHeader(
interceptors.NewHeader("X-Gateway-Version", "1.0"),
interceptors.NewHeader("X-Served-By", "llmproxy"),
)
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(add))Provider-specific prompt caching interceptors for Anthropic, OpenAI, xAI, Fireworks, and AWS Bedrock.
- Cache-Control header: If the incoming request has
Cache-Control: no-cache, the interceptor skips entirely — letting clients disable caching per-request - Provider detection: Only applies to matching models:
- Anthropic:
claude-* - OpenAI:
gpt-*,o1-*,o3-*,o4-*,chatgpt-* - xAI:
grok-* - Fireworks:
accounts/fireworks/*,fireworks* - Bedrock:
anthropic.claude-*,amazon.nova-*,amazon.titan-*
- Anthropic:
- Cache usage tracking: Response metadata includes
CacheUsageinCustom["cache_usage"]
NewAnthropicPromptCaching(retention) — Enables Anthropic prompt caching:
- Automatic caching: Adds
cache_controlat the top level of requests - Retention options:
CacheRetentionDefault(default, 5 min) — no TTL field, free, auto-refreshed on useCacheRetention1h— addsttl: "1h", costs more, longer cache lifetime
- User-controlled caching: If request already has
cache_control, the interceptor skips entirely — letting you control caching explicitly via block-level breakpoints
Example:
// Enable prompt caching for Anthropic (default 5 min, free)
caching := interceptors.NewAnthropicPromptCaching(interceptors.CacheRetentionDefault)
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))
// With 1h retention (costs more) and cache usage callback
caching := interceptors.NewAnthropicPromptCachingWithResult(interceptors.CacheRetention1h, func(u llmproxy.CacheUsage) {
log.Printf("Cache read: %d tokens, Cache write: %d tokens", u.CacheReadInputTokens, u.CacheCreationInputTokens)
})NewOpenAIPromptCaching(retention, cacheKey) — Enables OpenAI prompt caching:
- Automatic caching: OpenAI caches prompts ≥ 1024 tokens automatically
- Cache routing: Adds
prompt_cache_keyto improve cache hit rates for requests with common prefixes - Retention options:
CacheRetentionDefault(default, in-memory, 5-10 min) — no retention fieldCacheRetention24h— addsprompt_cache_retention: "24h"for GPT-5.x and GPT-4.1
- Cache key sources (in priority order):
X-Cache-Keyheader from incoming request- Configured
CacheKeyin PromptCachingConfig - Auto-derived from static content prefix via
DeriveCacheKeyFromPrefix()
- Tenant namespacing: Cache keys are automatically prefixed with org/tenant ID from:
- Custom
OrgIDExtractorfunction OrgIDinMetaContextValuestored in request contextX-Org-IDheaderorg_idinBodyMetadata.Custom- Configured
Namespacefallback
- Custom
Example:
// Enable prompt caching for OpenAI with a cache key (default retention)
caching := interceptors.NewOpenAIPromptCaching(interceptors.CacheRetentionDefault, "my-app-session-123")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))
// With 24h retention and cache usage callback
caching := interceptors.NewOpenAIPromptCachingWithResult(interceptors.CacheRetention24h, "my-key", func(u llmproxy.CacheUsage) {
log.Printf("Cached tokens: %d", u.CachedTokens)
})
// Auto-derive cache key from static content, namespace by tenant
caching := interceptors.NewOpenAIPromptCachingAuto("tenant-123", interceptors.CacheRetentionDefault)
// Custom org ID extractor (e.g., from auth context)
caching := interceptors.NewOpenAIPromptCachingWithOrgExtractor(
interceptors.CacheRetentionDefault,
"my-key",
func(ctx context.Context, req *http.Request, meta llmproxy.BodyMetadata) string {
return getOrgFromAuthContext(ctx)
},
)NewXAIPromptCaching(convID) — Enables xAI/Grok prompt caching:
- Automatic prefix caching: xAI caches from the start of the messages array automatically
- Cache routing: Adds
x-grok-conv-idHTTP header to route requests to the same server where cache lives - Conversation ID: Use a stable value (conversation ID, session ID, or deterministic hash of static content)
- Key rule: Never reorder or modify earlier messages — only append
Example:
// Enable prompt caching for xAI with a conversation ID
caching := interceptors.NewXAIPromptCaching("conv-abc123-tenant456")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))
// With cache usage callback
caching := interceptors.NewXAIPromptCachingWithResult("my-conv-id", func(u llmproxy.CacheUsage) {
log.Printf("Cached tokens: %d", u.CachedTokens)
})NewFireworksPromptCaching(sessionID) — Enables Fireworks prompt caching:
- Automatic caching: Fireworks caches prompts with shared prefixes automatically (enabled by default)
- Cache routing: Adds
x-session-affinityHTTP header to route requests to the same replica - Tenant isolation: Adds
x-prompt-cache-isolation-keyheader set to org/tenant ID for multi-tenant isolation - Cache usage: Reads
fireworks-cached-prompt-tokensresponse header for cache hit tracking
Example:
// Enable prompt caching for Fireworks with session affinity
caching := interceptors.NewFireworksPromptCaching("session-abc123")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))
// With org ID extractor for tenant isolation
caching := interceptors.NewFireworksPromptCachingWithOrgExtractor("session-abc123", func(ctx context.Context, req *http.Request, meta llmproxy.BodyMetadata) string {
return getOrgFromAuthContext(ctx)
})
// With cache usage callback
caching := interceptors.NewFireworksPromptCachingWithResult("session-abc123", func(u llmproxy.CacheUsage) {
log.Printf("Cached tokens: %d", u.CachedTokens)
})NewBedrockPromptCaching(retention) — Enables AWS Bedrock prompt caching via the Converse API:
- Cache checkpoints: Adds
cachePointobjects to system, messages, and toolConfig - Retention options:
CacheRetentionDefault(default, 5 min) — no TTL fieldCacheRetention1h— addsttl: "1h"for Claude Opus 4.5, Haiku 4.5, and Sonnet 4.5
- Minimum tokens: 1,024 tokens per cache checkpoint (varies by model)
- Maximum checkpoints: 4 per request
- Supported models: Claude models (anthropic.claude-), Nova models (amazon.nova-), Titan models (amazon.titan-*)
- Cache usage: Reads
cacheReadInputTokens,cacheWriteInputTokens, andcacheDetailsfrom response
Example:
// Enable prompt caching for Bedrock (default 5 min)
caching := interceptors.NewBedrockPromptCaching(interceptors.CacheRetentionDefault)
proxy := llmproxy.NewProxy(bedrockProvider, llmproxy.WithInterceptor(caching))
// With 1h retention for Claude Opus 4.5
caching := interceptors.NewBedrockPromptCaching(interceptors.CacheRetention1h)
// With cache usage callback
caching := interceptors.NewBedrockPromptCachingWithResult(interceptors.CacheRetentionDefault, func(u llmproxy.CacheUsage) {
log.Printf("Cache read: %d tokens, Cache write: %d tokens", u.CachedTokens, u.CacheWriteTokens)
for _, detail := range u.CacheDetails {
log.Printf(" TTL %s: %d tokens written", detail.TTL, detail.CacheWriteTokens)
}
})Azure OpenAI uses the same prompt_cache_key body parameter as OpenAI. Use the OpenAI interceptor for Azure OpenAI:
// Azure OpenAI prompt caching uses the OpenAI interceptor
caching := interceptors.NewOpenAIPromptCaching(interceptors.CacheRetentionDefault, "my-cache-key")
proxy := llmproxy.NewProxy(azureProvider, llmproxy.WithInterceptor(caching))Note: Azure OpenAI caches prompts ≥ 1,024 tokens automatically. The prompt_cache_key parameter is combined with the prefix hash to improve cache hit rates. Cache hits appear as cached_tokens in prompt_tokens_details in the response.
NewPromptCaching(provider, config) — Creates a caching interceptor for any provider:
// Anthropic with 1h retention
caching := interceptors.NewPromptCaching("anthropic", interceptors.PromptCachingConfig{
Enabled: true,
Retention: interceptors.CacheRetention1h,
})
// OpenAI with 24h retention
caching := interceptors.NewPromptCaching("openai", interceptors.PromptCachingConfig{
Enabled: true,
Retention: interceptors.CacheRetention24h,
CacheKey: "my-cache-key",
})
// xAI with conversation ID
caching := interceptors.NewPromptCaching("xai", interceptors.PromptCachingConfig{
Enabled: true,
CacheKey: "my-conv-id",
})
// Fireworks with session ID and org extractor
caching := interceptors.NewPromptCaching("fireworks", interceptors.PromptCachingConfig{
Enabled: true,
CacheKey: "my-session-id",
OrgIDExtractor: interceptors.DefaultOrgIDExtractor,
})
// Bedrock with 1h retention
caching := interceptors.NewPromptCaching("bedrock", interceptors.PromptCachingConfig{
Enabled: true,
Retention: interceptors.CacheRetention1h,
})pricing/modelsdev/ provides an adapter that loads pricing data from models.dev.
Three source options:
LoadFromFile(path)— Load from a local JSON fileLoadFromURL()— Load from the default models.dev URL- Custom URL via
WithURL(url)option
WithMarkup(multiplier)— Apply a markup for reselling (e.g.,1.2= 20% markup on all prices)
adapter := modelsdev.LoadFromURL()
lookup := adapter.GetCostLookup()
billing := interceptors.NewBilling(lookup, func(result llmproxy.BillingResult) {
// handle billing result
})
GetCostLookup() returns a CostLookup function suitable for the billing interceptor.
The adapter supports TTL-based refresh so pricing data stays current without restarting the process.
type Logger interface {
Debug(msg string, args ...interface{})
Info(msg string, args ...interface{})
Warn(msg string, args ...interface{})
Error(msg string, args ...interface{})
}Matches the signature of github.com/agentuity/go-common/logger without requiring it as a dependency. Any logger implementing these four methods works.
LoggerFunc is an adapter that wraps a plain function as a Logger, useful for testing or simple setups.
-
Small, focused interfaces — Each interface (
BodyParser,RequestEnricher,URLResolver,ResponseExtractor) does exactly one thing. This keeps implementations simple and testable. -
Composition over inheritance — Providers compose the four core interfaces rather than inheriting from a base class.
BaseProviderwires them together via functional options. OpenAI-compatible providers embed a shared base and override only what differs. -
Raw body preservation — Both request and response bodies are preserved as raw bytes throughout the lifecycle. This avoids losing custom JSON fields that providers may include but that the library's typed structs don't model.
-
Function-based lookup —
CostLookupis a function type, not an interface with a concrete implementation. This allows callers to manage pricing data externally (database, config file, remote service) without coupling to a specific source. -
Interceptor chain — Cross-cutting concerns (logging, metrics, retry, billing) wrap the request lifecycle without modifying providers. Interceptors compose independently and execute in a predictable onion order.
llmproxy/
├── apitype.go # API type detection and constants
├── autorouter.go # AutoRouter, provider/API auto-detection, streaming
├── autorouter_websocket.go # WebSocket forwarding, bidirectional relay
├── billing.go # CostInfo, CostLookup, BillingResult, CalculateCost
├── billing_calculator.go # BillingCalculator for streaming/non-streaming
├── detection.go # Provider detection from model/header
├── enricher.go # RequestEnricher interface
├── extractor.go # ResponseExtractor, StreamingResponseExtractor interface
├── interceptor.go # Interceptor, InterceptorChain, RoundTripFunc
├── internal/
│ └── fastjson/
│ └── extractor.go # Fast JSON parsing with simdjson-go
├── logger.go # Logger interface, LoggerFunc adapter
├── metadata.go # BodyMetadata, ResponseMetadata, Message, Usage, Choice
├── parser.go # BodyParser interface
├── provider.go # Provider interface, BaseProvider
├── proxy.go # Proxy struct, Forward method
├── registry.go # Registry interface, MapRegistry
├── resolver.go # URLResolver interface
├── streaming.go # SSE parser, streaming types, usage extraction
├── websocket.go # WebSocket interfaces, message parsing, usage extraction
├── interceptors/
│ ├── addheader.go # AddHeaderInterceptor
│ ├── billing.go # BillingInterceptor
│ ├── headerban.go # HeaderBanInterceptor
│ ├── logging.go # LoggingInterceptor
│ ├── metrics.go # MetricsInterceptor, Metrics
│ ├── promptcaching.go # PromptCachingInterceptor
│ ├── retry.go # RetryInterceptor
│ └── tracing.go # TracingInterceptor
├── pricing/
│ └── modelsdev/
│ └── adapter.go # models.dev pricing adapter
├── providers/
│ ├── anthropic/ # Anthropic Messages API
│ │ └── streaming_extractor.go # Anthropic SSE streaming
│ ├── azure/ # Azure OpenAI
│ ├── bedrock/ # AWS Bedrock Converse API
│ │ └── streaming_extractor.go # Bedrock streaming
│ ├── fireworks/ # Fireworks (OpenAI-compatible)
│ ├── googleai/ # Google AI Gemini
│ │ └── streaming_extractor.go # Google AI streaming
│ ├── groq/ # Groq (OpenAI-compatible)
│ ├── openai/ # OpenAI (Chat Completions + Responses + WebSocket)
│ ├── openai_compatible/ # Base for OpenAI-compatible providers
│ │ ├── multiapi.go # Multi-API parser/extractor dispatch
│ │ ├── streaming_extractor.go # Chat Completions SSE streaming
│ │ ├── responses_parser.go # Responses API parser
│ │ ├── responses_extractor.go # Responses API extractor
│ │ ├── responses_streaming_extractor.go # Responses API SSE streaming
│ │ └── websocket.go # WebSocket URL resolver
│ └── xai/ # x.AI (OpenAI-compatible)
└── examples/
└── basic/ # Multi-provider proxy server example (uses AutoRouter)