DESIGN

Module: github.com/agentuity/llmproxy

A pluggable, composable library for proxying requests to upstream LLM providers. The core lifecycle follows a five-stage pipeline:

Parse --> Enrich --> Resolve --> Forward --> Extract

Callers construct a Proxy with a Provider and optional Interceptor chain, then call Forward(ctx, req) to proxy a single request through the full lifecycle.

Architecture

Core Interfaces

BodyParser

Extracts BodyMetadata from a raw HTTP request body. Returns structured metadata (model name, messages, max tokens, stream flag, custom fields) alongside the raw body bytes. The raw bytes are returned because http.Request.Body is a ReadCloser that can only be read once — downstream stages need access to the original payload.

RequestEnricher

Modifies the outgoing upstream request with provider-specific headers. Receives the parsed BodyMetadata and raw body bytes. For most providers this sets Authorization: Bearer <key>. AWS Bedrock computes an AWS Signature V4 instead.

URLResolver

Determines the upstream URL from metadata. Each provider maps to its own endpoint scheme:

Provider	URL Pattern
OpenAI	`https://api.openai.com/v1/chat/completions`
Bedrock	`https://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/converse`
Google AI	`https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent`

ResponseExtractor

Parses the upstream HTTP response into ResponseMetadata (ID, model, token usage, choices). Reads and returns the raw response body bytes so they can be re-attached to the response, preserving any custom JSON fields the provider may include.

Provider

Composes the four interfaces above into a single unit:

Name() string
BodyParser() BodyParser
RequestEnricher() RequestEnricher
ResponseExtractor() ResponseExtractor
URLResolver() URLResolver

BaseProvider supplies a configurable default implementation via functional options. Providers that share the OpenAI chat completions format embed the openai_compatible base and only override what differs (name, URL, auth).

Interceptor

Wraps the request/response cycle for cross-cutting concerns. Signature:

Intercept(req, meta, rawBody, next) -> (resp, respMeta, rawRespBody, err)

The chain wraps in reverse order so that given interceptors [A, B, C], execution flows as:

A -> B -> C -> upstream -> C -> B -> A

Each interceptor can inspect/modify the request before calling next, and inspect/modify the response after next returns.

Registry

Manages a collection of named providers. MapRegistry provides a thread-safe, map-based lookup by provider name.

Proxy

The main entry point. Forward(ctx, req) orchestrates the full request lifecycle. Configurable with:

WithInterceptor(i) — adds an interceptor to the chain
WithHTTPClient(c) — sets a custom *http.Client for upstream calls

AutoRouter

An HTTP handler that provides automatic provider and API type detection from a single endpoint. Implements http.Handler for easy integration.

Forward(ctx, req) -> (resp, meta, err)
ServeHTTP(w, r)

Detection Flow:

Parse body - Extract model name and request structure
Detect provider - From X-Provider header, model prefix (openai/gpt-4), or model pattern (gpt-*)
Strip provider prefix - If model has known provider prefix, strip before forwarding
Detect API type - From path (/v1/messages) or body+provider (input → Responses)
Route to provider - Forward to detected provider with correct endpoint

Configuration options:

WithAutoRouterRegistry(r) — Use custom registry
WithAutoRouterDetector(d) — Custom provider detection logic
WithAutoRouterModelProviderLookup(lookup) — Hook for model→provider mapping (e.g., models.dev-backed detection); called when model pattern detection fails
WithAutoRouterInterceptor(i) — Add interceptor to chain
WithAutoRouterHTTPClient(c) — Custom HTTP client
WithAutoRouterFallbackProvider(p) — Provider when detection fails
WithAutoRouterWebSocket(upgrader, dialer) — Enable WebSocket mode (opt-in, see WebSocket Mode)
WithAutoRouterWSBillingCallback(cb) — Per-turn billing callback for WebSocket connections

Example:

// Basic setup
router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
    llmproxy.WithAutoRouterInterceptor(interceptors.NewLogging(logger)),
)
router.RegisterProvider(openaiProvider)
router.RegisterProvider(anthropicProvider)

http.Handle("/", router)

// With models.dev-backed provider detection
adapter, _ := modelsdev.LoadFromURL()
router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterModelProviderLookup(adapter.FindProviderForModel),
    llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
)

Data Types

BodyMetadata

Field	Type	Description
model	string	Target model identifier
messages	[]Message	Conversation messages
maxTokens	int	Token generation limit
stream	bool	Whether to stream the response
custom	map[string]any	Provider-specific fields

ResponseMetadata

Field	Type	Description
id	string	Response identifier
object	string	Object type
model	string	Model that generated the response
usage	Usage	Token counts
choices	[]Choice	Response choices
custom	map[string]any	Provider-specific fields

Message

Field	Type	Description
role	string	`system`, `user`, `assistant`, `tool`
content	string	Message content

Usage

Field	Type	Description
promptTokens	int	Input tokens consumed
completionTokens	int	Output tokens generated
totalTokens	int	Sum of prompt + completion

CacheUsage

Field	Type	Description
cachedTokens	int	Tokens served from cache (OpenAI, Azure)
cacheCreationInputTokens	int	Tokens written to cache (Anthropic)
cacheReadInputTokens	int	Tokens read from cache (Anthropic)
ephemeral5mInputTokens	int	5-minute cache write tokens (Anthropic)
ephemeral1hInputTokens	int	1-hour cache write tokens (Anthropic)
cacheWriteTokens	int	Tokens written to cache (Bedrock)
cacheDetails	[]CacheDetail	TTL-based cache write breakdown (Bedrock)

CacheDetail

Field	Type	Description
ttl	string	Time-to-live for cache entry (e.g., "5m", "1h")
cacheWriteTokens	int	Tokens written to cache at this TTL

Choice

Field	Type	Description
index	int	Choice index
message	Message	Non-streaming response content
delta	Message	Streaming response content
finishReason	string	Why generation stopped

CostInfo

Per-model pricing in USD per 1M tokens:

Field	Description
input	Input token cost
output	Output token cost
cacheRead	Cached input read cost
cacheWrite	Cached input write cost

BillingResult

Field	Description
provider	Provider name
model	Model name
tokens	Usage breakdown
costs	Computed costs

Request Lifecycle

The full flow through Proxy.Forward:

                    +------------------+
                    |  Incoming HTTP   |
                    |    Request       |
                    +--------+---------+
                             |
                    1. Read request body
                             |
                    +--------v---------+
                    |   BodyParser     |
                    |  Parse -> Meta   |
                    |  + raw body bytes|
                    +--------+---------+
                             |
                    2. Resolve upstream URL
                             |
                    +--------v---------+
                    |   URLResolver    |
                    +--------+---------+
                             |
                    3. Create upstream request,
                       copy headers
                             |
                    4. Enrich request
                             |
                    +--------v---------+
                    |  RequestEnricher |
                    |  (auth headers)  |
                    +--------+---------+
                             |
                    5. Store metadata in
                       request context
                             |
                    6. Execute interceptor chain
                             |
              +--------------v--------------+
              |  Interceptor A              |
              |   +- Interceptor B          |
              |   |   +- Interceptor C      |
              |   |   |   +- HTTP call ---->| upstream
              |   |   |   +- response <----+|
              |   |   +- post-process       |
              |   +- post-process           |
              +- post-process               |
              +-----------------------------+
                             |
                    7. Extract ResponseMetadata
                             |
                    +--------v---------+
                    | ResponseExtractor|
                    +--------+---------+
                             |
                    8. Re-attach raw body
                       to response
                             |
                    9. Return response
                       + metadata

Steps in detail:

Read and parse the request body into BodyMetadata. The raw bytes are preserved for later use.
Resolve the upstream URL from the metadata (model name, provider config).
Create a new HTTP request for the upstream provider and copy relevant headers from the original request.
Enrich the upstream request with provider-specific authentication (Bearer token, API key, AWS Signature V4).
Store BodyMetadata in the request context so interceptors can access it.
Execute the request through the interceptor chain. Each interceptor wraps the next, forming an onion-style pipeline.
Extract ResponseMetadata from the upstream response.
Re-attach the raw response body so custom JSON fields from the provider are preserved.
Return the response and metadata to the caller.

Auto-Routing

The AutoRouter enables automatic provider and API detection from a single endpoint. POST to / with any LLM request and routing happens automatically.

API Type Detection

Detection happens in two phases:

Phase 1: Path-based detection

Path	API Type
`/v1/chat/completions`	Chat Completions
`/v1/responses`	Responses
`/v1/completions`	Legacy Completions
`/v1/messages`	Anthropic Messages
`:generateContent`	Gemini GenerateContent
`/converse`	Bedrock Converse

Phase 2: Body + Provider detection (when path is / or unknown)

Body Field	Provider	API Type
`input`	any	Responses
`prompt`	any	Completions
`contents`	any	GenerateContent
`messages`	anthropic	Messages
`messages`	other	Chat Completions

Provider Detection

Provider is detected in priority order:

X-Provider header — Explicit override

curl -X POST http://localhost:8080/ \
  -H 'X-Provider: anthropic' \
  -d '{"model":"claude-3-opus",...}'

Model prefix — Provider prefix in model name (stripped before forwarding)

# Model "openai/gpt-4" routes to OpenAI, forwards "gpt-4"
curl -X POST http://localhost:8080/ \
  -d '{"model":"anthropic/claude-3-opus",...}'

Model pattern — Match against known patterns

Pattern	Provider
`gpt-`, `o1-`, `o3-`, `chatgpt-`	OpenAI
`claude-*`	Anthropic
`gemini-`, `gemma-`	Google AI
`grok-*`	x.AI
`accounts/fireworks/*`	Fireworks
`sonar*`	Perplexity
`anthropic.claude-`, `amazon.`	Bedrock

Provider Prefix Stripping

Only known provider prefixes are stripped:

// Stripped (known providers)
"openai/gpt-4"           -> "gpt-4"
"anthropic/claude-3"     -> "claude-3"
"fireworks/models/llama" -> "models/llama"

// Preserved (unknown or model-native paths)
"accounts/fireworks/models/llama" -> "accounts/fireworks/models/llama"
"some-unknown/model"              -> "some-unknown/model"

Usage Examples

# Auto-detect everything - POST to /
curl -X POST http://localhost:8080/ \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'

# Auto-detect Anthropic from model name
curl -X POST http://localhost:8080/ \
  -H 'Content-Type: application/json' \
  -d '{"model":"claude-3-opus","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'

# Auto-detect Responses API from input field
curl -X POST http://localhost:8080/ \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4o","input":"Hello"}'

# Traditional path-based routing still works
curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'

Streaming

The proxy fully supports SSE (Server-Sent Events) streaming with efficient token usage extraction for billing.

StreamingResponseExtractor

Extends ResponseExtractor for streaming responses:

type StreamingResponseExtractor interface {
    ResponseExtractor
    ExtractStreamingWithController(resp, w, rc) -> (ResponseMetadata, error)
    IsStreamingResponse(resp) -> bool
}

All built-in providers implement this interface for streaming support.

Streaming Flow

+------------------+
| Incoming Request |
|  stream: true    |
+--------+---------+
         |
   Parse body, detect
   stream: true flag
         |
+--------v---------+
|   AutoRouter     |
| ForwardStreaming |
+--------+---------+
         |
   Auto-inject
   stream_options:
   {include_usage:true}
         |
+--------v---------+
|   HTTP Request   |
|   to upstream    |
+--------+---------+
         |
+--------v---------+
| Upstream Response|
| text/event-stream|
+--------+---------+
         |
+--------v---------+
| StreamingExtractor|
| Parse SSE events |
| Extract usage    |
| Flush each chunk |
+--------+---------+
         |
+--------v---------+
| BillingCalculator|
| Calculate cost   |
+--------+---------+
         |
+--------v---------+
|   HTTP Response  |
|   to client      |
+------------------+

Usage Extraction

OpenAI Chat Completions: Usage is sent in the final chunk before [DONE]:

data: {"id":"...","usage":{"prompt_tokens":100,"completion_tokens":50,"total_tokens":150}}
data: [DONE]

OpenAI Responses API: Usage is in the response.completed event. The StreamingMultiAPIExtractor automatically detects the API type from the request context and dispatches to the correct streaming extractor:

data: {"type":"response.created","response":{"id":"resp_123","model":"gpt-4o"}}
data: {"type":"response.output_text.delta","delta":"Hello"}
data: {"type":"response.completed","response":{"usage":{"input_tokens":10,"output_tokens":5,"total_tokens":15}}}
data: [DONE]

No stream_options.include_usage is needed for the Responses API — usage is always included in response.completed. The proxy automatically skips stream_options injection for Responses API requests.

Anthropic: Usage is sent in message_start and message_delta events:

data: {"type":"message_start","message":{"usage":{"input_tokens":100}}}
...
data: {"type":"message_delta","usage":{"output_tokens":50}}
data: {"type":"message_stop"}

Auto stream_options Injection

When BillingCalculator is configured and the request has stream: true, the proxy automatically injects stream_options.include_usage for OpenAI-compatible Chat Completions endpoints only:

{
  "stream": true,
  "stream_options": { "include_usage": true }
}

This ensures providers return token usage in their streaming responses for billing calculation.

The following are excluded from this injection because they already include usage natively:

Responses API — usage is always present in the response.completed event
Anthropic — usage is sent in message_start and message_delta events
Bedrock and Google AI — usage is included in their streaming event formats

Efficient Flushing

Uses http.ResponseController for optimal streaming:

rc := http.NewResponseController(w)

for each SSE event {
    w.Write(event)
    rc.Flush()  // Immediate flush after each chunk
}

Non-streaming responses also use chunked read/write/flush with a 512KB buffer for better performance.

Billing with Streaming

adapter, _ := modelsdev.LoadFromURL()

billingCallback := func(r llmproxy.BillingResult) {
    log.Printf("Cost: $%.6f", r.TotalCost)
}

router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterBillingCalculator(
        llmproxy.NewBillingCalculator(adapter.GetCostLookup(), billingCallback),
    ),
)

After the stream completes, the billing callback is invoked with the extracted token usage.

WebSocket Mode

The proxy supports the OpenAI Responses API WebSocket mode for persistent, multi-turn connections. This is useful for tool-call-heavy workflows where multiple response.create / response.completed cycles happen on a single connection.

Adapter Pattern (Zero Dependencies)

The library defines abstract WebSocket interfaces — no WebSocket library is vendored. Consumers bring their own implementation (gorilla/websocket, nhooyr.io/websocket, etc.) and wire it in via thin adapters.

Interfaces

// WSConn abstracts a WebSocket connection.
// gorilla/websocket's *Conn satisfies this directly.
type WSConn interface {
    ReadMessage() (messageType int, p []byte, err error)
    WriteMessage(messageType int, data []byte) error
    Close() error
}

// WSUpgrader upgrades an HTTP request to a WebSocket connection.
type WSUpgrader interface {
    Upgrade(w http.ResponseWriter, r *http.Request, responseHeader http.Header) (WSConn, error)
}

// WSDialer dials a WebSocket connection to an upstream server.
type WSDialer interface {
    DialContext(ctx context.Context, urlStr string, requestHeader http.Header) (WSConn, *http.Response, error)
}

// WebSocketCapableProvider is implemented by providers that support WebSocket mode.
type WebSocketCapableProvider interface {
    Provider
    WebSocketURL(meta BodyMetadata) (*url.URL, error)
}

The OpenAI provider implements WebSocketCapableProvider. Other providers can opt in by implementing the same interface.

gorilla/websocket Example

gorilla's *websocket.Conn already satisfies WSConn — no wrapper needed. You only need thin adapters for Upgrader and Dialer because their return types differ (*websocket.Conn vs WSConn):

package myadapter

import (
    "context"
    "net/http"

    "github.com/agentuity/llmproxy"
    "github.com/gorilla/websocket"
)

// GorillaUpgrader wraps gorilla's Upgrader to satisfy llmproxy.WSUpgrader.
type GorillaUpgrader struct {
    Upgrader websocket.Upgrader
}

func (u *GorillaUpgrader) Upgrade(w http.ResponseWriter, r *http.Request, h http.Header) (llmproxy.WSConn, error) {
    conn, err := u.Upgrader.Upgrade(w, r, h)
    if err != nil {
        return nil, err
    }
    return conn, nil // *websocket.Conn satisfies WSConn
}

// GorillaDialer wraps gorilla's Dialer to satisfy llmproxy.WSDialer.
type GorillaDialer struct {
    Dialer websocket.Dialer
}

func (d *GorillaDialer) DialContext(ctx context.Context, urlStr string, h http.Header) (llmproxy.WSConn, *http.Response, error) {
    conn, resp, err := d.Dialer.DialContext(ctx, urlStr, h)
    if err != nil {
        return nil, resp, err
    }
    return conn, resp, nil // *websocket.Conn satisfies WSConn
}

Wiring It Together

router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
    // In production, configure CheckOrigin to validate against trusted origins.
    llmproxy.WithAutoRouterWebSocket(
        &myadapter.GorillaUpgrader{
            Upgrader: websocket.Upgrader{
                CheckOrigin: func(r *http.Request) bool {
                    // Validate r.Header.Get("Origin") against a whitelist
                    return isAllowedOrigin(r)
                },
            },
        },
        &myadapter.GorillaDialer{
            Dialer: websocket.Dialer{},
        },
    ),
    llmproxy.WithAutoRouterWSBillingCallback(func(turn int, meta llmproxy.ResponseMetadata, billing *llmproxy.BillingResult) {
        log.Printf("Turn %d: model=%s prompt=%d completion=%d",
            turn, meta.Model, meta.Usage.PromptTokens, meta.Usage.CompletionTokens)
        if billing != nil {
            log.Printf("  Cost: $%.6f", billing.TotalCost)
        }
    }),
)
router.RegisterProvider(openaiProvider)

http.Handle("/", router)

WebSocket support is opt-in — if WithAutoRouterWebSocket is not called, WebSocket upgrade requests are rejected and normal HTTP handling is unchanged.

WebSocket Protocol

The OpenAI Responses API WebSocket mode operates at wss://api.openai.com/v1/responses:

Aspect	Detail
Client sends	`{"type":"response.create","model":"gpt-4o","input":[...]}`
Server sends	Same events as SSE: `response.created`, `response.output_text.delta`, `response.completed`, etc.
Multi-turn	Client sends new `response.create` with `previous_response_id` on same connection
Usage	In `response.completed` → `response.usage.{input_tokens, output_tokens, total_tokens}`
Frames	JSON text frames — no `data:` prefix (unlike SSE)

WebSocket Flow

+------------------+        +------------------+        +------------------+
|     Client       |        |    AutoRouter    |        |    Upstream      |
|                  |        |                  |        |   (OpenAI)       |
+--------+---------+        +--------+---------+        +--------+---------+
         |                           |                           |
   1. WS Upgrade  -------->  2. Accept upgrade                  |
         |                   3. Read first message               |
         |                      (response.create)                |
         |                   4. Detect provider/model            |
         |                   5. Strip model prefix               |
         |                   6. Enrich headers (auth)            |
         |                   7. Dial upstream WS  -------->  8. Accept
         |                   9. Forward first msg  -------->     |
         |                           |                           |
         |                  === Bidirectional Relay ===          |
         |                           |                           |
  10. response.create ----->  Strip prefix  ------------>  response.create
         |                           |                           |
         |                           |  <-----------  response.created
  response.created  <------          |                           |
         |                           |  <-----------  response.output_text.delta
  text delta  <----------------      |                           |
         |                           |  <-----------  response.completed
  response.completed  <----  Extract usage,            (includes usage)
         |                   compute billing                     |
         |                           |                           |
  11. response.create ----->  Strip prefix  ------------>  (next turn)
         |                     ...                        ...
         |                           |                           |
  12. Close  ----------------> Close both  ------------>  Close

Billing

Billing is calculated per turn — each response.completed event triggers:

Usage extraction (input_tokens, output_tokens, total_tokens, cache details)
Cost calculation via BillingCalculator (if configured)
WSBillingCallback invocation with the turn number, response metadata, and billing result

For multi-turn connections, each turn is billed independently. The callback receives an incrementing turn counter so consumers can aggregate if needed.

Model Prefix Stripping

Works the same as HTTP mode — openai/gpt-4o is stripped to gpt-4o in all response.create messages. Non-response.create messages are forwarded byte-for-byte without modification.

Connection Lifecycle

Both relay goroutines use a sync.Once-guarded close — when either side closes, the other is closed immediately
WebSocket close errors (io.EOF, "connection closed") are treated as normal termination, not errors
The ForwardWebSocket method blocks until both relay goroutines complete

Providers

Nine providers are included. Six share the OpenAI-compatible base; three have fully custom implementations.

OpenAI-Compatible Base

providers/openai_compatible implements BodyParser, ResponseExtractor, URLResolver, and RequestEnricher for the OpenAI chat completions format. Providers that speak this protocol embed the base and override only what differs (name, base URL, auth configuration).

The OpenAI provider also supports the Responses API (/v1/responses) with automatic detection based on the input field in the request body.

Provider Table

Provider	Package	Auth	API Format	Notes
OpenAI	`providers/openai`	Bearer token	Chat completions, Responses	Supports both APIs with auto-detection
Anthropic	`providers/anthropic`	`x-api-key` header + `anthropic-version`	Anthropic Messages API	Custom parser/extractor
Groq	`providers/groq`	Bearer token	OpenAI-compatible	Wraps `openai_compatible`
Fireworks	`providers/fireworks`	Bearer token	OpenAI-compatible	Wraps `openai_compatible`
x.AI	`providers/xai`	Bearer token	OpenAI-compatible	Wraps `openai_compatible`
Google AI	`providers/googleai`	API key in URL query param	Gemini generateContent	Custom parser/extractor/resolver
AWS Bedrock	`providers/bedrock`	AWS Signature V4	Converse API	Custom everything; supports both Converse and InvokeModel
Azure OpenAI	`providers/azure`	`api-key` header or Azure AD Bearer token	OpenAI chat completions	Uses deployments instead of models
OpenAI-compatible	`providers/openai_compatible`	Bearer token	OpenAI chat completions	Reusable base for OpenAI-compatible providers

Provider Details

OpenAI — Wraps openai_compatible with support for multiple APIs:

Chat Completions (/v1/chat/completions) — Standard messages-based API
Responses (/v1/responses) — Newer API with input field, built-in tools support. Supports both HTTP (SSE streaming) and WebSocket modes.
Legacy Completions (/v1/completions) — Older prompt-based API

The provider auto-detects the API type from the request body:

input field → Responses API
prompt field → Completions API
messages field → Chat Completions API

The OpenAI provider implements WebSocketCapableProvider, enabling persistent WebSocket connections for multi-turn Responses API workflows when WithAutoRouterWebSocket is configured.

Anthropic — Custom body parser translates between the proxy's canonical format and Anthropic's Messages API. Custom extractor maps Anthropic's response shape (content blocks, stop_reason) back to ResponseMetadata. Auth uses the x-api-key header alongside an anthropic-version header.

Groq, Fireworks, x.AI — Each wraps openai_compatible with its own base URL and provider name. No custom parsing or extraction logic needed.

Google AI — Custom body parser converts to the Gemini generateContent format (contents/parts). Custom URL resolver appends the API key as a query parameter rather than using a header. Custom extractor maps Gemini's response (candidates/content/parts) back to ResponseMetadata.

AWS Bedrock — Fully custom implementation. The body parser converts to the Bedrock Converse API format. The enricher computes AWS Signature V4 signing using region, access key, and secret key. The URL resolver constructs region-specific endpoints. Supports both the Converse and InvokeModel API paths.

Azure OpenAI — Uses deployments instead of direct model access. Two construction modes:

New(resourceName, deploymentID, apiVersion, opts...) — Fixed deployment, uses configured deploymentID for all requests
NewWithDynamicDeployment(resourceName, apiVersion, opts...) — Dynamic deployment, uses the model field from each request as the deployment name

Authentication via functional options:

WithAPIKey(apiKey) — Sets api-key header
WithAzureADToken(token) — Sets Authorization: Bearer header
WithAzureADTokenRefresher(fn) — Token refresh callback for expiring Azure AD tokens

URL format: https://{resource}.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version={version}

Interceptors

Eight built-in interceptors are provided in the interceptors/ package.

Logging

NewLogging(logger) — Logs each request/response cycle with:

Model name
HTTP method and URL
Response status
Latency
Token usage (prompt, completion, total)

Accepts an llmproxy.Logger interface, which is compatible with github.com/agentuity/go-common/logger without requiring the dependency.

Metrics

NewMetrics(m) — Tracks aggregate statistics:

Field	Description
TotalRequests	Number of requests processed
TotalTokens	Total tokens consumed
TotalPromptTokens	Total input tokens
TotalCompletionTokens	Total output tokens
TotalLatency	Cumulative request duration
Errors	Number of failed requests

All fields use sync/atomic operations for thread safety. The Metrics struct can be read concurrently from monitoring goroutines.

Retry

NewRetry(maxAttempts, delay) — Retries failed requests under specific conditions:

Retries on: HTTP 429 (rate limit) and 5xx (server errors)
Does NOT retry: context.Canceled, context.DeadlineExceeded
Body handling: Reconstructs the request body from raw bytes on each retry attempt
Custom predicate: NewRetryWithPredicate(maxAttempts, delay, predicate) allows callers to supply a custom function that decides whether a given response should be retried

NewRetryWithRateLimitHeaders(maxAttempts, defaultDelay) — Retries with rate limit header support:

Retry-After header: Parses both seconds (integer) and HTTP date formats
X-RateLimit-Reset header: Fallback if Retry-After not present
Max delay: Values over 24 hours are ignored (fallback to defaultDelay); exactly 24 hours is accepted
Precedence: Retry-After takes precedence over X-RateLimit-Reset

Example:

// Use rate limit headers from provider
retry := interceptors.NewRetryWithRateLimitHeaders(3, time.Second)

// If provider returns 429 with Retry-After: 30, waits 30s instead of 1s

Billing

NewBilling(lookup, onResult) — Calculates the cost of each request:

Uses a CostLookup function to retrieve per-model pricing
Detects the provider from the model name
Computes input/output/cache costs based on token usage
Calls the onResult callback with a BillingResult after each request
Stores BillingResult in ResponseMetadata.Custom["billing_result"] for downstream access

When using AutoRouter, billing results are automatically added as response headers:

X-Gateway-Cost — Total cost in USD
X-Gateway-Prompt-Tokens — Input token count
X-Gateway-Completion-Tokens — Output token count

Tracing

NewTracing(extractor) — Propagates OpenTelemetry trace context:

Upstream headers:
- X-Request-ID: the trace ID (32 hex chars)
- traceparent: W3C Trace Context format (00-{traceid}-{spanid}-{flags})
Response header: X-Request-ID for correlation
Extractor function: Pulls trace info from request context

The extractor function signature:

func(ctx context.Context) TraceInfo

For OpenTelemetry:

func otelExtractor(ctx context.Context) interceptors.TraceInfo {
    span := trace.SpanFromContext(ctx)
    if !span.SpanContext().IsValid() {
        return interceptors.TraceInfo{}
    }
    return interceptors.TraceInfo{
        TraceID: span.SpanContext().TraceID(),
        SpanID:  span.SpanContext().SpanID(),
        Sampled: span.SpanContext().IsSampled(),
    }
}

Use NewTracingWithHeader(extractor, headerName) to customize the response header name.

HeaderBan

NewHeaderBan(requestHeaders, responseHeaders) — Strips specified headers:

Request headers: Removed before forwarding upstream
Response headers: Removed before returning to caller
Case-insensitive: HTTP header matching is case-insensitive

Convenience constructors:

NewResponseHeaderBan(headers...) — Strip only response headers
NewRequestHeaderBan(headers...) — Strip only request headers

Example:

ban := interceptors.NewResponseHeaderBan("Openai-Organization", "Openai-Project")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(ban))

AddHeader

NewAddHeader(requestHeaders, responseHeaders) — Adds custom headers to requests and responses:

Request headers: Added before forwarding upstream
Response headers: Added before returning to caller

Convenience constructors:

NewAddResponseHeader(headers...) — Add only response headers
NewAddRequestHeader(headers...) — Add only request headers

Example:

add := interceptors.NewAddResponseHeader(
    interceptors.NewHeader("X-Gateway-Version", "1.0"),
    interceptors.NewHeader("X-Served-By", "llmproxy"),
)
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(add))

PromptCaching

Provider-specific prompt caching interceptors for Anthropic, OpenAI, xAI, Fireworks, and AWS Bedrock.

Common Behavior

Cache-Control header: If the incoming request has Cache-Control: no-cache, the interceptor skips entirely — letting clients disable caching per-request
Provider detection: Only applies to matching models:
- Anthropic: claude-*
- OpenAI: gpt-*, o1-*, o3-*, o4-*, chatgpt-*
- xAI: grok-*
- Fireworks: accounts/fireworks/*, fireworks*
- Bedrock: anthropic.claude-*, amazon.nova-*, amazon.titan-*
Cache usage tracking: Response metadata includes CacheUsage in Custom["cache_usage"]

Anthropic

NewAnthropicPromptCaching(retention) — Enables Anthropic prompt caching:

Automatic caching: Adds cache_control at the top level of requests
Retention options:
- CacheRetentionDefault (default, 5 min) — no TTL field, free, auto-refreshed on use
- CacheRetention1h — adds ttl: "1h", costs more, longer cache lifetime
User-controlled caching: If request already has cache_control, the interceptor skips entirely — letting you control caching explicitly via block-level breakpoints

Example:

// Enable prompt caching for Anthropic (default 5 min, free)
caching := interceptors.NewAnthropicPromptCaching(interceptors.CacheRetentionDefault)
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With 1h retention (costs more) and cache usage callback
caching := interceptors.NewAnthropicPromptCachingWithResult(interceptors.CacheRetention1h, func(u llmproxy.CacheUsage) {
    log.Printf("Cache read: %d tokens, Cache write: %d tokens", u.CacheReadInputTokens, u.CacheCreationInputTokens)
})

OpenAI

NewOpenAIPromptCaching(retention, cacheKey) — Enables OpenAI prompt caching:

Automatic caching: OpenAI caches prompts ≥ 1024 tokens automatically
Cache routing: Adds prompt_cache_key to improve cache hit rates for requests with common prefixes
Retention options:
- CacheRetentionDefault (default, in-memory, 5-10 min) — no retention field
- CacheRetention24h — adds prompt_cache_retention: "24h" for GPT-5.x and GPT-4.1
Cache key sources (in priority order):
1. X-Cache-Key header from incoming request
2. Configured CacheKey in PromptCachingConfig
3. Auto-derived from static content prefix via DeriveCacheKeyFromPrefix()
Tenant namespacing: Cache keys are automatically prefixed with org/tenant ID from:
1. Custom OrgIDExtractor function
2. OrgID in MetaContextValue stored in request context
3. X-Org-ID header
4. org_id in BodyMetadata.Custom
5. Configured Namespace fallback

Example:

// Enable prompt caching for OpenAI with a cache key (default retention)
caching := interceptors.NewOpenAIPromptCaching(interceptors.CacheRetentionDefault, "my-app-session-123")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With 24h retention and cache usage callback
caching := interceptors.NewOpenAIPromptCachingWithResult(interceptors.CacheRetention24h, "my-key", func(u llmproxy.CacheUsage) {
    log.Printf("Cached tokens: %d", u.CachedTokens)
})

// Auto-derive cache key from static content, namespace by tenant
caching := interceptors.NewOpenAIPromptCachingAuto("tenant-123", interceptors.CacheRetentionDefault)

// Custom org ID extractor (e.g., from auth context)
caching := interceptors.NewOpenAIPromptCachingWithOrgExtractor(
    interceptors.CacheRetentionDefault, 
    "my-key",
    func(ctx context.Context, req *http.Request, meta llmproxy.BodyMetadata) string {
        return getOrgFromAuthContext(ctx)
    },
)

xAI (Grok)

NewXAIPromptCaching(convID) — Enables xAI/Grok prompt caching:

Automatic prefix caching: xAI caches from the start of the messages array automatically
Cache routing: Adds x-grok-conv-id HTTP header to route requests to the same server where cache lives
Conversation ID: Use a stable value (conversation ID, session ID, or deterministic hash of static content)
Key rule: Never reorder or modify earlier messages — only append

Example:

// Enable prompt caching for xAI with a conversation ID
caching := interceptors.NewXAIPromptCaching("conv-abc123-tenant456")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With cache usage callback
caching := interceptors.NewXAIPromptCachingWithResult("my-conv-id", func(u llmproxy.CacheUsage) {
    log.Printf("Cached tokens: %d", u.CachedTokens)
})

Fireworks

NewFireworksPromptCaching(sessionID) — Enables Fireworks prompt caching:

Automatic caching: Fireworks caches prompts with shared prefixes automatically (enabled by default)
Cache routing: Adds x-session-affinity HTTP header to route requests to the same replica
Tenant isolation: Adds x-prompt-cache-isolation-key header set to org/tenant ID for multi-tenant isolation
Cache usage: Reads fireworks-cached-prompt-tokens response header for cache hit tracking

Example:

// Enable prompt caching for Fireworks with session affinity
caching := interceptors.NewFireworksPromptCaching("session-abc123")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With org ID extractor for tenant isolation
caching := interceptors.NewFireworksPromptCachingWithOrgExtractor("session-abc123", func(ctx context.Context, req *http.Request, meta llmproxy.BodyMetadata) string {
    return getOrgFromAuthContext(ctx)
})

// With cache usage callback
caching := interceptors.NewFireworksPromptCachingWithResult("session-abc123", func(u llmproxy.CacheUsage) {
    log.Printf("Cached tokens: %d", u.CachedTokens)
})

AWS Bedrock

NewBedrockPromptCaching(retention) — Enables AWS Bedrock prompt caching via the Converse API:

Cache checkpoints: Adds cachePoint objects to system, messages, and toolConfig
Retention options:
- CacheRetentionDefault (default, 5 min) — no TTL field
- CacheRetention1h — adds ttl: "1h" for Claude Opus 4.5, Haiku 4.5, and Sonnet 4.5
Minimum tokens: 1,024 tokens per cache checkpoint (varies by model)
Maximum checkpoints: 4 per request
Supported models: Claude models (anthropic.claude-), Nova models (amazon.nova-), Titan models (amazon.titan-*)
Cache usage: Reads cacheReadInputTokens, cacheWriteInputTokens, and cacheDetails from response

Example:

// Enable prompt caching for Bedrock (default 5 min)
caching := interceptors.NewBedrockPromptCaching(interceptors.CacheRetentionDefault)
proxy := llmproxy.NewProxy(bedrockProvider, llmproxy.WithInterceptor(caching))

// With 1h retention for Claude Opus 4.5
caching := interceptors.NewBedrockPromptCaching(interceptors.CacheRetention1h)

// With cache usage callback
caching := interceptors.NewBedrockPromptCachingWithResult(interceptors.CacheRetentionDefault, func(u llmproxy.CacheUsage) {
    log.Printf("Cache read: %d tokens, Cache write: %d tokens", u.CachedTokens, u.CacheWriteTokens)
    for _, detail := range u.CacheDetails {
        log.Printf("  TTL %s: %d tokens written", detail.TTL, detail.CacheWriteTokens)
    }
})

Azure OpenAI

Azure OpenAI uses the same prompt_cache_key body parameter as OpenAI. Use the OpenAI interceptor for Azure OpenAI:

// Azure OpenAI prompt caching uses the OpenAI interceptor
caching := interceptors.NewOpenAIPromptCaching(interceptors.CacheRetentionDefault, "my-cache-key")
proxy := llmproxy.NewProxy(azureProvider, llmproxy.WithInterceptor(caching))

Note: Azure OpenAI caches prompts ≥ 1,024 tokens automatically. The prompt_cache_key parameter is combined with the prefix hash to improve cache hit rates. Cache hits appear as cached_tokens in prompt_tokens_details in the response.

Generic constructor

NewPromptCaching(provider, config) — Creates a caching interceptor for any provider:

// Anthropic with 1h retention
caching := interceptors.NewPromptCaching("anthropic", interceptors.PromptCachingConfig{
    Enabled:   true,
    Retention: interceptors.CacheRetention1h,
})

// OpenAI with 24h retention
caching := interceptors.NewPromptCaching("openai", interceptors.PromptCachingConfig{
    Enabled:   true,
    Retention: interceptors.CacheRetention24h,
    CacheKey:  "my-cache-key",
})

// xAI with conversation ID
caching := interceptors.NewPromptCaching("xai", interceptors.PromptCachingConfig{
    Enabled:  true,
    CacheKey: "my-conv-id",
})

// Fireworks with session ID and org extractor
caching := interceptors.NewPromptCaching("fireworks", interceptors.PromptCachingConfig{
    Enabled:  true,
    CacheKey: "my-session-id",
    OrgIDExtractor: interceptors.DefaultOrgIDExtractor,
})

// Bedrock with 1h retention
caching := interceptors.NewPromptCaching("bedrock", interceptors.PromptCachingConfig{
    Enabled:   true,
    Retention: interceptors.CacheRetention1h,
})

Pricing System

pricing/modelsdev/ provides an adapter that loads pricing data from models.dev.

Loading

Three source options:

LoadFromFile(path) — Load from a local JSON file
LoadFromURL() — Load from the default models.dev URL
Custom URL via WithURL(url) option

Options

WithMarkup(multiplier) — Apply a markup for reselling (e.g., 1.2 = 20% markup on all prices)

Integration

adapter := modelsdev.LoadFromURL()
lookup := adapter.GetCostLookup()
billing := interceptors.NewBilling(lookup, func(result llmproxy.BillingResult) {
    // handle billing result
})

GetCostLookup() returns a CostLookup function suitable for the billing interceptor.

Caching

The adapter supports TTL-based refresh so pricing data stays current without restarting the process.

Logger Interface

type Logger interface {
    Debug(msg string, args ...interface{})
    Info(msg string, args ...interface{})
    Warn(msg string, args ...interface{})
    Error(msg string, args ...interface{})
}

Matches the signature of github.com/agentuity/go-common/logger without requiring it as a dependency. Any logger implementing these four methods works.

LoggerFunc is an adapter that wraps a plain function as a Logger, useful for testing or simple setups.

Design Principles

Small, focused interfaces — Each interface (BodyParser, RequestEnricher, URLResolver, ResponseExtractor) does exactly one thing. This keeps implementations simple and testable.
Composition over inheritance — Providers compose the four core interfaces rather than inheriting from a base class. BaseProvider wires them together via functional options. OpenAI-compatible providers embed a shared base and override only what differs.
Raw body preservation — Both request and response bodies are preserved as raw bytes throughout the lifecycle. This avoids losing custom JSON fields that providers may include but that the library's typed structs don't model.
Function-based lookup — CostLookup is a function type, not an interface with a concrete implementation. This allows callers to manage pricing data externally (database, config file, remote service) without coupling to a specific source.
Interceptor chain — Cross-cutting concerns (logging, metrics, retry, billing) wrap the request lifecycle without modifying providers. Interceptors compose independently and execute in a predictable onion order.

Directory Structure

llmproxy/
├── apitype.go              # API type detection and constants
├── autorouter.go           # AutoRouter, provider/API auto-detection, streaming
├── autorouter_websocket.go # WebSocket forwarding, bidirectional relay
├── billing.go              # CostInfo, CostLookup, BillingResult, CalculateCost
├── billing_calculator.go   # BillingCalculator for streaming/non-streaming
├── detection.go            # Provider detection from model/header
├── enricher.go             # RequestEnricher interface
├── extractor.go            # ResponseExtractor, StreamingResponseExtractor interface
├── interceptor.go          # Interceptor, InterceptorChain, RoundTripFunc
├── internal/
│   └── fastjson/
│       └── extractor.go    # Fast JSON parsing with simdjson-go
├── logger.go               # Logger interface, LoggerFunc adapter
├── metadata.go             # BodyMetadata, ResponseMetadata, Message, Usage, Choice
├── parser.go               # BodyParser interface
├── provider.go             # Provider interface, BaseProvider
├── proxy.go                # Proxy struct, Forward method
├── registry.go             # Registry interface, MapRegistry
├── resolver.go             # URLResolver interface
├── streaming.go            # SSE parser, streaming types, usage extraction
├── websocket.go            # WebSocket interfaces, message parsing, usage extraction
├── interceptors/
│   ├── addheader.go       # AddHeaderInterceptor
│   ├── billing.go          # BillingInterceptor
│   ├── headerban.go        # HeaderBanInterceptor
│   ├── logging.go          # LoggingInterceptor
│   ├── metrics.go          # MetricsInterceptor, Metrics
│   ├── promptcaching.go    # PromptCachingInterceptor
│   ├── retry.go            # RetryInterceptor
│   └── tracing.go          # TracingInterceptor
├── pricing/
│   └── modelsdev/
│       └── adapter.go      # models.dev pricing adapter
├── providers/
│   ├── anthropic/          # Anthropic Messages API
│   │   └── streaming_extractor.go  # Anthropic SSE streaming
│   ├── azure/              # Azure OpenAI
│   ├── bedrock/            # AWS Bedrock Converse API
│   │   └── streaming_extractor.go  # Bedrock streaming
│   ├── fireworks/          # Fireworks (OpenAI-compatible)
│   ├── googleai/           # Google AI Gemini
│   │   └── streaming_extractor.go  # Google AI streaming
│   ├── groq/               # Groq (OpenAI-compatible)
│   ├── openai/             # OpenAI (Chat Completions + Responses + WebSocket)
│   ├── openai_compatible/  # Base for OpenAI-compatible providers
│   │   ├── multiapi.go                      # Multi-API parser/extractor dispatch
│   │   ├── streaming_extractor.go           # Chat Completions SSE streaming
│   │   ├── responses_parser.go              # Responses API parser
│   │   ├── responses_extractor.go           # Responses API extractor
│   │   ├── responses_streaming_extractor.go # Responses API SSE streaming
│   │   └── websocket.go                     # WebSocket URL resolver
│   └── xai/                # x.AI (OpenAI-compatible)
└── examples/
    └── basic/              # Multi-provider proxy server example (uses AutoRouter)

FilesExpand file tree

DESIGN.md

Latest commit

History

DESIGN.md

File metadata and controls

DESIGN

Architecture

Core Interfaces

BodyParser

RequestEnricher

URLResolver

ResponseExtractor

Provider

Interceptor

Registry

Proxy

AutoRouter

Data Types

BodyMetadata

ResponseMetadata

Message

Usage

CacheUsage

CacheDetail

Choice

CostInfo

BillingResult

Request Lifecycle

Auto-Routing

API Type Detection

Provider Detection

Provider Prefix Stripping

Usage Examples

Streaming

StreamingResponseExtractor

Streaming Flow

Usage Extraction

Auto stream_options Injection

Efficient Flushing

Billing with Streaming

WebSocket Mode

Adapter Pattern (Zero Dependencies)

Interfaces

gorilla/websocket Example

Wiring It Together

WebSocket Protocol

WebSocket Flow

Billing

Model Prefix Stripping

Connection Lifecycle

Providers

OpenAI-Compatible Base

Provider Table

Provider Details

Interceptors

Logging

Metrics

Retry

Billing

Tracing

HeaderBan

AddHeader

PromptCaching

Common Behavior

Anthropic

OpenAI

xAI (Grok)

Fireworks

AWS Bedrock

Azure OpenAI

Generic constructor

Pricing System

Loading

Options

Integration

Caching

Logger Interface

Design Principles

Directory Structure