Skip to content

Latest commit

 

History

History
1276 lines (973 loc) · 48.6 KB

File metadata and controls

1276 lines (973 loc) · 48.6 KB

DESIGN

Module: github.com/agentuity/llmproxy

A pluggable, composable library for proxying requests to upstream LLM providers. The core lifecycle follows a five-stage pipeline:

Parse --> Enrich --> Resolve --> Forward --> Extract

Callers construct a Proxy with a Provider and optional Interceptor chain, then call Forward(ctx, req) to proxy a single request through the full lifecycle.


Architecture

Core Interfaces

BodyParser

Extracts BodyMetadata from a raw HTTP request body. Returns structured metadata (model name, messages, max tokens, stream flag, custom fields) alongside the raw body bytes. The raw bytes are returned because http.Request.Body is a ReadCloser that can only be read once — downstream stages need access to the original payload.

RequestEnricher

Modifies the outgoing upstream request with provider-specific headers. Receives the parsed BodyMetadata and raw body bytes. For most providers this sets Authorization: Bearer <key>. AWS Bedrock computes an AWS Signature V4 instead.

URLResolver

Determines the upstream URL from metadata. Each provider maps to its own endpoint scheme:

Provider URL Pattern
OpenAI https://api.openai.com/v1/chat/completions
Bedrock https://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/converse
Google AI https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent

ResponseExtractor

Parses the upstream HTTP response into ResponseMetadata (ID, model, token usage, choices). Reads and returns the raw response body bytes so they can be re-attached to the response, preserving any custom JSON fields the provider may include.

Provider

Composes the four interfaces above into a single unit:

Name() string
BodyParser() BodyParser
RequestEnricher() RequestEnricher
ResponseExtractor() ResponseExtractor
URLResolver() URLResolver

BaseProvider supplies a configurable default implementation via functional options. Providers that share the OpenAI chat completions format embed the openai_compatible base and only override what differs (name, URL, auth).

Interceptor

Wraps the request/response cycle for cross-cutting concerns. Signature:

Intercept(req, meta, rawBody, next) -> (resp, respMeta, rawRespBody, err)

The chain wraps in reverse order so that given interceptors [A, B, C], execution flows as:

A -> B -> C -> upstream -> C -> B -> A

Each interceptor can inspect/modify the request before calling next, and inspect/modify the response after next returns.

Registry

Manages a collection of named providers. MapRegistry provides a thread-safe, map-based lookup by provider name.

Proxy

The main entry point. Forward(ctx, req) orchestrates the full request lifecycle. Configurable with:

  • WithInterceptor(i) — adds an interceptor to the chain
  • WithHTTPClient(c) — sets a custom *http.Client for upstream calls

AutoRouter

An HTTP handler that provides automatic provider and API type detection from a single endpoint. Implements http.Handler for easy integration.

Forward(ctx, req) -> (resp, meta, err)
ServeHTTP(w, r)

Detection Flow:

  1. Parse body - Extract model name and request structure
  2. Detect provider - From X-Provider header, model prefix (openai/gpt-4), or model pattern (gpt-*)
  3. Strip provider prefix - If model has known provider prefix, strip before forwarding
  4. Detect API type - From path (/v1/messages) or body+provider (input → Responses)
  5. Route to provider - Forward to detected provider with correct endpoint

Configuration options:

  • WithAutoRouterRegistry(r) — Use custom registry
  • WithAutoRouterDetector(d) — Custom provider detection logic
  • WithAutoRouterModelProviderLookup(lookup) — Hook for model→provider mapping (e.g., models.dev-backed detection); called when model pattern detection fails
  • WithAutoRouterInterceptor(i) — Add interceptor to chain
  • WithAutoRouterHTTPClient(c) — Custom HTTP client
  • WithAutoRouterFallbackProvider(p) — Provider when detection fails
  • WithAutoRouterWebSocket(upgrader, dialer) — Enable WebSocket mode (opt-in, see WebSocket Mode)
  • WithAutoRouterWSBillingCallback(cb) — Per-turn billing callback for WebSocket connections

Example:

// Basic setup
router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
    llmproxy.WithAutoRouterInterceptor(interceptors.NewLogging(logger)),
)
router.RegisterProvider(openaiProvider)
router.RegisterProvider(anthropicProvider)

http.Handle("/", router)
// With models.dev-backed provider detection
adapter, _ := modelsdev.LoadFromURL()
router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterModelProviderLookup(adapter.FindProviderForModel),
    llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
)

Data Types

BodyMetadata

Field Type Description
model string Target model identifier
messages []Message Conversation messages
maxTokens int Token generation limit
stream bool Whether to stream the response
custom map[string]any Provider-specific fields

ResponseMetadata

Field Type Description
id string Response identifier
object string Object type
model string Model that generated the response
usage Usage Token counts
choices []Choice Response choices
custom map[string]any Provider-specific fields

Message

Field Type Description
role string system, user, assistant, tool
content string Message content

Usage

Field Type Description
promptTokens int Input tokens consumed
completionTokens int Output tokens generated
totalTokens int Sum of prompt + completion

CacheUsage

Field Type Description
cachedTokens int Tokens served from cache (OpenAI, Azure)
cacheCreationInputTokens int Tokens written to cache (Anthropic)
cacheReadInputTokens int Tokens read from cache (Anthropic)
ephemeral5mInputTokens int 5-minute cache write tokens (Anthropic)
ephemeral1hInputTokens int 1-hour cache write tokens (Anthropic)
cacheWriteTokens int Tokens written to cache (Bedrock)
cacheDetails []CacheDetail TTL-based cache write breakdown (Bedrock)

CacheDetail

Field Type Description
ttl string Time-to-live for cache entry (e.g., "5m", "1h")
cacheWriteTokens int Tokens written to cache at this TTL

Choice

Field Type Description
index int Choice index
message Message Non-streaming response content
delta Message Streaming response content
finishReason string Why generation stopped

CostInfo

Per-model pricing in USD per 1M tokens:

Field Description
input Input token cost
output Output token cost
cacheRead Cached input read cost
cacheWrite Cached input write cost

BillingResult

Field Description
provider Provider name
model Model name
tokens Usage breakdown
costs Computed costs

Request Lifecycle

The full flow through Proxy.Forward:

                    +------------------+
                    |  Incoming HTTP   |
                    |    Request       |
                    +--------+---------+
                             |
                    1. Read request body
                             |
                    +--------v---------+
                    |   BodyParser     |
                    |  Parse -> Meta   |
                    |  + raw body bytes|
                    +--------+---------+
                             |
                    2. Resolve upstream URL
                             |
                    +--------v---------+
                    |   URLResolver    |
                    +--------+---------+
                             |
                    3. Create upstream request,
                       copy headers
                             |
                    4. Enrich request
                             |
                    +--------v---------+
                    |  RequestEnricher |
                    |  (auth headers)  |
                    +--------+---------+
                             |
                    5. Store metadata in
                       request context
                             |
                    6. Execute interceptor chain
                             |
              +--------------v--------------+
              |  Interceptor A              |
              |   +- Interceptor B          |
              |   |   +- Interceptor C      |
              |   |   |   +- HTTP call ---->| upstream
              |   |   |   +- response <----+|
              |   |   +- post-process       |
              |   +- post-process           |
              +- post-process               |
              +-----------------------------+
                             |
                    7. Extract ResponseMetadata
                             |
                    +--------v---------+
                    | ResponseExtractor|
                    +--------+---------+
                             |
                    8. Re-attach raw body
                       to response
                             |
                    9. Return response
                       + metadata

Steps in detail:

  1. Read and parse the request body into BodyMetadata. The raw bytes are preserved for later use.
  2. Resolve the upstream URL from the metadata (model name, provider config).
  3. Create a new HTTP request for the upstream provider and copy relevant headers from the original request.
  4. Enrich the upstream request with provider-specific authentication (Bearer token, API key, AWS Signature V4).
  5. Store BodyMetadata in the request context so interceptors can access it.
  6. Execute the request through the interceptor chain. Each interceptor wraps the next, forming an onion-style pipeline.
  7. Extract ResponseMetadata from the upstream response.
  8. Re-attach the raw response body so custom JSON fields from the provider are preserved.
  9. Return the response and metadata to the caller.

Auto-Routing

The AutoRouter enables automatic provider and API detection from a single endpoint. POST to / with any LLM request and routing happens automatically.

API Type Detection

Detection happens in two phases:

Phase 1: Path-based detection

Path API Type
/v1/chat/completions Chat Completions
/v1/responses Responses
/v1/completions Legacy Completions
/v1/messages Anthropic Messages
:generateContent Gemini GenerateContent
/converse Bedrock Converse

Phase 2: Body + Provider detection (when path is / or unknown)

Body Field Provider API Type
input any Responses
prompt any Completions
contents any GenerateContent
messages anthropic Messages
messages other Chat Completions

Provider Detection

Provider is detected in priority order:

  1. X-Provider header — Explicit override

    curl -X POST http://localhost:8080/ \
      -H 'X-Provider: anthropic' \
      -d '{"model":"claude-3-opus",...}'
  2. Model prefix — Provider prefix in model name (stripped before forwarding)

    # Model "openai/gpt-4" routes to OpenAI, forwards "gpt-4"
    curl -X POST http://localhost:8080/ \
      -d '{"model":"anthropic/claude-3-opus",...}'
  3. Model pattern — Match against known patterns

    Pattern Provider
    gpt-*, o1-*, o3-*, chatgpt-* OpenAI
    claude-* Anthropic
    gemini-*, gemma-* Google AI
    grok-* x.AI
    accounts/fireworks/* Fireworks
    sonar* Perplexity
    anthropic.claude-*, amazon.* Bedrock

Provider Prefix Stripping

Only known provider prefixes are stripped:

// Stripped (known providers)
"openai/gpt-4"           -> "gpt-4"
"anthropic/claude-3"     -> "claude-3"
"fireworks/models/llama" -> "models/llama"

// Preserved (unknown or model-native paths)
"accounts/fireworks/models/llama" -> "accounts/fireworks/models/llama"
"some-unknown/model"              -> "some-unknown/model"

Usage Examples

# Auto-detect everything - POST to /
curl -X POST http://localhost:8080/ \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'

# Auto-detect Anthropic from model name
curl -X POST http://localhost:8080/ \
  -H 'Content-Type: application/json' \
  -d '{"model":"claude-3-opus","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'

# Auto-detect Responses API from input field
curl -X POST http://localhost:8080/ \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4o","input":"Hello"}'

# Traditional path-based routing still works
curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"Hello"}]}'

Streaming

The proxy fully supports SSE (Server-Sent Events) streaming with efficient token usage extraction for billing.

StreamingResponseExtractor

Extends ResponseExtractor for streaming responses:

type StreamingResponseExtractor interface {
    ResponseExtractor
    ExtractStreamingWithController(resp, w, rc) -> (ResponseMetadata, error)
    IsStreamingResponse(resp) -> bool
}

All built-in providers implement this interface for streaming support.

Streaming Flow

+------------------+
| Incoming Request |
|  stream: true    |
+--------+---------+
         |
   Parse body, detect
   stream: true flag
         |
+--------v---------+
|   AutoRouter     |
| ForwardStreaming |
+--------+---------+
         |
   Auto-inject
   stream_options:
   {include_usage:true}
         |
+--------v---------+
|   HTTP Request   |
|   to upstream    |
+--------+---------+
         |
+--------v---------+
| Upstream Response|
| text/event-stream|
+--------+---------+
         |
+--------v---------+
| StreamingExtractor|
| Parse SSE events |
| Extract usage    |
| Flush each chunk |
+--------+---------+
         |
+--------v---------+
| BillingCalculator|
| Calculate cost   |
+--------+---------+
         |
+--------v---------+
|   HTTP Response  |
|   to client      |
+------------------+

Usage Extraction

OpenAI Chat Completions: Usage is sent in the final chunk before [DONE]:

data: {"id":"...","usage":{"prompt_tokens":100,"completion_tokens":50,"total_tokens":150}}
data: [DONE]

OpenAI Responses API: Usage is in the response.completed event. The StreamingMultiAPIExtractor automatically detects the API type from the request context and dispatches to the correct streaming extractor:

data: {"type":"response.created","response":{"id":"resp_123","model":"gpt-4o"}}
data: {"type":"response.output_text.delta","delta":"Hello"}
data: {"type":"response.completed","response":{"usage":{"input_tokens":10,"output_tokens":5,"total_tokens":15}}}
data: [DONE]

No stream_options.include_usage is needed for the Responses API — usage is always included in response.completed. The proxy automatically skips stream_options injection for Responses API requests.

Anthropic: Usage is sent in message_start and message_delta events:

data: {"type":"message_start","message":{"usage":{"input_tokens":100}}}
...
data: {"type":"message_delta","usage":{"output_tokens":50}}
data: {"type":"message_stop"}

Auto stream_options Injection

When BillingCalculator is configured and the request has stream: true, the proxy automatically injects stream_options.include_usage for OpenAI-compatible Chat Completions endpoints only:

{
  "stream": true,
  "stream_options": { "include_usage": true }
}

This ensures providers return token usage in their streaming responses for billing calculation.

The following are excluded from this injection because they already include usage natively:

  • Responses API — usage is always present in the response.completed event
  • Anthropic — usage is sent in message_start and message_delta events
  • Bedrock and Google AI — usage is included in their streaming event formats

Efficient Flushing

Uses http.ResponseController for optimal streaming:

rc := http.NewResponseController(w)

for each SSE event {
    w.Write(event)
    rc.Flush()  // Immediate flush after each chunk
}

Non-streaming responses also use chunked read/write/flush with a 512KB buffer for better performance.

Billing with Streaming

adapter, _ := modelsdev.LoadFromURL()

billingCallback := func(r llmproxy.BillingResult) {
    log.Printf("Cost: $%.6f", r.TotalCost)
}

router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterBillingCalculator(
        llmproxy.NewBillingCalculator(adapter.GetCostLookup(), billingCallback),
    ),
)

After the stream completes, the billing callback is invoked with the extracted token usage.


WebSocket Mode

The proxy supports the OpenAI Responses API WebSocket mode for persistent, multi-turn connections. This is useful for tool-call-heavy workflows where multiple response.create / response.completed cycles happen on a single connection.

Adapter Pattern (Zero Dependencies)

The library defines abstract WebSocket interfaces — no WebSocket library is vendored. Consumers bring their own implementation (gorilla/websocket, nhooyr.io/websocket, etc.) and wire it in via thin adapters.

Interfaces

// WSConn abstracts a WebSocket connection.
// gorilla/websocket's *Conn satisfies this directly.
type WSConn interface {
    ReadMessage() (messageType int, p []byte, err error)
    WriteMessage(messageType int, data []byte) error
    Close() error
}

// WSUpgrader upgrades an HTTP request to a WebSocket connection.
type WSUpgrader interface {
    Upgrade(w http.ResponseWriter, r *http.Request, responseHeader http.Header) (WSConn, error)
}

// WSDialer dials a WebSocket connection to an upstream server.
type WSDialer interface {
    DialContext(ctx context.Context, urlStr string, requestHeader http.Header) (WSConn, *http.Response, error)
}

// WebSocketCapableProvider is implemented by providers that support WebSocket mode.
type WebSocketCapableProvider interface {
    Provider
    WebSocketURL(meta BodyMetadata) (*url.URL, error)
}

The OpenAI provider implements WebSocketCapableProvider. Other providers can opt in by implementing the same interface.

gorilla/websocket Example

gorilla's *websocket.Conn already satisfies WSConn — no wrapper needed. You only need thin adapters for Upgrader and Dialer because their return types differ (*websocket.Conn vs WSConn):

package myadapter

import (
    "context"
    "net/http"

    "github.com/agentuity/llmproxy"
    "github.com/gorilla/websocket"
)

// GorillaUpgrader wraps gorilla's Upgrader to satisfy llmproxy.WSUpgrader.
type GorillaUpgrader struct {
    Upgrader websocket.Upgrader
}

func (u *GorillaUpgrader) Upgrade(w http.ResponseWriter, r *http.Request, h http.Header) (llmproxy.WSConn, error) {
    conn, err := u.Upgrader.Upgrade(w, r, h)
    if err != nil {
        return nil, err
    }
    return conn, nil // *websocket.Conn satisfies WSConn
}

// GorillaDialer wraps gorilla's Dialer to satisfy llmproxy.WSDialer.
type GorillaDialer struct {
    Dialer websocket.Dialer
}

func (d *GorillaDialer) DialContext(ctx context.Context, urlStr string, h http.Header) (llmproxy.WSConn, *http.Response, error) {
    conn, resp, err := d.Dialer.DialContext(ctx, urlStr, h)
    if err != nil {
        return nil, resp, err
    }
    return conn, resp, nil // *websocket.Conn satisfies WSConn
}

Wiring It Together

router := llmproxy.NewAutoRouter(
    llmproxy.WithAutoRouterFallbackProvider(openaiProvider),
    // In production, configure CheckOrigin to validate against trusted origins.
    llmproxy.WithAutoRouterWebSocket(
        &myadapter.GorillaUpgrader{
            Upgrader: websocket.Upgrader{
                CheckOrigin: func(r *http.Request) bool {
                    // Validate r.Header.Get("Origin") against a whitelist
                    return isAllowedOrigin(r)
                },
            },
        },
        &myadapter.GorillaDialer{
            Dialer: websocket.Dialer{},
        },
    ),
    llmproxy.WithAutoRouterWSBillingCallback(func(turn int, meta llmproxy.ResponseMetadata, billing *llmproxy.BillingResult) {
        log.Printf("Turn %d: model=%s prompt=%d completion=%d",
            turn, meta.Model, meta.Usage.PromptTokens, meta.Usage.CompletionTokens)
        if billing != nil {
            log.Printf("  Cost: $%.6f", billing.TotalCost)
        }
    }),
)
router.RegisterProvider(openaiProvider)

http.Handle("/", router)

WebSocket support is opt-in — if WithAutoRouterWebSocket is not called, WebSocket upgrade requests are rejected and normal HTTP handling is unchanged.

WebSocket Protocol

The OpenAI Responses API WebSocket mode operates at wss://api.openai.com/v1/responses:

Aspect Detail
Client sends {"type":"response.create","model":"gpt-4o","input":[...]}
Server sends Same events as SSE: response.created, response.output_text.delta, response.completed, etc.
Multi-turn Client sends new response.create with previous_response_id on same connection
Usage In response.completedresponse.usage.{input_tokens, output_tokens, total_tokens}
Frames JSON text frames — no data: prefix (unlike SSE)

WebSocket Flow

+------------------+        +------------------+        +------------------+
|     Client       |        |    AutoRouter    |        |    Upstream      |
|                  |        |                  |        |   (OpenAI)       |
+--------+---------+        +--------+---------+        +--------+---------+
         |                           |                           |
   1. WS Upgrade  -------->  2. Accept upgrade                  |
         |                   3. Read first message               |
         |                      (response.create)                |
         |                   4. Detect provider/model            |
         |                   5. Strip model prefix               |
         |                   6. Enrich headers (auth)            |
         |                   7. Dial upstream WS  -------->  8. Accept
         |                   9. Forward first msg  -------->     |
         |                           |                           |
         |                  === Bidirectional Relay ===          |
         |                           |                           |
  10. response.create ----->  Strip prefix  ------------>  response.create
         |                           |                           |
         |                           |  <-----------  response.created
  response.created  <------          |                           |
         |                           |  <-----------  response.output_text.delta
  text delta  <----------------      |                           |
         |                           |  <-----------  response.completed
  response.completed  <----  Extract usage,            (includes usage)
         |                   compute billing                     |
         |                           |                           |
  11. response.create ----->  Strip prefix  ------------>  (next turn)
         |                     ...                        ...
         |                           |                           |
  12. Close  ----------------> Close both  ------------>  Close

Billing

Billing is calculated per turn — each response.completed event triggers:

  1. Usage extraction (input_tokens, output_tokens, total_tokens, cache details)
  2. Cost calculation via BillingCalculator (if configured)
  3. WSBillingCallback invocation with the turn number, response metadata, and billing result

For multi-turn connections, each turn is billed independently. The callback receives an incrementing turn counter so consumers can aggregate if needed.

Model Prefix Stripping

Works the same as HTTP mode — openai/gpt-4o is stripped to gpt-4o in all response.create messages. Non-response.create messages are forwarded byte-for-byte without modification.

Connection Lifecycle

  • Both relay goroutines use a sync.Once-guarded close — when either side closes, the other is closed immediately
  • WebSocket close errors (io.EOF, "connection closed") are treated as normal termination, not errors
  • The ForwardWebSocket method blocks until both relay goroutines complete

Providers

Nine providers are included. Six share the OpenAI-compatible base; three have fully custom implementations.

OpenAI-Compatible Base

providers/openai_compatible implements BodyParser, ResponseExtractor, URLResolver, and RequestEnricher for the OpenAI chat completions format. Providers that speak this protocol embed the base and override only what differs (name, base URL, auth configuration).

The OpenAI provider also supports the Responses API (/v1/responses) with automatic detection based on the input field in the request body.

Provider Table

Provider Package Auth API Format Notes
OpenAI providers/openai Bearer token Chat completions, Responses Supports both APIs with auto-detection
Anthropic providers/anthropic x-api-key header + anthropic-version Anthropic Messages API Custom parser/extractor
Groq providers/groq Bearer token OpenAI-compatible Wraps openai_compatible
Fireworks providers/fireworks Bearer token OpenAI-compatible Wraps openai_compatible
x.AI providers/xai Bearer token OpenAI-compatible Wraps openai_compatible
Google AI providers/googleai API key in URL query param Gemini generateContent Custom parser/extractor/resolver
AWS Bedrock providers/bedrock AWS Signature V4 Converse API Custom everything; supports both Converse and InvokeModel
Azure OpenAI providers/azure api-key header or Azure AD Bearer token OpenAI chat completions Uses deployments instead of models
OpenAI-compatible providers/openai_compatible Bearer token OpenAI chat completions Reusable base for OpenAI-compatible providers

Provider Details

OpenAI — Wraps openai_compatible with support for multiple APIs:

  • Chat Completions (/v1/chat/completions) — Standard messages-based API
  • Responses (/v1/responses) — Newer API with input field, built-in tools support. Supports both HTTP (SSE streaming) and WebSocket modes.
  • Legacy Completions (/v1/completions) — Older prompt-based API

The provider auto-detects the API type from the request body:

  • input field → Responses API
  • prompt field → Completions API
  • messages field → Chat Completions API

The OpenAI provider implements WebSocketCapableProvider, enabling persistent WebSocket connections for multi-turn Responses API workflows when WithAutoRouterWebSocket is configured.

Anthropic — Custom body parser translates between the proxy's canonical format and Anthropic's Messages API. Custom extractor maps Anthropic's response shape (content blocks, stop_reason) back to ResponseMetadata. Auth uses the x-api-key header alongside an anthropic-version header.

Groq, Fireworks, x.AI — Each wraps openai_compatible with its own base URL and provider name. No custom parsing or extraction logic needed.

Google AI — Custom body parser converts to the Gemini generateContent format (contents/parts). Custom URL resolver appends the API key as a query parameter rather than using a header. Custom extractor maps Gemini's response (candidates/content/parts) back to ResponseMetadata.

AWS Bedrock — Fully custom implementation. The body parser converts to the Bedrock Converse API format. The enricher computes AWS Signature V4 signing using region, access key, and secret key. The URL resolver constructs region-specific endpoints. Supports both the Converse and InvokeModel API paths.

Azure OpenAI — Uses deployments instead of direct model access. Two construction modes:

  • New(resourceName, deploymentID, apiVersion, opts...) — Fixed deployment, uses configured deploymentID for all requests
  • NewWithDynamicDeployment(resourceName, apiVersion, opts...) — Dynamic deployment, uses the model field from each request as the deployment name

Authentication via functional options:

  • WithAPIKey(apiKey) — Sets api-key header
  • WithAzureADToken(token) — Sets Authorization: Bearer header
  • WithAzureADTokenRefresher(fn) — Token refresh callback for expiring Azure AD tokens

URL format: https://{resource}.openai.azure.com/openai/deployments/{deployment}/chat/completions?api-version={version}


Interceptors

Eight built-in interceptors are provided in the interceptors/ package.

Logging

NewLogging(logger) — Logs each request/response cycle with:

  • Model name
  • HTTP method and URL
  • Response status
  • Latency
  • Token usage (prompt, completion, total)

Accepts an llmproxy.Logger interface, which is compatible with github.com/agentuity/go-common/logger without requiring the dependency.

Metrics

NewMetrics(m) — Tracks aggregate statistics:

Field Description
TotalRequests Number of requests processed
TotalTokens Total tokens consumed
TotalPromptTokens Total input tokens
TotalCompletionTokens Total output tokens
TotalLatency Cumulative request duration
Errors Number of failed requests

All fields use sync/atomic operations for thread safety. The Metrics struct can be read concurrently from monitoring goroutines.

Retry

NewRetry(maxAttempts, delay) — Retries failed requests under specific conditions:

  • Retries on: HTTP 429 (rate limit) and 5xx (server errors)
  • Does NOT retry: context.Canceled, context.DeadlineExceeded
  • Body handling: Reconstructs the request body from raw bytes on each retry attempt
  • Custom predicate: NewRetryWithPredicate(maxAttempts, delay, predicate) allows callers to supply a custom function that decides whether a given response should be retried

NewRetryWithRateLimitHeaders(maxAttempts, defaultDelay) — Retries with rate limit header support:

  • Retry-After header: Parses both seconds (integer) and HTTP date formats
  • X-RateLimit-Reset header: Fallback if Retry-After not present
  • Max delay: Values over 24 hours are ignored (fallback to defaultDelay); exactly 24 hours is accepted
  • Precedence: Retry-After takes precedence over X-RateLimit-Reset

Example:

// Use rate limit headers from provider
retry := interceptors.NewRetryWithRateLimitHeaders(3, time.Second)

// If provider returns 429 with Retry-After: 30, waits 30s instead of 1s

Billing

NewBilling(lookup, onResult) — Calculates the cost of each request:

  • Uses a CostLookup function to retrieve per-model pricing
  • Detects the provider from the model name
  • Computes input/output/cache costs based on token usage
  • Calls the onResult callback with a BillingResult after each request
  • Stores BillingResult in ResponseMetadata.Custom["billing_result"] for downstream access

When using AutoRouter, billing results are automatically added as response headers:

  • X-Gateway-Cost — Total cost in USD
  • X-Gateway-Prompt-Tokens — Input token count
  • X-Gateway-Completion-Tokens — Output token count

Tracing

NewTracing(extractor) — Propagates OpenTelemetry trace context:

  • Upstream headers:
    • X-Request-ID: the trace ID (32 hex chars)
    • traceparent: W3C Trace Context format (00-{traceid}-{spanid}-{flags})
  • Response header: X-Request-ID for correlation
  • Extractor function: Pulls trace info from request context

The extractor function signature:

func(ctx context.Context) TraceInfo

For OpenTelemetry:

func otelExtractor(ctx context.Context) interceptors.TraceInfo {
    span := trace.SpanFromContext(ctx)
    if !span.SpanContext().IsValid() {
        return interceptors.TraceInfo{}
    }
    return interceptors.TraceInfo{
        TraceID: span.SpanContext().TraceID(),
        SpanID:  span.SpanContext().SpanID(),
        Sampled: span.SpanContext().IsSampled(),
    }
}

Use NewTracingWithHeader(extractor, headerName) to customize the response header name.

HeaderBan

NewHeaderBan(requestHeaders, responseHeaders) — Strips specified headers:

  • Request headers: Removed before forwarding upstream
  • Response headers: Removed before returning to caller
  • Case-insensitive: HTTP header matching is case-insensitive

Convenience constructors:

  • NewResponseHeaderBan(headers...) — Strip only response headers
  • NewRequestHeaderBan(headers...) — Strip only request headers

Example:

ban := interceptors.NewResponseHeaderBan("Openai-Organization", "Openai-Project")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(ban))

AddHeader

NewAddHeader(requestHeaders, responseHeaders) — Adds custom headers to requests and responses:

  • Request headers: Added before forwarding upstream
  • Response headers: Added before returning to caller

Convenience constructors:

  • NewAddResponseHeader(headers...) — Add only response headers
  • NewAddRequestHeader(headers...) — Add only request headers

Example:

add := interceptors.NewAddResponseHeader(
    interceptors.NewHeader("X-Gateway-Version", "1.0"),
    interceptors.NewHeader("X-Served-By", "llmproxy"),
)
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(add))

PromptCaching

Provider-specific prompt caching interceptors for Anthropic, OpenAI, xAI, Fireworks, and AWS Bedrock.

Common Behavior

  • Cache-Control header: If the incoming request has Cache-Control: no-cache, the interceptor skips entirely — letting clients disable caching per-request
  • Provider detection: Only applies to matching models:
    • Anthropic: claude-*
    • OpenAI: gpt-*, o1-*, o3-*, o4-*, chatgpt-*
    • xAI: grok-*
    • Fireworks: accounts/fireworks/*, fireworks*
    • Bedrock: anthropic.claude-*, amazon.nova-*, amazon.titan-*
  • Cache usage tracking: Response metadata includes CacheUsage in Custom["cache_usage"]

Anthropic

NewAnthropicPromptCaching(retention) — Enables Anthropic prompt caching:

  • Automatic caching: Adds cache_control at the top level of requests
  • Retention options:
    • CacheRetentionDefault (default, 5 min) — no TTL field, free, auto-refreshed on use
    • CacheRetention1h — adds ttl: "1h", costs more, longer cache lifetime
  • User-controlled caching: If request already has cache_control, the interceptor skips entirely — letting you control caching explicitly via block-level breakpoints

Example:

// Enable prompt caching for Anthropic (default 5 min, free)
caching := interceptors.NewAnthropicPromptCaching(interceptors.CacheRetentionDefault)
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With 1h retention (costs more) and cache usage callback
caching := interceptors.NewAnthropicPromptCachingWithResult(interceptors.CacheRetention1h, func(u llmproxy.CacheUsage) {
    log.Printf("Cache read: %d tokens, Cache write: %d tokens", u.CacheReadInputTokens, u.CacheCreationInputTokens)
})

OpenAI

NewOpenAIPromptCaching(retention, cacheKey) — Enables OpenAI prompt caching:

  • Automatic caching: OpenAI caches prompts ≥ 1024 tokens automatically
  • Cache routing: Adds prompt_cache_key to improve cache hit rates for requests with common prefixes
  • Retention options:
    • CacheRetentionDefault (default, in-memory, 5-10 min) — no retention field
    • CacheRetention24h — adds prompt_cache_retention: "24h" for GPT-5.x and GPT-4.1
  • Cache key sources (in priority order):
    1. X-Cache-Key header from incoming request
    2. Configured CacheKey in PromptCachingConfig
    3. Auto-derived from static content prefix via DeriveCacheKeyFromPrefix()
  • Tenant namespacing: Cache keys are automatically prefixed with org/tenant ID from:
    1. Custom OrgIDExtractor function
    2. OrgID in MetaContextValue stored in request context
    3. X-Org-ID header
    4. org_id in BodyMetadata.Custom
    5. Configured Namespace fallback

Example:

// Enable prompt caching for OpenAI with a cache key (default retention)
caching := interceptors.NewOpenAIPromptCaching(interceptors.CacheRetentionDefault, "my-app-session-123")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With 24h retention and cache usage callback
caching := interceptors.NewOpenAIPromptCachingWithResult(interceptors.CacheRetention24h, "my-key", func(u llmproxy.CacheUsage) {
    log.Printf("Cached tokens: %d", u.CachedTokens)
})

// Auto-derive cache key from static content, namespace by tenant
caching := interceptors.NewOpenAIPromptCachingAuto("tenant-123", interceptors.CacheRetentionDefault)

// Custom org ID extractor (e.g., from auth context)
caching := interceptors.NewOpenAIPromptCachingWithOrgExtractor(
    interceptors.CacheRetentionDefault, 
    "my-key",
    func(ctx context.Context, req *http.Request, meta llmproxy.BodyMetadata) string {
        return getOrgFromAuthContext(ctx)
    },
)

xAI (Grok)

NewXAIPromptCaching(convID) — Enables xAI/Grok prompt caching:

  • Automatic prefix caching: xAI caches from the start of the messages array automatically
  • Cache routing: Adds x-grok-conv-id HTTP header to route requests to the same server where cache lives
  • Conversation ID: Use a stable value (conversation ID, session ID, or deterministic hash of static content)
  • Key rule: Never reorder or modify earlier messages — only append

Example:

// Enable prompt caching for xAI with a conversation ID
caching := interceptors.NewXAIPromptCaching("conv-abc123-tenant456")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With cache usage callback
caching := interceptors.NewXAIPromptCachingWithResult("my-conv-id", func(u llmproxy.CacheUsage) {
    log.Printf("Cached tokens: %d", u.CachedTokens)
})

Fireworks

NewFireworksPromptCaching(sessionID) — Enables Fireworks prompt caching:

  • Automatic caching: Fireworks caches prompts with shared prefixes automatically (enabled by default)
  • Cache routing: Adds x-session-affinity HTTP header to route requests to the same replica
  • Tenant isolation: Adds x-prompt-cache-isolation-key header set to org/tenant ID for multi-tenant isolation
  • Cache usage: Reads fireworks-cached-prompt-tokens response header for cache hit tracking

Example:

// Enable prompt caching for Fireworks with session affinity
caching := interceptors.NewFireworksPromptCaching("session-abc123")
proxy := llmproxy.NewProxy(provider, llmproxy.WithInterceptor(caching))

// With org ID extractor for tenant isolation
caching := interceptors.NewFireworksPromptCachingWithOrgExtractor("session-abc123", func(ctx context.Context, req *http.Request, meta llmproxy.BodyMetadata) string {
    return getOrgFromAuthContext(ctx)
})

// With cache usage callback
caching := interceptors.NewFireworksPromptCachingWithResult("session-abc123", func(u llmproxy.CacheUsage) {
    log.Printf("Cached tokens: %d", u.CachedTokens)
})

AWS Bedrock

NewBedrockPromptCaching(retention) — Enables AWS Bedrock prompt caching via the Converse API:

  • Cache checkpoints: Adds cachePoint objects to system, messages, and toolConfig
  • Retention options:
    • CacheRetentionDefault (default, 5 min) — no TTL field
    • CacheRetention1h — adds ttl: "1h" for Claude Opus 4.5, Haiku 4.5, and Sonnet 4.5
  • Minimum tokens: 1,024 tokens per cache checkpoint (varies by model)
  • Maximum checkpoints: 4 per request
  • Supported models: Claude models (anthropic.claude-), Nova models (amazon.nova-), Titan models (amazon.titan-*)
  • Cache usage: Reads cacheReadInputTokens, cacheWriteInputTokens, and cacheDetails from response

Example:

// Enable prompt caching for Bedrock (default 5 min)
caching := interceptors.NewBedrockPromptCaching(interceptors.CacheRetentionDefault)
proxy := llmproxy.NewProxy(bedrockProvider, llmproxy.WithInterceptor(caching))

// With 1h retention for Claude Opus 4.5
caching := interceptors.NewBedrockPromptCaching(interceptors.CacheRetention1h)

// With cache usage callback
caching := interceptors.NewBedrockPromptCachingWithResult(interceptors.CacheRetentionDefault, func(u llmproxy.CacheUsage) {
    log.Printf("Cache read: %d tokens, Cache write: %d tokens", u.CachedTokens, u.CacheWriteTokens)
    for _, detail := range u.CacheDetails {
        log.Printf("  TTL %s: %d tokens written", detail.TTL, detail.CacheWriteTokens)
    }
})

Azure OpenAI

Azure OpenAI uses the same prompt_cache_key body parameter as OpenAI. Use the OpenAI interceptor for Azure OpenAI:

// Azure OpenAI prompt caching uses the OpenAI interceptor
caching := interceptors.NewOpenAIPromptCaching(interceptors.CacheRetentionDefault, "my-cache-key")
proxy := llmproxy.NewProxy(azureProvider, llmproxy.WithInterceptor(caching))

Note: Azure OpenAI caches prompts ≥ 1,024 tokens automatically. The prompt_cache_key parameter is combined with the prefix hash to improve cache hit rates. Cache hits appear as cached_tokens in prompt_tokens_details in the response.

Generic constructor

NewPromptCaching(provider, config) — Creates a caching interceptor for any provider:

// Anthropic with 1h retention
caching := interceptors.NewPromptCaching("anthropic", interceptors.PromptCachingConfig{
    Enabled:   true,
    Retention: interceptors.CacheRetention1h,
})

// OpenAI with 24h retention
caching := interceptors.NewPromptCaching("openai", interceptors.PromptCachingConfig{
    Enabled:   true,
    Retention: interceptors.CacheRetention24h,
    CacheKey:  "my-cache-key",
})

// xAI with conversation ID
caching := interceptors.NewPromptCaching("xai", interceptors.PromptCachingConfig{
    Enabled:  true,
    CacheKey: "my-conv-id",
})

// Fireworks with session ID and org extractor
caching := interceptors.NewPromptCaching("fireworks", interceptors.PromptCachingConfig{
    Enabled:  true,
    CacheKey: "my-session-id",
    OrgIDExtractor: interceptors.DefaultOrgIDExtractor,
})

// Bedrock with 1h retention
caching := interceptors.NewPromptCaching("bedrock", interceptors.PromptCachingConfig{
    Enabled:   true,
    Retention: interceptors.CacheRetention1h,
})

Pricing System

pricing/modelsdev/ provides an adapter that loads pricing data from models.dev.

Loading

Three source options:

  • LoadFromFile(path) — Load from a local JSON file
  • LoadFromURL() — Load from the default models.dev URL
  • Custom URL via WithURL(url) option

Options

  • WithMarkup(multiplier) — Apply a markup for reselling (e.g., 1.2 = 20% markup on all prices)

Integration

adapter := modelsdev.LoadFromURL()
lookup := adapter.GetCostLookup()
billing := interceptors.NewBilling(lookup, func(result llmproxy.BillingResult) {
    // handle billing result
})

GetCostLookup() returns a CostLookup function suitable for the billing interceptor.

Caching

The adapter supports TTL-based refresh so pricing data stays current without restarting the process.


Logger Interface

type Logger interface {
    Debug(msg string, args ...interface{})
    Info(msg string, args ...interface{})
    Warn(msg string, args ...interface{})
    Error(msg string, args ...interface{})
}

Matches the signature of github.com/agentuity/go-common/logger without requiring it as a dependency. Any logger implementing these four methods works.

LoggerFunc is an adapter that wraps a plain function as a Logger, useful for testing or simple setups.


Design Principles

  1. Small, focused interfaces — Each interface (BodyParser, RequestEnricher, URLResolver, ResponseExtractor) does exactly one thing. This keeps implementations simple and testable.

  2. Composition over inheritance — Providers compose the four core interfaces rather than inheriting from a base class. BaseProvider wires them together via functional options. OpenAI-compatible providers embed a shared base and override only what differs.

  3. Raw body preservation — Both request and response bodies are preserved as raw bytes throughout the lifecycle. This avoids losing custom JSON fields that providers may include but that the library's typed structs don't model.

  4. Function-based lookupCostLookup is a function type, not an interface with a concrete implementation. This allows callers to manage pricing data externally (database, config file, remote service) without coupling to a specific source.

  5. Interceptor chain — Cross-cutting concerns (logging, metrics, retry, billing) wrap the request lifecycle without modifying providers. Interceptors compose independently and execute in a predictable onion order.


Directory Structure

llmproxy/
├── apitype.go              # API type detection and constants
├── autorouter.go           # AutoRouter, provider/API auto-detection, streaming
├── autorouter_websocket.go # WebSocket forwarding, bidirectional relay
├── billing.go              # CostInfo, CostLookup, BillingResult, CalculateCost
├── billing_calculator.go   # BillingCalculator for streaming/non-streaming
├── detection.go            # Provider detection from model/header
├── enricher.go             # RequestEnricher interface
├── extractor.go            # ResponseExtractor, StreamingResponseExtractor interface
├── interceptor.go          # Interceptor, InterceptorChain, RoundTripFunc
├── internal/
│   └── fastjson/
│       └── extractor.go    # Fast JSON parsing with simdjson-go
├── logger.go               # Logger interface, LoggerFunc adapter
├── metadata.go             # BodyMetadata, ResponseMetadata, Message, Usage, Choice
├── parser.go               # BodyParser interface
├── provider.go             # Provider interface, BaseProvider
├── proxy.go                # Proxy struct, Forward method
├── registry.go             # Registry interface, MapRegistry
├── resolver.go             # URLResolver interface
├── streaming.go            # SSE parser, streaming types, usage extraction
├── websocket.go            # WebSocket interfaces, message parsing, usage extraction
├── interceptors/
│   ├── addheader.go       # AddHeaderInterceptor
│   ├── billing.go          # BillingInterceptor
│   ├── headerban.go        # HeaderBanInterceptor
│   ├── logging.go          # LoggingInterceptor
│   ├── metrics.go          # MetricsInterceptor, Metrics
│   ├── promptcaching.go    # PromptCachingInterceptor
│   ├── retry.go            # RetryInterceptor
│   └── tracing.go          # TracingInterceptor
├── pricing/
│   └── modelsdev/
│       └── adapter.go      # models.dev pricing adapter
├── providers/
│   ├── anthropic/          # Anthropic Messages API
│   │   └── streaming_extractor.go  # Anthropic SSE streaming
│   ├── azure/              # Azure OpenAI
│   ├── bedrock/            # AWS Bedrock Converse API
│   │   └── streaming_extractor.go  # Bedrock streaming
│   ├── fireworks/          # Fireworks (OpenAI-compatible)
│   ├── googleai/           # Google AI Gemini
│   │   └── streaming_extractor.go  # Google AI streaming
│   ├── groq/               # Groq (OpenAI-compatible)
│   ├── openai/             # OpenAI (Chat Completions + Responses + WebSocket)
│   ├── openai_compatible/  # Base for OpenAI-compatible providers
│   │   ├── multiapi.go                      # Multi-API parser/extractor dispatch
│   │   ├── streaming_extractor.go           # Chat Completions SSE streaming
│   │   ├── responses_parser.go              # Responses API parser
│   │   ├── responses_extractor.go           # Responses API extractor
│   │   ├── responses_streaming_extractor.go # Responses API SSE streaming
│   │   └── websocket.go                     # WebSocket URL resolver
│   └── xai/                # x.AI (OpenAI-compatible)
└── examples/
    └── basic/              # Multi-provider proxy server example (uses AutoRouter)