Product Name: StackBilt LLM Gateway
Package Name: @stackbilt/llm-gateway
Status: Concept → MVP Definition
Primary User: Independent AI developer, local agent-tool power user, AI engineering team, platform/governance operator
Core Idea: A local-first LLM gateway that lets Claude Code, OpenAI Codex, and future agent CLIs route through a lightweight localhost proxy before reaching external model providers.
StackBilt LLM Gateway provides protocol-compatible local endpoints for AI coding tools, then delegates model selection, failover, cost optimization, cache hints, and provider abstraction to @stackbilt/llm-providers.
The gateway is not intended to be a full LiteLLM clone. It is a focused agent-dev control plane for reducing model spend, improving routing intelligence, and creating telemetry/governance hooks around local AI coding workflows.
Modern AI coding tools such as Claude Code and Codex can become expensive because they repeatedly send large context payloads, tool schemas, repository instructions, and conversational history to high-cost frontier models.
Developers increasingly want to route these tools through local or self-hosted proxies so they can:
- Use cheaper models for simple requests.
- Reserve premium models for hard reasoning and tool-heavy tasks.
- Cache repeated prompt prefixes and static context.
- Track token usage and provider costs across tools.
- Avoid being locked into one provider.
- Experiment with Cloudflare Workers AI, Groq, Cerebras, OpenAI, Anthropic, and other providers behind one local interface.
- Add governance and observability around agent behavior.
Existing proxy solutions are either too generic, too provider-specific, or too focused on raw OpenAI compatibility. StackBilt needs a gateway shaped specifically for local AI coding workflows and future governed agent orchestration.
The MVP should allow a developer to run a local server that:
- Accepts Claude Code requests through an Anthropic-compatible local endpoint.
- Accepts Codex requests through OpenAI-compatible local endpoints.
- Converts client-specific request formats into a normalized internal
LLMRequest. - Routes requests through
@stackbilt/llm-providers. - Supports streaming responses.
- Applies simple routing policy based on request type, tool usage, cost, and provider health.
- Provides local metrics for provider, model, token usage, latency, cost estimate, fallback path, and cache status.
- Provides a safe default configuration that avoids breaking coding-agent workflows.
Over time, StackBilt LLM Gateway should become:
- The local-first model-routing layer for agentic developer workflows.
- A practical bridge between Claude Code, Codex, Gemini CLI, OpenCode, Continue, and future coding agents.
- A cost-control layer for AI-assisted software development.
- A governance telemetry source for StackBilt / Digital CSA.
- A proving ground for model-routing heuristics before they are pushed into Cloudflare edge deployments.
The MVP should not attempt to:
- Become a full replacement for LiteLLM.
- Support every provider protocol on day one.
- Provide a hosted multi-tenant SaaS gateway.
- Implement advanced semantic caching before basic prefix/request caching works.
- Rewrite the model-selection logic already present in
@stackbilt/llm-providers. - Automatically downgrade tool-heavy requests to weak models without compatibility testing.
- Support production enterprise auth, RBAC, or billing in the first version.
- Depend on Cloudflare Workers for the local MVP.
A developer using Claude Code, Codex, or similar tools locally who wants lower cost and better control over provider routing.
Needs:
- Easy local install.
- One command to start the gateway.
- Drop-in config for Claude Code and Codex.
- Sensible defaults.
- Cost and token visibility.
A developer building local or repo-specific agent workflows who wants to experiment with routing policies and model compatibility.
Needs:
- Adapter APIs.
- Request/response logs.
- Configurable routing rules.
- Provider health data.
- Easy integration with custom tools.
A technical lead who wants visibility into how AI coding agents are being used across projects.
Needs:
- Structured telemetry.
- Audit-friendly event logs.
- Provider and model attribution.
- Cost estimates by project/session/tool.
- Future integration with Digital CSA governance.
As a developer, I want Claude Code to send requests to localhost first so that I can route some work to cheaper models while keeping Claude-compatible behavior.
Acceptance Criteria:
- User can set
ANTHROPIC_BASE_URL=http://localhost:8787. - User can start the gateway locally.
- Claude Code can complete a basic prompt through the gateway.
- Gateway returns Anthropic-compatible non-streaming responses.
- Gateway returns Anthropic-compatible streaming responses.
- Errors are translated into a format Claude Code can understand.
As a developer, I want Codex to use StackBilt Gateway as a custom OpenAI-compatible provider.
Acceptance Criteria:
- User can configure Codex with a custom provider pointing at
http://localhost:8787/v1. - Gateway supports either
/v1/responsesor/v1/chat/completions. - Codex can complete a basic coding prompt through the gateway.
- Streaming is supported where the client expects it.
As a developer, I want cheap/simple requests routed to low-cost providers and complex/tool-heavy requests routed to safer premium models.
Acceptance Criteria:
- Gateway classifies requests into a small set of route classes.
- Gateway can prefer Cloudflare, Groq, or Cerebras for simple requests.
- Gateway can prefer Anthropic/OpenAI for high-risk tool-heavy requests.
- Routing decision is recorded in local telemetry.
- User can override routing policy in config.
As a developer, I want failed, rate-limited, or degraded providers to fall back automatically.
Acceptance Criteria:
- Gateway uses
@stackbilt/llm-providerscircuit-breaker/fallback behavior. - Provider failures do not crash the local server.
- Fallback chain is recorded in telemetry.
- Final response includes usage/provider metadata where possible.
As a developer, I want to see which tools, providers, and models are consuming tokens.
Acceptance Criteria:
- Gateway exposes
GET /health. - Gateway exposes
GET /metrics. - Gateway records per-request events locally.
- Metrics include client, route class, provider, model, latency, token usage, cost estimate, cache hit/miss, and fallback path.
The gateway must run as a local development server.
- Provide CLI command:
npx @stackbilt/llm-gateway start --port 8787- Default port:
8787. - Runtime: Node.js 18+.
- Optional future runtime: Bun.
- Must load provider keys from environment variables.
- Must load gateway config from
gateway.config.json,stackbilt.gateway.json, or CLI flags.
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GROQ_API_KEY=gsk_...
export CEREBRAS_API_KEY=csk-...
export STACKBILT_GATEWAY_KEY=local-dev-keyThe gateway must provide an Anthropic-compatible endpoint for Claude Code.
POST /v1/messages-
Accept Anthropic Messages-style requests.
-
Normalize request into internal
LLMRequest. -
Preserve important fields:
modelmessagessystemmax_tokenstemperaturetoolstool_choicestream
-
Convert internal response back to Anthropic-compatible shape.
-
Support streaming SSE.
-
Translate provider errors into Anthropic-compatible error objects.
If a Claude Code request includes tools, the gateway must route only to models/providers marked as tool-compatible and Claude-Code-safe unless the user explicitly enables experimental routing.
The gateway should provide an OpenAI Responses-compatible endpoint for Codex.
POST /v1/responses- Accept OpenAI Responses-style requests.
- Normalize input into internal
LLMRequest. - Support streaming where needed.
- Convert internal response into Responses API-compatible output.
- Preserve tool/function call fields where possible.
The gateway should optionally provide a Chat Completions-compatible endpoint for broader compatibility.
POST /v1/chat/completions- Accept OpenAI Chat Completions-style requests.
- Normalize messages into internal
LLMRequest. - Convert response into Chat Completions-compatible shape.
- Support streaming chunks.
Each client protocol should be implemented as an adapter.
interface ClientAdapter<ClientRequest, ClientResponse> {
toLLMRequest(input: ClientRequest, context: GatewayRequestContext): LLMRequest;
fromLLMResponse(output: LLMResponse, context: GatewayRequestContext): ClientResponse;
fromLLMStream?(
stream: ReadableStream<string>,
context: GatewayRequestContext
): ReadableStream<Uint8Array>;
}anthropic-messagesopenai-responsesopenai-chat-completions
- Gemini CLI
- Continue.dev
- OpenCode
- Aider
- Custom StackBilt agent protocol
The gateway should classify requests before routing.
type RouteClass =
| 'cheap_edit'
| 'fast_code'
| 'deep_reasoning'
| 'tool_heavy'
| 'long_context'
| 'fallback_safe';| Route Class | Description | Preferred Providers |
|---|---|---|
cheap_edit |
Small edits, summaries, simple rewrites | Cloudflare, Groq, Cerebras |
fast_code |
Normal coding/refactor tasks | Groq, Cerebras, Cloudflare |
deep_reasoning |
Architecture, debugging, complex design | Anthropic, OpenAI, Cerebras |
tool_heavy |
Tool/function-heavy agent work | Anthropic, OpenAI, known-compatible Groq/Cerebras |
long_context |
Large repo/context requests | Anthropic, OpenAI |
fallback_safe |
Unknown or risky requests | Anthropic, OpenAI |
- Must support
defaultProvider: 'auto'through@stackbilt/llm-providers. - Must allow user override by route class.
- Must record route decision and provider result.
- Must allow experimental providers behind config flags.
The gateway must maintain model compatibility metadata separate from raw provider availability.
interface ModelCompatibility {
provider: string;
model: string;
streaming: boolean;
tools: boolean;
vision?: boolean;
claudeCodeSafe: boolean | 'experimental';
codexSafe: boolean | 'experimental';
notes?: string;
}- Tool-heavy requests must check compatibility before routing.
- Experimental models must be opt-in.
- Compatibility failures should trigger fallback.
- Compatibility registry should be editable without changing core provider logic.
The MVP should include local caching, but avoid unsafe overreach.
Caches stable prompt sections such as:
- System prompts.
- Tool schemas.
- Repo instructions.
- Static agent policy blocks.
Caches deterministic, read-only responses where safe.
Examples:
- Explain this file.
- Summarize this error.
- Classify this request.
Avoid caching:
- Tool execution results.
- File mutation requests.
- Commands that depend on current repo state.
- Anything with secrets.
Tracks recent provider latency, errors, rate-limit pressure, and circuit-breaker state.
Use SQLite for local MVP.
For hosted/edge version:
- D1 for structured event ledger.
- KV for coarse cache lookups.
- R2 for large trace artifacts.
- Durable Objects for per-session state.
Every request should produce a structured event.
interface GatewayRequestEvent {
id: string;
timestamp: string;
client: 'claude-code' | 'codex' | 'unknown';
protocol: 'anthropic-messages' | 'openai-responses' | 'openai-chat';
sessionId?: string;
repoPath?: string;
routeClass: RouteClass;
requestedModel?: string;
selectedProvider: string;
selectedModel: string;
fallbackChain: string[];
inputTokens?: number;
outputTokens?: number;
cachedInputTokens?: number;
costEstimateUsd?: number;
latencyMs: number;
cacheHit: boolean;
status: 'success' | 'error' | 'fallback_success';
errorClass?: string;
}GET /health
GET /metrics
GET /events/recent- Metrics must be readable locally without external services.
- Logs must avoid storing full prompts by default.
- Full prompt tracing must be opt-in.
- Secret redaction must be applied before storing traces.
{
"port": 8787,
"auth": {
"mode": "local-key",
"keys": ["local-dev-key"]
},
"routing": {
"default": "auto",
"experimentalModels": false,
"routes": {
"cheap_edit": ["cloudflare", "groq", "cerebras"],
"fast_code": ["groq", "cerebras", "cloudflare"],
"deep_reasoning": ["anthropic", "openai", "cerebras"],
"tool_heavy": ["anthropic", "openai"],
"long_context": ["anthropic", "openai"],
"fallback_safe": ["anthropic", "openai"]
}
},
"cache": {
"enabled": true,
"storage": "sqlite",
"path": ".stackbilt/gateway/cache.sqlite",
"responseCache": false,
"prefixCache": true
},
"telemetry": {
"enabled": true,
"storePrompts": false,
"redactSecrets": true,
"path": ".stackbilt/gateway/events.sqlite"
}
}Claude Code / Codex / Agent CLI
|
v
Localhost Gateway Server
|
v
Protocol Adapter
- Anthropic Messages
- OpenAI Responses
- OpenAI Chat
|
v
Request Classifier + Policy Layer
|
v
Cache Layer + Telemetry Layer
|
v
@stackbilt/llm-providers
|
v
Cloudflare Workers AI / Groq / Cerebras / Anthropic / OpenAIResponsible for:
- Provider abstraction.
- Model catalog.
- Circuit breakers.
- Retry/failover.
- Cost tracking.
- Provider-specific request/response handling.
- Streaming provider support.
Responsible for:
- Local server.
- Client protocol compatibility.
- Request classification.
- Gateway-specific routing policy.
- Local caching.
- Local telemetry.
- CLI setup and config.
- Future StackBilt governance integration.
packages/llm-gateway/
src/
cli.ts
server.ts
config.ts
auth.ts
policy/
classify.ts
routes.ts
compatibility.ts
adapters/
anthropic-messages.ts
openai-responses.ts
openai-chat.ts
cache/
sqlite-cache.ts
keys.ts
redaction.ts
telemetry/
events.ts
metrics.ts
sqlite-events.ts
errors.ts
types.ts
tests/
adapters/
policy/
streaming/
examples/
claude-code.md
codex.md
package.json
README.md- CLI start command.
- Local HTTP server.
/healthendpoint.- Anthropic Messages endpoint.
- OpenAI Responses endpoint.
- Basic OpenAI Chat Completions endpoint.
- Non-streaming and streaming support.
- Basic request classification.
- Integration with
@stackbilt/llm-providers. - Local telemetry events.
- Safe default routing policy.
- Documentation for Claude Code setup.
- Documentation for Codex setup.
- SQLite event storage.
- Prefix cache.
- Config file support.
- Simple metrics endpoint.
- Compatibility registry.
- Secret redaction.
- Fallback-chain reporting.
- Response cache.
- Local dashboard.
- Web UI for metrics.
- Repo-aware routing.
- Token estimation before routing.
- Digital CSA event export.
- Multi-user hosted service.
- Cloudflare Worker deployment.
- Full RBAC.
- Billing.
- Advanced semantic cache.
- Automatic codebase indexing.
- Full enterprise audit UI.
The MVP is successful if:
- Claude Code can operate through the gateway for basic tasks.
- Codex can operate through the gateway for basic tasks.
- At least three provider backends can be used through one local server.
- Streaming works for both Anthropic-style and OpenAI-style clients.
- Tool-heavy requests are not accidentally routed to incompatible models by default.
- Per-request provider/model/token/cost telemetry is captured.
- A developer can install and test within 10 minutes.
Longer-term success means:
- Meaningful token/cost reduction compared to direct Claude/OpenAI usage.
- Reliable fallback under provider rate limits.
- Useful local analytics by project/session/tool.
- Reusable adapter pattern for additional agent CLIs.
- StackBilt governance telemetry can consume gateway events.
Claude Code and Codex may change expected API behavior.
- Keep protocol adapters isolated.
- Add fixture-based compatibility tests.
- Version adapters independently.
- Avoid hardcoding assumptions outside adapter files.
Agent coding tools rely on precise tool/function behavior. Incompatible models may break workflows.
- Conservative default routing for tool-heavy requests.
- Model compatibility registry.
- Experimental routing opt-in.
- Fixture tests for tool-call request/response shapes.
Cheap models may produce lower-quality outputs that require retries, increasing total cost and frustration.
- Use cheap models for low-risk classes only.
- Track fallback and retry rates.
- Escalate after failure or low-confidence patterns.
- Let users pin premium models per route class.
Caching mutable or sensitive requests can create stale, incorrect, or unsafe behavior.
- Prefix cache on by default.
- Response cache off by default.
- Full prompt storage off by default.
- Secret redaction enabled by default.
- No caching for tool execution or file mutation requests.
The product expands into a generic proxy and loses its StackBilt-specific wedge.
- Keep MVP focused on local coding agents.
- Treat provider routing as delegated to
@stackbilt/llm-providers. - Prioritize observability, governance readiness, and agent compatibility.
Prove that Claude Code and Codex can hit a local gateway and receive valid responses.
- Minimal Hono or Express server.
- Hardcoded Anthropic
/v1/messagesresponse. - Hardcoded OpenAI
/v1/responsesor/v1/chat/completionsresponse. - Confirm client configuration.
Real routing through @stackbilt/llm-providers.
- Anthropic adapter.
- OpenAI Responses adapter.
- OpenAI Chat adapter.
- Streaming support.
- Basic request classifier.
- Provider routing through
LLMProviders.fromEnv(). /healthendpoint.- Basic telemetry logs.
Make the gateway economically useful.
- SQLite event store.
- Prefix cache.
- Token/cost reporting.
- Fallback-chain reporting.
- Basic metrics endpoint.
- Configurable route classes.
Make the gateway safe for real coding-agent use.
- Model compatibility registry.
- Tool-call compatibility tests.
- Client fixture tests.
- Redaction layer.
- Failure-mode tests.
- Experimental model flags.
Turn gateway telemetry into governance and product intelligence.
- Optional Digital CSA event export.
- Repo/session/project metadata.
- Governance mode tags.
- Policy violation events.
- Agent workflow trace format.
- Optional local dashboard.
Deploy a shared gateway for team or cloud use.
- Cloudflare Worker gateway mode.
- D1 event ledger.
- KV/R2-backed cache/trace storage.
- Durable Object session coordinator.
- API key auth.
- Usage limits.
- Team-level metrics.
- Which Claude Code request/response fields are strictly required for stable operation?
- Which Codex wire API should be prioritized first: Responses or Chat Completions?
- Should the first MVP support tool calls end-to-end or initially route tool-heavy requests only to Anthropic/OpenAI?
- Should
@stackbilt/llm-providersexpose additional gateway metadata hooks? - Should route classification live fully in
@stackbilt/llm-gateway, or should some classifier helpers be added to@stackbilt/llm-providers? - Should local telemetry use SQLite directly, or a pluggable storage interface from the start?
- Should this package be part of the existing monorepo or remain independent at first?
Build the first implementation as a separate package:
@stackbilt/llm-gatewayUse:
- Node.js 18+.
- Hono or Fastify.
@stackbilt/llm-providersas the routing engine.- SQLite for local event/cache storage.
- Isolated adapters for Anthropic and OpenAI protocols.
The first useful milestone should be:
Claude Code → localhost:8787/v1/messages → StackBilt Gateway → Groq/Cerebras/Anthropic fallback → valid streamed responseThe second useful milestone should be:
Codex → localhost:8787/v1/responses → StackBilt Gateway → model-router → valid streamed responseOnly after those two flows work should the project invest heavily in caching, dashboards, or governance integrations.
StackBilt LLM Gateway is a local-first AI coding gateway that routes Claude Code, Codex, and future agent CLIs across multiple LLM providers with cost controls, caching, fallback, and governance-ready telemetry.
A local model-routing gateway for serious AI coding workflows.
Most LLM proxies focus on raw provider compatibility. StackBilt LLM Gateway focuses on agentic developer workflows: coding assistants, CLI agents, repo-aware sessions, model routing, cost pressure, fallback, and governance traces. It gives independent builders and engineering teams a lightweight control plane between their coding agents and the rapidly changing model provider ecosystem.
npm install -g @stackbilt/llm-gateway
export ANTHROPIC_API_KEY=sk-ant-...
export GROQ_API_KEY=gsk_...
export CEREBRAS_API_KEY=csk-...
export STACKBILT_GATEWAY_KEY=local-dev-key
stackbilt-llm-gateway start --port 8787
export ANTHROPIC_BASE_URL=http://localhost:8787
export ANTHROPIC_API_KEY=local-dev-key
claudemodel_provider = "stackbilt"
model = "stackbilt-auto"
[model_providers.stackbilt]
name = "StackBilt Local Gateway"
base_url = "http://localhost:8787/v1"
env_key = "STACKBILT_GATEWAY_KEY"
wire_api = "responses"export STACKBILT_GATEWAY_KEY=local-dev-key
codexGET /health
GET /metrics
GET /events/recent
POST /v1/messages
POST /v1/responses
POST /v1/chat/completionsProceed with a lean local-first MVP.
The product should start as a thin protocol adapter and policy layer over @stackbilt/llm-providers, not as a new provider abstraction system. Its differentiation should be:
- Agent-client compatibility.
- Cost-aware routing.
- Safe model downgrade rules.
- Local observability.
- Cache discipline.
- Future governance telemetry.
This gives StackBilt a practical, developer-useful gateway today and a strong foundation for governed multi-agent orchestration later.