Multi-tenant LLM inference-as-a-service on NVIDIA DGX Spark. All tenant management, authentication, rate limiting, and inference routing is implemented via Kubernetes CRDs — zero custom application code.
TokenLabs composes three open-source projects, each handling a distinct concern:
Client (Authorization: Bearer <api-key>)
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Envoy Gateway (gatewayClassName: eg) │
│ ├─ Kuadrant AuthPolicy → Authorino ① API key auth │
│ ├─ Kuadrant RateLimitPolicy → Limitador ② Rate limits │
│ │ │
│ ├─ AI Gateway ext_proc ③ Model routing │
│ │ reads "model" → sets x-ai-eg-model header │
│ ├─ AIGatewayRoute → InferencePool │
│ │ └─ model=Llama-3.1-8B → token-labs-pool (spark-01) │
│ └─ llm-d EPP ext_proc ④ Inference sched │
├──────────────────────────────────────────────────────────────┤
│ vLLM / TTS Workers │
│ └─ Response with usage.total_tokens (LLMs) │
├──────────────────────────────────────────────────────────────┤
│ Kuadrant TokenRateLimitPolicy → Limitador ⑥ Token quota │
└──────────────────────────────────────────────────────────────┘
│
▼
Client receives response
Envoy Gateway — Kubernetes-native L7 proxy that implements the Gateway API. It serves as the single entry point for all client traffic. Envoy Gateway handles TLS termination, HTTP routing, and hosts ext_proc filters for the AI Gateway controller and llm-d EPP. It was chosen over Istio because the Gateway API Inference Extension explicitly supports it, and it's lighter weight than a full service mesh.
Envoy AI Gateway — AI-native routing layer that runs on top of Envoy Gateway. Its controller runs as an ext_proc extension that automatically extracts the "model" field from the request body, sets the x-ai-eg-model header, and routes to the correct InferencePool backend via AIGatewayRoute rules. It also tracks per-request token usage via llmRequestCosts (InputToken, OutputToken, TotalToken). This replaces the need for a separate Body Based Router or custom HTTPRoute header matching.
Kuadrant — CNCF policy layer that deploys two backing services:
- Authorino — external authorization service. When the
AuthPolicyCRD is applied, Authorino intercepts every request and validates the API key (stored as a Kubernetes Secret). It extracts tenant metadata (tier, user-id) from the Secret's annotations and enriches the request context so downstream policies can use it. - Limitador — rate limiting service. Enforces request-count limits (via
RateLimitPolicy) and, critically, token-based quotas (viaTokenRateLimitPolicy). The token policy automatically parsesusage.total_tokensfrom OpenAI-compatible JSON responses and counts it against the tenant's quota — no custom middleware required. This is what makes per-tenant billing feasible without writing a proxy.
llm-d — inference-aware request scheduler. Its Endpoint Picker (EPP) runs as an Envoy ext_proc server and scores every vLLM pod on three signals before routing the request:
- KV-cache usage — avoids pods whose GPU memory is nearly full
- Prefix-cache locality — routes similar prompts to the same pod to reuse cached KV entries
- Queue depth — prefers pods with fewer in-flight requests
This produces better tail latency and higher throughput than round-robin or least-connections load balancing.
vLLM — high-performance LLM inference engine running on DGX Spark GB10 GPUs. Exposes an OpenAI-compatible API (/v1/chat/completions, /v1/completions, /v1/models). Currently serves four models across two nodes:
- Nemotron-3-Super-120B (spark-01) — MoE model in NVFP4, served alone to utilize spark-01's full 128GB unified memory pool
- DeepSeek-R1 7B (spark-02) — reasoning model via vLLM
- Llama 3.1 8B Instruct (spark-02) — general-purpose chat model via SGLang
- Qwen 2.5 7B Instruct (spark-02) — chat model via TRT-LLM (PyTorch backend, batch size 8)
┌──────────────────────────────────────────────────────────────────────┐
│ MicroK8s Cluster │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ controller │ │ spark-01 │ │ spark-02 │ │
│ │ (CPU, ARM64) │ │ (GB10 GPU) │ │ (GB10 GPU) │ │
│ │ │ │ │ │ │ │
│ │ Envoy GW │ │ vLLM: │ │ vLLM: │ │
│ │ Kuadrant │ │ Nemotron 120B │ │ DeepSeek-R1 │ │
│ │ llm-d EPPs │ │ (NVFP4) │ │ Llama 3.1 8B │ │
│ │ │ │ │ │ Qwen 2.5 7B │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
The cluster has three nodes. The CPU controller runs control-plane components (Envoy Gateway proxy, Kuadrant operators, llm-d EPPs). spark-01 serves nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 alone — the 120B MoE model requires the full 128GB unified memory pool (NVFP4 quantization, --gpu-memory-utilization 0.50, tensor_parallelism=1). spark-02 serves three smaller models: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B via vLLM, meta-llama/Llama-3.1-8B-Instruct via SGLang, and Qwen/Qwen2.5-7B-Instruct via TRT-LLM.
There is no external identity provider (no Keycloak, no Auth0). Tenants are Kubernetes Secrets — Authorino validates API keys by looking up Secrets directly. No database, no restarts, no config reloads. The moment you kubectl apply a tenant Secret, the API key is live.
- Client sends a request with
Authorization: Bearer <api-key> - Authorino searches for a Secret in
kuadrant-systemlabeledauthorino.kuadrant.io/managed-by: authorino - Compares the
api_keyfield in each Secret against the bearer token - On match, extracts the tenant's tier (
kuadrant.io/groups) and ID (secret.kuadrant.io/user-id) from annotations - Passes this metadata downstream — RateLimitPolicy and TokenRateLimitPolicy use it to enforce per-tenant quotas
apiVersion: v1
kind: Secret
metadata:
name: tenant-acme
namespace: kuadrant-system
labels:
authorino.kuadrant.io/managed-by: authorino # Authorino discovers this Secret
app: token-labs
annotations:
kuadrant.io/groups: "pro" # tier: free | pro | enterprise
secret.kuadrant.io/user-id: "acme" # unique tenant ID (rate limit counter key)
stringData:
api_key: "tlabs_sk_acme_..." # API key value# 1. Generate a secure API key
API_KEY="tlabs_sk_$(openssl rand -hex 24)"
# 2. Create the tenant Secret (choose tier: free, pro, or enterprise)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: tenant-acme
namespace: kuadrant-system
labels:
authorino.kuadrant.io/managed-by: authorino
app: token-labs
annotations:
kuadrant.io/groups: "pro"
secret.kuadrant.io/user-id: "acme"
stringData:
api_key: "$API_KEY"
EOF
# 3. Share the API key with the client (securely, out-of-band)
echo "API Key: $API_KEY"The client can use the key immediately — no waiting, no restart:
curl https://inference.token-labs.local/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'| Action | Command |
|---|---|
| List all tenants | kubectl get secrets -n kuadrant-system -l app=token-labs |
| Change tier | kubectl annotate secret tenant-acme -n kuadrant-system kuadrant.io/groups=enterprise --overwrite |
| Rotate API key | kubectl create secret generic tenant-acme -n kuadrant-system --from-literal=api_key="$(openssl rand -hex 24)" --dry-run=client -o yaml | kubectl apply -f - |
| Revoke access | kubectl delete secret tenant-acme -n kuadrant-system |
| Tier | Requests/day | Requests/min | Tokens/day | Tokens/min |
|---|---|---|---|---|
| Free | 100 | 10 | 50,000 | 5,000 |
| Pro | 5,000 | 100 | 500,000 | 50,000 |
| Enterprise | 50,000 | 1,000 | 5,000,000 | 500,000 |
See docs/ARCHITECTURE.md for the full CRD inventory, request flow details, and design decisions.
- MicroK8s cluster with GPU addon enabled on worker nodes
kubectlv1.28+ configured for the clusterhelmv3.12+helmfilev1.1+- HuggingFace token with access to
meta-llama/Llama-3.1-8B-Instructandnvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8
All scripts and commands in this guide use standard kubectl and helm. On a MicroK8s cluster, create aliases so they resolve to the MicroK8s-bundled binaries:
# Permanent system-wide aliases (recommended)
sudo snap alias microk8s.kubectl kubectl
sudo snap alias microk8s.helm helm
# Or add to ~/.bashrc / ~/.zshrc
echo 'alias kubectl="microk8s kubectl"' >> ~/.zshrc
echo 'alias helm="microk8s helm"' >> ~/.zshrc
source ~/.zshrcVerify:
kubectl version --client
helm versionThis installs the Gateway API base CRDs (Gateway, HTTPRoute, GatewayClass), the Gateway API Inference Extension CRDs (InferencePool), and the Envoy AI Gateway CRDs (AIGatewayRoute). These are the Kubernetes resource definitions that all projects build upon.
./deploy/scripts/01-install-crds.shWhat it does:
- Applies Gateway API v1.4.1 standard CRDs
- Applies Inference Extension v1.3.0 CRDs (graduated InferencePool at
inference.networking.k8s.io/v1) - Installs Envoy AI Gateway v0.5.0 CRDs (AIGatewayRoute)
Verify:
kubectl get crd gateways.gateway.networking.k8s.io
kubectl get crd inferencepools.inference.networking.k8s.io
kubectl get crd aigatewayroutes.aigateway.envoyproxy.ioEnvoy Gateway is the data-plane proxy. Envoy AI Gateway adds AI-native routing on top — its controller extracts the model from the request body and routes to the correct InferencePool. Redis is required as the backend for Kuadrant's distributed rate limiting (Limitador stores counters in Redis).
./deploy/scripts/02-install-envoy-gateway.shWhat it does:
- Deploys a standalone Redis instance into
redis-system - Installs Envoy Gateway v1.6.4 Helm chart with AI Gateway values files (extension manager, rate limiting addon, InferencePool addon)
- Installs the AI Gateway controller v0.5.0 into
envoy-ai-gateway-system
Verify:
kubectl get pods -n envoy-gateway-system # envoy-gateway controller running
kubectl get pods -n envoy-ai-gateway-system # ai-gateway controller running
kubectl get pods -n redis-system # redis pod runningKuadrant is the policy layer. Installing the operator deploys the controller that watches for AuthPolicy, RateLimitPolicy, and TokenRateLimitPolicy CRDs. Creating the Kuadrant CR bootstraps the backing services (Authorino for auth, Limitador for rate limiting).
./deploy/scripts/03-install-kuadrant.shWhat it does:
- Adds the Kuadrant Helm repo and installs
kuadrant-operatorintokuadrant-system - Creates a
KuadrantCR that triggers deployment of Authorino and Limitador
Verify:
kubectl get pods -n kuadrant-system # operator, authorino, limitador all running
kubectl get kuadrant -n kuadrant-system # status should show Readyllm-d is the inference-aware scheduling layer. It uses a 5-release Helmfile pattern:
| Chart | Release | What it deploys |
|---|---|---|
llm-d-infra v1.3.6 |
llm-d-infra |
CRDs and shared infrastructure. Gateway creation is disabled (gateway.create: false) since we manage the Gateway resource separately via Envoy Gateway. |
inferencepool v1.3.0 |
llm-d-inferencepool |
EPP for Llama 3.1 8B — the ext_proc server that performs inference-aware routing with kvCacheAware and queueDepthAware scoring. |
llm-d-modelservice v0.4.5 |
llm-d-modelservice |
vLLM worker for Llama 3.1 8B Instruct. 1 decode replica on spark-01. |
# Set your HuggingFace token first
kubectl create namespace token-labs
kubectl create secret generic hf-token \
--from-literal="HF_TOKEN=${HF_TOKEN}" \
-n token-labs
./deploy/scripts/04-deploy-llm-d.shWhat it does:
- Creates the
token-labsnamespace - Runs
helmfile applywhich installs all 5 releases with values fromdeploy/llm-d/values/ - Waits for vLLM workers to download model weights and become ready (can take several minutes on first run)
Verify:
kubectl get pods -n token-labs # 2 vLLM pods + 2 EPP pods running
kubectl get inferencepool -n token-labs # both pools should show ReadyThe Envoy AI Gateway uses the AIGatewayRoute CRD for model-based routing. The AI Gateway controller automatically extracts the "model" field from the request body, sets the x-ai-eg-model header, and the AIGatewayRoute matches on this header to route to the correct InferencePool backend. Token usage is tracked via llmRequestCosts.
./deploy/scripts/05-deploy-ai-gateway-route.shWhat it does:
- Applies the
AIGatewayRouteresource which defines model-to-InferencePool routing rules - Configures
llmRequestCostsfor per-request token tracking (InputToken, OutputToken, TotalToken)
Verify:
kubectl get aigatewayroute -n token-labs # AIGatewayRoute listedThis step creates the actual networking and policy resources that wire everything together:
# Gateway + HTTPRoute
kubectl apply -f deploy/gateway/
# Kuadrant policies
kubectl apply -f deploy/policies/Gateway resources (deploy/gateway/):
namespace.yaml— creates thetoken-labsnamespace (idempotent)gateway.yaml— creates aGatewayresource withgatewayClassName: eg, listening on HTTP port 80 with hostnameinference.token-labs.local. Envoy Gateway sees this and provisions an Envoy proxy pod to handle traffic.aigatewayroute.yaml— creates anAIGatewayRoutewith per-model header matching (deployed in step 5). The AI Gateway controller extracts the"model"field from the request body and sets thex-ai-eg-modelheader. Each rule matches on this header and routes to the correctInferencePoolbackend. The InferencePool is the bridge to llm-d's EPP — when Envoy receives a matching request, it invokes the EPP via ext_proc to pick the optimal vLLM pod.
Kuadrant policies (deploy/policies/):
kuadrant.yaml— theKuadrantCR (idempotent, already created in step 3)auth-policy.yaml—AuthPolicytargeting the Gateway. Configures API key authentication: Authorino validates theAuthorization: Bearer <key>header by looking up Secrets labeledauthorino.kuadrant.io/managed-by: authorino. On match, it extractskuadrant.io/groups(tier) andsecret.kuadrant.io/user-id(tenant ID) from annotations and passes them in the request context. An OPA policy validates the tier is one offree,pro, orenterprise.rate-limit-policy.yaml—RateLimitPolicytargeting the Gateway. Defines per-tier request count limits (e.g., free = 10/min and 100/day). Useswhenpredicates with CEL expressions to matchauth.identity.groupsandcounterskeyed byauth.identity.useridfor tenant isolation.token-rate-limit-policy.yaml—TokenRateLimitPolicytargeting the Gateway for/v1/chat/completions. This is the key CRD for LLM billing. After vLLM returns a response, Kuadrant's wasm-shim parsesusage.total_tokensfrom the JSON body and sends it to Limitador ashits_addend. Each tenant's cumulative token usage is tracked per time window.
Verify:
kubectl get gateway -n token-labs # Programmed: True
kubectl get aigatewayroute -n token-labs # Listed
kubectl get authpolicy -n token-labs # Accepted: True
kubectl get ratelimitpolicy -n token-labs # Accepted: True
kubectl get tokenratelimitpolicy -n token-labs # Accepted: TrueDemo tenants are provided for testing (see Tenant Model above for full onboarding instructions):
kubectl apply -f deploy/tenants/This creates two demo tenants:
tenant-free-demo— free tier, keytlabs_free_demo_key_change_metenant-pro-demo— pro tier, keytlabs_pro_demo_key_change_me
For production tenants, follow the onboarding steps in the Tenant Model section.
# Port-forward to the gateway (or use MetalLB IP)
kubectl port-forward -n envoy-gateway-system \
svc/envoy-default-token-labs-gateway 8080:80 &
# List models
curl -s http://localhost:8080/v1/models \
-H "Authorization: Bearer tlabs_pro_demo_key_change_me" | jq
# Chat completion (Llama 3.1 8B)
curl -s http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer tlabs_pro_demo_key_change_me" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is Kubernetes?"}],
"max_tokens": 200
}' | jq
curl -s http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer tlabs_pro_demo_key_change_me" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8",
"messages": [{"role": "user", "content": "Describe this image."}],
"max_tokens": 200
}' | jq
# Text-to-speech (Magpie TTS)
curl -s http://localhost:8080/v1/audio/speech \
-H "Authorization: Bearer tlabs_pro_demo_key_change_me" \
-H "Content-Type: application/json" \
-d '{
"input": "Welcome to Token Labs.",
"voice": "aria",
"language": "en"
}' --output speech.wav
# Verify rate limiting (free tier — should get 429 after 10 requests/min)
for i in $(seq 1 15); do
echo -n "Request $i: "
curl -s -o /dev/null -w "%{http_code}" \
http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer tlabs_free_demo_key_change_me" \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}'
echo
doneThe stack exposes metrics from all layers via Prometheus ServiceMonitors:
# Optional: deploy ServiceMonitors
kubectl apply -f deploy/monitoring/service-monitors.yaml| Source | Key Metrics |
|---|---|
| Limitador | limitador_counter_hits_total — per-tenant request/token counts |
| Authorino | auth_server_response_status — auth allow/deny rates |
| vLLM | vllm:kv_cache_usage_perc, vllm:request_latency_seconds |
| EPP | Routing decisions, prefix-cache hit rates |
llm-d also provides ready-made Grafana dashboards — see docs/ARCHITECTURE.md for details.
| Metric | Prefill (Input) | Decode (Output) |
|---|---|---|
| Throughput | 3,203 tok/s | 520 tok/s |
| Cost/1M tokens | $0.006 | $0.037 |
Uses lighteval with the IFEval benchmark to verify model quality across quantizations. Models are compared against the meta-llama/Llama-3.1-8B-Instruct baseline using a ±5% threshold. See baselines/README.md for details.
├── deploy/
│ ├── scripts/ # Installation scripts (run in order)
│ │ ├── 01-install-crds.sh
│ │ ├── 02-install-envoy-gateway.sh
│ │ ├── 03-install-kuadrant.sh
│ │ ├── 04-deploy-llm-d.sh
│ │ └── 05-deploy-ai-gateway-route.sh
│ ├── gateway/ # Gateway + AIGatewayRoute resources
│ ├── llm-d/ # Helmfile + values for llm-d 5-release deploy
│ │ ├── helmfile.yaml.gotmpl
│ │ └── values/
│ ├── policies/ # Kuadrant AuthPolicy, RateLimitPolicy, TokenRateLimitPolicy
│ ├── tenants/ # Tenant API key Secrets (template + demos)
├── docs/
│ ├── ARCHITECTURE.md # Full architecture deep-dive
│ ├── index.html # Live demo page
│ └── benchmark-results.* # Benchmark data
├── baselines/ # Accuracy baseline values
├── scripts/ # Benchmark and analysis scripts
├── Dockerfile # vLLM build for ARM64
└── .github/workflows/ # CI/CD pipelines
MIT