Skip to content

elizabetht/token-labs

Repository files navigation

Token Labs

Deploy and Benchmark Build vLLM Latest Release

Multi-tenant LLM inference-as-a-service on NVIDIA DGX Spark. All tenant management, authentication, rate limiting, and inference routing is implemented via Kubernetes CRDs — zero custom application code.

Architecture

TokenLabs composes three open-source projects, each handling a distinct concern:

Client (Authorization: Bearer <api-key>)
  │
  ▼
┌──────────────────────────────────────────────────────────────┐
│  Envoy Gateway  (gatewayClassName: eg)                       │
│  ├─ Kuadrant AuthPolicy → Authorino        ① API key auth   │
│  ├─ Kuadrant RateLimitPolicy → Limitador   ② Rate limits    │
│  │                                                           │
│  ├─ AI Gateway ext_proc                   ③ Model routing  │
│  │   reads "model" → sets x-ai-eg-model header              │
│  ├─ AIGatewayRoute → InferencePool                           │
│  │   └─ model=Llama-3.1-8B     → token-labs-pool (spark-01) │
│  └─ llm-d EPP ext_proc                    ④ Inference sched │
├──────────────────────────────────────────────────────────────┤
│  vLLM / TTS Workers                                          │
│  └─ Response with usage.total_tokens (LLMs)                  │
├──────────────────────────────────────────────────────────────┤
│  Kuadrant TokenRateLimitPolicy → Limitador  ⑥ Token quota   │
└──────────────────────────────────────────────────────────────┘
  │
  ▼
Client receives response

Components

Envoy Gateway — Kubernetes-native L7 proxy that implements the Gateway API. It serves as the single entry point for all client traffic. Envoy Gateway handles TLS termination, HTTP routing, and hosts ext_proc filters for the AI Gateway controller and llm-d EPP. It was chosen over Istio because the Gateway API Inference Extension explicitly supports it, and it's lighter weight than a full service mesh.

Envoy AI Gateway — AI-native routing layer that runs on top of Envoy Gateway. Its controller runs as an ext_proc extension that automatically extracts the "model" field from the request body, sets the x-ai-eg-model header, and routes to the correct InferencePool backend via AIGatewayRoute rules. It also tracks per-request token usage via llmRequestCosts (InputToken, OutputToken, TotalToken). This replaces the need for a separate Body Based Router or custom HTTPRoute header matching.

Kuadrant — CNCF policy layer that deploys two backing services:

  • Authorino — external authorization service. When the AuthPolicy CRD is applied, Authorino intercepts every request and validates the API key (stored as a Kubernetes Secret). It extracts tenant metadata (tier, user-id) from the Secret's annotations and enriches the request context so downstream policies can use it.
  • Limitador — rate limiting service. Enforces request-count limits (via RateLimitPolicy) and, critically, token-based quotas (via TokenRateLimitPolicy). The token policy automatically parses usage.total_tokens from OpenAI-compatible JSON responses and counts it against the tenant's quota — no custom middleware required. This is what makes per-tenant billing feasible without writing a proxy.

llm-d — inference-aware request scheduler. Its Endpoint Picker (EPP) runs as an Envoy ext_proc server and scores every vLLM pod on three signals before routing the request:

  1. KV-cache usage — avoids pods whose GPU memory is nearly full
  2. Prefix-cache locality — routes similar prompts to the same pod to reuse cached KV entries
  3. Queue depth — prefers pods with fewer in-flight requests

This produces better tail latency and higher throughput than round-robin or least-connections load balancing.

vLLM — high-performance LLM inference engine running on DGX Spark GB10 GPUs. Exposes an OpenAI-compatible API (/v1/chat/completions, /v1/completions, /v1/models). Currently serves four models across two nodes:

  • Nemotron-3-Super-120B (spark-01) — MoE model in NVFP4, served alone to utilize spark-01's full 128GB unified memory pool
  • DeepSeek-R1 7B (spark-02) — reasoning model via vLLM
  • Llama 3.1 8B Instruct (spark-02) — general-purpose chat model via SGLang
  • Qwen 2.5 7B Instruct (spark-02) — chat model via TRT-LLM (PyTorch backend, batch size 8)

Infrastructure

┌──────────────────────────────────────────────────────────────────────┐
│                         MicroK8s Cluster                             │
│                                                                      │
│  ┌────────────────┐   ┌────────────────┐   ┌────────────────┐       │
│  │  controller     │   │  spark-01      │   │  spark-02      │       │
│  │  (CPU, ARM64)   │   │  (GB10 GPU)    │   │  (GB10 GPU)    │       │
│  │                 │   │                │   │                │       │
│  │  Envoy GW       │   │  vLLM:         │   │  vLLM:         │       │
│  │  Kuadrant       │   │  Nemotron 120B │   │  DeepSeek-R1   │       │
│  │  llm-d EPPs     │   │  (NVFP4)       │   │  Llama 3.1 8B  │       │
│  │                 │   │                │   │  Qwen 2.5 7B   │       │
│  └────────────────┘   └────────────────┘   └────────────────┘       │
└──────────────────────────────────────────────────────────────────────┘

The cluster has three nodes. The CPU controller runs control-plane components (Envoy Gateway proxy, Kuadrant operators, llm-d EPPs). spark-01 serves nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 alone — the 120B MoE model requires the full 128GB unified memory pool (NVFP4 quantization, --gpu-memory-utilization 0.50, tensor_parallelism=1). spark-02 serves three smaller models: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B via vLLM, meta-llama/Llama-3.1-8B-Instruct via SGLang, and Qwen/Qwen2.5-7B-Instruct via TRT-LLM.

Tenant Model

There is no external identity provider (no Keycloak, no Auth0). Tenants are Kubernetes Secrets — Authorino validates API keys by looking up Secrets directly. No database, no restarts, no config reloads. The moment you kubectl apply a tenant Secret, the API key is live.

How authentication works

  1. Client sends a request with Authorization: Bearer <api-key>
  2. Authorino searches for a Secret in kuadrant-system labeled authorino.kuadrant.io/managed-by: authorino
  3. Compares the api_key field in each Secret against the bearer token
  4. On match, extracts the tenant's tier (kuadrant.io/groups) and ID (secret.kuadrant.io/user-id) from annotations
  5. Passes this metadata downstream — RateLimitPolicy and TokenRateLimitPolicy use it to enforce per-tenant quotas

Tenant Secret structure

apiVersion: v1
kind: Secret
metadata:
  name: tenant-acme
  namespace: kuadrant-system
  labels:
    authorino.kuadrant.io/managed-by: authorino   # Authorino discovers this Secret
    app: token-labs
  annotations:
    kuadrant.io/groups: "pro"                      # tier: free | pro | enterprise
    secret.kuadrant.io/user-id: "acme"             # unique tenant ID (rate limit counter key)
stringData:
  api_key: "tlabs_sk_acme_..."                     # API key value

Onboarding a new tenant

# 1. Generate a secure API key
API_KEY="tlabs_sk_$(openssl rand -hex 24)"

# 2. Create the tenant Secret (choose tier: free, pro, or enterprise)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: tenant-acme
  namespace: kuadrant-system
  labels:
    authorino.kuadrant.io/managed-by: authorino
    app: token-labs
  annotations:
    kuadrant.io/groups: "pro"
    secret.kuadrant.io/user-id: "acme"
stringData:
  api_key: "$API_KEY"
EOF

# 3. Share the API key with the client (securely, out-of-band)
echo "API Key: $API_KEY"

The client can use the key immediately — no waiting, no restart:

curl https://inference.token-labs.local/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

Managing tenants

Action Command
List all tenants kubectl get secrets -n kuadrant-system -l app=token-labs
Change tier kubectl annotate secret tenant-acme -n kuadrant-system kuadrant.io/groups=enterprise --overwrite
Rotate API key kubectl create secret generic tenant-acme -n kuadrant-system --from-literal=api_key="$(openssl rand -hex 24)" --dry-run=client -o yaml | kubectl apply -f -
Revoke access kubectl delete secret tenant-acme -n kuadrant-system

Rate limits by tier

Tier Requests/day Requests/min Tokens/day Tokens/min
Free 100 10 50,000 5,000
Pro 5,000 100 500,000 50,000
Enterprise 50,000 1,000 5,000,000 500,000

See docs/ARCHITECTURE.md for the full CRD inventory, request flow details, and design decisions.


Deployment Guide

Prerequisites

  • MicroK8s cluster with GPU addon enabled on worker nodes
  • kubectl v1.28+ configured for the cluster
  • helm v3.12+
  • helmfile v1.1+
  • HuggingFace token with access to meta-llama/Llama-3.1-8B-Instruct and nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8

MicroK8s CLI aliases

All scripts and commands in this guide use standard kubectl and helm. On a MicroK8s cluster, create aliases so they resolve to the MicroK8s-bundled binaries:

# Permanent system-wide aliases (recommended)
sudo snap alias microk8s.kubectl kubectl
sudo snap alias microk8s.helm helm

# Or add to ~/.bashrc / ~/.zshrc
echo 'alias kubectl="microk8s kubectl"' >> ~/.zshrc
echo 'alias helm="microk8s helm"' >> ~/.zshrc
source ~/.zshrc

Verify:

kubectl version --client
helm version

Step 1: Install Gateway API CRDs

This installs the Gateway API base CRDs (Gateway, HTTPRoute, GatewayClass), the Gateway API Inference Extension CRDs (InferencePool), and the Envoy AI Gateway CRDs (AIGatewayRoute). These are the Kubernetes resource definitions that all projects build upon.

./deploy/scripts/01-install-crds.sh

What it does:

  • Applies Gateway API v1.4.1 standard CRDs
  • Applies Inference Extension v1.3.0 CRDs (graduated InferencePool at inference.networking.k8s.io/v1)
  • Installs Envoy AI Gateway v0.5.0 CRDs (AIGatewayRoute)

Verify:

kubectl get crd gateways.gateway.networking.k8s.io
kubectl get crd inferencepools.inference.networking.k8s.io
kubectl get crd aigatewayroutes.aigateway.envoyproxy.io

Step 2: Install Envoy Gateway + AI Gateway + Redis

Envoy Gateway is the data-plane proxy. Envoy AI Gateway adds AI-native routing on top — its controller extracts the model from the request body and routes to the correct InferencePool. Redis is required as the backend for Kuadrant's distributed rate limiting (Limitador stores counters in Redis).

./deploy/scripts/02-install-envoy-gateway.sh

What it does:

  1. Deploys a standalone Redis instance into redis-system
  2. Installs Envoy Gateway v1.6.4 Helm chart with AI Gateway values files (extension manager, rate limiting addon, InferencePool addon)
  3. Installs the AI Gateway controller v0.5.0 into envoy-ai-gateway-system

Verify:

kubectl get pods -n envoy-gateway-system       # envoy-gateway controller running
kubectl get pods -n envoy-ai-gateway-system    # ai-gateway controller running
kubectl get pods -n redis-system               # redis pod running

Step 3: Install Kuadrant

Kuadrant is the policy layer. Installing the operator deploys the controller that watches for AuthPolicy, RateLimitPolicy, and TokenRateLimitPolicy CRDs. Creating the Kuadrant CR bootstraps the backing services (Authorino for auth, Limitador for rate limiting).

./deploy/scripts/03-install-kuadrant.sh

What it does:

  1. Adds the Kuadrant Helm repo and installs kuadrant-operator into kuadrant-system
  2. Creates a Kuadrant CR that triggers deployment of Authorino and Limitador

Verify:

kubectl get pods -n kuadrant-system   # operator, authorino, limitador all running
kubectl get kuadrant -n kuadrant-system  # status should show Ready

Step 4: Deploy llm-d (inference stack)

llm-d is the inference-aware scheduling layer. It uses a 5-release Helmfile pattern:

Chart Release What it deploys
llm-d-infra v1.3.6 llm-d-infra CRDs and shared infrastructure. Gateway creation is disabled (gateway.create: false) since we manage the Gateway resource separately via Envoy Gateway.
inferencepool v1.3.0 llm-d-inferencepool EPP for Llama 3.1 8B — the ext_proc server that performs inference-aware routing with kvCacheAware and queueDepthAware scoring.
llm-d-modelservice v0.4.5 llm-d-modelservice vLLM worker for Llama 3.1 8B Instruct. 1 decode replica on spark-01.
# Set your HuggingFace token first
kubectl create namespace token-labs
kubectl create secret generic hf-token \
  --from-literal="HF_TOKEN=${HF_TOKEN}" \
  -n token-labs

./deploy/scripts/04-deploy-llm-d.sh

What it does:

  1. Creates the token-labs namespace
  2. Runs helmfile apply which installs all 5 releases with values from deploy/llm-d/values/
  3. Waits for vLLM workers to download model weights and become ready (can take several minutes on first run)

Verify:

kubectl get pods -n token-labs           # 2 vLLM pods + 2 EPP pods running
kubectl get inferencepool -n token-labs  # both pools should show Ready

Step 5: Deploy AI Gateway Route

The Envoy AI Gateway uses the AIGatewayRoute CRD for model-based routing. The AI Gateway controller automatically extracts the "model" field from the request body, sets the x-ai-eg-model header, and the AIGatewayRoute matches on this header to route to the correct InferencePool backend. Token usage is tracked via llmRequestCosts.

./deploy/scripts/05-deploy-ai-gateway-route.sh

What it does:

  1. Applies the AIGatewayRoute resource which defines model-to-InferencePool routing rules
  2. Configures llmRequestCosts for per-request token tracking (InputToken, OutputToken, TotalToken)

Verify:

kubectl get aigatewayroute -n token-labs    # AIGatewayRoute listed

Step 6: Apply Gateway, routes, and policies

This step creates the actual networking and policy resources that wire everything together:

# Gateway + HTTPRoute
kubectl apply -f deploy/gateway/

# Kuadrant policies
kubectl apply -f deploy/policies/

Gateway resources (deploy/gateway/):

  • namespace.yaml — creates the token-labs namespace (idempotent)
  • gateway.yaml — creates a Gateway resource with gatewayClassName: eg, listening on HTTP port 80 with hostname inference.token-labs.local. Envoy Gateway sees this and provisions an Envoy proxy pod to handle traffic.
  • aigatewayroute.yaml — creates an AIGatewayRoute with per-model header matching (deployed in step 5). The AI Gateway controller extracts the "model" field from the request body and sets the x-ai-eg-model header. Each rule matches on this header and routes to the correct InferencePool backend. The InferencePool is the bridge to llm-d's EPP — when Envoy receives a matching request, it invokes the EPP via ext_proc to pick the optimal vLLM pod.

Kuadrant policies (deploy/policies/):

  • kuadrant.yaml — the Kuadrant CR (idempotent, already created in step 3)
  • auth-policy.yamlAuthPolicy targeting the Gateway. Configures API key authentication: Authorino validates the Authorization: Bearer <key> header by looking up Secrets labeled authorino.kuadrant.io/managed-by: authorino. On match, it extracts kuadrant.io/groups (tier) and secret.kuadrant.io/user-id (tenant ID) from annotations and passes them in the request context. An OPA policy validates the tier is one of free, pro, or enterprise.
  • rate-limit-policy.yamlRateLimitPolicy targeting the Gateway. Defines per-tier request count limits (e.g., free = 10/min and 100/day). Uses when predicates with CEL expressions to match auth.identity.groups and counters keyed by auth.identity.userid for tenant isolation.
  • token-rate-limit-policy.yamlTokenRateLimitPolicy targeting the Gateway for /v1/chat/completions. This is the key CRD for LLM billing. After vLLM returns a response, Kuadrant's wasm-shim parses usage.total_tokens from the JSON body and sends it to Limitador as hits_addend. Each tenant's cumulative token usage is tracked per time window.

Verify:

kubectl get gateway -n token-labs            # Programmed: True
kubectl get aigatewayroute -n token-labs     # Listed
kubectl get authpolicy -n token-labs         # Accepted: True
kubectl get ratelimitpolicy -n token-labs    # Accepted: True
kubectl get tokenratelimitpolicy -n token-labs  # Accepted: True

Step 8: Create tenant API keys

Demo tenants are provided for testing (see Tenant Model above for full onboarding instructions):

kubectl apply -f deploy/tenants/

This creates two demo tenants:

  • tenant-free-demo — free tier, key tlabs_free_demo_key_change_me
  • tenant-pro-demo — pro tier, key tlabs_pro_demo_key_change_me

For production tenants, follow the onboarding steps in the Tenant Model section.

Test it

# Port-forward to the gateway (or use MetalLB IP)
kubectl port-forward -n envoy-gateway-system \
  svc/envoy-default-token-labs-gateway 8080:80 &

# List models
curl -s http://localhost:8080/v1/models \
  -H "Authorization: Bearer tlabs_pro_demo_key_change_me" | jq

# Chat completion (Llama 3.1 8B)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer tlabs_pro_demo_key_change_me" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 200
  }' | jq

curl -s http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer tlabs_pro_demo_key_change_me" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8",
    "messages": [{"role": "user", "content": "Describe this image."}],
    "max_tokens": 200
  }' | jq

# Text-to-speech (Magpie TTS)
curl -s http://localhost:8080/v1/audio/speech \
  -H "Authorization: Bearer tlabs_pro_demo_key_change_me" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Welcome to Token Labs.",
    "voice": "aria",
    "language": "en"
  }' --output speech.wav

# Verify rate limiting (free tier — should get 429 after 10 requests/min)
for i in $(seq 1 15); do
  echo -n "Request $i: "
  curl -s -o /dev/null -w "%{http_code}" \
    http://localhost:8080/v1/chat/completions \
    -H "Authorization: Bearer tlabs_free_demo_key_change_me" \
    -H "Content-Type: application/json" \
    -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}'
  echo
done

Observability

The stack exposes metrics from all layers via Prometheus ServiceMonitors:

# Optional: deploy ServiceMonitors
kubectl apply -f deploy/monitoring/service-monitors.yaml
Source Key Metrics
Limitador limitador_counter_hits_total — per-tenant request/token counts
Authorino auth_server_response_status — auth allow/deny rates
vLLM vllm:kv_cache_usage_perc, vllm:request_latency_seconds
EPP Routing decisions, prefix-cache hit rates

llm-d also provides ready-made Grafana dashboards — see docs/ARCHITECTURE.md for details.


Benchmarks

Metric Prefill (Input) Decode (Output)
Throughput 3,203 tok/s 520 tok/s
Cost/1M tokens $0.006 $0.037

Accuracy Testing

Uses lighteval with the IFEval benchmark to verify model quality across quantizations. Models are compared against the meta-llama/Llama-3.1-8B-Instruct baseline using a ±5% threshold. See baselines/README.md for details.

Repository Structure

├── deploy/
│   ├── scripts/              # Installation scripts (run in order)
│   │   ├── 01-install-crds.sh
│   │   ├── 02-install-envoy-gateway.sh
│   │   ├── 03-install-kuadrant.sh
│   │   ├── 04-deploy-llm-d.sh
│   │   └── 05-deploy-ai-gateway-route.sh
│   ├── gateway/              # Gateway + AIGatewayRoute resources
│   ├── llm-d/                # Helmfile + values for llm-d 5-release deploy
│   │   ├── helmfile.yaml.gotmpl
│   │   └── values/
│   ├── policies/             # Kuadrant AuthPolicy, RateLimitPolicy, TokenRateLimitPolicy
│   ├── tenants/              # Tenant API key Secrets (template + demos)
├── docs/
│   ├── ARCHITECTURE.md       # Full architecture deep-dive
│   ├── index.html            # Live demo page
│   └── benchmark-results.*   # Benchmark data
├── baselines/                # Accuracy baseline values
├── scripts/                  # Benchmark and analysis scripts
├── Dockerfile                # vLLM build for ARM64
└── .github/workflows/        # CI/CD pipelines

Links

License

MIT