vLLMux is a self-hosted control plane for serving many LLMs on
vLLM. Paste a vllm serve … command and it
becomes a routable model; the router load-balances across instances; and a bundled Prometheus + Grafana stack monitors everything — all behind one Vue dashboard.
- Add a model by pasting
vllm serve …— parsed into a form and layered on as a dynamic overlay; the router hot-reloads, noconfig.yamledits. - Lifecycle + self-healing — per-instance state machine (
stopped → starting → ready → failed), VRAM pre-flight guard, GPU auto-placement, crash auto-restart with backoff. - Pluggable routing strategies — pick the load-balancing policy per model group or globally:
least_load(default),round_robin,random,least_inflight,p2c, plussession_affinity/prefix_affinityfor cache reuse on multi-turn chat & shared prompts. Switch it live from the dashboard; transparent failover + per-backend cooldown apply to every strategy. - Cross-instance KV-cache sharing — toggle it per model group in the editor: instances offload/load KV blocks to a shared store (vLLM
OffloadingConnector) so a prefix computed on one replica is reused by another (verified ~99% external prefix-cache hit, 31% lower TTFT on a warmed prompt). Traffic & topology views show which groups pool KV vs keep their own. - Live observability — SSE status, animated system-topology & router-balancing graphs, per-model usage / latency / error stats.
- Bundled Grafana monitoring — Prometheus auto-discovers every running instance; Overview / Capacity / Performance / GPU / Host dashboards embedded in-app, with SLO thresholds & alerts.
- Playground — OpenAI-compatible chat (streaming) / completions / embeddings / reranking, with reasoning display.
- Benchmark & evaluate — evalscope load tests (concurrency, arrival-rate, SLA auto-tune) plus 30+ accuracy datasets with LLM-as-judge.
- Libraries — browse / pre-download HF model weights & datasets from the UI; tool-calling parser helper; LoRA support.
- Secure by config — admin-token-gated controls, plus mint/revoke API keys with per-key usage attribution.
See docs/features.md for the full breakdown.
Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in Docker Desktop).
cp deploy/.env.example deploy/.env # set HF_TOKEN, which GPUs, the admin token
make up # build + start the whole stack
# open http://localhost:8884make down stops it · make logs tails all services · make ps shows status.
curl http://localhost:8887/v1/models # router: configured model groups
curl http://localhost:5000/api/models # backend: lifecycle state of each instance
# http://localhost:8884/grafana # dashboards + alertsEvery deploy/.env setting — host ports (FRONTEND_PORT/ROUTER_PORT/…), auth
tokens, GPU selection, and cache locations — is documented inline in
deploy/.env.example and tabulated in
docs/deployment.md#environment-variables-deployenv.
Full topology, the shared-netns rationale, volumes, and a manual run are in the same
docs/deployment.md.
Every model is reachable through a single OpenAI-compatible origin — the router on
:8887 (or /v1 via the dashboard's nginx). Pick the model by its model field and
the router load-balances across that group's instances:
| Endpoint | Purpose |
|---|---|
POST /v1/chat/completions |
Chat — streaming supported; load-balanced across the model group |
POST /v1/completions |
Text completion |
POST /v1/embeddings |
Embeddings — and reranking when a query field is present (forwarded to the embedding/rerank server) |
GET /v1/models |
List configured model groups |
curl http://localhost:8887/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "Qwen3-0.6B", "messages": [{"role": "user", "content": "hi"}]}'Request/response shapes and auth details are in docs/API.md.
flowchart LR
Client([Clients])
FE["<b>frontend</b><br/>nginx · :8884<br/>single origin"]
VLLM["<b>vLLM instances</b><br/>:800x"]
GF["<b>grafana</b><br/>/grafana"]
DCGM["dcgm-exporter<br/>:9400 · GPU"]
NODE["node-exporter<br/>:9100 · host"]
subgraph netns["shared network namespace"]
BE["<b>backend</b> · :5000<br/>model lifecycle"]
RT["<b>router</b> · :8887<br/>OpenAI-compatible LB"]
PR["<b>prometheus</b> · :9090"]
end
Client --> FE
FE -->|/api| BE
FE -->|/v1| RT
FE -->|/grafana| GF
BE -->|launch on demand| VLLM
RT -->|route + balance| VLLM
PR -->|scrape /metrics| VLLM
PR --> DCGM
PR --> NODE
GF -->|query| PR
The router only routes — the backend owns model lifecycle. The frontend, router,
backend, and Grafana sit behind nginx on a single origin; backend, router, and Prometheus
share one network namespace so the spawned vLLM instances are reachable on localhost.
| Topic | |
|---|---|
| Deployment & topology | docs/deployment.md |
Configuration (config.yaml) |
docs/configuration.md |
| Features in depth | docs/features.md |
| Monitoring (Prometheus + Grafana) | docs/monitoring.md |
| HTTP API | docs/API.md |
NVIDIA GPU (CUDA 13.1+ recommended) · 16GB+ RAM · 50GB+ disk.
Tip — running multiple instances on limited RAM. Each vLLM instance runs
torch.compile+ CUDA-graph capture on startup, which is heavy on system RAM (not VRAM). On a small box (e.g. WSL2 with ~8GB RAM), launching a second instance of the same model can exhaust RAM and thrash swap, leaving the new instance stuck instarting. Add--enforce-eagerto the launch command to skip compilation: startup drops from minutes to seconds and RAM/CPU pressure falls sharply, at a small inference-latency cost. RAM — not VRAM — is usually the bottleneck for multi-instance, so give WSL more memory (.wslconfig→memory=12GB, thenwsl --shutdown) before scaling out.
MIT — see LICENSE.



