GitHub - LLMSystems/vLLMux: One-Stop LLM Model Management and Monitoring Platform

One-stop platform to deploy, route, monitor & evaluate your vLLM cluster

vLLMux is a self-hosted control plane for serving many LLMs on vLLM. Paste a vllm serve … command and it becomes a routable model; the router load-balances across instances; and a bundled Prometheus + Grafana stack monitors everything — all behind one Vue dashboard.

Highlights

Add a model by pasting vllm serve … — parsed into a form and layered on as a dynamic overlay; the router hot-reloads, no config.yaml edits.
Lifecycle + self-healing — per-instance state machine (stopped → starting → ready → failed), VRAM pre-flight guard, GPU auto-placement, crash auto-restart with backoff.
Pluggable routing strategies — pick the load-balancing policy per model group or globally: least_load (default), round_robin, random, least_inflight, p2c, plus session_affinity / prefix_affinity for cache reuse on multi-turn chat & shared prompts. Switch it live from the dashboard; transparent failover + per-backend cooldown apply to every strategy.
Cross-instance KV-cache sharing — toggle it per model group in the editor: instances offload/load KV blocks to a shared store (vLLM OffloadingConnector) so a prefix computed on one replica is reused by another (verified ~99% external prefix-cache hit, 31% lower TTFT on a warmed prompt). Traffic & topology views show which groups pool KV vs keep their own.
Live observability — SSE status, animated system-topology & router-balancing graphs, per-model usage / latency / error stats.
Bundled Grafana monitoring — Prometheus auto-discovers every running instance; Overview / Capacity / Performance / GPU / Host dashboards embedded in-app, with SLO thresholds & alerts.
Playground — OpenAI-compatible chat (streaming) / completions / embeddings / reranking, with reasoning display.
Benchmark & evaluate — evalscope load tests (concurrency, arrival-rate, SLA auto-tune) plus 30+ accuracy datasets with LLM-as-judge.
Libraries — browse / pre-download HF model weights & datasets from the UI; tool-calling parser helper; LoRA support.
Secure by config — admin-token-gated controls, plus mint/revoke API keys with per-key usage attribution.

See docs/features.md for the full breakdown.

Quick start

Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in Docker Desktop).

cp deploy/.env.example deploy/.env   # set HF_TOKEN, which GPUs, the admin token
make up                              # build + start the whole stack
# open http://localhost:8884

make down stops it · make logs tails all services · make ps shows status.

curl http://localhost:8887/v1/models     # router: configured model groups
curl http://localhost:5000/api/models    # backend: lifecycle state of each instance
# http://localhost:8884/grafana          # dashboards + alerts

Every deploy/.env setting — host ports (FRONTEND_PORT/ROUTER_PORT/…), auth tokens, GPU selection, and cache locations — is documented inline in deploy/.env.example and tabulated in docs/deployment.md#environment-variables-deployenv. Full topology, the shared-netns rationale, volumes, and a manual run are in the same docs/deployment.md.

One endpoint for the whole fleet

Every model is reachable through a single OpenAI-compatible origin — the router on :8887 (or /v1 via the dashboard's nginx). Pick the model by its model field and the router load-balances across that group's instances:

Endpoint	Purpose
`POST /v1/chat/completions`	Chat — streaming supported; load-balanced across the model group
`POST /v1/completions`	Text completion
`POST /v1/embeddings`	Embeddings — and reranking when a `query` field is present (forwarded to the embedding/rerank server)
`GET /v1/models`	List configured model groups

curl http://localhost:8887/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "Qwen3-0.6B", "messages": [{"role": "user", "content": "hi"}]}'

Request/response shapes and auth details are in docs/API.md.

Architecture

flowchart LR
    Client([Clients])
    FE["<b>frontend</b><br/>nginx · :8884<br/>single origin"]
    VLLM["<b>vLLM instances</b><br/>:800x"]
    GF["<b>grafana</b><br/>/grafana"]
    DCGM["dcgm-exporter<br/>:9400 · GPU"]
    NODE["node-exporter<br/>:9100 · host"]

    subgraph netns["shared network namespace"]
        BE["<b>backend</b> · :5000<br/>model lifecycle"]
        RT["<b>router</b> · :8887<br/>OpenAI-compatible LB"]
        PR["<b>prometheus</b> · :9090"]
    end

    Client --> FE
    FE -->|/api| BE
    FE -->|/v1| RT
    FE -->|/grafana| GF
    BE -->|launch on demand| VLLM
    RT -->|route + balance| VLLM
    PR -->|scrape /metrics| VLLM
    PR --> DCGM
    PR --> NODE
    GF -->|query| PR

The router only routes — the backend owns model lifecycle. The frontend, router, backend, and Grafana sit behind nginx on a single origin; backend, router, and Prometheus share one network namespace so the spawned vLLM instances are reachable on localhost.

Documentation

Topic
Deployment & topology	docs/deployment.md
Configuration (`config.yaml`)	docs/configuration.md
Features in depth	docs/features.md
Monitoring (Prometheus + Grafana)	docs/monitoring.md
HTTP API	docs/API.md

Requirements

NVIDIA GPU (CUDA 13.1+ recommended) · 16GB+ RAM · 50GB+ disk.

Tip — running multiple instances on limited RAM. Each vLLM instance runs torch.compile + CUDA-graph capture on startup, which is heavy on system RAM (not VRAM). On a small box (e.g. WSL2 with ~8GB RAM), launching a second instance of the same model can exhaust RAM and thrash swap, leaving the new instance stuck in starting. Add --enforce-eager to the launch command to skip compilation: startup drops from minutes to seconds and RAM/CPU pressure falls sharply, at a small inference-latency cost. RAM — not VRAM — is usually the bottleneck for multi-instance, so give WSL more memory (.wslconfig → memory=12GB, then wsl --shutdown) before scaling out.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.claude		.claude
.github/workflows		.github/workflows
apps		apps
assets		assets
deploy		deploy
docs		docs
packages		packages
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_zh-CN.md		README_zh-CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highlights

Quick start

One endpoint for the whole fleet

Architecture

Documentation

Requirements

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Highlights

Quick start

One endpoint for the whole fleet

Architecture

Documentation

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages