Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ becomes a routable model; the router load-balances across instances; and a bundl

- **Add a model by pasting `vllm serve …`** — parsed into a form and layered on as a dynamic overlay; the router hot-reloads, no `config.yaml` edits.
- **Lifecycle + self-healing** — per-instance state machine (`stopped → starting → ready → failed`), VRAM pre-flight guard, GPU auto-placement, crash auto-restart with backoff.
- **Load-aware routing** — picks the least-loaded replica (running/waiting requests + KV-cache usage).
- **Pluggable routing strategies** — pick the load-balancing policy per model group or globally: `least_load` (default), `round_robin`, `random`, `least_inflight`, `p2c`, plus `session_affinity` / `prefix_affinity` for cache reuse on multi-turn chat & shared prompts. Switch it live from the dashboard; transparent failover + per-backend cooldown apply to every strategy.
- **Live observability** — SSE status, animated system-topology & router-balancing graphs, per-model usage / latency / error stats.
- **Bundled Grafana monitoring** — Prometheus auto-discovers every running instance; Overview / Capacity / Performance / GPU / Host dashboards embedded in-app, with SLO thresholds & alerts.
- **Playground** — OpenAI-compatible chat (streaming) / completions / embeddings / reranking, with reasoning display.
Expand Down Expand Up @@ -107,6 +107,16 @@ share one network namespace so the spawned vLLM instances are reachable on `loca

NVIDIA GPU (CUDA 13.1+ recommended) · 16GB+ RAM · 50GB+ disk.

> **Tip — running multiple instances on limited RAM.** Each vLLM instance runs
> `torch.compile` + CUDA-graph capture on startup, which is heavy on **system RAM**
> (not VRAM). On a small box (e.g. WSL2 with ~8GB RAM), launching a second instance
> of the same model can exhaust RAM and thrash swap, leaving the new instance stuck
> in `starting`. Add **`--enforce-eager`** to the launch command to skip compilation:
> startup drops from minutes to seconds and RAM/CPU pressure falls sharply, at a small
> inference-latency cost. RAM — not VRAM — is usually the bottleneck for multi-instance,
> so give WSL more memory (`.wslconfig` → `memory=12GB`, then `wsl --shutdown`) before
> scaling out.

## License

MIT — see [LICENSE](LICENSE).
10 changes: 9 additions & 1 deletion README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

- **貼上 `vllm serve …` 即可新增模型** — 解析成表單、以動態 overlay 疊加;router 熱重載。
- **生命週期** — 每實例狀態機(`stopped → starting → ready → failed`)、VRAM 預檢防呆、GPU 自動擺放、崩潰指數退避自動重啟。
- **負載感知路由** — 自動挑負載最低的副本(運行中/等待中請求 + KV 快取使用率)
- **可插拔路由策略** — 每個模型群組或全域各自選負載平衡策略:`least_load`(預設)、`round_robin`、`random`、`least_inflight`、`p2c`,以及 `session_affinity` / `prefix_affinity`(多輪對話與共用 prompt 的快取重用)。可在控制台即時切換;失效轉移與每後端冷卻對所有策略一體適用
- **即時觀測** — SSE 狀態、動畫系統拓撲圖與 router 負載平衡圖、每模型用量/延遲/錯誤統計。
- **內建 Grafana 監控** — Prometheus 自動發現每個運行中的實例;總覽/容量/效能/GPU/主機 dashboards 嵌入應用內,含 SLO 門檻線與告警。
- **Playground** — OpenAI 相容的 chat(串流)/completions/embeddings/reranking。
Expand Down Expand Up @@ -107,6 +107,14 @@ namespace,所以被拉起的 vLLM 實例可在 `localhost` 互相連到。

NVIDIA GPU(建議 CUDA 13.1+)· 16GB+ RAM · 50GB+ 磁碟。

> **提示 — RAM 有限時跑多個 instance。** 每個 vLLM instance 啟動時都會做
> `torch.compile` + CUDA-graph capture,這非常吃**系統 RAM**(不是 VRAM)。在小機器上
> (例如 WSL2 只有 ~8GB RAM),對同一顆模型開第二個 instance 很容易把 RAM 吃光、swap
> 抖動,讓新 instance 一直卡在 `starting`。在啟動指令加上 **`--enforce-eager`** 即可跳過
> 編譯:啟動時間從數分鐘降到數秒、RAM/CPU 壓力大幅下降,代價只是推理延遲略增。多 instance
> 的瓶頸通常是 **RAM 而非 VRAM**,擴展前先把 WSL 記憶體加大(`.wslconfig` →
> `memory=12GB`,再 `wsl --shutdown`)。

## 授權

MIT — 見 [LICENSE](LICENSE)。
8 changes: 7 additions & 1 deletion apps/backend/app/llmops/launchers.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,12 @@ def _write_effective_config(config) -> str:

# Keys consumed as env vars / handled specially, not emitted as CLI flags.
_LORA_RUNTIME_KEY = "allow_runtime_lora"
# Router-only knobs that ride the shared model_config (EngineModelConfig is
# extra="allow") but belong to the router, not vLLM — never pass them to
# `vllm serve` or it errors on an unknown argument.
_ROUTER_ONLY_KEYS = frozenset({"routing_strategy"})
# Everything build_vllm_cli_args must skip (model_tag is the positional arg).
_SKIP_CLI_KEYS = frozenset({"model_tag", _LORA_RUNTIME_KEY}) | _ROUTER_ONLY_KEYS

# vLLM's --max-loras defaults to 1 (only one distinct adapter per batch, which
# serialises mixed-LoRA traffic and leaves no headroom for hot-loading more).
Expand Down Expand Up @@ -85,7 +91,7 @@ def build_vllm_cli_args(model_cfg: dict) -> list[str]:

cli_args = ["serve", model_tag]
for key, value in model_cfg.items():
if key == "model_tag" or key == _LORA_RUNTIME_KEY or value is None:
if key in _SKIP_CLI_KEYS or value is None:
continue
key_flag = "--" + key.replace("_", "-")
if key == "lora_modules":
Expand Down
11 changes: 11 additions & 0 deletions apps/backend/tests/unit/test_launchers.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,17 @@ def test_build_vllm_cli_args_requires_model_tag():
build_vllm_cli_args({"dtype": "float16"})


def test_routing_strategy_not_passed_to_vllm():
# routing_strategy is a router-only knob riding the shared model_config; it
# must never reach `vllm serve` (vLLM errors on the unknown arg).
args = build_vllm_cli_args(
{"model_tag": "org/m", "dtype": "float16", "routing_strategy": "session_affinity"}
)
assert "--routing-strategy" not in args
assert "session_affinity" not in args
assert "--dtype" in args # other flags still pass through


def test_build_vllm_cli_args_lora_modules_multi_value():
args = build_vllm_cli_args(
{
Expand Down
Loading
Loading