LLMSystems · milk333445 · Jun 20, 2026 · Jun 20, 2026
diff --git a/README.md b/README.md
@@ -28,7 +28,7 @@ becomes a routable model; the router load-balances across instances; and a bundl
 
 - **Add a model by pasting `vllm serve …`** — parsed into a form and layered on as a dynamic overlay; the router hot-reloads, no `config.yaml` edits.
 - **Lifecycle + self-healing** — per-instance state machine (`stopped → starting → ready → failed`), VRAM pre-flight guard, GPU auto-placement, crash auto-restart with backoff.
-- **Load-aware routing** — picks the least-loaded replica (running/waiting requests + KV-cache usage).
+- **Pluggable routing strategies** — pick the load-balancing policy per model group or globally: `least_load` (default), `round_robin`, `random`, `least_inflight`, `p2c`, plus `session_affinity` / `prefix_affinity` for cache reuse on multi-turn chat & shared prompts. Switch it live from the dashboard; transparent failover + per-backend cooldown apply to every strategy.
 - **Live observability** — SSE status, animated system-topology & router-balancing graphs, per-model usage / latency / error stats.
 - **Bundled Grafana monitoring** — Prometheus auto-discovers every running instance; Overview / Capacity / Performance / GPU / Host dashboards embedded in-app, with SLO thresholds & alerts.
 - **Playground** — OpenAI-compatible chat (streaming) / completions / embeddings / reranking, with reasoning display.
@@ -107,6 +107,16 @@ share one network namespace so the spawned vLLM instances are reachable on `loca
 
 NVIDIA GPU (CUDA 13.1+ recommended) · 16GB+ RAM · 50GB+ disk.
 
+> **Tip — running multiple instances on limited RAM.** Each vLLM instance runs
+> `torch.compile` + CUDA-graph capture on startup, which is heavy on **system RAM**
+> (not VRAM). On a small box (e.g. WSL2 with ~8GB RAM), launching a second instance
+> of the same model can exhaust RAM and thrash swap, leaving the new instance stuck
+> in `starting`. Add **`--enforce-eager`** to the launch command to skip compilation:
+> startup drops from minutes to seconds and RAM/CPU pressure falls sharply, at a small
+> inference-latency cost. RAM — not VRAM — is usually the bottleneck for multi-instance,
+> so give WSL more memory (`.wslconfig` → `memory=12GB`, then `wsl --shutdown`) before
+> scaling out.
+
 ## License
 
 MIT — see [LICENSE](LICENSE).
diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -29,7 +29,7 @@
 
 - **貼上 `vllm serve …` 即可新增模型** — 解析成表單、以動態 overlay 疊加；router 熱重載。
 - **生命週期** — 每實例狀態機（`stopped → starting → ready → failed`）、VRAM 預檢防呆、GPU 自動擺放、崩潰指數退避自動重啟。
-- **負載感知路由** — 自動挑負載最低的副本（運行中／等待中請求 + KV 快取使用率）。
+- **可插拔路由策略** — 每個模型群組或全域各自選負載平衡策略：`least_load`（預設）、`round_robin`、`random`、`least_inflight`、`p2c`,以及 `session_affinity` / `prefix_affinity`(多輪對話與共用 prompt 的快取重用）。可在控制台即時切換;失效轉移與每後端冷卻對所有策略一體適用。
 - **即時觀測** — SSE 狀態、動畫系統拓撲圖與 router 負載平衡圖、每模型用量／延遲／錯誤統計。
 - **內建 Grafana 監控** — Prometheus 自動發現每個運行中的實例；總覽／容量／效能／GPU／主機 dashboards 嵌入應用內，含 SLO 門檻線與告警。
 - **Playground** — OpenAI 相容的 chat（串流）／completions／embeddings／reranking。
@@ -107,6 +107,14 @@ namespace，所以被拉起的 vLLM 實例可在 `localhost` 互相連到。
 
 NVIDIA GPU（建議 CUDA 13.1+）· 16GB+ RAM · 50GB+ 磁碟。
 
+> **提示 — RAM 有限時跑多個 instance。** 每個 vLLM instance 啟動時都會做
+> `torch.compile` + CUDA-graph capture，這非常吃**系統 RAM**（不是 VRAM）。在小機器上
+> （例如 WSL2 只有 ~8GB RAM），對同一顆模型開第二個 instance 很容易把 RAM 吃光、swap
+> 抖動，讓新 instance 一直卡在 `starting`。在啟動指令加上 **`--enforce-eager`** 即可跳過
+> 編譯：啟動時間從數分鐘降到數秒、RAM/CPU 壓力大幅下降，代價只是推理延遲略增。多 instance
+> 的瓶頸通常是 **RAM 而非 VRAM**，擴展前先把 WSL 記憶體加大（`.wslconfig` →
+> `memory=12GB`，再 `wsl --shutdown`）。
+
 ## 授權
 
 MIT — 見 [LICENSE](LICENSE)。
diff --git a/apps/backend/app/llmops/launchers.py b/apps/backend/app/llmops/launchers.py
@@ -53,6 +53,12 @@ def _write_effective_config(config) -> str:
 
 # Keys consumed as env vars / handled specially, not emitted as CLI flags.
 _LORA_RUNTIME_KEY = "allow_runtime_lora"
+# Router-only knobs that ride the shared model_config (EngineModelConfig is
+# extra="allow") but belong to the router, not vLLM — never pass them to
+# `vllm serve` or it errors on an unknown argument.
+_ROUTER_ONLY_KEYS = frozenset({"routing_strategy"})
+# Everything build_vllm_cli_args must skip (model_tag is the positional arg).
+_SKIP_CLI_KEYS = frozenset({"model_tag", _LORA_RUNTIME_KEY}) | _ROUTER_ONLY_KEYS
 
 # vLLM's --max-loras defaults to 1 (only one distinct adapter per batch, which
 # serialises mixed-LoRA traffic and leaves no headroom for hot-loading more).
@@ -85,7 +91,7 @@ def build_vllm_cli_args(model_cfg: dict) -> list[str]:
 
     cli_args = ["serve", model_tag]
     for key, value in model_cfg.items():
-        if key == "model_tag" or key == _LORA_RUNTIME_KEY or value is None:
+        if key in _SKIP_CLI_KEYS or value is None:
             continue
         key_flag = "--" + key.replace("_", "-")
         if key == "lora_modules":

diff --git a/apps/backend/tests/unit/test_launchers.py b/apps/backend/tests/unit/test_launchers.py
@@ -74,6 +74,17 @@ def test_build_vllm_cli_args_requires_model_tag():
         build_vllm_cli_args({"dtype": "float16"})
 
 
+def test_routing_strategy_not_passed_to_vllm():
+    # routing_strategy is a router-only knob riding the shared model_config; it
+    # must never reach `vllm serve` (vLLM errors on the unknown arg).
+    args = build_vllm_cli_args(
+        {"model_tag": "org/m", "dtype": "float16", "routing_strategy": "session_affinity"}
+    )
+    assert "--routing-strategy" not in args
+    assert "session_affinity" not in args
+    assert "--dtype" in args  # other flags still pass through
+
+
 def test_build_vllm_cli_args_lora_modules_multi_value():
     args = build_vllm_cli_args(
         {