LLMSystems · milk333445 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/README.md b/README.md
diff --git a/README_zh-CN.md b/README_zh-CN.md
diff --git a/apps/backend/app/main.py b/apps/backend/app/main.py
@@ -145,7 +145,7 @@ async def lifespan(app: FastAPI):
 
 
 def create_app() -> FastAPI:
-    app = FastAPI(title="LLM Router Dashboard Backend", lifespan=lifespan)
+    app = FastAPI(title="vLLMux Backend", lifespan=lifespan)
     app.add_middleware(
         CORSMiddleware,
         allow_origins=["*"],

diff --git a/apps/frontend_llmops/index.html b/apps/frontend_llmops/index.html
@@ -4,7 +4,7 @@
     <meta charset="UTF-8">
     <link rel="icon" href="/favicon.ico">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>Vite App</title>
+    <title>vLLMux</title>
   </head>
   <body>
     <div id="app"></div>

diff --git a/apps/frontend_llmops/src/components/layout/AppSidebar.vue b/apps/frontend_llmops/src/components/layout/AppSidebar.vue
@@ -80,7 +80,7 @@ const nav = [
         <Server class="size-4.5" />
       </div>
       <div class="leading-tight">
-        <p class="text-sm font-semibold">LLMOps</p>
+        <p class="text-sm font-semibold">vLLMux</p>
         <p class="text-[10px] uppercase tracking-widest text-muted-foreground">控制台</p>
       </div>
     </div>

diff --git a/assets/image0.png b/assets/image0.png
diff --git a/assets/image1.png b/assets/image1.png
diff --git a/assets/image2.png b/assets/image2.png
diff --git a/assets/image3.png b/assets/image3.png
diff --git a/assets/image4.png b/assets/image4.png
diff --git a/docs/REFACTOR_PLAN.md b/docs/REFACTOR_PLAN.md
@@ -1,4 +1,4 @@
-# LLM-Router-Server-Dashboard 重構方案（Monorepo）
+# vLLMux 重構方案（Monorepo）
 
 > 狀態：**已執行（Phase 0–4 完成）**。三個子專案已整理成 `apps/` + `packages/` + `deploy/` 的正式 Monorepo。
 > 後端已分層並收斂 config、前端已建立 services/stores/router/views、共用 config-schema 已上線，

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -0,0 +1,93 @@
+# Configuration
+
+> [中文](configuration_zh-CN.md)
+
+The configuration file lives at `packages/config-schema/config.yaml` — the single
+source of truth, validated by `packages/config-schema/schema.py`, and read by the
+frontend, backend, and router alike. It controls all model startup parameters.
+
+> You usually **don't** edit this by hand: add models from the UI by pasting a
+> `vllm serve …` command, which is layered on as a dynamic overlay. Edit `config.yaml`
+> only for the canonical, hand-maintained fleet.
+
+## `config.yaml` structure
+
+```yaml
+# Router server configuration
+server:
+  host: "0.0.0.0"
+  port: 8887
+  uvicorn_log_level: "info"
+
+# LLM model configuration
+LLM_engines:
+  Qwen3-0.6B:
+    instances:
+      - id: "qwen3"
+        host: "localhost"
+        port: 8002
+        cuda_device: 0
+      - id: "qwen3-2"
+        host: "localhost"
+        port: 8004
+        cuda_device: 0
+
+    model_config:
+      model_tag: "Qwen/Qwen3-0.6B"
+      dtype: "float16"
+      max_model_len: 500
+      gpu_memory_utilization: 0.35
+      tensor_parallel_size: 1
+
+# Embedding server configuration (optional)
+embedding_server:
+  host: "localhost"
+  port: 8005
+  cuda_device: 1
+
+  embedding_models:
+    m3e-base:
+      model_name: "moka-ai/m3e-base"
+      model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
+      tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+
+  reranking_models:
+    bge-reranker-large:
+      model_name: "BAAI/bge-reranker-large"
+      model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
+      tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+```
+
+## Key parameters
+
+| Parameter | Description | Recommended |
+|------|------|--------|
+| `gpu_memory_utilization` | GPU memory usage ratio | 0.6–0.9 |
+| `max_model_len` | Maximum context length | Based on model capability |
+| `tensor_parallel_size` | Multi-GPU parallelism count | Number of GPUs |
+| `dtype` | Inference precision | float16 (faster) / bfloat16 (more stable) |
+| `cuda_device` | GPU device number | 0, 1, 2… |
+
+## Running multiple models at once
+
+Yes — as long as they fit in GPU memory. A **VRAM pre-flight guard** blocks a start
+that would overflow the target GPU (override per-start with *Force start*), and
+instances without a pinned `cuda_device` are **auto-placed** on the GPU with the most
+free memory. On a single small GPU you'll typically run one mid-size model alongside
+a few small ones; models are started on demand, so a large fleet can be configured
+without all running at once.
+
+Tune the guard / restart policy via env on the backend:
+
+| Env | Purpose |
+|---|---|
+| `LLMOPS_VRAM_GUARD` | Enable/disable the VRAM pre-flight guard |
+| `LLMOPS_AUTO_RESTART` | Auto-restart a crashed managed model |
+| `LLMOPS_MAX_RESTARTS` | Restart budget before giving up |
+| `LLMOPS_RESTART_BACKOFF` | Exponential backoff base |
diff --git a/docs/configuration_zh-CN.md b/docs/configuration_zh-CN.md
@@ -0,0 +1,89 @@
+# 配置說明
+
+> [English](configuration.md)
+
+配置文件位於 `packages/config-schema/config.yaml`——單一來源，由
+`packages/config-schema/schema.py` 驗證，前端、後端與 router 都讀同一份。它控制所有
+模型的啟動參數。
+
+> 通常**不需要**手動編輯：從前端貼上 `vllm serve …` 指令新增模型，會以動態 overlay 疊加。
+> 只有要維護「正式、手寫」的模型清單時才改 `config.yaml`。
+
+## `config.yaml` 結構
+
+```yaml
+# 路由服務器配置
+server:
+  host: "0.0.0.0"
+  port: 8887
+  uvicorn_log_level: "info"
+
+# LLM 模型配置
+LLM_engines:
+  Qwen3-0.6B:
+    instances:
+      - id: "qwen3"
+        host: "localhost"
+        port: 8002
+        cuda_device: 0
+      - id: "qwen3-2"
+        host: "localhost"
+        port: 8004
+        cuda_device: 0
+
+    model_config:
+      model_tag: "Qwen/Qwen3-0.6B"
+      dtype: "float16"
+      max_model_len: 500
+      gpu_memory_utilization: 0.35
+      tensor_parallel_size: 1
+
+# Embedding 服務器配置（可選）
+embedding_server:
+  host: "localhost"
+  port: 8005
+  cuda_device: 1
+
+  embedding_models:
+    m3e-base:
+      model_name: "moka-ai/m3e-base"
+      model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
+      tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+
+  reranking_models:
+    bge-reranker-large:
+      model_name: "BAAI/bge-reranker-large"
+      model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
+      tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+```
+
+## 關鍵參數
+
+| 參數 | 說明 | 建議值 |
+|------|------|--------|
+| `gpu_memory_utilization` | GPU 記憶體使用比例 | 0.6–0.9 |
+| `max_model_len` | 最大上下文長度 | 依模型能力 |
+| `tensor_parallel_size` | 多 GPU 並行數 | GPU 數量 |
+| `dtype` | 推理精度 | float16（速度快） / bfloat16（更穩定） |
+| `cuda_device` | GPU 設備編號 | 0, 1, 2… |
+
+## 同時啟動多個模型
+
+可以——只要顯存放得下。**VRAM 預檢防呆**會擋下會撐爆目標 GPU 的啟動（可用 *Force start*
+逐次覆寫），未指定 `cuda_device` 的實例會**自動擺放**到剩餘顯存最多的 GPU。單張小卡通常
+能跑一顆中型模型加幾顆小模型；模型是按需啟動的，所以可以設定一大批而不必全部同時運行。
+
+可在後端用環境變數調整防呆／重啟策略：
+
+| 環境變數 | 用途 |
+|---|---|
+| `LLMOPS_VRAM_GUARD` | 啟用／關閉 VRAM 預檢防呆 |
+| `LLMOPS_AUTO_RESTART` | 崩潰的 managed 模型自動重啟 |
+| `LLMOPS_MAX_RESTARTS` | 放棄前的重啟次數上限 |
+| `LLMOPS_RESTART_BACKOFF` | 指數退避基數 |
diff --git a/docs/deployment.md b/docs/deployment.md
@@ -0,0 +1,115 @@
+# Deployment & Topology
+
+> [中文](deployment_zh-CN.md)
+
+The whole stack — dashboard backend, LLM router, Prometheus, Grafana, the GPU/host
+exporters, and the Vue frontend — is built and started by a single Compose file.
+Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in
+Docker Desktop).
+
+```bash
+cp deploy/.env.example deploy/.env   # set HF_TOKEN, which GPUs, the admin token
+make up                              # docker compose -f deploy/docker-compose.yaml up -d --build
+# open http://localhost:8884
+```
+
+`make down` stops it, `make logs` tails all services, `make ps` shows status.
+
+## Services
+
+See [`deploy/docker-compose.yaml`](../deploy/docker-compose.yaml).
+
+| Service          | Image                  | Port    | Role |
+|------------------|------------------------|---------|------|
+| `backend`        | `llmops-engine` (GPU)  | 5000    | Dashboard API; spawns vLLM subprocesses on `:800x` |
+| `router`         | `llmops-engine`        | 8887    | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
+| `prometheus`     | `prom/prometheus`      | 9090    | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances |
+| `grafana`        | `grafana/grafana`      | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx |
+| `dcgm-exporter`  | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) |
+| `node-exporter`  | `prom/node-exporter`   | 9100    | Host metrics (CPU, RAM, disk, network) |
+| `frontend`       | `llmops-frontend`      | 8884    | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana |
+
+### Why one image, multiple services on one netns
+
+Only the backend truly needs vLLM (it launches the subprocesses), and the router +
+Prometheus must see them on `localhost` — so a single
+[`engine.Dockerfile`](../deploy/engine.Dockerfile) (based on the official
+`vllm/vllm-openai`) runs as `backend` + `router`, joined (with Prometheus) by
+`network_mode: service:backend`.
+
+The frontend reaches the backend, router, and Grafana through nginx on a single
+origin, so no host/port is baked into the build.
+
+### Persistence
+
+- SQLite + the dynamic-model overlay → `llmops-data` named volume
+- Prometheus TSDB → `prometheus-data`; Grafana state → `grafana-data`
+- Model **weights** are bind-mounted from the host HF cache (`HF_CACHE_DIR`, default
+  `~/.cache/huggingface`) so they're browsable locally and shared with host-side tools
+- `packages/config-schema/config.yaml` is bind-mounted too, so you can edit models
+  without rebuilding
+
+> **Model lifecycle**: the router only routes and load-balances — it never launches
+> models. vLLM instances (and the Embedding/Reranker server) are owned by the backend
+> and started on demand from the **Models** page (or `POST /api/models/{key}/start`).
+> The backend and router both merge the dynamic-model overlay at startup, so models
+> added from the UI survive restarts.
+
+### Verify
+
+```bash
+curl http://localhost:8887/v1/models     # router: configured model groups
+curl http://localhost:5000/api/models    # backend: lifecycle state of each instance
+```
+
+## Frontend (Web dashboard)
+
+The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript,
+Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the
+topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.)
+
+```bash
+cd apps/frontend_llmops
+npm install
+npm run dev          # http://localhost:5173
+npm run build        # production build → dist/
+```
+
+Configuration — `apps/frontend_llmops/.env`:
+
+```env
+VITE_API_BASE_URL=http://localhost:5000        # Dashboard backend (lifecycle, telemetry)
+VITE_ROUTER_BASE_URL=http://localhost:8887     # LLM Router (inference + /metrics + /reload)
+```
+
+### Authentication
+
+Authentication is backend-driven (not a build-time password). Set
+`LLMOPS_ADMIN_TOKEN` on the backend + router to gate every control action (start /
+stop / add / edit / remove + API-key management); the UI prompts for the token once
+and reuses it for the session. Set `LLMOPS_REQUIRE_API_KEY=true` on the router to
+require a bearer token (the admin token, or an API key minted on the **API Keys**
+page) for all `/v1/*` inference. Both default to off for local dev.
+
+## Manual / development run
+
+Run the three pieces yourself (Python deps in the repo-root `.venv`):
+
+```bash
+# Dashboard backend (:5000)
+cd apps/backend && pip install -r requirements.txt
+uvicorn main:app --host 0.0.0.0 --port 5000
+
+# LLM router (:8887) — see apps/router-server/README.md for details
+cd apps/router-server && pip install -r requirements.txt
+sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py
+```
+
+Use `packages/config-schema/config.yaml` as the single source of truth so the
+frontend, backend, and router all read the same configuration.
+
+## Requirements
+
+- **GPU**: NVIDIA GPU (CUDA 13.1+ recommended)
+- **Memory**: 16GB+ RAM (depending on model size)
+- **Disk**: 50GB+ available space