diff --git a/README.md b/README.md
index 600e743..91910e5 100644
--- a/README.md
+++ b/README.md
@@ -1,297 +1,110 @@
 <div align="center">
 
-# LLM-Router-Server-Dashboard
-**One-Stop LLM Model Management and Monitoring Platform**
+# vLLMux
 
-[English](README.md) | [中文](README_zh-CN.md)
+**One-stop platform to deploy, route, monitor & evaluate your vLLM cluster**
 
-![Main Console](assets/image0.png)
-
-![Model Management](assets/image1.png)
+[English](README.md) · [中文](README_zh-CN.md)
 
+![vLLMux](https://img.shields.io/badge/vLLM-multiplexed-5b8def)
+![license](https://img.shields.io/badge/license-MIT-green)
+![stack](https://img.shields.io/badge/FastAPI%20·%20Vue%203%20·%20Grafana-informational)
 
-![Model Management](assets/image2.png)
-
-![Model Management](assets/image3.png)
-![Model Management](assets/image4.png)
+![Main Console](assets/image0.png)
+![Main Console](assets/image1.png)
+![Main Console](assets/image2.png)
 
 </div>
 
 ---
 
-## Project Overview
-
-**LLM-Router-Server-Dashboard** is a solution for large language model (LLM) deployment and management, providing an intuitive web interface to manage, monitor, and operate multiple LLM model instances.
-
-This project combines a routing server (LLM-Router-Server) with an easy-to-use management interface, enabling you to:
-- **Visual Management**: Easily manage multiple models through a web interface
-- **Dynamic Control**: Start and stop models in real-time without service restarts
-- **Real-time Monitoring**: Monitor model status, GPU utilization, and system information
-- **Configuration Management**: Flexibly manage model parameters through YAML configuration files
-
----
-
-## Key Features
-
-### Model Management
-- Multi-model, multi-instance management on vLLM (LLM, Embedding, Reranker)
-- Per-instance lifecycle (start/stop) with a live state machine (`stopped → starting → ready → failed/stopping`), driven by a reconciler that derives the true state from process liveness + `/health` probes
-- **Add models from the UI by pasting a `vllm serve …` command** — it is parsed into an editable form and layered on as a dynamic *overlay*, so the hand-maintained `config.yaml` stays untouched; the router hot-reloads (`POST /reload`) so new models are routable end-to-end
-- Load-aware routing: the router auto-selects the least-loaded instance (weighting running / waiting requests + KV-cache usage)
-
-### Reliability
-- **VRAM pre-flight guard** — blocks a start that would likely OOM, with a one-click *Force start* override
-- **GPU auto-placement** — an instance with no pinned `cuda_device` is placed on the GPU with the most free memory
-- **Auto-restart** — a managed model that crashes is restarted with exponential backoff (configurable budget, resets once healthy)
-
-### Observability
-- Real-time status via Server-Sent Events (no polling)
-- **System topology** (Vue Flow) — a live mission-control graph of Clients → Router → model groups / Embedding → GPUs, with animated traffic edges, GPU-placement edges, and a control plane; nodes are clickable drill-ins
-- **Router load-balancing view** — an animated fan showing each replica's real traffic share and the instance the router will pick next
-- **Grafana monitoring** (bundled) — Prometheus auto-discovers every running vLLM instance (file-based service discovery written by the backend as models start/stop) and scrapes its `/metrics`, alongside GPU (DCGM) and host (node-exporter) metrics. Grafana dashboards — **Overview** (single pane: health, latency SLO, capacity, GPU/host), **Scheduling & Capacity**, vLLM Performance/Query, GPU, Host — are embedded in the **Monitoring** tab, with SLO threshold lines, model-lifecycle annotations, and alert rules. See [Monitoring (Grafana)](#monitoring-grafana)
-- Per-model usage (count, error rate, p50/p95 latency, tokens), request log, and a state-transition event timeline
-- GPU / CPU / memory monitoring plus a GPU-process inventory
+**vLLMux** is a self-hosted control plane for serving many LLMs on
+[vLLM](https://github.com/vllm-project/vllm). Paste a `vllm serve …` command and it
+becomes a routable model; the router load-balances across instances; and a bundled Prometheus + Grafana stack monitors everything — all behind one Vue dashboard.
 
-### Playground
-- OpenAI-compatible **chat (streaming)**, completions, **embeddings**, and **reranking**, sent straight through the router
-- **Reasoning ("thinking") display** — when a model runs with a vLLM reasoning parser, the `reasoning` stream is shown in a collapsible *思考過程* block above the answer
+## Highlights
 
-### Benchmarking & Evaluation (evalscope)
-- **Load testing** (`/benchmark`) — concurrency sweep, arrival-rate open-loop, multi-turn, **SLA auto-tune**, plus **embedding / rerank** throughput and single-request **speed benchmark**; each run is an isolated subprocess, with live charts, run comparison, and the full evalscope HTML report
-- **Accuracy / quality evaluation** (`/eval`) — **30+ benchmark datasets** grouped by capability tier (Baseline, Knowledge, Chinese, Reasoning, Math, Multilingual, **Tool-calling**, **Long-context**, Code, and judge-scored QA): MMLU/ARC/GSM8K/IFEval, C-Eval/C-MMLU, GPQA/MMLU-Pro, AIME, HumanEval, ToolBench/General-FunctionCall, Needle-in-a-Haystack, …
-  - Per-dataset scores, a **run-to-run comparison matrix** (highlights the best per dataset), and the interactive HTML report
-  - **LLM-as-judge** for free-form QA — pick one of your own deployed models (via the router) or an external OpenAI-compatible API
-  - **Advanced `dataset_args`** — few-shot count + raw per-dataset overrides (subset selection, etc.)
-  - Sanity guards: judge-scored datasets require a judge; long-context and real tool-calling datasets warn about their model prerequisites (large `max_model_len`, vLLM tool parser)
+- **Add a model by pasting `vllm serve …`** — parsed into a form and layered on as a dynamic overlay; the router hot-reloads, no `config.yaml` edits.
+- **Lifecycle + self-healing** — per-instance state machine (`stopped → starting → ready → failed`), VRAM pre-flight guard, GPU auto-placement, crash auto-restart with backoff.
+- **Load-aware routing** — picks the least-loaded replica (running/waiting requests + KV-cache usage).
+- **Live observability** — SSE status, animated system-topology & router-balancing graphs, per-model usage / latency / error stats.
+- **Bundled Grafana monitoring** — Prometheus auto-discovers every running instance; Overview / Capacity / Performance / GPU / Host dashboards embedded in-app, with SLO thresholds & alerts.
+- **Playground** — OpenAI-compatible chat (streaming) / completions / embeddings / reranking, with reasoning display.
+- **Benchmark & evaluate** — evalscope load tests (concurrency, arrival-rate, SLA auto-tune) plus 30+ accuracy datasets with LLM-as-judge.
+- **Libraries** — browse / pre-download HF model weights & datasets from the UI; tool-calling parser helper; LoRA support.
+- **Secure by config** — admin-token-gated controls, plus mint/revoke API keys with per-key usage attribution.
 
-### Libraries
-- **Model library** (`/library`) — scan / pre-download / delete HF model weights from the UI, with live download progress
-- **Dataset library** (`/datasets`) — pre-download load-test and evaluation datasets into the shared ModelScope cache so a run never stalls on a first-time download
-- **Tool-calling config helper** — the model editor maps model families to the right vLLM `tool_call_parser` (Qwen→`hermes`, Qwen3-Coder→`qwen3_xml`, Llama→`llama3_json`/`llama4_pythonic`, …) with one-click preset insertion (see `docs/vllm_auto_tool_整理.md`)
+See [docs/features.md](docs/features.md) for the full breakdown.
 
-### UX
-- Light / dark theme, dense "control-room" interface
-- **Admin-token-gated control** (start / stop / add / edit / remove) and
-  **API-key management** — mint/revoke keys that authenticate router inference,
-  with per-key usage attribution in the request log
+## Quick start
 
----
-
-## System Requirements
-
-### Hardware Requirements
-- **GPU**: NVIDIA GPU (CUDA 13.1+ recommended)
-- **Memory**: 16GB+ RAM (depending on model size)
-- **Disk**: 50GB+ available space
----
-
-## Quick Start
-
-### Docker Deployment (one command)
-
-The whole stack — dashboard backend, LLM router, and the Vue frontend — is built
-and started by a single Compose file. Requires Docker with the NVIDIA Container
-Toolkit (on WSL2, enable GPU support in Docker Desktop).
+Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in
+Docker Desktop).
 
 ```bash
 cp deploy/.env.example deploy/.env   # set HF_TOKEN, which GPUs, the admin token
-make up                              # docker compose -f deploy/docker-compose.yaml up -d --build
+make up                              # build + start the whole stack
 # open http://localhost:8884
 ```
 
-`make down` stops it, `make logs` tails all services, `make ps` shows status.
-
-**Topology** (see [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)):
-
-| Service          | Image                  | Port    | Role |
-|------------------|------------------------|---------|------|
-| `backend`        | `llmops-engine` (GPU)  | 5000    | Dashboard API; spawns vLLM subprocesses on `:800x` |
-| `router`         | `llmops-engine`        | 8887    | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
-| `prometheus`     | `prom/prometheus`      | 9090    | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances |
-| `grafana`        | `grafana/grafana`      | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx |
-| `dcgm-exporter`  | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) |
-| `node-exporter`  | `prom/node-exporter`   | 9100    | Host metrics (CPU, RAM, disk, network) |
-| `frontend`       | `llmops-frontend`      | 8884    | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana |
-
-Why one image, multiple services on one netns: only the backend truly needs vLLM
-(it launches the subprocesses), and the router + Prometheus must see them on
-`localhost` — so a single [`engine.Dockerfile`](deploy/engine.Dockerfile) (based
-on the official `vllm/vllm-openai`) runs as `backend` + `router`, joined (with
-Prometheus) by `network_mode: service:backend`.
-
-The frontend reaches the backend, router, and Grafana through nginx on a single
-origin, so no host/port is baked into the build. SQLite + the dynamic-model
-overlay persist in the `llmops-data` named volume (Prometheus TSDB and Grafana
-state in `prometheus-data` / `grafana-data`); downloaded model **weights** are
-bind-mounted from the host HF cache (`HF_CACHE_DIR`, default
-`~/.cache/huggingface`) so they're browsable locally and shared with host-side
-tools. The canonical `packages/config-schema/config.yaml` is bind-mounted too, so
-you can edit models without rebuilding.
-
-> **Model lifecycle**: the router only routes and load-balances — it never
-> launches models. vLLM instances (and the Embedding/Reranker server) are owned
-> by the backend and started on demand from the **Models** page (or
-> `POST /api/models/{key}/start`). The backend and router both merge the
-> dynamic-model overlay at startup, so models added from the UI survive restarts.
-
-#### Verify
+`make down` stops it · `make logs` tails all services · `make ps` shows status.
 
 ```bash
 curl http://localhost:8887/v1/models     # router: configured model groups
 curl http://localhost:5000/api/models    # backend: lifecycle state of each instance
+# http://localhost:8884/grafana          # dashboards + alerts
 ```
 
-#### Monitoring (Grafana)
-
-The stack bundles a full **Prometheus → Grafana** pipeline, no manual setup:
-
-- The **backend** writes a Prometheus file-based service-discovery file
-  (`LLMOPS_PROMETHEUS_SD_PATH`) listing every *ready* vLLM instance, refreshed as
-  models start/stop — so a dynamic fleet is scraped with zero config edits.
-- **Prometheus** (`:9090`) scrapes those instances' `/metrics` plus
-  `dcgm-exporter` (GPU) and `node-exporter` (host).
-- **Grafana** is served single-origin at **`http://localhost:8884/grafana`**
-  (anonymous read-only; log in as `admin` / `GRAFANA_ADMIN_PASSWORD` to edit).
-  Datasource and dashboards are auto-provisioned from
-  [`deploy/grafana`](deploy/grafana): **Overview**, **vLLM Scheduling &
-  Capacity** (custom), **Performance**/**Query** (official), **GPU** (DCGM), and
-  **Host** (Node Exporter). The same dashboards are embedded in the dashboard's
-  **Monitoring** tab.
-- **Alerting**: provisioned vLLM alert rules (target down, TTFT p95, KV cache,
-  request queueing) route to a webhook contact point — set `GRAFANA_ALERT_WEBHOOK`
-  in `deploy/.env` (Slack/Discord/generic) and restart Grafana to receive them.
-
-```bash
-curl http://localhost:9090/api/v1/targets        # prometheus: scrape target health
-# open http://localhost:8884/grafana             # dashboards + alerts
+Full topology, the shared-netns rationale, volumes, and a manual run are in
+[docs/deployment.md](docs/deployment.md).
+
+## Architecture
+
+```mermaid
+flowchart LR
+    Client([Clients])
+    FE["<b>frontend</b><br/>nginx · :8884<br/>single origin"]
+    VLLM["<b>vLLM instances</b><br/>:800x"]
+    GF["<b>grafana</b><br/>/grafana"]
+    DCGM["dcgm-exporter<br/>:9400 · GPU"]
+    NODE["node-exporter<br/>:9100 · host"]
+
+    subgraph netns["shared network namespace"]
+        BE["<b>backend</b> · :5000<br/>model lifecycle"]
+        RT["<b>router</b> · :8887<br/>OpenAI-compatible LB"]
+        PR["<b>prometheus</b> · :9090"]
+    end
+
+    Client --> FE
+    FE -->|/api| BE
+    FE -->|/v1| RT
+    FE -->|/grafana| GF
+    BE -->|launch on demand| VLLM
+    RT -->|route + balance| VLLM
+    PR -->|scrape /metrics| VLLM
+    PR --> DCGM
+    PR --> NODE
+    GF -->|query| PR
 ```
 
-### Frontend (Web Dashboard)
+The **router only routes** — the **backend owns model lifecycle**. The frontend, router,
+backend, and Grafana sit behind nginx on a single origin; backend, router, and Prometheus
+share one network namespace so the spawned vLLM instances are reachable on `localhost`.
 
-The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript, Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.)
+## Documentation
 
-#### Local development
+| Topic | |
+|---|---|
+| Deployment & topology | [docs/deployment.md](docs/deployment.md) |
+| Configuration (`config.yaml`) | [docs/configuration.md](docs/configuration.md) |
+| Features in depth | [docs/features.md](docs/features.md) |
+| Monitoring (Prometheus + Grafana) | [docs/monitoring.md](docs/monitoring.md) |
+| HTTP API | [docs/API.md](docs/API.md) |
 
-```bash
-cd apps/frontend_llmops
-npm install
-npm run dev          # http://localhost:5173
-```
-
-#### Production build
-
-```bash
-npm run build        # outputs to dist/
-```
-
-#### Configuration — `apps/frontend_llmops/.env`
-
-```env
-VITE_API_BASE_URL=http://localhost:5000        # Dashboard backend (lifecycle, telemetry)
-VITE_ROUTER_BASE_URL=http://localhost:8887     # LLM Router (inference + /metrics + /reload)
-```
-
-> **Authentication** is backend-driven (not a build-time password). Set
-> `LLMOPS_ADMIN_TOKEN` on the backend + router to gate every control action
-> (start / stop / add / edit / remove + API-key management); the UI prompts for
-> the token once and reuses it for the session. Set `LLMOPS_REQUIRE_API_KEY=true`
-> on the router to require a bearer token (the admin token, or an API key minted
-> on the **API 金鑰** page) for all `/v1/*` inference. Both default to off for
-> local dev.
-
-> **Run all three services for full functionality**: the Dashboard Backend (`:5000`), the LLM Router (`:8887`), and the model instances the backend launches on demand. The backend and router both merge the dynamic-model overlay at startup, so models added from the UI survive restarts.
-
-### Manual / development run
-
-Run the three pieces yourself (Python deps in the repo-root `.venv`):
-
-```bash
-# Dashboard backend (:5000)
-cd apps/backend && pip install -r requirements.txt
-uvicorn main:app --host 0.0.0.0 --port 5000
-
-# LLM router (:8887)  — see apps/router-server/README.md for details
-cd apps/router-server && pip install -r requirements.txt
-sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py
-```
-
-Use `packages/config-schema/config.yaml` as the single source of truth so the
-frontend, backend, and router all read the same configuration.
-
----
-
-## Configuration Guide
-
-### config.yaml Structure
-
-The configuration file is located at `packages/config-schema/config.yaml` (the single source of truth, validated by `packages/config-schema/schema.py`) and controls all model startup parameters.
-
-```yaml
-# Router server configuration
-server:
-  host: "0.0.0.0"
-  port: 8887
-  uvicorn_log_level: "info"
-
-# LLM model configuration
-LLM_engines:
-  Qwen3-0.6B:
-    instances:
-      - id: "qwen3"
-        host: "localhost"
-        port: 8002
-        cuda_device: 0
-      - id: "qwen3-2"
-        host: "localhost"
-        port: 8004
-        cuda_device: 0
-
-    model_config:
-      model_tag: "Qwen/Qwen3-0.6B"
-      dtype: "float16"
-      max_model_len: 500
-      gpu_memory_utilization: 0.35
-      tensor_parallel_size: 1
-
-# Embedding server configuration (optional)
-embedding_server:
-  host: "localhost"
-  port: 8005
-  cuda_device: 1
-  
-  embedding_models:
-    m3e-base:
-      model_name: "moka-ai/m3e-base"
-      model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
-      tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
-      max_length: 512
-      use_gpu: true
-      use_float16: true
-  
-  reranking_models:
-    bge-reranker-large:
-      model_name: "BAAI/bge-reranker-large"
-      model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
-      tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
-      max_length: 512
-      use_gpu: true
-      use_float16: true
-```
-
-### Key Parameter Descriptions
-
-| Parameter | Description | Recommended Value |
-|------|------|--------|
-| `gpu_memory_utilization` | GPU memory usage ratio | 0.6-0.9 |
-| `max_model_len` | Maximum context length | Based on model capability |
-| `tensor_parallel_size` | Multi-GPU parallelism count | Number of GPUs |
-| `dtype` | Inference precision | float16 (faster) / bfloat16 (more stable) |
-| `cuda_device` | GPU device number | 0, 1, 2... |
-
----
+## Requirements
 
-### Q4: Can I run multiple models at once?
+NVIDIA GPU (CUDA 13.1+ recommended) · 16GB+ RAM · 50GB+ disk.
 
-Yes — as long as they fit in GPU memory. A **VRAM pre-flight guard** blocks a start that would overflow the target GPU (override per-start with *Force start*), and instances without a pinned `cuda_device` are **auto-placed** on the GPU with the most free memory. On a single small GPU you'll typically run one mid-size model alongside a few small ones; models are started on demand, so a large fleet can be configured without all running at once.
+## License
 
-Tune the guard / restart policy via env on the backend: `LLMOPS_VRAM_GUARD`, `LLMOPS_AUTO_RESTART`, `LLMOPS_MAX_RESTARTS`, `LLMOPS_RESTART_BACKOFF`.
\ No newline at end of file
+MIT — see [LICENSE](LICENSE).
diff --git a/README_zh-CN.md b/README_zh-CN.md
index c6f34db..733d4ea 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -1,283 +1,110 @@
 <div align="center">
 
-# LLM-Router-Server-Dashboard
-**一站式 LLM 模型管理與監控平台**
+# vLLMux
 
-![Main Console](assets/image0.png)
+**一站式部署、路由、監控與評測你的 vLLM 集群**
 
-![Model Management](assets/image1.png)
+[English](README.md) · [中文](README_zh-CN.md)
 
+![vLLMux](https://img.shields.io/badge/vLLM-multiplexed-5b8def)
+![license](https://img.shields.io/badge/license-MIT-green)
+![stack](https://img.shields.io/badge/FastAPI%20·%20Vue%203%20·%20Grafana-informational)
 
-![Model Management](assets/image2.png)
-![Model Management](assets/image3.png)
-![Model Management](assets/image4.png)
+![Main Console](assets/image0.png)
+![Main Console](assets/image1.png)
+![Main Console](assets/image2.png)
 
 </div>
 
 ---
 
-## 專案簡介
-
-**LLM-Router-Server-Dashboard** 是一個針對大型語言模型（LLM）部署與管理的解決方案，提供直觀的 Web 界面來管理、監控和操作多個 LLM 模型實例。
-
-本專案結合路由伺服器（LLM-Router-Server）與易用的管理界面，讓您能夠：
-- **視覺化管理**：透過 Web 界面輕鬆管理多個模型
-- **動態啟停**：即時啟動、停止模型，無需重啟服務
-- **即時監控**：監控模型狀態、GPU 使用率、系統資訊
-- **配置管理**：透過 YAML 配置文件靈活管理模型參數
-
----
-
-## 功能特色
-
-### 模型管理
-- 基於 vLLM 的多模型、多實例管理（LLM、Embedding、Reranker）
-- 每個實例獨立的生命週期（啟動/停止），具即時狀態機（`stopped → starting → ready → failed/stopping`），由 reconciler 從「進程存活 + `/health` 探測」推導真實狀態
-- **在前端貼上 `vllm serve …` 指令即可新增模型** — 解析成可編輯表單，以動態 *overlay* 疊加，**不動手寫的 `config.yaml`**；router 會熱重載（`POST /reload`），新模型端到端可被路由
-- 負載感知路由：router 自動選擇負載最低的實例（依運行中／等待中請求 + KV 快取使用率加權）
+**vLLMux** 是一個自架的 LLM 服務控制平台，基於
+[vLLM](https://github.com/vllm-project/vllm)。
+內建的 Prometheus + Grafana 監控——全都在同一個 Vue 控制台之後。
 
-### 可靠性
-- **VRAM 預檢防呆** — 啟動前估算顯存，可能 OOM 就擋下，並提供一鍵 *Force start* 覆寫
-- **GPU 自動擺放** — 未指定 `cuda_device` 的實例會自動擺到剩餘顯存最多的 GPU
-- **失敗自動重啟** — managed 模型崩潰後以指數退避自動重啟（可設次數，恢復健康後重置）
 
-### 觀測性
-- 透過 Server-Sent Events 即時更新狀態（免輪詢）
-- **系統拓撲圖**（Vue Flow）— Clients → Router → 模型群組／Embedding → GPU 的即時 mission-control 圖，含流動的流量邊、GPU 擺放邊與控制平面；節點可點擊下鑽
-- **Router 負載平衡視圖** — 動畫扇形圖呈現每個副本的實際流量佔比，以及 router 下一個會選的實例
-- **Grafana 監控**（內建）— Prometheus 自動發現每個運行中的 vLLM 實例（後端隨模型啟停寫出 file-based service discovery）並抓取其 `/metrics`，外加 GPU（DCGM）與主機（node-exporter）指標。Grafana dashboards —— **總覽**（單一頁面：健康、延遲 SLO、容量、GPU/主機）、**排程與容量**、vLLM Performance/Query、GPU、Host —— 嵌入 **監控** 分頁，含 SLO 門檻線、模型生命週期標註與告警規則。見 [監控（Grafana）](#監控grafana)
-- 每模型用量（次數、錯誤率、p50/p95 延遲、tokens）、請求日誌、狀態轉移事件時間軸
-- GPU／CPU／記憶體監控，以及 GPU 進程清單
+## 功能亮點
 
-### Playground
-- OpenAI 相容的 **chat（串流）**、completions、**embeddings**、**reranking**，直接經由 router
-- **思考（reasoning）顯示** — 模型搭配 vLLM reasoning parser 時，`reasoning` 串流會顯示在答案上方的可摺疊「💭 思考過程」區塊
-
-### 壓測與評測（evalscope）
-- **壓測**（`/benchmark`）— 並發 sweep、到達率 open-loop、多輪、**SLA 自動調優**，以及 **embedding／rerank** 吞吐與單請求**速度基準**；每次執行為獨立子進程，含即時圖表、run 比較、完整 evalscope HTML 報告
-- **準確度／品質評測**（`/eval`）— **30+ 個基準資料集**，依能力分組（基線、知識進階、中文、推理、數學、多語言、**工具調用**、**長上下文**、程式碼、需裁判的問答）：MMLU/ARC/GSM8K/IFEval、C-Eval/C-MMLU、GPQA/MMLU-Pro、AIME、HumanEval、ToolBench/General-FunctionCall、Needle-in-a-Haystack…
-  - 每資料集分數、**run 對 run 的比較表**（每列標出最高分）、互動式 HTML 報告
-  - **裁判模型（LLM-as-judge）** 給自由問答評分 — 可選自家部署的模型（經 router）或外部 OpenAI 相容 API
-  - **進階 `dataset_args`** — few-shot 數 + 依資料集的原始覆寫（子集選擇等）
-  - 防呆：需裁判的資料集會強制設定裁判；長上下文與真實工具調用資料集會提醒模型前提（夠大的 `max_model_len`、vLLM tool parser）
-
-### 資料庫
-- **模型庫**（`/library`）— 在 UI 掃描／預下載／刪除 HF 權重，含即時下載進度
-- **資料集庫**（`/datasets`）— 預先下載壓測與評測資料集到共用 ModelScope 快取，執行時就不會卡在首次下載
-- **工具調用設定助手** — 模型編輯器把模型家族對應到正確的 vLLM `tool_call_parser`（Qwen→`hermes`、Qwen3-Coder→`qwen3_xml`、Llama→`llama3_json`/`llama4_pythonic`…），一鍵帶入（見 `docs/vllm_auto_tool_整理.md`）
-
-### 使用體驗
-- 明暗雙主題、資訊密集的「控制室」介面
-- **管理員權杖控管**控制操作（啟動／停止／新增／編輯／移除），以及 **API 金鑰管理** —
-  發行／撤銷用於 router 推理的金鑰，並在請求日誌中做 per-key 用量歸屬
-
----
-
-## 環境需求
+- **貼上 `vllm serve …` 即可新增模型** — 解析成表單、以動態 overlay 疊加；router 熱重載。
+- **生命週期** — 每實例狀態機（`stopped → starting → ready → failed`）、VRAM 預檢防呆、GPU 自動擺放、崩潰指數退避自動重啟。
+- **負載感知路由** — 自動挑負載最低的副本（運行中／等待中請求 + KV 快取使用率）。
+- **即時觀測** — SSE 狀態、動畫系統拓撲圖與 router 負載平衡圖、每模型用量／延遲／錯誤統計。
+- **內建 Grafana 監控** — Prometheus 自動發現每個運行中的實例；總覽／容量／效能／GPU／主機 dashboards 嵌入應用內，含 SLO 門檻線與告警。
+- **Playground** — OpenAI 相容的 chat（串流）／completions／embeddings／reranking。
+- **壓測與評測** — LLM 壓測（並發、到達率、SLA 自動調優）＋ 30+ 個準確度資料集與 LLM-as-judge。
+- **資料庫** — 在 UI 瀏覽／預下載 HF 權重與資料集；工具調用 parser 助手；LoRA 支援。
+- **安全性** — 管理員權杖控管操作，並可發行／撤銷帶 per-key 用量歸屬的 API 金鑰。
 
-### 硬體需求
-- **GPU**: NVIDIA GPU（建議 CUDA 13.1+）
-- **記憶體**: 16GB+ RAM（依模型大小而定）
-- **硬碟**: 50GB+ 可用空間
----
+完整說明見 [docs/features_zh-CN.md](docs/features_zh-CN.md)。
 
 ## 快速開始
 
-### Docker 一鍵部署
-
-整套服務（Dashboard 後端、LLM router、Vue 前端）由單一 Compose 檔建置並啟動。
 需要安裝 Docker 與 NVIDIA Container Toolkit（WSL2 請在 Docker Desktop 開啟 GPU 支援）。
 
 ```bash
 cp deploy/.env.example deploy/.env   # 填 HF_TOKEN、要用的 GPU、管理員權杖
-make up                              # docker compose -f deploy/docker-compose.yaml up -d --build
+make up                              # 建置並啟動整套服務
 # 瀏覽器開 http://localhost:8884
 ```
 
-`make down` 停止、`make logs` 追蹤所有服務日誌、`make ps` 看狀態。
-
-**架構**（見 [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)）：
-
-| 服務             | 映像                   | 端口  | 角色 |
-|------------------|------------------------|-------|------|
-| `backend`        | `llmops-engine`（GPU） | 5000  | Dashboard API；在 `:800x` 拉起 vLLM 子進程 |
-| `router`         | `llmops-engine`        | 8887  | OpenAI 相容路由；**共用後端的 network namespace**，才打得到那些 localhost vLLM 端口 |
-| `prometheus`     | `prom/prometheus`      | 9090  | 透過 file-based SD 抓取 vLLM 艦隊的 `/metrics`；**同樣共用後端 netns**，`localhost:800x` 才解析得到那些實例 |
-| `grafana`        | `grafana/grafana`      | （代理）| Dashboards 與告警；經前端 nginx 以單一來源代理在 `/grafana` |
-| `dcgm-exporter`  | `nvcr.io/.../dcgm-exporter`（GPU） | 9400 | NVIDIA GPU 遙測（利用率、顯存、溫度、功耗） |
-| `node-exporter`  | `prom/node-exporter`   | 9100  | 主機指標（CPU、RAM、磁碟、網路） |
-| `frontend`       | `llmops-frontend`      | 8884  | nginx 服務 SPA，並反向代理 `/api` → 後端、`/v1` → router、`/grafana` → grafana |
-
-為何一份映像、多個服務共用一個 netns：只有後端真的需要 vLLM（它負責拉起子進程），
-而 router 與 Prometheus 必須在 `localhost` 看到那些子進程——所以單一
-[`engine.Dockerfile`](deploy/engine.Dockerfile)（基於官方 `vllm/vllm-openai`）跑成
-`backend` + `router`，並（連同 Prometheus）以 `network_mode: service:backend` 串接。
-
-前端透過 nginx 以單一來源（same-origin）連到後端、router 與 Grafana，因此 build 不會
-硬編任何 host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume（Prometheus
-TSDB 與 Grafana 狀態放在 `prometheus-data` / `grafana-data`）；下載的模型**權重**則以
-bind-mount 掛在主機 HF 快取（`HF_CACHE_DIR`，預設 `~/.cache/huggingface`），所以本機就能
-直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount 掛入，
-因此改模型不必重新 build。
-
-> **模型生命週期**：router 只負責路由與負載平衡，不會啟動模型。vLLM 實例（與
-> Embedding/Reranker 服務）由後端管理，從 **Models** 頁按需啟動（或
-> `POST /api/models/{key}/start`）。後端與 router 都會在啟動時合併動態模型 overlay，
-> 所以從前端新增的模型在重啟後仍會保留。
-
-#### 驗證
+`make down` 停止 · `make logs` 追蹤所有服務日誌 · `make ps` 看狀態。
 
 ```bash
 curl http://localhost:8887/v1/models     # router：列出設定的模型群組
 curl http://localhost:5000/api/models    # 後端：每個實例的生命週期狀態
+# http://localhost:8884/grafana          # dashboards 與告警
 ```
 
-#### 監控（Grafana）
-
-整套內建完整的 **Prometheus → Grafana** 流程，免手動設定：
-
-- **後端**寫出 Prometheus file-based service-discovery 檔（`LLMOPS_PROMETHEUS_SD_PATH`），
-  列出每個 *ready* 的 vLLM 實例，並隨模型啟停刷新——所以動態艦隊免改設定即被抓取。
-- **Prometheus**（`:9090`）抓取這些實例的 `/metrics`，外加 `dcgm-exporter`（GPU）與
-  `node-exporter`（主機）。
-- **Grafana** 以單一來源服務於 **`http://localhost:8884/grafana`**（匿名唯讀；以
-  `admin` / `GRAFANA_ADMIN_PASSWORD` 登入可編輯）。datasource 與 dashboards 由
-  [`deploy/grafana`](deploy/grafana) 自動 provision：**總覽**、**vLLM 排程與容量**（自訂）、
-  **Performance**/**Query**（官方）、**GPU**（DCGM）、**Host**（Node Exporter）。
-  同一批 dashboards 也嵌入控制台的 **監控** 分頁。
-- **告警**：已 provision 的 vLLM 告警規則（target down、TTFT p95、KV cache、請求排隊）
-  路由到一個 webhook contact point —— 在 `deploy/.env` 設 `GRAFANA_ALERT_WEBHOOK`
-  （Slack/Discord/通用）並重啟 Grafana 即可收到通知。
-
-```bash
-curl http://localhost:9090/api/v1/targets        # prometheus：scrape target 健康狀態
-# 開啟 http://localhost:8884/grafana             # dashboards 與告警
-```
-
-### 前端（Web 控制台）
-
-控制台位於 **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript、Tailwind CSS v4、shadcn-vue 元件、[Vue Flow](https://vueflow.dev)（拓撲／路由圖）、Pinia + Vue Router。（舊的 `apps/frontend` 已棄用。）
-
-#### 本地開發
-
-```bash
-cd apps/frontend_llmops
-npm install
-npm run dev          # http://localhost:5173
-```
-
-#### 生產環境建置
-
-```bash
-npm run build        # 輸出到 dist/
-```
-
-#### 設定 — `apps/frontend_llmops/.env`
-
-```env
-VITE_API_BASE_URL=http://localhost:5000        # Dashboard 後端（生命週期、遙測）
-VITE_ROUTER_BASE_URL=http://localhost:8887     # LLM Router（推理 + /metrics + /reload）
-```
-
-> **驗證**改由後端控管（不再是 build-time 密碼）。在後端與 router 設定
-> `LLMOPS_ADMIN_TOKEN` 即可鎖住所有控制操作（啟動／停止／新增／編輯／移除 +
-> 金鑰管理）；UI 會要求輸入一次 token 並在 session 內沿用。在 router 設定
-> `LLMOPS_REQUIRE_API_KEY=true` 則要求所有 `/v1/*` 推理都帶 bearer token（admin
-> token，或在 **API 金鑰** 頁建立的金鑰）。兩者預設關閉，方便本機開發。
-
-> **三個服務都要跑才完整**：Dashboard 後端（`:5000`）、LLM Router（`:8887`）、以及後端按需啟動的模型實例。後端與 router 都會在啟動時合併動態模型 overlay，所以從前端新增的模型在重啟後仍會保留。
-
-### 手動 / 開發啟動
-
-也可自行啟動三個部分（Python 依賴在 repo 根目錄的 `.venv`）：
-
-```bash
-# Dashboard 後端（:5000）
-cd apps/backend && pip install -r requirements.txt
-uvicorn main:app --host 0.0.0.0 --port 5000
-
-# LLM router（:8887）— 細節見 apps/router-server/README_zh.md
-cd apps/router-server && pip install -r requirements.txt
-sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py
+完整架構、共用 netns 的原理、volumes 與手動啟動見
+[docs/deployment_zh-CN.md](docs/deployment_zh-CN.md)。
+
+## 架構
+
+```mermaid
+flowchart LR
+    Client([Clients 用戶端])
+    FE["<b>frontend</b><br/>nginx · :8884<br/>單一來源"]
+    VLLM["<b>vLLM 實例</b><br/>:800x"]
+    GF["<b>grafana</b><br/>/grafana"]
+    DCGM["dcgm-exporter<br/>:9400 · GPU"]
+    NODE["node-exporter<br/>:9100 · 主機"]
+
+    subgraph netns["共用 network namespace"]
+        BE["<b>backend</b> · :5000<br/>模型生命週期"]
+        RT["<b>router</b> · :8887<br/>OpenAI 相容負載平衡"]
+        PR["<b>prometheus</b> · :9090"]
+    end
+
+    Client --> FE
+    FE -->|/api| BE
+    FE -->|/v1| RT
+    FE -->|/grafana| GF
+    BE -->|按需拉起| VLLM
+    RT -->|路由 + 平衡| VLLM
+    PR -->|抓取 /metrics| VLLM
+    PR --> DCGM
+    PR --> NODE
+    GF -->|查詢| PR
 ```
 
-配置文件統一使用 `packages/config-schema/config.yaml`（單一來源），確保前端、後端與
-router 讀到同一份設定。
-
----
-
-## 配置說明
-
-### config.yaml 結構
-
-配置文件位於 `packages/config-schema/config.yaml`（單一來源，由 `packages/config-schema/schema.py` 驗證），控制所有模型的啟動參數。
+**router 只負責路由**——**模型生命週期由 backend 掌管**。frontend、router、backend 與
+Grafana 都在 nginx 之後以單一來源對外；backend、router、Prometheus 共用一個 network
+namespace，所以被拉起的 vLLM 實例可在 `localhost` 互相連到。
 
-```yaml
-# 路由服務器配置
-server:
-  host: "0.0.0.0"
-  port: 8887
-  uvicorn_log_level: "info"
+## 文件
 
-# LLM 模型配置
-LLM_engines:
-  Qwen3-0.6B:
-    instances:
-      - id: "qwen3"
-        host: "localhost"
-        port: 8002
-        cuda_device: 0
-      - id: "qwen3-2"
-        host: "localhost"
-        port: 8004
-        cuda_device: 0
+| 主題 | |
+|---|---|
+| 部署與架構 | [docs/deployment_zh-CN.md](docs/deployment_zh-CN.md) |
+| 配置（`config.yaml`） | [docs/configuration_zh-CN.md](docs/configuration_zh-CN.md) |
+| 功能特色（詳細） | [docs/features_zh-CN.md](docs/features_zh-CN.md) |
+| 監控（Prometheus + Grafana） | [docs/monitoring_zh-CN.md](docs/monitoring_zh-CN.md) |
+| HTTP API | [docs/API.md](docs/API.md) |
 
-    model_config:
-      model_tag: "Qwen/Qwen3-0.6B"
-      dtype: "float16"
-      max_model_len: 500
-      gpu_memory_utilization: 0.35
-      tensor_parallel_size: 1
-
-# Embedding 服務器配置（可選）
-embedding_server:
-  host: "localhost"
-  port: 8005
-  cuda_device: 1
-  
-  embedding_models:
-    m3e-base:
-      model_name: "moka-ai/m3e-base"
-      model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
-      tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
-      max_length: 512
-      use_gpu: true
-      use_float16: true
-  
-  reranking_models:
-    bge-reranker-large:
-      model_name: "BAAI/bge-reranker-large"
-      model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
-      tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
-      max_length: 512
-      use_gpu: true
-      use_float16: true
-```
-
-### 關鍵參數說明
-
-| 參數 | 說明 | 建議值 |
-|------|------|--------|
-| `gpu_memory_utilization` | GPU 記憶體使用比例 | 0.6-0.9 |
-| `max_model_len` | 最大上下文長度 | 依模型能力 |
-| `tensor_parallel_size` | 多 GPU 並行數 | GPU 數量 |
-| `dtype` | 推理精度 | float16（速度快） / bfloat16（更穩定） |
-| `cuda_device` | GPU 設備編號 | 0, 1, 2... |
-
----
+## 環境需求
 
-### Q4: 可以同時啟動多個模型嗎？
+NVIDIA GPU（建議 CUDA 13.1+）· 16GB+ RAM · 50GB+ 磁碟。
 
-可以 — 只要顯存放得下。**VRAM 預檢防呆**會擋下會撐爆目標 GPU 的啟動（可用 *Force start* 逐次覆寫），未指定 `cuda_device` 的實例會**自動擺放**到剩餘顯存最多的 GPU。單張小卡通常能跑一顆中型模型加幾顆小模型；模型是按需啟動的，所以可以設定一大批而不必全部同時運行。
+## 授權
 
-可在後端用環境變數調整防呆／重啟策略：`LLMOPS_VRAM_GUARD`、`LLMOPS_AUTO_RESTART`、`LLMOPS_MAX_RESTARTS`、`LLMOPS_RESTART_BACKOFF`。
\ No newline at end of file
+MIT — 見 [LICENSE](LICENSE)。
diff --git a/apps/backend/app/main.py b/apps/backend/app/main.py
index 3d5351e..4e8ce4c 100644
--- a/apps/backend/app/main.py
+++ b/apps/backend/app/main.py
@@ -145,7 +145,7 @@ async def lifespan(app: FastAPI):
 
 
 def create_app() -> FastAPI:
-    app = FastAPI(title="LLM Router Dashboard Backend", lifespan=lifespan)
+    app = FastAPI(title="vLLMux Backend", lifespan=lifespan)
     app.add_middleware(
         CORSMiddleware,
         allow_origins=["*"],
diff --git a/apps/frontend_llmops/index.html b/apps/frontend_llmops/index.html
index 9e5fc8f..91bf23d 100644
--- a/apps/frontend_llmops/index.html
+++ b/apps/frontend_llmops/index.html
@@ -4,7 +4,7 @@
     <meta charset="UTF-8">
     <link rel="icon" href="/favicon.ico">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>Vite App</title>
+    <title>vLLMux</title>
   </head>
   <body>
     <div id="app"></div>
diff --git a/apps/frontend_llmops/src/components/layout/AppSidebar.vue b/apps/frontend_llmops/src/components/layout/AppSidebar.vue
index 2822f57..6f8ebd2 100644
--- a/apps/frontend_llmops/src/components/layout/AppSidebar.vue
+++ b/apps/frontend_llmops/src/components/layout/AppSidebar.vue
@@ -80,7 +80,7 @@ const nav = [
         <Server class="size-4.5" />
       </div>
       <div class="leading-tight">
-        <p class="text-sm font-semibold">LLMOps</p>
+        <p class="text-sm font-semibold">vLLMux</p>
         <p class="text-[10px] uppercase tracking-widest text-muted-foreground">控制台</p>
       </div>
     </div>
diff --git a/assets/image0.png b/assets/image0.png
index e1b9f76..8074a84 100644
Binary files a/assets/image0.png and b/assets/image0.png differ
diff --git a/assets/image1.png b/assets/image1.png
index fc0f170..42aa1d7 100644
Binary files a/assets/image1.png and b/assets/image1.png differ
diff --git a/assets/image2.png b/assets/image2.png
index 6bf6a8c..0572a10 100644
Binary files a/assets/image2.png and b/assets/image2.png differ
diff --git a/assets/image3.png b/assets/image3.png
deleted file mode 100644
index 7b930c5..0000000
Binary files a/assets/image3.png and /dev/null differ
diff --git a/assets/image4.png b/assets/image4.png
deleted file mode 100644
index 639dc37..0000000
Binary files a/assets/image4.png and /dev/null differ
diff --git a/docs/REFACTOR_PLAN.md b/docs/REFACTOR_PLAN.md
index ea47c4b..e9660f0 100644
--- a/docs/REFACTOR_PLAN.md
+++ b/docs/REFACTOR_PLAN.md
@@ -1,4 +1,4 @@
-# LLM-Router-Server-Dashboard 重構方案（Monorepo）
+# vLLMux 重構方案（Monorepo）
 
 > 狀態：**已執行（Phase 0–4 完成）**。三個子專案已整理成 `apps/` + `packages/` + `deploy/` 的正式 Monorepo。
 > 後端已分層並收斂 config、前端已建立 services/stores/router/views、共用 config-schema 已上線，
diff --git a/docs/configuration.md b/docs/configuration.md
new file mode 100644
index 0000000..0369bb9
--- /dev/null
+++ b/docs/configuration.md
@@ -0,0 +1,93 @@
+# Configuration
+
+> [中文](configuration_zh-CN.md)
+
+The configuration file lives at `packages/config-schema/config.yaml` — the single
+source of truth, validated by `packages/config-schema/schema.py`, and read by the
+frontend, backend, and router alike. It controls all model startup parameters.
+
+> You usually **don't** edit this by hand: add models from the UI by pasting a
+> `vllm serve …` command, which is layered on as a dynamic overlay. Edit `config.yaml`
+> only for the canonical, hand-maintained fleet.
+
+## `config.yaml` structure
+
+```yaml
+# Router server configuration
+server:
+  host: "0.0.0.0"
+  port: 8887
+  uvicorn_log_level: "info"
+
+# LLM model configuration
+LLM_engines:
+  Qwen3-0.6B:
+    instances:
+      - id: "qwen3"
+        host: "localhost"
+        port: 8002
+        cuda_device: 0
+      - id: "qwen3-2"
+        host: "localhost"
+        port: 8004
+        cuda_device: 0
+
+    model_config:
+      model_tag: "Qwen/Qwen3-0.6B"
+      dtype: "float16"
+      max_model_len: 500
+      gpu_memory_utilization: 0.35
+      tensor_parallel_size: 1
+
+# Embedding server configuration (optional)
+embedding_server:
+  host: "localhost"
+  port: 8005
+  cuda_device: 1
+
+  embedding_models:
+    m3e-base:
+      model_name: "moka-ai/m3e-base"
+      model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
+      tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+
+  reranking_models:
+    bge-reranker-large:
+      model_name: "BAAI/bge-reranker-large"
+      model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
+      tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+```
+
+## Key parameters
+
+| Parameter | Description | Recommended |
+|------|------|--------|
+| `gpu_memory_utilization` | GPU memory usage ratio | 0.6–0.9 |
+| `max_model_len` | Maximum context length | Based on model capability |
+| `tensor_parallel_size` | Multi-GPU parallelism count | Number of GPUs |
+| `dtype` | Inference precision | float16 (faster) / bfloat16 (more stable) |
+| `cuda_device` | GPU device number | 0, 1, 2… |
+
+## Running multiple models at once
+
+Yes — as long as they fit in GPU memory. A **VRAM pre-flight guard** blocks a start
+that would overflow the target GPU (override per-start with *Force start*), and
+instances without a pinned `cuda_device` are **auto-placed** on the GPU with the most
+free memory. On a single small GPU you'll typically run one mid-size model alongside
+a few small ones; models are started on demand, so a large fleet can be configured
+without all running at once.
+
+Tune the guard / restart policy via env on the backend:
+
+| Env | Purpose |
+|---|---|
+| `LLMOPS_VRAM_GUARD` | Enable/disable the VRAM pre-flight guard |
+| `LLMOPS_AUTO_RESTART` | Auto-restart a crashed managed model |
+| `LLMOPS_MAX_RESTARTS` | Restart budget before giving up |
+| `LLMOPS_RESTART_BACKOFF` | Exponential backoff base |
diff --git a/docs/configuration_zh-CN.md b/docs/configuration_zh-CN.md
new file mode 100644
index 0000000..d4288c7
--- /dev/null
+++ b/docs/configuration_zh-CN.md
@@ -0,0 +1,89 @@
+# 配置說明
+
+> [English](configuration.md)
+
+配置文件位於 `packages/config-schema/config.yaml`——單一來源，由
+`packages/config-schema/schema.py` 驗證，前端、後端與 router 都讀同一份。它控制所有
+模型的啟動參數。
+
+> 通常**不需要**手動編輯：從前端貼上 `vllm serve …` 指令新增模型，會以動態 overlay 疊加。
+> 只有要維護「正式、手寫」的模型清單時才改 `config.yaml`。
+
+## `config.yaml` 結構
+
+```yaml
+# 路由服務器配置
+server:
+  host: "0.0.0.0"
+  port: 8887
+  uvicorn_log_level: "info"
+
+# LLM 模型配置
+LLM_engines:
+  Qwen3-0.6B:
+    instances:
+      - id: "qwen3"
+        host: "localhost"
+        port: 8002
+        cuda_device: 0
+      - id: "qwen3-2"
+        host: "localhost"
+        port: 8004
+        cuda_device: 0
+
+    model_config:
+      model_tag: "Qwen/Qwen3-0.6B"
+      dtype: "float16"
+      max_model_len: 500
+      gpu_memory_utilization: 0.35
+      tensor_parallel_size: 1
+
+# Embedding 服務器配置（可選）
+embedding_server:
+  host: "localhost"
+  port: 8005
+  cuda_device: 1
+
+  embedding_models:
+    m3e-base:
+      model_name: "moka-ai/m3e-base"
+      model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
+      tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+
+  reranking_models:
+    bge-reranker-large:
+      model_name: "BAAI/bge-reranker-large"
+      model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
+      tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
+      max_length: 512
+      use_gpu: true
+      use_float16: true
+```
+
+## 關鍵參數
+
+| 參數 | 說明 | 建議值 |
+|------|------|--------|
+| `gpu_memory_utilization` | GPU 記憶體使用比例 | 0.6–0.9 |
+| `max_model_len` | 最大上下文長度 | 依模型能力 |
+| `tensor_parallel_size` | 多 GPU 並行數 | GPU 數量 |
+| `dtype` | 推理精度 | float16（速度快） / bfloat16（更穩定） |
+| `cuda_device` | GPU 設備編號 | 0, 1, 2… |
+
+## 同時啟動多個模型
+
+可以——只要顯存放得下。**VRAM 預檢防呆**會擋下會撐爆目標 GPU 的啟動（可用 *Force start*
+逐次覆寫），未指定 `cuda_device` 的實例會**自動擺放**到剩餘顯存最多的 GPU。單張小卡通常
+能跑一顆中型模型加幾顆小模型；模型是按需啟動的，所以可以設定一大批而不必全部同時運行。
+
+可在後端用環境變數調整防呆／重啟策略：
+
+| 環境變數 | 用途 |
+|---|---|
+| `LLMOPS_VRAM_GUARD` | 啟用／關閉 VRAM 預檢防呆 |
+| `LLMOPS_AUTO_RESTART` | 崩潰的 managed 模型自動重啟 |
+| `LLMOPS_MAX_RESTARTS` | 放棄前的重啟次數上限 |
+| `LLMOPS_RESTART_BACKOFF` | 指數退避基數 |
diff --git a/docs/deployment.md b/docs/deployment.md
new file mode 100644
index 0000000..69aa01a
--- /dev/null
+++ b/docs/deployment.md
@@ -0,0 +1,115 @@
+# Deployment & Topology
+
+> [中文](deployment_zh-CN.md)
+
+The whole stack — dashboard backend, LLM router, Prometheus, Grafana, the GPU/host
+exporters, and the Vue frontend — is built and started by a single Compose file.
+Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in
+Docker Desktop).
+
+```bash
+cp deploy/.env.example deploy/.env   # set HF_TOKEN, which GPUs, the admin token
+make up                              # docker compose -f deploy/docker-compose.yaml up -d --build
+# open http://localhost:8884
+```
+
+`make down` stops it, `make logs` tails all services, `make ps` shows status.
+
+## Services
+
+See [`deploy/docker-compose.yaml`](../deploy/docker-compose.yaml).
+
+| Service          | Image                  | Port    | Role |
+|------------------|------------------------|---------|------|
+| `backend`        | `llmops-engine` (GPU)  | 5000    | Dashboard API; spawns vLLM subprocesses on `:800x` |
+| `router`         | `llmops-engine`        | 8887    | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
+| `prometheus`     | `prom/prometheus`      | 9090    | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances |
+| `grafana`        | `grafana/grafana`      | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx |
+| `dcgm-exporter`  | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) |
+| `node-exporter`  | `prom/node-exporter`   | 9100    | Host metrics (CPU, RAM, disk, network) |
+| `frontend`       | `llmops-frontend`      | 8884    | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana |
+
+### Why one image, multiple services on one netns
+
+Only the backend truly needs vLLM (it launches the subprocesses), and the router +
+Prometheus must see them on `localhost` — so a single
+[`engine.Dockerfile`](../deploy/engine.Dockerfile) (based on the official
+`vllm/vllm-openai`) runs as `backend` + `router`, joined (with Prometheus) by
+`network_mode: service:backend`.
+
+The frontend reaches the backend, router, and Grafana through nginx on a single
+origin, so no host/port is baked into the build.
+
+### Persistence
+
+- SQLite + the dynamic-model overlay → `llmops-data` named volume
+- Prometheus TSDB → `prometheus-data`; Grafana state → `grafana-data`
+- Model **weights** are bind-mounted from the host HF cache (`HF_CACHE_DIR`, default
+  `~/.cache/huggingface`) so they're browsable locally and shared with host-side tools
+- `packages/config-schema/config.yaml` is bind-mounted too, so you can edit models
+  without rebuilding
+
+> **Model lifecycle**: the router only routes and load-balances — it never launches
+> models. vLLM instances (and the Embedding/Reranker server) are owned by the backend
+> and started on demand from the **Models** page (or `POST /api/models/{key}/start`).
+> The backend and router both merge the dynamic-model overlay at startup, so models
+> added from the UI survive restarts.
+
+### Verify
+
+```bash
+curl http://localhost:8887/v1/models     # router: configured model groups
+curl http://localhost:5000/api/models    # backend: lifecycle state of each instance
+```
+
+## Frontend (Web dashboard)
+
+The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript,
+Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the
+topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.)
+
+```bash
+cd apps/frontend_llmops
+npm install
+npm run dev          # http://localhost:5173
+npm run build        # production build → dist/
+```
+
+Configuration — `apps/frontend_llmops/.env`:
+
+```env
+VITE_API_BASE_URL=http://localhost:5000        # Dashboard backend (lifecycle, telemetry)
+VITE_ROUTER_BASE_URL=http://localhost:8887     # LLM Router (inference + /metrics + /reload)
+```
+
+### Authentication
+
+Authentication is backend-driven (not a build-time password). Set
+`LLMOPS_ADMIN_TOKEN` on the backend + router to gate every control action (start /
+stop / add / edit / remove + API-key management); the UI prompts for the token once
+and reuses it for the session. Set `LLMOPS_REQUIRE_API_KEY=true` on the router to
+require a bearer token (the admin token, or an API key minted on the **API Keys**
+page) for all `/v1/*` inference. Both default to off for local dev.
+
+## Manual / development run
+
+Run the three pieces yourself (Python deps in the repo-root `.venv`):
+
+```bash
+# Dashboard backend (:5000)
+cd apps/backend && pip install -r requirements.txt
+uvicorn main:app --host 0.0.0.0 --port 5000
+
+# LLM router (:8887) — see apps/router-server/README.md for details
+cd apps/router-server && pip install -r requirements.txt
+sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py
+```
+
+Use `packages/config-schema/config.yaml` as the single source of truth so the
+frontend, backend, and router all read the same configuration.
+
+## Requirements
+
+- **GPU**: NVIDIA GPU (CUDA 13.1+ recommended)
+- **Memory**: 16GB+ RAM (depending on model size)
+- **Disk**: 50GB+ available space
diff --git a/docs/deployment_zh-CN.md b/docs/deployment_zh-CN.md
new file mode 100644
index 0000000..8fe1e4f
--- /dev/null
+++ b/docs/deployment_zh-CN.md
@@ -0,0 +1,108 @@
+# 部署與架構
+
+> [English](deployment.md)
+
+整套服務（Dashboard 後端、LLM router、Prometheus、Grafana、GPU/主機 exporter、Vue 前端）
+由單一 Compose 檔建置並啟動。需要安裝 Docker 與 NVIDIA Container Toolkit（WSL2 請在
+Docker Desktop 開啟 GPU 支援）。
+
+```bash
+cp deploy/.env.example deploy/.env   # 填 HF_TOKEN、要用的 GPU、管理員權杖
+make up                              # docker compose -f deploy/docker-compose.yaml up -d --build
+# 瀏覽器開 http://localhost:8884
+```
+
+`make down` 停止、`make logs` 追蹤所有服務日誌、`make ps` 看狀態。
+
+## 服務
+
+見 [`deploy/docker-compose.yaml`](../deploy/docker-compose.yaml)。
+
+| 服務             | 映像                   | 端口  | 角色 |
+|------------------|------------------------|-------|------|
+| `backend`        | `llmops-engine`（GPU） | 5000  | Dashboard API；在 `:800x` 拉起 vLLM 子進程 |
+| `router`         | `llmops-engine`        | 8887  | OpenAI 相容路由；**共用後端的 network namespace**，才打得到那些 localhost vLLM 端口 |
+| `prometheus`     | `prom/prometheus`      | 9090  | 透過 file-based SD 抓取 vLLM 艦隊的 `/metrics`；**同樣共用後端 netns**，`localhost:800x` 才解析得到那些實例 |
+| `grafana`        | `grafana/grafana`      | （代理）| Dashboards 與告警；經前端 nginx 以單一來源代理在 `/grafana` |
+| `dcgm-exporter`  | `nvcr.io/.../dcgm-exporter`（GPU） | 9400 | NVIDIA GPU 遙測（利用率、顯存、溫度、功耗） |
+| `node-exporter`  | `prom/node-exporter`   | 9100  | 主機指標（CPU、RAM、磁碟、網路） |
+| `frontend`       | `llmops-frontend`      | 8884  | nginx 服務 SPA，並反向代理 `/api` → 後端、`/v1` → router、`/grafana` → grafana |
+
+### 為何一份映像、多個服務共用一個 netns
+
+只有後端真的需要 vLLM（它負責拉起子進程），而 router 與 Prometheus 必須在 `localhost`
+看到那些子進程——所以單一 [`engine.Dockerfile`](../deploy/engine.Dockerfile)（基於官方
+`vllm/vllm-openai`）跑成 `backend` + `router`，並（連同 Prometheus）以
+`network_mode: service:backend` 串接。前端透過 nginx 以單一來源（same-origin）連到後端、
+router 與 Grafana，因此 build 不會硬編任何 host/port。
+
+### 資料持久化
+
+- SQLite 與動態模型 overlay → `llmops-data` named volume
+- Prometheus TSDB → `prometheus-data`；Grafana 狀態 → `grafana-data`
+- 模型**權重**以 bind-mount 掛在主機 HF 快取（`HF_CACHE_DIR`，預設 `~/.cache/huggingface`），
+  本機就能直接瀏覽、也和主機端工具共用
+- `packages/config-schema/config.yaml` 同樣 bind-mount 掛入，因此改模型不必重新 build
+
+> **模型生命週期**：router 只負責路由與負載平衡，不會啟動模型。vLLM 實例（與
+> Embedding/Reranker 服務）由後端管理，從 **Models** 頁按需啟動（或
+> `POST /api/models/{key}/start`）。後端與 router 都會在啟動時合併動態模型 overlay，
+> 所以從前端新增的模型在重啟後仍會保留。
+
+### 驗證
+
+```bash
+curl http://localhost:8887/v1/models     # router：列出設定的模型群組
+curl http://localhost:5000/api/models    # 後端：每個實例的生命週期狀態
+```
+
+## 前端（Web 控制台）
+
+控制台位於 **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript、Tailwind CSS v4、
+shadcn-vue 元件、[Vue Flow](https://vueflow.dev)（拓撲／路由圖）、Pinia + Vue Router。
+（舊的 `apps/frontend` 已棄用。）
+
+```bash
+cd apps/frontend_llmops
+npm install
+npm run dev          # http://localhost:5173
+npm run build        # 生產環境建置 → dist/
+```
+
+設定 — `apps/frontend_llmops/.env`：
+
+```env
+VITE_API_BASE_URL=http://localhost:5000        # Dashboard 後端（生命週期、遙測）
+VITE_ROUTER_BASE_URL=http://localhost:8887     # LLM Router（推理 + /metrics + /reload）
+```
+
+### 驗證機制
+
+驗證改由後端控管（不再是 build-time 密碼）。在後端與 router 設定 `LLMOPS_ADMIN_TOKEN`
+即可鎖住所有控制操作（啟動／停止／新增／編輯／移除 + 金鑰管理）；UI 會要求輸入一次
+token 並在 session 內沿用。在 router 設定 `LLMOPS_REQUIRE_API_KEY=true` 則要求所有
+`/v1/*` 推理都帶 bearer token（admin token，或在 **API 金鑰** 頁建立的金鑰）。兩者預設
+關閉，方便本機開發。
+
+## 手動 / 開發啟動
+
+也可自行啟動三個部分（Python 依賴在 repo 根目錄的 `.venv`）：
+
+```bash
+# Dashboard 後端（:5000）
+cd apps/backend && pip install -r requirements.txt
+uvicorn main:app --host 0.0.0.0 --port 5000
+
+# LLM router（:8887）— 細節見 apps/router-server/README_zh.md
+cd apps/router-server && pip install -r requirements.txt
+sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py
+```
+
+配置文件統一使用 `packages/config-schema/config.yaml`（單一來源），確保前端、後端與
+router 讀到同一份設定。
+
+## 環境需求
+
+- **GPU**：NVIDIA GPU（建議 CUDA 13.1+）
+- **記憶體**：16GB+ RAM（依模型大小而定）
+- **硬碟**：50GB+ 可用空間
diff --git a/docs/features.md b/docs/features.md
new file mode 100644
index 0000000..82cc341
--- /dev/null
+++ b/docs/features.md
@@ -0,0 +1,87 @@
+# Features in depth
+
+> [中文](features_zh-CN.md)
+
+## Model management
+
+- Multi-model, multi-instance management on vLLM (LLM, Embedding, Reranker).
+- Per-instance lifecycle (start/stop) with a live state machine
+  (`stopped → starting → ready → failed/stopping`), driven by a reconciler that
+  derives the true state from process liveness + `/health` probes.
+- **Add models from the UI by pasting a `vllm serve …` command** — it is parsed into
+  an editable form and layered on as a dynamic *overlay*, so the hand-maintained
+  `config.yaml` stays untouched; the router hot-reloads (`POST /reload`) so new models
+  are routable end-to-end.
+- Load-aware routing: the router auto-selects the least-loaded instance (weighting
+  running / waiting requests + KV-cache usage).
+
+## Reliability
+
+- **VRAM pre-flight guard** — blocks a start that would likely OOM, with a one-click
+  *Force start* override.
+- **GPU auto-placement** — an instance with no pinned `cuda_device` is placed on the
+  GPU with the most free memory.
+- **Auto-restart** — a managed model that crashes is restarted with exponential
+  backoff (configurable budget, resets once healthy).
+
+## Observability
+
+- Real-time status via Server-Sent Events (no polling).
+- **System topology** (Vue Flow) — a live mission-control graph of Clients → Router →
+  model groups / Embedding → GPUs, with animated traffic edges, GPU-placement edges,
+  and a control plane; nodes are clickable drill-ins.
+- **Router load-balancing view** — an animated fan showing each replica's real traffic
+  share and the instance the router will pick next.
+- **Grafana monitoring** (bundled) — see [monitoring.md](monitoring.md).
+- Per-model usage (count, error rate, p50/p95 latency, tokens), request log, and a
+  state-transition event timeline.
+- GPU / CPU / memory monitoring plus a GPU-process inventory.
+
+## Playground
+
+- OpenAI-compatible **chat (streaming)**, completions, **embeddings**, and
+  **reranking**, sent straight through the router.
+- **Reasoning ("thinking") display** — when a model runs with a vLLM reasoning parser,
+  the `reasoning` stream is shown in a collapsible block above the answer.
+
+## Benchmarking & evaluation (evalscope)
+
+- **Load testing** (`/benchmark`) — concurrency sweep, arrival-rate open-loop,
+  multi-turn, **SLA auto-tune**, plus **embedding / rerank** throughput and
+  single-request **speed benchmark**; each run is an isolated subprocess, with live
+  charts, run comparison, and the full evalscope HTML report.
+  See [evalscope_模型壓測整理.md](evalscope_模型壓測整理.md).
+- **Accuracy / quality evaluation** (`/eval`) — **30+ benchmark datasets** grouped by
+  capability tier (Baseline, Knowledge, Chinese, Reasoning, Math, Multilingual,
+  **Tool-calling**, **Long-context**, Code, and judge-scored QA): MMLU/ARC/GSM8K/IFEval,
+  C-Eval/C-MMLU, GPQA/MMLU-Pro, AIME, HumanEval, ToolBench/General-FunctionCall,
+  Needle-in-a-Haystack, …
+  See [evalscope_LLM評測集整理.md](evalscope_LLM評測集整理.md).
+  - Per-dataset scores, a **run-to-run comparison matrix** (highlights the best per
+    dataset), and the interactive HTML report.
+  - **LLM-as-judge** for free-form QA — pick one of your own deployed models (via the
+    router) or an external OpenAI-compatible API.
+  - **Advanced `dataset_args`** — few-shot count + raw per-dataset overrides (subset
+    selection, etc.).
+  - Sanity guards: judge-scored datasets require a judge; long-context and real
+    tool-calling datasets warn about their model prerequisites (large `max_model_len`,
+    vLLM tool parser).
+
+## Libraries
+
+- **Model library** (`/library`) — scan / pre-download / delete HF model weights from
+  the UI, with live download progress.
+- **Dataset library** (`/datasets`) — pre-download load-test and evaluation datasets
+  into the shared ModelScope cache so a run never stalls on a first-time download.
+- **Tool-calling config helper** — the model editor maps model families to the right
+  vLLM `tool_call_parser` (Qwen→`hermes`, Qwen3-Coder→`qwen3_xml`,
+  Llama→`llama3_json`/`llama4_pythonic`, …) with one-click preset insertion.
+  See [vllm_auto_tool_整理.md](vllm_auto_tool_整理.md).
+- **LoRA** — see [vLLM_LoRA_部署整理.md](vLLM_LoRA_部署整理.md).
+
+## UX & security
+
+- Light / dark theme, dense "control-room" interface.
+- **Admin-token-gated control** (start / stop / add / edit / remove) and **API-key
+  management** — mint/revoke keys that authenticate router inference, with per-key
+  usage attribution in the request log.
diff --git a/docs/features_zh-CN.md b/docs/features_zh-CN.md
new file mode 100644
index 0000000..2d7192e
--- /dev/null
+++ b/docs/features_zh-CN.md
@@ -0,0 +1,70 @@
+# 功能特色（詳細）
+
+> [English](features.md)
+
+## 模型管理
+
+- 基於 vLLM 的多模型、多實例管理（LLM、Embedding、Reranker）。
+- 每個實例獨立的生命週期（啟動/停止），具即時狀態機
+  （`stopped → starting → ready → failed/stopping`），由 reconciler 從「進程存活 +
+  `/health` 探測」推導真實狀態。
+- **在前端貼上 `vllm serve …` 指令即可新增模型** — 解析成可編輯表單，以動態 *overlay*
+  疊加，**不動手寫的 `config.yaml`**；router 會熱重載（`POST /reload`），新模型端到端
+  可被路由。
+- 負載感知路由：router 自動選擇負載最低的實例（依運行中／等待中請求 + KV 快取使用率加權）。
+
+## 可靠性
+
+- **VRAM 預檢防呆** — 啟動前估算顯存，可能 OOM 就擋下，並提供一鍵 *Force start* 覆寫。
+- **GPU 自動擺放** — 未指定 `cuda_device` 的實例會自動擺到剩餘顯存最多的 GPU。
+- **失敗自動重啟** — managed 模型崩潰後以指數退避自動重啟（可設次數，恢復健康後重置）。
+
+## 觀測性
+
+- 透過 Server-Sent Events 即時更新狀態（免輪詢）。
+- **系統拓撲圖**（Vue Flow）— Clients → Router → 模型群組／Embedding → GPU 的即時
+  mission-control 圖，含流動的流量邊、GPU 擺放邊與控制平面；節點可點擊下鑽。
+- **Router 負載平衡視圖** — 動畫扇形圖呈現每個副本的實際流量佔比，以及 router 下一個會
+  選的實例。
+- **Grafana 監控**（內建）— 見 [monitoring_zh-CN.md](monitoring_zh-CN.md)。
+- 每模型用量（次數、錯誤率、p50/p95 延遲、tokens）、請求日誌、狀態轉移事件時間軸。
+- GPU／CPU／記憶體監控，以及 GPU 進程清單。
+
+## Playground
+
+- OpenAI 相容的 **chat（串流）**、completions、**embeddings**、**reranking**，直接經由 router。
+- **思考（reasoning）顯示** — 模型搭配 vLLM reasoning parser 時，`reasoning` 串流會顯示
+  在答案上方的可摺疊「思考過程」區塊。
+
+## 壓測與評測（evalscope）
+
+- **壓測**（`/benchmark`）— 並發 sweep、到達率 open-loop、多輪、**SLA 自動調優**，以及
+  **embedding／rerank** 吞吐與單請求**速度基準**；每次執行為獨立子進程，含即時圖表、
+  run 比較、完整 evalscope HTML 報告。見 [evalscope_模型壓測整理.md](evalscope_模型壓測整理.md)。
+- **準確度／品質評測**（`/eval`）— **30+ 個基準資料集**，依能力分組（基線、知識進階、
+  中文、推理、數學、多語言、**工具調用**、**長上下文**、程式碼、需裁判的問答）：
+  MMLU/ARC/GSM8K/IFEval、C-Eval/C-MMLU、GPQA/MMLU-Pro、AIME、HumanEval、
+  ToolBench/General-FunctionCall、Needle-in-a-Haystack…
+  見 [evalscope_LLM評測集整理.md](evalscope_LLM評測集整理.md)。
+  - 每資料集分數、**run 對 run 的比較表**（每列標出最高分）、互動式 HTML 報告。
+  - **裁判模型（LLM-as-judge）** 給自由問答評分 — 可選自家部署的模型（經 router）或外部
+    OpenAI 相容 API。
+  - **進階 `dataset_args`** — few-shot 數 + 依資料集的原始覆寫（子集選擇等）。
+  - 防呆：需裁判的資料集會強制設定裁判；長上下文與真實工具調用資料集會提醒模型前提
+    （夠大的 `max_model_len`、vLLM tool parser）。
+
+## 資料庫
+
+- **模型庫**（`/library`）— 在 UI 掃描／預下載／刪除 HF 權重，含即時下載進度。
+- **資料集庫**（`/datasets`）— 預先下載壓測與評測資料集到共用 ModelScope 快取，執行時
+  就不會卡在首次下載。
+- **工具調用設定助手** — 模型編輯器把模型家族對應到正確的 vLLM `tool_call_parser`
+  （Qwen→`hermes`、Qwen3-Coder→`qwen3_xml`、Llama→`llama3_json`/`llama4_pythonic`…），
+  一鍵帶入。見 [vllm_auto_tool_整理.md](vllm_auto_tool_整理.md)。
+- **LoRA** — 見 [vLLM_LoRA_部署整理.md](vLLM_LoRA_部署整理.md)。
+
+## 使用體驗與安全
+
+- 明暗雙主題、資訊密集的「控制室」介面。
+- **管理員權杖控管**控制操作（啟動／停止／新增／編輯／移除），以及 **API 金鑰管理** —
+  發行／撤銷用於 router 推理的金鑰，並在請求日誌中做 per-key 用量歸屬。
diff --git a/docs/monitoring.md b/docs/monitoring.md
new file mode 100644
index 0000000..56ab1a7
--- /dev/null
+++ b/docs/monitoring.md
@@ -0,0 +1,33 @@
+# Monitoring (Prometheus + Grafana)
+
+> [中文](monitoring_zh-CN.md)
+
+The stack bundles a full **Prometheus → Grafana** pipeline, no manual setup.
+
+- The **backend** writes a Prometheus file-based service-discovery file
+  (`LLMOPS_PROMETHEUS_SD_PATH`) listing every *ready* vLLM instance, refreshed as
+  models start/stop — so a dynamic fleet is scraped with zero config edits.
+- **Prometheus** (`:9090`) scrapes those instances' `/metrics` plus `dcgm-exporter`
+  (GPU) and `node-exporter` (host).
+- **Grafana** is served single-origin at **`http://localhost:8884/grafana`**
+  (anonymous read-only; log in as `admin` / `GRAFANA_ADMIN_PASSWORD` to edit).
+  Datasource and dashboards are auto-provisioned from
+  [`deploy/grafana`](../deploy/grafana):
+  - **Overview** — single pane: health, latency SLO, capacity, GPU/host
+  - **vLLM Scheduling & Capacity** (custom)
+  - **Performance** / **Query** (official vLLM dashboards)
+  - **GPU** (DCGM) and **Host** (Node Exporter)
+
+  The same dashboards are embedded in the dashboard's **Monitoring** tab, with SLO
+  threshold lines and model-lifecycle annotations.
+- **Alerting**: provisioned vLLM alert rules (target down, TTFT p95, KV cache,
+  request queueing) route to a webhook contact point — set `GRAFANA_ALERT_WEBHOOK` in
+  `deploy/.env` (Slack/Discord/generic) and restart Grafana to receive them.
+
+```bash
+curl http://localhost:9090/api/v1/targets        # prometheus: scrape target health
+# open http://localhost:8884/grafana             # dashboards + alerts
+```
+
+For background on the metrics and the design rationale, see
+[vllm_grafana_monitoring_guide.md](vllm_grafana_monitoring_guide.md).
diff --git a/docs/monitoring_zh-CN.md b/docs/monitoring_zh-CN.md
new file mode 100644
index 0000000..bc7148b
--- /dev/null
+++ b/docs/monitoring_zh-CN.md
@@ -0,0 +1,29 @@
+# 監控（Prometheus + Grafana）
+
+> [English](monitoring.md)
+
+整套內建完整的 **Prometheus → Grafana** 流程，免手動設定。
+
+- **後端**寫出 Prometheus file-based service-discovery 檔（`LLMOPS_PROMETHEUS_SD_PATH`），
+  列出每個 *ready* 的 vLLM 實例，並隨模型啟停刷新——所以動態艦隊免改設定即被抓取。
+- **Prometheus**（`:9090`）抓取這些實例的 `/metrics`，外加 `dcgm-exporter`（GPU）與
+  `node-exporter`（主機）。
+- **Grafana** 以單一來源服務於 **`http://localhost:8884/grafana`**（匿名唯讀；以
+  `admin` / `GRAFANA_ADMIN_PASSWORD` 登入可編輯）。datasource 與 dashboards 由
+  [`deploy/grafana`](../deploy/grafana) 自動 provision：
+  - **總覽** — 單一頁面：健康、延遲 SLO、容量、GPU/主機
+  - **vLLM 排程與容量**（自訂）
+  - **Performance** / **Query**（官方 vLLM dashboards）
+  - **GPU**（DCGM）與 **Host**（Node Exporter）
+
+  同一批 dashboards 也嵌入控制台的 **監控** 分頁，含 SLO 門檻線與模型生命週期標註。
+- **告警**：已 provision 的 vLLM 告警規則（target down、TTFT p95、KV cache、請求排隊）
+  路由到一個 webhook contact point —— 在 `deploy/.env` 設 `GRAFANA_ALERT_WEBHOOK`
+  （Slack/Discord/通用）並重啟 Grafana 即可收到通知。
+
+```bash
+curl http://localhost:9090/api/v1/targets        # prometheus：scrape target 健康狀態
+# 開啟 http://localhost:8884/grafana             # dashboards 與告警
+```
+
+指標背景與設計理念見 [vllm_grafana_monitoring_guide.md](vllm_grafana_monitoring_guide.md)。