LLMSystems · milk333445 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/README.md b/README.md
@@ -48,7 +48,7 @@ This project combines a routing server (LLM-Router-Server) with an easy-to-use m
 - Real-time status via Server-Sent Events (no polling)
 - **System topology** (Vue Flow) — a live mission-control graph of Clients → Router → model groups / Embedding → GPUs, with animated traffic edges, GPU-placement edges, and a control plane; nodes are clickable drill-ins
 - **Router load-balancing view** — an animated fan showing each replica's real traffic share and the instance the router will pick next
-- **Trends** — time-series charts (requests, error rate, p95 latency, tokens) over 15m–24h, aggregated from the persisted request log
+- **Grafana monitoring** (bundled) — Prometheus auto-discovers every running vLLM instance (file-based service discovery written by the backend as models start/stop) and scrapes its `/metrics`, alongside GPU (DCGM) and host (node-exporter) metrics. Grafana dashboards — **Overview** (single pane: health, latency SLO, capacity, GPU/host), **Scheduling & Capacity**, vLLM Performance/Query, GPU, Host — are embedded in the **Monitoring** tab, with SLO threshold lines, model-lifecycle annotations, and alert rules. See [Monitoring (Grafana)](#monitoring-grafana)
 - Per-model usage (count, error rate, p50/p95 latency, tokens), request log, and a state-transition event timeline
 - GPU / CPU / memory monitoring plus a GPU-process inventory
 
@@ -103,24 +103,30 @@ make up                              # docker compose -f deploy/docker-compose.y
 
 **Topology** (see [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)):
 
-| Service    | Image                  | Port    | Role |
-|------------|------------------------|---------|------|
-| `backend`  | `llmops-engine` (GPU)  | 5000    | Dashboard API; spawns vLLM subprocesses on `:800x` |
-| `router`   | `llmops-engine`        | 8887    | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
-| `frontend` | `llmops-frontend`      | 8884    | nginx serving the SPA + reverse-proxying `/api` → backend and `/v1` → router |
-
-Why one image, two services: only the backend truly needs vLLM (it launches the
-subprocesses), and the router must see them on `localhost` — so a single
-[`engine.Dockerfile`](deploy/engine.Dockerfile) (based on the official
-`vllm/vllm-openai`) runs as two services joined by `network_mode: service:backend`.
-
-The frontend reaches the backend and router through nginx on a single origin, so
-no host/port is baked into the build. SQLite + the dynamic-model overlay persist
-in the `llmops-data` named volume; downloaded model **weights** are bind-mounted
-from the host HF cache (`HF_CACHE_DIR`, default `~/.cache/huggingface`) so they're
-browsable locally and shared with host-side tools. The canonical
-`packages/config-schema/config.yaml` is bind-mounted too, so you can edit models
-without rebuilding.
+| Service          | Image                  | Port    | Role |
+|------------------|------------------------|---------|------|
+| `backend`        | `llmops-engine` (GPU)  | 5000    | Dashboard API; spawns vLLM subprocesses on `:800x` |
+| `router`         | `llmops-engine`        | 8887    | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
+| `prometheus`     | `prom/prometheus`      | 9090    | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances |
+| `grafana`        | `grafana/grafana`      | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx |
+| `dcgm-exporter`  | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) |
+| `node-exporter`  | `prom/node-exporter`   | 9100    | Host metrics (CPU, RAM, disk, network) |
+| `frontend`       | `llmops-frontend`      | 8884    | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana |
+
+Why one image, multiple services on one netns: only the backend truly needs vLLM
+(it launches the subprocesses), and the router + Prometheus must see them on
+`localhost` — so a single [`engine.Dockerfile`](deploy/engine.Dockerfile) (based
+on the official `vllm/vllm-openai`) runs as `backend` + `router`, joined (with
+Prometheus) by `network_mode: service:backend`.
+
+The frontend reaches the backend, router, and Grafana through nginx on a single
+origin, so no host/port is baked into the build. SQLite + the dynamic-model
+overlay persist in the `llmops-data` named volume (Prometheus TSDB and Grafana
+state in `prometheus-data` / `grafana-data`); downloaded model **weights** are
+bind-mounted from the host HF cache (`HF_CACHE_DIR`, default
+`~/.cache/huggingface`) so they're browsable locally and shared with host-side
+tools. The canonical `packages/config-schema/config.yaml` is bind-mounted too, so
+you can edit models without rebuilding.
 
 > **Model lifecycle**: the router only routes and load-balances — it never
 > launches models. vLLM instances (and the Embedding/Reranker server) are owned
@@ -135,6 +141,31 @@ curl http://localhost:8887/v1/models     # router: configured model groups
 curl http://localhost:5000/api/models    # backend: lifecycle state of each instance
 ```
 
+#### Monitoring (Grafana)
+
+The stack bundles a full **Prometheus → Grafana** pipeline, no manual setup:
+
+- The **backend** writes a Prometheus file-based service-discovery file
+  (`LLMOPS_PROMETHEUS_SD_PATH`) listing every *ready* vLLM instance, refreshed as
+  models start/stop — so a dynamic fleet is scraped with zero config edits.
+- **Prometheus** (`:9090`) scrapes those instances' `/metrics` plus
+  `dcgm-exporter` (GPU) and `node-exporter` (host).
+- **Grafana** is served single-origin at **`http://localhost:8884/grafana`**
+  (anonymous read-only; log in as `admin` / `GRAFANA_ADMIN_PASSWORD` to edit).
+  Datasource and dashboards are auto-provisioned from
+  [`deploy/grafana`](deploy/grafana): **Overview**, **vLLM Scheduling &
+  Capacity** (custom), **Performance**/**Query** (official), **GPU** (DCGM), and
+  **Host** (Node Exporter). The same dashboards are embedded in the dashboard's
+  **Monitoring** tab.
+- **Alerting**: provisioned vLLM alert rules (target down, TTFT p95, KV cache,
+  request queueing) route to a webhook contact point — set `GRAFANA_ALERT_WEBHOOK`
+  in `deploy/.env` (Slack/Discord/generic) and restart Grafana to receive them.
+
+```bash
+curl http://localhost:9090/api/v1/targets        # prometheus: scrape target health
+# open http://localhost:8884/grafana             # dashboards + alerts
+```
+
 ### Frontend (Web Dashboard)
 
 The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript, Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.)

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -45,7 +45,7 @@
 - 透過 Server-Sent Events 即時更新狀態（免輪詢）
 - **系統拓撲圖**（Vue Flow）— Clients → Router → 模型群組／Embedding → GPU 的即時 mission-control 圖，含流動的流量邊、GPU 擺放邊與控制平面；節點可點擊下鑽
 - **Router 負載平衡視圖** — 動畫扇形圖呈現每個副本的實際流量佔比，以及 router 下一個會選的實例
-- **趨勢圖** — 請求數／錯誤率／p95 延遲／tokens 的時序圖（15m–24h），由持久化的 request log 聚合
+- **Grafana 監控**（內建）— Prometheus 自動發現每個運行中的 vLLM 實例（後端隨模型啟停寫出 file-based service discovery）並抓取其 `/metrics`，外加 GPU（DCGM）與主機（node-exporter）指標。Grafana dashboards —— **總覽**（單一頁面：健康、延遲 SLO、容量、GPU/主機）、**排程與容量**、vLLM Performance/Query、GPU、Host —— 嵌入 **監控** 分頁，含 SLO 門檻線、模型生命週期標註與告警規則。見 [監控（Grafana）](#監控grafana)
 - 每模型用量（次數、錯誤率、p50/p95 延遲、tokens）、請求日誌、狀態轉移事件時間軸
 - GPU／CPU／記憶體監控，以及 GPU 進程清單
 
@@ -98,21 +98,27 @@ make up                              # docker compose -f deploy/docker-compose.y
 
 **架構**（見 [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)）：
 
-| 服務       | 映像                   | 端口  | 角色 |
-|------------|------------------------|-------|------|
-| `backend`  | `llmops-engine`（GPU） | 5000  | Dashboard API；在 `:800x` 拉起 vLLM 子進程 |
-| `router`   | `llmops-engine`        | 8887  | OpenAI 相容路由；**共用後端的 network namespace**，才打得到那些 localhost vLLM 端口 |
-| `frontend` | `llmops-frontend`      | 8884  | nginx 服務 SPA，並反向代理 `/api` → 後端、`/v1` → router |
-
-為何一份映像、兩個服務：只有後端真的需要 vLLM（它負責拉起子進程），而 router 必須在
-`localhost` 看到那些子進程——所以單一 [`engine.Dockerfile`](deploy/engine.Dockerfile)
-（基於官方 `vllm/vllm-openai`）以 `network_mode: service:backend` 跑成兩個服務。
-
-前端透過 nginx 以單一來源（same-origin）連到後端與 router，因此 build 不會硬編任何
-host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume；下載的模型**權重**
-則以 bind-mount 掛在主機 HF 快取（`HF_CACHE_DIR`，預設 `~/.cache/huggingface`），所以
-本機就能直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount
-掛入，因此改模型不必重新 build。
+| 服務             | 映像                   | 端口  | 角色 |
+|------------------|------------------------|-------|------|
+| `backend`        | `llmops-engine`（GPU） | 5000  | Dashboard API；在 `:800x` 拉起 vLLM 子進程 |
+| `router`         | `llmops-engine`        | 8887  | OpenAI 相容路由；**共用後端的 network namespace**，才打得到那些 localhost vLLM 端口 |
+| `prometheus`     | `prom/prometheus`      | 9090  | 透過 file-based SD 抓取 vLLM 艦隊的 `/metrics`；**同樣共用後端 netns**，`localhost:800x` 才解析得到那些實例 |
+| `grafana`        | `grafana/grafana`      | （代理）| Dashboards 與告警；經前端 nginx 以單一來源代理在 `/grafana` |
+| `dcgm-exporter`  | `nvcr.io/.../dcgm-exporter`（GPU） | 9400 | NVIDIA GPU 遙測（利用率、顯存、溫度、功耗） |
+| `node-exporter`  | `prom/node-exporter`   | 9100  | 主機指標（CPU、RAM、磁碟、網路） |
+| `frontend`       | `llmops-frontend`      | 8884  | nginx 服務 SPA，並反向代理 `/api` → 後端、`/v1` → router、`/grafana` → grafana |
+
+為何一份映像、多個服務共用一個 netns：只有後端真的需要 vLLM（它負責拉起子進程），
+而 router 與 Prometheus 必須在 `localhost` 看到那些子進程——所以單一
+[`engine.Dockerfile`](deploy/engine.Dockerfile)（基於官方 `vllm/vllm-openai`）跑成
+`backend` + `router`，並（連同 Prometheus）以 `network_mode: service:backend` 串接。
+
+前端透過 nginx 以單一來源（same-origin）連到後端、router 與 Grafana，因此 build 不會
+硬編任何 host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume（Prometheus
+TSDB 與 Grafana 狀態放在 `prometheus-data` / `grafana-data`）；下載的模型**權重**則以
+bind-mount 掛在主機 HF 快取（`HF_CACHE_DIR`，預設 `~/.cache/huggingface`），所以本機就能
+直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount 掛入，
+因此改模型不必重新 build。
 
 > **模型生命週期**：router 只負責路由與負載平衡，不會啟動模型。vLLM 實例（與
 > Embedding/Reranker 服務）由後端管理，從 **Models** 頁按需啟動（或
@@ -126,6 +132,28 @@ curl http://localhost:8887/v1/models     # router：列出設定的模型群組
 curl http://localhost:5000/api/models    # 後端：每個實例的生命週期狀態
 ```
 
+#### 監控（Grafana）
+
+整套內建完整的 **Prometheus → Grafana** 流程，免手動設定：
+
+- **後端**寫出 Prometheus file-based service-discovery 檔（`LLMOPS_PROMETHEUS_SD_PATH`），
+  列出每個 *ready* 的 vLLM 實例，並隨模型啟停刷新——所以動態艦隊免改設定即被抓取。
+- **Prometheus**（`:9090`）抓取這些實例的 `/metrics`，外加 `dcgm-exporter`（GPU）與
+  `node-exporter`（主機）。
+- **Grafana** 以單一來源服務於 **`http://localhost:8884/grafana`**（匿名唯讀；以
+  `admin` / `GRAFANA_ADMIN_PASSWORD` 登入可編輯）。datasource 與 dashboards 由
+  [`deploy/grafana`](deploy/grafana) 自動 provision：**總覽**、**vLLM 排程與容量**（自訂）、
+  **Performance**/**Query**（官方）、**GPU**（DCGM）、**Host**（Node Exporter）。
+  同一批 dashboards 也嵌入控制台的 **監控** 分頁。
+- **告警**：已 provision 的 vLLM 告警規則（target down、TTFT p95、KV cache、請求排隊）
+  路由到一個 webhook contact point —— 在 `deploy/.env` 設 `GRAFANA_ALERT_WEBHOOK`
+  （Slack/Discord/通用）並重啟 Grafana 即可收到通知。
+
+```bash
+curl http://localhost:9090/api/v1/targets        # prometheus：scrape target 健康狀態
+# 開啟 http://localhost:8884/grafana             # dashboards 與告警
+```
+
 ### 前端（Web 控制台）
 
 控制台位於 **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript、Tailwind CSS v4、shadcn-vue 元件、[Vue Flow](https://vueflow.dev)（拓撲／路由圖）、Pinia + Vue Router。（舊的 `apps/frontend` 已棄用。）

diff --git a/apps/backend/app/api/observability.py b/apps/backend/app/api/observability.py
@@ -54,26 +54,6 @@ async def requests_log(request: Request, model_key: Optional[str] = None, limit:
     return await _store(request).recent_requests(model_key=model_key, limit=limit)
 
 
-@router.get("/metrics/timeseries")
-async def metrics_timeseries(
-    request: Request,
-    window: int = 3600,
-    bucket: int = 60,
-    model_key: Optional[str] = None,
-):
-    """Bucketed request metrics over the last `window` seconds (for trend charts).
-
-    `bucket` is the bucket width in seconds; `model_key` optionally scopes to one
-    model group. Each point: ts, count, error_count, avg/p95 latency, total_tokens.
-    """
-    import time
-
-    since = time.time() - max(60, window)
-    return await _store(request).timeseries(
-        since=since, bucket_seconds=bucket, model_key=model_key
-    )
-
-
 @router.get("/models/{key}/logs")
 async def model_logs(
     key: str, tail: int = 200, manager: ModelManager = Depends(get_manager)

diff --git a/apps/backend/app/core/settings.py b/apps/backend/app/core/settings.py
@@ -50,6 +50,10 @@ class BackendSettings:
     admin_token: str = ""
     # Optional webhook URL; a JSON alert is POSTed when a model enters FAILED.
     alert_webhook: str = ""
+    # Optional path for the Prometheus file_sd targets file. The backend rewrites
+    # it whenever the set of ready vLLM instances changes, so Prometheus can
+    # scrape a dynamic fleet without config edits. Empty -> feature disabled.
+    prometheus_sd_path: str = ""
     # Total concurrency budget shared across running evals (sum of their
     # eval_batch_size). Evals run in parallel as long as the sum stays within
     # this; the rest queue. Maps to vLLM's max-num-seqs pressure. Runtime-editable
@@ -65,6 +69,7 @@ def from_env(cls) -> "BackendSettings":
         return cls(
             admin_token=os.environ.get("LLMOPS_ADMIN_TOKEN", "").strip(),
             alert_webhook=os.environ.get("LLMOPS_ALERT_WEBHOOK", "").strip(),
+            prometheus_sd_path=os.environ.get("LLMOPS_PROMETHEUS_SD_PATH", "").strip(),
             poll_interval=_env_float("LLMOPS_POLL_INTERVAL", 2.0),
             start_timeout=_env_float("LLMOPS_START_TIMEOUT", 300.0),
             stop_timeout=_env_float("LLMOPS_STOP_TIMEOUT", 10.0),

diff --git a/apps/backend/app/llmops/manager.py b/apps/backend/app/llmops/manager.py
@@ -144,6 +144,25 @@ async def trigger_router_reload(self) -> bool:
             logger.warning("Router reload POST failed (%s/reload)", self.router_url)
             return False
 
+    async def write_prometheus_targets(self) -> bool:
+        """Best-effort: refresh the Prometheus file_sd targets file to reflect the
+        currently-ready vLLM instances. No-op unless prometheus_sd_path is set.
+        Write-if-changed and never raises — monitoring discovery must never break
+        the model state machine. The (blocking) file IO runs in the executor."""
+        path = self.settings.prometheus_sd_path
+        if not path:
+            return False
+        from app.services.prometheus_targets import build_targets, write_targets_file
+
+        instances = await self.registry.snapshot()
+        targets = build_targets(instances)
+        loop = asyncio.get_event_loop()
+        try:
+            return await loop.run_in_executor(None, write_targets_file, path, targets)
+        except Exception:
+            logger.warning("Failed to write Prometheus SD file at %s", path)
+            return False
+
     async def list(self) -> list[ModelInstance]:
         return await self.registry.snapshot()
 

diff --git a/apps/backend/app/llmops/reconciler.py b/apps/backend/app/llmops/reconciler.py
@@ -209,6 +209,14 @@ async def reconcile_once(
         for inst, _frm, to, _detail in transitions
     ):
         await manager.trigger_router_reload()
+    # Keep the Prometheus scrape-target file in sync whenever a vLLM instance
+    # joins or leaves the ready pool (READY in either direction of a transition),
+    # so monitoring tracks the live fleet. Idempotent (write-if-changed).
+    if manager is not None and any(
+        inst.kind == ModelKind.LLM and ModelState.READY in (frm, to)
+        for inst, frm, to, _detail in transitions
+    ):
+        await manager.write_prometheus_targets()
     if manager is not None and settings.auto_restart:
         await _process_restarts(registry, settings, store, manager)
 

diff --git a/apps/backend/app/main.py b/apps/backend/app/main.py
@@ -118,6 +118,10 @@ async def lifespan(app: FastAPI):
     # honest from the first response.
     await adopt_running(registry, http_client, settings, store)
 
+    # Seed the Prometheus file_sd targets file (covering adopted-ready instances)
+    # so monitoring has a valid file from t=0, before the first state transition.
+    await manager.write_prometheus_targets()
+
     tasks = [
         asyncio.create_task(reconcile_loop(registry, http_client, settings, store, manager)),
         asyncio.create_task(_gpu_poll_loop(app, settings.gpu_poll_interval)),