Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 50 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ This project combines a routing server (LLM-Router-Server) with an easy-to-use m
- Real-time status via Server-Sent Events (no polling)
- **System topology** (Vue Flow) — a live mission-control graph of Clients → Router → model groups / Embedding → GPUs, with animated traffic edges, GPU-placement edges, and a control plane; nodes are clickable drill-ins
- **Router load-balancing view** — an animated fan showing each replica's real traffic share and the instance the router will pick next
- **Trends** — time-series charts (requests, error rate, p95 latency, tokens) over 15m–24h, aggregated from the persisted request log
- **Grafana monitoring** (bundled) — Prometheus auto-discovers every running vLLM instance (file-based service discovery written by the backend as models start/stop) and scrapes its `/metrics`, alongside GPU (DCGM) and host (node-exporter) metrics. Grafana dashboards — **Overview** (single pane: health, latency SLO, capacity, GPU/host), **Scheduling & Capacity**, vLLM Performance/Query, GPU, Host — are embedded in the **Monitoring** tab, with SLO threshold lines, model-lifecycle annotations, and alert rules. See [Monitoring (Grafana)](#monitoring-grafana)
- Per-model usage (count, error rate, p50/p95 latency, tokens), request log, and a state-transition event timeline
- GPU / CPU / memory monitoring plus a GPU-process inventory

Expand Down Expand Up @@ -103,24 +103,30 @@ make up # docker compose -f deploy/docker-compose.y

**Topology** (see [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)):

| Service | Image | Port | Role |
|------------|------------------------|---------|------|
| `backend` | `llmops-engine` (GPU) | 5000 | Dashboard API; spawns vLLM subprocesses on `:800x` |
| `router` | `llmops-engine` | 8887 | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
| `frontend` | `llmops-frontend` | 8884 | nginx serving the SPA + reverse-proxying `/api` → backend and `/v1` → router |

Why one image, two services: only the backend truly needs vLLM (it launches the
subprocesses), and the router must see them on `localhost` — so a single
[`engine.Dockerfile`](deploy/engine.Dockerfile) (based on the official
`vllm/vllm-openai`) runs as two services joined by `network_mode: service:backend`.

The frontend reaches the backend and router through nginx on a single origin, so
no host/port is baked into the build. SQLite + the dynamic-model overlay persist
in the `llmops-data` named volume; downloaded model **weights** are bind-mounted
from the host HF cache (`HF_CACHE_DIR`, default `~/.cache/huggingface`) so they're
browsable locally and shared with host-side tools. The canonical
`packages/config-schema/config.yaml` is bind-mounted too, so you can edit models
without rebuilding.
| Service | Image | Port | Role |
|------------------|------------------------|---------|------|
| `backend` | `llmops-engine` (GPU) | 5000 | Dashboard API; spawns vLLM subprocesses on `:800x` |
| `router` | `llmops-engine` | 8887 | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
| `prometheus` | `prom/prometheus` | 9090 | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances |
| `grafana` | `grafana/grafana` | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx |
| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) |
| `node-exporter` | `prom/node-exporter` | 9100 | Host metrics (CPU, RAM, disk, network) |
| `frontend` | `llmops-frontend` | 8884 | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana |

Why one image, multiple services on one netns: only the backend truly needs vLLM
(it launches the subprocesses), and the router + Prometheus must see them on
`localhost` — so a single [`engine.Dockerfile`](deploy/engine.Dockerfile) (based
on the official `vllm/vllm-openai`) runs as `backend` + `router`, joined (with
Prometheus) by `network_mode: service:backend`.

The frontend reaches the backend, router, and Grafana through nginx on a single
origin, so no host/port is baked into the build. SQLite + the dynamic-model
overlay persist in the `llmops-data` named volume (Prometheus TSDB and Grafana
state in `prometheus-data` / `grafana-data`); downloaded model **weights** are
bind-mounted from the host HF cache (`HF_CACHE_DIR`, default
`~/.cache/huggingface`) so they're browsable locally and shared with host-side
tools. The canonical `packages/config-schema/config.yaml` is bind-mounted too, so
you can edit models without rebuilding.

> **Model lifecycle**: the router only routes and load-balances — it never
> launches models. vLLM instances (and the Embedding/Reranker server) are owned
Expand All @@ -135,6 +141,31 @@ curl http://localhost:8887/v1/models # router: configured model groups
curl http://localhost:5000/api/models # backend: lifecycle state of each instance
```

#### Monitoring (Grafana)

The stack bundles a full **Prometheus → Grafana** pipeline, no manual setup:

- The **backend** writes a Prometheus file-based service-discovery file
(`LLMOPS_PROMETHEUS_SD_PATH`) listing every *ready* vLLM instance, refreshed as
models start/stop — so a dynamic fleet is scraped with zero config edits.
- **Prometheus** (`:9090`) scrapes those instances' `/metrics` plus
`dcgm-exporter` (GPU) and `node-exporter` (host).
- **Grafana** is served single-origin at **`http://localhost:8884/grafana`**
(anonymous read-only; log in as `admin` / `GRAFANA_ADMIN_PASSWORD` to edit).
Datasource and dashboards are auto-provisioned from
[`deploy/grafana`](deploy/grafana): **Overview**, **vLLM Scheduling &
Capacity** (custom), **Performance**/**Query** (official), **GPU** (DCGM), and
**Host** (Node Exporter). The same dashboards are embedded in the dashboard's
**Monitoring** tab.
- **Alerting**: provisioned vLLM alert rules (target down, TTFT p95, KV cache,
request queueing) route to a webhook contact point — set `GRAFANA_ALERT_WEBHOOK`
in `deploy/.env` (Slack/Discord/generic) and restart Grafana to receive them.

```bash
curl http://localhost:9090/api/v1/targets # prometheus: scrape target health
# open http://localhost:8884/grafana # dashboards + alerts
```

### Frontend (Web Dashboard)

The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript, Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.)
Expand Down
60 changes: 44 additions & 16 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
- 透過 Server-Sent Events 即時更新狀態(免輪詢)
- **系統拓撲圖**(Vue Flow)— Clients → Router → 模型群組/Embedding → GPU 的即時 mission-control 圖,含流動的流量邊、GPU 擺放邊與控制平面;節點可點擊下鑽
- **Router 負載平衡視圖** — 動畫扇形圖呈現每個副本的實際流量佔比,以及 router 下一個會選的實例
- **趨勢圖** — 請求數/錯誤率/p95 延遲/tokens 的時序圖(15m–24h),由持久化的 request log 聚合
- **Grafana 監控**(內建)— Prometheus 自動發現每個運行中的 vLLM 實例(後端隨模型啟停寫出 file-based service discovery)並抓取其 `/metrics`,外加 GPU(DCGM)與主機(node-exporter)指標。Grafana dashboards —— **總覽**(單一頁面:健康、延遲 SLO、容量、GPU/主機)、**排程與容量**、vLLM Performance/Query、GPU、Host —— 嵌入 **監控** 分頁,含 SLO 門檻線、模型生命週期標註與告警規則。見 [監控(Grafana)](#監控grafana)
- 每模型用量(次數、錯誤率、p50/p95 延遲、tokens)、請求日誌、狀態轉移事件時間軸
- GPU/CPU/記憶體監控,以及 GPU 進程清單

Expand Down Expand Up @@ -98,21 +98,27 @@ make up # docker compose -f deploy/docker-compose.y

**架構**(見 [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)):

| 服務 | 映像 | 端口 | 角色 |
|------------|------------------------|-------|------|
| `backend` | `llmops-engine`(GPU) | 5000 | Dashboard API;在 `:800x` 拉起 vLLM 子進程 |
| `router` | `llmops-engine` | 8887 | OpenAI 相容路由;**共用後端的 network namespace**,才打得到那些 localhost vLLM 端口 |
| `frontend` | `llmops-frontend` | 8884 | nginx 服務 SPA,並反向代理 `/api` → 後端、`/v1` → router |

為何一份映像、兩個服務:只有後端真的需要 vLLM(它負責拉起子進程),而 router 必須在
`localhost` 看到那些子進程——所以單一 [`engine.Dockerfile`](deploy/engine.Dockerfile)
(基於官方 `vllm/vllm-openai`)以 `network_mode: service:backend` 跑成兩個服務。

前端透過 nginx 以單一來源(same-origin)連到後端與 router,因此 build 不會硬編任何
host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume;下載的模型**權重**
則以 bind-mount 掛在主機 HF 快取(`HF_CACHE_DIR`,預設 `~/.cache/huggingface`),所以
本機就能直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount
掛入,因此改模型不必重新 build。
| 服務 | 映像 | 端口 | 角色 |
|------------------|------------------------|-------|------|
| `backend` | `llmops-engine`(GPU) | 5000 | Dashboard API;在 `:800x` 拉起 vLLM 子進程 |
| `router` | `llmops-engine` | 8887 | OpenAI 相容路由;**共用後端的 network namespace**,才打得到那些 localhost vLLM 端口 |
| `prometheus` | `prom/prometheus` | 9090 | 透過 file-based SD 抓取 vLLM 艦隊的 `/metrics`;**同樣共用後端 netns**,`localhost:800x` 才解析得到那些實例 |
| `grafana` | `grafana/grafana` | (代理)| Dashboards 與告警;經前端 nginx 以單一來源代理在 `/grafana` |
| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter`(GPU) | 9400 | NVIDIA GPU 遙測(利用率、顯存、溫度、功耗) |
| `node-exporter` | `prom/node-exporter` | 9100 | 主機指標(CPU、RAM、磁碟、網路) |
| `frontend` | `llmops-frontend` | 8884 | nginx 服務 SPA,並反向代理 `/api` → 後端、`/v1` → router、`/grafana` → grafana |

為何一份映像、多個服務共用一個 netns:只有後端真的需要 vLLM(它負責拉起子進程),
而 router 與 Prometheus 必須在 `localhost` 看到那些子進程——所以單一
[`engine.Dockerfile`](deploy/engine.Dockerfile)(基於官方 `vllm/vllm-openai`)跑成
`backend` + `router`,並(連同 Prometheus)以 `network_mode: service:backend` 串接。

前端透過 nginx 以單一來源(same-origin)連到後端、router 與 Grafana,因此 build 不會
硬編任何 host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume(Prometheus
TSDB 與 Grafana 狀態放在 `prometheus-data` / `grafana-data`);下載的模型**權重**則以
bind-mount 掛在主機 HF 快取(`HF_CACHE_DIR`,預設 `~/.cache/huggingface`),所以本機就能
直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount 掛入,
因此改模型不必重新 build。

> **模型生命週期**:router 只負責路由與負載平衡,不會啟動模型。vLLM 實例(與
> Embedding/Reranker 服務)由後端管理,從 **Models** 頁按需啟動(或
Expand All @@ -126,6 +132,28 @@ curl http://localhost:8887/v1/models # router:列出設定的模型群組
curl http://localhost:5000/api/models # 後端:每個實例的生命週期狀態
```

#### 監控(Grafana)

整套內建完整的 **Prometheus → Grafana** 流程,免手動設定:

- **後端**寫出 Prometheus file-based service-discovery 檔(`LLMOPS_PROMETHEUS_SD_PATH`),
列出每個 *ready* 的 vLLM 實例,並隨模型啟停刷新——所以動態艦隊免改設定即被抓取。
- **Prometheus**(`:9090`)抓取這些實例的 `/metrics`,外加 `dcgm-exporter`(GPU)與
`node-exporter`(主機)。
- **Grafana** 以單一來源服務於 **`http://localhost:8884/grafana`**(匿名唯讀;以
`admin` / `GRAFANA_ADMIN_PASSWORD` 登入可編輯)。datasource 與 dashboards 由
[`deploy/grafana`](deploy/grafana) 自動 provision:**總覽**、**vLLM 排程與容量**(自訂)、
**Performance**/**Query**(官方)、**GPU**(DCGM)、**Host**(Node Exporter)。
同一批 dashboards 也嵌入控制台的 **監控** 分頁。
- **告警**:已 provision 的 vLLM 告警規則(target down、TTFT p95、KV cache、請求排隊)
路由到一個 webhook contact point —— 在 `deploy/.env` 設 `GRAFANA_ALERT_WEBHOOK`
(Slack/Discord/通用)並重啟 Grafana 即可收到通知。

```bash
curl http://localhost:9090/api/v1/targets # prometheus:scrape target 健康狀態
# 開啟 http://localhost:8884/grafana # dashboards 與告警
```

### 前端(Web 控制台)

控制台位於 **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript、Tailwind CSS v4、shadcn-vue 元件、[Vue Flow](https://vueflow.dev)(拓撲/路由圖)、Pinia + Vue Router。(舊的 `apps/frontend` 已棄用。)
Expand Down
20 changes: 0 additions & 20 deletions apps/backend/app/api/observability.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,26 +54,6 @@ async def requests_log(request: Request, model_key: Optional[str] = None, limit:
return await _store(request).recent_requests(model_key=model_key, limit=limit)


@router.get("/metrics/timeseries")
async def metrics_timeseries(
request: Request,
window: int = 3600,
bucket: int = 60,
model_key: Optional[str] = None,
):
"""Bucketed request metrics over the last `window` seconds (for trend charts).

`bucket` is the bucket width in seconds; `model_key` optionally scopes to one
model group. Each point: ts, count, error_count, avg/p95 latency, total_tokens.
"""
import time

since = time.time() - max(60, window)
return await _store(request).timeseries(
since=since, bucket_seconds=bucket, model_key=model_key
)


@router.get("/models/{key}/logs")
async def model_logs(
key: str, tail: int = 200, manager: ModelManager = Depends(get_manager)
Expand Down
5 changes: 5 additions & 0 deletions apps/backend/app/core/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ class BackendSettings:
admin_token: str = ""
# Optional webhook URL; a JSON alert is POSTed when a model enters FAILED.
alert_webhook: str = ""
# Optional path for the Prometheus file_sd targets file. The backend rewrites
# it whenever the set of ready vLLM instances changes, so Prometheus can
# scrape a dynamic fleet without config edits. Empty -> feature disabled.
prometheus_sd_path: str = ""
# Total concurrency budget shared across running evals (sum of their
# eval_batch_size). Evals run in parallel as long as the sum stays within
# this; the rest queue. Maps to vLLM's max-num-seqs pressure. Runtime-editable
Expand All @@ -65,6 +69,7 @@ def from_env(cls) -> "BackendSettings":
return cls(
admin_token=os.environ.get("LLMOPS_ADMIN_TOKEN", "").strip(),
alert_webhook=os.environ.get("LLMOPS_ALERT_WEBHOOK", "").strip(),
prometheus_sd_path=os.environ.get("LLMOPS_PROMETHEUS_SD_PATH", "").strip(),
poll_interval=_env_float("LLMOPS_POLL_INTERVAL", 2.0),
start_timeout=_env_float("LLMOPS_START_TIMEOUT", 300.0),
stop_timeout=_env_float("LLMOPS_STOP_TIMEOUT", 10.0),
Expand Down
19 changes: 19 additions & 0 deletions apps/backend/app/llmops/manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,25 @@ async def trigger_router_reload(self) -> bool:
logger.warning("Router reload POST failed (%s/reload)", self.router_url)
return False

async def write_prometheus_targets(self) -> bool:
"""Best-effort: refresh the Prometheus file_sd targets file to reflect the
currently-ready vLLM instances. No-op unless prometheus_sd_path is set.
Write-if-changed and never raises — monitoring discovery must never break
the model state machine. The (blocking) file IO runs in the executor."""
path = self.settings.prometheus_sd_path
if not path:
return False
from app.services.prometheus_targets import build_targets, write_targets_file

instances = await self.registry.snapshot()
targets = build_targets(instances)
loop = asyncio.get_event_loop()
try:
return await loop.run_in_executor(None, write_targets_file, path, targets)
except Exception:
logger.warning("Failed to write Prometheus SD file at %s", path)
return False

async def list(self) -> list[ModelInstance]:
return await self.registry.snapshot()

Expand Down
8 changes: 8 additions & 0 deletions apps/backend/app/llmops/reconciler.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,14 @@ async def reconcile_once(
for inst, _frm, to, _detail in transitions
):
await manager.trigger_router_reload()
# Keep the Prometheus scrape-target file in sync whenever a vLLM instance
# joins or leaves the ready pool (READY in either direction of a transition),
# so monitoring tracks the live fleet. Idempotent (write-if-changed).
if manager is not None and any(
inst.kind == ModelKind.LLM and ModelState.READY in (frm, to)
for inst, frm, to, _detail in transitions
):
await manager.write_prometheus_targets()
if manager is not None and settings.auto_restart:
await _process_restarts(registry, settings, store, manager)

Expand Down
4 changes: 4 additions & 0 deletions apps/backend/app/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ async def lifespan(app: FastAPI):
# honest from the first response.
await adopt_running(registry, http_client, settings, store)

# Seed the Prometheus file_sd targets file (covering adopted-ready instances)
# so monitoring has a valid file from t=0, before the first state transition.
await manager.write_prometheus_targets()

tasks = [
asyncio.create_task(reconcile_loop(registry, http_client, settings, store, manager)),
asyncio.create_task(_gpu_poll_loop(app, settings.gpu_poll_interval)),
Expand Down
Loading
Loading