diff --git a/README.md b/README.md index 4981571..600e743 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,7 @@ This project combines a routing server (LLM-Router-Server) with an easy-to-use m - Real-time status via Server-Sent Events (no polling) - **System topology** (Vue Flow) — a live mission-control graph of Clients → Router → model groups / Embedding → GPUs, with animated traffic edges, GPU-placement edges, and a control plane; nodes are clickable drill-ins - **Router load-balancing view** — an animated fan showing each replica's real traffic share and the instance the router will pick next -- **Trends** — time-series charts (requests, error rate, p95 latency, tokens) over 15m–24h, aggregated from the persisted request log +- **Grafana monitoring** (bundled) — Prometheus auto-discovers every running vLLM instance (file-based service discovery written by the backend as models start/stop) and scrapes its `/metrics`, alongside GPU (DCGM) and host (node-exporter) metrics. Grafana dashboards — **Overview** (single pane: health, latency SLO, capacity, GPU/host), **Scheduling & Capacity**, vLLM Performance/Query, GPU, Host — are embedded in the **Monitoring** tab, with SLO threshold lines, model-lifecycle annotations, and alert rules. See [Monitoring (Grafana)](#monitoring-grafana) - Per-model usage (count, error rate, p50/p95 latency, tokens), request log, and a state-transition event timeline - GPU / CPU / memory monitoring plus a GPU-process inventory @@ -103,24 +103,30 @@ make up # docker compose -f deploy/docker-compose.y **Topology** (see [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)): -| Service | Image | Port | Role | -|------------|------------------------|---------|------| -| `backend` | `llmops-engine` (GPU) | 5000 | Dashboard API; spawns vLLM subprocesses on `:800x` | -| `router` | `llmops-engine` | 8887 | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports | -| `frontend` | `llmops-frontend` | 8884 | nginx serving the SPA + reverse-proxying `/api` → backend and `/v1` → router | - -Why one image, two services: only the backend truly needs vLLM (it launches the -subprocesses), and the router must see them on `localhost` — so a single -[`engine.Dockerfile`](deploy/engine.Dockerfile) (based on the official -`vllm/vllm-openai`) runs as two services joined by `network_mode: service:backend`. - -The frontend reaches the backend and router through nginx on a single origin, so -no host/port is baked into the build. SQLite + the dynamic-model overlay persist -in the `llmops-data` named volume; downloaded model **weights** are bind-mounted -from the host HF cache (`HF_CACHE_DIR`, default `~/.cache/huggingface`) so they're -browsable locally and shared with host-side tools. The canonical -`packages/config-schema/config.yaml` is bind-mounted too, so you can edit models -without rebuilding. +| Service | Image | Port | Role | +|------------------|------------------------|---------|------| +| `backend` | `llmops-engine` (GPU) | 5000 | Dashboard API; spawns vLLM subprocesses on `:800x` | +| `router` | `llmops-engine` | 8887 | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports | +| `prometheus` | `prom/prometheus` | 9090 | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances | +| `grafana` | `grafana/grafana` | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx | +| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) | +| `node-exporter` | `prom/node-exporter` | 9100 | Host metrics (CPU, RAM, disk, network) | +| `frontend` | `llmops-frontend` | 8884 | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana | + +Why one image, multiple services on one netns: only the backend truly needs vLLM +(it launches the subprocesses), and the router + Prometheus must see them on +`localhost` — so a single [`engine.Dockerfile`](deploy/engine.Dockerfile) (based +on the official `vllm/vllm-openai`) runs as `backend` + `router`, joined (with +Prometheus) by `network_mode: service:backend`. + +The frontend reaches the backend, router, and Grafana through nginx on a single +origin, so no host/port is baked into the build. SQLite + the dynamic-model +overlay persist in the `llmops-data` named volume (Prometheus TSDB and Grafana +state in `prometheus-data` / `grafana-data`); downloaded model **weights** are +bind-mounted from the host HF cache (`HF_CACHE_DIR`, default +`~/.cache/huggingface`) so they're browsable locally and shared with host-side +tools. The canonical `packages/config-schema/config.yaml` is bind-mounted too, so +you can edit models without rebuilding. > **Model lifecycle**: the router only routes and load-balances — it never > launches models. vLLM instances (and the Embedding/Reranker server) are owned @@ -135,6 +141,31 @@ curl http://localhost:8887/v1/models # router: configured model groups curl http://localhost:5000/api/models # backend: lifecycle state of each instance ``` +#### Monitoring (Grafana) + +The stack bundles a full **Prometheus → Grafana** pipeline, no manual setup: + +- The **backend** writes a Prometheus file-based service-discovery file + (`LLMOPS_PROMETHEUS_SD_PATH`) listing every *ready* vLLM instance, refreshed as + models start/stop — so a dynamic fleet is scraped with zero config edits. +- **Prometheus** (`:9090`) scrapes those instances' `/metrics` plus + `dcgm-exporter` (GPU) and `node-exporter` (host). +- **Grafana** is served single-origin at **`http://localhost:8884/grafana`** + (anonymous read-only; log in as `admin` / `GRAFANA_ADMIN_PASSWORD` to edit). + Datasource and dashboards are auto-provisioned from + [`deploy/grafana`](deploy/grafana): **Overview**, **vLLM Scheduling & + Capacity** (custom), **Performance**/**Query** (official), **GPU** (DCGM), and + **Host** (Node Exporter). The same dashboards are embedded in the dashboard's + **Monitoring** tab. +- **Alerting**: provisioned vLLM alert rules (target down, TTFT p95, KV cache, + request queueing) route to a webhook contact point — set `GRAFANA_ALERT_WEBHOOK` + in `deploy/.env` (Slack/Discord/generic) and restart Grafana to receive them. + +```bash +curl http://localhost:9090/api/v1/targets # prometheus: scrape target health +# open http://localhost:8884/grafana # dashboards + alerts +``` + ### Frontend (Web Dashboard) The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript, Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.) diff --git a/README_zh-CN.md b/README_zh-CN.md index 4f1943b..c6f34db 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -45,7 +45,7 @@ - 透過 Server-Sent Events 即時更新狀態(免輪詢) - **系統拓撲圖**(Vue Flow)— Clients → Router → 模型群組/Embedding → GPU 的即時 mission-control 圖,含流動的流量邊、GPU 擺放邊與控制平面;節點可點擊下鑽 - **Router 負載平衡視圖** — 動畫扇形圖呈現每個副本的實際流量佔比,以及 router 下一個會選的實例 -- **趨勢圖** — 請求數/錯誤率/p95 延遲/tokens 的時序圖(15m–24h),由持久化的 request log 聚合 +- **Grafana 監控**(內建)— Prometheus 自動發現每個運行中的 vLLM 實例(後端隨模型啟停寫出 file-based service discovery)並抓取其 `/metrics`,外加 GPU(DCGM)與主機(node-exporter)指標。Grafana dashboards —— **總覽**(單一頁面:健康、延遲 SLO、容量、GPU/主機)、**排程與容量**、vLLM Performance/Query、GPU、Host —— 嵌入 **監控** 分頁,含 SLO 門檻線、模型生命週期標註與告警規則。見 [監控(Grafana)](#監控grafana) - 每模型用量(次數、錯誤率、p50/p95 延遲、tokens)、請求日誌、狀態轉移事件時間軸 - GPU/CPU/記憶體監控,以及 GPU 進程清單 @@ -98,21 +98,27 @@ make up # docker compose -f deploy/docker-compose.y **架構**(見 [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)): -| 服務 | 映像 | 端口 | 角色 | -|------------|------------------------|-------|------| -| `backend` | `llmops-engine`(GPU) | 5000 | Dashboard API;在 `:800x` 拉起 vLLM 子進程 | -| `router` | `llmops-engine` | 8887 | OpenAI 相容路由;**共用後端的 network namespace**,才打得到那些 localhost vLLM 端口 | -| `frontend` | `llmops-frontend` | 8884 | nginx 服務 SPA,並反向代理 `/api` → 後端、`/v1` → router | - -為何一份映像、兩個服務:只有後端真的需要 vLLM(它負責拉起子進程),而 router 必須在 -`localhost` 看到那些子進程——所以單一 [`engine.Dockerfile`](deploy/engine.Dockerfile) -(基於官方 `vllm/vllm-openai`)以 `network_mode: service:backend` 跑成兩個服務。 - -前端透過 nginx 以單一來源(same-origin)連到後端與 router,因此 build 不會硬編任何 -host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume;下載的模型**權重** -則以 bind-mount 掛在主機 HF 快取(`HF_CACHE_DIR`,預設 `~/.cache/huggingface`),所以 -本機就能直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount -掛入,因此改模型不必重新 build。 +| 服務 | 映像 | 端口 | 角色 | +|------------------|------------------------|-------|------| +| `backend` | `llmops-engine`(GPU) | 5000 | Dashboard API;在 `:800x` 拉起 vLLM 子進程 | +| `router` | `llmops-engine` | 8887 | OpenAI 相容路由;**共用後端的 network namespace**,才打得到那些 localhost vLLM 端口 | +| `prometheus` | `prom/prometheus` | 9090 | 透過 file-based SD 抓取 vLLM 艦隊的 `/metrics`;**同樣共用後端 netns**,`localhost:800x` 才解析得到那些實例 | +| `grafana` | `grafana/grafana` | (代理)| Dashboards 與告警;經前端 nginx 以單一來源代理在 `/grafana` | +| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter`(GPU) | 9400 | NVIDIA GPU 遙測(利用率、顯存、溫度、功耗) | +| `node-exporter` | `prom/node-exporter` | 9100 | 主機指標(CPU、RAM、磁碟、網路) | +| `frontend` | `llmops-frontend` | 8884 | nginx 服務 SPA,並反向代理 `/api` → 後端、`/v1` → router、`/grafana` → grafana | + +為何一份映像、多個服務共用一個 netns:只有後端真的需要 vLLM(它負責拉起子進程), +而 router 與 Prometheus 必須在 `localhost` 看到那些子進程——所以單一 +[`engine.Dockerfile`](deploy/engine.Dockerfile)(基於官方 `vllm/vllm-openai`)跑成 +`backend` + `router`,並(連同 Prometheus)以 `network_mode: service:backend` 串接。 + +前端透過 nginx 以單一來源(same-origin)連到後端、router 與 Grafana,因此 build 不會 +硬編任何 host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume(Prometheus +TSDB 與 Grafana 狀態放在 `prometheus-data` / `grafana-data`);下載的模型**權重**則以 +bind-mount 掛在主機 HF 快取(`HF_CACHE_DIR`,預設 `~/.cache/huggingface`),所以本機就能 +直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount 掛入, +因此改模型不必重新 build。 > **模型生命週期**:router 只負責路由與負載平衡,不會啟動模型。vLLM 實例(與 > Embedding/Reranker 服務)由後端管理,從 **Models** 頁按需啟動(或 @@ -126,6 +132,28 @@ curl http://localhost:8887/v1/models # router:列出設定的模型群組 curl http://localhost:5000/api/models # 後端:每個實例的生命週期狀態 ``` +#### 監控(Grafana) + +整套內建完整的 **Prometheus → Grafana** 流程,免手動設定: + +- **後端**寫出 Prometheus file-based service-discovery 檔(`LLMOPS_PROMETHEUS_SD_PATH`), + 列出每個 *ready* 的 vLLM 實例,並隨模型啟停刷新——所以動態艦隊免改設定即被抓取。 +- **Prometheus**(`:9090`)抓取這些實例的 `/metrics`,外加 `dcgm-exporter`(GPU)與 + `node-exporter`(主機)。 +- **Grafana** 以單一來源服務於 **`http://localhost:8884/grafana`**(匿名唯讀;以 + `admin` / `GRAFANA_ADMIN_PASSWORD` 登入可編輯)。datasource 與 dashboards 由 + [`deploy/grafana`](deploy/grafana) 自動 provision:**總覽**、**vLLM 排程與容量**(自訂)、 + **Performance**/**Query**(官方)、**GPU**(DCGM)、**Host**(Node Exporter)。 + 同一批 dashboards 也嵌入控制台的 **監控** 分頁。 +- **告警**:已 provision 的 vLLM 告警規則(target down、TTFT p95、KV cache、請求排隊) + 路由到一個 webhook contact point —— 在 `deploy/.env` 設 `GRAFANA_ALERT_WEBHOOK` + (Slack/Discord/通用)並重啟 Grafana 即可收到通知。 + +```bash +curl http://localhost:9090/api/v1/targets # prometheus:scrape target 健康狀態 +# 開啟 http://localhost:8884/grafana # dashboards 與告警 +``` + ### 前端(Web 控制台) 控制台位於 **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript、Tailwind CSS v4、shadcn-vue 元件、[Vue Flow](https://vueflow.dev)(拓撲/路由圖)、Pinia + Vue Router。(舊的 `apps/frontend` 已棄用。) diff --git a/apps/backend/app/api/observability.py b/apps/backend/app/api/observability.py index 20274b9..3d79307 100644 --- a/apps/backend/app/api/observability.py +++ b/apps/backend/app/api/observability.py @@ -54,26 +54,6 @@ async def requests_log(request: Request, model_key: Optional[str] = None, limit: return await _store(request).recent_requests(model_key=model_key, limit=limit) -@router.get("/metrics/timeseries") -async def metrics_timeseries( - request: Request, - window: int = 3600, - bucket: int = 60, - model_key: Optional[str] = None, -): - """Bucketed request metrics over the last `window` seconds (for trend charts). - - `bucket` is the bucket width in seconds; `model_key` optionally scopes to one - model group. Each point: ts, count, error_count, avg/p95 latency, total_tokens. - """ - import time - - since = time.time() - max(60, window) - return await _store(request).timeseries( - since=since, bucket_seconds=bucket, model_key=model_key - ) - - @router.get("/models/{key}/logs") async def model_logs( key: str, tail: int = 200, manager: ModelManager = Depends(get_manager) diff --git a/apps/backend/app/core/settings.py b/apps/backend/app/core/settings.py index 6dee065..6895f28 100644 --- a/apps/backend/app/core/settings.py +++ b/apps/backend/app/core/settings.py @@ -50,6 +50,10 @@ class BackendSettings: admin_token: str = "" # Optional webhook URL; a JSON alert is POSTed when a model enters FAILED. alert_webhook: str = "" + # Optional path for the Prometheus file_sd targets file. The backend rewrites + # it whenever the set of ready vLLM instances changes, so Prometheus can + # scrape a dynamic fleet without config edits. Empty -> feature disabled. + prometheus_sd_path: str = "" # Total concurrency budget shared across running evals (sum of their # eval_batch_size). Evals run in parallel as long as the sum stays within # this; the rest queue. Maps to vLLM's max-num-seqs pressure. Runtime-editable @@ -65,6 +69,7 @@ def from_env(cls) -> "BackendSettings": return cls( admin_token=os.environ.get("LLMOPS_ADMIN_TOKEN", "").strip(), alert_webhook=os.environ.get("LLMOPS_ALERT_WEBHOOK", "").strip(), + prometheus_sd_path=os.environ.get("LLMOPS_PROMETHEUS_SD_PATH", "").strip(), poll_interval=_env_float("LLMOPS_POLL_INTERVAL", 2.0), start_timeout=_env_float("LLMOPS_START_TIMEOUT", 300.0), stop_timeout=_env_float("LLMOPS_STOP_TIMEOUT", 10.0), diff --git a/apps/backend/app/llmops/manager.py b/apps/backend/app/llmops/manager.py index 9d91534..ef1fbbe 100644 --- a/apps/backend/app/llmops/manager.py +++ b/apps/backend/app/llmops/manager.py @@ -144,6 +144,25 @@ async def trigger_router_reload(self) -> bool: logger.warning("Router reload POST failed (%s/reload)", self.router_url) return False + async def write_prometheus_targets(self) -> bool: + """Best-effort: refresh the Prometheus file_sd targets file to reflect the + currently-ready vLLM instances. No-op unless prometheus_sd_path is set. + Write-if-changed and never raises — monitoring discovery must never break + the model state machine. The (blocking) file IO runs in the executor.""" + path = self.settings.prometheus_sd_path + if not path: + return False + from app.services.prometheus_targets import build_targets, write_targets_file + + instances = await self.registry.snapshot() + targets = build_targets(instances) + loop = asyncio.get_event_loop() + try: + return await loop.run_in_executor(None, write_targets_file, path, targets) + except Exception: + logger.warning("Failed to write Prometheus SD file at %s", path) + return False + async def list(self) -> list[ModelInstance]: return await self.registry.snapshot() diff --git a/apps/backend/app/llmops/reconciler.py b/apps/backend/app/llmops/reconciler.py index 2910a0e..bb17bce 100644 --- a/apps/backend/app/llmops/reconciler.py +++ b/apps/backend/app/llmops/reconciler.py @@ -209,6 +209,14 @@ async def reconcile_once( for inst, _frm, to, _detail in transitions ): await manager.trigger_router_reload() + # Keep the Prometheus scrape-target file in sync whenever a vLLM instance + # joins or leaves the ready pool (READY in either direction of a transition), + # so monitoring tracks the live fleet. Idempotent (write-if-changed). + if manager is not None and any( + inst.kind == ModelKind.LLM and ModelState.READY in (frm, to) + for inst, frm, to, _detail in transitions + ): + await manager.write_prometheus_targets() if manager is not None and settings.auto_restart: await _process_restarts(registry, settings, store, manager) diff --git a/apps/backend/app/main.py b/apps/backend/app/main.py index 57c7db1..3d5351e 100644 --- a/apps/backend/app/main.py +++ b/apps/backend/app/main.py @@ -118,6 +118,10 @@ async def lifespan(app: FastAPI): # honest from the first response. await adopt_running(registry, http_client, settings, store) + # Seed the Prometheus file_sd targets file (covering adopted-ready instances) + # so monitoring has a valid file from t=0, before the first state transition. + await manager.write_prometheus_targets() + tasks = [ asyncio.create_task(reconcile_loop(registry, http_client, settings, store, manager)), asyncio.create_task(_gpu_poll_loop(app, settings.gpu_poll_interval)), diff --git a/apps/backend/app/services/prometheus_targets.py b/apps/backend/app/services/prometheus_targets.py new file mode 100644 index 0000000..48d09b3 --- /dev/null +++ b/apps/backend/app/services/prometheus_targets.py @@ -0,0 +1,84 @@ +"""Prometheus file-based service discovery for the backend-owned vLLM fleet. + +vLLM instances are spawned on demand on dynamic localhost ports (and come and go +as models are added/removed/auto-restarted), so a static Prometheus scrape config +would constantly drift. Instead the backend — which already owns the registry, the +single source of truth for which instance is on which port — writes a Prometheus +`file_sd` targets file listing every *ready* vLLM instance. Prometheus watches the +file and picks up changes within its refresh interval, no restart needed. + +Only LLM (vLLM) instances are emitted: vLLM exposes a Prometheus-format `/metrics` +on its OpenAI port, whereas the embedding/reranker server does not. + +The file lives in the shared data volume and is read by the Prometheus container +(which joins the backend's network namespace, so the `localhost:` targets +resolve to the same vLLM processes the backend spawned). +""" +from __future__ import annotations + +import json +import os +from typing import Iterable + +from app.llmops.instance import ModelInstance +from app.llmops.state import ModelKind, ModelState + + +def build_targets(instances: Iterable[ModelInstance]) -> list[dict]: + """Build the Prometheus file_sd target list from registry instances. + + One entry per ready vLLM instance. `targets` is the scrape address + (`host:port`); Prometheus appends the configured metrics_path (`/metrics`). + Labels carry the group/instance identity and model tag so dashboards can + join on something meaningful instead of the volatile `host:port`. + + Sorted by address so the serialized output is stable — the writer can then + skip an identical rewrite and avoid churning the file (which would otherwise + nudge Prometheus to re-read it every reconcile pass). + """ + targets: list[dict] = [] + for inst in instances: + if inst.kind != ModelKind.LLM or inst.state != ModelState.READY: + continue + group, _, instance_id = inst.key.partition("::") + targets.append( + { + "targets": [f"{inst.host}:{inst.port}"], + "labels": { + "group": group, + "instance_id": instance_id, + "model_tag": inst.model_tag or "", + }, + } + ) + targets.sort(key=lambda t: t["targets"][0]) + return targets + + +def render(targets: list[dict]) -> str: + """Serialize the target list to the JSON Prometheus file_sd expects.""" + return json.dumps(targets, indent=2, sort_keys=True) + + +def write_targets_file(path: str, targets: list[dict]) -> bool: + """Atomically write the SD file if its content changed. Returns True if it + was (re)written, False if the on-disk content already matched. + + Write-if-changed keeps Prometheus from re-reading an identical file on every + reconcile tick. The write is atomic (temp + os.replace) so Prometheus never + observes a half-written, unparseable file. + """ + payload = render(targets) + try: + with open(path, encoding="utf-8") as f: + if f.read() == payload: + return False + except (OSError, ValueError): + pass # missing/unreadable -> (re)write below + + os.makedirs(os.path.dirname(path) or ".", exist_ok=True) + tmp = f"{path}.tmp" + with open(tmp, "w", encoding="utf-8") as f: + f.write(payload) + os.replace(tmp, path) # atomic on POSIX + return True diff --git a/apps/backend/tests/unit/test_prometheus_targets.py b/apps/backend/tests/unit/test_prometheus_targets.py new file mode 100644 index 0000000..1e8732b --- /dev/null +++ b/apps/backend/tests/unit/test_prometheus_targets.py @@ -0,0 +1,99 @@ +import json + +import pytest + +from app.core.settings import BackendSettings +from app.llmops.launchers import EMBEDDING_KEY, EmbeddingLauncher, VllmLauncher +from app.llmops.manager import ModelManager, build_registry +from app.llmops.state import ModelState +from app.services.prometheus_targets import (build_targets, render, + write_targets_file) +from tests.conftest import FAKE_CONFIG, FakeHTTPClient + +pytestmark = pytest.mark.unit + +HEALTHY = "Qwen3-0.6B::qwen3" # port 8002 +OTHER = "Qwen3-0.6B::qwen3-2" # port 8004 + + +def _registry(): + return build_registry(FAKE_CONFIG, "config.yaml", [VllmLauncher(), EmbeddingLauncher()]) + + +def test_build_targets_only_includes_ready_llm(): + reg = _registry() + reg.get(HEALTHY).state = ModelState.READY + reg.get(OTHER).state = ModelState.STARTING # not ready -> excluded + + targets = build_targets(reg.values()) + + assert len(targets) == 1 + entry = targets[0] + assert entry["targets"] == ["localhost:8002"] + assert entry["labels"]["group"] == "Qwen3-0.6B" + assert entry["labels"]["instance_id"] == "qwen3" + assert entry["labels"]["model_tag"] == "Qwen/Qwen3-0.6B" + + +def test_build_targets_excludes_embedding_server(): + # The embedding/reranker server is not vLLM and exposes no Prometheus metrics. + reg = _registry() + emb = reg.get(EMBEDDING_KEY) + assert emb is not None + emb.state = ModelState.READY + + assert build_targets(reg.values()) == [] + + +def test_build_targets_is_sorted_and_stable(): + reg = _registry() + reg.get(HEALTHY).state = ModelState.READY # 8002 + reg.get(OTHER).state = ModelState.READY # 8004 + + addrs = [t["targets"][0] for t in build_targets(reg.values())] + assert addrs == ["localhost:8002", "localhost:8004"] # sorted by address + + +def test_write_targets_file_writes_then_skips_unchanged(tmp_path): + path = str(tmp_path / "sub" / "targets.json") # parent created on demand + targets = [{"targets": ["localhost:8002"], "labels": {"group": "g"}}] + + assert write_targets_file(path, targets) is True # first write + assert json.loads(open(path).read()) == targets + assert write_targets_file(path, targets) is False # identical -> skip + + targets2 = targets + [{"targets": ["localhost:8004"], "labels": {"group": "g"}}] + assert write_targets_file(path, targets2) is True # changed -> rewrite + assert json.loads(open(path).read()) == targets2 + + +def test_write_targets_file_leaves_no_tmp_artifact(tmp_path): + path = tmp_path / "targets.json" + write_targets_file(str(path), []) + assert not (tmp_path / "targets.json.tmp").exists() + assert path.read_text() == render([]) + + +async def test_manager_noop_without_path_configured(): + # Default settings leave prometheus_sd_path empty -> feature disabled. + reg = _registry() + mgr = ModelManager( + reg, [VllmLauncher(), EmbeddingLauncher()], FakeHTTPClient(), + FAKE_CONFIG, "config.yaml", BackendSettings(), + ) + assert await mgr.write_prometheus_targets() is False + + +async def test_manager_writes_ready_targets_when_path_set(tmp_path): + path = str(tmp_path / "targets.json") + reg = _registry() + reg.get(HEALTHY).state = ModelState.READY + settings = BackendSettings(prometheus_sd_path=path) + mgr = ModelManager( + reg, [VllmLauncher(), EmbeddingLauncher()], FakeHTTPClient(), + FAKE_CONFIG, "config.yaml", settings, + ) + + assert await mgr.write_prometheus_targets() is True + written = json.loads(open(path).read()) + assert [t["targets"][0] for t in written] == ["localhost:8002"] diff --git a/apps/backend/tests/unit/test_reconciler.py b/apps/backend/tests/unit/test_reconciler.py index 569e082..12e2a24 100644 --- a/apps/backend/tests/unit/test_reconciler.py +++ b/apps/backend/tests/unit/test_reconciler.py @@ -45,15 +45,20 @@ async def test_starting_becomes_ready_when_health_ok(): class _ReloadSpyManager: - """Minimal manager stub capturing router-reload nudges.""" + """Minimal manager stub capturing router-reload + Prometheus SD nudges.""" def __init__(self): self.reloads = 0 + self.sd_writes = 0 async def trigger_router_reload(self): self.reloads += 1 return True + async def write_prometheus_targets(self): + self.sd_writes += 1 + return True + async def test_ready_transition_nudges_router_reload(): reg = _registry() @@ -67,14 +72,32 @@ async def test_ready_transition_nudges_router_reload(): await reconcile_once(reg, FakeHTTPClient(healthy_ports={8002}), _settings(), manager=mgr) assert inst.state == ModelState.READY assert mgr.reloads == 1 + assert mgr.sd_writes == 1 # joining the ready pool refreshes scrape targets async def test_no_ready_transition_does_not_reload(): - # Steady-state pass (nothing turns READY) must not spam the router. + # Steady-state pass (nothing turns READY) must not spam the router or rewrite SD. reg = _registry() mgr = _ReloadSpyManager() await reconcile_once(reg, FakeHTTPClient(healthy_ports=set()), _settings(), manager=mgr) assert mgr.reloads == 0 + assert mgr.sd_writes == 0 + + +async def test_ready_to_failed_refreshes_sd_but_not_router(): + # A ready vLLM dying leaves the pool: SD must be rewritten (drop the target), + # but the router reload only fires on instances *joining* the pool. + reg = _registry() + inst = reg.get(HEALTHY) + inst.state = ModelState.READY + inst.managed = True + inst.proc = FakeProc(returncode=139) # crashed + + mgr = _ReloadSpyManager() + await reconcile_once(reg, FakeHTTPClient(healthy_ports={8002}), _settings(), manager=mgr) + assert inst.state == ModelState.FAILED + assert mgr.sd_writes == 1 + assert mgr.reloads == 0 async def test_starting_times_out_to_failed(): diff --git a/apps/frontend_llmops/src/components/TimeChart.vue b/apps/frontend_llmops/src/components/TimeChart.vue deleted file mode 100644 index 78bec48..0000000 --- a/apps/frontend_llmops/src/components/TimeChart.vue +++ /dev/null @@ -1,81 +0,0 @@ - - - diff --git a/apps/frontend_llmops/src/components/layout/AppSidebar.vue b/apps/frontend_llmops/src/components/layout/AppSidebar.vue index e906672..2822f57 100644 --- a/apps/frontend_llmops/src/components/layout/AppSidebar.vue +++ b/apps/frontend_llmops/src/components/layout/AppSidebar.vue @@ -12,11 +12,11 @@ import { KeyRound, Layers, LayoutDashboard, + LineChart, Package, Receipt, Server, TerminalSquare, - TrendingUp, } from '@lucide/vue' import { useModelsStore } from '@/stores/models' import StatusDot from '@/components/StatusDot.vue' @@ -53,8 +53,8 @@ const nav = [ { to: '/', label: '總覽', icon: LayoutDashboard }, { to: '/models', label: '模型', icon: Server }, { to: '/traffic', label: '流量', icon: ArrowLeftRight }, - { to: '/trends', label: '趨勢', icon: TrendingUp }, { to: '/requests', label: '請求', icon: Receipt }, + { to: '/monitoring', label: '監控', icon: LineChart }, { to: '/playground', label: '測試台', icon: TerminalSquare }, { to: '/benchmark', label: '壓測', icon: Gauge }, { to: '/eval', label: '評測', icon: ClipboardCheck }, diff --git a/apps/frontend_llmops/src/lib/api.ts b/apps/frontend_llmops/src/lib/api.ts index 82b5ffe..a1d61e1 100644 --- a/apps/frontend_llmops/src/lib/api.ts +++ b/apps/frontend_llmops/src/lib/api.ts @@ -31,7 +31,6 @@ import type { RouterMetrics, SettingValue, StateEvent, - TimeseriesPoint, UsageRow, } from '@/types/api' @@ -167,13 +166,6 @@ export const api = { request(API_BASE, `/api/models/${enc(key)}/logs?tail=${tail}`), getModelMetrics: (key: string) => request(API_BASE, `/api/models/${enc(key)}/metrics`), - getTimeseries: (opts: { window?: number; bucket?: number; modelKey?: string } = {}) => { - const params = new URLSearchParams() - params.set('window', String(opts.window ?? 3600)) - params.set('bucket', String(opts.bucket ?? 60)) - if (opts.modelKey) params.set('model_key', opts.modelKey) - return request(API_BASE, `/api/metrics/timeseries?${params.toString()}`) - }, healthz: () => request(API_BASE, '/healthz'), // ---- LLM Router ----------------------------------------------------------- diff --git a/apps/frontend_llmops/src/router/index.ts b/apps/frontend_llmops/src/router/index.ts index bd95bb8..dc0e50d 100644 --- a/apps/frontend_llmops/src/router/index.ts +++ b/apps/frontend_llmops/src/router/index.ts @@ -21,18 +21,18 @@ const router = createRouter({ meta: { title: 'Traffic' }, component: () => import('@/views/TrafficView.vue'), }, - { - path: '/trends', - name: 'trends', - meta: { title: 'Trends' }, - component: () => import('@/views/TrendsView.vue'), - }, { path: '/requests', name: 'requests', meta: { title: 'Requests' }, component: () => import('@/views/RequestsView.vue'), }, + { + path: '/monitoring', + name: 'monitoring', + meta: { title: 'Monitoring' }, + component: () => import('@/views/MonitoringView.vue'), + }, { path: '/benchmark', name: 'benchmark', diff --git a/apps/frontend_llmops/src/types/api.ts b/apps/frontend_llmops/src/types/api.ts index 0aaf1be..a0ebf4f 100644 --- a/apps/frontend_llmops/src/types/api.ts +++ b/apps/frontend_llmops/src/types/api.ts @@ -42,15 +42,6 @@ export interface ModelView { restart_count?: number } -export interface TimeseriesPoint { - ts: number - count: number - error_count: number - avg_latency_ms: number | null - p95_latency_ms: number | null - total_tokens: number -} - export interface MemoryInfo { total: number available: number diff --git a/apps/frontend_llmops/src/views/MonitoringView.vue b/apps/frontend_llmops/src/views/MonitoringView.vue new file mode 100644 index 0000000..2a17370 --- /dev/null +++ b/apps/frontend_llmops/src/views/MonitoringView.vue @@ -0,0 +1,100 @@ + + +