diff --git a/README.md b/README.md index 600e743..91910e5 100644 --- a/README.md +++ b/README.md @@ -1,297 +1,110 @@
-# LLM-Router-Server-Dashboard -**One-Stop LLM Model Management and Monitoring Platform** +# vLLMux -[English](README.md) | [中文](README_zh-CN.md) +**One-stop platform to deploy, route, monitor & evaluate your vLLM cluster** -![Main Console](assets/image0.png) - -![Model Management](assets/image1.png) +[English](README.md) · [中文](README_zh-CN.md) +![vLLMux](https://img.shields.io/badge/vLLM-multiplexed-5b8def) +![license](https://img.shields.io/badge/license-MIT-green) +![stack](https://img.shields.io/badge/FastAPI%20·%20Vue%203%20·%20Grafana-informational) -![Model Management](assets/image2.png) - -![Model Management](assets/image3.png) -![Model Management](assets/image4.png) +![Main Console](assets/image0.png) +![Main Console](assets/image1.png) +![Main Console](assets/image2.png)
--- -## Project Overview - -**LLM-Router-Server-Dashboard** is a solution for large language model (LLM) deployment and management, providing an intuitive web interface to manage, monitor, and operate multiple LLM model instances. - -This project combines a routing server (LLM-Router-Server) with an easy-to-use management interface, enabling you to: -- **Visual Management**: Easily manage multiple models through a web interface -- **Dynamic Control**: Start and stop models in real-time without service restarts -- **Real-time Monitoring**: Monitor model status, GPU utilization, and system information -- **Configuration Management**: Flexibly manage model parameters through YAML configuration files - ---- - -## Key Features - -### Model Management -- Multi-model, multi-instance management on vLLM (LLM, Embedding, Reranker) -- Per-instance lifecycle (start/stop) with a live state machine (`stopped → starting → ready → failed/stopping`), driven by a reconciler that derives the true state from process liveness + `/health` probes -- **Add models from the UI by pasting a `vllm serve …` command** — it is parsed into an editable form and layered on as a dynamic *overlay*, so the hand-maintained `config.yaml` stays untouched; the router hot-reloads (`POST /reload`) so new models are routable end-to-end -- Load-aware routing: the router auto-selects the least-loaded instance (weighting running / waiting requests + KV-cache usage) - -### Reliability -- **VRAM pre-flight guard** — blocks a start that would likely OOM, with a one-click *Force start* override -- **GPU auto-placement** — an instance with no pinned `cuda_device` is placed on the GPU with the most free memory -- **Auto-restart** — a managed model that crashes is restarted with exponential backoff (configurable budget, resets once healthy) - -### Observability -- Real-time status via Server-Sent Events (no polling) -- **System topology** (Vue Flow) — a live mission-control graph of Clients → Router → model groups / Embedding → GPUs, with animated traffic edges, GPU-placement edges, and a control plane; nodes are clickable drill-ins -- **Router load-balancing view** — an animated fan showing each replica's real traffic share and the instance the router will pick next -- **Grafana monitoring** (bundled) — Prometheus auto-discovers every running vLLM instance (file-based service discovery written by the backend as models start/stop) and scrapes its `/metrics`, alongside GPU (DCGM) and host (node-exporter) metrics. Grafana dashboards — **Overview** (single pane: health, latency SLO, capacity, GPU/host), **Scheduling & Capacity**, vLLM Performance/Query, GPU, Host — are embedded in the **Monitoring** tab, with SLO threshold lines, model-lifecycle annotations, and alert rules. See [Monitoring (Grafana)](#monitoring-grafana) -- Per-model usage (count, error rate, p50/p95 latency, tokens), request log, and a state-transition event timeline -- GPU / CPU / memory monitoring plus a GPU-process inventory +**vLLMux** is a self-hosted control plane for serving many LLMs on +[vLLM](https://github.com/vllm-project/vllm). Paste a `vllm serve …` command and it +becomes a routable model; the router load-balances across instances; and a bundled Prometheus + Grafana stack monitors everything — all behind one Vue dashboard. -### Playground -- OpenAI-compatible **chat (streaming)**, completions, **embeddings**, and **reranking**, sent straight through the router -- **Reasoning ("thinking") display** — when a model runs with a vLLM reasoning parser, the `reasoning` stream is shown in a collapsible *思考過程* block above the answer +## Highlights -### Benchmarking & Evaluation (evalscope) -- **Load testing** (`/benchmark`) — concurrency sweep, arrival-rate open-loop, multi-turn, **SLA auto-tune**, plus **embedding / rerank** throughput and single-request **speed benchmark**; each run is an isolated subprocess, with live charts, run comparison, and the full evalscope HTML report -- **Accuracy / quality evaluation** (`/eval`) — **30+ benchmark datasets** grouped by capability tier (Baseline, Knowledge, Chinese, Reasoning, Math, Multilingual, **Tool-calling**, **Long-context**, Code, and judge-scored QA): MMLU/ARC/GSM8K/IFEval, C-Eval/C-MMLU, GPQA/MMLU-Pro, AIME, HumanEval, ToolBench/General-FunctionCall, Needle-in-a-Haystack, … - - Per-dataset scores, a **run-to-run comparison matrix** (highlights the best per dataset), and the interactive HTML report - - **LLM-as-judge** for free-form QA — pick one of your own deployed models (via the router) or an external OpenAI-compatible API - - **Advanced `dataset_args`** — few-shot count + raw per-dataset overrides (subset selection, etc.) - - Sanity guards: judge-scored datasets require a judge; long-context and real tool-calling datasets warn about their model prerequisites (large `max_model_len`, vLLM tool parser) +- **Add a model by pasting `vllm serve …`** — parsed into a form and layered on as a dynamic overlay; the router hot-reloads, no `config.yaml` edits. +- **Lifecycle + self-healing** — per-instance state machine (`stopped → starting → ready → failed`), VRAM pre-flight guard, GPU auto-placement, crash auto-restart with backoff. +- **Load-aware routing** — picks the least-loaded replica (running/waiting requests + KV-cache usage). +- **Live observability** — SSE status, animated system-topology & router-balancing graphs, per-model usage / latency / error stats. +- **Bundled Grafana monitoring** — Prometheus auto-discovers every running instance; Overview / Capacity / Performance / GPU / Host dashboards embedded in-app, with SLO thresholds & alerts. +- **Playground** — OpenAI-compatible chat (streaming) / completions / embeddings / reranking, with reasoning display. +- **Benchmark & evaluate** — evalscope load tests (concurrency, arrival-rate, SLA auto-tune) plus 30+ accuracy datasets with LLM-as-judge. +- **Libraries** — browse / pre-download HF model weights & datasets from the UI; tool-calling parser helper; LoRA support. +- **Secure by config** — admin-token-gated controls, plus mint/revoke API keys with per-key usage attribution. -### Libraries -- **Model library** (`/library`) — scan / pre-download / delete HF model weights from the UI, with live download progress -- **Dataset library** (`/datasets`) — pre-download load-test and evaluation datasets into the shared ModelScope cache so a run never stalls on a first-time download -- **Tool-calling config helper** — the model editor maps model families to the right vLLM `tool_call_parser` (Qwen→`hermes`, Qwen3-Coder→`qwen3_xml`, Llama→`llama3_json`/`llama4_pythonic`, …) with one-click preset insertion (see `docs/vllm_auto_tool_整理.md`) +See [docs/features.md](docs/features.md) for the full breakdown. -### UX -- Light / dark theme, dense "control-room" interface -- **Admin-token-gated control** (start / stop / add / edit / remove) and - **API-key management** — mint/revoke keys that authenticate router inference, - with per-key usage attribution in the request log +## Quick start ---- - -## System Requirements - -### Hardware Requirements -- **GPU**: NVIDIA GPU (CUDA 13.1+ recommended) -- **Memory**: 16GB+ RAM (depending on model size) -- **Disk**: 50GB+ available space ---- - -## Quick Start - -### Docker Deployment (one command) - -The whole stack — dashboard backend, LLM router, and the Vue frontend — is built -and started by a single Compose file. Requires Docker with the NVIDIA Container -Toolkit (on WSL2, enable GPU support in Docker Desktop). +Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in +Docker Desktop). ```bash cp deploy/.env.example deploy/.env # set HF_TOKEN, which GPUs, the admin token -make up # docker compose -f deploy/docker-compose.yaml up -d --build +make up # build + start the whole stack # open http://localhost:8884 ``` -`make down` stops it, `make logs` tails all services, `make ps` shows status. - -**Topology** (see [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)): - -| Service | Image | Port | Role | -|------------------|------------------------|---------|------| -| `backend` | `llmops-engine` (GPU) | 5000 | Dashboard API; spawns vLLM subprocesses on `:800x` | -| `router` | `llmops-engine` | 8887 | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports | -| `prometheus` | `prom/prometheus` | 9090 | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances | -| `grafana` | `grafana/grafana` | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx | -| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) | -| `node-exporter` | `prom/node-exporter` | 9100 | Host metrics (CPU, RAM, disk, network) | -| `frontend` | `llmops-frontend` | 8884 | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana | - -Why one image, multiple services on one netns: only the backend truly needs vLLM -(it launches the subprocesses), and the router + Prometheus must see them on -`localhost` — so a single [`engine.Dockerfile`](deploy/engine.Dockerfile) (based -on the official `vllm/vllm-openai`) runs as `backend` + `router`, joined (with -Prometheus) by `network_mode: service:backend`. - -The frontend reaches the backend, router, and Grafana through nginx on a single -origin, so no host/port is baked into the build. SQLite + the dynamic-model -overlay persist in the `llmops-data` named volume (Prometheus TSDB and Grafana -state in `prometheus-data` / `grafana-data`); downloaded model **weights** are -bind-mounted from the host HF cache (`HF_CACHE_DIR`, default -`~/.cache/huggingface`) so they're browsable locally and shared with host-side -tools. The canonical `packages/config-schema/config.yaml` is bind-mounted too, so -you can edit models without rebuilding. - -> **Model lifecycle**: the router only routes and load-balances — it never -> launches models. vLLM instances (and the Embedding/Reranker server) are owned -> by the backend and started on demand from the **Models** page (or -> `POST /api/models/{key}/start`). The backend and router both merge the -> dynamic-model overlay at startup, so models added from the UI survive restarts. - -#### Verify +`make down` stops it · `make logs` tails all services · `make ps` shows status. ```bash curl http://localhost:8887/v1/models # router: configured model groups curl http://localhost:5000/api/models # backend: lifecycle state of each instance +# http://localhost:8884/grafana # dashboards + alerts ``` -#### Monitoring (Grafana) - -The stack bundles a full **Prometheus → Grafana** pipeline, no manual setup: - -- The **backend** writes a Prometheus file-based service-discovery file - (`LLMOPS_PROMETHEUS_SD_PATH`) listing every *ready* vLLM instance, refreshed as - models start/stop — so a dynamic fleet is scraped with zero config edits. -- **Prometheus** (`:9090`) scrapes those instances' `/metrics` plus - `dcgm-exporter` (GPU) and `node-exporter` (host). -- **Grafana** is served single-origin at **`http://localhost:8884/grafana`** - (anonymous read-only; log in as `admin` / `GRAFANA_ADMIN_PASSWORD` to edit). - Datasource and dashboards are auto-provisioned from - [`deploy/grafana`](deploy/grafana): **Overview**, **vLLM Scheduling & - Capacity** (custom), **Performance**/**Query** (official), **GPU** (DCGM), and - **Host** (Node Exporter). The same dashboards are embedded in the dashboard's - **Monitoring** tab. -- **Alerting**: provisioned vLLM alert rules (target down, TTFT p95, KV cache, - request queueing) route to a webhook contact point — set `GRAFANA_ALERT_WEBHOOK` - in `deploy/.env` (Slack/Discord/generic) and restart Grafana to receive them. - -```bash -curl http://localhost:9090/api/v1/targets # prometheus: scrape target health -# open http://localhost:8884/grafana # dashboards + alerts +Full topology, the shared-netns rationale, volumes, and a manual run are in +[docs/deployment.md](docs/deployment.md). + +## Architecture + +```mermaid +flowchart LR + Client([Clients]) + FE["frontend
nginx · :8884
single origin"] + VLLM["vLLM instances
:800x"] + GF["grafana
/grafana"] + DCGM["dcgm-exporter
:9400 · GPU"] + NODE["node-exporter
:9100 · host"] + + subgraph netns["shared network namespace"] + BE["backend · :5000
model lifecycle"] + RT["router · :8887
OpenAI-compatible LB"] + PR["prometheus · :9090"] + end + + Client --> FE + FE -->|/api| BE + FE -->|/v1| RT + FE -->|/grafana| GF + BE -->|launch on demand| VLLM + RT -->|route + balance| VLLM + PR -->|scrape /metrics| VLLM + PR --> DCGM + PR --> NODE + GF -->|query| PR ``` -### Frontend (Web Dashboard) +The **router only routes** — the **backend owns model lifecycle**. The frontend, router, +backend, and Grafana sit behind nginx on a single origin; backend, router, and Prometheus +share one network namespace so the spawned vLLM instances are reachable on `localhost`. -The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript, Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.) +## Documentation -#### Local development +| Topic | | +|---|---| +| Deployment & topology | [docs/deployment.md](docs/deployment.md) | +| Configuration (`config.yaml`) | [docs/configuration.md](docs/configuration.md) | +| Features in depth | [docs/features.md](docs/features.md) | +| Monitoring (Prometheus + Grafana) | [docs/monitoring.md](docs/monitoring.md) | +| HTTP API | [docs/API.md](docs/API.md) | -```bash -cd apps/frontend_llmops -npm install -npm run dev # http://localhost:5173 -``` - -#### Production build - -```bash -npm run build # outputs to dist/ -``` - -#### Configuration — `apps/frontend_llmops/.env` - -```env -VITE_API_BASE_URL=http://localhost:5000 # Dashboard backend (lifecycle, telemetry) -VITE_ROUTER_BASE_URL=http://localhost:8887 # LLM Router (inference + /metrics + /reload) -``` - -> **Authentication** is backend-driven (not a build-time password). Set -> `LLMOPS_ADMIN_TOKEN` on the backend + router to gate every control action -> (start / stop / add / edit / remove + API-key management); the UI prompts for -> the token once and reuses it for the session. Set `LLMOPS_REQUIRE_API_KEY=true` -> on the router to require a bearer token (the admin token, or an API key minted -> on the **API 金鑰** page) for all `/v1/*` inference. Both default to off for -> local dev. - -> **Run all three services for full functionality**: the Dashboard Backend (`:5000`), the LLM Router (`:8887`), and the model instances the backend launches on demand. The backend and router both merge the dynamic-model overlay at startup, so models added from the UI survive restarts. - -### Manual / development run - -Run the three pieces yourself (Python deps in the repo-root `.venv`): - -```bash -# Dashboard backend (:5000) -cd apps/backend && pip install -r requirements.txt -uvicorn main:app --host 0.0.0.0 --port 5000 - -# LLM router (:8887) — see apps/router-server/README.md for details -cd apps/router-server && pip install -r requirements.txt -sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py -``` - -Use `packages/config-schema/config.yaml` as the single source of truth so the -frontend, backend, and router all read the same configuration. - ---- - -## Configuration Guide - -### config.yaml Structure - -The configuration file is located at `packages/config-schema/config.yaml` (the single source of truth, validated by `packages/config-schema/schema.py`) and controls all model startup parameters. - -```yaml -# Router server configuration -server: - host: "0.0.0.0" - port: 8887 - uvicorn_log_level: "info" - -# LLM model configuration -LLM_engines: - Qwen3-0.6B: - instances: - - id: "qwen3" - host: "localhost" - port: 8002 - cuda_device: 0 - - id: "qwen3-2" - host: "localhost" - port: 8004 - cuda_device: 0 - - model_config: - model_tag: "Qwen/Qwen3-0.6B" - dtype: "float16" - max_model_len: 500 - gpu_memory_utilization: 0.35 - tensor_parallel_size: 1 - -# Embedding server configuration (optional) -embedding_server: - host: "localhost" - port: 8005 - cuda_device: 1 - - embedding_models: - m3e-base: - model_name: "moka-ai/m3e-base" - model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model" - tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer" - max_length: 512 - use_gpu: true - use_float16: true - - reranking_models: - bge-reranker-large: - model_name: "BAAI/bge-reranker-large" - model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model" - tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer" - max_length: 512 - use_gpu: true - use_float16: true -``` - -### Key Parameter Descriptions - -| Parameter | Description | Recommended Value | -|------|------|--------| -| `gpu_memory_utilization` | GPU memory usage ratio | 0.6-0.9 | -| `max_model_len` | Maximum context length | Based on model capability | -| `tensor_parallel_size` | Multi-GPU parallelism count | Number of GPUs | -| `dtype` | Inference precision | float16 (faster) / bfloat16 (more stable) | -| `cuda_device` | GPU device number | 0, 1, 2... | - ---- +## Requirements -### Q4: Can I run multiple models at once? +NVIDIA GPU (CUDA 13.1+ recommended) · 16GB+ RAM · 50GB+ disk. -Yes — as long as they fit in GPU memory. A **VRAM pre-flight guard** blocks a start that would overflow the target GPU (override per-start with *Force start*), and instances without a pinned `cuda_device` are **auto-placed** on the GPU with the most free memory. On a single small GPU you'll typically run one mid-size model alongside a few small ones; models are started on demand, so a large fleet can be configured without all running at once. +## License -Tune the guard / restart policy via env on the backend: `LLMOPS_VRAM_GUARD`, `LLMOPS_AUTO_RESTART`, `LLMOPS_MAX_RESTARTS`, `LLMOPS_RESTART_BACKOFF`. \ No newline at end of file +MIT — see [LICENSE](LICENSE). diff --git a/README_zh-CN.md b/README_zh-CN.md index c6f34db..733d4ea 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -1,283 +1,110 @@
-# LLM-Router-Server-Dashboard -**一站式 LLM 模型管理與監控平台** +# vLLMux -![Main Console](assets/image0.png) +**一站式部署、路由、監控與評測你的 vLLM 集群** -![Model Management](assets/image1.png) +[English](README.md) · [中文](README_zh-CN.md) +![vLLMux](https://img.shields.io/badge/vLLM-multiplexed-5b8def) +![license](https://img.shields.io/badge/license-MIT-green) +![stack](https://img.shields.io/badge/FastAPI%20·%20Vue%203%20·%20Grafana-informational) -![Model Management](assets/image2.png) -![Model Management](assets/image3.png) -![Model Management](assets/image4.png) +![Main Console](assets/image0.png) +![Main Console](assets/image1.png) +![Main Console](assets/image2.png)
--- -## 專案簡介 - -**LLM-Router-Server-Dashboard** 是一個針對大型語言模型(LLM)部署與管理的解決方案,提供直觀的 Web 界面來管理、監控和操作多個 LLM 模型實例。 - -本專案結合路由伺服器(LLM-Router-Server)與易用的管理界面,讓您能夠: -- **視覺化管理**:透過 Web 界面輕鬆管理多個模型 -- **動態啟停**:即時啟動、停止模型,無需重啟服務 -- **即時監控**:監控模型狀態、GPU 使用率、系統資訊 -- **配置管理**:透過 YAML 配置文件靈活管理模型參數 - ---- - -## 功能特色 - -### 模型管理 -- 基於 vLLM 的多模型、多實例管理(LLM、Embedding、Reranker) -- 每個實例獨立的生命週期(啟動/停止),具即時狀態機(`stopped → starting → ready → failed/stopping`),由 reconciler 從「進程存活 + `/health` 探測」推導真實狀態 -- **在前端貼上 `vllm serve …` 指令即可新增模型** — 解析成可編輯表單,以動態 *overlay* 疊加,**不動手寫的 `config.yaml`**;router 會熱重載(`POST /reload`),新模型端到端可被路由 -- 負載感知路由:router 自動選擇負載最低的實例(依運行中/等待中請求 + KV 快取使用率加權) +**vLLMux** 是一個自架的 LLM 服務控制平台,基於 +[vLLM](https://github.com/vllm-project/vllm)。 +內建的 Prometheus + Grafana 監控——全都在同一個 Vue 控制台之後。 -### 可靠性 -- **VRAM 預檢防呆** — 啟動前估算顯存,可能 OOM 就擋下,並提供一鍵 *Force start* 覆寫 -- **GPU 自動擺放** — 未指定 `cuda_device` 的實例會自動擺到剩餘顯存最多的 GPU -- **失敗自動重啟** — managed 模型崩潰後以指數退避自動重啟(可設次數,恢復健康後重置) -### 觀測性 -- 透過 Server-Sent Events 即時更新狀態(免輪詢) -- **系統拓撲圖**(Vue Flow)— Clients → Router → 模型群組/Embedding → GPU 的即時 mission-control 圖,含流動的流量邊、GPU 擺放邊與控制平面;節點可點擊下鑽 -- **Router 負載平衡視圖** — 動畫扇形圖呈現每個副本的實際流量佔比,以及 router 下一個會選的實例 -- **Grafana 監控**(內建)— Prometheus 自動發現每個運行中的 vLLM 實例(後端隨模型啟停寫出 file-based service discovery)並抓取其 `/metrics`,外加 GPU(DCGM)與主機(node-exporter)指標。Grafana dashboards —— **總覽**(單一頁面:健康、延遲 SLO、容量、GPU/主機)、**排程與容量**、vLLM Performance/Query、GPU、Host —— 嵌入 **監控** 分頁,含 SLO 門檻線、模型生命週期標註與告警規則。見 [監控(Grafana)](#監控grafana) -- 每模型用量(次數、錯誤率、p50/p95 延遲、tokens)、請求日誌、狀態轉移事件時間軸 -- GPU/CPU/記憶體監控,以及 GPU 進程清單 +## 功能亮點 -### Playground -- OpenAI 相容的 **chat(串流)**、completions、**embeddings**、**reranking**,直接經由 router -- **思考(reasoning)顯示** — 模型搭配 vLLM reasoning parser 時,`reasoning` 串流會顯示在答案上方的可摺疊「💭 思考過程」區塊 - -### 壓測與評測(evalscope) -- **壓測**(`/benchmark`)— 並發 sweep、到達率 open-loop、多輪、**SLA 自動調優**,以及 **embedding/rerank** 吞吐與單請求**速度基準**;每次執行為獨立子進程,含即時圖表、run 比較、完整 evalscope HTML 報告 -- **準確度/品質評測**(`/eval`)— **30+ 個基準資料集**,依能力分組(基線、知識進階、中文、推理、數學、多語言、**工具調用**、**長上下文**、程式碼、需裁判的問答):MMLU/ARC/GSM8K/IFEval、C-Eval/C-MMLU、GPQA/MMLU-Pro、AIME、HumanEval、ToolBench/General-FunctionCall、Needle-in-a-Haystack… - - 每資料集分數、**run 對 run 的比較表**(每列標出最高分)、互動式 HTML 報告 - - **裁判模型(LLM-as-judge)** 給自由問答評分 — 可選自家部署的模型(經 router)或外部 OpenAI 相容 API - - **進階 `dataset_args`** — few-shot 數 + 依資料集的原始覆寫(子集選擇等) - - 防呆:需裁判的資料集會強制設定裁判;長上下文與真實工具調用資料集會提醒模型前提(夠大的 `max_model_len`、vLLM tool parser) - -### 資料庫 -- **模型庫**(`/library`)— 在 UI 掃描/預下載/刪除 HF 權重,含即時下載進度 -- **資料集庫**(`/datasets`)— 預先下載壓測與評測資料集到共用 ModelScope 快取,執行時就不會卡在首次下載 -- **工具調用設定助手** — 模型編輯器把模型家族對應到正確的 vLLM `tool_call_parser`(Qwen→`hermes`、Qwen3-Coder→`qwen3_xml`、Llama→`llama3_json`/`llama4_pythonic`…),一鍵帶入(見 `docs/vllm_auto_tool_整理.md`) - -### 使用體驗 -- 明暗雙主題、資訊密集的「控制室」介面 -- **管理員權杖控管**控制操作(啟動/停止/新增/編輯/移除),以及 **API 金鑰管理** — - 發行/撤銷用於 router 推理的金鑰,並在請求日誌中做 per-key 用量歸屬 - ---- - -## 環境需求 +- **貼上 `vllm serve …` 即可新增模型** — 解析成表單、以動態 overlay 疊加;router 熱重載。 +- **生命週期** — 每實例狀態機(`stopped → starting → ready → failed`)、VRAM 預檢防呆、GPU 自動擺放、崩潰指數退避自動重啟。 +- **負載感知路由** — 自動挑負載最低的副本(運行中/等待中請求 + KV 快取使用率)。 +- **即時觀測** — SSE 狀態、動畫系統拓撲圖與 router 負載平衡圖、每模型用量/延遲/錯誤統計。 +- **內建 Grafana 監控** — Prometheus 自動發現每個運行中的實例;總覽/容量/效能/GPU/主機 dashboards 嵌入應用內,含 SLO 門檻線與告警。 +- **Playground** — OpenAI 相容的 chat(串流)/completions/embeddings/reranking。 +- **壓測與評測** — LLM 壓測(並發、到達率、SLA 自動調優)+ 30+ 個準確度資料集與 LLM-as-judge。 +- **資料庫** — 在 UI 瀏覽/預下載 HF 權重與資料集;工具調用 parser 助手;LoRA 支援。 +- **安全性** — 管理員權杖控管操作,並可發行/撤銷帶 per-key 用量歸屬的 API 金鑰。 -### 硬體需求 -- **GPU**: NVIDIA GPU(建議 CUDA 13.1+) -- **記憶體**: 16GB+ RAM(依模型大小而定) -- **硬碟**: 50GB+ 可用空間 ---- +完整說明見 [docs/features_zh-CN.md](docs/features_zh-CN.md)。 ## 快速開始 -### Docker 一鍵部署 - -整套服務(Dashboard 後端、LLM router、Vue 前端)由單一 Compose 檔建置並啟動。 需要安裝 Docker 與 NVIDIA Container Toolkit(WSL2 請在 Docker Desktop 開啟 GPU 支援)。 ```bash cp deploy/.env.example deploy/.env # 填 HF_TOKEN、要用的 GPU、管理員權杖 -make up # docker compose -f deploy/docker-compose.yaml up -d --build +make up # 建置並啟動整套服務 # 瀏覽器開 http://localhost:8884 ``` -`make down` 停止、`make logs` 追蹤所有服務日誌、`make ps` 看狀態。 - -**架構**(見 [`deploy/docker-compose.yaml`](deploy/docker-compose.yaml)): - -| 服務 | 映像 | 端口 | 角色 | -|------------------|------------------------|-------|------| -| `backend` | `llmops-engine`(GPU) | 5000 | Dashboard API;在 `:800x` 拉起 vLLM 子進程 | -| `router` | `llmops-engine` | 8887 | OpenAI 相容路由;**共用後端的 network namespace**,才打得到那些 localhost vLLM 端口 | -| `prometheus` | `prom/prometheus` | 9090 | 透過 file-based SD 抓取 vLLM 艦隊的 `/metrics`;**同樣共用後端 netns**,`localhost:800x` 才解析得到那些實例 | -| `grafana` | `grafana/grafana` | (代理)| Dashboards 與告警;經前端 nginx 以單一來源代理在 `/grafana` | -| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter`(GPU) | 9400 | NVIDIA GPU 遙測(利用率、顯存、溫度、功耗) | -| `node-exporter` | `prom/node-exporter` | 9100 | 主機指標(CPU、RAM、磁碟、網路) | -| `frontend` | `llmops-frontend` | 8884 | nginx 服務 SPA,並反向代理 `/api` → 後端、`/v1` → router、`/grafana` → grafana | - -為何一份映像、多個服務共用一個 netns:只有後端真的需要 vLLM(它負責拉起子進程), -而 router 與 Prometheus 必須在 `localhost` 看到那些子進程——所以單一 -[`engine.Dockerfile`](deploy/engine.Dockerfile)(基於官方 `vllm/vllm-openai`)跑成 -`backend` + `router`,並(連同 Prometheus)以 `network_mode: service:backend` 串接。 - -前端透過 nginx 以單一來源(same-origin)連到後端、router 與 Grafana,因此 build 不會 -硬編任何 host/port。SQLite 與動態模型 overlay 放在 `llmops-data` named volume(Prometheus -TSDB 與 Grafana 狀態放在 `prometheus-data` / `grafana-data`);下載的模型**權重**則以 -bind-mount 掛在主機 HF 快取(`HF_CACHE_DIR`,預設 `~/.cache/huggingface`),所以本機就能 -直接瀏覽、也和主機端工具共用。`packages/config-schema/config.yaml` 同樣 bind-mount 掛入, -因此改模型不必重新 build。 - -> **模型生命週期**:router 只負責路由與負載平衡,不會啟動模型。vLLM 實例(與 -> Embedding/Reranker 服務)由後端管理,從 **Models** 頁按需啟動(或 -> `POST /api/models/{key}/start`)。後端與 router 都會在啟動時合併動態模型 overlay, -> 所以從前端新增的模型在重啟後仍會保留。 - -#### 驗證 +`make down` 停止 · `make logs` 追蹤所有服務日誌 · `make ps` 看狀態。 ```bash curl http://localhost:8887/v1/models # router:列出設定的模型群組 curl http://localhost:5000/api/models # 後端:每個實例的生命週期狀態 +# http://localhost:8884/grafana # dashboards 與告警 ``` -#### 監控(Grafana) - -整套內建完整的 **Prometheus → Grafana** 流程,免手動設定: - -- **後端**寫出 Prometheus file-based service-discovery 檔(`LLMOPS_PROMETHEUS_SD_PATH`), - 列出每個 *ready* 的 vLLM 實例,並隨模型啟停刷新——所以動態艦隊免改設定即被抓取。 -- **Prometheus**(`:9090`)抓取這些實例的 `/metrics`,外加 `dcgm-exporter`(GPU)與 - `node-exporter`(主機)。 -- **Grafana** 以單一來源服務於 **`http://localhost:8884/grafana`**(匿名唯讀;以 - `admin` / `GRAFANA_ADMIN_PASSWORD` 登入可編輯)。datasource 與 dashboards 由 - [`deploy/grafana`](deploy/grafana) 自動 provision:**總覽**、**vLLM 排程與容量**(自訂)、 - **Performance**/**Query**(官方)、**GPU**(DCGM)、**Host**(Node Exporter)。 - 同一批 dashboards 也嵌入控制台的 **監控** 分頁。 -- **告警**:已 provision 的 vLLM 告警規則(target down、TTFT p95、KV cache、請求排隊) - 路由到一個 webhook contact point —— 在 `deploy/.env` 設 `GRAFANA_ALERT_WEBHOOK` - (Slack/Discord/通用)並重啟 Grafana 即可收到通知。 - -```bash -curl http://localhost:9090/api/v1/targets # prometheus:scrape target 健康狀態 -# 開啟 http://localhost:8884/grafana # dashboards 與告警 -``` - -### 前端(Web 控制台) - -控制台位於 **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript、Tailwind CSS v4、shadcn-vue 元件、[Vue Flow](https://vueflow.dev)(拓撲/路由圖)、Pinia + Vue Router。(舊的 `apps/frontend` 已棄用。) - -#### 本地開發 - -```bash -cd apps/frontend_llmops -npm install -npm run dev # http://localhost:5173 -``` - -#### 生產環境建置 - -```bash -npm run build # 輸出到 dist/ -``` - -#### 設定 — `apps/frontend_llmops/.env` - -```env -VITE_API_BASE_URL=http://localhost:5000 # Dashboard 後端(生命週期、遙測) -VITE_ROUTER_BASE_URL=http://localhost:8887 # LLM Router(推理 + /metrics + /reload) -``` - -> **驗證**改由後端控管(不再是 build-time 密碼)。在後端與 router 設定 -> `LLMOPS_ADMIN_TOKEN` 即可鎖住所有控制操作(啟動/停止/新增/編輯/移除 + -> 金鑰管理);UI 會要求輸入一次 token 並在 session 內沿用。在 router 設定 -> `LLMOPS_REQUIRE_API_KEY=true` 則要求所有 `/v1/*` 推理都帶 bearer token(admin -> token,或在 **API 金鑰** 頁建立的金鑰)。兩者預設關閉,方便本機開發。 - -> **三個服務都要跑才完整**:Dashboard 後端(`:5000`)、LLM Router(`:8887`)、以及後端按需啟動的模型實例。後端與 router 都會在啟動時合併動態模型 overlay,所以從前端新增的模型在重啟後仍會保留。 - -### 手動 / 開發啟動 - -也可自行啟動三個部分(Python 依賴在 repo 根目錄的 `.venv`): - -```bash -# Dashboard 後端(:5000) -cd apps/backend && pip install -r requirements.txt -uvicorn main:app --host 0.0.0.0 --port 5000 - -# LLM router(:8887)— 細節見 apps/router-server/README_zh.md -cd apps/router-server && pip install -r requirements.txt -sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py +完整架構、共用 netns 的原理、volumes 與手動啟動見 +[docs/deployment_zh-CN.md](docs/deployment_zh-CN.md)。 + +## 架構 + +```mermaid +flowchart LR + Client([Clients 用戶端]) + FE["frontend
nginx · :8884
單一來源"] + VLLM["vLLM 實例
:800x"] + GF["grafana
/grafana"] + DCGM["dcgm-exporter
:9400 · GPU"] + NODE["node-exporter
:9100 · 主機"] + + subgraph netns["共用 network namespace"] + BE["backend · :5000
模型生命週期"] + RT["router · :8887
OpenAI 相容負載平衡"] + PR["prometheus · :9090"] + end + + Client --> FE + FE -->|/api| BE + FE -->|/v1| RT + FE -->|/grafana| GF + BE -->|按需拉起| VLLM + RT -->|路由 + 平衡| VLLM + PR -->|抓取 /metrics| VLLM + PR --> DCGM + PR --> NODE + GF -->|查詢| PR ``` -配置文件統一使用 `packages/config-schema/config.yaml`(單一來源),確保前端、後端與 -router 讀到同一份設定。 - ---- - -## 配置說明 - -### config.yaml 結構 - -配置文件位於 `packages/config-schema/config.yaml`(單一來源,由 `packages/config-schema/schema.py` 驗證),控制所有模型的啟動參數。 +**router 只負責路由**——**模型生命週期由 backend 掌管**。frontend、router、backend 與 +Grafana 都在 nginx 之後以單一來源對外;backend、router、Prometheus 共用一個 network +namespace,所以被拉起的 vLLM 實例可在 `localhost` 互相連到。 -```yaml -# 路由服務器配置 -server: - host: "0.0.0.0" - port: 8887 - uvicorn_log_level: "info" +## 文件 -# LLM 模型配置 -LLM_engines: - Qwen3-0.6B: - instances: - - id: "qwen3" - host: "localhost" - port: 8002 - cuda_device: 0 - - id: "qwen3-2" - host: "localhost" - port: 8004 - cuda_device: 0 +| 主題 | | +|---|---| +| 部署與架構 | [docs/deployment_zh-CN.md](docs/deployment_zh-CN.md) | +| 配置(`config.yaml`) | [docs/configuration_zh-CN.md](docs/configuration_zh-CN.md) | +| 功能特色(詳細) | [docs/features_zh-CN.md](docs/features_zh-CN.md) | +| 監控(Prometheus + Grafana) | [docs/monitoring_zh-CN.md](docs/monitoring_zh-CN.md) | +| HTTP API | [docs/API.md](docs/API.md) | - model_config: - model_tag: "Qwen/Qwen3-0.6B" - dtype: "float16" - max_model_len: 500 - gpu_memory_utilization: 0.35 - tensor_parallel_size: 1 - -# Embedding 服務器配置(可選) -embedding_server: - host: "localhost" - port: 8005 - cuda_device: 1 - - embedding_models: - m3e-base: - model_name: "moka-ai/m3e-base" - model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model" - tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer" - max_length: 512 - use_gpu: true - use_float16: true - - reranking_models: - bge-reranker-large: - model_name: "BAAI/bge-reranker-large" - model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model" - tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer" - max_length: 512 - use_gpu: true - use_float16: true -``` - -### 關鍵參數說明 - -| 參數 | 說明 | 建議值 | -|------|------|--------| -| `gpu_memory_utilization` | GPU 記憶體使用比例 | 0.6-0.9 | -| `max_model_len` | 最大上下文長度 | 依模型能力 | -| `tensor_parallel_size` | 多 GPU 並行數 | GPU 數量 | -| `dtype` | 推理精度 | float16(速度快) / bfloat16(更穩定) | -| `cuda_device` | GPU 設備編號 | 0, 1, 2... | - ---- +## 環境需求 -### Q4: 可以同時啟動多個模型嗎? +NVIDIA GPU(建議 CUDA 13.1+)· 16GB+ RAM · 50GB+ 磁碟。 -可以 — 只要顯存放得下。**VRAM 預檢防呆**會擋下會撐爆目標 GPU 的啟動(可用 *Force start* 逐次覆寫),未指定 `cuda_device` 的實例會**自動擺放**到剩餘顯存最多的 GPU。單張小卡通常能跑一顆中型模型加幾顆小模型;模型是按需啟動的,所以可以設定一大批而不必全部同時運行。 +## 授權 -可在後端用環境變數調整防呆/重啟策略:`LLMOPS_VRAM_GUARD`、`LLMOPS_AUTO_RESTART`、`LLMOPS_MAX_RESTARTS`、`LLMOPS_RESTART_BACKOFF`。 \ No newline at end of file +MIT — 見 [LICENSE](LICENSE)。 diff --git a/apps/backend/app/main.py b/apps/backend/app/main.py index 3d5351e..4e8ce4c 100644 --- a/apps/backend/app/main.py +++ b/apps/backend/app/main.py @@ -145,7 +145,7 @@ async def lifespan(app: FastAPI): def create_app() -> FastAPI: - app = FastAPI(title="LLM Router Dashboard Backend", lifespan=lifespan) + app = FastAPI(title="vLLMux Backend", lifespan=lifespan) app.add_middleware( CORSMiddleware, allow_origins=["*"], diff --git a/apps/frontend_llmops/index.html b/apps/frontend_llmops/index.html index 9e5fc8f..91bf23d 100644 --- a/apps/frontend_llmops/index.html +++ b/apps/frontend_llmops/index.html @@ -4,7 +4,7 @@ - Vite App + vLLMux
diff --git a/apps/frontend_llmops/src/components/layout/AppSidebar.vue b/apps/frontend_llmops/src/components/layout/AppSidebar.vue index 2822f57..6f8ebd2 100644 --- a/apps/frontend_llmops/src/components/layout/AppSidebar.vue +++ b/apps/frontend_llmops/src/components/layout/AppSidebar.vue @@ -80,7 +80,7 @@ const nav = [
-

LLMOps

+

vLLMux

控制台

diff --git a/assets/image0.png b/assets/image0.png index e1b9f76..8074a84 100644 Binary files a/assets/image0.png and b/assets/image0.png differ diff --git a/assets/image1.png b/assets/image1.png index fc0f170..42aa1d7 100644 Binary files a/assets/image1.png and b/assets/image1.png differ diff --git a/assets/image2.png b/assets/image2.png index 6bf6a8c..0572a10 100644 Binary files a/assets/image2.png and b/assets/image2.png differ diff --git a/assets/image3.png b/assets/image3.png deleted file mode 100644 index 7b930c5..0000000 Binary files a/assets/image3.png and /dev/null differ diff --git a/assets/image4.png b/assets/image4.png deleted file mode 100644 index 639dc37..0000000 Binary files a/assets/image4.png and /dev/null differ diff --git a/docs/REFACTOR_PLAN.md b/docs/REFACTOR_PLAN.md index ea47c4b..e9660f0 100644 --- a/docs/REFACTOR_PLAN.md +++ b/docs/REFACTOR_PLAN.md @@ -1,4 +1,4 @@ -# LLM-Router-Server-Dashboard 重構方案(Monorepo) +# vLLMux 重構方案(Monorepo) > 狀態:**已執行(Phase 0–4 完成)**。三個子專案已整理成 `apps/` + `packages/` + `deploy/` 的正式 Monorepo。 > 後端已分層並收斂 config、前端已建立 services/stores/router/views、共用 config-schema 已上線, diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 0000000..0369bb9 --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,93 @@ +# Configuration + +> [中文](configuration_zh-CN.md) + +The configuration file lives at `packages/config-schema/config.yaml` — the single +source of truth, validated by `packages/config-schema/schema.py`, and read by the +frontend, backend, and router alike. It controls all model startup parameters. + +> You usually **don't** edit this by hand: add models from the UI by pasting a +> `vllm serve …` command, which is layered on as a dynamic overlay. Edit `config.yaml` +> only for the canonical, hand-maintained fleet. + +## `config.yaml` structure + +```yaml +# Router server configuration +server: + host: "0.0.0.0" + port: 8887 + uvicorn_log_level: "info" + +# LLM model configuration +LLM_engines: + Qwen3-0.6B: + instances: + - id: "qwen3" + host: "localhost" + port: 8002 + cuda_device: 0 + - id: "qwen3-2" + host: "localhost" + port: 8004 + cuda_device: 0 + + model_config: + model_tag: "Qwen/Qwen3-0.6B" + dtype: "float16" + max_model_len: 500 + gpu_memory_utilization: 0.35 + tensor_parallel_size: 1 + +# Embedding server configuration (optional) +embedding_server: + host: "localhost" + port: 8005 + cuda_device: 1 + + embedding_models: + m3e-base: + model_name: "moka-ai/m3e-base" + model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model" + tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer" + max_length: 512 + use_gpu: true + use_float16: true + + reranking_models: + bge-reranker-large: + model_name: "BAAI/bge-reranker-large" + model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model" + tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer" + max_length: 512 + use_gpu: true + use_float16: true +``` + +## Key parameters + +| Parameter | Description | Recommended | +|------|------|--------| +| `gpu_memory_utilization` | GPU memory usage ratio | 0.6–0.9 | +| `max_model_len` | Maximum context length | Based on model capability | +| `tensor_parallel_size` | Multi-GPU parallelism count | Number of GPUs | +| `dtype` | Inference precision | float16 (faster) / bfloat16 (more stable) | +| `cuda_device` | GPU device number | 0, 1, 2… | + +## Running multiple models at once + +Yes — as long as they fit in GPU memory. A **VRAM pre-flight guard** blocks a start +that would overflow the target GPU (override per-start with *Force start*), and +instances without a pinned `cuda_device` are **auto-placed** on the GPU with the most +free memory. On a single small GPU you'll typically run one mid-size model alongside +a few small ones; models are started on demand, so a large fleet can be configured +without all running at once. + +Tune the guard / restart policy via env on the backend: + +| Env | Purpose | +|---|---| +| `LLMOPS_VRAM_GUARD` | Enable/disable the VRAM pre-flight guard | +| `LLMOPS_AUTO_RESTART` | Auto-restart a crashed managed model | +| `LLMOPS_MAX_RESTARTS` | Restart budget before giving up | +| `LLMOPS_RESTART_BACKOFF` | Exponential backoff base | diff --git a/docs/configuration_zh-CN.md b/docs/configuration_zh-CN.md new file mode 100644 index 0000000..d4288c7 --- /dev/null +++ b/docs/configuration_zh-CN.md @@ -0,0 +1,89 @@ +# 配置說明 + +> [English](configuration.md) + +配置文件位於 `packages/config-schema/config.yaml`——單一來源,由 +`packages/config-schema/schema.py` 驗證,前端、後端與 router 都讀同一份。它控制所有 +模型的啟動參數。 + +> 通常**不需要**手動編輯:從前端貼上 `vllm serve …` 指令新增模型,會以動態 overlay 疊加。 +> 只有要維護「正式、手寫」的模型清單時才改 `config.yaml`。 + +## `config.yaml` 結構 + +```yaml +# 路由服務器配置 +server: + host: "0.0.0.0" + port: 8887 + uvicorn_log_level: "info" + +# LLM 模型配置 +LLM_engines: + Qwen3-0.6B: + instances: + - id: "qwen3" + host: "localhost" + port: 8002 + cuda_device: 0 + - id: "qwen3-2" + host: "localhost" + port: 8004 + cuda_device: 0 + + model_config: + model_tag: "Qwen/Qwen3-0.6B" + dtype: "float16" + max_model_len: 500 + gpu_memory_utilization: 0.35 + tensor_parallel_size: 1 + +# Embedding 服務器配置(可選) +embedding_server: + host: "localhost" + port: 8005 + cuda_device: 1 + + embedding_models: + m3e-base: + model_name: "moka-ai/m3e-base" + model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model" + tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer" + max_length: 512 + use_gpu: true + use_float16: true + + reranking_models: + bge-reranker-large: + model_name: "BAAI/bge-reranker-large" + model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model" + tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer" + max_length: 512 + use_gpu: true + use_float16: true +``` + +## 關鍵參數 + +| 參數 | 說明 | 建議值 | +|------|------|--------| +| `gpu_memory_utilization` | GPU 記憶體使用比例 | 0.6–0.9 | +| `max_model_len` | 最大上下文長度 | 依模型能力 | +| `tensor_parallel_size` | 多 GPU 並行數 | GPU 數量 | +| `dtype` | 推理精度 | float16(速度快) / bfloat16(更穩定) | +| `cuda_device` | GPU 設備編號 | 0, 1, 2… | + +## 同時啟動多個模型 + +可以——只要顯存放得下。**VRAM 預檢防呆**會擋下會撐爆目標 GPU 的啟動(可用 *Force start* +逐次覆寫),未指定 `cuda_device` 的實例會**自動擺放**到剩餘顯存最多的 GPU。單張小卡通常 +能跑一顆中型模型加幾顆小模型;模型是按需啟動的,所以可以設定一大批而不必全部同時運行。 + +可在後端用環境變數調整防呆/重啟策略: + +| 環境變數 | 用途 | +|---|---| +| `LLMOPS_VRAM_GUARD` | 啟用/關閉 VRAM 預檢防呆 | +| `LLMOPS_AUTO_RESTART` | 崩潰的 managed 模型自動重啟 | +| `LLMOPS_MAX_RESTARTS` | 放棄前的重啟次數上限 | +| `LLMOPS_RESTART_BACKOFF` | 指數退避基數 | diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 0000000..69aa01a --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,115 @@ +# Deployment & Topology + +> [中文](deployment_zh-CN.md) + +The whole stack — dashboard backend, LLM router, Prometheus, Grafana, the GPU/host +exporters, and the Vue frontend — is built and started by a single Compose file. +Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in +Docker Desktop). + +```bash +cp deploy/.env.example deploy/.env # set HF_TOKEN, which GPUs, the admin token +make up # docker compose -f deploy/docker-compose.yaml up -d --build +# open http://localhost:8884 +``` + +`make down` stops it, `make logs` tails all services, `make ps` shows status. + +## Services + +See [`deploy/docker-compose.yaml`](../deploy/docker-compose.yaml). + +| Service | Image | Port | Role | +|------------------|------------------------|---------|------| +| `backend` | `llmops-engine` (GPU) | 5000 | Dashboard API; spawns vLLM subprocesses on `:800x` | +| `router` | `llmops-engine` | 8887 | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports | +| `prometheus` | `prom/prometheus` | 9090 | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances | +| `grafana` | `grafana/grafana` | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx | +| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) | +| `node-exporter` | `prom/node-exporter` | 9100 | Host metrics (CPU, RAM, disk, network) | +| `frontend` | `llmops-frontend` | 8884 | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana | + +### Why one image, multiple services on one netns + +Only the backend truly needs vLLM (it launches the subprocesses), and the router + +Prometheus must see them on `localhost` — so a single +[`engine.Dockerfile`](../deploy/engine.Dockerfile) (based on the official +`vllm/vllm-openai`) runs as `backend` + `router`, joined (with Prometheus) by +`network_mode: service:backend`. + +The frontend reaches the backend, router, and Grafana through nginx on a single +origin, so no host/port is baked into the build. + +### Persistence + +- SQLite + the dynamic-model overlay → `llmops-data` named volume +- Prometheus TSDB → `prometheus-data`; Grafana state → `grafana-data` +- Model **weights** are bind-mounted from the host HF cache (`HF_CACHE_DIR`, default + `~/.cache/huggingface`) so they're browsable locally and shared with host-side tools +- `packages/config-schema/config.yaml` is bind-mounted too, so you can edit models + without rebuilding + +> **Model lifecycle**: the router only routes and load-balances — it never launches +> models. vLLM instances (and the Embedding/Reranker server) are owned by the backend +> and started on demand from the **Models** page (or `POST /api/models/{key}/start`). +> The backend and router both merge the dynamic-model overlay at startup, so models +> added from the UI survive restarts. + +### Verify + +```bash +curl http://localhost:8887/v1/models # router: configured model groups +curl http://localhost:5000/api/models # backend: lifecycle state of each instance +``` + +## Frontend (Web dashboard) + +The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript, +Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the +topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.) + +```bash +cd apps/frontend_llmops +npm install +npm run dev # http://localhost:5173 +npm run build # production build → dist/ +``` + +Configuration — `apps/frontend_llmops/.env`: + +```env +VITE_API_BASE_URL=http://localhost:5000 # Dashboard backend (lifecycle, telemetry) +VITE_ROUTER_BASE_URL=http://localhost:8887 # LLM Router (inference + /metrics + /reload) +``` + +### Authentication + +Authentication is backend-driven (not a build-time password). Set +`LLMOPS_ADMIN_TOKEN` on the backend + router to gate every control action (start / +stop / add / edit / remove + API-key management); the UI prompts for the token once +and reuses it for the session. Set `LLMOPS_REQUIRE_API_KEY=true` on the router to +require a bearer token (the admin token, or an API key minted on the **API Keys** +page) for all `/v1/*` inference. Both default to off for local dev. + +## Manual / development run + +Run the three pieces yourself (Python deps in the repo-root `.venv`): + +```bash +# Dashboard backend (:5000) +cd apps/backend && pip install -r requirements.txt +uvicorn main:app --host 0.0.0.0 --port 5000 + +# LLM router (:8887) — see apps/router-server/README.md for details +cd apps/router-server && pip install -r requirements.txt +sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py +``` + +Use `packages/config-schema/config.yaml` as the single source of truth so the +frontend, backend, and router all read the same configuration. + +## Requirements + +- **GPU**: NVIDIA GPU (CUDA 13.1+ recommended) +- **Memory**: 16GB+ RAM (depending on model size) +- **Disk**: 50GB+ available space diff --git a/docs/deployment_zh-CN.md b/docs/deployment_zh-CN.md new file mode 100644 index 0000000..8fe1e4f --- /dev/null +++ b/docs/deployment_zh-CN.md @@ -0,0 +1,108 @@ +# 部署與架構 + +> [English](deployment.md) + +整套服務(Dashboard 後端、LLM router、Prometheus、Grafana、GPU/主機 exporter、Vue 前端) +由單一 Compose 檔建置並啟動。需要安裝 Docker 與 NVIDIA Container Toolkit(WSL2 請在 +Docker Desktop 開啟 GPU 支援)。 + +```bash +cp deploy/.env.example deploy/.env # 填 HF_TOKEN、要用的 GPU、管理員權杖 +make up # docker compose -f deploy/docker-compose.yaml up -d --build +# 瀏覽器開 http://localhost:8884 +``` + +`make down` 停止、`make logs` 追蹤所有服務日誌、`make ps` 看狀態。 + +## 服務 + +見 [`deploy/docker-compose.yaml`](../deploy/docker-compose.yaml)。 + +| 服務 | 映像 | 端口 | 角色 | +|------------------|------------------------|-------|------| +| `backend` | `llmops-engine`(GPU) | 5000 | Dashboard API;在 `:800x` 拉起 vLLM 子進程 | +| `router` | `llmops-engine` | 8887 | OpenAI 相容路由;**共用後端的 network namespace**,才打得到那些 localhost vLLM 端口 | +| `prometheus` | `prom/prometheus` | 9090 | 透過 file-based SD 抓取 vLLM 艦隊的 `/metrics`;**同樣共用後端 netns**,`localhost:800x` 才解析得到那些實例 | +| `grafana` | `grafana/grafana` | (代理)| Dashboards 與告警;經前端 nginx 以單一來源代理在 `/grafana` | +| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter`(GPU) | 9400 | NVIDIA GPU 遙測(利用率、顯存、溫度、功耗) | +| `node-exporter` | `prom/node-exporter` | 9100 | 主機指標(CPU、RAM、磁碟、網路) | +| `frontend` | `llmops-frontend` | 8884 | nginx 服務 SPA,並反向代理 `/api` → 後端、`/v1` → router、`/grafana` → grafana | + +### 為何一份映像、多個服務共用一個 netns + +只有後端真的需要 vLLM(它負責拉起子進程),而 router 與 Prometheus 必須在 `localhost` +看到那些子進程——所以單一 [`engine.Dockerfile`](../deploy/engine.Dockerfile)(基於官方 +`vllm/vllm-openai`)跑成 `backend` + `router`,並(連同 Prometheus)以 +`network_mode: service:backend` 串接。前端透過 nginx 以單一來源(same-origin)連到後端、 +router 與 Grafana,因此 build 不會硬編任何 host/port。 + +### 資料持久化 + +- SQLite 與動態模型 overlay → `llmops-data` named volume +- Prometheus TSDB → `prometheus-data`;Grafana 狀態 → `grafana-data` +- 模型**權重**以 bind-mount 掛在主機 HF 快取(`HF_CACHE_DIR`,預設 `~/.cache/huggingface`), + 本機就能直接瀏覽、也和主機端工具共用 +- `packages/config-schema/config.yaml` 同樣 bind-mount 掛入,因此改模型不必重新 build + +> **模型生命週期**:router 只負責路由與負載平衡,不會啟動模型。vLLM 實例(與 +> Embedding/Reranker 服務)由後端管理,從 **Models** 頁按需啟動(或 +> `POST /api/models/{key}/start`)。後端與 router 都會在啟動時合併動態模型 overlay, +> 所以從前端新增的模型在重啟後仍會保留。 + +### 驗證 + +```bash +curl http://localhost:8887/v1/models # router:列出設定的模型群組 +curl http://localhost:5000/api/models # 後端:每個實例的生命週期狀態 +``` + +## 前端(Web 控制台) + +控制台位於 **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript、Tailwind CSS v4、 +shadcn-vue 元件、[Vue Flow](https://vueflow.dev)(拓撲/路由圖)、Pinia + Vue Router。 +(舊的 `apps/frontend` 已棄用。) + +```bash +cd apps/frontend_llmops +npm install +npm run dev # http://localhost:5173 +npm run build # 生產環境建置 → dist/ +``` + +設定 — `apps/frontend_llmops/.env`: + +```env +VITE_API_BASE_URL=http://localhost:5000 # Dashboard 後端(生命週期、遙測) +VITE_ROUTER_BASE_URL=http://localhost:8887 # LLM Router(推理 + /metrics + /reload) +``` + +### 驗證機制 + +驗證改由後端控管(不再是 build-time 密碼)。在後端與 router 設定 `LLMOPS_ADMIN_TOKEN` +即可鎖住所有控制操作(啟動/停止/新增/編輯/移除 + 金鑰管理);UI 會要求輸入一次 +token 並在 session 內沿用。在 router 設定 `LLMOPS_REQUIRE_API_KEY=true` 則要求所有 +`/v1/*` 推理都帶 bearer token(admin token,或在 **API 金鑰** 頁建立的金鑰)。兩者預設 +關閉,方便本機開發。 + +## 手動 / 開發啟動 + +也可自行啟動三個部分(Python 依賴在 repo 根目錄的 `.venv`): + +```bash +# Dashboard 後端(:5000) +cd apps/backend && pip install -r requirements.txt +uvicorn main:app --host 0.0.0.0 --port 5000 + +# LLM router(:8887)— 細節見 apps/router-server/README_zh.md +cd apps/router-server && pip install -r requirements.txt +sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py +``` + +配置文件統一使用 `packages/config-schema/config.yaml`(單一來源),確保前端、後端與 +router 讀到同一份設定。 + +## 環境需求 + +- **GPU**:NVIDIA GPU(建議 CUDA 13.1+) +- **記憶體**:16GB+ RAM(依模型大小而定) +- **硬碟**:50GB+ 可用空間 diff --git a/docs/features.md b/docs/features.md new file mode 100644 index 0000000..82cc341 --- /dev/null +++ b/docs/features.md @@ -0,0 +1,87 @@ +# Features in depth + +> [中文](features_zh-CN.md) + +## Model management + +- Multi-model, multi-instance management on vLLM (LLM, Embedding, Reranker). +- Per-instance lifecycle (start/stop) with a live state machine + (`stopped → starting → ready → failed/stopping`), driven by a reconciler that + derives the true state from process liveness + `/health` probes. +- **Add models from the UI by pasting a `vllm serve …` command** — it is parsed into + an editable form and layered on as a dynamic *overlay*, so the hand-maintained + `config.yaml` stays untouched; the router hot-reloads (`POST /reload`) so new models + are routable end-to-end. +- Load-aware routing: the router auto-selects the least-loaded instance (weighting + running / waiting requests + KV-cache usage). + +## Reliability + +- **VRAM pre-flight guard** — blocks a start that would likely OOM, with a one-click + *Force start* override. +- **GPU auto-placement** — an instance with no pinned `cuda_device` is placed on the + GPU with the most free memory. +- **Auto-restart** — a managed model that crashes is restarted with exponential + backoff (configurable budget, resets once healthy). + +## Observability + +- Real-time status via Server-Sent Events (no polling). +- **System topology** (Vue Flow) — a live mission-control graph of Clients → Router → + model groups / Embedding → GPUs, with animated traffic edges, GPU-placement edges, + and a control plane; nodes are clickable drill-ins. +- **Router load-balancing view** — an animated fan showing each replica's real traffic + share and the instance the router will pick next. +- **Grafana monitoring** (bundled) — see [monitoring.md](monitoring.md). +- Per-model usage (count, error rate, p50/p95 latency, tokens), request log, and a + state-transition event timeline. +- GPU / CPU / memory monitoring plus a GPU-process inventory. + +## Playground + +- OpenAI-compatible **chat (streaming)**, completions, **embeddings**, and + **reranking**, sent straight through the router. +- **Reasoning ("thinking") display** — when a model runs with a vLLM reasoning parser, + the `reasoning` stream is shown in a collapsible block above the answer. + +## Benchmarking & evaluation (evalscope) + +- **Load testing** (`/benchmark`) — concurrency sweep, arrival-rate open-loop, + multi-turn, **SLA auto-tune**, plus **embedding / rerank** throughput and + single-request **speed benchmark**; each run is an isolated subprocess, with live + charts, run comparison, and the full evalscope HTML report. + See [evalscope_模型壓測整理.md](evalscope_模型壓測整理.md). +- **Accuracy / quality evaluation** (`/eval`) — **30+ benchmark datasets** grouped by + capability tier (Baseline, Knowledge, Chinese, Reasoning, Math, Multilingual, + **Tool-calling**, **Long-context**, Code, and judge-scored QA): MMLU/ARC/GSM8K/IFEval, + C-Eval/C-MMLU, GPQA/MMLU-Pro, AIME, HumanEval, ToolBench/General-FunctionCall, + Needle-in-a-Haystack, … + See [evalscope_LLM評測集整理.md](evalscope_LLM評測集整理.md). + - Per-dataset scores, a **run-to-run comparison matrix** (highlights the best per + dataset), and the interactive HTML report. + - **LLM-as-judge** for free-form QA — pick one of your own deployed models (via the + router) or an external OpenAI-compatible API. + - **Advanced `dataset_args`** — few-shot count + raw per-dataset overrides (subset + selection, etc.). + - Sanity guards: judge-scored datasets require a judge; long-context and real + tool-calling datasets warn about their model prerequisites (large `max_model_len`, + vLLM tool parser). + +## Libraries + +- **Model library** (`/library`) — scan / pre-download / delete HF model weights from + the UI, with live download progress. +- **Dataset library** (`/datasets`) — pre-download load-test and evaluation datasets + into the shared ModelScope cache so a run never stalls on a first-time download. +- **Tool-calling config helper** — the model editor maps model families to the right + vLLM `tool_call_parser` (Qwen→`hermes`, Qwen3-Coder→`qwen3_xml`, + Llama→`llama3_json`/`llama4_pythonic`, …) with one-click preset insertion. + See [vllm_auto_tool_整理.md](vllm_auto_tool_整理.md). +- **LoRA** — see [vLLM_LoRA_部署整理.md](vLLM_LoRA_部署整理.md). + +## UX & security + +- Light / dark theme, dense "control-room" interface. +- **Admin-token-gated control** (start / stop / add / edit / remove) and **API-key + management** — mint/revoke keys that authenticate router inference, with per-key + usage attribution in the request log. diff --git a/docs/features_zh-CN.md b/docs/features_zh-CN.md new file mode 100644 index 0000000..2d7192e --- /dev/null +++ b/docs/features_zh-CN.md @@ -0,0 +1,70 @@ +# 功能特色(詳細) + +> [English](features.md) + +## 模型管理 + +- 基於 vLLM 的多模型、多實例管理(LLM、Embedding、Reranker)。 +- 每個實例獨立的生命週期(啟動/停止),具即時狀態機 + (`stopped → starting → ready → failed/stopping`),由 reconciler 從「進程存活 + + `/health` 探測」推導真實狀態。 +- **在前端貼上 `vllm serve …` 指令即可新增模型** — 解析成可編輯表單,以動態 *overlay* + 疊加,**不動手寫的 `config.yaml`**;router 會熱重載(`POST /reload`),新模型端到端 + 可被路由。 +- 負載感知路由:router 自動選擇負載最低的實例(依運行中/等待中請求 + KV 快取使用率加權)。 + +## 可靠性 + +- **VRAM 預檢防呆** — 啟動前估算顯存,可能 OOM 就擋下,並提供一鍵 *Force start* 覆寫。 +- **GPU 自動擺放** — 未指定 `cuda_device` 的實例會自動擺到剩餘顯存最多的 GPU。 +- **失敗自動重啟** — managed 模型崩潰後以指數退避自動重啟(可設次數,恢復健康後重置)。 + +## 觀測性 + +- 透過 Server-Sent Events 即時更新狀態(免輪詢)。 +- **系統拓撲圖**(Vue Flow)— Clients → Router → 模型群組/Embedding → GPU 的即時 + mission-control 圖,含流動的流量邊、GPU 擺放邊與控制平面;節點可點擊下鑽。 +- **Router 負載平衡視圖** — 動畫扇形圖呈現每個副本的實際流量佔比,以及 router 下一個會 + 選的實例。 +- **Grafana 監控**(內建)— 見 [monitoring_zh-CN.md](monitoring_zh-CN.md)。 +- 每模型用量(次數、錯誤率、p50/p95 延遲、tokens)、請求日誌、狀態轉移事件時間軸。 +- GPU/CPU/記憶體監控,以及 GPU 進程清單。 + +## Playground + +- OpenAI 相容的 **chat(串流)**、completions、**embeddings**、**reranking**,直接經由 router。 +- **思考(reasoning)顯示** — 模型搭配 vLLM reasoning parser 時,`reasoning` 串流會顯示 + 在答案上方的可摺疊「思考過程」區塊。 + +## 壓測與評測(evalscope) + +- **壓測**(`/benchmark`)— 並發 sweep、到達率 open-loop、多輪、**SLA 自動調優**,以及 + **embedding/rerank** 吞吐與單請求**速度基準**;每次執行為獨立子進程,含即時圖表、 + run 比較、完整 evalscope HTML 報告。見 [evalscope_模型壓測整理.md](evalscope_模型壓測整理.md)。 +- **準確度/品質評測**(`/eval`)— **30+ 個基準資料集**,依能力分組(基線、知識進階、 + 中文、推理、數學、多語言、**工具調用**、**長上下文**、程式碼、需裁判的問答): + MMLU/ARC/GSM8K/IFEval、C-Eval/C-MMLU、GPQA/MMLU-Pro、AIME、HumanEval、 + ToolBench/General-FunctionCall、Needle-in-a-Haystack… + 見 [evalscope_LLM評測集整理.md](evalscope_LLM評測集整理.md)。 + - 每資料集分數、**run 對 run 的比較表**(每列標出最高分)、互動式 HTML 報告。 + - **裁判模型(LLM-as-judge)** 給自由問答評分 — 可選自家部署的模型(經 router)或外部 + OpenAI 相容 API。 + - **進階 `dataset_args`** — few-shot 數 + 依資料集的原始覆寫(子集選擇等)。 + - 防呆:需裁判的資料集會強制設定裁判;長上下文與真實工具調用資料集會提醒模型前提 + (夠大的 `max_model_len`、vLLM tool parser)。 + +## 資料庫 + +- **模型庫**(`/library`)— 在 UI 掃描/預下載/刪除 HF 權重,含即時下載進度。 +- **資料集庫**(`/datasets`)— 預先下載壓測與評測資料集到共用 ModelScope 快取,執行時 + 就不會卡在首次下載。 +- **工具調用設定助手** — 模型編輯器把模型家族對應到正確的 vLLM `tool_call_parser` + (Qwen→`hermes`、Qwen3-Coder→`qwen3_xml`、Llama→`llama3_json`/`llama4_pythonic`…), + 一鍵帶入。見 [vllm_auto_tool_整理.md](vllm_auto_tool_整理.md)。 +- **LoRA** — 見 [vLLM_LoRA_部署整理.md](vLLM_LoRA_部署整理.md)。 + +## 使用體驗與安全 + +- 明暗雙主題、資訊密集的「控制室」介面。 +- **管理員權杖控管**控制操作(啟動/停止/新增/編輯/移除),以及 **API 金鑰管理** — + 發行/撤銷用於 router 推理的金鑰,並在請求日誌中做 per-key 用量歸屬。 diff --git a/docs/monitoring.md b/docs/monitoring.md new file mode 100644 index 0000000..56ab1a7 --- /dev/null +++ b/docs/monitoring.md @@ -0,0 +1,33 @@ +# Monitoring (Prometheus + Grafana) + +> [中文](monitoring_zh-CN.md) + +The stack bundles a full **Prometheus → Grafana** pipeline, no manual setup. + +- The **backend** writes a Prometheus file-based service-discovery file + (`LLMOPS_PROMETHEUS_SD_PATH`) listing every *ready* vLLM instance, refreshed as + models start/stop — so a dynamic fleet is scraped with zero config edits. +- **Prometheus** (`:9090`) scrapes those instances' `/metrics` plus `dcgm-exporter` + (GPU) and `node-exporter` (host). +- **Grafana** is served single-origin at **`http://localhost:8884/grafana`** + (anonymous read-only; log in as `admin` / `GRAFANA_ADMIN_PASSWORD` to edit). + Datasource and dashboards are auto-provisioned from + [`deploy/grafana`](../deploy/grafana): + - **Overview** — single pane: health, latency SLO, capacity, GPU/host + - **vLLM Scheduling & Capacity** (custom) + - **Performance** / **Query** (official vLLM dashboards) + - **GPU** (DCGM) and **Host** (Node Exporter) + + The same dashboards are embedded in the dashboard's **Monitoring** tab, with SLO + threshold lines and model-lifecycle annotations. +- **Alerting**: provisioned vLLM alert rules (target down, TTFT p95, KV cache, + request queueing) route to a webhook contact point — set `GRAFANA_ALERT_WEBHOOK` in + `deploy/.env` (Slack/Discord/generic) and restart Grafana to receive them. + +```bash +curl http://localhost:9090/api/v1/targets # prometheus: scrape target health +# open http://localhost:8884/grafana # dashboards + alerts +``` + +For background on the metrics and the design rationale, see +[vllm_grafana_monitoring_guide.md](vllm_grafana_monitoring_guide.md). diff --git a/docs/monitoring_zh-CN.md b/docs/monitoring_zh-CN.md new file mode 100644 index 0000000..bc7148b --- /dev/null +++ b/docs/monitoring_zh-CN.md @@ -0,0 +1,29 @@ +# 監控(Prometheus + Grafana) + +> [English](monitoring.md) + +整套內建完整的 **Prometheus → Grafana** 流程,免手動設定。 + +- **後端**寫出 Prometheus file-based service-discovery 檔(`LLMOPS_PROMETHEUS_SD_PATH`), + 列出每個 *ready* 的 vLLM 實例,並隨模型啟停刷新——所以動態艦隊免改設定即被抓取。 +- **Prometheus**(`:9090`)抓取這些實例的 `/metrics`,外加 `dcgm-exporter`(GPU)與 + `node-exporter`(主機)。 +- **Grafana** 以單一來源服務於 **`http://localhost:8884/grafana`**(匿名唯讀;以 + `admin` / `GRAFANA_ADMIN_PASSWORD` 登入可編輯)。datasource 與 dashboards 由 + [`deploy/grafana`](../deploy/grafana) 自動 provision: + - **總覽** — 單一頁面:健康、延遲 SLO、容量、GPU/主機 + - **vLLM 排程與容量**(自訂) + - **Performance** / **Query**(官方 vLLM dashboards) + - **GPU**(DCGM)與 **Host**(Node Exporter) + + 同一批 dashboards 也嵌入控制台的 **監控** 分頁,含 SLO 門檻線與模型生命週期標註。 +- **告警**:已 provision 的 vLLM 告警規則(target down、TTFT p95、KV cache、請求排隊) + 路由到一個 webhook contact point —— 在 `deploy/.env` 設 `GRAFANA_ALERT_WEBHOOK` + (Slack/Discord/通用)並重啟 Grafana 即可收到通知。 + +```bash +curl http://localhost:9090/api/v1/targets # prometheus:scrape target 健康狀態 +# 開啟 http://localhost:8884/grafana # dashboards 與告警 +``` + +指標背景與設計理念見 [vllm_grafana_monitoring_guide.md](vllm_grafana_monitoring_guide.md)。