Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
335 changes: 74 additions & 261 deletions README.md

Large diffs are not rendered by default.

315 changes: 71 additions & 244 deletions README_zh-CN.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion apps/backend/app/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ async def lifespan(app: FastAPI):


def create_app() -> FastAPI:
app = FastAPI(title="LLM Router Dashboard Backend", lifespan=lifespan)
app = FastAPI(title="vLLMux Backend", lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
Expand Down
2 changes: 1 addition & 1 deletion apps/frontend_llmops/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="UTF-8">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Vite App</title>
<title>vLLMux</title>
</head>
<body>
<div id="app"></div>
Expand Down
2 changes: 1 addition & 1 deletion apps/frontend_llmops/src/components/layout/AppSidebar.vue
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ const nav = [
<Server class="size-4.5" />
</div>
<div class="leading-tight">
<p class="text-sm font-semibold">LLMOps</p>
<p class="text-sm font-semibold">vLLMux</p>
<p class="text-[10px] uppercase tracking-widest text-muted-foreground">控制台</p>
</div>
</div>
Expand Down
Binary file modified assets/image0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/image1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/image2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed assets/image3.png
Binary file not shown.
Binary file removed assets/image4.png
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/REFACTOR_PLAN.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# LLM-Router-Server-Dashboard 重構方案(Monorepo)
# vLLMux 重構方案(Monorepo)

> 狀態:**已執行(Phase 0–4 完成)**。三個子專案已整理成 `apps/` + `packages/` + `deploy/` 的正式 Monorepo。
> 後端已分層並收斂 config、前端已建立 services/stores/router/views、共用 config-schema 已上線,
Expand Down
93 changes: 93 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Configuration

> [中文](configuration_zh-CN.md)

The configuration file lives at `packages/config-schema/config.yaml` — the single
source of truth, validated by `packages/config-schema/schema.py`, and read by the
frontend, backend, and router alike. It controls all model startup parameters.

> You usually **don't** edit this by hand: add models from the UI by pasting a
> `vllm serve …` command, which is layered on as a dynamic overlay. Edit `config.yaml`
> only for the canonical, hand-maintained fleet.

## `config.yaml` structure

```yaml
# Router server configuration
server:
host: "0.0.0.0"
port: 8887
uvicorn_log_level: "info"

# LLM model configuration
LLM_engines:
Qwen3-0.6B:
instances:
- id: "qwen3"
host: "localhost"
port: 8002
cuda_device: 0
- id: "qwen3-2"
host: "localhost"
port: 8004
cuda_device: 0

model_config:
model_tag: "Qwen/Qwen3-0.6B"
dtype: "float16"
max_model_len: 500
gpu_memory_utilization: 0.35
tensor_parallel_size: 1

# Embedding server configuration (optional)
embedding_server:
host: "localhost"
port: 8005
cuda_device: 1

embedding_models:
m3e-base:
model_name: "moka-ai/m3e-base"
model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
max_length: 512
use_gpu: true
use_float16: true

reranking_models:
bge-reranker-large:
model_name: "BAAI/bge-reranker-large"
model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
max_length: 512
use_gpu: true
use_float16: true
```

## Key parameters

| Parameter | Description | Recommended |
|------|------|--------|
| `gpu_memory_utilization` | GPU memory usage ratio | 0.6–0.9 |
| `max_model_len` | Maximum context length | Based on model capability |
| `tensor_parallel_size` | Multi-GPU parallelism count | Number of GPUs |
| `dtype` | Inference precision | float16 (faster) / bfloat16 (more stable) |
| `cuda_device` | GPU device number | 0, 1, 2… |

## Running multiple models at once

Yes — as long as they fit in GPU memory. A **VRAM pre-flight guard** blocks a start
that would overflow the target GPU (override per-start with *Force start*), and
instances without a pinned `cuda_device` are **auto-placed** on the GPU with the most
free memory. On a single small GPU you'll typically run one mid-size model alongside
a few small ones; models are started on demand, so a large fleet can be configured
without all running at once.

Tune the guard / restart policy via env on the backend:

| Env | Purpose |
|---|---|
| `LLMOPS_VRAM_GUARD` | Enable/disable the VRAM pre-flight guard |
| `LLMOPS_AUTO_RESTART` | Auto-restart a crashed managed model |
| `LLMOPS_MAX_RESTARTS` | Restart budget before giving up |
| `LLMOPS_RESTART_BACKOFF` | Exponential backoff base |
89 changes: 89 additions & 0 deletions docs/configuration_zh-CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# 配置說明

> [English](configuration.md)

配置文件位於 `packages/config-schema/config.yaml`——單一來源,由
`packages/config-schema/schema.py` 驗證,前端、後端與 router 都讀同一份。它控制所有
模型的啟動參數。

> 通常**不需要**手動編輯:從前端貼上 `vllm serve …` 指令新增模型,會以動態 overlay 疊加。
> 只有要維護「正式、手寫」的模型清單時才改 `config.yaml`。

## `config.yaml` 結構

```yaml
# 路由服務器配置
server:
host: "0.0.0.0"
port: 8887
uvicorn_log_level: "info"

# LLM 模型配置
LLM_engines:
Qwen3-0.6B:
instances:
- id: "qwen3"
host: "localhost"
port: 8002
cuda_device: 0
- id: "qwen3-2"
host: "localhost"
port: 8004
cuda_device: 0

model_config:
model_tag: "Qwen/Qwen3-0.6B"
dtype: "float16"
max_model_len: 500
gpu_memory_utilization: 0.35
tensor_parallel_size: 1

# Embedding 服務器配置(可選)
embedding_server:
host: "localhost"
port: 8005
cuda_device: 1

embedding_models:
m3e-base:
model_name: "moka-ai/m3e-base"
model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model"
tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer"
max_length: 512
use_gpu: true
use_float16: true

reranking_models:
bge-reranker-large:
model_name: "BAAI/bge-reranker-large"
model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model"
tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer"
max_length: 512
use_gpu: true
use_float16: true
```

## 關鍵參數

| 參數 | 說明 | 建議值 |
|------|------|--------|
| `gpu_memory_utilization` | GPU 記憶體使用比例 | 0.6–0.9 |
| `max_model_len` | 最大上下文長度 | 依模型能力 |
| `tensor_parallel_size` | 多 GPU 並行數 | GPU 數量 |
| `dtype` | 推理精度 | float16(速度快) / bfloat16(更穩定) |
| `cuda_device` | GPU 設備編號 | 0, 1, 2… |

## 同時啟動多個模型

可以——只要顯存放得下。**VRAM 預檢防呆**會擋下會撐爆目標 GPU 的啟動(可用 *Force start*
逐次覆寫),未指定 `cuda_device` 的實例會**自動擺放**到剩餘顯存最多的 GPU。單張小卡通常
能跑一顆中型模型加幾顆小模型;模型是按需啟動的,所以可以設定一大批而不必全部同時運行。

可在後端用環境變數調整防呆/重啟策略:

| 環境變數 | 用途 |
|---|---|
| `LLMOPS_VRAM_GUARD` | 啟用/關閉 VRAM 預檢防呆 |
| `LLMOPS_AUTO_RESTART` | 崩潰的 managed 模型自動重啟 |
| `LLMOPS_MAX_RESTARTS` | 放棄前的重啟次數上限 |
| `LLMOPS_RESTART_BACKOFF` | 指數退避基數 |
115 changes: 115 additions & 0 deletions docs/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Deployment & Topology

> [中文](deployment_zh-CN.md)

The whole stack — dashboard backend, LLM router, Prometheus, Grafana, the GPU/host
exporters, and the Vue frontend — is built and started by a single Compose file.
Requires Docker with the NVIDIA Container Toolkit (on WSL2, enable GPU support in
Docker Desktop).

```bash
cp deploy/.env.example deploy/.env # set HF_TOKEN, which GPUs, the admin token
make up # docker compose -f deploy/docker-compose.yaml up -d --build
# open http://localhost:8884
```

`make down` stops it, `make logs` tails all services, `make ps` shows status.

## Services

See [`deploy/docker-compose.yaml`](../deploy/docker-compose.yaml).

| Service | Image | Port | Role |
|------------------|------------------------|---------|------|
| `backend` | `llmops-engine` (GPU) | 5000 | Dashboard API; spawns vLLM subprocesses on `:800x` |
| `router` | `llmops-engine` | 8887 | OpenAI-compatible router; **shares the backend's network namespace** so it reaches those localhost vLLM ports |
| `prometheus` | `prom/prometheus` | 9090 | Scrapes the vLLM fleet's `/metrics` via file-based SD; **also shares the backend's netns** so `localhost:800x` resolves to the spawned instances |
| `grafana` | `grafana/grafana` | (proxied) | Dashboards + alerting; served single-origin under `/grafana` via the frontend nginx |
| `dcgm-exporter` | `nvcr.io/.../dcgm-exporter` (GPU) | 9400 | NVIDIA GPU telemetry (util, memory, temperature, power) |
| `node-exporter` | `prom/node-exporter` | 9100 | Host metrics (CPU, RAM, disk, network) |
| `frontend` | `llmops-frontend` | 8884 | nginx serving the SPA + reverse-proxying `/api` → backend, `/v1` → router, `/grafana` → grafana |

### Why one image, multiple services on one netns

Only the backend truly needs vLLM (it launches the subprocesses), and the router +
Prometheus must see them on `localhost` — so a single
[`engine.Dockerfile`](../deploy/engine.Dockerfile) (based on the official
`vllm/vllm-openai`) runs as `backend` + `router`, joined (with Prometheus) by
`network_mode: service:backend`.

The frontend reaches the backend, router, and Grafana through nginx on a single
origin, so no host/port is baked into the build.

### Persistence

- SQLite + the dynamic-model overlay → `llmops-data` named volume
- Prometheus TSDB → `prometheus-data`; Grafana state → `grafana-data`
- Model **weights** are bind-mounted from the host HF cache (`HF_CACHE_DIR`, default
`~/.cache/huggingface`) so they're browsable locally and shared with host-side tools
- `packages/config-schema/config.yaml` is bind-mounted too, so you can edit models
without rebuilding

> **Model lifecycle**: the router only routes and load-balances — it never launches
> models. vLLM instances (and the Embedding/Reranker server) are owned by the backend
> and started on demand from the **Models** page (or `POST /api/models/{key}/start`).
> The backend and router both merge the dynamic-model overlay at startup, so models
> added from the UI survive restarts.

### Verify

```bash
curl http://localhost:8887/v1/models # router: configured model groups
curl http://localhost:5000/api/models # backend: lifecycle state of each instance
```

## Frontend (Web dashboard)

The dashboard lives in **`apps/frontend_llmops`** — Vue 3 + Vite + TypeScript,
Tailwind CSS v4, shadcn-vue components, [Vue Flow](https://vueflow.dev) for the
topology/router graphs, Pinia + Vue Router. (The older `apps/frontend` is deprecated.)

```bash
cd apps/frontend_llmops
npm install
npm run dev # http://localhost:5173
npm run build # production build → dist/
```

Configuration — `apps/frontend_llmops/.env`:

```env
VITE_API_BASE_URL=http://localhost:5000 # Dashboard backend (lifecycle, telemetry)
VITE_ROUTER_BASE_URL=http://localhost:8887 # LLM Router (inference + /metrics + /reload)
```

### Authentication

Authentication is backend-driven (not a build-time password). Set
`LLMOPS_ADMIN_TOKEN` on the backend + router to gate every control action (start /
stop / add / edit / remove + API-key management); the UI prompts for the token once
and reuses it for the session. Set `LLMOPS_REQUIRE_API_KEY=true` on the router to
require a bearer token (the admin token, or an API key minted on the **API Keys**
page) for all `/v1/*` inference. Both default to off for local dev.

## Manual / development run

Run the three pieces yourself (Python deps in the repo-root `.venv`):

```bash
# Dashboard backend (:5000)
cd apps/backend && pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 5000

# LLM router (:8887) — see apps/router-server/README.md for details
cd apps/router-server && pip install -r requirements.txt
sh scripts/start_all.sh ../../packages/config-schema/config.yaml ./configs/gunicorn.conf.py
```

Use `packages/config-schema/config.yaml` as the single source of truth so the
frontend, backend, and router all read the same configuration.

## Requirements

- **GPU**: NVIDIA GPU (CUDA 13.1+ recommended)
- **Memory**: 16GB+ RAM (depending on model size)
- **Disk**: 50GB+ available space
Loading
Loading