Skip to content

Feat/grafana monitoring#5

Merged
milk333445 merged 3 commits into
mainfrom
feat/grafana-monitoring
Jun 19, 2026
Merged

Feat/grafana monitoring#5
milk333445 merged 3 commits into
mainfrom
feat/grafana-monitoring

Conversation

@milk333445

Copy link
Copy Markdown
Contributor

No description provided.

milk333445 and others added 3 commits June 19, 2026 20:29
Phase 1-4 完整監控方案,全程單一 origin、隨模型啟停自動跟隨:

- backend: 新增 prometheus_targets 服務,reconciler 在 vLLM 進/出 READY
  時動態寫 file_sd targets,Prometheus 無需改設定即自動發現艦隊
  (LLMOPS_PROMETHEUS_SD_PATH;含單元測試)
- deploy: 新增 prometheus / grafana / dcgm-exporter / node-exporter
  services;prometheus、grafana 與 backend 共用 netns;nginx 反代
  /grafana(單一 origin,含 absolute_redirect off 修 port 重導)
- grafana: provision datasource + 官方 vLLM(Performance/Query)、DCGM、
  Node Exporter dashboards,加自訂 "vLLM Scheduling & Capacity"
  (排程/容量/工作負載,變數化 datasource+model_name+instance),
  4 條 vLLM alert rules + webhook contact point(env 帶入)
- frontend: 新增「監控」分頁嵌入 5 張 dashboard(kiosk、主題同步)
- 移除已被 grafana 取代的 /trends(前端頁面 + 後端 timeseries endpoint
  與 store 方法)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- 新增 "vLLM Overview" 總覽 dashboard(single pane of glass):系統健康
  stat 列、延遲/吞吐、容量、基礎設施(GPU+host)濃縮一頁;TTFT/E2E/KV
  門檻線、嵌入告警清單 panel、以 process_start_time_seconds 偵測的模型
  (重)啟動事件標註
- frontend: 「監控」分頁新增「總覽」tab 並設為預設
- 官方 vLLM(Performance/Query)+ Node Exporter 全部時序 panel 改
  spanNulls=true,間歇流量下不再斷線

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Observability 改列 Grafana 監控(動態 SD 發現、GPU/host 指標、嵌入監控
  分頁、門檻線/標註/告警);移除已刪的 Trends/趨勢
- Docker 拓撲表加入 prometheus / grafana / dcgm-exporter / node-exporter,
  frontend 補 /grafana 反代,說明段補 netns 共用與 prometheus/grafana volume
- 新增「Monitoring (Grafana)」小節(英/中);兩版同步

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@milk333445 milk333445 merged commit 5ca577f into main Jun 19, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant