This document describes how to monitor Arks using Prometheus and Grafana.
-
A Kubernetes cluster with Prometheus Operator installed
# Install prometheus-operator using helm helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace -
Arks installed in your cluster
Arks provides two types of metrics:
- Runtime metrics from model serving
- Gateway metrics from API gateway
Runtime metrics are collected from the model serving pods, including:
-
Request Statistics
- Success request count
- Request prompt/generation length distribution
- Request finish reasons (stop/length limit)
-
Latency Metrics
- End-to-end request latency (P50/P90/P95/P99)
- Time to first token (P50/P90/P95/P99)
- Time per output token (P50/P90/P95/P99)
-
Throughput Metrics
- Prompt token throughput
- Decode token throughput
- Number of running/waiting/swapped requests
-
Cache Metrics
- Prefix cache hit rate
- Cache utilization (GPU/CPU)
Gateway metrics are collected from the gateway service, including:
-
Request Processing
- Request counts
- Processing duration
- Error counts
-
Token Usage
- Input/Output token counts
- Token distribution
-
Rate Limiting
- Rate limit hits
- Available tokens
-
Quota Usage
- Current quota usage
- Quota limits
Arks automatically creates ServiceMonitor resources for both runtime and gateway services. You can find the configurations in:
config/prometheus/monitor-runtime.yaml # For runtime metrics
config/prometheus/monitor-gw.yaml # For gateway metricsArks provides a pre-configured Grafana dashboard for visualizing the metrics:
config/grafana/runtime-dashboard.jsonTo import the dashboard:
- Access your Grafana UI
- Click "+" -> "Import"
- Upload the dashboard JSON file or paste its content
- Select your Prometheus data source
- Click "Import"
The runtime dashboard includes several sections:
-
Request Overview
- Success request count
- Request prompt/generation length distribution
- Token length percentiles (P50/P90/P99)
-
Latency Monitoring
- E2E request latency
- Time to first token
- Time per output token
- All metrics include P50/P90/P95/P99 percentiles
-
Throughput Analysis
- Prompt token processing rate
- Decode token processing rate
- Per-instance and total throughput
-
Resource Utilization
- Scheduler state (running/waiting/swapped requests)
- Cache hit rates
- Cache utilization
-
Request States
- Number of running requests per instance
- Number of waiting requests per instance
Common issues and solutions:
-
Metrics not showing up
- Check if ServiceMonitor is properly created
- Verify Prometheus can access the metrics endpoints
- Check pod labels match ServiceMonitor selectors
-
High latency
- Monitor cache hit rates
- Check number of waiting requests
- Analyze token throughput
-
Low throughput
- Check scheduler state
- Monitor cache utilization
- Analyze request distribution across instances
- More metrics for gateway
- Gateway dashboard
