From c8ed6ceeba7449c3e38164f4e9ee0c821d66d414 Mon Sep 17 00:00:00 2001 From: GrammaTonic Date: Mon, 2 Mar 2026 02:52:15 +0100 Subject: [PATCH 1/3] docs: add Prometheus monitoring documentation (Phase 5) Create 6 new documentation files, 1 example scrape config, and update 4 existing files for the Prometheus metrics system. New files: PROMETHEUS_SETUP.md, PROMETHEUS_USAGE.md, PROMETHEUS_TROUBLESHOOTING.md, PROMETHEUS_ARCHITECTURE.md, PROMETHEUS_METRICS_REFERENCE.md, PROMETHEUS_QUICKSTART.md, monitoring/prometheus-scrape-example.yml Updated: README.md (fix port 9090->9091, add doc links), docs/README.md (add Prometheus section), docs/API.md (rewrite metrics with correct names), config/runner.env.example (add metrics vars), plan/feature-prometheus-monitoring-1.md Implements: TASK-047 through TASK-056 (Issue #1063) --- README.md | 39 +- config/runner.env.example | 13 + docs/API.md | 29 +- docs/README.md | 11 + docs/features/PROMETHEUS_ARCHITECTURE.md | 323 +++++++++++++ docs/features/PROMETHEUS_METRICS_REFERENCE.md | 343 +++++++++++++ docs/features/PROMETHEUS_QUICKSTART.md | 125 +++++ docs/features/PROMETHEUS_SETUP.md | 275 +++++++++++ docs/features/PROMETHEUS_TROUBLESHOOTING.md | 452 ++++++++++++++++++ docs/features/PROMETHEUS_USAGE.md | 306 ++++++++++++ monitoring/prometheus-scrape-example.yml | 70 +++ plan/feature-prometheus-monitoring-1.md | 22 +- 12 files changed, 1982 insertions(+), 26 deletions(-) create mode 100644 docs/features/PROMETHEUS_ARCHITECTURE.md create mode 100644 docs/features/PROMETHEUS_METRICS_REFERENCE.md create mode 100644 docs/features/PROMETHEUS_QUICKSTART.md create mode 100644 docs/features/PROMETHEUS_SETUP.md create mode 100644 docs/features/PROMETHEUS_TROUBLESHOOTING.md create mode 100644 docs/features/PROMETHEUS_USAGE.md create mode 100644 monitoring/prometheus-scrape-example.yml diff --git a/README.md b/README.md index 48c36329..b1ec02c5 100644 --- a/README.md +++ b/README.md @@ -402,19 +402,44 @@ docker compose -f docker/docker-compose.chrome.yml up -d ## πŸ“Š Monitoring -### Health Checks +All runner types expose Prometheus-compatible metrics on port **9091** (container port). See the [Monitoring Quick Start](docs/features/PROMETHEUS_QUICKSTART.md) to get started in 5 minutes. + +### Metrics Endpoint ```bash -# Check runner health -curl http://localhost:8080/health +# Standard runner metrics (host port 9091) +curl http://localhost:9091/metrics -# Prometheus metrics -curl http://localhost:9090/metrics +# Chrome runner metrics (host port 9092) +curl http://localhost:9092/metrics -# Grafana dashboard -open http://localhost:3000 +# Chrome-Go runner metrics (host port 9093) +curl http://localhost:9093/metrics ``` +### Grafana Dashboards + +Four pre-built dashboards are provided in `monitoring/grafana/dashboards/`: + +| Dashboard | File | Panels | +|---|---|---| +| Runner Overview | `runner-overview.json` | 12 | +| DORA Metrics | `dora-metrics.json` | 12 | +| Performance Trends | `performance-trends.json` | 14 | +| Job Analysis | `job-analysis.json` | 16 | + +Import them into your Grafana instance or use the provisioning config for auto-loading. + +### Documentation + +- [Quick Start](docs/features/PROMETHEUS_QUICKSTART.md) β€” 5-minute setup +- [Setup Guide](docs/features/PROMETHEUS_SETUP.md) β€” Full configuration +- [Usage Guide](docs/features/PROMETHEUS_USAGE.md) β€” PromQL queries and alerts +- [Metrics Reference](docs/features/PROMETHEUS_METRICS_REFERENCE.md) β€” All metric definitions +- [Architecture](docs/features/PROMETHEUS_ARCHITECTURE.md) β€” System internals +- [Troubleshooting](docs/features/PROMETHEUS_TROUBLESHOOTING.md) β€” Common issues +- [API Reference](docs/API.md) β€” Endpoint details + ## πŸ”§ Maintenance ### Scaling diff --git a/config/runner.env.example b/config/runner.env.example index 8cb65eb9..08b8db30 100644 --- a/config/runner.env.example +++ b/config/runner.env.example @@ -61,6 +61,19 @@ REGISTRY=ghcr.io/grammatonic RUNNER_IMAGE_TAG=latest CHROME_IMAGE_TAG=chrome-latest +# ========================================== +# OPTIONAL: Metrics & Monitoring +# ========================================== + +# Runner type identifier (used in Prometheus labels) +# RUNNER_TYPE=standard + +# Metrics HTTP server port (inside the container) +# METRICS_PORT=9091 + +# Metrics collector update interval in seconds +# METRICS_UPDATE_INTERVAL=30 + # Resource Limits (uncomment to enable) # RUNNER_MEMORY_LIMIT=1g # RUNNER_CPU_LIMIT=1.0 diff --git a/docs/API.md b/docs/API.md index db8d5988..d5ec8e9e 100644 --- a/docs/API.md +++ b/docs/API.md @@ -29,16 +29,29 @@ Returns the current health status of the runner (Chrome or normal). ### GET /metrics -Returns Prometheus metrics for monitoring runner health and job execution. +Returns Prometheus-formatted metrics for monitoring runner health and job execution. -**Key Metrics:** +**Port:** 9091 (container port). Host port mappings: 9091 (standard), 9092 (chrome), 9093 (chrome-go). -- `github_runner_jobs_total` - Total jobs executed -- `github_runner_jobs_duration_seconds` - Job execution time -- `github_runner_registration_status` - Registration health (1 = registered, 0 = not registered) -- `github_runner_last_job_timestamp` - Timestamp of last job -- `github_runner_uptime_seconds` - Runner uptime in seconds -- `github_runner_type` - Runner type (chrome/normal) +**Content-Type:** `text/plain; version=0.0.4; charset=utf-8` + +**Metrics Exposed:** + +| Metric | Type | Description | +|---|---|---| +| `github_runner_status` | gauge | Runner status (1=online, 0=offline) | +| `github_runner_info` | gauge | Runner metadata (name, type, version) | +| `github_runner_uptime_seconds` | counter | Runner uptime in seconds | +| `github_runner_jobs_total` | counter | Total jobs by status (total, success, failed) | +| `github_runner_job_duration_seconds` | histogram | Job duration distribution (buckets: 60s–3600s) | +| `github_runner_queue_time_seconds` | gauge | Average queue wait time (last 100 jobs) | +| `github_runner_cache_hit_rate` | gauge | Cache hit rate by type (stubbed at 0) | +| `github_runner_last_update_timestamp` | gauge | Unix timestamp of last metrics update | + +All metrics carry `runner_name` and `runner_type` labels. + +For full metric definitions, see [Metrics Reference](features/PROMETHEUS_METRICS_REFERENCE.md). +For PromQL query examples, see [Usage Guide](features/PROMETHEUS_USAGE.md). ## Container Labels diff --git a/docs/README.md b/docs/README.md index bdd9e540..23e095cf 100644 --- a/docs/README.md +++ b/docs/README.md @@ -60,6 +60,17 @@ docs/ - [Runner Self-Test](features/RUNNER_SELF_TEST.md) - Automated runner validation +### Prometheus Monitoring + +- [Quick Start](features/PROMETHEUS_QUICKSTART.md) - 5-minute monitoring setup +- [Setup Guide](features/PROMETHEUS_SETUP.md) - Full Prometheus and Grafana configuration +- [Usage Guide](features/PROMETHEUS_USAGE.md) - PromQL queries, alerts, and dashboard customization +- [Metrics Reference](features/PROMETHEUS_METRICS_REFERENCE.md) - Complete metric definitions +- [Architecture](features/PROMETHEUS_ARCHITECTURE.md) - System design and data flow +- [Troubleshooting](features/PROMETHEUS_TROUBLESHOOTING.md) - Common issues and fixes +- [Grafana Dashboard Metrics](features/GRAFANA_DASHBOARD_METRICS.md) - Dashboard feature specification + + ### Releases - [Changelog](releases/CHANGELOG.md) - Full release history diff --git a/docs/features/PROMETHEUS_ARCHITECTURE.md b/docs/features/PROMETHEUS_ARCHITECTURE.md new file mode 100644 index 00000000..387c67eb --- /dev/null +++ b/docs/features/PROMETHEUS_ARCHITECTURE.md @@ -0,0 +1,323 @@ +# Prometheus Monitoring Architecture + +## Status: βœ… Complete + +**Created:** 2026-03-02 +**Phase:** 5 β€” Documentation & User Guide +**Task:** TASK-050 + +--- + +## Overview + +This document describes the internal architecture of the Prometheus monitoring system for GitHub Actions self-hosted runners. The system uses a pure-bash implementation (no external language runtimes) with netcat for HTTP serving. + +--- + +## System Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Runner Container β”‚ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ GitHub Actions β”‚ β”‚ metrics- β”‚ β”‚ metrics- β”‚ β”‚ +β”‚ β”‚ Runner Binary β”‚ β”‚ collector.sh β”‚ β”‚ server.sh β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ (background) β”‚ β”‚ (background) β”‚ β”‚ +β”‚ β”‚ Executes jobs β”‚ β”‚ Updates every β”‚ β”‚ Listens on β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ 30 seconds β”‚ β”‚ port 9091 β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Hook scripts β”‚ Reads + Writes β”‚ Reads β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β–Ό β–Ό β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ job-started β”‚ β”‚ /tmp/ β”‚ β”‚ HTTP Response β”‚ β”‚ +β”‚ β”‚ .sh β”‚ β”‚ runner_metrics β”‚ β”‚ (Prometheus text) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ .prom β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ job- β”‚ β”‚ β”‚ β”‚ GET /metrics β”‚ β”‚ +β”‚ β”‚ completed.sh β”‚ β”‚ (atomic writes) β”‚ β”‚ β†’ 200 OK β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β–² β”‚ β”‚ +β”‚ β”‚ Appends β”‚ Reads β”‚ β”‚ +β”‚ β–Ό β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ +β”‚ β”‚ /tmp/ β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β”‚ jobs.log β”‚ β”‚ β”‚ +β”‚ β”‚ (CSV) β”‚ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β”‚ β”‚ +β”‚ Port 9091 β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”‚ Prometheus scrapes :9091/metrics + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Prometheus Server │─────────▢│ Grafana β”‚ +β”‚ (User-Provided) β”‚ queries β”‚ (User-Provided) β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ Stores time-series β”‚ β”‚ 4 Pre-built β”‚ +β”‚ data β”‚ β”‚ Dashboards β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +--- + +## Component Descriptions + +### 1. Metrics Server (`docker/metrics-server.sh`) + +**Purpose:** Lightweight HTTP server that responds to Prometheus scrape requests. + +**Implementation:** + +- Uses `netcat` (`nc`) to listen on a TCP port (default: 9091). +- On each incoming request, reads `/tmp/runner_metrics.prom` and returns it with HTTP 200-series headers. +- Returns HTTP 503 if the metrics file is missing. +- Runs as a background process, started by the entrypoint script. + +**Key characteristics:** + +- Single-threaded (handles one request at a time). +- Stateless β€” reads the metrics file on every request. +- No request routing β€” all paths return the same metrics. +- Content-Type: `text/plain; version=0.0.4; charset=utf-8` (Prometheus text format). + +**Configuration:** + +| Variable | Default | Description | +|---|---|---| +| `METRICS_PORT` | `9091` | TCP port to listen on | +| `METRICS_FILE` | `/tmp/runner_metrics.prom` | Path to metrics file | + +### 2. Metrics Collector (`docker/metrics-collector.sh`) + +**Purpose:** Periodically reads system state and job logs to generate Prometheus-formatted metrics. + +**Implementation:** + +- Runs in an infinite loop with a configurable sleep interval (default: 30s). +- Reads job data from `/tmp/jobs.log` (CSV format). +- Computes counters, gauges, and histogram buckets. +- Writes metrics atomically to `/tmp/runner_metrics.prom` (write temp β†’ `mv`). + +**Metrics generated:** + +| Metric | Type | Source | +|---|---|---| +| `github_runner_status` | gauge | Always 1 while collector runs | +| `github_runner_info` | gauge | Environment variables | +| `github_runner_uptime_seconds` | counter | `$(date +%s) - $START_TIME` | +| `github_runner_jobs_total` | counter | Parsed from `jobs.log` | +| `github_runner_job_duration_seconds` | histogram | Computed from `jobs.log` durations | +| `github_runner_queue_time_seconds` | gauge | Averaged from `jobs.log` queue times | +| `github_runner_cache_hit_rate` | gauge | Stubbed (returns 0) | +| `github_runner_last_update_timestamp` | gauge | `$(date +%s)` at write time | + +**Configuration:** + +| Variable | Default | Description | +|---|---|---| +| `METRICS_FILE` | `/tmp/runner_metrics.prom` | Output path | +| `JOBS_LOG` | `/tmp/jobs.log` | Job log input path | +| `UPDATE_INTERVAL` | `30` | Seconds between updates | +| `RUNNER_NAME` | `unknown` | Runner name label | +| `RUNNER_TYPE` | `standard` | Runner type label | +| `RUNNER_VERSION` | `2.332.0` | Runner version label | + +### 3. Job Hook Scripts (`docker/job-started.sh`, `docker/job-completed.sh`) + +**Purpose:** Record job lifecycle events to the jobs log for metrics collection. + +**Implementation:** + +- Invoked by the GitHub Actions runner binary via environment variables: + - `ACTIONS_RUNNER_HOOK_JOB_STARTED` β†’ `job-started.sh` + - `ACTIONS_RUNNER_HOOK_JOB_COMPLETED` β†’ `job-completed.sh` +- `job-started.sh` records a `running` entry and saves the start timestamp to a state file. +- `job-completed.sh` calculates duration, determines status, and writes the final log entry. + +**Job Log Format** (`/tmp/jobs.log`): + +``` +timestamp,job_id,status,duration_seconds,queue_time_seconds +``` + +Example: + +``` +2026-03-02T10:00:00Z,12345_build,running,0,0 +2026-03-02T10:05:30Z,12345_build,success,330,12 +``` + +**Job state directory:** `/tmp/job_state/` stores per-job start timestamps for duration calculation. + +### 4. Entrypoint Scripts (`docker/entrypoint.sh`, `docker/entrypoint-chrome.sh`) + +**Purpose:** Container initialization that starts the metrics system alongside the runner. + +**Startup sequence:** + +1. Configure and register the GitHub Actions runner. +2. Initialize `/tmp/jobs.log` (touch). +3. Copy hook scripts to the runner directory. +4. Set `ACTIONS_RUNNER_HOOK_JOB_STARTED` and `ACTIONS_RUNNER_HOOK_JOB_COMPLETED`. +5. Start `metrics-server.sh` in background. +6. Start `metrics-collector.sh` in background. +7. Start the GitHub Actions runner (foreground). + +--- + +## Data Flow + +``` +Job Execution β†’ job-started.sh β†’ /tmp/jobs.log (append "running" entry) + /tmp/job_state/.start (timestamp) + +Job Completion β†’ job-completed.sh β†’ /tmp/jobs.log (append final entry) + /tmp/job_state/.start (delete) + +Every 30s β†’ metrics-collector.sh β†’ reads /tmp/jobs.log + β†’ computes counters, histogram, queue time + β†’ writes /tmp/runner_metrics.prom (atomic) + +On scrape β†’ metrics-server.sh β†’ reads /tmp/runner_metrics.prom + β†’ returns HTTP 200 with Prometheus text + +Prometheus β†’ scrapes :9091/metrics β†’ stores time-series data + +Grafana β†’ queries Prometheus β†’ renders dashboards +``` + +--- + +## Design Decisions + +### Decision: Bash + Netcat (CON-001, CON-002) + +**Rationale:** The project constrains implementation to bash scripting with no additional language runtimes. Netcat is available in the base image (`ubuntu:resolute`) and is sufficient for serving simple HTTP responses. This avoids adding Python, Node.js, or Go dependencies to the runner image. + +**Trade-offs:** + +- (+) Zero additional dependencies. +- (+) Minimal image size impact. +- (+) Simple to debug and modify. +- (-) Single-threaded HTTP server (one request at a time). +- (-) No request routing (all paths return metrics). +- (-) Limited HTTP compliance (HTTP/1.0 only). + +**Review:** If scrape concurrency becomes an issue, consider `socat` (multi-connection) or a lightweight Go binary. + +### Decision: File-Based Metrics Transfer + +**Rationale:** The collector writes metrics to a file; the server reads the file. This decouples the two processes and allows atomic updates via `mv`. No shared memory or IPC required. + +**Trade-offs:** + +- (+) Simple, robust, no race conditions (atomic `mv`). +- (+) Easy to debug (`cat /tmp/runner_metrics.prom`). +- (-) Slight latency (up to 30s stale data between updates). +- (-) Disk I/O on each update (minimal β€” file is < 2KB). + +### Decision: CSV Job Log Format + +**Rationale:** A simple CSV format (`timestamp,job_id,status,duration,queue_time`) is easy to parse with standard shell tools (`grep`, `awk`, `read`). No external parsers needed. + +**Trade-offs:** + +- (+) Human-readable and inspectable. +- (+) Easy to parse with bash built-ins. +- (-) No schema enforcement. +- (-) Unbounded growth (mitigated by reading only recent entries for queue time). + +### Decision: Stub Cache Metrics + +**Rationale:** BuildKit cache logs reside on the Docker host, not inside the runner container. APT and npm caches are internal to builds. Real cache hit rate data is not accessible from within the runner. + +**Trade-offs:** + +- (+) Metrics schema is future-proof (cache_type label ready). +- (+) Dashboards already have cache panels. +- (-) Currently returns 0 for all cache types. + +**Future:** A sidecar exporter running on the Docker host could parse BuildKit logs and expose real cache metrics. + +--- + +## Multi-Runner Deployment + +When running multiple runner types simultaneously: + +``` + Docker Host +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Host Port 9091 β”‚ +β”‚ β”‚ Standard │──────────────────┐ β”‚ +β”‚ β”‚ Runner β”‚ Container: 9091 β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Host Port 9092 β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Chrome │─────────────────▢│ Prom β”‚ β”‚ +β”‚ β”‚ Runner β”‚ Container: 9091 β”‚ etheusβ”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β–² β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Host Port 9093 β”‚ β”‚ +β”‚ β”‚ Chrome-Go β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ Runner β”‚ Container: 9091 β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +Each runner type: + +- Listens on container port **9091** internally. +- Maps to a unique **host port** (9091, 9092, 9093). +- Has unique `runner_name` and `runner_type` labels. +- Maintains its own `/tmp/jobs.log` and metrics files. + +--- + +## Scalability Considerations + +| Factor | Current Limit | Mitigation | +|---|---|---| +| Scrape concurrency | 1 request at a time (netcat) | Prometheus retries; 15s scrape interval > response time | +| Jobs log size | Unbounded growth | Queue time reads last 100 entries; restart resets log | +| Metrics file size | ~2 KB per runner | Negligible disk impact | +| CPU overhead | < 1% (bash + sleep loop) | Configurable `UPDATE_INTERVAL` | +| Memory overhead | < 10 MB per runner | Bash processes, no JVM/runtime | +| Number of runners | Unlimited (unique ports) | Network port planning required | + +For large deployments (100+ runners), consider: + +- Service discovery in Prometheus (file-based or DNS-based) instead of static targets. +- A metrics aggregation proxy to reduce Prometheus scrape load. +- Log rotation for `/tmp/jobs.log` to prevent disk exhaustion. + +--- + +## File Inventory + +| File | Purpose | Started By | +|---|---|---| +| `docker/metrics-server.sh` | HTTP server for `/metrics` | Entrypoint script | +| `docker/metrics-collector.sh` | Periodic metrics generation | Entrypoint script | +| `docker/job-started.sh` | Job start hook | Runner binary | +| `docker/job-completed.sh` | Job completion hook | Runner binary | +| `docker/entrypoint.sh` | Standard runner init | Docker CMD | +| `docker/entrypoint-chrome.sh` | Chrome/Chrome-Go runner init | Docker CMD | +| `monitoring/prometheus.yml` | Full Prometheus config example | User deploys | +| `monitoring/prometheus-scrape-example.yml` | Minimal scrape config | User references | +| `monitoring/grafana/dashboards/*.json` | 4 Grafana dashboards | User imports | +| `monitoring/grafana/provisioning/dashboards/dashboards.yml` | Auto-load config | Grafana | + +--- + +## Next Steps + +- [Setup Guide](PROMETHEUS_SETUP.md) β€” Deploy and configure +- [Usage Guide](PROMETHEUS_USAGE.md) β€” PromQL queries and dashboards +- [Metrics Reference](PROMETHEUS_METRICS_REFERENCE.md) β€” Full metric catalog +- [Troubleshooting](PROMETHEUS_TROUBLESHOOTING.md) β€” Fix common issues diff --git a/docs/features/PROMETHEUS_METRICS_REFERENCE.md b/docs/features/PROMETHEUS_METRICS_REFERENCE.md new file mode 100644 index 00000000..eb9481e2 --- /dev/null +++ b/docs/features/PROMETHEUS_METRICS_REFERENCE.md @@ -0,0 +1,343 @@ +# Prometheus Metrics Reference + +## Status: βœ… Complete + +**Created:** 2026-03-02 +**Phase:** 5 β€” Documentation & User Guide +**Task:** TASK-054 + +--- + +## Overview + +This document provides the complete reference for all Prometheus metrics exposed by the GitHub Actions self-hosted runner metrics endpoint on port 9091. Metrics are generated by `docker/metrics-collector.sh` and served by `docker/metrics-server.sh`. + +--- + +## Common Labels + +All metrics carry these labels unless otherwise noted: + +| Label | Description | Example Values | +|---|---|---| +| `runner_name` | Name of the runner instance | `docker-runner`, `chrome-runner-1` | +| `runner_type` | Type of runner | `standard`, `chrome`, `chrome-go` | + +--- + +## Metric Catalog + +### `github_runner_status` + +| Property | Value | +|---|---| +| **Type** | Gauge | +| **Description** | Runner online/offline status | +| **Labels** | `runner_name`, `runner_type` | +| **Values** | `1` = online, `0` = offline | +| **Source** | Always `1` while the collector process is running | +| **Update frequency** | Every 30 seconds | + +**Example:** + +``` +# HELP github_runner_status Runner status (1=online, 0=offline) +# TYPE github_runner_status gauge +github_runner_status{runner_name="docker-runner",runner_type="standard"} 1 +``` + +**PromQL examples:** + +```promql +# All online runners +github_runner_status == 1 + +# Offline runners (alert-worthy) +github_runner_status == 0 + +# Count of online runners by type +count by (runner_type) (github_runner_status == 1) +``` + +--- + +### `github_runner_info` + +| Property | Value | +|---|---| +| **Type** | Gauge | +| **Description** | Runner metadata β€” always 1; informational labels carry the data | +| **Labels** | `runner_name`, `runner_type`, `version` | +| **Values** | Always `1` | +| **Source** | Environment variables: `RUNNER_NAME`, `RUNNER_TYPE`, `RUNNER_VERSION` | + +**Example:** + +``` +# HELP github_runner_info Runner information +# TYPE github_runner_info gauge +github_runner_info{runner_name="docker-runner",runner_type="standard",version="2.332.0"} 1 +``` + +**PromQL examples:** + +```promql +# List all runners with their versions +github_runner_info + +# Filter by version +github_runner_info{version="2.332.0"} +``` + +--- + +### `github_runner_uptime_seconds` + +| Property | Value | +|---|---| +| **Type** | Counter | +| **Description** | Runner uptime since the metrics collector started (seconds) | +| **Labels** | `runner_name`, `runner_type` | +| **Values** | Monotonically increasing integer | +| **Source** | `$(date +%s) - $START_TIME` where `START_TIME` is the collector launch epoch | + +**Example:** + +``` +# HELP github_runner_uptime_seconds Runner uptime in seconds +# TYPE github_runner_uptime_seconds counter +github_runner_uptime_seconds{runner_name="docker-runner",runner_type="standard"} 86400 +``` + +**PromQL examples:** + +```promql +# Uptime in hours +github_runner_uptime_seconds / 3600 + +# Uptime by runner type +github_runner_uptime_seconds by (runner_type) +``` + +--- + +### `github_runner_jobs_total` + +| Property | Value | +|---|---| +| **Type** | Counter | +| **Description** | Total number of jobs processed, segmented by status | +| **Labels** | `runner_name`, `runner_type`, `status` | +| **Status values** | `total`, `success`, `failed` | +| **Source** | Parsed from `/tmp/jobs.log` β€” counts lines matching each status | + +**Example:** + +``` +# HELP github_runner_jobs_total Total number of jobs processed by status +# TYPE github_runner_jobs_total counter +github_runner_jobs_total{status="total",runner_name="docker-runner",runner_type="standard"} 50 +github_runner_jobs_total{status="success",runner_name="docker-runner",runner_type="standard"} 47 +github_runner_jobs_total{status="failed",runner_name="docker-runner",runner_type="standard"} 3 +``` + +**PromQL examples:** + +```promql +# Jobs per hour +rate(github_runner_jobs_total{status="total"}[1h]) * 3600 + +# Success rate (percentage) +github_runner_jobs_total{status="success"} / github_runner_jobs_total{status="total"} * 100 + +# Deployment frequency (successful jobs in 24h) +sum(increase(github_runner_jobs_total{status="success"}[24h])) + +# Change failure rate +sum(increase(github_runner_jobs_total{status="failed"}[24h])) + / sum(increase(github_runner_jobs_total{status="total"}[24h])) * 100 +``` + +> **Note:** The `total` status count excludes entries with status `running` (preliminary entries written by `job-started.sh`). + +--- + +### `github_runner_job_duration_seconds` + +| Property | Value | +|---|---| +| **Type** | Histogram | +| **Description** | Distribution of job execution durations in seconds | +| **Labels** | `runner_name`, `runner_type`, `le` (bucket boundary) | +| **Bucket boundaries** | `60` (1 min), `300` (5 min), `600` (10 min), `1800` (30 min), `3600` (1 hr), `+Inf` | +| **Sub-metrics** | `_bucket`, `_sum`, `_count` | +| **Source** | Computed from duration field (column 4) in `/tmp/jobs.log` | + +**Example:** + +``` +# HELP github_runner_job_duration_seconds Histogram of job durations in seconds +# TYPE github_runner_job_duration_seconds histogram +github_runner_job_duration_seconds_bucket{le="60",runner_name="docker-runner",runner_type="standard"} 10 +github_runner_job_duration_seconds_bucket{le="300",runner_name="docker-runner",runner_type="standard"} 35 +github_runner_job_duration_seconds_bucket{le="600",runner_name="docker-runner",runner_type="standard"} 42 +github_runner_job_duration_seconds_bucket{le="1800",runner_name="docker-runner",runner_type="standard"} 48 +github_runner_job_duration_seconds_bucket{le="3600",runner_name="docker-runner",runner_type="standard"} 50 +github_runner_job_duration_seconds_bucket{le="+Inf",runner_name="docker-runner",runner_type="standard"} 50 +github_runner_job_duration_seconds_sum{runner_name="docker-runner",runner_type="standard"} 8542 +github_runner_job_duration_seconds_count{runner_name="docker-runner",runner_type="standard"} 50 +``` + +**PromQL examples:** + +```promql +# Median (p50) job duration +histogram_quantile(0.50, rate(github_runner_job_duration_seconds_bucket[1h])) + +# 90th percentile +histogram_quantile(0.90, rate(github_runner_job_duration_seconds_bucket[1h])) + +# 99th percentile +histogram_quantile(0.99, rate(github_runner_job_duration_seconds_bucket[1h])) + +# Average job duration (Lead Time proxy) +rate(github_runner_job_duration_seconds_sum[5m]) + / rate(github_runner_job_duration_seconds_count[5m]) + +# Jobs under 5 minutes +github_runner_job_duration_seconds_bucket{le="300"} +``` + +> **Note:** Buckets are cumulative (each bucket includes all smaller buckets). The `+Inf` bucket equals `_count`. + +--- + +### `github_runner_queue_time_seconds` + +| Property | Value | +|---|---| +| **Type** | Gauge | +| **Description** | Average queue time in seconds (computed from last 100 completed jobs) | +| **Labels** | `runner_name`, `runner_type` | +| **Values** | Non-negative integer | +| **Source** | Average of queue_time field (column 5) in `/tmp/jobs.log`, last 100 entries | + +**Example:** + +``` +# HELP github_runner_queue_time_seconds Average queue time in seconds (last 100 jobs) +# TYPE github_runner_queue_time_seconds gauge +github_runner_queue_time_seconds{runner_name="docker-runner",runner_type="standard"} 12 +``` + +**PromQL examples:** + +```promql +# Queue time per runner +github_runner_queue_time_seconds by (runner_name) + +# Alert if queue time exceeds 5 minutes +github_runner_queue_time_seconds > 300 +``` + +> **Note:** Queue time is measured from job assignment to job start. A value of 0 means the job started immediately. + +--- + +### `github_runner_cache_hit_rate` + +| Property | Value | +|---|---| +| **Type** | Gauge | +| **Description** | Cache hit rate by cache type (0.0 to 1.0) | +| **Labels** | `runner_name`, `runner_type`, `cache_type` | +| **Cache types** | `buildkit`, `apt`, `npm` | +| **Values** | `0` (currently stubbed) | +| **Source** | Stub function β€” returns 0 for all types | + +**Example:** + +``` +# HELP github_runner_cache_hit_rate Cache hit rate by type (0.0-1.0) +# TYPE github_runner_cache_hit_rate gauge +github_runner_cache_hit_rate{cache_type="buildkit",runner_name="docker-runner",runner_type="standard"} 0 +github_runner_cache_hit_rate{cache_type="apt",runner_name="docker-runner",runner_type="standard"} 0 +github_runner_cache_hit_rate{cache_type="npm",runner_name="docker-runner",runner_type="standard"} 0 +``` + +> **Important:** Cache metrics are currently **stubbed** and always return 0. BuildKit cache logs exist on the Docker host, not inside the runner container. APT and npm caches are internal to build processes. Future work will add a sidecar exporter for real cache data. + +--- + +### `github_runner_last_update_timestamp` + +| Property | Value | +|---|---| +| **Type** | Gauge | +| **Description** | Unix timestamp of the last metrics update | +| **Labels** | None | +| **Values** | Unix epoch seconds | +| **Source** | `$(date +%s)` at the time of metrics file generation | + +**Example:** + +``` +# HELP github_runner_last_update_timestamp Unix timestamp of last metrics update +# TYPE github_runner_last_update_timestamp gauge +github_runner_last_update_timestamp 1709366400 +``` + +**PromQL examples:** + +```promql +# Time since last update (useful for staleness detection) +time() - github_runner_last_update_timestamp + +# Alert if metrics are stale (>2 minutes) +time() - github_runner_last_update_timestamp > 120 +``` + +--- + +## Summary Table + +| Metric | Type | Labels | Source | Stubbed? | +|---|---|---|---|---| +| `github_runner_status` | gauge | name, type | Collector running | No | +| `github_runner_info` | gauge | name, type, version | Environment | No | +| `github_runner_uptime_seconds` | counter | name, type | Clock | No | +| `github_runner_jobs_total` | counter | name, type, status | jobs.log | No | +| `github_runner_job_duration_seconds` | histogram | name, type, le | jobs.log | No | +| `github_runner_queue_time_seconds` | gauge | name, type | jobs.log | No | +| `github_runner_cache_hit_rate` | gauge | name, type, cache_type | Stub | **Yes** | +| `github_runner_last_update_timestamp` | gauge | β€” | Clock | No | + +--- + +## Job Log Format + +The metrics collector reads data from `/tmp/jobs.log`. Each line is CSV: + +``` +timestamp,job_id,status,duration_seconds,queue_time_seconds +``` + +| Field | Description | Example | +|---|---|---| +| `timestamp` | ISO 8601 UTC | `2026-03-02T10:05:30Z` | +| `job_id` | `{run_id}_{job_name}` | `12345_build` | +| `status` | Job result | `running`, `success`, `failed` | +| `duration_seconds` | Execution time | `330` | +| `queue_time_seconds` | Time waiting in queue | `12` | + +- `running` entries are written by `job-started.sh` (preliminary, excluded from totals). +- Final entries are written by `job-completed.sh` with actual duration and status. + +--- + +## Next Steps + +- [Setup Guide](PROMETHEUS_SETUP.md) β€” Deploy and configure +- [Usage Guide](PROMETHEUS_USAGE.md) β€” PromQL queries and dashboards +- [Architecture](PROMETHEUS_ARCHITECTURE.md) β€” System internals +- [Troubleshooting](PROMETHEUS_TROUBLESHOOTING.md) β€” Fix common issues diff --git a/docs/features/PROMETHEUS_QUICKSTART.md b/docs/features/PROMETHEUS_QUICKSTART.md new file mode 100644 index 00000000..f735d485 --- /dev/null +++ b/docs/features/PROMETHEUS_QUICKSTART.md @@ -0,0 +1,125 @@ +# Prometheus Monitoring Quick Start + +## Status: βœ… Complete + +**Created:** 2026-03-02 +**Phase:** 5 β€” Documentation & User Guide +**Task:** TASK-056 + +--- + +## 5-Minute Setup + +Get runner metrics into Prometheus and Grafana in 5 steps. + +### Prerequisites + +- Docker and Docker Compose installed +- Prometheus server running and accessible +- Grafana server running with Prometheus datasource configured + +--- + +### Step 1: Deploy a Runner + +```bash +# Clone the repository +git clone https://github.com/GrammaTonic/github-runner.git +cd github-runner + +# Configure +cp config/runner.env.example config/runner.env +# Edit config/runner.env β€” set GITHUB_TOKEN and GITHUB_REPOSITORY + +# Start +docker compose -f docker/docker-compose.production.yml up -d +``` + +### Step 2: Verify Metrics + +```bash +curl http://localhost:9091/metrics +``` + +You should see Prometheus-formatted output with metrics like `github_runner_status`, `github_runner_uptime_seconds`, etc. + +### Step 3: Add Scrape Target to Prometheus + +Add to your `prometheus.yml` under `scrape_configs`: + +```yaml +- job_name: "github-runner" + static_configs: + - targets: [":9091"] + scrape_interval: 15s + metrics_path: /metrics +``` + +Reload Prometheus: + +```bash +curl -X POST http://localhost:9090/-/reload +``` + +### Step 4: Import Grafana Dashboards + +1. Open Grafana β†’ **Dashboards β†’ Import**. +2. Upload JSON files from `monitoring/grafana/dashboards/`: + - `runner-overview.json` β€” Status and health + - `dora-metrics.json` β€” DORA metrics + - `job-analysis.json` β€” Job details + - `performance-trends.json` β€” Performance data +3. Select your Prometheus datasource when prompted. + +### Step 5: Verify + +1. Check Prometheus: `http://localhost:9090/targets` β€” runner target should be `UP`. +2. Check Grafana: Open the **Runner Overview** dashboard β€” panels should show live data. + +--- + +## Multi-Runner Setup + +Deploy all three runner types: + +```bash +# Standard runner (port 9091) +docker compose -f docker/docker-compose.production.yml up -d + +# Chrome runner (port 9092) +cp config/chrome-runner.env.example config/chrome-runner.env +# Edit chrome-runner.env +docker compose -f docker/docker-compose.chrome.yml up -d + +# Chrome-Go runner (port 9093) +cp config/chrome-go-runner.env.example config/chrome-go-runner.env +# Edit chrome-go-runner.env +docker compose -f docker/docker-compose.chrome-go.yml up -d +``` + +Add all targets to Prometheus: + +```yaml +scrape_configs: + - job_name: "github-runner-standard" + static_configs: + - targets: [":9091"] + - job_name: "github-runner-chrome" + static_configs: + - targets: [":9092"] + - job_name: "github-runner-chrome-go" + static_configs: + - targets: [":9093"] +``` + +--- + +## What's Next? + +| Guide | Description | +|---|---| +| [Full Setup Guide](PROMETHEUS_SETUP.md) | Detailed configuration options and provisioning | +| [Usage Guide](PROMETHEUS_USAGE.md) | PromQL queries, alerts, and dashboard customization | +| [Metrics Reference](PROMETHEUS_METRICS_REFERENCE.md) | Complete metric definitions and examples | +| [Architecture](PROMETHEUS_ARCHITECTURE.md) | How the metrics system works internally | +| [Troubleshooting](PROMETHEUS_TROUBLESHOOTING.md) | Fix common issues | diff --git a/docs/features/PROMETHEUS_SETUP.md b/docs/features/PROMETHEUS_SETUP.md new file mode 100644 index 00000000..6a567e50 --- /dev/null +++ b/docs/features/PROMETHEUS_SETUP.md @@ -0,0 +1,275 @@ +# Prometheus Monitoring Setup Guide + +## Status: βœ… Complete + +**Created:** 2026-03-02 +**Phase:** 5 β€” Documentation & User Guide +**Task:** TASK-047 + +--- + +## Overview + +This guide walks you through setting up Prometheus monitoring for GitHub Actions self-hosted runners. The runners expose custom metrics on port 9091 in Prometheus text format. You bring your own Prometheus and Grafana instances; this project provides the metrics endpoint and pre-built dashboards. + +--- + +## Prerequisites + +Before you begin, ensure you have: + +| Requirement | Version | Purpose | +|---|---|---| +| Docker Engine | 20.10+ | Container runtime | +| Docker Compose | v2.0+ | Orchestration | +| Prometheus | 2.30+ | Metrics scraping and storage | +| Grafana | 9.0+ | Dashboard visualization | +| Network access | β€” | Prometheus must reach runners on port 9091 | + +> **Note:** Prometheus and Grafana are **user-provided** β€” this project does not deploy or manage them. + +--- + +## Step 1: Deploy Runners with Metrics Enabled + +Metrics are enabled by default on all runner types. Each runner exposes metrics on container port `9091`. + +### Standard Runner + +```bash +# Copy and configure environment +cp config/runner.env.example config/runner.env +# Edit config/runner.env with your GITHUB_TOKEN and GITHUB_REPOSITORY + +# Deploy +docker compose -f docker/docker-compose.production.yml up -d +``` + +Host port mapping: `9091:9091` + +### Chrome Runner + +```bash +cp config/chrome-runner.env.example config/chrome-runner.env +# Edit config/chrome-runner.env + +docker compose -f docker/docker-compose.chrome.yml up -d +``` + +Host port mapping: `9092:9091` + +### Chrome-Go Runner + +```bash +cp config/chrome-go-runner.env.example config/chrome-go-runner.env +# Edit config/chrome-go-runner.env + +docker compose -f docker/docker-compose.chrome-go.yml up -d +``` + +Host port mapping: `9093:9091` + +--- + +## Step 2: Verify Metrics Endpoint + +Confirm each runner is serving metrics: + +```bash +# Standard runner +curl -s http://localhost:9091/metrics | head -20 + +# Chrome runner +curl -s http://localhost:9092/metrics | head -20 + +# Chrome-Go runner +curl -s http://localhost:9093/metrics | head -20 +``` + +You should see output in Prometheus text format: + +``` +# HELP github_runner_status Runner status (1=online, 0=offline) +# TYPE github_runner_status gauge +github_runner_status{runner_name="docker-runner",runner_type="standard"} 1 + +# HELP github_runner_uptime_seconds Runner uptime in seconds +# TYPE github_runner_uptime_seconds counter +github_runner_uptime_seconds{runner_name="docker-runner",runner_type="standard"} 120 +``` + +--- + +## Step 3: Configure Prometheus Scrape Targets + +Add the runner scrape targets to your `prometheus.yml`. An example configuration is provided at [`monitoring/prometheus-scrape-example.yml`](../../monitoring/prometheus-scrape-example.yml). + +### Minimal Scrape Config + +Add these jobs to your Prometheus `scrape_configs`: + +```yaml +scrape_configs: + # Standard runner + - job_name: "github-runner-standard" + static_configs: + - targets: [":9091"] + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s + + # Chrome runner + - job_name: "github-runner-chrome" + static_configs: + - targets: [":9092"] + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s + + # Chrome-Go runner + - job_name: "github-runner-chrome-go" + static_configs: + - targets: [":9093"] + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s +``` + +Replace `` with your Docker host IP or hostname. If Prometheus runs on the same Docker network, use the container service names (e.g., `github-runner-main:9091`). + +### Docker Network Scrape Config + +When Prometheus is on the same Docker Compose network: + +```yaml +scrape_configs: + - job_name: "github-runner-standard" + static_configs: + - targets: ["github-runner-main:9091"] + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s +``` + +### Reload Prometheus + +After updating the configuration: + +```bash +# Option 1: Send SIGHUP +kill -HUP $(pidof prometheus) + +# Option 2: Use the reload API (if --web.enable-lifecycle is set) +curl -X POST http://localhost:9090/-/reload +``` + +--- + +## Step 4: Configure Grafana Datasource + +1. Open Grafana (e.g., `http://localhost:3000`). +2. Go to **Configuration β†’ Data Sources β†’ Add data source**. +3. Select **Prometheus**. +4. Set the URL to your Prometheus server (e.g., `http://prometheus:9090`). +5. Click **Save & Test** to verify connectivity. + +--- + +## Step 5: Import Grafana Dashboards + +This project provides 4 pre-built dashboards in `monitoring/grafana/dashboards/`: + +| Dashboard | File | Panels | +|---|---|---| +| Runner Overview | `runner-overview.json` | 12 | +| DORA Metrics | `dora-metrics.json` | 12 | +| Performance Trends | `performance-trends.json` | 14 | +| Job Analysis | `job-analysis.json` | 16 | + +### Manual Import + +1. Open Grafana β†’ **Dashboards β†’ Import**. +2. Click **Upload JSON file**. +3. Select a dashboard JSON file from `monitoring/grafana/dashboards/`. +4. Select your Prometheus datasource when prompted. +5. Click **Import**. +6. Repeat for each dashboard. + +### Automatic Provisioning + +If you mount the dashboards directory into Grafana, use the provisioning config at [`monitoring/grafana/provisioning/dashboards/dashboards.yml`](../../monitoring/grafana/provisioning/dashboards/dashboards.yml): + +```yaml +# docker-compose snippet for Grafana +services: + grafana: + image: grafana/grafana:latest + volumes: + - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards + - ./monitoring/grafana/provisioning:/etc/grafana/provisioning + ports: + - "3000:3000" +``` + +Grafana will automatically load all dashboards on startup. + +--- + +## Step 6: Verify End-to-End + +1. **Prometheus Targets**: Go to Prometheus β†’ Status β†’ Targets. Confirm runner targets show `UP`. +2. **Test Query**: Run in Prometheus: + ```promql + github_runner_status + ``` + Should return `1` for each runner. +3. **Grafana Dashboards**: Open the Runner Overview dashboard. Panels should show live data. + +--- + +## Environment Variables Reference + +These variables control metrics behavior in runner containers: + +| Variable | Default | Description | +|---|---|---| +| `METRICS_PORT` | `9091` | Port for the metrics HTTP server | +| `METRICS_FILE` | `/tmp/runner_metrics.prom` | Path to the generated metrics file | +| `METRICS_UPDATE_INTERVAL` | `30` | Seconds between metrics updates | +| `RUNNER_NAME` | `unknown` | Runner name label in metrics | +| `RUNNER_TYPE` | `standard` | Runner type label (`standard`, `chrome`, `chrome-go`) | +| `RUNNER_VERSION` | `2.332.0` | Runner version in `github_runner_info` | +| `JOBS_LOG` | `/tmp/jobs.log` | Path to the job log file | +| `JOB_STATE_DIR` | `/tmp/job_state` | Directory for per-job state files | + +--- + +## Port Mapping Summary + +| Runner Type | Container Port | Default Host Port | Compose File | +|---|---|---|---| +| Standard | 9091 | 9091 | `docker-compose.production.yml` | +| Chrome | 9091 | 9092 | `docker-compose.chrome.yml` | +| Chrome-Go | 9091 | 9093 | `docker-compose.chrome-go.yml` | + +--- + +## Troubleshooting Setup Issues + +| Symptom | Cause | Fix | +|---|---|---| +| `curl` returns "Connection refused" | Container not running or port not mapped | Check `docker ps` and compose port mappings | +| Prometheus target shows `DOWN` | Network connectivity issue | Ensure Prometheus can reach the runner host/port | +| Grafana shows "No Data" | Datasource misconfigured or no scrape data yet | Verify Prometheus datasource URL and wait for first scrape | +| Metrics file empty | Collector script not running | Check container logs: `docker logs ` | + +For detailed troubleshooting, see [PROMETHEUS_TROUBLESHOOTING.md](PROMETHEUS_TROUBLESHOOTING.md). + +--- + +## Next Steps + +- [Quick Start Guide](PROMETHEUS_QUICKSTART.md) β€” 5-minute setup +- [Usage Guide](PROMETHEUS_USAGE.md) β€” PromQL queries and dashboard customization +- [Metrics Reference](PROMETHEUS_METRICS_REFERENCE.md) β€” Full metric definitions +- [Architecture](PROMETHEUS_ARCHITECTURE.md) β€” How the metrics system works diff --git a/docs/features/PROMETHEUS_TROUBLESHOOTING.md b/docs/features/PROMETHEUS_TROUBLESHOOTING.md new file mode 100644 index 00000000..b3f6cec7 --- /dev/null +++ b/docs/features/PROMETHEUS_TROUBLESHOOTING.md @@ -0,0 +1,452 @@ +# Prometheus Monitoring Troubleshooting Guide + +## Status: βœ… Complete + +**Created:** 2026-03-02 +**Phase:** 5 β€” Documentation & User Guide +**Task:** TASK-049 + +--- + +## Overview + +This guide covers common issues with the Prometheus monitoring system for GitHub Actions self-hosted runners and how to resolve them. Problems are organized by symptom. + +--- + +## Quick Diagnostic Commands + +Run these first to gather information: + +```bash +# Check container status +docker ps --filter "name=github-runner" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" + +# Check metrics endpoint +curl -s -o /dev/null -w "%{http_code}" http://localhost:9091/metrics + +# View container logs (last 50 lines) +docker logs --tail 50 + +# Check metrics collector log +docker exec cat /tmp/metrics-collector.log + +# Check metrics server log +docker exec cat /tmp/metrics-server.log + +# Check if metrics file exists and has content +docker exec wc -l /tmp/runner_metrics.prom + +# Check running processes inside container +docker exec ps aux | grep -E "metrics|nc" +``` + +--- + +## Problem: Metrics Endpoint Not Responding + +### Symptom + +`curl http://localhost:9091/metrics` returns "Connection refused" or times out. + +### Possible Causes and Fixes + +#### 1. Container Not Running + +```bash +docker ps | grep github-runner +``` + +**Fix:** Start the container: + +```bash +docker compose -f docker/docker-compose.production.yml up -d +``` + +#### 2. Port Not Mapped + +```bash +docker port +``` + +**Fix:** Verify the compose file has the correct port mapping: + +```yaml +ports: + - "9091:9091" # Standard runner + - "9092:9091" # Chrome runner + - "9093:9091" # Chrome-Go runner +``` + +#### 3. Metrics Server Not Started + +```bash +docker exec ps aux | grep metrics-server +``` + +**Fix:** The metrics server is launched by the entrypoint script. Check logs: + +```bash +docker logs 2>&1 | grep -i "metrics" +``` + +If the server is not running, restart the container: + +```bash +docker compose -f docker/docker-compose.production.yml restart +``` + +#### 4. Port Conflict + +Another service may be using port 9091 on the host. + +```bash +lsof -i :9091 +# or +ss -tlnp | grep 9091 +``` + +**Fix:** Change the host port in the compose file: + +```yaml +ports: + - "9094:9091" # Use alternate host port +``` + +#### 5. Netcat Not Available + +```bash +docker exec which nc +``` + +**Fix:** Netcat (`nc`) should be included in the base image. If missing, rebuild the image. + +--- + +## Problem: Metrics Not Updating + +### Symptom + +`github_runner_uptime_seconds` or `github_runner_last_update_timestamp` does not change between requests. + +### Possible Causes and Fixes + +#### 1. Collector Script Not Running + +```bash +docker exec ps aux | grep metrics-collector +``` + +**Fix:** Check the collector log for errors: + +```bash +docker exec cat /tmp/metrics-collector.log +``` + +Restart the container if the collector crashed: + +```bash +docker restart +``` + +#### 2. Metrics File Not Writable + +```bash +docker exec ls -la /tmp/runner_metrics.prom +``` + +**Fix:** Ensure `/tmp` is writable (it should be by default). Check disk space: + +```bash +docker exec df -h /tmp +``` + +#### 3. Update Interval Too Long + +The default update interval is 30 seconds. Wait at least 30 seconds between checks. + +```bash +# Watch metrics update in real time +watch -n 5 'curl -s http://localhost:9091/metrics | grep uptime' +``` + +**Fix:** Reduce the interval via environment variable: + +```yaml +environment: + METRICS_UPDATE_INTERVAL: "15" # Update every 15 seconds +``` + +--- + +## Problem: Grafana Dashboard Shows "No Data" + +### Symptom + +Dashboard panels display "No data" or are empty. + +### Possible Causes and Fixes + +#### 1. Prometheus Datasource Not Configured + +In Grafana: + +1. Go to **Configuration β†’ Data Sources**. +2. Verify a Prometheus datasource exists. +3. Click **Save & Test** to confirm connectivity. + +#### 2. Prometheus Not Scraping Runners + +Check Prometheus targets: + +1. Open `http://:9090/targets`. +2. Look for `github-runner-*` jobs. +3. Targets should show state `UP`. + +**Fix:** Add runner targets to your `prometheus.yml`: + +```yaml +scrape_configs: + - job_name: "github-runner-standard" + static_configs: + - targets: [":9091"] +``` + +Reload Prometheus: + +```bash +curl -X POST http://localhost:9090/-/reload +``` + +#### 3. Datasource Name Mismatch + +The dashboards use `${DS_PROMETHEUS}` as a datasource input variable. During import, you must select your Prometheus datasource. + +**Fix:** Re-import the dashboard and select the correct datasource at the import prompt. + +#### 4. Time Range Too Narrow + +If the runner was just deployed, there may not be enough data for the selected time range. + +**Fix:** Set the dashboard time range to "Last 15 minutes" or "Last 1 hour". + +#### 5. No Jobs Executed Yet + +Job metrics (`github_runner_jobs_total`, `github_runner_job_duration_seconds`) only populate after jobs run. + +**Fix:** Trigger a test workflow in your repository, or check panels that show runner status (which updates immediately). + +--- + +## Problem: Prometheus Target Shows DOWN + +### Symptom + +Prometheus targets page shows the runner target with state `DOWN` and an error message. + +### Possible Causes and Fixes + +#### 1. Network Connectivity + +Prometheus cannot reach the runner on the configured port. + +```bash +# From the Prometheus host/container, test connectivity +curl http://:9091/metrics +``` + +**Fix for Docker networks:** Put Prometheus and runners on the same Docker network: + +```yaml +# In your Prometheus docker-compose +networks: + monitoring: + external: true + +# In runner docker-compose, add: +networks: + monitoring: + external: true +``` + +#### 2. Firewall Blocking + +```bash +# Check if port is open +nc -zv 9091 +``` + +**Fix:** Open port 9091 in your firewall rules. + +#### 3. Scrape Timeout + +The metrics endpoint must respond within the `scrape_timeout` (default 10s). + +```bash +# Measure response time +time curl -s http://localhost:9091/metrics > /dev/null +``` + +**Fix:** If response is slow, increase the scrape timeout: + +```yaml +- job_name: "github-runner-standard" + scrape_timeout: 15s +``` + +--- + +## Problem: Job Counts Not Incrementing + +### Symptom + +`github_runner_jobs_total` stays at 0 despite running jobs. + +### Possible Causes and Fixes + +#### 1. Job Hooks Not Configured + +The runner must have job hooks set via environment variables. + +```bash +docker exec env | grep ACTIONS_RUNNER_HOOK +``` + +Expected output: + +``` +ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/runner/job-started.sh +ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/runner/job-completed.sh +``` + +**Fix:** These are configured in the entrypoint scripts. Verify the entrypoint script sets them: + +```bash +docker exec cat /home/runner/entrypoint.sh | grep HOOK +``` + +#### 2. Jobs Log Not Writable + +```bash +docker exec ls -la /tmp/jobs.log +docker exec cat /tmp/jobs.log +``` + +**Fix:** Ensure `/tmp/jobs.log` exists and is writable. + +#### 3. Hook Scripts Not Executable + +```bash +docker exec ls -la /home/runner/job-started.sh /home/runner/job-completed.sh +``` + +**Fix:** Scripts should have execute permission. This is set during the Docker build. + +--- + +## Problem: High Memory or CPU Usage + +### Symptom + +Runner container using more resources than expected. + +### Diagnostic + +```bash +# Check resource usage +docker stats --no-stream + +# Check metrics processes specifically +docker exec ps aux --sort=-%mem | head -10 +``` + +### Fixes + +#### Reduce Scrape Frequency + +```yaml +environment: + METRICS_UPDATE_INTERVAL: "60" # Reduce from 30s to 60s +``` + +#### Check Jobs Log Growth + +```bash +docker exec wc -l /tmp/jobs.log +``` + +If the log has thousands of entries, the histogram calculation may be slow. + +**Fix:** The collector processes recent entries (last 100 for queue time). For very long-running containers, consider restarting to reset the log. + +#### Resource Limits + +Set container resource limits in the compose file: + +```yaml +deploy: + resources: + limits: + cpus: "2.0" + memory: 2G +``` + +--- + +## Problem: Cache Metrics Always Zero + +### Symptom + +`github_runner_cache_hit_rate` reports 0 for all cache types. + +### Explanation + +Cache metrics are currently **stubbed** β€” they always return 0. This is by design: + +- BuildKit cache logs exist on the Docker host, not inside the runner container. +- APT and npm caches are internal to the build process and not easily instrumented from the runner. + +See [PROMETHEUS_METRICS_REFERENCE.md](PROMETHEUS_METRICS_REFERENCE.md) for details. + +**Future work:** A sidecar container or host-side exporter could provide real cache metrics. + +--- + +## Collecting Diagnostic Information + +If you need to file a bug report, gather this information: + +```bash +# 1. Container info +docker inspect | head -100 + +# 2. Metrics output +curl -s http://localhost:9091/metrics > metrics-dump.txt + +# 3. Container logs +docker logs > container-logs.txt 2>&1 + +# 4. Collector log +docker exec cat /tmp/metrics-collector.log > collector-log.txt + +# 5. Server log +docker exec cat /tmp/metrics-server.log > server-log.txt + +# 6. Jobs log +docker exec cat /tmp/jobs.log > jobs-log.txt + +# 7. Process list +docker exec ps aux > processes.txt + +# 8. Environment +docker exec env | grep -E "RUNNER|METRICS|JOBS" > env.txt +``` + +--- + +## Next Steps + +- [Setup Guide](PROMETHEUS_SETUP.md) β€” Initial configuration +- [Usage Guide](PROMETHEUS_USAGE.md) β€” PromQL queries and dashboards +- [Architecture](PROMETHEUS_ARCHITECTURE.md) β€” System internals +- [Metrics Reference](PROMETHEUS_METRICS_REFERENCE.md) β€” Full metric definitions diff --git a/docs/features/PROMETHEUS_USAGE.md b/docs/features/PROMETHEUS_USAGE.md new file mode 100644 index 00000000..74c77e3e --- /dev/null +++ b/docs/features/PROMETHEUS_USAGE.md @@ -0,0 +1,306 @@ +# Prometheus Monitoring Usage Guide + +## Status: βœ… Complete + +**Created:** 2026-03-02 +**Phase:** 5 β€” Documentation & User Guide +**Task:** TASK-048 + +--- + +## Overview + +This guide covers day-to-day usage of the Prometheus monitoring system for GitHub Actions self-hosted runners: accessing metrics, writing PromQL queries, customizing dashboards, and best practices. + +For initial setup, see [PROMETHEUS_SETUP.md](PROMETHEUS_SETUP.md). + +--- + +## Accessing the Metrics Endpoint + +Each runner container exposes metrics via HTTP: + +```bash +# Raw metrics output +curl http://localhost:9091/metrics + +# Filter for a specific metric +curl -s http://localhost:9091/metrics | grep github_runner_jobs_total + +# Pretty-print with line numbers +curl -s http://localhost:9091/metrics | cat -n +``` + +The endpoint returns plain text in [Prometheus exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/). + +--- + +## Understanding Metric Types + +The runner metrics use three Prometheus types: + +### Gauges (current value, can go up or down) + +- `github_runner_status` β€” Runner online/offline state +- `github_runner_info` β€” Runner metadata (always 1) +- `github_runner_queue_time_seconds` β€” Average queue wait time +- `github_runner_cache_hit_rate` β€” Cache hit ratio per type +- `github_runner_last_update_timestamp` β€” Last metrics update epoch + +### Counters (monotonically increasing) + +- `github_runner_uptime_seconds` β€” Total uptime since container start +- `github_runner_jobs_total` β€” Cumulative job counts by status + +### Histograms (distribution of values) + +- `github_runner_job_duration_seconds` β€” Job duration distribution with buckets at 60s, 300s, 600s, 1800s, 3600s, +Inf + +For full metric definitions, see [PROMETHEUS_METRICS_REFERENCE.md](PROMETHEUS_METRICS_REFERENCE.md). + +--- + +## Writing PromQL Queries + +### Basic Queries + +```promql +# Current status of all runners +github_runner_status + +# Filter by runner type +github_runner_status{runner_type="chrome"} + +# Runner uptime in hours +github_runner_uptime_seconds / 3600 + +# Total successful jobs +github_runner_jobs_total{status="success"} +``` + +### Rate and Aggregation + +```promql +# Jobs per hour (success) +rate(github_runner_jobs_total{status="success"}[1h]) * 3600 + +# Total jobs across all runners in last 24h +sum(increase(github_runner_jobs_total{status="total"}[24h])) + +# Failed job rate (percentage) +sum(rate(github_runner_jobs_total{status="failed"}[1h])) + / +sum(rate(github_runner_jobs_total{status="total"}[1h])) + * 100 +``` + +### DORA Metrics + +```promql +# Deployment Frequency (successful builds per day) +sum(increase(github_runner_jobs_total{status="success"}[24h])) + +# Lead Time for Changes (average job duration in minutes) +rate(github_runner_job_duration_seconds_sum[5m]) + / +rate(github_runner_job_duration_seconds_count[5m]) + / 60 + +# Change Failure Rate (%) +sum(increase(github_runner_jobs_total{status="failed"}[24h])) + / +sum(increase(github_runner_jobs_total{status="total"}[24h])) + * 100 + +# Mean Time to Recovery (average duration of failed jobs in minutes) +rate(github_runner_job_duration_seconds_sum{status="failed"}[1h]) + / +rate(github_runner_job_duration_seconds_count{status="failed"}[1h]) + / 60 +``` + +### Histogram Queries + +```promql +# Median job duration (p50) +histogram_quantile(0.50, rate(github_runner_job_duration_seconds_bucket[1h])) + +# 90th percentile job duration +histogram_quantile(0.90, rate(github_runner_job_duration_seconds_bucket[1h])) + +# 99th percentile job duration +histogram_quantile(0.99, rate(github_runner_job_duration_seconds_bucket[1h])) + +# Jobs completing under 5 minutes +github_runner_job_duration_seconds_bucket{le="300"} +``` + +### Runner Comparison + +```promql +# Uptime by runner type +github_runner_uptime_seconds by (runner_type) + +# Job success rate per runner +github_runner_jobs_total{status="success"} / github_runner_jobs_total{status="total"} + +# Queue time per runner +github_runner_queue_time_seconds by (runner_name) +``` + +--- + +## Customizing Dashboards + +### Modifying Existing Panels + +1. Open a dashboard in Grafana. +2. Click the panel title β†’ **Edit**. +3. Modify the PromQL query in the **Query** tab. +4. Adjust visualization options in the **Panel options** tab. +5. Click **Apply** and then **Save dashboard**. + +### Adding New Panels + +1. Click **Add** β†’ **Visualization** in the dashboard. +2. Select your Prometheus datasource. +3. Enter a PromQL query. +4. Choose a visualization type (Time series, Stat, Gauge, Table, etc.). +5. Configure thresholds: + - Green: Normal operation + - Yellow: Warning threshold + - Red: Critical threshold + +### Using Dashboard Variables + +All pre-built dashboards include two template variables: + +- **`runner_name`**: Multi-select filter by runner name +- **`runner_type`**: Multi-select filter by runner type (standard, chrome, chrome-go) + +Use these in custom queries: + +```promql +github_runner_jobs_total{runner_name=~"$runner_name", runner_type=~"$runner_type"} +``` + +### Exporting Customized Dashboards + +1. Open the dashboard β†’ **Settings** (gear icon) β†’ **JSON Model**. +2. Copy the JSON. +3. Save to `monitoring/grafana/dashboards/` for version control. + +--- + +## Setting Up Alerts (Prometheus Alertmanager) + +> **Note:** Alertmanager deployment is user-provided. These are example alert rules. + +### Example Alert Rules + +Create a file `prometheus-rules.yml`: + +```yaml +groups: + - name: github-runner-alerts + rules: + # Runner is offline + - alert: RunnerOffline + expr: github_runner_status == 0 + for: 5m + labels: + severity: critical + annotations: + summary: "Runner {{ $labels.runner_name }} is offline" + description: "Runner has been offline for more than 5 minutes." + + # High failure rate + - alert: HighJobFailureRate + expr: > + (sum by (runner_name) (increase(github_runner_jobs_total{status="failed"}[1h])) + / + sum by (runner_name) (increase(github_runner_jobs_total{status="total"}[1h]))) + > 0.15 + for: 15m + labels: + severity: warning + annotations: + summary: "High job failure rate on {{ $labels.runner_name }}" + description: "Failure rate exceeds 15% over the last hour." + + # Long queue times + - alert: HighQueueTime + expr: github_runner_queue_time_seconds > 300 + for: 10m + labels: + severity: warning + annotations: + summary: "High queue time on {{ $labels.runner_name }}" + description: "Average queue time exceeds 5 minutes." + + # Metrics stale (collector may have crashed) + - alert: MetricsStale + expr: time() - github_runner_last_update_timestamp > 120 + for: 5m + labels: + severity: warning + annotations: + summary: "Stale metrics from {{ $labels.runner_name }}" + description: "Metrics have not updated for over 2 minutes." +``` + +Add to Prometheus configuration: + +```yaml +rule_files: + - "/etc/prometheus/rules/prometheus-rules.yml" +``` + +--- + +## Best Practices + +### Metrics Retention + +- **Short-term** (1–7 days): Keep raw 15s scrape data for real-time dashboards. +- **Medium-term** (30 days): Use Prometheus recording rules to downsample. +- **Long-term** (90+ days): Use remote storage (Thanos, Cortex, Mimir) or export metrics. + +### Recording Rules for Performance + +Pre-compute expensive queries: + +```yaml +groups: + - name: github-runner-recording-rules + rules: + - record: job:github_runner_jobs_total:rate1h + expr: sum by (runner_name, status) (rate(github_runner_jobs_total[1h])) + + - record: job:github_runner_job_duration:p99_1h + expr: histogram_quantile(0.99, sum by (le, runner_name) (rate(github_runner_job_duration_seconds_bucket[1h]))) +``` + +### Scrape Interval + +- **15s** (default): Good balance of granularity and storage. +- **30s**: Reduces storage by ~50%, sufficient for most use cases. +- **5s**: Only for debugging; increases storage significantly. + +### Label Cardinality + +Keep label cardinality low to avoid Prometheus performance issues: + +- `runner_name`: One per runner instance (bounded by deployment size) +- `runner_type`: Three values (`standard`, `chrome`, `chrome-go`) +- `status`: Three values (`total`, `success`, `failed`) +- `cache_type`: Three values (`buildkit`, `apt`, `npm`) + +--- + +## Next Steps + +- [Metrics Reference](PROMETHEUS_METRICS_REFERENCE.md) β€” Full metric definitions and types +- [Troubleshooting](PROMETHEUS_TROUBLESHOOTING.md) β€” Common issues and fixes +- [Architecture](PROMETHEUS_ARCHITECTURE.md) β€” System internals +- [Quick Start](PROMETHEUS_QUICKSTART.md) β€” 5-minute setup diff --git a/monitoring/prometheus-scrape-example.yml b/monitoring/prometheus-scrape-example.yml new file mode 100644 index 00000000..9e55117f --- /dev/null +++ b/monitoring/prometheus-scrape-example.yml @@ -0,0 +1,70 @@ +# Prometheus Scrape Configuration Example +# Add these jobs to your prometheus.yml under 'scrape_configs' +# +# This file demonstrates how to configure Prometheus to scrape +# GitHub Actions self-hosted runner metrics endpoints. +# +# For full setup instructions, see: +# docs/features/PROMETHEUS_SETUP.md +# docs/features/PROMETHEUS_QUICKSTART.md + +scrape_configs: + # Standard runner metrics + # Default host port: 9091 (maps to container port 9091) + - job_name: "github-runner-standard" + static_configs: + - targets: [":9091"] + labels: + runner_variant: "standard" + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s + + # Chrome runner metrics + # Default host port: 9092 (maps to container port 9091) + - job_name: "github-runner-chrome" + static_configs: + - targets: [":9092"] + labels: + runner_variant: "chrome" + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s + + # Chrome-Go runner metrics + # Default host port: 9093 (maps to container port 9091) + - job_name: "github-runner-chrome-go" + static_configs: + - targets: [":9093"] + labels: + runner_variant: "chrome-go" + scrape_interval: 15s + metrics_path: /metrics + scrape_timeout: 10s + +# ────────────────────────────────────────────────────────── +# Docker Network Configuration (alternative) +# Use when Prometheus runs on the same Docker network as runners +# ────────────────────────────────────────────────────────── +# +# scrape_configs: +# - job_name: "github-runner-standard" +# static_configs: +# - targets: ["github-runner-main:9091"] +# scrape_interval: 15s +# metrics_path: /metrics +# scrape_timeout: 10s +# +# - job_name: "github-runner-chrome" +# static_configs: +# - targets: ["github-runner-chrome:9091"] +# scrape_interval: 15s +# metrics_path: /metrics +# scrape_timeout: 10s +# +# - job_name: "github-runner-chrome-go" +# static_configs: +# - targets: ["github-runner-chrome-go:9091"] +# scrape_interval: 15s +# metrics_path: /metrics +# scrape_timeout: 10s diff --git a/plan/feature-prometheus-monitoring-1.md b/plan/feature-prometheus-monitoring-1.md index 3f5d923e..e890b234 100644 --- a/plan/feature-prometheus-monitoring-1.md +++ b/plan/feature-prometheus-monitoring-1.md @@ -167,22 +167,22 @@ This implementation plan provides a fully executable roadmap for adding Promethe ### Implementation Phase 5: Documentation & User Guide **Timeline:** Week 4-5 (2025-12-07 to 2025-12-21) -**Status:** ⏳ Planned +**Status:** βœ… Complete - **GOAL-005**: Provide comprehensive documentation for setup, usage, troubleshooting, and architecture | Task | Description | Completed | Date | |------|-------------|-----------|------| -| TASK-047 | Create `docs/features/PROMETHEUS_SETUP.md` with sections: Prerequisites (external Prometheus/Grafana), Prometheus scrape config example (scraping port 9091), Grafana datasource setup, Dashboard import instructions, Verification steps, Troubleshooting common setup issues | | | -| TASK-048 | Create `docs/features/PROMETHEUS_USAGE.md` with sections: Accessing metrics endpoint, Understanding metric types, Writing custom PromQL queries, Customizing dashboards, Setting up alerts (future), Best practices for metrics retention | | | -| TASK-049 | Create `docs/features/PROMETHEUS_TROUBLESHOOTING.md` with sections: Metrics endpoint not responding (check port exposure, container logs), Metrics not updating (check collector script, logs), Dashboard showing "No Data" (verify Prometheus scraping, datasource config), High memory usage (adjust retention, scrape interval), Performance optimization tips | | | -| TASK-050 | Create `docs/features/PROMETHEUS_ARCHITECTURE.md` with sections: System architecture diagram, Component descriptions (metrics server, collector, HTTP endpoint), Data flow (collector β†’ file β†’ HTTP server β†’ Prometheus), Metric naming conventions, Design decisions (bash + netcat rationale), Scalability considerations (horizontal runner scaling) | | | -| TASK-051 | Update `README.md` with "πŸ“Š Monitoring" section linking to setup guide and architecture docs | | | -| TASK-052 | Update `docs/README.md` with links to all new Prometheus documentation files | | | -| TASK-053 | Create example Prometheus scrape configuration YAML snippet in `monitoring/prometheus-scrape-example.yml` | | | -| TASK-054 | Document metric definitions with descriptions, types (gauge/counter/histogram), and example values in `docs/features/PROMETHEUS_METRICS_REFERENCE.md` | | | -| TASK-055 | Add metrics endpoint to API documentation in `docs/API.md` (if applicable) | | | -| TASK-056 | Create quickstart guide: `docs/features/PROMETHEUS_QUICKSTART.md` with 5-minute setup instructions | | | +| TASK-047 | Create `docs/features/PROMETHEUS_SETUP.md` with sections: Prerequisites (external Prometheus/Grafana), Prometheus scrape config example (scraping port 9091), Grafana datasource setup, Dashboard import instructions, Verification steps, Troubleshooting common setup issues | βœ… | 2026-03-02 | +| TASK-048 | Create `docs/features/PROMETHEUS_USAGE.md` with sections: Accessing metrics endpoint, Understanding metric types, Writing custom PromQL queries, Customizing dashboards, Setting up alerts (future), Best practices for metrics retention | βœ… | 2026-03-02 | +| TASK-049 | Create `docs/features/PROMETHEUS_TROUBLESHOOTING.md` with sections: Metrics endpoint not responding (check port exposure, container logs), Metrics not updating (check collector script, logs), Dashboard showing "No Data" (verify Prometheus scraping, datasource config), High memory usage (adjust retention, scrape interval), Performance optimization tips | βœ… | 2026-03-02 | +| TASK-050 | Create `docs/features/PROMETHEUS_ARCHITECTURE.md` with sections: System architecture diagram, Component descriptions (metrics server, collector, HTTP endpoint), Data flow (collector β†’ file β†’ HTTP server β†’ Prometheus), Metric naming conventions, Design decisions (bash + netcat rationale), Scalability considerations (horizontal runner scaling) | βœ… | 2026-03-02 | +| TASK-051 | Update `README.md` with "πŸ“Š Monitoring" section: Fixed port from 9090β†’9091, added metrics endpoint examples for all 3 runner types, added Grafana dashboard table, added links to all Prometheus documentation files | βœ… | 2026-03-02 | +| TASK-052 | Update `docs/README.md` with Prometheus Monitoring section linking to all 7 documentation files (Quick Start, Setup, Usage, Metrics Reference, Architecture, Troubleshooting, Grafana Dashboard Metrics) | βœ… | 2026-03-02 | +| TASK-053 | Create `monitoring/prometheus-scrape-example.yml` with scrape configs for all 3 runner types (standard:9091, chrome:9092, chrome-go:9093) plus Docker network alternative config | βœ… | 2026-03-02 | +| TASK-054 | Create `docs/features/PROMETHEUS_METRICS_REFERENCE.md` with complete definitions for all 8 metric families: type, description, labels, values, source, PromQL examples, stub status for cache metrics | βœ… | 2026-03-02 | +| TASK-055 | Rewrite `docs/API.md` metrics section with correct metric names, types, descriptions, port info, and links to Metrics Reference and Usage Guide | βœ… | 2026-03-02 | +| TASK-056 | Create `docs/features/PROMETHEUS_QUICKSTART.md` with 5-step, 5-minute setup instructions covering deploy, verify, scrape config, dashboard import, and multi-runner setup | βœ… | 2026-03-02 | ### Implementation Phase 6: Testing & Validation From feacc5871bfa30dbc90ce0f0c8058be5491201f5 Mon Sep 17 00:00:00 2001 From: GrammaTonic Date: Mon, 2 Mar 2026 02:55:53 +0100 Subject: [PATCH 2/3] style: fix markdownlint MD031 blank lines around fenced code block --- docs/features/PROMETHEUS_SETUP.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/features/PROMETHEUS_SETUP.md b/docs/features/PROMETHEUS_SETUP.md index 6a567e50..7f92ead0 100644 --- a/docs/features/PROMETHEUS_SETUP.md +++ b/docs/features/PROMETHEUS_SETUP.md @@ -219,9 +219,11 @@ Grafana will automatically load all dashboards on startup. 1. **Prometheus Targets**: Go to Prometheus β†’ Status β†’ Targets. Confirm runner targets show `UP`. 2. **Test Query**: Run in Prometheus: + ```promql github_runner_status ``` + Should return `1` for each runner. 3. **Grafana Dashboards**: Open the Runner Overview dashboard. Panels should show live data. From cc98cdb98fd4b49ea000d674fcce4ae9fa924a83 Mon Sep 17 00:00:00 2001 From: GrammaTonic Date: Mon, 2 Mar 2026 03:08:09 +0100 Subject: [PATCH 3/3] docs: add Prometheus monitoring wiki pages and fix existing references MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Create 4 new wiki pages mirroring Phase 5 Prometheus documentation: - Monitoring-Setup.md: Quick start, port mapping, multi-runner config - Metrics-Reference.md: All 8 metrics with types, labels, and PromQL - Grafana-Dashboards.md: 4 dashboards, import/provisioning instructions - Monitoring-Troubleshooting.md: Symptom-based troubleshooting guide Update 5 existing wiki pages: - Home.md: Add Monitoring & Observability section to Table of Contents - Production-Deployment.md: Fix METRICS_PORT 9090β†’9091, scrape target runner:8080β†’runner:9091, add monitoring guide cross-link - Quick-Start.md: Restore monitoring link in What's Next section - Chrome-Runner.md: Add Prometheus metrics port 9092 info and links - Docker-Configuration.md: Add monitoring setup link below architecture --- wiki-content/Chrome-Runner.md | 11 + wiki-content/Docker-Configuration.md | 4 +- wiki-content/Grafana-Dashboards.md | 157 ++++++++++ wiki-content/Home.md | 11 +- wiki-content/Metrics-Reference.md | 214 +++++++++++++ wiki-content/Monitoring-Setup.md | 186 +++++++++++ wiki-content/Monitoring-Troubleshooting.md | 344 +++++++++++++++++++++ wiki-content/Production-Deployment.md | 10 +- wiki-content/Quick-Start.md | 3 +- 9 files changed, 930 insertions(+), 10 deletions(-) create mode 100644 wiki-content/Grafana-Dashboards.md create mode 100644 wiki-content/Metrics-Reference.md create mode 100644 wiki-content/Monitoring-Setup.md create mode 100644 wiki-content/Monitoring-Troubleshooting.md diff --git a/wiki-content/Chrome-Runner.md b/wiki-content/Chrome-Runner.md index 854024c0..b21b172c 100644 --- a/wiki-content/Chrome-Runner.md +++ b/wiki-content/Chrome-Runner.md @@ -356,6 +356,17 @@ curl http://localhost:8080/health ## πŸ“ˆ **Monitoring & Metrics** +### **Prometheus Metrics** + +The Chrome runner exposes Prometheus metrics on host port **9092** (mapped from container port 9091): + +```bash +# Verify Chrome runner metrics +curl http://localhost:9092/metrics +``` + +See [Monitoring Setup](Monitoring-Setup.md) for full setup instructions and [Metrics Reference](Metrics-Reference.md) for all 8 available metrics. + ### **Container Metrics** ```bash diff --git a/wiki-content/Docker-Configuration.md b/wiki-content/Docker-Configuration.md index 8bb9d625..e64c928f 100644 --- a/wiki-content/Docker-Configuration.md +++ b/wiki-content/Docker-Configuration.md @@ -16,7 +16,7 @@ Complete guide to configuring Docker and Docker Compose for GitHub Actions self- β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Runner 1 β”‚ Runner 2 β”‚ Runner 3 β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ -β”‚ Monitoring Stack β”‚ +β”‚ Monitoring Stack (User-Provided) β”‚ β”‚ Prometheus β”‚ Grafana β”‚ AlertMgr β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Shared Volumes β”‚ @@ -24,6 +24,8 @@ Complete guide to configuring Docker and Docker Compose for GitHub Actions self- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` +> πŸ“– **Monitoring Stack setup:** See [Monitoring Setup](Monitoring-Setup.md) for configuring Prometheus scraping and Grafana dashboards with your runners. Each runner exposes metrics on port 9091 (standard), 9092 (chrome), or 9093 (chrome-go). + ## πŸ“ Docker Compose Configuration ### Separate Architecture diff --git a/wiki-content/Grafana-Dashboards.md b/wiki-content/Grafana-Dashboards.md new file mode 100644 index 00000000..c12e0432 --- /dev/null +++ b/wiki-content/Grafana-Dashboards.md @@ -0,0 +1,157 @@ +# Grafana Dashboards + +![Grafana](https://img.shields.io/badge/Grafana-Dashboards-F46800?style=for-the-badge&logo=grafana&logoColor=white) +![Dashboards](https://img.shields.io/badge/Dashboards-4%20Included-blue?style=for-the-badge) + +Pre-built Grafana dashboards for visualizing GitHub Actions self-hosted runner metrics. Import the JSON files into your Grafana instance β€” no custom plugin required. + +--- + +## πŸ“Š Dashboard Overview + +All dashboard JSON files are in `monitoring/grafana/dashboards/`: + +| Dashboard | File | Panels | Focus | +|---|---|---|---| +| **Runner Overview** | `runner-overview.json` | 12 | Runner status, health, uptime, queue time | +| **DORA Metrics** | `dora-metrics.json` | 12 | Deployment Frequency, Lead Time, CFR, MTTR | +| **Performance Trends** | `performance-trends.json` | 14 | Cache hit rates, build duration percentiles, queue times | +| **Job Analysis** | `job-analysis.json` | 16 | Job summary, duration histograms, status breakdown | + +**Total:** 54 panels across 4 dashboards. + +--- + +## πŸš€ Importing Dashboards + +### Option 1: Manual Import (Recommended for Quick Start) + +1. Open Grafana β†’ **Dashboards β†’ Import**. +2. Click **Upload JSON file**. +3. Select a dashboard file from `monitoring/grafana/dashboards/`. +4. Select your **Prometheus datasource** when prompted. +5. Click **Import**. +6. Repeat for each dashboard. + +### Option 2: Provisioning (Recommended for Production) + +Use the included provisioning configuration to auto-load dashboards on Grafana startup. + +```yaml +# monitoring/grafana/provisioning/dashboards/dashboards.yml +apiVersion: 1 + +providers: + - name: "github-runner" + orgId: 1 + folder: "GitHub Runner" + type: file + disableDeletion: false + editable: true + options: + path: /etc/grafana/provisioning/dashboards + foldersFromFilesStructure: false +``` + +Mount the dashboards directory into your Grafana container: + +```yaml +# In your Grafana docker-compose service +volumes: + - ./monitoring/grafana/provisioning:/etc/grafana/provisioning:ro + - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro +``` + +Dashboards will appear automatically in the **GitHub Runner** folder on startup. + +--- + +## βš™οΈ Dashboard Variables + +All dashboards include these template variables for filtering: + +| Variable | Type | Description | +|---|---|---| +| `runner_name` | Multi-select | Filter by runner instance name | +| `runner_type` | Multi-select | Filter by runner type (`standard`, `chrome`, `chrome-go`) | + +Variables are populated from live Prometheus label data, so new runners appear automatically. + +--- + +## πŸ“‹ Dashboard Details + +### Runner Overview + +The primary operational dashboard. Shows: + +- **Runner Status** β€” Online/offline indicator per runner +- **Fleet Size** β€” Total active runners +- **Uptime** β€” Current uptime per runner +- **Job Success Rate** β€” Percentage gauge +- **Queue Time** β€” Average time jobs wait before starting +- **Jobs Over Time** β€” Time series of job throughput +- **Quick Links** β€” Navigation to other dashboards + +### DORA Metrics + +Tracks the four DORA key metrics as calculated from runner data: + +- **Deployment Frequency** β€” Successful jobs per day +- **Lead Time for Changes** β€” Average job duration (proxy) +- **Change Failure Rate** β€” Failed jobs / total jobs (%) +- **Mean Time to Recovery** β€” Time between failure and next success +- **Trend Lines** β€” 7-day rolling averages +- **Classification** β€” Elite / High / Medium / Low performance bands + +### Performance Trends + +Resource utilization and build performance over time: + +- **Build Duration Percentiles** β€” p50, p90, p99 +- **Cache Hit Rates** β€” BuildKit, APT, npm (currently stubbed) +- **Queue Time Trends** β€” Historical queue wait times +- **Runner Comparison** β€” Side-by-side performance across runner types + +### Job Analysis + +Deep dive into individual job metrics: + +- **Job Summary** β€” Total, successful, failed counts +- **Duration Histograms** β€” Distribution of job execution times +- **Status Breakdown** β€” Pie/bar charts by status +- **Runner Comparison** β€” Which runners handle more/faster jobs +- **Duration by Runner Type** β€” Compare standard vs chrome vs chrome-go + +--- + +## πŸ”§ Datasource Configuration + +Dashboards use the `${DS_PROMETHEUS}` input variable for datasource portability. During import, Grafana will prompt you to map this to your Prometheus datasource. + +### Adding a Prometheus Datasource + +If you haven't configured one yet: + +1. Go to **Configuration β†’ Data Sources β†’ Add data source**. +2. Select **Prometheus**. +3. Set the URL to your Prometheus server (e.g., `http://prometheus:9090`). +4. Click **Save & Test** to verify connectivity. + +--- + +## πŸ”— Inter-Dashboard Navigation + +Each dashboard includes navigation links to the other dashboards. The **Runner Overview** dashboard has a **Quick Links** panel for easy cross-dashboard navigation. + +--- + +## πŸ“Š What's Next? + +| Guide | Description | +|---|---| +| [Monitoring Setup](Monitoring-Setup.md) | Deploy runners and connect Prometheus | +| [Metrics Reference](Metrics-Reference.md) | All 8 metrics with PromQL examples | +| [Monitoring Troubleshooting](Monitoring-Troubleshooting.md) | Fix "No Data" and other dashboard issues | + +> πŸ“– **Full dashboard documentation:** See [GRAFANA_DASHBOARD_METRICS.md](../docs/features/GRAFANA_DASHBOARD_METRICS.md) and [PROMETHEUS_USAGE.md](../docs/features/PROMETHEUS_USAGE.md) in the main docs for PromQL query recipes, alert rule examples, and dashboard customization. diff --git a/wiki-content/Home.md b/wiki-content/Home.md index 2db18b22..2ff20ab8 100644 --- a/wiki-content/Home.md +++ b/wiki-content/Home.md @@ -63,6 +63,13 @@ Welcome to the comprehensive documentation for the GitHub Actions Self-Hosted Ru - **[Chrome Runner](Chrome-Runner.md) πŸ†•** - Web UI testing and browser automation - [Docker Configuration](Docker-Configuration.md) - General Docker setup +### Monitoring & Observability + +- **[Monitoring Setup](Monitoring-Setup.md) πŸ†•** - Prometheus metrics quick start and configuration +- [Metrics Reference](Metrics-Reference.md) - All 8 runner metrics with PromQL examples +- [Grafana Dashboards](Grafana-Dashboards.md) - 4 pre-built dashboards (54 panels) +- [Monitoring Troubleshooting](Monitoring-Troubleshooting.md) - Fix common monitoring issues + ### Configuration - [Production Deployment](Production-Deployment.md) - Production-ready deployment @@ -94,9 +101,7 @@ docker-compose up -d | **Standard Runner** | βœ… Stable | [Installation Guide](Installation-Guide.md) | | **CI/CD Pipeline** | βœ… Passing | [Production Deployment](Production-Deployment.md) | | **Security Scanning** | βœ… Clean | [Common Issues](Common-Issues.md) | - - - +| **Monitoring** | βœ… Production Ready | [Monitoring Setup](Monitoring-Setup.md) | ## πŸš€ Quick Links diff --git a/wiki-content/Metrics-Reference.md b/wiki-content/Metrics-Reference.md new file mode 100644 index 00000000..3f51d305 --- /dev/null +++ b/wiki-content/Metrics-Reference.md @@ -0,0 +1,214 @@ +# Metrics Reference + +![Prometheus](https://img.shields.io/badge/Prometheus-Metrics-E6522C?style=for-the-badge&logo=prometheus&logoColor=white) + +Complete reference for all Prometheus metrics exposed by GitHub Actions self-hosted runners on port **9091**. + +--- + +## 🏷️ Common Labels + +All metrics include these labels unless otherwise noted: + +| Label | Description | Example Values | +|---|---|---| +| `runner_name` | Runner instance name | `docker-runner`, `chrome-runner-1` | +| `runner_type` | Runner variant | `standard`, `chrome`, `chrome-go` | + +--- + +## πŸ“Š Metrics Summary + +| Metric | Type | Labels | Stubbed? | Description | +|---|---|---|---|---| +| `github_runner_status` | Gauge | name, type | No | Runner online/offline (1/0) | +| `github_runner_info` | Gauge | name, type, version | No | Runner metadata (always 1) | +| `github_runner_uptime_seconds` | Counter | name, type | No | Uptime since collector start | +| `github_runner_jobs_total` | Counter | name, type, status | No | Jobs by status (total/success/failed) | +| `github_runner_job_duration_seconds` | Histogram | name, type, le | No | Job duration distribution | +| `github_runner_queue_time_seconds` | Gauge | name, type | No | Average queue time (last 100 jobs) | +| `github_runner_cache_hit_rate` | Gauge | name, type, cache_type | **Yes** | Cache hit rate (stubbed at 0) | +| `github_runner_last_update_timestamp` | Gauge | β€” | No | Unix epoch of last update | + +--- + +## πŸ” Metric Details + +### `github_runner_status` + +**Type:** Gauge β€” Runner online/offline status. + +| Value | Meaning | +|---|---| +| `1` | Online (collector running) | +| `0` | Offline | + +```promql +# All online runners +github_runner_status == 1 + +# Count online runners by type +count by (runner_type) (github_runner_status == 1) + +# Alert: runner offline +github_runner_status == 0 +``` + +--- + +### `github_runner_info` + +**Type:** Gauge β€” Runner metadata. Always `1`; informational labels carry the data. + +Extra label: `version` (runner software version). + +```promql +# List all runners with versions +github_runner_info + +# Filter by version +github_runner_info{version="2.332.0"} +``` + +--- + +### `github_runner_uptime_seconds` + +**Type:** Counter β€” Seconds since the metrics collector started. + +```promql +# Uptime in hours +github_runner_uptime_seconds / 3600 + +# Alert: recent restart (uptime < 5 min) +github_runner_uptime_seconds < 300 +``` + +--- + +### `github_runner_jobs_total` + +**Type:** Counter β€” Total jobs processed, segmented by `status` label. + +| Status Value | Description | +|---|---| +| `total` | All completed jobs | +| `success` | Successful jobs | +| `failed` | Failed jobs | + +```promql +# Jobs per hour +rate(github_runner_jobs_total{status="total"}[1h]) * 3600 + +# Success rate (%) +github_runner_jobs_total{status="success"} + / github_runner_jobs_total{status="total"} * 100 + +# DORA: Deployment Frequency (successful jobs/24h) +sum(increase(github_runner_jobs_total{status="success"}[24h])) + +# DORA: Change Failure Rate (%) +sum(increase(github_runner_jobs_total{status="failed"}[24h])) + / sum(increase(github_runner_jobs_total{status="total"}[24h])) * 100 +``` + +--- + +### `github_runner_job_duration_seconds` + +**Type:** Histogram β€” Distribution of job execution durations. + +**Bucket boundaries:** `60` (1 min), `300` (5 min), `600` (10 min), `1800` (30 min), `3600` (1 hr), `+Inf`. + +Sub-metrics: `_bucket`, `_sum`, `_count`. + +```promql +# Median (p50) job duration +histogram_quantile(0.50, rate(github_runner_job_duration_seconds_bucket[1h])) + +# 90th percentile +histogram_quantile(0.90, rate(github_runner_job_duration_seconds_bucket[1h])) + +# DORA: Lead Time (average duration in minutes) +rate(github_runner_job_duration_seconds_sum[5m]) + / rate(github_runner_job_duration_seconds_count[5m]) / 60 +``` + +> **Note:** Buckets are cumulative β€” each bucket includes all smaller buckets. The `+Inf` bucket equals `_count`. + +--- + +### `github_runner_queue_time_seconds` + +**Type:** Gauge β€” Average queue wait time in seconds (computed from last 100 completed jobs). + +```promql +# Queue time per runner +github_runner_queue_time_seconds by (runner_name) + +# Alert: queue time > 5 minutes +github_runner_queue_time_seconds > 300 +``` + +> A value of `0` means jobs started immediately with no queuing. + +--- + +### `github_runner_cache_hit_rate` + +**Type:** Gauge β€” Cache hit rate by `cache_type` label (0.0 to 1.0). + +| Cache Type | Description | +|---|---| +| `buildkit` | Docker BuildKit layer cache | +| `apt` | APT package cache | +| `npm` | npm package cache | + +> ⚠️ **Currently stubbed** β€” always returns `0`. BuildKit cache logs exist on the Docker host, not inside the runner container. Future work will add a sidecar exporter for real cache data. + +--- + +### `github_runner_last_update_timestamp` + +**Type:** Gauge β€” Unix timestamp of the last metrics collection cycle. + +```promql +# Time since last update (staleness detection) +time() - github_runner_last_update_timestamp + +# Alert: metrics stale (>2 minutes) +time() - github_runner_last_update_timestamp > 120 +``` + +--- + +## πŸ“ Job Log Format + +Metrics are derived from `/tmp/jobs.log` inside the container. Each line is CSV: + +``` +timestamp,job_id,status,duration_seconds,queue_time_seconds +``` + +| Field | Description | Example | +|---|---|---| +| `timestamp` | ISO 8601 UTC | `2026-03-02T10:05:30Z` | +| `job_id` | `{run_id}_{job_name}` | `12345_build` | +| `status` | Job result | `running`, `success`, `failed` | +| `duration_seconds` | Execution time | `330` | +| `queue_time_seconds` | Time waiting in queue | `12` | + +- `running` entries are written by `job-started.sh` (preliminary, excluded from totals). +- Final entries are written by `job-completed.sh` with actual duration and status. + +--- + +## πŸ“Š What's Next? + +| Guide | Description | +|---|---| +| [Monitoring Setup](Monitoring-Setup.md) | Quick start and configuration | +| [Grafana Dashboards](Grafana-Dashboards.md) | Dashboard details, import, and customization | +| [Monitoring Troubleshooting](Monitoring-Troubleshooting.md) | Fix common monitoring issues | + +> πŸ“– **Full reference:** See [PROMETHEUS_METRICS_REFERENCE.md](../docs/features/PROMETHEUS_METRICS_REFERENCE.md) in the main docs for extended examples. diff --git a/wiki-content/Monitoring-Setup.md b/wiki-content/Monitoring-Setup.md new file mode 100644 index 00000000..5f7e2285 --- /dev/null +++ b/wiki-content/Monitoring-Setup.md @@ -0,0 +1,186 @@ +# Monitoring Setup + +![Prometheus](https://img.shields.io/badge/Prometheus-Metrics-E6522C?style=for-the-badge&logo=prometheus&logoColor=white) +![Grafana](https://img.shields.io/badge/Grafana-Dashboards-F46800?style=for-the-badge&logo=grafana&logoColor=white) +![Status](https://img.shields.io/badge/Status-Production%20Ready-success?style=for-the-badge) + +All GitHub Actions self-hosted runners expose custom Prometheus metrics on port **9091**. This guide walks you through connecting your existing Prometheus and Grafana instances to collect and visualize runner telemetry. + +--- + +## 🎯 What You Get + +- **8 custom metrics** covering runner status, job counts, duration histograms, DORA metrics, and more +- **4 pre-built Grafana dashboards** (54 panels total) for runner health, DORA metrics, performance trends, and job analysis +- **Zero dependencies** β€” pure Bash implementation, no external exporters required + +> **Note:** This project provides the metrics endpoint and dashboards. You bring your own Prometheus and Grafana. + +--- + +## ⚑ 5-Minute Quick Start + +### Step 1: Deploy a Runner + +```bash +# Clone the repository +git clone https://github.com/GrammaTonic/github-runner.git +cd github-runner + +# Configure +cp config/runner.env.example config/runner.env +# Edit config/runner.env β€” set GITHUB_TOKEN and GITHUB_REPOSITORY + +# Start +docker compose -f docker/docker-compose.production.yml up -d +``` + +### Step 2: Verify Metrics + +```bash +curl http://localhost:9091/metrics +``` + +You should see Prometheus-formatted output with metrics like `github_runner_status`, `github_runner_uptime_seconds`, etc. + +### Step 3: Add Scrape Target + +Add to your `prometheus.yml` under `scrape_configs`: + +```yaml +- job_name: "github-runner" + static_configs: + - targets: [":9091"] + scrape_interval: 15s + metrics_path: /metrics +``` + +Reload Prometheus: + +```bash +curl -X POST http://localhost:9090/-/reload +``` + +### Step 4: Import Grafana Dashboards + +1. Open Grafana β†’ **Dashboards β†’ Import**. +2. Upload JSON files from `monitoring/grafana/dashboards/`: + - `runner-overview.json` β€” Status and health + - `dora-metrics.json` β€” DORA metrics + - `job-analysis.json` β€” Job details + - `performance-trends.json` β€” Performance data +3. Select your Prometheus datasource when prompted. + +### Step 5: Verify End-to-End + +1. **Prometheus**: Open `http://localhost:9090/targets` β€” runner target should show `UP`. +2. **Grafana**: Open the **Runner Overview** dashboard β€” panels should display live data. + +--- + +## 🐳 Runner Types and Port Mapping + +Each runner type listens on container port 9091 internally, but maps to a different host port: + +| Runner Type | Compose File | Host Port | Container Port | Verify Command | +|---|---|---|---|---| +| **Standard** | `docker-compose.production.yml` | `9091` | `9091` | `curl http://localhost:9091/metrics` | +| **Chrome** | `docker-compose.chrome.yml` | `9092` | `9091` | `curl http://localhost:9092/metrics` | +| **Chrome-Go** | `docker-compose.chrome-go.yml` | `9093` | `9091` | `curl http://localhost:9093/metrics` | + +### Multi-Runner Deployment + +Deploy all three runner types simultaneously: + +```bash +# Standard runner (host port 9091) +docker compose -f docker/docker-compose.production.yml up -d + +# Chrome runner (host port 9092) +cp config/chrome-runner.env.example config/chrome-runner.env +# Edit chrome-runner.env +docker compose -f docker/docker-compose.chrome.yml up -d + +# Chrome-Go runner (host port 9093) +cp config/chrome-go-runner.env.example config/chrome-go-runner.env +# Edit chrome-go-runner.env +docker compose -f docker/docker-compose.chrome-go.yml up -d +``` + +Add all targets to Prometheus: + +```yaml +scrape_configs: + - job_name: "github-runner-standard" + static_configs: + - targets: [":9091"] + - job_name: "github-runner-chrome" + static_configs: + - targets: [":9092"] + - job_name: "github-runner-chrome-go" + static_configs: + - targets: [":9093"] +``` + +--- + +## βš™οΈ Environment Variables + +Configure monitoring behavior through environment variables in your runner `.env` file: + +| Variable | Default | Description | +|---|---|---| +| `RUNNER_TYPE` | `standard` | Runner type label (`standard`, `chrome`, `chrome-go`) | +| `METRICS_PORT` | `9091` | Container port for the metrics endpoint | +| `METRICS_UPDATE_INTERVAL` | `30` | Seconds between metrics collector updates | + +These are pre-configured in the compose files. Override only if needed. + +--- + +## πŸ—οΈ Architecture Overview + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Your Infrastructure (User-Provided) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Prometheus │───▢│ Grafana β”‚ β”‚ +β”‚ β”‚ scrapes :909x β”‚ β”‚ 4 dashboards β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ Runner Containers (This Project) β”‚ +β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ metrics-server β”‚ β”‚ metrics-collector β”‚ β”‚ +β”‚ β”‚ (netcat :9091) β”‚ β”‚ (bash, 30s loop) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ /tmp/runner_metrics.prom β”‚ β”‚ +β”‚ β”‚ (Prometheus text format) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**How it works:** + +1. `metrics-collector.sh` runs every 30 seconds, gathers runner data, and writes `/tmp/runner_metrics.prom`. +2. `metrics-server.sh` uses netcat to serve that file over HTTP on port 9091. +3. `job-started.sh` and `job-completed.sh` hook scripts log job events to `/tmp/jobs.log`. +4. Prometheus scrapes the endpoint; Grafana queries Prometheus. + +> πŸ“– **Full architecture details:** See [Prometheus Architecture](../docs/features/PROMETHEUS_ARCHITECTURE.md) in the main docs. + +--- + +## πŸ“Š What's Next? + +| Guide | Description | +|---|---| +| [Metrics Reference](Metrics-Reference.md) | All 8 metrics with types, labels, and PromQL examples | +| [Grafana Dashboards](Grafana-Dashboards.md) | Dashboard details, import instructions, and customization | +| [Monitoring Troubleshooting](Monitoring-Troubleshooting.md) | Fix common monitoring issues | +| [Production Deployment](Production-Deployment.md) | Full production setup with monitoring stack | + +> πŸ“– **Detailed documentation:** The [docs/features/](../docs/features/) directory contains comprehensive guides for [setup](../docs/features/PROMETHEUS_SETUP.md), [usage & PromQL](../docs/features/PROMETHEUS_USAGE.md), [architecture](../docs/features/PROMETHEUS_ARCHITECTURE.md), and [troubleshooting](../docs/features/PROMETHEUS_TROUBLESHOOTING.md). diff --git a/wiki-content/Monitoring-Troubleshooting.md b/wiki-content/Monitoring-Troubleshooting.md new file mode 100644 index 00000000..fa846f39 --- /dev/null +++ b/wiki-content/Monitoring-Troubleshooting.md @@ -0,0 +1,344 @@ +# Monitoring Troubleshooting + +![Troubleshooting](https://img.shields.io/badge/Troubleshooting-Monitoring-red?style=for-the-badge) + +Common monitoring issues and their solutions. Problems are organized by symptom β€” find yours and follow the fix. + +--- + +## πŸ” Quick Diagnostic Commands + +Run these first to gather information: + +```bash +# Container status +docker ps --filter "name=github-runner" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" + +# Metrics endpoint health +curl -s -o /dev/null -w "%{http_code}" http://localhost:9091/metrics + +# Container logs (last 50 lines) +docker logs --tail 50 + +# Metrics collector log +docker exec cat /tmp/metrics-collector.log + +# Metrics server log +docker exec cat /tmp/metrics-server.log + +# Metrics file size +docker exec wc -l /tmp/runner_metrics.prom + +# Running processes +docker exec ps aux | grep -E "metrics|nc" +``` + +--- + +## ❌ Metrics Endpoint Not Responding + +**Symptom:** `curl http://localhost:9091/metrics` returns "Connection refused" or times out. + +### Check 1: Container Running? + +```bash +docker ps | grep github-runner +``` + +**Fix:** Start the container: + +```bash +docker compose -f docker/docker-compose.production.yml up -d +``` + +### Check 2: Port Mapped Correctly? + +```bash +docker port +``` + +Expected port mappings: + +| Runner | Host Port | Container Port | +|---|---|---| +| Standard | `9091` | `9091` | +| Chrome | `9092` | `9091` | +| Chrome-Go | `9093` | `9091` | + +### Check 3: Metrics Server Running? + +```bash +docker exec ps aux | grep metrics-server +``` + +**Fix:** Restart the container if the server is not running: + +```bash +docker compose -f docker/docker-compose.production.yml restart +``` + +### Check 4: Port Conflict? + +```bash +lsof -i :9091 +# or +ss -tlnp | grep 9091 +``` + +**Fix:** Change the host port in the compose file or stop the conflicting process. + +--- + +## ⏸️ Metrics Not Updating + +**Symptom:** `github_runner_uptime_seconds` or `github_runner_last_update_timestamp` does not change between requests. + +### Check 1: Collector Running? + +```bash +docker exec ps aux | grep metrics-collector +``` + +**Fix:** Check the collector log for errors: + +```bash +docker exec cat /tmp/metrics-collector.log +``` + +Restart the container if the collector has crashed. + +### Check 2: Disk Space? + +```bash +docker exec df -h /tmp +``` + +The metrics file needs `/tmp` to be writable. + +### Check 3: Update Interval + +The default update interval is **30 seconds**. Wait at least 30 seconds between checks. + +```bash +# Watch metrics update in real time +watch -n 5 'curl -s http://localhost:9091/metrics | grep uptime' +``` + +**Reduce interval** via environment variable: + +```yaml +environment: + METRICS_UPDATE_INTERVAL: "15" +``` + +--- + +## πŸ“Š Grafana Dashboard Shows "No Data" + +**Symptom:** Dashboard panels display "No data" or are empty. + +### Check 1: Prometheus Datasource Configured? + +In Grafana β†’ **Configuration β†’ Data Sources** β†’ verify a Prometheus datasource exists β†’ click **Save & Test**. + +### Check 2: Prometheus Scraping Runners? + +Open `http://:9090/targets` and look for `github-runner-*` jobs. Targets should show state `UP`. + +**Fix:** Add runner targets to your `prometheus.yml`: + +```yaml +scrape_configs: + - job_name: "github-runner-standard" + static_configs: + - targets: [":9091"] +``` + +Reload Prometheus: + +```bash +curl -X POST http://localhost:9090/-/reload +``` + +### Check 3: Datasource Name Mismatch? + +Dashboards use `${DS_PROMETHEUS}` as a datasource input variable. During import, you must select your Prometheus datasource. + +**Fix:** Re-import the dashboard and select the correct datasource. + +### Check 4: Time Range Too Narrow? + +If the runner was just deployed, there may not be enough data. + +**Fix:** Set the dashboard time range to **Last 15 minutes** or **Last 1 hour**. + +### Check 5: No Jobs Executed Yet? + +Job metrics (`github_runner_jobs_total`, `github_runner_job_duration_seconds`) only populate after jobs run. Runner status panels update immediately. + +**Fix:** Trigger a test workflow in your repository. + +--- + +## πŸ”» Prometheus Target Shows DOWN + +**Symptom:** Prometheus targets page shows the runner target with state `DOWN`. + +### Check 1: Network Connectivity + +```bash +# From the Prometheus host, test connectivity +curl http://:9091/metrics +``` + +**Fix for Docker networks:** Put Prometheus and runners on the same Docker network: + +```yaml +networks: + monitoring: + external: true +``` + +### Check 2: Firewall + +```bash +nc -zv 9091 +``` + +**Fix:** Open port 9091 in your firewall rules. + +### Check 3: Scrape Timeout + +```bash +time curl -s http://localhost:9091/metrics > /dev/null +``` + +**Fix:** If response is slow, increase the scrape timeout: + +```yaml +- job_name: "github-runner-standard" + scrape_timeout: 15s +``` + +--- + +## πŸ”’ Job Counts Not Incrementing + +**Symptom:** `github_runner_jobs_total` stays at 0 despite running jobs. + +### Check 1: Job Hooks Configured? + +```bash +docker exec env | grep ACTIONS_RUNNER_HOOK +``` + +Expected: + +``` +ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/runner/job-started.sh +ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/runner/job-completed.sh +``` + +These are set by the entrypoint scripts automatically. + +### Check 2: Jobs Log Exists? + +```bash +docker exec ls -la /tmp/jobs.log +docker exec cat /tmp/jobs.log +``` + +### Check 3: Hook Scripts Executable? + +```bash +docker exec ls -la /home/runner/job-started.sh /home/runner/job-completed.sh +``` + +Scripts should have execute permission (set during Docker build). + +--- + +## πŸ“ˆ High Resource Usage + +**Symptom:** Runner container using more resources than expected. + +```bash +docker stats --no-stream +docker exec ps aux --sort=-%mem | head -10 +``` + +### Fix: Reduce Scrape Frequency + +```yaml +environment: + METRICS_UPDATE_INTERVAL: "60" # Reduce from 30s default +``` + +### Fix: Check Jobs Log Growth + +```bash +docker exec wc -l /tmp/jobs.log +``` + +For very long-running containers with thousands of log entries, restart to reset the log. + +### Fix: Set Resource Limits + +```yaml +deploy: + resources: + limits: + cpus: "2.0" + memory: 2G +``` + +--- + +## 0️⃣ Cache Metrics Always Zero + +**Symptom:** `github_runner_cache_hit_rate` reports 0 for all cache types. + +**This is expected.** Cache metrics are currently **stubbed** β€” they always return 0. BuildKit cache logs exist on the Docker host (not inside the runner container), and APT/npm caches are internal to build processes. + +Future work will add a sidecar exporter for real cache data. See [Metrics Reference](Metrics-Reference.md) for details. + +--- + +## πŸ“‹ Collecting Diagnostic Info + +If you need to file a bug report, gather this information: + +```bash +# Container info +docker inspect | head -100 + +# Metrics output +curl -s http://localhost:9091/metrics > metrics-dump.txt + +# Container logs +docker logs > container-logs.txt 2>&1 + +# Collector log +docker exec cat /tmp/metrics-collector.log > collector-log.txt + +# Server log +docker exec cat /tmp/metrics-server.log > server-log.txt + +# Jobs log +docker exec cat /tmp/jobs.log > jobs-log.txt + +# Environment +docker exec env | grep -E "RUNNER|METRICS|JOBS" > env.txt +``` + +--- + +## πŸ“Š What's Next? + +| Guide | Description | +|---|---| +| [Monitoring Setup](Monitoring-Setup.md) | Initial configuration and deployment | +| [Metrics Reference](Metrics-Reference.md) | All 8 metrics with types and PromQL | +| [Grafana Dashboards](Grafana-Dashboards.md) | Dashboard import and customization | + +> πŸ“– **Full troubleshooting guide:** See [PROMETHEUS_TROUBLESHOOTING.md](../docs/features/PROMETHEUS_TROUBLESHOOTING.md) in the main docs. diff --git a/wiki-content/Production-Deployment.md b/wiki-content/Production-Deployment.md index fcbb730a..3e1d068e 100644 --- a/wiki-content/Production-Deployment.md +++ b/wiki-content/Production-Deployment.md @@ -178,7 +178,7 @@ LOG_RETENTION_DAYS=30 # Monitoring ENABLE_PROMETHEUS_METRICS=true ENABLE_HEALTH_ENDPOINTS=true -METRICS_PORT=9090 +METRICS_PORT=9091 ``` ### 3. Production Docker Compose @@ -358,6 +358,8 @@ docker stack ps github-runner ## πŸ“Š Production Monitoring +> πŸ“– **Full monitoring guide:** See [Monitoring Setup](Monitoring-Setup.md) for Prometheus metrics configuration, port mapping for all runner types, and Grafana dashboard import. + ### Health Checks ```bash @@ -404,10 +406,10 @@ alerting: - alertmanager:9093 scrape_configs: - - job_name: "github-runners" + - job_name: "github-runner-standard" static_configs: - - targets: ["runner:8080"] - scrape_interval: 30s + - targets: ["runner:9091"] + scrape_interval: 15s metrics_path: /metrics - job_name: "docker" diff --git a/wiki-content/Quick-Start.md b/wiki-content/Quick-Start.md index 68bbdef6..09a2200d 100644 --- a/wiki-content/Quick-Start.md +++ b/wiki-content/Quick-Start.md @@ -109,8 +109,7 @@ docker system prune -a -f ## 🎯 What's Next? - **[Production Setup](Production-Deployment.md)** - Scale for production use - - +- **[Monitoring Setup](Monitoring-Setup.md)** - Prometheus metrics and Grafana dashboards - **[Troubleshooting](Common-Issues.md)** - Fix common problems ## πŸ’‘ Quick Tips