Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 32 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -402,19 +402,44 @@ docker compose -f docker/docker-compose.chrome.yml up -d

## 📊 Monitoring

### Health Checks
All runner types expose Prometheus-compatible metrics on port **9091** (container port). See the [Monitoring Quick Start](docs/features/PROMETHEUS_QUICKSTART.md) to get started in 5 minutes.

### Metrics Endpoint

```bash
# Check runner health
curl http://localhost:8080/health
# Standard runner metrics (host port 9091)
curl http://localhost:9091/metrics

# Prometheus metrics
curl http://localhost:9090/metrics
# Chrome runner metrics (host port 9092)
curl http://localhost:9092/metrics

# Grafana dashboard
open http://localhost:3000
# Chrome-Go runner metrics (host port 9093)
curl http://localhost:9093/metrics
```

### Grafana Dashboards

Four pre-built dashboards are provided in `monitoring/grafana/dashboards/`:

| Dashboard | File | Panels |
|---|---|---|
| Runner Overview | `runner-overview.json` | 12 |
| DORA Metrics | `dora-metrics.json` | 12 |
| Performance Trends | `performance-trends.json` | 14 |
| Job Analysis | `job-analysis.json` | 16 |

Import them into your Grafana instance or use the provisioning config for auto-loading.

### Documentation

- [Quick Start](docs/features/PROMETHEUS_QUICKSTART.md) — 5-minute setup
- [Setup Guide](docs/features/PROMETHEUS_SETUP.md) — Full configuration
- [Usage Guide](docs/features/PROMETHEUS_USAGE.md) — PromQL queries and alerts
- [Metrics Reference](docs/features/PROMETHEUS_METRICS_REFERENCE.md) — All metric definitions
- [Architecture](docs/features/PROMETHEUS_ARCHITECTURE.md) — System internals
- [Troubleshooting](docs/features/PROMETHEUS_TROUBLESHOOTING.md) — Common issues
- [API Reference](docs/API.md) — Endpoint details

## 🔧 Maintenance

### Scaling
Expand Down
13 changes: 13 additions & 0 deletions config/runner.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,19 @@ REGISTRY=ghcr.io/grammatonic
RUNNER_IMAGE_TAG=latest
CHROME_IMAGE_TAG=chrome-latest

# ==========================================
# OPTIONAL: Metrics & Monitoring
# ==========================================

# Runner type identifier (used in Prometheus labels)
# RUNNER_TYPE=standard

# Metrics HTTP server port (inside the container)
# METRICS_PORT=9091

# Metrics collector update interval in seconds
# METRICS_UPDATE_INTERVAL=30

# Resource Limits (uncomment to enable)
# RUNNER_MEMORY_LIMIT=1g
# RUNNER_CPU_LIMIT=1.0
Expand Down
29 changes: 21 additions & 8 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,29 @@ Returns the current health status of the runner (Chrome or normal).

### GET /metrics

Returns Prometheus metrics for monitoring runner health and job execution.
Returns Prometheus-formatted metrics for monitoring runner health and job execution.

**Key Metrics:**
**Port:** 9091 (container port). Host port mappings: 9091 (standard), 9092 (chrome), 9093 (chrome-go).

- `github_runner_jobs_total` - Total jobs executed
- `github_runner_jobs_duration_seconds` - Job execution time
- `github_runner_registration_status` - Registration health (1 = registered, 0 = not registered)
- `github_runner_last_job_timestamp` - Timestamp of last job
- `github_runner_uptime_seconds` - Runner uptime in seconds
- `github_runner_type` - Runner type (chrome/normal)
**Content-Type:** `text/plain; version=0.0.4; charset=utf-8`

**Metrics Exposed:**

| Metric | Type | Description |
|---|---|---|
| `github_runner_status` | gauge | Runner status (1=online, 0=offline) |
| `github_runner_info` | gauge | Runner metadata (name, type, version) |
| `github_runner_uptime_seconds` | counter | Runner uptime in seconds |
| `github_runner_jobs_total` | counter | Total jobs by status (total, success, failed) |
| `github_runner_job_duration_seconds` | histogram | Job duration distribution (buckets: 60s–3600s) |
| `github_runner_queue_time_seconds` | gauge | Average queue wait time (last 100 jobs) |
| `github_runner_cache_hit_rate` | gauge | Cache hit rate by type (stubbed at 0) |
| `github_runner_last_update_timestamp` | gauge | Unix timestamp of last metrics update |

All metrics carry `runner_name` and `runner_type` labels.

For full metric definitions, see [Metrics Reference](features/PROMETHEUS_METRICS_REFERENCE.md).
For PromQL query examples, see [Usage Guide](features/PROMETHEUS_USAGE.md).

## Container Labels

Expand Down
11 changes: 11 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,17 @@ docs/
- [Runner Self-Test](features/RUNNER_SELF_TEST.md) - Automated runner validation


### Prometheus Monitoring

- [Quick Start](features/PROMETHEUS_QUICKSTART.md) - 5-minute monitoring setup
- [Setup Guide](features/PROMETHEUS_SETUP.md) - Full Prometheus and Grafana configuration
- [Usage Guide](features/PROMETHEUS_USAGE.md) - PromQL queries, alerts, and dashboard customization
- [Metrics Reference](features/PROMETHEUS_METRICS_REFERENCE.md) - Complete metric definitions
- [Architecture](features/PROMETHEUS_ARCHITECTURE.md) - System design and data flow
- [Troubleshooting](features/PROMETHEUS_TROUBLESHOOTING.md) - Common issues and fixes
- [Grafana Dashboard Metrics](features/GRAFANA_DASHBOARD_METRICS.md) - Dashboard feature specification

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line links to features/GRAFANA_DASHBOARD_METRICS.md, but this file does not appear to be included in the pull request. This will result in a broken link for users. Please either add the missing file or remove this link.



### Releases

- [Changelog](releases/CHANGELOG.md) - Full release history
Expand Down
Loading