docs: add Prometheus monitoring documentation (Phase 5)#1139
docs: add Prometheus monitoring documentation (Phase 5)#1139GrammaTonic merged 3 commits intodevelopfrom
Conversation
Create 6 new documentation files, 1 example scrape config, and update 4 existing files for the Prometheus metrics system. New files: PROMETHEUS_SETUP.md, PROMETHEUS_USAGE.md, PROMETHEUS_TROUBLESHOOTING.md, PROMETHEUS_ARCHITECTURE.md, PROMETHEUS_METRICS_REFERENCE.md, PROMETHEUS_QUICKSTART.md, monitoring/prometheus-scrape-example.yml Updated: README.md (fix port 9090->9091, add doc links), docs/README.md (add Prometheus section), docs/API.md (rewrite metrics with correct names), config/runner.env.example (add metrics vars), plan/feature-prometheus-monitoring-1.md Implements: TASK-047 through TASK-056 (Issue #1063)
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Prometheus monitoring capabilities by providing a comprehensive set of documentation and user guides. It aims to streamline the process for users to set up, effectively utilize, and troubleshoot Prometheus monitoring for GitHub Actions self-hosted runners, ensuring clarity on metric definitions, system architecture, and operational best practices. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of documentation for the new Prometheus monitoring feature. The new guides for setup, usage, architecture, and troubleshooting are exceptionally detailed and will be a great resource for users. The updates to the README and API documentation also significantly improve discoverability and clarity.
I've left a few comments on the documentation files pointing out a broken link, a couple of potential improvements to the metrics implementation for better maintainability and observability, and a correction needed for a DORA metric query. Overall, this is an excellent contribution that greatly enhances the project's monitoring capabilities.
| # Mean Time to Recovery (average duration of failed jobs in minutes) | ||
| rate(github_runner_job_duration_seconds_sum{status="failed"}[1h]) | ||
| / | ||
| rate(github_runner_job_duration_seconds_count{status="failed"}[1h]) | ||
| / 60 |
There was a problem hiding this comment.
The PromQL query provided for "Mean Time to Recovery" (MTTR) is incorrect. It attempts to filter the github_runner_job_duration_seconds metric by status="failed", but this metric does not have a status label. The histogram currently aggregates durations for all completed jobs, regardless of their success or failure.
To fix this, the github_runner_job_duration_seconds histogram metric needs to be updated to include a status label (e.g., status="success" or status="failed"). This would require changes to docker/metrics-collector.sh to generate separate histograms for each status. Without this change, the MTTR query as written will return no data.
| - [Metrics Reference](features/PROMETHEUS_METRICS_REFERENCE.md) - Complete metric definitions | ||
| - [Architecture](features/PROMETHEUS_ARCHITECTURE.md) - System design and data flow | ||
| - [Troubleshooting](features/PROMETHEUS_TROUBLESHOOTING.md) - Common issues and fixes | ||
| - [Grafana Dashboard Metrics](features/GRAFANA_DASHBOARD_METRICS.md) - Dashboard feature specification |
| | `UPDATE_INTERVAL` | `30` | Seconds between updates | | ||
| | `RUNNER_NAME` | `unknown` | Runner name label | | ||
| | `RUNNER_TYPE` | `standard` | Runner type label | | ||
| | `RUNNER_VERSION` | `2.332.0` | Runner version label | |
There was a problem hiding this comment.
The RUNNER_VERSION is documented here with a hardcoded default value of 2.332.0. This creates a maintenance issue, as the version will need to be manually updated in the script whenever the runner agent is upgraded.
To make this more robust, I recommend determining the version dynamically within the metrics-collector.sh script by querying the runner binary itself. For example:
# In metrics-collector.sh
RUNNER_VERSION=$("/actions-runner/bin/Runner.Listener" --version 2>/dev/null || echo "unknown")This would ensure the github_runner_info metric always reports the correct version automatically.
| |---|---| | ||
| | **Type** | Gauge | | ||
| | **Description** | Unix timestamp of the last metrics update | | ||
| | **Labels** | None | |
There was a problem hiding this comment.
The github_runner_last_update_timestamp metric is documented as having no labels. In a multi-runner environment, this makes it impossible to determine which specific runner's metrics have gone stale. All other metrics include runner_name and runner_type labels for this reason.
Please add these labels to github_runner_last_update_timestamp to improve its utility for alerting and diagnostics. The implementation in docker/metrics-collector.sh would need to be updated accordingly, for example:
github_runner_last_update_timestamp{runner_name="$RUNNER_NAME",runner_type="$RUNNER_TYPE"} $(date +%s)
| | **Labels** | None | | |
| | **Labels** | `runner_name`, `runner_type` | |
Create 4 new wiki pages mirroring Phase 5 Prometheus documentation: - Monitoring-Setup.md: Quick start, port mapping, multi-runner config - Metrics-Reference.md: All 8 metrics with types, labels, and PromQL - Grafana-Dashboards.md: 4 dashboards, import/provisioning instructions - Monitoring-Troubleshooting.md: Symptom-based troubleshooting guide Update 5 existing wiki pages: - Home.md: Add Monitoring & Observability section to Table of Contents - Production-Deployment.md: Fix METRICS_PORT 9090→9091, scrape target runner:8080→runner:9091, add monitoring guide cross-link - Quick-Start.md: Restore monitoring link in What's Next section - Chrome-Runner.md: Add Prometheus metrics port 9092 info and links - Docker-Configuration.md: Add monitoring setup link below architecture
Summary
Phase 5 of Prometheus monitoring implementation: complete documentation and user guide.
Implements all 10 tasks (TASK-047 through TASK-056) from Issue #1063.
Type of Change
Changes Made
New Files (7)
Updated Files (5)
Testing
Related Issues