Skip to content

docs: add Prometheus monitoring documentation (Phase 5)#1139

Merged
GrammaTonic merged 3 commits intodevelopfrom
feature/prometheus-documentation-phase5
Mar 2, 2026
Merged

docs: add Prometheus monitoring documentation (Phase 5)#1139
GrammaTonic merged 3 commits intodevelopfrom
feature/prometheus-documentation-phase5

Conversation

@GrammaTonic
Copy link
Owner

Summary

Phase 5 of Prometheus monitoring implementation: complete documentation and user guide.

Implements all 10 tasks (TASK-047 through TASK-056) from Issue #1063.

Type of Change

  • Documentation update

Changes Made

New Files (7)

File Task Description
docs/features/PROMETHEUS_SETUP.md TASK-047 Full setup guide
docs/features/PROMETHEUS_USAGE.md TASK-048 Usage guide with PromQL queries
docs/features/PROMETHEUS_TROUBLESHOOTING.md TASK-049 Troubleshooting guide
docs/features/PROMETHEUS_ARCHITECTURE.md TASK-050 Architecture documentation
docs/features/PROMETHEUS_METRICS_REFERENCE.md TASK-054 Complete metrics reference
docs/features/PROMETHEUS_QUICKSTART.md TASK-056 5-minute quick start
monitoring/prometheus-scrape-example.yml TASK-053 Example scrape config

Updated Files (5)

File Task Changes
README.md TASK-051 Fixed port 9090 to 9091, added dashboard table and doc links
docs/README.md TASK-052 Added Prometheus Monitoring section with all doc links
docs/API.md TASK-055 Rewrote metrics section with correct metric names
config/runner.env.example - Added Metrics and Monitoring env vars
plan/feature-prometheus-monitoring-1.md - Marked Phase 5 tasks complete

Testing

  • All markdown files valid
  • YAML file valid
  • Cross-document links verified
  • Metric names match metrics-collector.sh output

Related Issues

Create 6 new documentation files, 1 example scrape config,
and update 4 existing files for the Prometheus metrics system.

New files: PROMETHEUS_SETUP.md, PROMETHEUS_USAGE.md,
PROMETHEUS_TROUBLESHOOTING.md, PROMETHEUS_ARCHITECTURE.md,
PROMETHEUS_METRICS_REFERENCE.md, PROMETHEUS_QUICKSTART.md,
monitoring/prometheus-scrape-example.yml

Updated: README.md (fix port 9090->9091, add doc links),
docs/README.md (add Prometheus section), docs/API.md (rewrite
metrics with correct names), config/runner.env.example (add
metrics vars), plan/feature-prometheus-monitoring-1.md

Implements: TASK-047 through TASK-056 (Issue #1063)
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Prometheus monitoring capabilities by providing a comprehensive set of documentation and user guides. It aims to streamline the process for users to set up, effectively utilize, and troubleshoot Prometheus monitoring for GitHub Actions self-hosted runners, ensuring clarity on metric definitions, system architecture, and operational best practices.

Highlights

  • Comprehensive Documentation: A full suite of documentation has been added for Prometheus monitoring, covering setup, usage, troubleshooting, architecture, and a quick start guide.
  • Detailed Metrics Reference: A new document provides a complete reference for all exposed Prometheus metrics, including their definitions, types, labels, and PromQL examples.
  • Updated Main Documentation: The main README.md and docs/README.md files have been updated to integrate and link to the new monitoring documentation.
  • Prometheus Scrape Configuration Example: An example Prometheus scrape configuration file has been added to assist users in setting up Prometheus to collect metrics from various runner types.
  • Revised API Documentation: The API documentation for the /metrics endpoint has been rewritten to reflect correct metric names, types, descriptions, and port information.
  • New Environment Variables: Optional environment variables for configuring metrics and monitoring, such as RUNNER_TYPE, METRICS_PORT, and METRICS_UPDATE_INTERVAL, have been introduced.
  • Feature Plan Completion: All Phase 5 documentation tasks (TASK-047 through TASK-056) for the Prometheus monitoring feature have been marked as complete.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • README.md
    • Updated the Monitoring section to reflect the correct Prometheus port (9091).
    • Added examples for metrics endpoints for standard, Chrome, and Chrome-Go runners.
    • Included a table detailing pre-built Grafana dashboards.
    • Added a comprehensive list of links to the new Prometheus monitoring documentation.
  • config/runner.env.example
    • Introduced new optional environment variables for configuring metrics and monitoring, including RUNNER_TYPE, METRICS_PORT, and METRICS_UPDATE_INTERVAL.
  • docs/API.md
    • Completely revised the /metrics API endpoint documentation.
    • Updated metric names, types, descriptions, and port information.
    • Added links to the new Metrics Reference and Usage Guide.
  • docs/README.md
    • Added a new "Prometheus Monitoring" section.
    • Included links to all seven new Prometheus documentation files (Quick Start, Setup, Usage, Metrics Reference, Architecture, Troubleshooting, Grafana Dashboard Metrics).
  • docs/features/PROMETHEUS_ARCHITECTURE.md
    • Added a new document detailing the internal architecture of the Prometheus monitoring system.
    • Included a system architecture diagram, component descriptions, data flow, design decisions (Bash + Netcat), and scalability considerations.
  • docs/features/PROMETHEUS_METRICS_REFERENCE.md
    • Added a new document providing a complete reference for all exposed Prometheus metrics.
    • Detailed each metric's type, description, labels, values, source, and PromQL examples.
    • Included a summary table and information on the job log format.
  • docs/features/PROMETHEUS_QUICKSTART.md
    • Added a new quick start guide for Prometheus monitoring.
    • Provided a 5-minute, 5-step setup process covering runner deployment, metrics verification, Prometheus scrape configuration, Grafana dashboard import, and multi-runner setup.
  • docs/features/PROMETHEUS_SETUP.md
    • Added a new comprehensive setup guide for Prometheus monitoring.
    • Covered prerequisites, deploying runners with metrics, verifying endpoints, configuring Prometheus scrape targets, Grafana datasource setup, dashboard import (manual and provisioning), and end-to-end verification.
  • docs/features/PROMETHEUS_TROUBLESHOOTING.md
    • Added a new troubleshooting guide for common Prometheus monitoring issues.
    • Addressed problems like unresponsive endpoints, stale metrics, "No Data" in Grafana, Prometheus targets showing DOWN, job counts not incrementing, high resource usage, and cache metrics always being zero.
  • monitoring/prometheus-scrape-example.yml
    • Added a new example Prometheus scrape configuration file.
    • Provided configurations for standard, Chrome, and Chrome-Go runners, including an alternative for Docker network setups.
  • plan/feature-prometheus-monitoring-1.md
    • Updated the "Implementation Phase 5: Documentation & User Guide" section.
    • Marked all associated tasks (TASK-047 through TASK-056) as complete with their completion dates.
Activity
  • The pull request introduces a significant amount of new documentation and configuration related to Prometheus monitoring.
  • All tasks for Phase 5 of the Prometheus monitoring feature have been completed.
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of documentation for the new Prometheus monitoring feature. The new guides for setup, usage, architecture, and troubleshooting are exceptionally detailed and will be a great resource for users. The updates to the README and API documentation also significantly improve discoverability and clarity.

I've left a few comments on the documentation files pointing out a broken link, a couple of potential improvements to the metrics implementation for better maintainability and observability, and a correction needed for a DORA metric query. Overall, this is an excellent contribution that greatly enhances the project's monitoring capabilities.

Comment on lines +115 to +119
# Mean Time to Recovery (average duration of failed jobs in minutes)
rate(github_runner_job_duration_seconds_sum{status="failed"}[1h])
/
rate(github_runner_job_duration_seconds_count{status="failed"}[1h])
/ 60

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PromQL query provided for "Mean Time to Recovery" (MTTR) is incorrect. It attempts to filter the github_runner_job_duration_seconds metric by status="failed", but this metric does not have a status label. The histogram currently aggregates durations for all completed jobs, regardless of their success or failure.

To fix this, the github_runner_job_duration_seconds histogram metric needs to be updated to include a status label (e.g., status="success" or status="failed"). This would require changes to docker/metrics-collector.sh to generate separate histograms for each status. Without this change, the MTTR query as written will return no data.

- [Metrics Reference](features/PROMETHEUS_METRICS_REFERENCE.md) - Complete metric definitions
- [Architecture](features/PROMETHEUS_ARCHITECTURE.md) - System design and data flow
- [Troubleshooting](features/PROMETHEUS_TROUBLESHOOTING.md) - Common issues and fixes
- [Grafana Dashboard Metrics](features/GRAFANA_DASHBOARD_METRICS.md) - Dashboard feature specification

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line links to features/GRAFANA_DASHBOARD_METRICS.md, but this file does not appear to be included in the pull request. This will result in a broken link for users. Please either add the missing file or remove this link.

| `UPDATE_INTERVAL` | `30` | Seconds between updates |
| `RUNNER_NAME` | `unknown` | Runner name label |
| `RUNNER_TYPE` | `standard` | Runner type label |
| `RUNNER_VERSION` | `2.332.0` | Runner version label |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The RUNNER_VERSION is documented here with a hardcoded default value of 2.332.0. This creates a maintenance issue, as the version will need to be manually updated in the script whenever the runner agent is upgraded.

To make this more robust, I recommend determining the version dynamically within the metrics-collector.sh script by querying the runner binary itself. For example:

# In metrics-collector.sh
RUNNER_VERSION=$("/actions-runner/bin/Runner.Listener" --version 2>/dev/null || echo "unknown")

This would ensure the github_runner_info metric always reports the correct version automatically.

|---|---|
| **Type** | Gauge |
| **Description** | Unix timestamp of the last metrics update |
| **Labels** | None |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The github_runner_last_update_timestamp metric is documented as having no labels. In a multi-runner environment, this makes it impossible to determine which specific runner's metrics have gone stale. All other metrics include runner_name and runner_type labels for this reason.

Please add these labels to github_runner_last_update_timestamp to improve its utility for alerting and diagnostics. The implementation in docker/metrics-collector.sh would need to be updated accordingly, for example:
github_runner_last_update_timestamp{runner_name="$RUNNER_NAME",runner_type="$RUNNER_TYPE"} $(date +%s)

Suggested change
| **Labels** | None |
| **Labels** | `runner_name`, `runner_type` |

Create 4 new wiki pages mirroring Phase 5 Prometheus documentation:
- Monitoring-Setup.md: Quick start, port mapping, multi-runner config
- Metrics-Reference.md: All 8 metrics with types, labels, and PromQL
- Grafana-Dashboards.md: 4 dashboards, import/provisioning instructions
- Monitoring-Troubleshooting.md: Symptom-based troubleshooting guide

Update 5 existing wiki pages:
- Home.md: Add Monitoring & Observability section to Table of Contents
- Production-Deployment.md: Fix METRICS_PORT 9090→9091, scrape target
  runner:8080→runner:9091, add monitoring guide cross-link
- Quick-Start.md: Restore monitoring link in What's Next section
- Chrome-Runner.md: Add Prometheus metrics port 9092 info and links
- Docker-Configuration.md: Add monitoring setup link below architecture
@GrammaTonic GrammaTonic merged commit 58be548 into develop Mar 2, 2026
22 checks passed
@GrammaTonic GrammaTonic deleted the feature/prometheus-documentation-phase5 branch March 2, 2026 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant