docs: add Prometheus monitoring documentation (Phase 5) by GrammaTonic · Pull Request #1139 · GrammaTonic/github-runner

GrammaTonic · 2026-03-02T01:53:10Z

Summary

Phase 5 of Prometheus monitoring implementation: complete documentation and user guide.

Implements all 10 tasks (TASK-047 through TASK-056) from Issue #1063.

Type of Change

Documentation update

Changes Made

New Files (7)

File	Task	Description
docs/features/PROMETHEUS_SETUP.md	TASK-047	Full setup guide
docs/features/PROMETHEUS_USAGE.md	TASK-048	Usage guide with PromQL queries
docs/features/PROMETHEUS_TROUBLESHOOTING.md	TASK-049	Troubleshooting guide
docs/features/PROMETHEUS_ARCHITECTURE.md	TASK-050	Architecture documentation
docs/features/PROMETHEUS_METRICS_REFERENCE.md	TASK-054	Complete metrics reference
docs/features/PROMETHEUS_QUICKSTART.md	TASK-056	5-minute quick start
monitoring/prometheus-scrape-example.yml	TASK-053	Example scrape config

Updated Files (5)

File	Task	Changes
README.md	TASK-051	Fixed port 9090 to 9091, added dashboard table and doc links
docs/README.md	TASK-052	Added Prometheus Monitoring section with all doc links
docs/API.md	TASK-055	Rewrote metrics section with correct metric names
config/runner.env.example	-	Added Metrics and Monitoring env vars
plan/feature-prometheus-monitoring-1.md	-	Marked Phase 5 tasks complete

Testing

All markdown files valid
YAML file valid
Cross-document links verified
Metric names match metrics-collector.sh output

Related Issues

Closes [Feature] Phase 5: Documentation & User Guide #1063

Create 6 new documentation files, 1 example scrape config, and update 4 existing files for the Prometheus metrics system. New files: PROMETHEUS_SETUP.md, PROMETHEUS_USAGE.md, PROMETHEUS_TROUBLESHOOTING.md, PROMETHEUS_ARCHITECTURE.md, PROMETHEUS_METRICS_REFERENCE.md, PROMETHEUS_QUICKSTART.md, monitoring/prometheus-scrape-example.yml Updated: README.md (fix port 9090->9091, add doc links), docs/README.md (add Prometheus section), docs/API.md (rewrite metrics with correct names), config/runner.env.example (add metrics vars), plan/feature-prometheus-monitoring-1.md Implements: TASK-047 through TASK-056 (Issue #1063)

gemini-code-assist · 2026-03-02T01:53:36Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Prometheus monitoring capabilities by providing a comprehensive set of documentation and user guides. It aims to streamline the process for users to set up, effectively utilize, and troubleshoot Prometheus monitoring for GitHub Actions self-hosted runners, ensuring clarity on metric definitions, system architecture, and operational best practices.

Highlights

Comprehensive Documentation: A full suite of documentation has been added for Prometheus monitoring, covering setup, usage, troubleshooting, architecture, and a quick start guide.
Detailed Metrics Reference: A new document provides a complete reference for all exposed Prometheus metrics, including their definitions, types, labels, and PromQL examples.
Updated Main Documentation: The main README.md and docs/README.md files have been updated to integrate and link to the new monitoring documentation.
Prometheus Scrape Configuration Example: An example Prometheus scrape configuration file has been added to assist users in setting up Prometheus to collect metrics from various runner types.
Revised API Documentation: The API documentation for the /metrics endpoint has been rewritten to reflect correct metric names, types, descriptions, and port information.
New Environment Variables: Optional environment variables for configuring metrics and monitoring, such as RUNNER_TYPE, METRICS_PORT, and METRICS_UPDATE_INTERVAL, have been introduced.
Feature Plan Completion: All Phase 5 documentation tasks (TASK-047 through TASK-056) for the Prometheus monitoring feature have been marked as complete.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

README.md
- Updated the Monitoring section to reflect the correct Prometheus port (9091).
- Added examples for metrics endpoints for standard, Chrome, and Chrome-Go runners.
- Included a table detailing pre-built Grafana dashboards.
- Added a comprehensive list of links to the new Prometheus monitoring documentation.
config/runner.env.example
- Introduced new optional environment variables for configuring metrics and monitoring, including RUNNER_TYPE, METRICS_PORT, and METRICS_UPDATE_INTERVAL.
docs/API.md
- Completely revised the /metrics API endpoint documentation.
- Updated metric names, types, descriptions, and port information.
- Added links to the new Metrics Reference and Usage Guide.
docs/README.md
- Added a new "Prometheus Monitoring" section.
- Included links to all seven new Prometheus documentation files (Quick Start, Setup, Usage, Metrics Reference, Architecture, Troubleshooting, Grafana Dashboard Metrics).
docs/features/PROMETHEUS_ARCHITECTURE.md
- Added a new document detailing the internal architecture of the Prometheus monitoring system.
- Included a system architecture diagram, component descriptions, data flow, design decisions (Bash + Netcat), and scalability considerations.
docs/features/PROMETHEUS_METRICS_REFERENCE.md
- Added a new document providing a complete reference for all exposed Prometheus metrics.
- Detailed each metric's type, description, labels, values, source, and PromQL examples.
- Included a summary table and information on the job log format.
docs/features/PROMETHEUS_QUICKSTART.md
- Added a new quick start guide for Prometheus monitoring.
- Provided a 5-minute, 5-step setup process covering runner deployment, metrics verification, Prometheus scrape configuration, Grafana dashboard import, and multi-runner setup.
docs/features/PROMETHEUS_SETUP.md
- Added a new comprehensive setup guide for Prometheus monitoring.
- Covered prerequisites, deploying runners with metrics, verifying endpoints, configuring Prometheus scrape targets, Grafana datasource setup, dashboard import (manual and provisioning), and end-to-end verification.
docs/features/PROMETHEUS_TROUBLESHOOTING.md
- Added a new troubleshooting guide for common Prometheus monitoring issues.
- Addressed problems like unresponsive endpoints, stale metrics, "No Data" in Grafana, Prometheus targets showing DOWN, job counts not incrementing, high resource usage, and cache metrics always being zero.
monitoring/prometheus-scrape-example.yml
- Added a new example Prometheus scrape configuration file.
- Provided configurations for standard, Chrome, and Chrome-Go runners, including an alternative for Docker network setups.
plan/feature-prometheus-monitoring-1.md
- Updated the "Implementation Phase 5: Documentation & User Guide" section.
- Marked all associated tasks (TASK-047 through TASK-056) as complete with their completion dates.

Activity

The pull request introduces a significant amount of new documentation and configuration related to Prometheus monitoring.
All tasks for Phase 5 of the Prometheus monitoring feature have been completed.
No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive set of documentation for the new Prometheus monitoring feature. The new guides for setup, usage, architecture, and troubleshooting are exceptionally detailed and will be a great resource for users. The updates to the README and API documentation also significantly improve discoverability and clarity.

I've left a few comments on the documentation files pointing out a broken link, a couple of potential improvements to the metrics implementation for better maintainability and observability, and a correction needed for a DORA metric query. Overall, this is an excellent contribution that greatly enhances the project's monitoring capabilities.

gemini-code-assist · 2026-03-02T01:55:48Z

docs/features/PROMETHEUS_USAGE.md

+# Mean Time to Recovery (average duration of failed jobs in minutes)
+rate(github_runner_job_duration_seconds_sum{status="failed"}[1h])
+  /
+rate(github_runner_job_duration_seconds_count{status="failed"}[1h])
+  / 60


The PromQL query provided for "Mean Time to Recovery" (MTTR) is incorrect. It attempts to filter the github_runner_job_duration_seconds metric by status="failed", but this metric does not have a status label. The histogram currently aggregates durations for all completed jobs, regardless of their success or failure.

To fix this, the github_runner_job_duration_seconds histogram metric needs to be updated to include a status label (e.g., status="success" or status="failed"). This would require changes to docker/metrics-collector.sh to generate separate histograms for each status. Without this change, the MTTR query as written will return no data.

gemini-code-assist · 2026-03-02T01:55:48Z

docs/README.md

+- [Metrics Reference](features/PROMETHEUS_METRICS_REFERENCE.md) - Complete metric definitions
+- [Architecture](features/PROMETHEUS_ARCHITECTURE.md) - System design and data flow
+- [Troubleshooting](features/PROMETHEUS_TROUBLESHOOTING.md) - Common issues and fixes
+- [Grafana Dashboard Metrics](features/GRAFANA_DASHBOARD_METRICS.md) - Dashboard feature specification


This line links to features/GRAFANA_DASHBOARD_METRICS.md, but this file does not appear to be included in the pull request. This will result in a broken link for users. Please either add the missing file or remove this link.

gemini-code-assist · 2026-03-02T01:55:48Z

docs/features/PROMETHEUS_ARCHITECTURE.md

+| `UPDATE_INTERVAL` | `30` | Seconds between updates |
+| `RUNNER_NAME` | `unknown` | Runner name label |
+| `RUNNER_TYPE` | `standard` | Runner type label |
+| `RUNNER_VERSION` | `2.332.0` | Runner version label |


The RUNNER_VERSION is documented here with a hardcoded default value of 2.332.0. This creates a maintenance issue, as the version will need to be manually updated in the script whenever the runner agent is upgraded.

To make this more robust, I recommend determining the version dynamically within the metrics-collector.sh script by querying the runner binary itself. For example:

# In metrics-collector.sh RUNNER_VERSION=$("/actions-runner/bin/Runner.Listener" --version 2>/dev/null || echo "unknown")

This would ensure the github_runner_info metric always reports the correct version automatically.

gemini-code-assist · 2026-03-02T01:55:48Z

docs/features/PROMETHEUS_METRICS_REFERENCE.md

+|---|---|
+| **Type** | Gauge |
+| **Description** | Unix timestamp of the last metrics update |
+| **Labels** | None |


The github_runner_last_update_timestamp metric is documented as having no labels. In a multi-runner environment, this makes it impossible to determine which specific runner's metrics have gone stale. All other metrics include runner_name and runner_type labels for this reason.

Please add these labels to github_runner_last_update_timestamp to improve its utility for alerting and diagnostics. The implementation in docker/metrics-collector.sh would need to be updated accordingly, for example:
github_runner_last_update_timestamp{runner_name="$RUNNER_NAME",runner_type="$RUNNER_TYPE"} $(date +%s)

Suggested change

| **Labels** | None |

| **Labels** | `runner_name`, `runner_type` |

Create 4 new wiki pages mirroring Phase 5 Prometheus documentation: - Monitoring-Setup.md: Quick start, port mapping, multi-runner config - Metrics-Reference.md: All 8 metrics with types, labels, and PromQL - Grafana-Dashboards.md: 4 dashboards, import/provisioning instructions - Monitoring-Troubleshooting.md: Symptom-based troubleshooting guide Update 5 existing wiki pages: - Home.md: Add Monitoring & Observability section to Table of Contents - Production-Deployment.md: Fix METRICS_PORT 9090→9091, scrape target runner:8080→runner:9091, add monitoring guide cross-link - Quick-Start.md: Restore monitoring link in What's Next section - Chrome-Runner.md: Add Prometheus metrics port 9092 info and links - Docker-Configuration.md: Add monitoring setup link below architecture

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

GrammaTonic added 2 commits March 2, 2026 02:55

style: fix markdownlint MD031 blank lines around fenced code block

feacc58

GrammaTonic merged commit 58be548 into develop Mar 2, 2026
22 checks passed

GrammaTonic deleted the feature/prometheus-documentation-phase5 branch March 2, 2026 02:11

This was referenced Mar 2, 2026

[Feature] Phase 5: Documentation & User Guide #1063

Closed

chore: promote develop to main — release v2.6.0 #1143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add Prometheus monitoring documentation (Phase 5)#1139

docs: add Prometheus monitoring documentation (Phase 5)#1139
GrammaTonic merged 3 commits intodevelopfrom
feature/prometheus-documentation-phase5

GrammaTonic commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| Labels \| None \|
	\| Labels \| `runner_name`, `runner_type` \|

Uh oh!

Conversation

GrammaTonic commented Mar 2, 2026

Summary

Type of Change

Changes Made

New Files (7)

Updated Files (5)

Testing

Related Issues

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant