feat(monitoring): split mega-dashboard into 4 standalone Grafana dashboards#1137
Conversation
…boards Replace the combined github-runner.json (24 panels, 4 rows) with 4 focused, standalone dashboards per the Phase 4 spec (Issue #1062): - runner-overview.json: Runner status, health, uptime, Quick Links - dora-metrics.json: Unchanged - DF, Lead Time, CFR, MTTR, trends - performance-trends.json: NEW - Cache, CPU/Memory, build percentiles - job-analysis.json: Unchanged - Summary, histogram, runner comparison Also adds Grafana dashboard provisioning config, inter-dashboard navigation links, and updated plan tracking and docs. Resolves #1062
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the Grafana monitoring setup by breaking down a large, combined dashboard into several smaller, more specialized ones. This change improves clarity and usability by providing focused views for different aspects of GitHub Actions runner monitoring, such as overall status, performance trends, and DORA metrics. The introduction of automatic provisioning simplifies deployment and management of these dashboards. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a great improvement, refactoring a single large Grafana dashboard into four focused, standalone dashboards: Runner Overview, Performance Trends, DORA Metrics, and Job Analysis. This modular approach enhances usability and maintainability. The addition of a provisioning configuration file for auto-loading dashboards is also a welcome feature. My review focuses on the new dashboard definitions. I've identified a couple of areas for improvement in the performance-trends.json dashboard to make it more user-friendly by clarifying the nature of hardcoded memory thresholds.
| { | ||
| "id": 4, | ||
| "title": "Memory Usage", | ||
| "description": "Current average memory usage across runner containers (requires cAdvisor)", |
There was a problem hiding this comment.
The memory usage thresholds in this panel are hardcoded to specific values (2GiB, 4GiB, 6GiB). While these might be reasonable defaults for some environments, they are not universally applicable and may cause confusion. To improve clarity and maintainability, I suggest updating the panel description to mention these default values and advise users to adjust them according to their specific runner configurations.
| "description": "Current average memory usage across runner containers (requires cAdvisor)", | |
| "description": "Current average memory usage across runner containers (requires cAdvisor). Thresholds are example values (2GiB, 4GiB, 6GiB) and should be adjusted to match your environment.", |
| { | ||
| "id": 22, | ||
| "title": "Memory Usage Over Time", | ||
| "description": "Memory usage per runner container over time (requires cAdvisor)", |
There was a problem hiding this comment.
Similar to the "Memory Usage" stat panel, this timeseries panel contains hardcoded memory thresholds (2GiB, 6GiB) that may not be suitable for all users. To make the dashboard more user-friendly and easier to configure, please update the panel's description to clarify that these are example thresholds and should be customized.
| "description": "Memory usage per runner container over time (requires cAdvisor)", | |
| "description": "Memory usage per runner container over time (requires cAdvisor). Thresholds are example values (e.g., 2GiB, 6GiB) and should be adjusted for your environment.", |
Summary
Split the combined
github-runner.jsonmega-dashboard (24 panels across 4 rows) into 4 focused, standalone Grafana dashboards, aligning with the Phase 4 specification in Issue #1062.Type of Change
Related Issues
Changes Made
Files Created
monitoring/grafana/dashboards/runner-overview.json- Runner Overview dashboard (12 panels, 3 rows)monitoring/grafana/dashboards/performance-trends.json- Performance Trends dashboard (14 panels, 4 rows)monitoring/grafana/provisioning/dashboards/dashboards.yml- Grafana dashboard auto-loading provisioning configFiles Deleted
monitoring/grafana/dashboards/github-runner.json- replaced by runner-overview.jsonFiles Modified
plan/feature-prometheus-monitoring-1.md- updated Phase 4 task descriptions and datesdocs/features/GRAFANA_DASHBOARD_METRICS.md- updated dashboard file listing to 4-dashboard tableFiles Unchanged
monitoring/grafana/dashboards/dora-metrics.json- already standalonemonitoring/grafana/dashboards/job-analysis.json- already standaloneKey Changes
Runner Overview (runner-overview.json): Slimmed to runner status, health, uptime panels. Added Runner Health timeseries row and Quick Links navigation panel.
Performance Trends (performance-trends.json): New standalone dashboard expanded from 3 to 14 panels: summary stats, cache hit rate by type/runner, CPU/memory over time (cAdvisor), job duration percentiles (p50/p90/p95/p99), queue time trend, avg duration by runner type.
Dashboard Provisioning: Added dashboards.yml provisioner config for zero-config auto-loading.
Inter-dashboard Navigation: Runner Overview includes links panel to DORA Metrics, Job Analysis, and Performance Trends.
Dashboard Inventory
All dashboards: schemaVersion 39, Grafana >= 9.0.0, DS_PROMETHEUS datasource, runner_name/runner_type multi-select variables.
Testing
Checklist
/cc @copilot