-
Notifications
You must be signed in to change notification settings - Fork 43
[do no merge] Add CodeRabbit configuration for runbook reviews #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,209 @@ | ||
| language: en-US | ||
|
|
||
| reviews: | ||
| profile: assertive | ||
| high_level_summary: false | ||
| high_level_summary_in_walkthrough: false | ||
| poem: false | ||
| review_status: false | ||
| collapse_walkthrough: false | ||
| changed_files_summary: false | ||
| sequence_diagrams: false | ||
| assess_linked_issues: false | ||
| related_issues: false | ||
| related_prs: false | ||
| suggested_labels: false | ||
| suggested_reviewers: false | ||
| enable_prompt_for_ai_agents: false | ||
| auto_review: | ||
| enabled: true | ||
| drafts: false | ||
| finishing_touches: | ||
| docstrings: | ||
| enabled: false | ||
| unit_tests: | ||
| enabled: false | ||
| simplify: | ||
| enabled: false | ||
| pre_merge_checks: | ||
| title: | ||
| mode: "off" | ||
| description: | ||
| mode: "off" | ||
| issue_assessment: | ||
| mode: "off" | ||
| docstrings: | ||
| mode: "off" | ||
|
|
||
| path_instructions: | ||
| - path: "docs/runbooks/**/*.md" | ||
| instructions: | | ||
| These are KubeVirt alert runbooks. Each runbook documents a Prometheus | ||
| alert. The alert definitions live in OTHER repositories in the kubevirt | ||
| GitHub organization (linked repositories), NOT in this repo. | ||
|
|
||
| YOUR PRIMARY JOB is to use the linked repositories to find the actual | ||
| alert source code and deeply verify the runbook against it. For each | ||
| runbook file, do the following: | ||
|
|
||
| ## 1. Find the alert definition in the linked repos | ||
|
|
||
| Search ALL linked kubevirt org repositories for the **exact alert name** | ||
| (the H1 heading / filename, e.g., "VirtControllerDown"). Do NOT rely | ||
| on name prefixes to guess the repo — alert names do not always contain | ||
| the operator name. Search every linked repo until you find it. The | ||
| alert is defined as a PrometheusRule or in Go code using the | ||
| operator-observability framework. | ||
|
|
||
| IMPORTANT: The alert may not be merged yet — it may only exist in an | ||
| open pull request in the source repository. The runbook PR description | ||
| should contain a link to the alert PR. If the alert is not found on | ||
| the main branch, search open PRs in the linked repositories for the | ||
| alert name. If no alert definition can be found anywhere (neither | ||
| merged nor in an open PR), flag this clearly in the review. | ||
|
|
||
| Once found, identify: | ||
| - The **PromQL expression** (`expr` field) | ||
| - The **`for` duration** (how long the condition must hold before firing) | ||
| - The **severity** label | ||
| - The **operator_health_impact** label | ||
| - Any other labels (kubernetes_operator_part_of, etc.) | ||
|
|
||
| ## 2. Trace the PromQL expression | ||
|
|
||
| Break down the PromQL expression: | ||
| - If it references **recording rules**, find those recording rule | ||
| definitions in the same repo and trace what they compute. | ||
| - If it references **metrics**, find where those metrics are defined and | ||
| incremented/set in the Go source code. Understand what each metric | ||
| measures (e.g., counter of API errors, gauge of running pods). | ||
| - Understand the **complete chain**: metric → recording rule → alert | ||
| expression → when it fires. | ||
|
|
||
| ## 3. Print alert context at the top of every review comment | ||
|
|
||
| For EVERY runbook file reviewed, you MUST start your comment with | ||
| an alert context block. This is MANDATORY — never skip it. It must | ||
| appear at the very top of the comment before any review feedback. | ||
|
|
||
| Use this EXACT format (all fields required, do not omit any): | ||
|
|
||
| --- | ||
| **Alert source code:** `<org>/<repo>` — `<file path>` (line <N>) | ||
| **Alert expr:** `<paste the FULL PromQL expression here>` | ||
| **Alert `for`:** `<duration>` | ||
| **Alert labels:** `severity=<val>`, `operator_health_impact=<val>`, | ||
| `kubernetes_operator_part_of=<val>`, `kubernetes_operator_component=<val>` | ||
| **Recording rules:** | ||
| - `<rule name>`: `<rule expr>` — `<repo>/<file path>` | ||
| **Metrics:** | ||
| - `<metric name>` (<type>): `<repo>/<file path>` (line <N>) | ||
| --- | ||
|
|
||
| The **Alert expr** field is critical — always include the full PromQL | ||
| expression so the reviewer can verify correctness at a glance. | ||
|
|
||
| If the alert was found in an open PR (not merged), add: | ||
| **Note:** Alert found in open PR: <PR URL> | ||
|
|
||
| If the alert could not be found, state: | ||
| **Alert source code:** NOT FOUND — cannot verify this runbook | ||
|
|
||
| ## 4. Verify the runbook against the source | ||
|
|
||
| With full understanding of the alert, verify each section: | ||
|
|
||
| **Meaning section:** | ||
| - Does it accurately describe what the PromQL expression evaluates? | ||
| - Is the firing condition correct (threshold, duration, component)? | ||
| - If the expression checks `virt-controller` pods, the runbook must not | ||
| say `virt-api` (or vice versa). Flag any component mismatches. | ||
| - Is the `for` duration mentioned and correct? | ||
|
|
||
| **Impact section:** | ||
| - Based on the code where the metrics are set, what actually breaks when | ||
| this alert fires? Does the runbook accurately describe the impact? | ||
| - Are there downstream effects the runbook misses? | ||
|
|
||
| **Diagnosis section:** | ||
| - Do the `kubectl`/`oc` commands target the correct resources, labels, | ||
| and namespaces for the component the alert actually monitors? | ||
| - Are the label selectors correct? (e.g., `kubevirt.io=virt-controller` | ||
| vs `kubevirt.io=virt-api`) | ||
| - Do the commands help the user verify the condition the PromQL | ||
| expression checks? | ||
| - Are the commands syntactically valid? | ||
|
|
||
| **Mitigation section:** | ||
| - Based on the root causes visible in the source code, are the suggested | ||
| mitigations relevant and actionable? | ||
| - Are there obvious mitigations missing? | ||
|
|
||
| ## 5. Formatting rules | ||
|
|
||
| - Every runbook MUST have H2 sections in order: `## Meaning`, | ||
| `## Impact`, `## Diagnosis`, `## Mitigation`. | ||
| - H1 heading must match the filename (e.g., `VirtAPIDown.md` → | ||
| `# VirtAPIDown`). | ||
| - Fenced code blocks: use ```` ```bash ```` not ```` ``` bash ````. | ||
| No space before the language tag. | ||
| - Line length: 80 chars for prose (code blocks exempt). | ||
| - `<!--DS: ... -->` and `<!--USstart-->...<!--USend-->` comment blocks | ||
| are intentional downstream/upstream markers. Do not flag them. | ||
|
|
||
| tools: | ||
| markdownlint: | ||
| enabled: true | ||
| languagetool: | ||
| enabled: true | ||
| level: picky | ||
| shellcheck: | ||
| enabled: true | ||
|
|
||
| knowledge_base: | ||
| learnings: | ||
| scope: global | ||
| pull_requests: | ||
| scope: global | ||
| linked_repositories: | ||
| - repository: kubevirt/kubevirt | ||
| instructions: > | ||
| Core KubeVirt. Contains alert definitions, recording rules, and | ||
| metric definitions for virt-api, virt-controller, virt-handler, | ||
| virt-launcher, virt-operator. Search this repo for any alert name. | ||
| - repository: kubevirt/containerized-data-importer | ||
| instructions: > | ||
| Containerized Data Importer (CDI). Contains alert definitions, | ||
| recording rules, and metric definitions. Search this repo for | ||
| any alert name. | ||
| - repository: kubevirt/hyperconverged-cluster-operator | ||
| instructions: > | ||
| HyperConverged Cluster Operator (HCO). Contains alert definitions, | ||
| recording rules, and metric definitions. Search this repo for | ||
| any alert name. | ||
| - repository: kubevirt/ssp-operator | ||
| instructions: > | ||
| Scheduling, Scale, and Performance operator (SSP). Contains alert | ||
| definitions, recording rules, and metric definitions. Search this | ||
| repo for any alert name. | ||
| - repository: kubevirt/cluster-network-addons-operator | ||
| instructions: > | ||
| Cluster Network Addons Operator (CNAO). Contains alert definitions | ||
| and metric definitions. Search this repo for any alert name. | ||
| - repository: kubevirt/hostpath-provisioner-operator | ||
| instructions: > | ||
| HostPath Provisioner Operator (HPP). Contains alert definitions | ||
| and metric definitions. Search this repo for any alert name. | ||
| - repository: k8snetworkplumbingwg/kubemacpool | ||
| instructions: > | ||
| Kubemacpool. Contains alert definitions and metric definitions. | ||
| Search this repo for any alert name. | ||
| - repository: kubevirt/application-aware-quota | ||
| instructions: > | ||
| Application Aware Quota (AAQ). Contains alert definitions and | ||
| metric definitions. Search this repo for any alert name. | ||
| - repository: kubevirt/hostpath-provisioner | ||
| instructions: > | ||
| HostPath Provisioner (HPP). Contains metric definitions and | ||
| possibly alert definitions. Search this repo for any alert name. | ||
| Note: this is the provisioner itself, not the operator. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| # Runbook Review Guidelines | ||
|
|
||
| These guidelines define how to review KubeVirt alert runbooks in | ||
| `docs/runbooks/`. The key principle: **every claim in a runbook must be | ||
| verified against the actual alert source code** in the kubevirt | ||
| organization repositories. | ||
|
|
||
| ## Alert Source Code Location | ||
|
|
||
| Alert definitions do NOT live in this repo. They are defined in operator | ||
| repositories across the kubevirt GitHub organization. Alert names do NOT | ||
| always contain the operator name as a prefix — you cannot guess the repo | ||
| from the alert name. Search ALL of these repos for the exact alert name: | ||
|
|
||
| - `kubevirt/kubevirt` | ||
| - `kubevirt/containerized-data-importer` | ||
| - `kubevirt/hyperconverged-cluster-operator` | ||
| - `kubevirt/ssp-operator` | ||
| - `kubevirt/cluster-network-addons-operator` | ||
| - `kubevirt/hostpath-provisioner-operator` | ||
| - `kubevirt/hostpath-provisioner` | ||
| - `k8snetworkplumbingwg/kubemacpool` | ||
| - `kubevirt/application-aware-quota` | ||
|
|
||
| ## Alert May Only Exist in an Open PR | ||
|
|
||
| Often the alert is not yet merged in the source repository — it only | ||
| exists in an open pull request. This is the normal workflow: the alert | ||
| PR and the runbook PR are created in parallel. | ||
|
|
||
| When reviewing a runbook: | ||
| 1. First search the main branch of the source repo for the alert name | ||
| 2. If not found, search **open pull requests** in the source repo | ||
| 3. The runbook PR description should link to the corresponding alert | ||
| PR — follow that link to find the alert definition | ||
| 4. If no alert definition can be found anywhere (neither merged nor in | ||
| an open PR), flag this clearly: the runbook cannot be verified | ||
|
|
||
| ## What to Verify | ||
|
|
||
| ### 1. Find the alert definition | ||
|
|
||
| For each runbook, locate the alert by its name in the source repo | ||
| (main branch or open PR). An alert definition includes: | ||
| - **`expr`**: The PromQL expression that triggers the alert | ||
| - **`for`**: How long the condition must hold before the alert fires | ||
| - **`severity`**: critical, warning, or info | ||
| - **`operator_health_impact`**: critical, warning, or none | ||
| - **`summary`** and **`description`** annotations | ||
|
|
||
| ### 2. Trace the PromQL expression end-to-end | ||
|
|
||
| Break down the `expr` to understand what it actually evaluates: | ||
|
|
||
| - **Metrics**: Find where each metric in the expression is defined in | ||
| the Go source code. Understand what it measures — is it a counter, | ||
| gauge, histogram? What events cause it to increment or change? | ||
| - **Recording rules**: If the expression references recording rules | ||
| (metrics that don't exist as raw instrumentation), find the recording | ||
| rule definition and trace what it computes. | ||
| - **Thresholds and conditions**: What numeric thresholds or boolean | ||
| conditions trigger the alert? Over what time window? | ||
|
|
||
| The goal is to fully understand: what system state causes this alert | ||
| to fire? | ||
|
|
||
| ### 3. Verify each runbook section | ||
|
|
||
| #### `## Meaning` | ||
|
|
||
| - Must accurately describe what the PromQL expression evaluates | ||
| - Must state the correct firing condition (threshold, duration) | ||
| - Must reference the correct component — if the expression queries | ||
| `virt-controller` pods, the runbook must not say `virt-api` | ||
| - The `for` duration should match the alert definition | ||
|
|
||
| #### `## Impact` | ||
|
|
||
| - Based on where the metrics are set in source code, what functionality | ||
| is actually affected when this condition is true? | ||
| - Does the runbook capture the real user-facing impact, or is it | ||
| vague/generic? | ||
|
|
||
| #### `## Diagnosis` | ||
|
|
||
| - `kubectl`/`oc` commands must target the correct resources and labels | ||
| for the component the alert monitors | ||
| - Label selectors must match what the source code uses (e.g., | ||
| `kubevirt.io=virt-controller`) | ||
| - Commands should help verify the specific condition the PromQL | ||
| expression checks | ||
| - Commands must be syntactically valid | ||
| - Use `$NAMESPACE` variable pattern: | ||
| ```bash | ||
| $ export NAMESPACE="$(kubectl get kubevirt -A \ | ||
| -o custom-columns="":.metadata.namespace)" | ||
| ``` | ||
|
|
||
| #### `## Mitigation` | ||
|
|
||
| - Remediation steps should address the root causes that the source | ||
| code reveals | ||
| - Should be concrete and actionable, not just "investigate the issue" | ||
|
|
||
| ## Formatting Rules | ||
|
|
||
| - One runbook per file: `<AlertName>.md` | ||
| - H1 heading must match filename: `VirtAPIDown.md` → `# VirtAPIDown` | ||
| - Required H2 sections in order: `## Meaning`, `## Impact`, | ||
| `## Diagnosis`, `## Mitigation` | ||
| - Code fences: ```` ```bash ```` (no space before language tag) | ||
| - Line length: 80 characters for prose (code blocks exempt) | ||
| - Prefix example commands with `$ ` | ||
| - `<!--DS: ... -->` and `<!--USstart-->...<!--USend-->` are intentional | ||
| downstream/upstream content markers — do not flag them |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,11 +2,13 @@ | |
|
|
||
| ## Meaning | ||
|
|
||
| No running `virt-api` pod has been detected for 10 minutes. | ||
| The `virt-api` deployment has fewer than the expected number of | ||
| replicas available for 5 minutes. | ||
|
Comment on lines
+5
to
+6
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Runbook now documents a different alert than Alert source code:
Line 5-6 and Line 10-11 describe a warning-level “fewer replicas for 5 minutes” condition, but this alert is a critical “no running virt-api pods for 10 minutes” condition. Suggested doc fix-The `virt-api` deployment has fewer than the expected number of
-replicas available for 5 minutes.
+This alert fires when no `virt-api` pods are running for 10 minutes.
-This is a warning level alert. KubeVirt objects may experience degraded
-API performance.
+This is a critical alert. KubeVirt API operations can become unavailable.
- $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-controller
+ $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-apiAs per coding guidelines, “Must accurately describe what the PromQL expression evaluates” and “Do the Also applies to: 10-11, 24-24 |
||
|
|
||
| ## Impact | ||
|
|
||
| KubeVirt objects cannot send API calls. | ||
| This is a warning level alert. KubeVirt objects may experience degraded | ||
| API performance. | ||
|
|
||
| ## Diagnosis | ||
|
|
||
|
|
@@ -19,7 +21,7 @@ KubeVirt objects cannot send API calls. | |
| 2. Check the status of the `virt-api` pods: | ||
|
|
||
| ```bash | ||
| $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-api | ||
| $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-controller | ||
| ``` | ||
|
|
||
| 3. Check the status of the `virt-api` deployment: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LowReadyVirtControllersCountrunbook semantics drift from the actual alert.Alert source code:
kubevirt/kubevirt—pkg/monitoring/rules/alerts/virt-controller.go(line 490)Alert expr:
cluster:kubevirt_virt_controller_ready:sum < cluster:kubevirt_virt_controller_pods_running:count and cluster:kubevirt_virt_controller_ready:sum > 0Alert
for:10mAlert labels:
severity=warning,operator_health_impact=warning,kubernetes_operator_part_of=kubevirt,kubernetes_operator_component=kubevirtRecording rules:
cluster:kubevirt_virt_controller_ready:sum:sum(kube_pod_status_ready{pod=~'virt-controller-.*', namespace='%s', condition='true'} * on(pod, namespace) kubevirt_virt_controller_ready_status{namespace='%s'}) or vector(0)—kubevirt/kubevirt/pkg/monitoring/rules/recordingrules/virt.gocluster:kubevirt_virt_controller_pods_running:count:count(kube_pod_status_phase{pod=~'virt-controller-.*', namespace='%s', phase='Running'} == 1) or vector(0)—kubevirt/kubevirt/pkg/monitoring/rules/recordingrules/virt.goMetrics:
kubevirt_virt_controller_ready_status(gauge):kubevirt/kubevirt/pkg/monitoring/metrics/virt-controller/component_metrics.go(line 440)Line 5-6 says “zero available replicas for 5 minutes,” but this alert is “ready < running and ready > 0 for 10 minutes.”
Line 30 checks
virt-handler; the diagnosis must targetvirt-controllerreadiness/running state.Suggested doc fix
As per coding guidelines, “Must state the correct firing condition (threshold, duration)” and “
kubectl/occommands must target the correct resources and labels for the component the alert monitors”.Also applies to: 30-30