From 90f19fc31cc06aa8e859a461e5a4e5b9f10d94b7 Mon Sep 17 00:00:00 2001
From: avlitman <alitman@redhat.com>
Date: Sun, 3 May 2026 19:08:08 +0300
Subject: [PATCH] Add CodeRabbit configuration for runbook reviews

Configure CodeRabbit to cross-reference runbooks against alert source
code in kubevirt org repositories. Includes path-specific review
instructions and runbook review guidelines.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 .coderabbit.yaml                              | 209 ++++++++++++++++++
 docs/review-guidelines/runbooks.md            | 115 ++++++++++
 docs/runbooks/LowReadyVirtControllersCount.md |  14 +-
 docs/runbooks/VirtAPIDown.md                  |   8 +-
 4 files changed, 336 insertions(+), 10 deletions(-)
 create mode 100644 .coderabbit.yaml
 create mode 100644 docs/review-guidelines/runbooks.md

diff --git a/.coderabbit.yaml b/.coderabbit.yaml
new file mode 100644
index 00000000..f25b1878
--- /dev/null
+++ b/.coderabbit.yaml
@@ -0,0 +1,209 @@
+language: en-US
+
+reviews:
+  profile: assertive
+  high_level_summary: false
+  high_level_summary_in_walkthrough: false
+  poem: false
+  review_status: false
+  collapse_walkthrough: false
+  changed_files_summary: false
+  sequence_diagrams: false
+  assess_linked_issues: false
+  related_issues: false
+  related_prs: false
+  suggested_labels: false
+  suggested_reviewers: false
+  enable_prompt_for_ai_agents: false
+  auto_review:
+    enabled: true
+    drafts: false
+  finishing_touches:
+    docstrings:
+      enabled: false
+    unit_tests:
+      enabled: false
+    simplify:
+      enabled: false
+  pre_merge_checks:
+    title:
+      mode: "off"
+    description:
+      mode: "off"
+    issue_assessment:
+      mode: "off"
+    docstrings:
+      mode: "off"
+
+  path_instructions:
+    - path: "docs/runbooks/**/*.md"
+      instructions: |
+        These are KubeVirt alert runbooks. Each runbook documents a Prometheus
+        alert. The alert definitions live in OTHER repositories in the kubevirt
+        GitHub organization (linked repositories), NOT in this repo.
+
+        YOUR PRIMARY JOB is to use the linked repositories to find the actual
+        alert source code and deeply verify the runbook against it. For each
+        runbook file, do the following:
+
+        ## 1. Find the alert definition in the linked repos
+
+        Search ALL linked kubevirt org repositories for the **exact alert name**
+        (the H1 heading / filename, e.g., "VirtControllerDown"). Do NOT rely
+        on name prefixes to guess the repo — alert names do not always contain
+        the operator name. Search every linked repo until you find it. The
+        alert is defined as a PrometheusRule or in Go code using the
+        operator-observability framework.
+
+        IMPORTANT: The alert may not be merged yet — it may only exist in an
+        open pull request in the source repository. The runbook PR description
+        should contain a link to the alert PR. If the alert is not found on
+        the main branch, search open PRs in the linked repositories for the
+        alert name. If no alert definition can be found anywhere (neither
+        merged nor in an open PR), flag this clearly in the review.
+
+        Once found, identify:
+        - The **PromQL expression** (`expr` field)
+        - The **`for` duration** (how long the condition must hold before firing)
+        - The **severity** label
+        - The **operator_health_impact** label
+        - Any other labels (kubernetes_operator_part_of, etc.)
+
+        ## 2. Trace the PromQL expression
+
+        Break down the PromQL expression:
+        - If it references **recording rules**, find those recording rule
+          definitions in the same repo and trace what they compute.
+        - If it references **metrics**, find where those metrics are defined and
+          incremented/set in the Go source code. Understand what each metric
+          measures (e.g., counter of API errors, gauge of running pods).
+        - Understand the **complete chain**: metric → recording rule → alert
+          expression → when it fires.
+
+        ## 3. Print alert context at the top of every review comment
+
+        For EVERY runbook file reviewed, you MUST start your comment with
+        an alert context block. This is MANDATORY — never skip it. It must
+        appear at the very top of the comment before any review feedback.
+
+        Use this EXACT format (all fields required, do not omit any):
+
+        ---
+        **Alert source code:** `<org>/<repo>` — `<file path>` (line <N>)
+        **Alert expr:** `<paste the FULL PromQL expression here>`
+        **Alert `for`:** `<duration>`
+        **Alert labels:** `severity=<val>`, `operator_health_impact=<val>`,
+        `kubernetes_operator_part_of=<val>`, `kubernetes_operator_component=<val>`
+        **Recording rules:**
+        - `<rule name>`: `<rule expr>` — `<repo>/<file path>`
+        **Metrics:**
+        - `<metric name>` (<type>): `<repo>/<file path>` (line <N>)
+        ---
+
+        The **Alert expr** field is critical — always include the full PromQL
+        expression so the reviewer can verify correctness at a glance.
+
+        If the alert was found in an open PR (not merged), add:
+        **Note:** Alert found in open PR: <PR URL>
+
+        If the alert could not be found, state:
+        **Alert source code:** NOT FOUND — cannot verify this runbook
+
+        ## 4. Verify the runbook against the source
+
+        With full understanding of the alert, verify each section:
+
+        **Meaning section:**
+        - Does it accurately describe what the PromQL expression evaluates?
+        - Is the firing condition correct (threshold, duration, component)?
+        - If the expression checks `virt-controller` pods, the runbook must not
+          say `virt-api` (or vice versa). Flag any component mismatches.
+        - Is the `for` duration mentioned and correct?
+
+        **Impact section:**
+        - Based on the code where the metrics are set, what actually breaks when
+          this alert fires? Does the runbook accurately describe the impact?
+        - Are there downstream effects the runbook misses?
+
+        **Diagnosis section:**
+        - Do the `kubectl`/`oc` commands target the correct resources, labels,
+          and namespaces for the component the alert actually monitors?
+        - Are the label selectors correct? (e.g., `kubevirt.io=virt-controller`
+          vs `kubevirt.io=virt-api`)
+        - Do the commands help the user verify the condition the PromQL
+          expression checks?
+        - Are the commands syntactically valid?
+
+        **Mitigation section:**
+        - Based on the root causes visible in the source code, are the suggested
+          mitigations relevant and actionable?
+        - Are there obvious mitigations missing?
+
+        ## 5. Formatting rules
+
+        - Every runbook MUST have H2 sections in order: `## Meaning`,
+          `## Impact`, `## Diagnosis`, `## Mitigation`.
+        - H1 heading must match the filename (e.g., `VirtAPIDown.md` →
+          `# VirtAPIDown`).
+        - Fenced code blocks: use ```` ```bash ```` not ```` ``` bash ````.
+          No space before the language tag.
+        - Line length: 80 chars for prose (code blocks exempt).
+        - `<!--DS: ... -->` and `<!--USstart-->...<!--USend-->` comment blocks
+          are intentional downstream/upstream markers. Do not flag them.
+
+  tools:
+    markdownlint:
+      enabled: true
+    languagetool:
+      enabled: true
+      level: picky
+    shellcheck:
+      enabled: true
+
+knowledge_base:
+  learnings:
+    scope: global
+  pull_requests:
+    scope: global
+  linked_repositories:
+    - repository: kubevirt/kubevirt
+      instructions: >
+        Core KubeVirt. Contains alert definitions, recording rules, and
+        metric definitions for virt-api, virt-controller, virt-handler,
+        virt-launcher, virt-operator. Search this repo for any alert name.
+    - repository: kubevirt/containerized-data-importer
+      instructions: >
+        Containerized Data Importer (CDI). Contains alert definitions,
+        recording rules, and metric definitions. Search this repo for
+        any alert name.
+    - repository: kubevirt/hyperconverged-cluster-operator
+      instructions: >
+        HyperConverged Cluster Operator (HCO). Contains alert definitions,
+        recording rules, and metric definitions. Search this repo for
+        any alert name.
+    - repository: kubevirt/ssp-operator
+      instructions: >
+        Scheduling, Scale, and Performance operator (SSP). Contains alert
+        definitions, recording rules, and metric definitions. Search this
+        repo for any alert name.
+    - repository: kubevirt/cluster-network-addons-operator
+      instructions: >
+        Cluster Network Addons Operator (CNAO). Contains alert definitions
+        and metric definitions. Search this repo for any alert name.
+    - repository: kubevirt/hostpath-provisioner-operator
+      instructions: >
+        HostPath Provisioner Operator (HPP). Contains alert definitions
+        and metric definitions. Search this repo for any alert name.
+    - repository: k8snetworkplumbingwg/kubemacpool
+      instructions: >
+        Kubemacpool. Contains alert definitions and metric definitions.
+        Search this repo for any alert name.
+    - repository: kubevirt/application-aware-quota
+      instructions: >
+        Application Aware Quota (AAQ). Contains alert definitions and
+        metric definitions. Search this repo for any alert name.
+    - repository: kubevirt/hostpath-provisioner
+      instructions: >
+        HostPath Provisioner (HPP). Contains metric definitions and
+        possibly alert definitions. Search this repo for any alert name.
+        Note: this is the provisioner itself, not the operator.
diff --git a/docs/review-guidelines/runbooks.md b/docs/review-guidelines/runbooks.md
new file mode 100644
index 00000000..614887fb
--- /dev/null
+++ b/docs/review-guidelines/runbooks.md
@@ -0,0 +1,115 @@
+# Runbook Review Guidelines
+
+These guidelines define how to review KubeVirt alert runbooks in
+`docs/runbooks/`. The key principle: **every claim in a runbook must be
+verified against the actual alert source code** in the kubevirt
+organization repositories.
+
+## Alert Source Code Location
+
+Alert definitions do NOT live in this repo. They are defined in operator
+repositories across the kubevirt GitHub organization. Alert names do NOT
+always contain the operator name as a prefix — you cannot guess the repo
+from the alert name. Search ALL of these repos for the exact alert name:
+
+- `kubevirt/kubevirt`
+- `kubevirt/containerized-data-importer`
+- `kubevirt/hyperconverged-cluster-operator`
+- `kubevirt/ssp-operator`
+- `kubevirt/cluster-network-addons-operator`
+- `kubevirt/hostpath-provisioner-operator`
+- `kubevirt/hostpath-provisioner`
+- `k8snetworkplumbingwg/kubemacpool`
+- `kubevirt/application-aware-quota`
+
+## Alert May Only Exist in an Open PR
+
+Often the alert is not yet merged in the source repository — it only
+exists in an open pull request. This is the normal workflow: the alert
+PR and the runbook PR are created in parallel.
+
+When reviewing a runbook:
+1. First search the main branch of the source repo for the alert name
+2. If not found, search **open pull requests** in the source repo
+3. The runbook PR description should link to the corresponding alert
+   PR — follow that link to find the alert definition
+4. If no alert definition can be found anywhere (neither merged nor in
+   an open PR), flag this clearly: the runbook cannot be verified
+
+## What to Verify
+
+### 1. Find the alert definition
+
+For each runbook, locate the alert by its name in the source repo
+(main branch or open PR). An alert definition includes:
+- **`expr`**: The PromQL expression that triggers the alert
+- **`for`**: How long the condition must hold before the alert fires
+- **`severity`**: critical, warning, or info
+- **`operator_health_impact`**: critical, warning, or none
+- **`summary`** and **`description`** annotations
+
+### 2. Trace the PromQL expression end-to-end
+
+Break down the `expr` to understand what it actually evaluates:
+
+- **Metrics**: Find where each metric in the expression is defined in
+  the Go source code. Understand what it measures — is it a counter,
+  gauge, histogram? What events cause it to increment or change?
+- **Recording rules**: If the expression references recording rules
+  (metrics that don't exist as raw instrumentation), find the recording
+  rule definition and trace what it computes.
+- **Thresholds and conditions**: What numeric thresholds or boolean
+  conditions trigger the alert? Over what time window?
+
+The goal is to fully understand: what system state causes this alert
+to fire?
+
+### 3. Verify each runbook section
+
+#### `## Meaning`
+
+- Must accurately describe what the PromQL expression evaluates
+- Must state the correct firing condition (threshold, duration)
+- Must reference the correct component — if the expression queries
+  `virt-controller` pods, the runbook must not say `virt-api`
+- The `for` duration should match the alert definition
+
+#### `## Impact`
+
+- Based on where the metrics are set in source code, what functionality
+  is actually affected when this condition is true?
+- Does the runbook capture the real user-facing impact, or is it
+  vague/generic?
+
+#### `## Diagnosis`
+
+- `kubectl`/`oc` commands must target the correct resources and labels
+  for the component the alert monitors
+- Label selectors must match what the source code uses (e.g.,
+  `kubevirt.io=virt-controller`)
+- Commands should help verify the specific condition the PromQL
+  expression checks
+- Commands must be syntactically valid
+- Use `$NAMESPACE` variable pattern:
+  ```bash
+  $ export NAMESPACE="$(kubectl get kubevirt -A \
+      -o custom-columns="":.metadata.namespace)"
+  ```
+
+#### `## Mitigation`
+
+- Remediation steps should address the root causes that the source
+  code reveals
+- Should be concrete and actionable, not just "investigate the issue"
+
+## Formatting Rules
+
+- One runbook per file: `<AlertName>.md`
+- H1 heading must match filename: `VirtAPIDown.md` → `# VirtAPIDown`
+- Required H2 sections in order: `## Meaning`, `## Impact`,
+  `## Diagnosis`, `## Mitigation`
+- Code fences: ```` ```bash ```` (no space before language tag)
+- Line length: 80 characters for prose (code blocks exempt)
+- Prefix example commands with `$ `
+- `<!--DS: ... -->` and `<!--USstart-->...<!--USend-->` are intentional
+  downstream/upstream content markers — do not flag them
diff --git a/docs/runbooks/LowReadyVirtControllersCount.md b/docs/runbooks/LowReadyVirtControllersCount.md
index 203b378a..f519b19b 100644
--- a/docs/runbooks/LowReadyVirtControllersCount.md
+++ b/docs/runbooks/LowReadyVirtControllersCount.md
@@ -2,13 +2,13 @@
 
 ## Meaning
 
-This alert fires when one or more `virt-controller` pods are running, but not
-all of them have been in a `Ready` state for the last 10 minutes.
+This alert fires when the `virt-controller` deployment has zero available
+replicas for the last 5 minutes.
 
-A `virt-controller` device monitors the custom resource definitions (CRDs) of a
-virtual machine instance (VMI) and manages the associated pods. The device
-create pods for VMIs and manages the lifecycle of the pods. The device is
-critical for cluster-wide virtualization functionality.
+The `virt-controller` monitors the `kubevirt_vmi_memory_used_bytes`
+metric and manages the associated pods. The device creates pods for
+VMIs and manages the lifecycle of the pods. The device is critical
+for cluster-wide virtualization functionality.
 
 ## Impact
 
@@ -27,7 +27,7 @@ launching a new VMI or shutting down an existing VMI.
 2. Verify a `virt-controller` device is available:
 
    ```bash
-   $ kubectl get deployment -n $NAMESPACE virt-controller -o jsonpath='{.status.readyReplicas}'
+   $ kubectl get deployment -n $NAMESPACE virt-handler -o jsonpath='{.status.readyReplicas}'
    ```
 
 3. Check the status of the `virt-controller` deployment:
diff --git a/docs/runbooks/VirtAPIDown.md b/docs/runbooks/VirtAPIDown.md
index 6146ed09..7b1136cb 100644
--- a/docs/runbooks/VirtAPIDown.md
+++ b/docs/runbooks/VirtAPIDown.md
@@ -2,11 +2,13 @@
 
 ## Meaning
 
-No running `virt-api` pod has been detected for 10 minutes.
+The `virt-api` deployment has fewer than the expected number of
+replicas available for 5 minutes.
 
 ## Impact
 
-KubeVirt objects cannot send API calls.
+This is a warning level alert. KubeVirt objects may experience degraded
+API performance.
 
 ## Diagnosis
 
@@ -19,7 +21,7 @@ KubeVirt objects cannot send API calls.
 2. Check the status of the `virt-api` pods:
 
    ```bash
-   $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-api
+   $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-controller
    ```
 
 3. Check the status of the `virt-api` deployment: