From 90f19fc31cc06aa8e859a461e5a4e5b9f10d94b7 Mon Sep 17 00:00:00 2001 From: avlitman Date: Sun, 3 May 2026 19:08:08 +0300 Subject: [PATCH] Add CodeRabbit configuration for runbook reviews Configure CodeRabbit to cross-reference runbooks against alert source code in kubevirt org repositories. Includes path-specific review instructions and runbook review guidelines. Co-authored-by: Cursor --- .coderabbit.yaml | 209 ++++++++++++++++++ docs/review-guidelines/runbooks.md | 115 ++++++++++ docs/runbooks/LowReadyVirtControllersCount.md | 14 +- docs/runbooks/VirtAPIDown.md | 8 +- 4 files changed, 336 insertions(+), 10 deletions(-) create mode 100644 .coderabbit.yaml create mode 100644 docs/review-guidelines/runbooks.md diff --git a/.coderabbit.yaml b/.coderabbit.yaml new file mode 100644 index 00000000..f25b1878 --- /dev/null +++ b/.coderabbit.yaml @@ -0,0 +1,209 @@ +language: en-US + +reviews: + profile: assertive + high_level_summary: false + high_level_summary_in_walkthrough: false + poem: false + review_status: false + collapse_walkthrough: false + changed_files_summary: false + sequence_diagrams: false + assess_linked_issues: false + related_issues: false + related_prs: false + suggested_labels: false + suggested_reviewers: false + enable_prompt_for_ai_agents: false + auto_review: + enabled: true + drafts: false + finishing_touches: + docstrings: + enabled: false + unit_tests: + enabled: false + simplify: + enabled: false + pre_merge_checks: + title: + mode: "off" + description: + mode: "off" + issue_assessment: + mode: "off" + docstrings: + mode: "off" + + path_instructions: + - path: "docs/runbooks/**/*.md" + instructions: | + These are KubeVirt alert runbooks. Each runbook documents a Prometheus + alert. The alert definitions live in OTHER repositories in the kubevirt + GitHub organization (linked repositories), NOT in this repo. + + YOUR PRIMARY JOB is to use the linked repositories to find the actual + alert source code and deeply verify the runbook against it. For each + runbook file, do the following: + + ## 1. Find the alert definition in the linked repos + + Search ALL linked kubevirt org repositories for the **exact alert name** + (the H1 heading / filename, e.g., "VirtControllerDown"). Do NOT rely + on name prefixes to guess the repo — alert names do not always contain + the operator name. Search every linked repo until you find it. The + alert is defined as a PrometheusRule or in Go code using the + operator-observability framework. + + IMPORTANT: The alert may not be merged yet — it may only exist in an + open pull request in the source repository. The runbook PR description + should contain a link to the alert PR. If the alert is not found on + the main branch, search open PRs in the linked repositories for the + alert name. If no alert definition can be found anywhere (neither + merged nor in an open PR), flag this clearly in the review. + + Once found, identify: + - The **PromQL expression** (`expr` field) + - The **`for` duration** (how long the condition must hold before firing) + - The **severity** label + - The **operator_health_impact** label + - Any other labels (kubernetes_operator_part_of, etc.) + + ## 2. Trace the PromQL expression + + Break down the PromQL expression: + - If it references **recording rules**, find those recording rule + definitions in the same repo and trace what they compute. + - If it references **metrics**, find where those metrics are defined and + incremented/set in the Go source code. Understand what each metric + measures (e.g., counter of API errors, gauge of running pods). + - Understand the **complete chain**: metric → recording rule → alert + expression → when it fires. + + ## 3. Print alert context at the top of every review comment + + For EVERY runbook file reviewed, you MUST start your comment with + an alert context block. This is MANDATORY — never skip it. It must + appear at the very top of the comment before any review feedback. + + Use this EXACT format (all fields required, do not omit any): + + --- + **Alert source code:** `/` — `` (line ) + **Alert expr:** `` + **Alert `for`:** `` + **Alert labels:** `severity=`, `operator_health_impact=`, + `kubernetes_operator_part_of=`, `kubernetes_operator_component=` + **Recording rules:** + - ``: `` — `/` + **Metrics:** + - `` (): `/` (line ) + --- + + The **Alert expr** field is critical — always include the full PromQL + expression so the reviewer can verify correctness at a glance. + + If the alert was found in an open PR (not merged), add: + **Note:** Alert found in open PR: + + If the alert could not be found, state: + **Alert source code:** NOT FOUND — cannot verify this runbook + + ## 4. Verify the runbook against the source + + With full understanding of the alert, verify each section: + + **Meaning section:** + - Does it accurately describe what the PromQL expression evaluates? + - Is the firing condition correct (threshold, duration, component)? + - If the expression checks `virt-controller` pods, the runbook must not + say `virt-api` (or vice versa). Flag any component mismatches. + - Is the `for` duration mentioned and correct? + + **Impact section:** + - Based on the code where the metrics are set, what actually breaks when + this alert fires? Does the runbook accurately describe the impact? + - Are there downstream effects the runbook misses? + + **Diagnosis section:** + - Do the `kubectl`/`oc` commands target the correct resources, labels, + and namespaces for the component the alert actually monitors? + - Are the label selectors correct? (e.g., `kubevirt.io=virt-controller` + vs `kubevirt.io=virt-api`) + - Do the commands help the user verify the condition the PromQL + expression checks? + - Are the commands syntactically valid? + + **Mitigation section:** + - Based on the root causes visible in the source code, are the suggested + mitigations relevant and actionable? + - Are there obvious mitigations missing? + + ## 5. Formatting rules + + - Every runbook MUST have H2 sections in order: `## Meaning`, + `## Impact`, `## Diagnosis`, `## Mitigation`. + - H1 heading must match the filename (e.g., `VirtAPIDown.md` → + `# VirtAPIDown`). + - Fenced code blocks: use ```` ```bash ```` not ```` ``` bash ````. + No space before the language tag. + - Line length: 80 chars for prose (code blocks exempt). + - `` and `...` comment blocks + are intentional downstream/upstream markers. Do not flag them. + + tools: + markdownlint: + enabled: true + languagetool: + enabled: true + level: picky + shellcheck: + enabled: true + +knowledge_base: + learnings: + scope: global + pull_requests: + scope: global + linked_repositories: + - repository: kubevirt/kubevirt + instructions: > + Core KubeVirt. Contains alert definitions, recording rules, and + metric definitions for virt-api, virt-controller, virt-handler, + virt-launcher, virt-operator. Search this repo for any alert name. + - repository: kubevirt/containerized-data-importer + instructions: > + Containerized Data Importer (CDI). Contains alert definitions, + recording rules, and metric definitions. Search this repo for + any alert name. + - repository: kubevirt/hyperconverged-cluster-operator + instructions: > + HyperConverged Cluster Operator (HCO). Contains alert definitions, + recording rules, and metric definitions. Search this repo for + any alert name. + - repository: kubevirt/ssp-operator + instructions: > + Scheduling, Scale, and Performance operator (SSP). Contains alert + definitions, recording rules, and metric definitions. Search this + repo for any alert name. + - repository: kubevirt/cluster-network-addons-operator + instructions: > + Cluster Network Addons Operator (CNAO). Contains alert definitions + and metric definitions. Search this repo for any alert name. + - repository: kubevirt/hostpath-provisioner-operator + instructions: > + HostPath Provisioner Operator (HPP). Contains alert definitions + and metric definitions. Search this repo for any alert name. + - repository: k8snetworkplumbingwg/kubemacpool + instructions: > + Kubemacpool. Contains alert definitions and metric definitions. + Search this repo for any alert name. + - repository: kubevirt/application-aware-quota + instructions: > + Application Aware Quota (AAQ). Contains alert definitions and + metric definitions. Search this repo for any alert name. + - repository: kubevirt/hostpath-provisioner + instructions: > + HostPath Provisioner (HPP). Contains metric definitions and + possibly alert definitions. Search this repo for any alert name. + Note: this is the provisioner itself, not the operator. diff --git a/docs/review-guidelines/runbooks.md b/docs/review-guidelines/runbooks.md new file mode 100644 index 00000000..614887fb --- /dev/null +++ b/docs/review-guidelines/runbooks.md @@ -0,0 +1,115 @@ +# Runbook Review Guidelines + +These guidelines define how to review KubeVirt alert runbooks in +`docs/runbooks/`. The key principle: **every claim in a runbook must be +verified against the actual alert source code** in the kubevirt +organization repositories. + +## Alert Source Code Location + +Alert definitions do NOT live in this repo. They are defined in operator +repositories across the kubevirt GitHub organization. Alert names do NOT +always contain the operator name as a prefix — you cannot guess the repo +from the alert name. Search ALL of these repos for the exact alert name: + +- `kubevirt/kubevirt` +- `kubevirt/containerized-data-importer` +- `kubevirt/hyperconverged-cluster-operator` +- `kubevirt/ssp-operator` +- `kubevirt/cluster-network-addons-operator` +- `kubevirt/hostpath-provisioner-operator` +- `kubevirt/hostpath-provisioner` +- `k8snetworkplumbingwg/kubemacpool` +- `kubevirt/application-aware-quota` + +## Alert May Only Exist in an Open PR + +Often the alert is not yet merged in the source repository — it only +exists in an open pull request. This is the normal workflow: the alert +PR and the runbook PR are created in parallel. + +When reviewing a runbook: +1. First search the main branch of the source repo for the alert name +2. If not found, search **open pull requests** in the source repo +3. The runbook PR description should link to the corresponding alert + PR — follow that link to find the alert definition +4. If no alert definition can be found anywhere (neither merged nor in + an open PR), flag this clearly: the runbook cannot be verified + +## What to Verify + +### 1. Find the alert definition + +For each runbook, locate the alert by its name in the source repo +(main branch or open PR). An alert definition includes: +- **`expr`**: The PromQL expression that triggers the alert +- **`for`**: How long the condition must hold before the alert fires +- **`severity`**: critical, warning, or info +- **`operator_health_impact`**: critical, warning, or none +- **`summary`** and **`description`** annotations + +### 2. Trace the PromQL expression end-to-end + +Break down the `expr` to understand what it actually evaluates: + +- **Metrics**: Find where each metric in the expression is defined in + the Go source code. Understand what it measures — is it a counter, + gauge, histogram? What events cause it to increment or change? +- **Recording rules**: If the expression references recording rules + (metrics that don't exist as raw instrumentation), find the recording + rule definition and trace what it computes. +- **Thresholds and conditions**: What numeric thresholds or boolean + conditions trigger the alert? Over what time window? + +The goal is to fully understand: what system state causes this alert +to fire? + +### 3. Verify each runbook section + +#### `## Meaning` + +- Must accurately describe what the PromQL expression evaluates +- Must state the correct firing condition (threshold, duration) +- Must reference the correct component — if the expression queries + `virt-controller` pods, the runbook must not say `virt-api` +- The `for` duration should match the alert definition + +#### `## Impact` + +- Based on where the metrics are set in source code, what functionality + is actually affected when this condition is true? +- Does the runbook capture the real user-facing impact, or is it + vague/generic? + +#### `## Diagnosis` + +- `kubectl`/`oc` commands must target the correct resources and labels + for the component the alert monitors +- Label selectors must match what the source code uses (e.g., + `kubevirt.io=virt-controller`) +- Commands should help verify the specific condition the PromQL + expression checks +- Commands must be syntactically valid +- Use `$NAMESPACE` variable pattern: + ```bash + $ export NAMESPACE="$(kubectl get kubevirt -A \ + -o custom-columns="":.metadata.namespace)" + ``` + +#### `## Mitigation` + +- Remediation steps should address the root causes that the source + code reveals +- Should be concrete and actionable, not just "investigate the issue" + +## Formatting Rules + +- One runbook per file: `.md` +- H1 heading must match filename: `VirtAPIDown.md` → `# VirtAPIDown` +- Required H2 sections in order: `## Meaning`, `## Impact`, + `## Diagnosis`, `## Mitigation` +- Code fences: ```` ```bash ```` (no space before language tag) +- Line length: 80 characters for prose (code blocks exempt) +- Prefix example commands with `$ ` +- `` and `...` are intentional + downstream/upstream content markers — do not flag them diff --git a/docs/runbooks/LowReadyVirtControllersCount.md b/docs/runbooks/LowReadyVirtControllersCount.md index 203b378a..f519b19b 100644 --- a/docs/runbooks/LowReadyVirtControllersCount.md +++ b/docs/runbooks/LowReadyVirtControllersCount.md @@ -2,13 +2,13 @@ ## Meaning -This alert fires when one or more `virt-controller` pods are running, but not -all of them have been in a `Ready` state for the last 10 minutes. +This alert fires when the `virt-controller` deployment has zero available +replicas for the last 5 minutes. -A `virt-controller` device monitors the custom resource definitions (CRDs) of a -virtual machine instance (VMI) and manages the associated pods. The device -create pods for VMIs and manages the lifecycle of the pods. The device is -critical for cluster-wide virtualization functionality. +The `virt-controller` monitors the `kubevirt_vmi_memory_used_bytes` +metric and manages the associated pods. The device creates pods for +VMIs and manages the lifecycle of the pods. The device is critical +for cluster-wide virtualization functionality. ## Impact @@ -27,7 +27,7 @@ launching a new VMI or shutting down an existing VMI. 2. Verify a `virt-controller` device is available: ```bash - $ kubectl get deployment -n $NAMESPACE virt-controller -o jsonpath='{.status.readyReplicas}' + $ kubectl get deployment -n $NAMESPACE virt-handler -o jsonpath='{.status.readyReplicas}' ``` 3. Check the status of the `virt-controller` deployment: diff --git a/docs/runbooks/VirtAPIDown.md b/docs/runbooks/VirtAPIDown.md index 6146ed09..7b1136cb 100644 --- a/docs/runbooks/VirtAPIDown.md +++ b/docs/runbooks/VirtAPIDown.md @@ -2,11 +2,13 @@ ## Meaning -No running `virt-api` pod has been detected for 10 minutes. +The `virt-api` deployment has fewer than the expected number of +replicas available for 5 minutes. ## Impact -KubeVirt objects cannot send API calls. +This is a warning level alert. KubeVirt objects may experience degraded +API performance. ## Diagnosis @@ -19,7 +21,7 @@ KubeVirt objects cannot send API calls. 2. Check the status of the `virt-api` pods: ```bash - $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-api + $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-controller ``` 3. Check the status of the `virt-api` deployment: