Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions .coderabbit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
language: en-US

reviews:
profile: assertive
high_level_summary: false
high_level_summary_in_walkthrough: false
poem: false
review_status: false
collapse_walkthrough: false
changed_files_summary: false
sequence_diagrams: false
assess_linked_issues: false
related_issues: false
related_prs: false
suggested_labels: false
suggested_reviewers: false
enable_prompt_for_ai_agents: false
auto_review:
enabled: true
drafts: false
finishing_touches:
docstrings:
enabled: false
unit_tests:
enabled: false
simplify:
enabled: false
pre_merge_checks:
title:
mode: "off"
description:
mode: "off"
issue_assessment:
mode: "off"
docstrings:
mode: "off"

path_instructions:
- path: "docs/runbooks/**/*.md"
instructions: |
These are KubeVirt alert runbooks. Each runbook documents a Prometheus
alert. The alert definitions live in OTHER repositories in the kubevirt
GitHub organization (linked repositories), NOT in this repo.

YOUR PRIMARY JOB is to use the linked repositories to find the actual
alert source code and deeply verify the runbook against it. For each
runbook file, do the following:

## 1. Find the alert definition in the linked repos

Search ALL linked kubevirt org repositories for the **exact alert name**
(the H1 heading / filename, e.g., "VirtControllerDown"). Do NOT rely
on name prefixes to guess the repo — alert names do not always contain
the operator name. Search every linked repo until you find it. The
alert is defined as a PrometheusRule or in Go code using the
operator-observability framework.

IMPORTANT: The alert may not be merged yet — it may only exist in an
open pull request in the source repository. The runbook PR description
should contain a link to the alert PR. If the alert is not found on
the main branch, search open PRs in the linked repositories for the
alert name. If no alert definition can be found anywhere (neither
merged nor in an open PR), flag this clearly in the review.

Once found, identify:
- The **PromQL expression** (`expr` field)
- The **`for` duration** (how long the condition must hold before firing)
- The **severity** label
- The **operator_health_impact** label
- Any other labels (kubernetes_operator_part_of, etc.)

## 2. Trace the PromQL expression

Break down the PromQL expression:
- If it references **recording rules**, find those recording rule
definitions in the same repo and trace what they compute.
- If it references **metrics**, find where those metrics are defined and
incremented/set in the Go source code. Understand what each metric
measures (e.g., counter of API errors, gauge of running pods).
- Understand the **complete chain**: metric → recording rule → alert
expression → when it fires.

## 3. Print alert context at the top of every review comment

For EVERY runbook file reviewed, you MUST start your comment with
an alert context block. This is MANDATORY — never skip it. It must
appear at the very top of the comment before any review feedback.

Use this EXACT format (all fields required, do not omit any):

---
**Alert source code:** `<org>/<repo>` — `<file path>` (line <N>)
**Alert expr:** `<paste the FULL PromQL expression here>`
**Alert `for`:** `<duration>`
**Alert labels:** `severity=<val>`, `operator_health_impact=<val>`,
`kubernetes_operator_part_of=<val>`, `kubernetes_operator_component=<val>`
**Recording rules:**
- `<rule name>`: `<rule expr>` — `<repo>/<file path>`
**Metrics:**
- `<metric name>` (<type>): `<repo>/<file path>` (line <N>)
---

The **Alert expr** field is critical — always include the full PromQL
expression so the reviewer can verify correctness at a glance.

If the alert was found in an open PR (not merged), add:
**Note:** Alert found in open PR: <PR URL>

If the alert could not be found, state:
**Alert source code:** NOT FOUND — cannot verify this runbook

## 4. Verify the runbook against the source

With full understanding of the alert, verify each section:

**Meaning section:**
- Does it accurately describe what the PromQL expression evaluates?
- Is the firing condition correct (threshold, duration, component)?
- If the expression checks `virt-controller` pods, the runbook must not
say `virt-api` (or vice versa). Flag any component mismatches.
- Is the `for` duration mentioned and correct?

**Impact section:**
- Based on the code where the metrics are set, what actually breaks when
this alert fires? Does the runbook accurately describe the impact?
- Are there downstream effects the runbook misses?

**Diagnosis section:**
- Do the `kubectl`/`oc` commands target the correct resources, labels,
and namespaces for the component the alert actually monitors?
- Are the label selectors correct? (e.g., `kubevirt.io=virt-controller`
vs `kubevirt.io=virt-api`)
- Do the commands help the user verify the condition the PromQL
expression checks?
- Are the commands syntactically valid?

**Mitigation section:**
- Based on the root causes visible in the source code, are the suggested
mitigations relevant and actionable?
- Are there obvious mitigations missing?

## 5. Formatting rules

- Every runbook MUST have H2 sections in order: `## Meaning`,
`## Impact`, `## Diagnosis`, `## Mitigation`.
- H1 heading must match the filename (e.g., `VirtAPIDown.md` →
`# VirtAPIDown`).
- Fenced code blocks: use ```` ```bash ```` not ```` ``` bash ````.
No space before the language tag.
- Line length: 80 chars for prose (code blocks exempt).
- `<!--DS: ... -->` and `<!--USstart-->...<!--USend-->` comment blocks
are intentional downstream/upstream markers. Do not flag them.

tools:
markdownlint:
enabled: true
languagetool:
enabled: true
level: picky
shellcheck:
enabled: true

knowledge_base:
learnings:
scope: global
pull_requests:
scope: global
linked_repositories:
- repository: kubevirt/kubevirt
instructions: >
Core KubeVirt. Contains alert definitions, recording rules, and
metric definitions for virt-api, virt-controller, virt-handler,
virt-launcher, virt-operator. Search this repo for any alert name.
- repository: kubevirt/containerized-data-importer
instructions: >
Containerized Data Importer (CDI). Contains alert definitions,
recording rules, and metric definitions. Search this repo for
any alert name.
- repository: kubevirt/hyperconverged-cluster-operator
instructions: >
HyperConverged Cluster Operator (HCO). Contains alert definitions,
recording rules, and metric definitions. Search this repo for
any alert name.
- repository: kubevirt/ssp-operator
instructions: >
Scheduling, Scale, and Performance operator (SSP). Contains alert
definitions, recording rules, and metric definitions. Search this
repo for any alert name.
- repository: kubevirt/cluster-network-addons-operator
instructions: >
Cluster Network Addons Operator (CNAO). Contains alert definitions
and metric definitions. Search this repo for any alert name.
- repository: kubevirt/hostpath-provisioner-operator
instructions: >
HostPath Provisioner Operator (HPP). Contains alert definitions
and metric definitions. Search this repo for any alert name.
- repository: k8snetworkplumbingwg/kubemacpool
instructions: >
Kubemacpool. Contains alert definitions and metric definitions.
Search this repo for any alert name.
- repository: kubevirt/application-aware-quota
instructions: >
Application Aware Quota (AAQ). Contains alert definitions and
metric definitions. Search this repo for any alert name.
- repository: kubevirt/hostpath-provisioner
instructions: >
HostPath Provisioner (HPP). Contains metric definitions and
possibly alert definitions. Search this repo for any alert name.
Note: this is the provisioner itself, not the operator.
115 changes: 115 additions & 0 deletions docs/review-guidelines/runbooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Runbook Review Guidelines

These guidelines define how to review KubeVirt alert runbooks in
`docs/runbooks/`. The key principle: **every claim in a runbook must be
verified against the actual alert source code** in the kubevirt
organization repositories.

## Alert Source Code Location

Alert definitions do NOT live in this repo. They are defined in operator
repositories across the kubevirt GitHub organization. Alert names do NOT
always contain the operator name as a prefix — you cannot guess the repo
from the alert name. Search ALL of these repos for the exact alert name:

- `kubevirt/kubevirt`
- `kubevirt/containerized-data-importer`
- `kubevirt/hyperconverged-cluster-operator`
- `kubevirt/ssp-operator`
- `kubevirt/cluster-network-addons-operator`
- `kubevirt/hostpath-provisioner-operator`
- `kubevirt/hostpath-provisioner`
- `k8snetworkplumbingwg/kubemacpool`
- `kubevirt/application-aware-quota`

## Alert May Only Exist in an Open PR

Often the alert is not yet merged in the source repository — it only
exists in an open pull request. This is the normal workflow: the alert
PR and the runbook PR are created in parallel.

When reviewing a runbook:
1. First search the main branch of the source repo for the alert name
2. If not found, search **open pull requests** in the source repo
3. The runbook PR description should link to the corresponding alert
PR — follow that link to find the alert definition
4. If no alert definition can be found anywhere (neither merged nor in
an open PR), flag this clearly: the runbook cannot be verified

## What to Verify

### 1. Find the alert definition

For each runbook, locate the alert by its name in the source repo
(main branch or open PR). An alert definition includes:
- **`expr`**: The PromQL expression that triggers the alert
- **`for`**: How long the condition must hold before the alert fires
- **`severity`**: critical, warning, or info
- **`operator_health_impact`**: critical, warning, or none
- **`summary`** and **`description`** annotations

### 2. Trace the PromQL expression end-to-end

Break down the `expr` to understand what it actually evaluates:

- **Metrics**: Find where each metric in the expression is defined in
the Go source code. Understand what it measures — is it a counter,
gauge, histogram? What events cause it to increment or change?
- **Recording rules**: If the expression references recording rules
(metrics that don't exist as raw instrumentation), find the recording
rule definition and trace what it computes.
- **Thresholds and conditions**: What numeric thresholds or boolean
conditions trigger the alert? Over what time window?

The goal is to fully understand: what system state causes this alert
to fire?

### 3. Verify each runbook section

#### `## Meaning`

- Must accurately describe what the PromQL expression evaluates
- Must state the correct firing condition (threshold, duration)
- Must reference the correct component — if the expression queries
`virt-controller` pods, the runbook must not say `virt-api`
- The `for` duration should match the alert definition

#### `## Impact`

- Based on where the metrics are set in source code, what functionality
is actually affected when this condition is true?
- Does the runbook capture the real user-facing impact, or is it
vague/generic?

#### `## Diagnosis`

- `kubectl`/`oc` commands must target the correct resources and labels
for the component the alert monitors
- Label selectors must match what the source code uses (e.g.,
`kubevirt.io=virt-controller`)
- Commands should help verify the specific condition the PromQL
expression checks
- Commands must be syntactically valid
- Use `$NAMESPACE` variable pattern:
```bash
$ export NAMESPACE="$(kubectl get kubevirt -A \
-o custom-columns="":.metadata.namespace)"
```

#### `## Mitigation`

- Remediation steps should address the root causes that the source
code reveals
- Should be concrete and actionable, not just "investigate the issue"

## Formatting Rules

- One runbook per file: `<AlertName>.md`
- H1 heading must match filename: `VirtAPIDown.md` → `# VirtAPIDown`
- Required H2 sections in order: `## Meaning`, `## Impact`,
`## Diagnosis`, `## Mitigation`
- Code fences: ```` ```bash ```` (no space before language tag)
- Line length: 80 characters for prose (code blocks exempt)
- Prefix example commands with `$ `
- `<!--DS: ... -->` and `<!--USstart-->...<!--USend-->` are intentional
downstream/upstream content markers — do not flag them
14 changes: 7 additions & 7 deletions docs/runbooks/LowReadyVirtControllersCount.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Meaning

This alert fires when one or more `virt-controller` pods are running, but not
all of them have been in a `Ready` state for the last 10 minutes.
This alert fires when the `virt-controller` deployment has zero available
replicas for the last 5 minutes.

A `virt-controller` device monitors the custom resource definitions (CRDs) of a
virtual machine instance (VMI) and manages the associated pods. The device
create pods for VMIs and manages the lifecycle of the pods. The device is
critical for cluster-wide virtualization functionality.
The `virt-controller` monitors the `kubevirt_vmi_memory_used_bytes`
metric and manages the associated pods. The device creates pods for
VMIs and manages the lifecycle of the pods. The device is critical
for cluster-wide virtualization functionality.
Comment on lines +5 to +11
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

LowReadyVirtControllersCount runbook semantics drift from the actual alert.


Alert source code: kubevirt/kubevirtpkg/monitoring/rules/alerts/virt-controller.go (line 490)
Alert expr: cluster:kubevirt_virt_controller_ready:sum < cluster:kubevirt_virt_controller_pods_running:count and cluster:kubevirt_virt_controller_ready:sum > 0
Alert for: 10m
Alert labels: severity=warning, operator_health_impact=warning,
kubernetes_operator_part_of=kubevirt, kubernetes_operator_component=kubevirt
Recording rules:

  • cluster:kubevirt_virt_controller_ready:sum: sum(kube_pod_status_ready{pod=~'virt-controller-.*', namespace='%s', condition='true'} * on(pod, namespace) kubevirt_virt_controller_ready_status{namespace='%s'}) or vector(0)kubevirt/kubevirt/pkg/monitoring/rules/recordingrules/virt.go
  • cluster:kubevirt_virt_controller_pods_running:count: count(kube_pod_status_phase{pod=~'virt-controller-.*', namespace='%s', phase='Running'} == 1) or vector(0)kubevirt/kubevirt/pkg/monitoring/rules/recordingrules/virt.go
    Metrics:
  • kubevirt_virt_controller_ready_status (gauge): kubevirt/kubevirt/pkg/monitoring/metrics/virt-controller/component_metrics.go (line 440)

Line 5-6 says “zero available replicas for 5 minutes,” but this alert is “ready < running and ready > 0 for 10 minutes.”
Line 30 checks virt-handler; the diagnosis must target virt-controller readiness/running state.

Suggested doc fix
-This alert fires when the `virt-controller` deployment has zero available
-replicas for the last 5 minutes.
+This alert fires when `virt-controller` has fewer ready pods than running
+pods for 10 minutes, while at least one pod is still ready.

-   $ kubectl get deployment -n $NAMESPACE virt-handler -o jsonpath='{.status.readyReplicas}'
+   $ kubectl get deployment -n $NAMESPACE virt-controller -o jsonpath='{.status.readyReplicas}'

As per coding guidelines, “Must state the correct firing condition (threshold, duration)” and “kubectl/oc commands must target the correct resources and labels for the component the alert monitors”.

Also applies to: 30-30


## Impact

Expand All @@ -27,7 +27,7 @@ launching a new VMI or shutting down an existing VMI.
2. Verify a `virt-controller` device is available:

```bash
$ kubectl get deployment -n $NAMESPACE virt-controller -o jsonpath='{.status.readyReplicas}'
$ kubectl get deployment -n $NAMESPACE virt-handler -o jsonpath='{.status.readyReplicas}'
```

3. Check the status of the `virt-controller` deployment:
Expand Down
8 changes: 5 additions & 3 deletions docs/runbooks/VirtAPIDown.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

## Meaning

No running `virt-api` pod has been detected for 10 minutes.
The `virt-api` deployment has fewer than the expected number of
replicas available for 5 minutes.
Comment on lines +5 to +6
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Runbook now documents a different alert than VirtAPIDown.


Alert source code: kubevirt/kubevirtpkg/monitoring/rules/alerts/virt-api.go (line 524)
Alert expr: cluster:kubevirt_virt_api_pods_running:count == 0
Alert for: 10m
Alert labels: severity=critical, operator_health_impact=critical,
kubernetes_operator_part_of=kubevirt, kubernetes_operator_component=kubevirt
Recording rules:

  • cluster:kubevirt_virt_api_pods_running:count: count(kube_pod_status_phase{pod=~'virt-api-.*', namespace='%s', phase='Running'} == 1) or vector(0)kubevirt/kubevirt/pkg/monitoring/rules/recordingrules/virt.go
    Metrics:
  • kube_pod_status_phase (gauge): kubevirt/kubevirt/pkg/monitoring/rules/recordingrules/virt.go (line 736)

Line 5-6 and Line 10-11 describe a warning-level “fewer replicas for 5 minutes” condition, but this alert is a critical “no running virt-api pods for 10 minutes” condition.
Line 24 also queries kubevirt.io=virt-controller, which is the wrong component for this runbook.

Suggested doc fix
-The `virt-api` deployment has fewer than the expected number of
-replicas available for 5 minutes.
+This alert fires when no `virt-api` pods are running for 10 minutes.

-This is a warning level alert. KubeVirt objects may experience degraded
-API performance.
+This is a critical alert. KubeVirt API operations can become unavailable.

-   $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-controller
+   $ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-api

As per coding guidelines, “Must accurately describe what the PromQL expression evaluates” and “Do the kubectl/oc commands target the correct resources, labels, and namespaces for the component the alert actually monitors?”.

Also applies to: 10-11, 24-24


## Impact

KubeVirt objects cannot send API calls.
This is a warning level alert. KubeVirt objects may experience degraded
API performance.

## Diagnosis

Expand All @@ -19,7 +21,7 @@ KubeVirt objects cannot send API calls.
2. Check the status of the `virt-api` pods:

```bash
$ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-api
$ kubectl -n $NAMESPACE get pods -l kubevirt.io=virt-controller
```

3. Check the status of the `virt-api` deployment:
Expand Down
Loading