Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions JOB_INSIGHT_FAILURE_HISTORY_ANALYSIS_PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<!-- To be used with https://github.com/myk-org/rootcoz
Complements the main analysis prompt — history-aware classification.
-->

# Pre-Classification Check: Did a Previous Test Break the Cluster?

## MANDATORY: Before classifying any failure, answer this question first:

**Did an earlier test in this job run modify cluster resources and fail to clean them up?**

If yes → the current failure is likely a side effect of that earlier test's failed
teardown, not an independent issue.

## How to Check

1. **Scan the console log for teardown failures BEFORE the current test.**
Look for:
- `TimeoutExpiredError` during teardown or fixture cleanup
- `ERROR` in teardown/finalizer of a preceding test
- `teardown_module`, `teardown_class`, or fixture `yield` cleanup failures

2. **Identify what the failed teardown was supposed to revert.**
Look for any test that patches, modifies, or reconfigures cluster-scoped resources:
operators, subscriptions, CRDs, node labels/taints, network configurations, or
cluster-level CRs (HyperConverged, KubeVirt, NetworkAddonsConfig, etc.).
If the teardown of such a test failed, the cluster may be left in a modified state.

3. **Check if the current failure matches the expected impact.**
For example: pods stuck in Pending, operators degraded, nodes not schedulable,
feature gates in wrong state, network policies broken — any symptom consistent
with the resource that was not reverted.

## Classification

- **The test whose teardown failed**: Classify based on why its teardown failed —
CODE ISSUE if the cleanup logic is wrong, PRODUCT BUG if the product blocked
the revert, INFRASTRUCTURE if an environmental issue (node outage, storage
failure, etc.) prevented cleanup.
- **All other tests that failed after it**: Use the **same classification** as the
root-cause test. In the reason, state: "Caused by [test_name] teardown failure —
[resource] was not reverted."

## When This Check Does NOT Apply

- The current test is the **first failure** in the run
- No teardown errors appear before the current test in the console log
- The failure has a clearly independent root cause (e.g., wrong assertion value,
import error, syntax error)
- The failure occurs during **pytest collection** (e.g., `ModuleNotFoundError`,
`SyntaxError`, missing fixture) — collection happens before any test runs
- The cluster was already broken **before any test ran** (e.g., deployment failure,
cluster not provisioned, operators not installed)
- The failure is in a completely **unrelated area** to what the previous test modified
(e.g., previous test changed storage config, current test fails on CPU topology)
- The same failure pattern appears in **previous job runs** where no teardown failure
preceded it — this indicates a recurring issue independent of teardown cascades

In these cases, proceed with normal classification rules.
134 changes: 108 additions & 26 deletions JOB_INSIGHT_PROMPT.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,22 +41,24 @@ tier marker are considered tier2 by CI job selection.

## 2. Decision Procedure and Classification Rules

Your goal is to classify each failure as `CODE ISSUE` or `PRODUCT BUG` only when the
available evidence supports that conclusion. Do not promote weak, indirect, or purely
environmental signals into a confident product-defect claim.
Your goal is to classify each failure as `CODE ISSUE`, `PRODUCT BUG`, or
Comment thread
rnetser marked this conversation as resolved.
Comment thread
rnetser marked this conversation as resolved.
`INFRASTRUCTURE` based on the available evidence. Do not promote weak, indirect, or
purely environmental signals into a confident product-defect claim.
Comment thread
rnetser marked this conversation as resolved.

**Allowed classification values:**

- `CODE ISSUE` - Test framework, test code, or test-owned configuration problem
- `PRODUCT BUG` - Actual KubeVirt, CDI, HCO, or related product, or dependent operator defect
- `INFRASTRUCTURE` - Environmental blocker, lab/cluster infrastructure failure, or
external dependency outage — not a code or product defect

**Allowed confidence levels:**

- `high` - Direct causal evidence (e.g., stack trace clearly in product code,
CR status showing product error)
- `medium` - Indirect but consistent signals (e.g., pattern matches known
product issue, but logs incomplete)
- `low` - Environmental blockers, contradictory signals, or missing direct cause
- `low` - Contradictory signals or missing direct cause

### Required Decision Order

Expand All @@ -81,11 +83,14 @@ environmental signals into a confident product-defect claim.
5. **Separate test-owned, product-owned, and environment-owned problems.**
Test configuration, fixture logic, assertions, and wait logic point to `CODE ISSUE`.
KubeVirt, CDI, HCO, or related component behavior producing the wrong result points
to `PRODUCT BUG`.
to `PRODUCT BUG`. Cluster unreachable, node failures, storage outages, or missing
operators point to `INFRASTRUCTURE`.
6. **Assign confidence based on evidence strength.**
High confidence requires a direct causal signal. Medium confidence fits consistent
but incomplete evidence. Low confidence fits environmental blockers, contradictory
signals, or missing direct cause.
but incomplete evidence. Low confidence fits contradictory signals or missing
direct cause. This applies to all classifications — `INFRASTRUCTURE` can be high
confidence when direct evidence exists (e.g., nodes `NotReady`, storage backend
unreachable).

### Expected-Failure and Derivative-Failure Handling

Expand Down Expand Up @@ -149,13 +154,12 @@ Indicators:
- Failures caused by dependent operators (see Section 6) behaving incorrectly with valid
CNV configuration. Distinguish between: the dependent operator itself is broken
(file against that operator, not CNV), CNV misconfigures the dependent operator
(`PRODUCT BUG` against CNV), or the operator is missing/not installed (environmental)
(`PRODUCT BUG` against CNV), or the operator is missing/not installed (`INFRASTRUCTURE`)

### Environmental Blockers and Ambiguous Cases
### INFRASTRUCTURE - Environmental Blockers and Ambiguous Cases

Infrastructure or lab failures are NOT confirmed `PRODUCT BUG` findings. Treat them as
environmental blockers with low confidence unless there is direct evidence that a
product component caused the instability.
Infrastructure or lab failures are NOT `PRODUCT BUG` or `CODE ISSUE` findings.
Classify them as `INFRASTRUCTURE`.

Common environmental blockers:

Expand All @@ -165,15 +169,14 @@ Common environmental blockers:
- Node hardware failure, IPMI issues, or SR-IOV card malfunction
- Remote cluster mismatch or unavailable remote cluster
- Insufficient cluster resources (CPU, memory, storage) for test requirements
- Container runtime failures (e.g., OCI hook errors) affecting all pods on a node
- Operator missing or not installed (test prerequisite not met)

Guidance:

- Do NOT classify a pure environmental blocker as a confirmed `PRODUCT BUG`.
- If the evidence only shows environment instability, say so explicitly and keep
confidence low.
- If a binary label is required by the consuming system, make it explicit
that the issue is environmental and the binary label is only a fallback, not a
confirmed product-defect conclusion.
- Classify pure environmental blockers as `INFRASTRUCTURE`.
- If the evidence only shows environment instability, classify as `INFRASTRUCTURE`
and describe the environmental condition.
- Quarantined tests (`@pytest.mark.jira(..., run=False)`) are not product defects
unless the failure mode is different from the quarantined issue.

Expand All @@ -199,25 +202,25 @@ themselves:
or the API rejected a valid one (`PRODUCT BUG`).
- `ResourceNotFoundError` or `NotFoundError` - Determine whether the resource was
never created (fixture issue = `CODE ISSUE`), was garbage-collected unexpectedly
(`PRODUCT BUG`), or the namespace was cleaned up (environmental).
(`PRODUCT BUG`), or the namespace was cleaned up (`INFRASTRUCTURE`).

Pattern guidance:

- **VM lifecycle timeout:** Too-low timeout or wrong wait target is `CODE ISSUE`; a
real stall with healthy inputs and controllers is `PRODUCT BUG`; an API or node
outage is environmental
outage is `INFRASTRUCTURE`
- **Live migration failure:** Wrong migration policy, anti-affinity, or insufficient
target node resources in test setup is `CODE ISSUE`; valid configuration plus
`virt-controller` or `virt-handler` failure is `PRODUCT BUG`; node drain or
network partition is environmental
network partition is `INFRASTRUCTURE`
- **SSH connectivity failure:** Read the test code to determine how SSH is used.
Wrong credentials, missing `virtctl` binary, no retry logic, or missing
`wait_for_ssh_connectivity()` before running commands is `CODE ISSUE`.
VM network misconfiguration after migration or snapshot restore where the test
correctly waits and retries is `PRODUCT BUG`. Cluster network outage is environmental.
correctly waits and retries is `PRODUCT BUG`. Cluster network outage is `INFRASTRUCTURE`.
- **DataVolume/CDI failure:** Wrong source URL, bad storage class reference, or
insufficient PVC size in test is `CODE ISSUE`; valid import/upload/clone rejected
or stuck by CDI controller is `PRODUCT BUG`; storage backend outage is environmental
or stuck by CDI controller is `PRODUCT BUG`; storage backend outage is `INFRASTRUCTURE`
- **Post-operation validation:** Wrong expected values or stale assertions are
`CODE ISSUE`; a VM with wrong CPU topology, missing disks, or broken networking
after a valid operation is `PRODUCT BUG`
Expand All @@ -233,7 +236,7 @@ Pattern guidance:
`PRODUCT BUG`
- **Resource cleanup failure:** Missing cleanup or bad fixture ownership is `CODE ISSUE`;
product finalizer or controller cleanup failure is `PRODUCT BUG`; namespace or cluster
cleanup blocked by infrastructure is environmental
cleanup blocked by infrastructure is `INFRASTRUCTURE`
- **Console access failure:** Wrong `pexpect` patterns or timeouts in test is
`CODE ISSUE`; `virtctl console` unable to connect to a healthy VMI is `PRODUCT BUG`

Expand Down Expand Up @@ -274,6 +277,78 @@ When classifying `CODE ISSUE`, suggest a specific fix:

## 3. Analysis Thoroughness and Required Evidence Structure

### Environment (MANDATORY — do this FIRST)

**STEP 1 — before any other analysis:** Open and read the file
`build-artifacts/run-info.json`. This is a JSON file containing version
and environment information. Extract **every** field that contains a
version, revision, image reference, or environment detail. Common keys
include (but are not limited to):

| JSON key | Label |
|---------------------|-----------------|
| `openshiftVersion` | OpenShift |
| `cnvVersion` | CNV |
| `bundleVersion` | Bundle |
| `kubevirtVersion` | KubeVirt |
| `cdiVersion` | CDI |
| `kubernetesVersion` | Kubernetes |
| `ocsVersion` | OCS |
| `networkType` | Network Type |
| `hcoImage` | HCO Image |
| `hcoIndexImage` | HCO Index Image |
| `testImage` | Test Image |

Log **every** key whose value is a version string, image reference
(`registry/...@sha256:...`), or environment identifier. Skip keys whose
values are HTML snippets, or empty strings.

**STEP 2:** Start the `details` field with an `Environment:` block listing
ALL extracted version and environment fields — do NOT filter by relevance:

```
Environment:
- OpenShift: 4.22.0-rc.2
- CNV: 4.22.0
- Bundle: v4.22.0.rhel9-149
- KubeVirt: v1.8.2-34-g9ff3b29bc2
- CDI: v1.65.0-2-ge83df1593
- Kubernetes: v1.35.3
- OCS: 4.22.0-70.stable
- Network Type: OVNKubernetes
- HCO Image: registry.redhat.io/...@sha256:...
- Test Image: quay.io/openshift-cnv/...@sha256:...

Root Cause:
The test failure is caused by...
```

If `run-info.json` is missing or a field is absent, search other artifacts
for version evidence: console logs, must-gather output, CSV names, operator
pod image tags, or log output under `build-artifacts/`.
If a version still cannot be determined, mark as `unknown`.
Do NOT skip this step — the environment block MUST appear in every `details` field.

**STEP 3:** In the `artifacts_evidence` field, ALWAYS include ALL version
and environment lines from `run-info.json` as the first entries — one line
per field, matching every key you extracted in STEP 1:

```
[build-artifacts/run-info.json]: "openshiftVersion":"4.22.0-rc.2"
[build-artifacts/run-info.json]: "cnvVersion":"4.22.0"
[build-artifacts/run-info.json]: "bundleVersion":"v4.22.0.rhel9-149"
[build-artifacts/run-info.json]: "kubevirtVersion":"v1.8.2-34-g9ff3b29bc2"
[build-artifacts/run-info.json]: "cdiVersion":"v1.65.0-2-ge83df1593"
[build-artifacts/run-info.json]: "kubernetesVersion":"v1.35.3"
[build-artifacts/run-info.json]: "ocsVersion":"4.22.0-70.stable"
[build-artifacts/run-info.json]: "networkType":"OVNKubernetes"
[build-artifacts/run-info.json]: "hcoImage":"registry.redhat.io/..."
[build-artifacts/run-info.json]: "testImage":"quay.io/openshift-cnv/..."
```

Do NOT omit any version or image field. Then continue with the
failure-specific evidence lines.

**CRITICAL: Never dismiss or skip warnings, conditions, or errors found in the data.**
Every warning, condition entry, and error message in VirtualMachine, VMI, DataVolume,
Migration, and related resource status MUST be evaluated as a potential contributing
Expand Down Expand Up @@ -371,7 +446,7 @@ never completed. Investigate it explicitly before ruling it out.
- NodeNetworkConfigurationPolicy status:
`oc get nncp -o yaml`

### For environmental blockers, suggest collecting
### For `INFRASTRUCTURE`, suggest collecting

- Cluster node status:
`oc get nodes`
Expand Down Expand Up @@ -524,11 +599,18 @@ Key product and runtime components to inspect:

When a failure involves a dependent operator, determine ownership:

- Operator missing or not installed → **environmental** (test prerequisite not met)
- Operator missing or not installed → **`INFRASTRUCTURE`** (test prerequisite not met)
- Operator installed but malfunctioning independently → file against that operator, not CNV
- CNV sends invalid configuration to the operator → **PRODUCT BUG** against CNV
- Operator works correctly but CNV misinterprets its status → **PRODUCT BUG** against CNV

## 7. Additional Version Sources

If `build-artifacts/run-info.json` does not contain a needed component version,
check `build-artifacts/` for version evidence (CSV names, operator pod image
tags, log output). If not found there, check `additional_repos` for the
component's source repo context.

[ocp-virt-doc]: https://docs.redhat.com/en/documentation/red_hat_openshift_virtualization/
[kubevirt-repo]: https://github.com/kubevirt/kubevirt
[cdi-repo]: https://github.com/kubevirt/containerized-data-importer
Expand Down