RedHatQE · rnetser · May 11, 2026
diff --git a/JOB_INSIGHT_FAILURE_HISTORY_ANALYSIS_PROMPT.md b/JOB_INSIGHT_FAILURE_HISTORY_ANALYSIS_PROMPT.md
@@ -0,0 +1,58 @@
+<!-- To be used with https://github.com/myk-org/rootcoz
+Complements the main analysis prompt — history-aware classification.
+-->
+
+# Pre-Classification Check: Did a Previous Test Break the Cluster?
+
+## MANDATORY: Before classifying any failure, answer this question first:
+
+**Did an earlier test in this job run modify cluster resources and fail to clean them up?**
+
+If yes → the current failure is likely a side effect of that earlier test's failed
+teardown, not an independent issue.
+
+## How to Check
+
+1. **Scan the console log for teardown failures BEFORE the current test.**
+   Look for:
+   - `TimeoutExpiredError` during teardown or fixture cleanup
+   - `ERROR` in teardown/finalizer of a preceding test
+   - `teardown_module`, `teardown_class`, or fixture `yield` cleanup failures
+
+2. **Identify what the failed teardown was supposed to revert.**
+   Look for any test that patches, modifies, or reconfigures cluster-scoped resources:
+   operators, subscriptions, CRDs, node labels/taints, network configurations, or
+   cluster-level CRs (HyperConverged, KubeVirt, NetworkAddonsConfig, etc.).
+   If the teardown of such a test failed, the cluster may be left in a modified state.
+
+3. **Check if the current failure matches the expected impact.**
+   For example: pods stuck in Pending, operators degraded, nodes not schedulable,
+   feature gates in wrong state, network policies broken — any symptom consistent
+   with the resource that was not reverted.
+
+## Classification
+
+- **The test whose teardown failed**: Classify based on why its teardown failed —
+  CODE ISSUE if the cleanup logic is wrong, PRODUCT BUG if the product blocked
+  the revert, INFRASTRUCTURE if an environmental issue (node outage, storage
+  failure, etc.) prevented cleanup.
+- **All other tests that failed after it**: Use the **same classification** as the
+  root-cause test. In the reason, state: "Caused by [test_name] teardown failure —
+  [resource] was not reverted."
+
+## When This Check Does NOT Apply
+
+- The current test is the **first failure** in the run
+- No teardown errors appear before the current test in the console log
+- The failure has a clearly independent root cause (e.g., wrong assertion value,
+  import error, syntax error)
+- The failure occurs during **pytest collection** (e.g., `ModuleNotFoundError`,
+  `SyntaxError`, missing fixture) — collection happens before any test runs
+- The cluster was already broken **before any test ran** (e.g., deployment failure,
+  cluster not provisioned, operators not installed)
+- The failure is in a completely **unrelated area** to what the previous test modified
+  (e.g., previous test changed storage config, current test fails on CPU topology)
+- The same failure pattern appears in **previous job runs** where no teardown failure
+  preceded it — this indicates a recurring issue independent of teardown cascades
+
+In these cases, proceed with normal classification rules.
diff --git a/JOB_INSIGHT_PROMPT.md b/JOB_INSIGHT_PROMPT.md
@@ -41,22 +41,24 @@ tier marker are considered tier2 by CI job selection.
 
 ## 2. Decision Procedure and Classification Rules
 
-Your goal is to classify each failure as `CODE ISSUE` or `PRODUCT BUG` only when the
-available evidence supports that conclusion. Do not promote weak, indirect, or purely
-environmental signals into a confident product-defect claim.
+Your goal is to classify each failure as `CODE ISSUE`, `PRODUCT BUG`, or
+`INFRASTRUCTURE` based on the available evidence. Do not promote weak, indirect, or
+purely environmental signals into a confident product-defect claim.
 
 **Allowed classification values:**
 
 - `CODE ISSUE` - Test framework, test code, or test-owned configuration problem
 - `PRODUCT BUG` - Actual KubeVirt, CDI, HCO, or related product, or dependent operator defect
+- `INFRASTRUCTURE` - Environmental blocker, lab/cluster infrastructure failure, or
+  external dependency outage — not a code or product defect
 
 **Allowed confidence levels:**
 
 - `high` - Direct causal evidence (e.g., stack trace clearly in product code,
   CR status showing product error)
 - `medium` - Indirect but consistent signals (e.g., pattern matches known
   product issue, but logs incomplete)
-- `low` - Environmental blockers, contradictory signals, or missing direct cause
+- `low` - Contradictory signals or missing direct cause
 
 ### Required Decision Order
 
@@ -81,11 +83,14 @@ environmental signals into a confident product-defect claim.
 5. **Separate test-owned, product-owned, and environment-owned problems.**
    Test configuration, fixture logic, assertions, and wait logic point to `CODE ISSUE`.
    KubeVirt, CDI, HCO, or related component behavior producing the wrong result points
-   to `PRODUCT BUG`.
+   to `PRODUCT BUG`. Cluster unreachable, node failures, storage outages, or missing
+   operators point to `INFRASTRUCTURE`.
 6. **Assign confidence based on evidence strength.**
    High confidence requires a direct causal signal. Medium confidence fits consistent
-   but incomplete evidence. Low confidence fits environmental blockers, contradictory
-   signals, or missing direct cause.
+   but incomplete evidence. Low confidence fits contradictory signals or missing
+   direct cause. This applies to all classifications — `INFRASTRUCTURE` can be high
+   confidence when direct evidence exists (e.g., nodes `NotReady`, storage backend
+   unreachable).
 
 ### Expected-Failure and Derivative-Failure Handling
 
@@ -149,13 +154,12 @@ Indicators:
 - Failures caused by dependent operators (see Section 6) behaving incorrectly with valid
   CNV configuration. Distinguish between: the dependent operator itself is broken
   (file against that operator, not CNV), CNV misconfigures the dependent operator
-  (`PRODUCT BUG` against CNV), or the operator is missing/not installed (environmental)
+  (`PRODUCT BUG` against CNV), or the operator is missing/not installed (`INFRASTRUCTURE`)
 
-### Environmental Blockers and Ambiguous Cases
+### INFRASTRUCTURE - Environmental Blockers and Ambiguous Cases
 
-Infrastructure or lab failures are NOT confirmed `PRODUCT BUG` findings. Treat them as
-environmental blockers with low confidence unless there is direct evidence that a
-product component caused the instability.
+Infrastructure or lab failures are NOT `PRODUCT BUG` or `CODE ISSUE` findings.
+Classify them as `INFRASTRUCTURE`.
 
 Common environmental blockers:
 
@@ -165,15 +169,14 @@ Common environmental blockers:
 - Node hardware failure, IPMI issues, or SR-IOV card malfunction
 - Remote cluster mismatch or unavailable remote cluster
 - Insufficient cluster resources (CPU, memory, storage) for test requirements
+- Container runtime failures (e.g., OCI hook errors) affecting all pods on a node
+- Operator missing or not installed (test prerequisite not met)
 
 Guidance:
 
-- Do NOT classify a pure environmental blocker as a confirmed `PRODUCT BUG`.
-- If the evidence only shows environment instability, say so explicitly and keep
-  confidence low.
-- If a binary label is required by the consuming system, make it explicit
-  that the issue is environmental and the binary label is only a fallback, not a
-  confirmed product-defect conclusion.
+- Classify pure environmental blockers as `INFRASTRUCTURE`.
+- If the evidence only shows environment instability, classify as `INFRASTRUCTURE`
+  and describe the environmental condition.
 - Quarantined tests (`@pytest.mark.jira(..., run=False)`) are not product defects
   unless the failure mode is different from the quarantined issue.
 
@@ -199,25 +202,25 @@ themselves:
   or the API rejected a valid one (`PRODUCT BUG`).
 - `ResourceNotFoundError` or `NotFoundError` - Determine whether the resource was
   never created (fixture issue = `CODE ISSUE`), was garbage-collected unexpectedly
-  (`PRODUCT BUG`), or the namespace was cleaned up (environmental).
+  (`PRODUCT BUG`), or the namespace was cleaned up (`INFRASTRUCTURE`).
 
 Pattern guidance:
 
 - **VM lifecycle timeout:** Too-low timeout or wrong wait target is `CODE ISSUE`; a
   real stall with healthy inputs and controllers is `PRODUCT BUG`; an API or node
-  outage is environmental
+  outage is `INFRASTRUCTURE`
 - **Live migration failure:** Wrong migration policy, anti-affinity, or insufficient
   target node resources in test setup is `CODE ISSUE`; valid configuration plus
   `virt-controller` or `virt-handler` failure is `PRODUCT BUG`; node drain or
-  network partition is environmental
+  network partition is `INFRASTRUCTURE`
 - **SSH connectivity failure:** Read the test code to determine how SSH is used.
   Wrong credentials, missing `virtctl` binary, no retry logic, or missing
   `wait_for_ssh_connectivity()` before running commands is `CODE ISSUE`.
   VM network misconfiguration after migration or snapshot restore where the test
-  correctly waits and retries is `PRODUCT BUG`. Cluster network outage is environmental.
+  correctly waits and retries is `PRODUCT BUG`. Cluster network outage is `INFRASTRUCTURE`.
 - **DataVolume/CDI failure:** Wrong source URL, bad storage class reference, or
   insufficient PVC size in test is `CODE ISSUE`; valid import/upload/clone rejected
-  or stuck by CDI controller is `PRODUCT BUG`; storage backend outage is environmental
+  or stuck by CDI controller is `PRODUCT BUG`; storage backend outage is `INFRASTRUCTURE`
 - **Post-operation validation:** Wrong expected values or stale assertions are
   `CODE ISSUE`; a VM with wrong CPU topology, missing disks, or broken networking
   after a valid operation is `PRODUCT BUG`
@@ -233,7 +236,7 @@ Pattern guidance:
   `PRODUCT BUG`
 - **Resource cleanup failure:** Missing cleanup or bad fixture ownership is `CODE ISSUE`;
   product finalizer or controller cleanup failure is `PRODUCT BUG`; namespace or cluster
-  cleanup blocked by infrastructure is environmental
+  cleanup blocked by infrastructure is `INFRASTRUCTURE`
 - **Console access failure:** Wrong `pexpect` patterns or timeouts in test is
   `CODE ISSUE`; `virtctl console` unable to connect to a healthy VMI is `PRODUCT BUG`
 
@@ -274,6 +277,78 @@ When classifying `CODE ISSUE`, suggest a specific fix:
 
 ## 3. Analysis Thoroughness and Required Evidence Structure
 
+### Environment (MANDATORY — do this FIRST)
+
+**STEP 1 — before any other analysis:** Open and read the file
+`build-artifacts/run-info.json`. This is a JSON file containing version
+and environment information. Extract **every** field that contains a
+version, revision, image reference, or environment detail. Common keys
+include (but are not limited to):
+
+| JSON key            | Label           |
+|---------------------|-----------------|
+| `openshiftVersion`  | OpenShift       |
+| `cnvVersion`        | CNV             |
+| `bundleVersion`     | Bundle          |
+| `kubevirtVersion`   | KubeVirt        |
+| `cdiVersion`        | CDI             |
+| `kubernetesVersion` | Kubernetes      |
+| `ocsVersion`        | OCS             |
+| `networkType`       | Network Type    |
+| `hcoImage`          | HCO Image       |
+| `hcoIndexImage`     | HCO Index Image |
+| `testImage`         | Test Image      |
+
+Log **every** key whose value is a version string, image reference
+(`registry/...@sha256:...`), or environment identifier. Skip keys whose
+values are HTML snippets, or empty strings.
+
+**STEP 2:** Start the `details` field with an `Environment:` block listing
+ALL extracted version and environment fields — do NOT filter by relevance:
+
+```
+Environment:
+- OpenShift: 4.22.0-rc.2
+- CNV: 4.22.0
+- Bundle: v4.22.0.rhel9-149
+- KubeVirt: v1.8.2-34-g9ff3b29bc2
+- CDI: v1.65.0-2-ge83df1593
+- Kubernetes: v1.35.3
+- OCS: 4.22.0-70.stable
+- Network Type: OVNKubernetes
+- HCO Image: registry.redhat.io/...@sha256:...
+- Test Image: quay.io/openshift-cnv/...@sha256:...
+
+Root Cause:
+The test failure is caused by...
+```
+
+If `run-info.json` is missing or a field is absent, search other artifacts
+for version evidence: console logs, must-gather output, CSV names, operator
+pod image tags, or log output under `build-artifacts/`.
+If a version still cannot be determined, mark as `unknown`.
+Do NOT skip this step — the environment block MUST appear in every `details` field.
+
+**STEP 3:** In the `artifacts_evidence` field, ALWAYS include ALL version
+and environment lines from `run-info.json` as the first entries — one line
+per field, matching every key you extracted in STEP 1:
+
+```
+[build-artifacts/run-info.json]: "openshiftVersion":"4.22.0-rc.2"
+[build-artifacts/run-info.json]: "cnvVersion":"4.22.0"
+[build-artifacts/run-info.json]: "bundleVersion":"v4.22.0.rhel9-149"
+[build-artifacts/run-info.json]: "kubevirtVersion":"v1.8.2-34-g9ff3b29bc2"
+[build-artifacts/run-info.json]: "cdiVersion":"v1.65.0-2-ge83df1593"
+[build-artifacts/run-info.json]: "kubernetesVersion":"v1.35.3"
+[build-artifacts/run-info.json]: "ocsVersion":"4.22.0-70.stable"
+[build-artifacts/run-info.json]: "networkType":"OVNKubernetes"
+[build-artifacts/run-info.json]: "hcoImage":"registry.redhat.io/..."
+[build-artifacts/run-info.json]: "testImage":"quay.io/openshift-cnv/..."
+```
+
+Do NOT omit any version or image field. Then continue with the
+failure-specific evidence lines.
+
 **CRITICAL: Never dismiss or skip warnings, conditions, or errors found in the data.**
 Every warning, condition entry, and error message in VirtualMachine, VMI, DataVolume,
 Migration, and related resource status MUST be evaluated as a potential contributing
@@ -371,7 +446,7 @@ never completed. Investigate it explicitly before ruling it out.
 - NodeNetworkConfigurationPolicy status:
   `oc get nncp -o yaml`
 
-### For environmental blockers, suggest collecting
+### For `INFRASTRUCTURE`, suggest collecting
 
 - Cluster node status:
   `oc get nodes`
@@ -524,11 +599,18 @@ Key product and runtime components to inspect:
 
 When a failure involves a dependent operator, determine ownership:
 
-- Operator missing or not installed → **environmental** (test prerequisite not met)
+- Operator missing or not installed → **`INFRASTRUCTURE`** (test prerequisite not met)
 - Operator installed but malfunctioning independently → file against that operator, not CNV
 - CNV sends invalid configuration to the operator → **PRODUCT BUG** against CNV
 - Operator works correctly but CNV misinterprets its status → **PRODUCT BUG** against CNV
 
+## 7. Additional Version Sources
+
+If `build-artifacts/run-info.json` does not contain a needed component version,
+check `build-artifacts/` for version evidence (CSV names, operator pod image
+tags, log output). If not found there, check `additional_repos` for the
+component's source repo context.
+
 [ocp-virt-doc]: https://docs.redhat.com/en/documentation/red_hat_openshift_virtualization/
 [kubevirt-repo]: https://github.com/kubevirt/kubevirt
 [cdi-repo]: https://github.com/kubevirt/containerized-data-importer