diff --git a/JOB_INSIGHT_FAILURE_HISTORY_ANALYSIS_PROMPT.md b/JOB_INSIGHT_FAILURE_HISTORY_ANALYSIS_PROMPT.md new file mode 100644 index 0000000000..c9f355d513 --- /dev/null +++ b/JOB_INSIGHT_FAILURE_HISTORY_ANALYSIS_PROMPT.md @@ -0,0 +1,58 @@ + + +# Pre-Classification Check: Did a Previous Test Break the Cluster? + +## MANDATORY: Before classifying any failure, answer this question first: + +**Did an earlier test in this job run modify cluster resources and fail to clean them up?** + +If yes → the current failure is likely a side effect of that earlier test's failed +teardown, not an independent issue. + +## How to Check + +1. **Scan the console log for teardown failures BEFORE the current test.** + Look for: + - `TimeoutExpiredError` during teardown or fixture cleanup + - `ERROR` in teardown/finalizer of a preceding test + - `teardown_module`, `teardown_class`, or fixture `yield` cleanup failures + +2. **Identify what the failed teardown was supposed to revert.** + Look for any test that patches, modifies, or reconfigures cluster-scoped resources: + operators, subscriptions, CRDs, node labels/taints, network configurations, or + cluster-level CRs (HyperConverged, KubeVirt, NetworkAddonsConfig, etc.). + If the teardown of such a test failed, the cluster may be left in a modified state. + +3. **Check if the current failure matches the expected impact.** + For example: pods stuck in Pending, operators degraded, nodes not schedulable, + feature gates in wrong state, network policies broken — any symptom consistent + with the resource that was not reverted. + +## Classification + +- **The test whose teardown failed**: Classify based on why its teardown failed — + CODE ISSUE if the cleanup logic is wrong, PRODUCT BUG if the product blocked + the revert, INFRASTRUCTURE if an environmental issue (node outage, storage + failure, etc.) prevented cleanup. +- **All other tests that failed after it**: Use the **same classification** as the + root-cause test. In the reason, state: "Caused by [test_name] teardown failure — + [resource] was not reverted." + +## When This Check Does NOT Apply + +- The current test is the **first failure** in the run +- No teardown errors appear before the current test in the console log +- The failure has a clearly independent root cause (e.g., wrong assertion value, + import error, syntax error) +- The failure occurs during **pytest collection** (e.g., `ModuleNotFoundError`, + `SyntaxError`, missing fixture) — collection happens before any test runs +- The cluster was already broken **before any test ran** (e.g., deployment failure, + cluster not provisioned, operators not installed) +- The failure is in a completely **unrelated area** to what the previous test modified + (e.g., previous test changed storage config, current test fails on CPU topology) +- The same failure pattern appears in **previous job runs** where no teardown failure + preceded it — this indicates a recurring issue independent of teardown cascades + +In these cases, proceed with normal classification rules. diff --git a/JOB_INSIGHT_PROMPT.md b/JOB_INSIGHT_PROMPT.md index 51c79e61fb..a366ba6281 100644 --- a/JOB_INSIGHT_PROMPT.md +++ b/JOB_INSIGHT_PROMPT.md @@ -41,14 +41,16 @@ tier marker are considered tier2 by CI job selection. ## 2. Decision Procedure and Classification Rules -Your goal is to classify each failure as `CODE ISSUE` or `PRODUCT BUG` only when the -available evidence supports that conclusion. Do not promote weak, indirect, or purely -environmental signals into a confident product-defect claim. +Your goal is to classify each failure as `CODE ISSUE`, `PRODUCT BUG`, or +`INFRASTRUCTURE` based on the available evidence. Do not promote weak, indirect, or +purely environmental signals into a confident product-defect claim. **Allowed classification values:** - `CODE ISSUE` - Test framework, test code, or test-owned configuration problem - `PRODUCT BUG` - Actual KubeVirt, CDI, HCO, or related product, or dependent operator defect +- `INFRASTRUCTURE` - Environmental blocker, lab/cluster infrastructure failure, or + external dependency outage — not a code or product defect **Allowed confidence levels:** @@ -56,7 +58,7 @@ environmental signals into a confident product-defect claim. CR status showing product error) - `medium` - Indirect but consistent signals (e.g., pattern matches known product issue, but logs incomplete) -- `low` - Environmental blockers, contradictory signals, or missing direct cause +- `low` - Contradictory signals or missing direct cause ### Required Decision Order @@ -81,11 +83,14 @@ environmental signals into a confident product-defect claim. 5. **Separate test-owned, product-owned, and environment-owned problems.** Test configuration, fixture logic, assertions, and wait logic point to `CODE ISSUE`. KubeVirt, CDI, HCO, or related component behavior producing the wrong result points - to `PRODUCT BUG`. + to `PRODUCT BUG`. Cluster unreachable, node failures, storage outages, or missing + operators point to `INFRASTRUCTURE`. 6. **Assign confidence based on evidence strength.** High confidence requires a direct causal signal. Medium confidence fits consistent - but incomplete evidence. Low confidence fits environmental blockers, contradictory - signals, or missing direct cause. + but incomplete evidence. Low confidence fits contradictory signals or missing + direct cause. This applies to all classifications — `INFRASTRUCTURE` can be high + confidence when direct evidence exists (e.g., nodes `NotReady`, storage backend + unreachable). ### Expected-Failure and Derivative-Failure Handling @@ -149,13 +154,12 @@ Indicators: - Failures caused by dependent operators (see Section 6) behaving incorrectly with valid CNV configuration. Distinguish between: the dependent operator itself is broken (file against that operator, not CNV), CNV misconfigures the dependent operator - (`PRODUCT BUG` against CNV), or the operator is missing/not installed (environmental) + (`PRODUCT BUG` against CNV), or the operator is missing/not installed (`INFRASTRUCTURE`) -### Environmental Blockers and Ambiguous Cases +### INFRASTRUCTURE - Environmental Blockers and Ambiguous Cases -Infrastructure or lab failures are NOT confirmed `PRODUCT BUG` findings. Treat them as -environmental blockers with low confidence unless there is direct evidence that a -product component caused the instability. +Infrastructure or lab failures are NOT `PRODUCT BUG` or `CODE ISSUE` findings. +Classify them as `INFRASTRUCTURE`. Common environmental blockers: @@ -165,15 +169,14 @@ Common environmental blockers: - Node hardware failure, IPMI issues, or SR-IOV card malfunction - Remote cluster mismatch or unavailable remote cluster - Insufficient cluster resources (CPU, memory, storage) for test requirements +- Container runtime failures (e.g., OCI hook errors) affecting all pods on a node +- Operator missing or not installed (test prerequisite not met) Guidance: -- Do NOT classify a pure environmental blocker as a confirmed `PRODUCT BUG`. -- If the evidence only shows environment instability, say so explicitly and keep - confidence low. -- If a binary label is required by the consuming system, make it explicit - that the issue is environmental and the binary label is only a fallback, not a - confirmed product-defect conclusion. +- Classify pure environmental blockers as `INFRASTRUCTURE`. +- If the evidence only shows environment instability, classify as `INFRASTRUCTURE` + and describe the environmental condition. - Quarantined tests (`@pytest.mark.jira(..., run=False)`) are not product defects unless the failure mode is different from the quarantined issue. @@ -199,25 +202,25 @@ themselves: or the API rejected a valid one (`PRODUCT BUG`). - `ResourceNotFoundError` or `NotFoundError` - Determine whether the resource was never created (fixture issue = `CODE ISSUE`), was garbage-collected unexpectedly - (`PRODUCT BUG`), or the namespace was cleaned up (environmental). + (`PRODUCT BUG`), or the namespace was cleaned up (`INFRASTRUCTURE`). Pattern guidance: - **VM lifecycle timeout:** Too-low timeout or wrong wait target is `CODE ISSUE`; a real stall with healthy inputs and controllers is `PRODUCT BUG`; an API or node - outage is environmental + outage is `INFRASTRUCTURE` - **Live migration failure:** Wrong migration policy, anti-affinity, or insufficient target node resources in test setup is `CODE ISSUE`; valid configuration plus `virt-controller` or `virt-handler` failure is `PRODUCT BUG`; node drain or - network partition is environmental + network partition is `INFRASTRUCTURE` - **SSH connectivity failure:** Read the test code to determine how SSH is used. Wrong credentials, missing `virtctl` binary, no retry logic, or missing `wait_for_ssh_connectivity()` before running commands is `CODE ISSUE`. VM network misconfiguration after migration or snapshot restore where the test - correctly waits and retries is `PRODUCT BUG`. Cluster network outage is environmental. + correctly waits and retries is `PRODUCT BUG`. Cluster network outage is `INFRASTRUCTURE`. - **DataVolume/CDI failure:** Wrong source URL, bad storage class reference, or insufficient PVC size in test is `CODE ISSUE`; valid import/upload/clone rejected - or stuck by CDI controller is `PRODUCT BUG`; storage backend outage is environmental + or stuck by CDI controller is `PRODUCT BUG`; storage backend outage is `INFRASTRUCTURE` - **Post-operation validation:** Wrong expected values or stale assertions are `CODE ISSUE`; a VM with wrong CPU topology, missing disks, or broken networking after a valid operation is `PRODUCT BUG` @@ -233,7 +236,7 @@ Pattern guidance: `PRODUCT BUG` - **Resource cleanup failure:** Missing cleanup or bad fixture ownership is `CODE ISSUE`; product finalizer or controller cleanup failure is `PRODUCT BUG`; namespace or cluster - cleanup blocked by infrastructure is environmental + cleanup blocked by infrastructure is `INFRASTRUCTURE` - **Console access failure:** Wrong `pexpect` patterns or timeouts in test is `CODE ISSUE`; `virtctl console` unable to connect to a healthy VMI is `PRODUCT BUG` @@ -274,6 +277,88 @@ When classifying `CODE ISSUE`, suggest a specific fix: ## 3. Analysis Thoroughness and Required Evidence Structure +### Environment (MANDATORY — do this FIRST) + +**STEP 1 — before any other analysis:** Open and read the file +`build-artifacts/run-info.json`. This is a JSON file containing version +and environment information. Extract **every** field that contains a +version, revision, image reference, or environment detail. Common keys +include (but are not limited to): + +| JSON key | Label | +|---------------------|-----------------| +| `openshiftVersion` | OpenShift | +| `cnvVersion` | CNV | +| `bundleVersion` | Bundle | +| `kubevirtVersion` | KubeVirt | +| `cdiVersion` | CDI | +| `kubernetesVersion` | Kubernetes | +| `ocsVersion` | OCS | +| `networkType` | Network Type | +| `hcoImage` | HCO Image | +| `hcoIndexImage` | HCO Index Image | +| `testImage` | Test Image | + +Log **every** key whose value is a version string, image reference +(`registry/...@sha256:...`), or environment identifier. Skip keys whose +values are HTML snippets, or empty strings. + +**STEP 2:** The `details` field MUST begin with EXACTLY this structure. +The JSON schema says "detailed analysis" — the Environment block IS part +of that analysis. Use this template: + + +Environment: +- OpenShift: 4.22.0-rc.2 +- CNV: 4.22.0 +- Bundle: v4.22.0.rhel9-149 +- KubeVirt: v1.8.2-34-g9ff3b29bc2 +- CDI: v1.65.0-2-ge83df1593 +- Kubernetes: v1.35.3 +- OCS: 4.22.0-70.stable +- Network Type: OVNKubernetes +- HCO Image: registry.redhat.io/...@sha256:... +- Test Image: quay.io/openshift-cnv/...@sha256:... + +Root Cause: +The test failure is caused by... + + +If `run-info.json` is missing or a field is absent, search other artifacts +for version evidence: console logs, must-gather output, CSV names, operator +pod image tags, or log output under `build-artifacts/`. +If a version still cannot be determined, mark as `unknown`. +Do NOT skip this step — the environment block MUST appear in every `details` field. + +**STEP 3:** In the `artifacts_evidence` field, the first entries MUST be +ALL version lines from `run-info.json` — one line per field. These version +lines ARE evidence that supports the analysis (they establish the +environment context). Do NOT filter them for relevance — include ALL of +them: + +``` +[build-artifacts/run-info.json]: "openshiftVersion":"4.22.0-rc.2" +[build-artifacts/run-info.json]: "cnvVersion":"4.22.0" +[build-artifacts/run-info.json]: "bundleVersion":"v4.22.0.rhel9-149" +[build-artifacts/run-info.json]: "kubevirtVersion":"v1.8.2-34-g9ff3b29bc2" +[build-artifacts/run-info.json]: "cdiVersion":"v1.65.0-2-ge83df1593" +[build-artifacts/run-info.json]: "kubernetesVersion":"v1.35.3" +[build-artifacts/run-info.json]: "ocsVersion":"4.22.0-70.stable" +[build-artifacts/run-info.json]: "networkType":"OVNKubernetes" +[build-artifacts/run-info.json]: "hcoImage":"registry.redhat.io/..." +[build-artifacts/run-info.json]: "testImage":"quay.io/openshift-cnv/..." +``` + +Do NOT omit any version or image field. Then continue with the +failure-specific evidence lines. + +### Self-Verification (MANDATORY) + +Before submitting your JSON response, verify: +1. Does `details` start with "Environment:" on the first line? If NO, fix it. +2. Does `artifacts_evidence` contain at least 5 separate `[build-artifacts/run-info.json]` lines? If NO, fix it. +3. Are version lines in `artifacts_evidence` each on their own line (not combined)? If NO, fix it. + **CRITICAL: Never dismiss or skip warnings, conditions, or errors found in the data.** Every warning, condition entry, and error message in VirtualMachine, VMI, DataVolume, Migration, and related resource status MUST be evaluated as a potential contributing @@ -371,7 +456,7 @@ never completed. Investigate it explicitly before ruling it out. - NodeNetworkConfigurationPolicy status: `oc get nncp -o yaml` -### For environmental blockers, suggest collecting +### For `INFRASTRUCTURE`, suggest collecting - Cluster node status: `oc get nodes` @@ -524,11 +609,18 @@ Key product and runtime components to inspect: When a failure involves a dependent operator, determine ownership: -- Operator missing or not installed → **environmental** (test prerequisite not met) +- Operator missing or not installed → **`INFRASTRUCTURE`** (test prerequisite not met) - Operator installed but malfunctioning independently → file against that operator, not CNV - CNV sends invalid configuration to the operator → **PRODUCT BUG** against CNV - Operator works correctly but CNV misinterprets its status → **PRODUCT BUG** against CNV +## 7. Additional Version Sources + +If `build-artifacts/run-info.json` does not contain a needed component version, +check `build-artifacts/` for version evidence (CSV names, operator pod image +tags, log output). If not found there, check `additional_repos` for the +component's source repo context. + [ocp-virt-doc]: https://docs.redhat.com/en/documentation/red_hat_openshift_virtualization/ [kubevirt-repo]: https://github.com/kubevirt/kubevirt [cdi-repo]: https://github.com/kubevirt/containerized-data-importer