WIP [INTEROP-9134] Add spoke cluster pre/post-upgrade OCP health check steps for ACM Virt Scenario by sg-rh · Pull Request #81164 · openshift/release

sg-rh · 2026-06-26T19:01:55Z

Summary

Add acm-interop-p2p-spoke-upgrade-prehealthcheck and acm-interop-p2p-spoke-upgrade-healthcheck step-registry steps
Both run the same 4 OCP platform health checks as cucushift-upgrade-prehealthcheck (MCP, ClusterOperators, nodes, pods) but target ACM managed spoke clusters instead of the single CI hub
Spokes are discovered automatically from ManagedCluster resources on the hub (excluding local-cluster), or via an explicit comma-separated env var override
Admin kubeconfigs are resolved from Hive ClusterDeployment secrets, with a fast path reusing ${SHARED_DIR}/managed-cluster-kubeconfig when the spoke name matches
On failure, per-spoke diagnostics (ClusterVersion, unhealthy MCP/CO/node describe, MCO pods) are written to ${ARTIFACT_DIR}

Summary by CodeRabbit

This PR adds new ACM interop P2P CI step-registry entries for spoke cluster upgrade pre-healthchecks and post-healthchecks. The new steps are designed to validate ACM-managed spoke clusters in OpenShift CI by discovering spokes from hub-side ManagedCluster resources or optional environment overrides, resolving per-spoke kubeconfigs from Hive/CI artifacts, and running the same core OCP health checks used in existing upgrade validation flows.

Practically, this expands OpenShift CI coverage for ACM Virt scenarios with:

new pre-upgrade and post-upgrade spoke healthcheck step definitions
Bash command scripts that run MCP, cluster operator, node, and pod checks against each spoke
failure diagnostics written per spoke to CI artifacts for troubleshooting
accompanying README and ownership metadata for the new step-registry entries

…erop

coderabbitai · 2026-06-26T19:02:20Z

Walkthrough

Adds new ACM interop P2P pre- and post-upgrade spoke healthcheck step registry entries, metadata, ownership files, README docs, and Bash command scripts. The scripts discover spokes, resolve per-spoke kubeconfigs, run MCP, clusteroperator, node, and pod checks, and write failure artifacts.

Changes

Pre-upgrade Spoke Healthcheck

Layer / File(s)	Summary
Step registry and docs `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/*`	Adds the pre-upgrade step registry entry, ownership metadata, and README content.
Bootstrap and spoke selection `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`	Validates the hub kubeconfig, records failure diagnostics, discovers spokes from the env var or `managedcluster`, and writes per-spoke kubeconfigs from cache or `ClusterDeployment` secrets.
Per-spoke runner and clusteroperator checks `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`	Defines the retrying `oc` wrapper, clusteroperator validation, the consecutive-success wait loop, and the per-spoke runner that starts the spoke check sequence.
MCP stability `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`	Checks MachineConfigPool update and degradation states, then waits for consecutive stable MCP results within the computed timeout budget.
Node readiness and main flow `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`	Checks node readiness and pod status, optionally sources proxy settings, sets hub and spoke `KUBECONFIG` values, and records failed spokes.

Post-upgrade Spoke Healthcheck

Layer / File(s)	Summary
Step registry and docs `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/*`	Adds the post-upgrade step registry entry, ownership metadata, and README content.
Bootstrap and spoke selection `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`	Validates the hub kubeconfig, records failure diagnostics, discovers spokes, and resolves per-spoke kubeconfigs.
Per-spoke runner and clusteroperator checks `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`	Defines the retrying `oc` wrapper, clusteroperator validation, the consecutive-success wait loop, and the per-spoke prehealthcheck runner.
MCP stability `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`	Checks MachineConfigPool update and degradation states, then waits for consecutive stable MCP results within the computed timeout budget.
Node readiness and main flow `ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`	Checks node readiness and pod status, optionally sources proxy settings, sets hub and spoke `KUBECONFIG` values, and records failed spokes.

Sequence Diagram(s)

Pre-upgrade flow

sequenceDiagram
  participant "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"
  participant DiscoverSpokeClusters
  participant ExtractSpokeKubeconfig
  participant RunSpokeHealthcheck
  participant oc
  "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>DiscoverSpokeClusters: collect spoke names
  "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>ExtractSpokeKubeconfig: resolve spoke kubeconfig
  ExtractSpokeKubeconfig->>oc: read ClusterDeployment and Secret
  "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>RunSpokeHealthcheck: run spoke checks
  RunSpokeHealthcheck->>oc: check MCPs, clusteroperators, nodes, pods

Post-upgrade flow

sequenceDiagram
  participant "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"
  participant DiscoverSpokeClusters
  participant ExtractSpokeKubeconfig
  participant RunSpokeHealthcheck
  participant oc
  "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>DiscoverSpokeClusters: collect spoke names
  "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>ExtractSpokeKubeconfig: resolve spoke kubeconfig
  ExtractSpokeKubeconfig->>oc: read ClusterDeployment and Secret
  "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>RunSpokeHealthcheck: run spoke checks
  RunSpokeHealthcheck->>oc: check MCPs, clusteroperators, nodes, pods

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

jira/valid-reference

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
No-Sensitive-Data-In-Logs	❌ Error	Both new step scripts use global `set -x` and print `oc get pods --all-namespaces`/`oc get node -o wide`, plus source `proxy-conf.sh`, which can leak cluster metadata and creds to logs.	Drop global xtrace; keep tracing off by default, and move/redact node/pod dumps and sourced proxy config so only non-sensitive details reach logs.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (13 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the addition of ACM spoke cluster pre/post-upgrade OCP health check steps.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The PR only adds step-registry YAML, docs, OWNERS, and shell scripts; no Ginkgo It/Describe/Context/When titles are present.
Test Structure And Quality	✅ Passed	PR only changes step-registry YAML, README, OWNERS, and shell scripts; no Ginkgo test code is touched, so the test-quality check is not applicable.
Microshift Test Compatibility	✅ Passed	No Ginkgo tests were added: the touched paths contain only step-registry YAML/SH/MD assets, and no It/Describe/Context blocks were found.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR only adds step-registry shell/YAML/README/OWNERS files, not Ginkgo e2e tests, so the SNO-specific check doesn’t apply.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR only adds step-registry refs, docs, and Bash healthcheck scripts; I found no affinity, nodeSelector, spread, or replica-topology assumptions.
Ote Binary Stdout Contract	✅ Passed	PR only adds step-registry shell/YAML files; no OTE binary or process-level Go/TestMain/setup code was changed, so the stdout contract doesn’t apply.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PASS: PR adds step-registry bash/YAML docs, not new Ginkgo e2e tests; no IPv4-only parsing or external connectivity in touched files.
No-Weak-Crypto	✅ Passed	Touched files only add kubeconfig handling and health checks; no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret/token comparisons found.
Container-Privileges	✅ Passed	The new step-registry YAMLs add only image, resources, env, and docs; no privileged/host/allowPrivilegeEscalation/securityContext settings appear in the changed manifests.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

openshift-ci · 2026-06-26T19:02:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sg-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/acm/interop-p2p/OWNERS~~ [sg-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-bot · 2026-06-26T19:05:11Z

[REHEARSALNOTIFIER]
@sg-rh: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh (1)
7-7: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Drop xtrace from this step-registry script.

This step handles kubeconfig material, so keeping -x enabled by default adds avoidable trace noise around sensitive operations. As per coding guidelines, "Default to set -euo pipefail (without -x) in step registry scripts, and only enable -x when actively debugging to avoid tracing sensitive information."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`
at line 7, Remove xtrace from the step-registry script setup so it defaults to
safe shell options only; update the script’s initial shell flags in the
spoke-upgrade-prehealthcheck command script from the current `set -euxo
pipefail; shopt -s inherit_errexit` behavior to use `set -euo pipefail` without
`-x`, keeping `inherit_errexit` if still needed.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`:
- Line 7: The step script currently enables global xtrace via the shell options
setup, which can leak sensitive kubeconfig-related expansions into logs. Update
the script’s initial shell settings to default to non-tracing error handling
only, and keep tracing disabled unless you explicitly wrap a short,
non-sensitive debug block. Preserve the existing errexit/pipefail behavior and
the inherit_errexit setting while removing the global -x from the script
entrypoint.
- Around line 330-337: The unhealthy-MCP check in the machineconfigpools filter
is too loose because the grep pattern in the unhealthy_mcp detection and the
real-time oc get pipeline matches any line containing a 0, so
DEGRADEDMACHINECOUNT values like 10 or 20 are treated as healthy. Tighten the
filter used in the oc get machineconfigpools output so it only excludes rows
where the Updating and Degraded columns are False and DEGRADEDMACHINECOUNT is
exactly 0, using anchored whitespace-aware matching in the same unhealthy_mcp
and grep pipeline logic.
- Around line 160-188: RunSpokeHealthcheck currently lets failures in
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod slip through when called under if ! ..., so the function can still
return success after a bad healthcheck. Update RunSpokeHealthcheck to handle
each step explicitly: after every healthcheck call, immediately invoke
WriteSpokeHealthcheckFailureDiagnostics on failure and return 1 so the error
propagates. Keep the existing trap setup in sync with this flow, but do not rely
on EXIT trap cleanup alone to report or surface intermediate failures.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`:
- Around line 183-184: The `check_pod` step is only listing pods and always
succeeding, so unhealthy spoke workloads never fail the prehealthcheck. Update
`check_pod` in the spoke upgrade prehealthcheck commands to actively inspect pod
phases and waiting/terminated states (for example, pending, error, crashloop,
not ready) and return a non-zero status when any unhealthy pod is found. Make
sure the same validation logic is used for the later call sites that also invoke
`check_pod`, so the step name and behavior match the health-check intent.
- Around line 160-188: The RunSpokePrehealthcheck function can return success
even after a failed health check because failures from
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod are not being surfaced. Update the control flow so these checks are
executed in a way that preserves non-zero exit codes from the function body
instead of relying on a trailing true, and make sure
SpokePrehealthcheckFailureCleanup still runs on failure while the function
returns failure correctly.

---

Nitpick comments:
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`:
- Line 7: Remove xtrace from the step-registry script setup so it defaults to
safe shell options only; update the script’s initial shell flags in the
spoke-upgrade-prehealthcheck command script from the current `set -euxo
pipefail; shopt -s inherit_errexit` behavior to use `set -euo pipefail` without
`-x`, keeping `inherit_errexit` if still needed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bc6b18ff-0e41-4145-9a98-cb6672c7203c

📥 Commits

Reviewing files that changed from the base of the PR and between 53106f6 and 8bfc0d2.

📒 Files selected for processing (10)

ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/OWNERS
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/README.md
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-ref.metadata.json
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-ref.yaml
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/OWNERS
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/README.md
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-ref.metadata.json
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-ref.yaml

coderabbitai · 2026-06-26T19:12:09Z

+# Health check logic mirrors cucushift-upgrade-healthcheck-commands.sh
+# (ci-operator/step-registry/cucushift/upgrade/healthcheck/) with spoke-specific MCP wait tuning.
+#
+set -euxo pipefail; shopt -s inherit_errexit


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Drop global xtrace for this step script.

This step reads admin kubeconfigs from hub secrets, so keeping -x enabled for the entire script violates the step-registry guidance and widens the chance of leaking sensitive command expansions into job logs. Default to set -euo pipefail here and only enable tracing around short non-sensitive debug sections if needed.

Suggested change

-set -euxo pipefail; shopt -s inherit_errexit +set -euo pipefail; shopt -s inherit_errexit

As per coding guidelines, "Default to set -euo pipefail (without -x) in step registry scripts," and based on learnings this script handles kubeconfig secret material.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

set -euxo pipefail; shopt -s inherit_errexit

set -euo pipefail; shopt -s inherit_errexit

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh` at line 7, The step script currently enables global xtrace via the shell options setup, which can leak sensitive kubeconfig-related expansions into logs. Update the script’s initial shell settings to default to non-tracing error handling only, and keep tracing disabled unless you explicitly wrap a short, non-sensitive debug block. Preserve the existing errexit/pipefail behavior and the inherit_errexit setting while removing the global -x from the script entrypoint.

Sources: Coding guidelines, Learnings

coderabbitai · 2026-06-26T19:12:09Z

+RunSpokeHealthcheck() {
+    typeset targetSpokeName="${1:?}"
+    typeset spokeKubeconfigPath="${2:?}"
+
+    spokeName="${targetSpokeName}"
+    export KUBECONFIG="${spokeKubeconfigPath}"
+    trap SpokeHealthcheckFailureCleanup EXIT
+
+    : "Post-upgrade health check for spoke '${spokeName}'"
+
+    OC="run_command_oc"
+
+    oc get machineconfig || true
+
+    : "Step #1: Make sure no degraded or updating mcp"
+    wait_mcp_continous_success
+
+    : "Step #2: check all cluster operators get stable and ready"
+    wait_clusteroperators_continous_success
+
+    : "Step #3: Make sure every machine is in 'Ready' status"
+    check_node
+
+    : "Step #4: check all pods are in status running or complete"
+    check_pod
+
+    trap - EXIT
+    : "Post-upgrade health check passed for spoke '${spokeName}'"
+    true


🎯 Functional Correctness | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash cat >/tmp/errexit-if-demo.sh <<'EOF' #!/bin/bash set -euo pipefail run() { trap 'echo EXIT-trap-fired' EXIT false echo continued-after-false trap - EXIT } if ! run; then echo caller-saw-failure else echo caller-saw-success fi EOF bash /tmp/errexit-if-demo.sh

Repository: openshift/release

Length of output: 196

🏁 Script executed:

#!/bin/bash echo "=== File header (first 50 lines) ===" head -n 50 ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | cat -n echo "" echo "=== Lines 450-470 ===" sed -n '450,470p' ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | cat -n

Repository: openshift/release

Length of output: 3603

🏁 Script executed:

#!/bin/bash echo "=== Find RunSpokeHealthcheck function definition ===" grep -n "^RunSpokeHealthcheck" ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh echo "" echo "=== Count occurrences of similar patterns (if ! function calls) ===" grep -n "if ! " ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | head -20

Repository: openshift/release

Length of output: 567

Failed spoke health checks can silently pass the step

When RunSpokeHealthcheck is invoked under if ! ... (line 456), Bash suppresses set -e inside the function. A failure in any healthcheck step (wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod) does not abort the function; execution continues to line 185 where the EXIT trap is cleared, and the function returns success. Consequently, a degraded spoke cluster may cause the CI job to pass without failure.

The fix requires explicit error handling after each healthcheck step to:

Write failure diagnostics immediately via WriteSpokeHealthcheckFailureDiagnostics

Return 1 to propagate the failure

Apply the following changes to ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh (lines 160–188):

Proposed diff

RunSpokeHealthcheck() { typeset targetSpokeName="${1:?}" typeset spokeKubeconfigPath="${2:?}" spokeName="${targetSpokeName}" export KUBECONFIG="${spokeKubeconfigPath}" - trap SpokeHealthcheckFailureCleanup EXIT : "Post-upgrade health check for spoke '${spokeName}'" OC="run_command_oc" oc get machineconfig || true : "Step `#1`: Make sure no degraded or updating mcp" - wait_mcp_continous_success + wait_mcp_continous_success || { + WriteSpokeHealthcheckFailureDiagnostics || true + return 1 + } : "Step `#2`: check all cluster operators get stable and ready" - wait_clusteroperators_continous_success + wait_clusteroperators_continous_success || { + WriteSpokeHealthcheckFailureDiagnostics || true + return 1 + } : "Step `#3`: Make sure every machine is in 'Ready' status" - check_node + check_node || { + WriteSpokeHealthcheckFailureDiagnostics || true + return 1 + } : "Step `#4`: check all pods are in status running or complete" - check_pod - - trap - EXIT + check_pod || { + WriteSpokeHealthcheckFailureDiagnostics || true + return 1 + } + : "Post-upgrade health check passed for spoke '${spokeName}'" true }

The EXIT trap mechanism is unreliable here because the suppressing if ! ... context prevents the trap from firing on intermediate step failures. Explicit inline error handling ensures diagnostics are written and failures are propagated immediately.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh` around lines 160 - 188, RunSpokeHealthcheck currently lets failures in wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod slip through when called under if ! ..., so the function can still return success after a bad healthcheck. Update RunSpokeHealthcheck to handle each step explicitly: after every healthcheck call, immediately invoke WriteSpokeHealthcheckFailureDiagnostics on failure and return 1 so the error propagates. Keep the existing trap setup in sync with this flow, but do not rely on EXIT trap cleanup alone to report or surface intermediate failures.

coderabbitai · 2026-06-26T19:12:09Z

+    oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount --no-headers > "${tmp_output}" || true
+    if [[ -s "${tmp_output}" ]]; then
+        unhealthy_mcp="$(grep -v 'False.*False.*0' "${tmp_output}" || true)"
+        if [[ -n "${unhealthy_mcp}" ]]; then
+            : "Detected unhealthy mcp"
+            echo "${unhealthy_mcp}"
+            : "Real-time detected unhealthy mcp"
+            oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -v 'False.*False.*0' || true


🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash printf '%s\n' \ 'worker rendered-worker False False 0' \ 'worker rendered-worker False False 10' | tee /tmp/mcp-sample.txt >/dev/null echo 'Current filter output:' grep -v 'False.*False.*0' /tmp/mcp-sample.txt || true echo echo 'Anchored filter output:' grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' /tmp/mcp-sample.txt || true

Repository: openshift/release

Length of output: 346

The unhealthy-MCP filter incorrectly treats DEGRADEDMACHINECOUNT=10, 20, etc. as healthy

The current grep pattern 'False.*False.*0' matches any line containing the substring 0 after two False values. This causes lines with DEGRADEDMACHINECOUNT=10, 20, 100 to be incorrectly filtered out as "healthy" since they all contain the digit 0.

Test demonstrating the issue:

# Sample input with healthy (0) and unhealthy (10) MCPs printf 'master rendered-worker False False 0\nworker rendered-worker False False 10\n' > /tmp/test.txt # Current filter - incorrectly filters out the unhealthy line with DEGRADED=10 grep -v 'False.*False.*0' /tmp/test.txt # Output: (empty - both lines filtered out) # Corrected filter - only excludes lines where DEGRADED is exactly 0 grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' /tmp/test.txt # Output: worker rendered-worker False False 10 (correctly identifies as unhealthy)

Suggested change

- unhealthy_mcp="$(grep -v 'False.*False.*0' "${tmp_output}" || true)" + unhealthy_mcp="$(grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' "${tmp_output}" || true)" @@ - oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?$@.type==\"Updating\"$].status,DEGRADED:status.conditions[?$@.type==\"Degraded\"$].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -v 'False.*False.*0' || true + oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?$@.type==\"Updating\"$].status,DEGRADED:status.conditions[?$@.type==\"Degraded\"$].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' || true

The anchored pattern [[:space:]]+ ensures we match exact spacing and *$ anchors to end of line, so only 0 (not 10, 20, etc.) matches.

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 330-330: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)

[info] 330-330: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)

[info] 337-337: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)

[info] 337-337: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh` around lines 330 - 337, The unhealthy-MCP check in the machineconfigpools filter is too loose because the grep pattern in the unhealthy_mcp detection and the real-time oc get pipeline matches any line containing a 0, so DEGRADEDMACHINECOUNT values like 10 or 20 are treated as healthy. Tighten the filter used in the oc get machineconfigpools output so it only excludes rows where the Updating and Degraded columns are False and DEGRADEDMACHINECOUNT is exactly 0, using anchored whitespace-aware matching in the same unhealthy_mcp and grep pipeline logic.

coderabbitai · 2026-06-26T19:12:09Z

+RunSpokePrehealthcheck() {
+    typeset targetSpokeName="${1:?}"
+    typeset spokeKubeconfigPath="${2:?}"
+
+    spokeName="${targetSpokeName}"
+    export KUBECONFIG="${spokeKubeconfigPath}"
+    trap SpokePrehealthcheckFailureCleanup EXIT
+
+    : "Pre-upgrade health check for spoke '${spokeName}'"
+
+    OC="run_command_oc"
+
+    oc get machineconfig
+
+    : "Step #1: Make sure no degraded or updating mcp"
+    wait_mcp_continous_success
+
+    : "Step #2: check all cluster operators get stable and ready"
+    wait_clusteroperators_continous_success
+
+    : "Step #3: Make sure every machine is in 'Ready' status"
+    check_node
+
+    : "Step #4: check all pods are in status running or complete"
+    check_pod
+
+    trap - EXIT
+    : "Pre-upgrade health check passed for spoke '${spokeName}'"
+    true


🎯 Functional Correctness | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail f() { trap 'echo "EXIT trap fired"' EXIT false echo "fell through after false" trap - EXIT true } if ! f; then echo "function returned failure" else echo "function returned success" fi

Repository: openshift/release

Length of output: 206

Failed spoke health checks may incorrectly report success

In Bash, when a function with set -euo pipefail is wrapped in if ! ..., the -e (errexit) flag is effectively disabled for the function body. If wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod fail (return non-zero), execution will not halt. Instead, it falls through to true at the end, causing the function to return success (0) despite the failure condition.

RunSpokePrehealthcheck() { typeset targetSpokeName="${1:?}" typeset spokeKubeconfigPath="${2:?}" spokeName="${targetSpokeName}" export KUBECONFIG="${spokeKubeconfigPath}" - trap SpokePrehealthcheckFailureCleanup EXIT : "Pre-upgrade health check for spoke '${spokeName}'" OC="run_command_oc" - : "Step `#1`: Make sure no degraded or updating mcp" - wait_mcp_continous_success - - : "Step `#2`: check all cluster operators get stable and ready" - wait_clusteroperators_continous_success - - : "Step `#3`: Make sure every machine is in 'Ready' status" - check_node - - : "Step `#4`: check all pods are in status running or complete" - check_pod - - trap - EXIT + if ! oc get machineconfig || \ + ! wait_mcp_continous_success || \ + ! wait_clusteroperators_continous_success || \ + ! check_node || \ + ! check_pod; then + WriteSpokePrehealthcheckFailureDiagnostics || true + return 1 + fi + : "Pre-upgrade health check passed for spoke '${spokeName}'" true }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

RunSpokePrehealthcheck() {

typeset targetSpokeName="${1:?}"

typeset spokeKubeconfigPath="${2:?}"

spokeName="${targetSpokeName}"

export KUBECONFIG="${spokeKubeconfigPath}"

trap SpokePrehealthcheckFailureCleanup EXIT

: "Pre-upgrade health check for spoke '${spokeName}'"

OC="run_command_oc"

oc get machineconfig

: "Step #1: Make sure no degraded or updating mcp"

wait_mcp_continous_success

: "Step #2: check all cluster operators get stable and ready"

wait_clusteroperators_continous_success

: "Step #3: Make sure every machine is in 'Ready' status"

check_node

: "Step #4: check all pods are in status running or complete"

check_pod

trap - EXIT

: "Pre-upgrade health check passed for spoke '${spokeName}'"

true

RunSpokePrehealthcheck() {

typeset targetSpokeName="${1:?}"

typeset spokeKubeconfigPath="${2:?}"

spokeName="${targetSpokeName}"

export KUBECONFIG="${spokeKubeconfigPath}"

: "Pre-upgrade health check for spoke '${spokeName}'"

OC="run_command_oc"

if ! oc get machineconfig || \

! wait_mcp_continous_success || \

! wait_clusteroperators_continous_success || \

! check_node || \

! check_pod; then

WriteSpokePrehealthcheckFailureDiagnostics || true

return 1

fi

: "Pre-upgrade health check passed for spoke '${spokeName}'"

true

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh` around lines 160 - 188, The RunSpokePrehealthcheck function can return success even after a failed health check because failures from wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod are not being surfaced. Update the control flow so these checks are executed in a way that preserves non-zero exit codes from the function body instead of relying on a trailing true, and make sure SpokePrehealthcheckFailureCleanup still runs on failure while the function returns failure correctly.

coderabbitai · 2026-06-26T19:12:09Z

+    : "Step #4: check all pods are in status running or complete"
+    check_pod


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

check_pod never fails unhealthy spoke workloads.

Line 184 says this step validates pod health, but check_pod only prints oc get pods --all-namespaces and then returns success. Pending, Error, or CrashLoopBackOff pods on a spoke will not fail this prehealthcheck.

💡 Suggested fix

function check_pod() { - : "Show all pods status for reference/debug" - oc get pods --all-namespaces - true + typeset unhealthy_pods + unhealthy_pods="$( + oc get pods --all-namespaces --no-headers | + awk '$4 != "Running" && $4 != "Completed" {print}' + )" + if [[ -n "${unhealthy_pods}" ]]; then + echo >&2 "Found pods not in Running/Completed state" + echo >&2 "${unhealthy_pods}" + return 1 + fi + : "All pods are Running or Completed" }

Also applies to: 413-416

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh` around lines 183 - 184, The `check_pod` step is only listing pods and always succeeding, so unhealthy spoke workloads never fail the prehealthcheck. Update `check_pod` in the spoke upgrade prehealthcheck commands to actively inspect pod phases and waiting/terminated states (for example, pending, error, crashloop, not ready) and return a non-zero status when any unhealthy pod is found. Make sure the same validation logic is used for the later call sites that also invoke `check_pod`, so the step name and behavior match the health-check intent.

openshift-ci · 2026-06-26T19:12:18Z

@sg-rh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Add spoke cluster pre/post-upgrade OCP health check steps for ACM int…

8bfc0d2

…erop

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2026

openshift-ci Bot requested review from chaclark1974 and etirta June 26, 2026 19:02

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 26, 2026

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP [INTEROP-9134] Add spoke cluster pre/post-upgrade OCP health check steps for ACM Virt Scenario#81164

WIP [INTEROP-9134] Add spoke cluster pre/post-upgrade OCP health check steps for ACM Virt Scenario#81164
sg-rh wants to merge 1 commit into
openshift:mainfrom
sg-rh:acm-interop-p2p-spoke-prehealthcheck

sg-rh commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Pre-merge checks failed

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	set -euxo pipefail; shopt -s inherit_errexit
	set -euo pipefail; shopt -s inherit_errexit

		: "Step #4: check all pods are in status running or complete"
		check_pod

Uh oh!

Conversation

sg-rh commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Pre-upgrade flow

Post-upgrade flow

Estimated code review effort

Suggested labels

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sg-rh commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading