Skip to content

WIP [INTEROP-9134] Add spoke cluster pre/post-upgrade OCP health check steps for ACM Virt Scenario#81164

Open
sg-rh wants to merge 1 commit into
openshift:mainfrom
sg-rh:acm-interop-p2p-spoke-prehealthcheck
Open

WIP [INTEROP-9134] Add spoke cluster pre/post-upgrade OCP health check steps for ACM Virt Scenario#81164
sg-rh wants to merge 1 commit into
openshift:mainfrom
sg-rh:acm-interop-p2p-spoke-prehealthcheck

Conversation

@sg-rh

@sg-rh sg-rh commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add acm-interop-p2p-spoke-upgrade-prehealthcheck and acm-interop-p2p-spoke-upgrade-healthcheck step-registry steps
  • Both run the same 4 OCP platform health checks as cucushift-upgrade-prehealthcheck (MCP, ClusterOperators, nodes, pods) but target ACM managed spoke clusters instead of the single CI hub
  • Spokes are discovered automatically from ManagedCluster resources on the hub (excluding local-cluster), or via an explicit comma-separated env var override
  • Admin kubeconfigs are resolved from Hive ClusterDeployment secrets, with a fast path reusing ${SHARED_DIR}/managed-cluster-kubeconfig when the spoke name matches
  • On failure, per-spoke diagnostics (ClusterVersion, unhealthy MCP/CO/node describe, MCO pods) are written to ${ARTIFACT_DIR}

Summary by CodeRabbit

This PR adds new ACM interop P2P CI step-registry entries for spoke cluster upgrade pre-healthchecks and post-healthchecks. The new steps are designed to validate ACM-managed spoke clusters in OpenShift CI by discovering spokes from hub-side ManagedCluster resources or optional environment overrides, resolving per-spoke kubeconfigs from Hive/CI artifacts, and running the same core OCP health checks used in existing upgrade validation flows.

Practically, this expands OpenShift CI coverage for ACM Virt scenarios with:

  • new pre-upgrade and post-upgrade spoke healthcheck step definitions
  • Bash command scripts that run MCP, cluster operator, node, and pod checks against each spoke
  • failure diagnostics written per spoke to CI artifacts for troubleshooting
  • accompanying README and ownership metadata for the new step-registry entries

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2026
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds new ACM interop P2P pre- and post-upgrade spoke healthcheck step registry entries, metadata, ownership files, README docs, and Bash command scripts. The scripts discover spokes, resolve per-spoke kubeconfigs, run MCP, clusteroperator, node, and pod checks, and write failure artifacts.

Changes

Pre-upgrade Spoke Healthcheck

Layer / File(s) Summary
Step registry and docs
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/*
Adds the pre-upgrade step registry entry, ownership metadata, and README content.
Bootstrap and spoke selection
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh
Validates the hub kubeconfig, records failure diagnostics, discovers spokes from the env var or managedcluster, and writes per-spoke kubeconfigs from cache or ClusterDeployment secrets.
Per-spoke runner and clusteroperator checks
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh
Defines the retrying oc wrapper, clusteroperator validation, the consecutive-success wait loop, and the per-spoke runner that starts the spoke check sequence.
MCP stability
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh
Checks MachineConfigPool update and degradation states, then waits for consecutive stable MCP results within the computed timeout budget.
Node readiness and main flow
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh
Checks node readiness and pod status, optionally sources proxy settings, sets hub and spoke KUBECONFIG values, and records failed spokes.

Post-upgrade Spoke Healthcheck

Layer / File(s) Summary
Step registry and docs
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/*
Adds the post-upgrade step registry entry, ownership metadata, and README content.
Bootstrap and spoke selection
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh
Validates the hub kubeconfig, records failure diagnostics, discovers spokes, and resolves per-spoke kubeconfigs.
Per-spoke runner and clusteroperator checks
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh
Defines the retrying oc wrapper, clusteroperator validation, the consecutive-success wait loop, and the per-spoke prehealthcheck runner.
MCP stability
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh
Checks MachineConfigPool update and degradation states, then waits for consecutive stable MCP results within the computed timeout budget.
Node readiness and main flow
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh
Checks node readiness and pod status, optionally sources proxy settings, sets hub and spoke KUBECONFIG values, and records failed spokes.

Sequence Diagram(s)

Pre-upgrade flow

sequenceDiagram
  participant "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"
  participant DiscoverSpokeClusters
  participant ExtractSpokeKubeconfig
  participant RunSpokeHealthcheck
  participant oc
  "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>DiscoverSpokeClusters: collect spoke names
  "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>ExtractSpokeKubeconfig: resolve spoke kubeconfig
  ExtractSpokeKubeconfig->>oc: read ClusterDeployment and Secret
  "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>RunSpokeHealthcheck: run spoke checks
  RunSpokeHealthcheck->>oc: check MCPs, clusteroperators, nodes, pods
Loading

Post-upgrade flow

sequenceDiagram
  participant "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"
  participant DiscoverSpokeClusters
  participant ExtractSpokeKubeconfig
  participant RunSpokeHealthcheck
  participant oc
  "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>DiscoverSpokeClusters: collect spoke names
  "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>ExtractSpokeKubeconfig: resolve spoke kubeconfig
  ExtractSpokeKubeconfig->>oc: read ClusterDeployment and Secret
  "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>RunSpokeHealthcheck: run spoke checks
  RunSpokeHealthcheck->>oc: check MCPs, clusteroperators, nodes, pods
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

jira/valid-reference


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Both new step scripts use global set -x and print oc get pods --all-namespaces/oc get node -o wide, plus source proxy-conf.sh, which can leak cluster metadata and creds to logs. Drop global xtrace; keep tracing off by default, and move/redact node/pod dumps and sourced proxy config so only non-sensitive details reach logs.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the addition of ACM spoke cluster pre/post-upgrade OCP health check steps.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR only adds step-registry YAML, docs, OWNERS, and shell scripts; no Ginkgo It/Describe/Context/When titles are present.
Test Structure And Quality ✅ Passed PR only changes step-registry YAML, README, OWNERS, and shell scripts; no Ginkgo test code is touched, so the test-quality check is not applicable.
Microshift Test Compatibility ✅ Passed No Ginkgo tests were added: the touched paths contain only step-registry YAML/SH/MD assets, and no It/Describe/Context blocks were found.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR only adds step-registry shell/YAML/README/OWNERS files, not Ginkgo e2e tests, so the SNO-specific check doesn’t apply.
Topology-Aware Scheduling Compatibility ✅ Passed The PR only adds step-registry refs, docs, and Bash healthcheck scripts; I found no affinity, nodeSelector, spread, or replica-topology assumptions.
Ote Binary Stdout Contract ✅ Passed PR only adds step-registry shell/YAML files; no OTE binary or process-level Go/TestMain/setup code was changed, so the stdout contract doesn’t apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PASS: PR adds step-registry bash/YAML docs, not new Ginkgo e2e tests; no IPv4-only parsing or external connectivity in touched files.
No-Weak-Crypto ✅ Passed Touched files only add kubeconfig handling and health checks; no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret/token comparisons found.
Container-Privileges ✅ Passed The new step-registry YAMLs add only image, resources, env, and docs; no privileged/host*/*allowPrivilegeEscalation/securityContext settings appear in the changed manifests.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from chaclark1974 and etirta June 26, 2026 19:02
@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sg-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026
@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 26, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@sg-rh: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh (1)

7-7: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Drop xtrace from this step-registry script.

This step handles kubeconfig material, so keeping -x enabled by default adds avoidable trace noise around sensitive operations. As per coding guidelines, "Default to set -euo pipefail (without -x) in step registry scripts, and only enable -x when actively debugging to avoid tracing sensitive information."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`
at line 7, Remove xtrace from the step-registry script setup so it defaults to
safe shell options only; update the script’s initial shell flags in the
spoke-upgrade-prehealthcheck command script from the current `set -euxo
pipefail; shopt -s inherit_errexit` behavior to use `set -euo pipefail` without
`-x`, keeping `inherit_errexit` if still needed.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`:
- Line 7: The step script currently enables global xtrace via the shell options
setup, which can leak sensitive kubeconfig-related expansions into logs. Update
the script’s initial shell settings to default to non-tracing error handling
only, and keep tracing disabled unless you explicitly wrap a short,
non-sensitive debug block. Preserve the existing errexit/pipefail behavior and
the inherit_errexit setting while removing the global -x from the script
entrypoint.
- Around line 330-337: The unhealthy-MCP check in the machineconfigpools filter
is too loose because the grep pattern in the unhealthy_mcp detection and the
real-time oc get pipeline matches any line containing a 0, so
DEGRADEDMACHINECOUNT values like 10 or 20 are treated as healthy. Tighten the
filter used in the oc get machineconfigpools output so it only excludes rows
where the Updating and Degraded columns are False and DEGRADEDMACHINECOUNT is
exactly 0, using anchored whitespace-aware matching in the same unhealthy_mcp
and grep pipeline logic.
- Around line 160-188: RunSpokeHealthcheck currently lets failures in
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod slip through when called under if ! ..., so the function can still
return success after a bad healthcheck. Update RunSpokeHealthcheck to handle
each step explicitly: after every healthcheck call, immediately invoke
WriteSpokeHealthcheckFailureDiagnostics on failure and return 1 so the error
propagates. Keep the existing trap setup in sync with this flow, but do not rely
on EXIT trap cleanup alone to report or surface intermediate failures.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`:
- Around line 183-184: The `check_pod` step is only listing pods and always
succeeding, so unhealthy spoke workloads never fail the prehealthcheck. Update
`check_pod` in the spoke upgrade prehealthcheck commands to actively inspect pod
phases and waiting/terminated states (for example, pending, error, crashloop,
not ready) and return a non-zero status when any unhealthy pod is found. Make
sure the same validation logic is used for the later call sites that also invoke
`check_pod`, so the step name and behavior match the health-check intent.
- Around line 160-188: The RunSpokePrehealthcheck function can return success
even after a failed health check because failures from
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod are not being surfaced. Update the control flow so these checks are
executed in a way that preserves non-zero exit codes from the function body
instead of relying on a trailing true, and make sure
SpokePrehealthcheckFailureCleanup still runs on failure while the function
returns failure correctly.

---

Nitpick comments:
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`:
- Line 7: Remove xtrace from the step-registry script setup so it defaults to
safe shell options only; update the script’s initial shell flags in the
spoke-upgrade-prehealthcheck command script from the current `set -euxo
pipefail; shopt -s inherit_errexit` behavior to use `set -euo pipefail` without
`-x`, keeping `inherit_errexit` if still needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bc6b18ff-0e41-4145-9a98-cb6672c7203c

📥 Commits

Reviewing files that changed from the base of the PR and between 53106f6 and 8bfc0d2.

📒 Files selected for processing (10)
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/OWNERS
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/README.md
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-ref.metadata.json
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-ref.yaml
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/OWNERS
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/README.md
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-ref.metadata.json
  • ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-ref.yaml

# Health check logic mirrors cucushift-upgrade-healthcheck-commands.sh
# (ci-operator/step-registry/cucushift/upgrade/healthcheck/) with spoke-specific MCP wait tuning.
#
set -euxo pipefail; shopt -s inherit_errexit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Drop global xtrace for this step script.

This step reads admin kubeconfigs from hub secrets, so keeping -x enabled for the entire script violates the step-registry guidance and widens the chance of leaking sensitive command expansions into job logs. Default to set -euo pipefail here and only enable tracing around short non-sensitive debug sections if needed.

Suggested change
-set -euxo pipefail; shopt -s inherit_errexit
+set -euo pipefail; shopt -s inherit_errexit

As per coding guidelines, "Default to set -euo pipefail (without -x) in step registry scripts," and based on learnings this script handles kubeconfig secret material.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set -euxo pipefail; shopt -s inherit_errexit
set -euo pipefail; shopt -s inherit_errexit
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`
at line 7, The step script currently enables global xtrace via the shell options
setup, which can leak sensitive kubeconfig-related expansions into logs. Update
the script’s initial shell settings to default to non-tracing error handling
only, and keep tracing disabled unless you explicitly wrap a short,
non-sensitive debug block. Preserve the existing errexit/pipefail behavior and
the inherit_errexit setting while removing the global -x from the script
entrypoint.

Sources: Coding guidelines, Learnings

Comment on lines +160 to +188
RunSpokeHealthcheck() {
typeset targetSpokeName="${1:?}"
typeset spokeKubeconfigPath="${2:?}"

spokeName="${targetSpokeName}"
export KUBECONFIG="${spokeKubeconfigPath}"
trap SpokeHealthcheckFailureCleanup EXIT

: "Post-upgrade health check for spoke '${spokeName}'"

OC="run_command_oc"

oc get machineconfig || true

: "Step #1: Make sure no degraded or updating mcp"
wait_mcp_continous_success

: "Step #2: check all cluster operators get stable and ready"
wait_clusteroperators_continous_success

: "Step #3: Make sure every machine is in 'Ready' status"
check_node

: "Step #4: check all pods are in status running or complete"
check_pod

trap - EXIT
: "Post-upgrade health check passed for spoke '${spokeName}'"
true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat >/tmp/errexit-if-demo.sh <<'EOF'
#!/bin/bash
set -euo pipefail

run() {
  trap 'echo EXIT-trap-fired' EXIT
  false
  echo continued-after-false
  trap - EXIT
}

if ! run; then
  echo caller-saw-failure
else
  echo caller-saw-success
fi
EOF

bash /tmp/errexit-if-demo.sh

Repository: openshift/release

Length of output: 196


🏁 Script executed:

#!/bin/bash
echo "=== File header (first 50 lines) ==="
head -n 50 ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | cat -n

echo ""
echo "=== Lines 450-470 ==="
sed -n '450,470p' ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | cat -n

Repository: openshift/release

Length of output: 3603


🏁 Script executed:

#!/bin/bash
echo "=== Find RunSpokeHealthcheck function definition ==="
grep -n "^RunSpokeHealthcheck\(\)" ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh

echo ""
echo "=== Count occurrences of similar patterns (if ! function calls) ==="
grep -n "if ! " ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | head -20

Repository: openshift/release

Length of output: 567


Failed spoke health checks can silently pass the step

When RunSpokeHealthcheck is invoked under if ! ... (line 456), Bash suppresses set -e inside the function. A failure in any healthcheck step (wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod) does not abort the function; execution continues to line 185 where the EXIT trap is cleared, and the function returns success. Consequently, a degraded spoke cluster may cause the CI job to pass without failure.

The fix requires explicit error handling after each healthcheck step to:

  1. Write failure diagnostics immediately via WriteSpokeHealthcheckFailureDiagnostics
  2. Return 1 to propagate the failure

Apply the following changes to ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh (lines 160–188):

Proposed diff
 RunSpokeHealthcheck() {
     typeset targetSpokeName="${1:?}"
     typeset spokeKubeconfigPath="${2:?}"

     spokeName="${targetSpokeName}"
     export KUBECONFIG="${spokeKubeconfigPath}"
-    trap SpokeHealthcheckFailureCleanup EXIT

     : "Post-upgrade health check for spoke '${spokeName}'"

     OC="run_command_oc"

     oc get machineconfig || true

     : "Step `#1`: Make sure no degraded or updating mcp"
-    wait_mcp_continous_success
+    wait_mcp_continous_success || {
+        WriteSpokeHealthcheckFailureDiagnostics || true
+        return 1
+    }

     : "Step `#2`: check all cluster operators get stable and ready"
-    wait_clusteroperators_continous_success
+    wait_clusteroperators_continous_success || {
+        WriteSpokeHealthcheckFailureDiagnostics || true
+        return 1
+    }

     : "Step `#3`: Make sure every machine is in 'Ready' status"
-    check_node
+    check_node || {
+        WriteSpokeHealthcheckFailureDiagnostics || true
+        return 1
+    }

     : "Step `#4`: check all pods are in status running or complete"
-    check_pod
-
-    trap - EXIT
+    check_pod || {
+        WriteSpokeHealthcheckFailureDiagnostics || true
+        return 1
+    }
+
     : "Post-upgrade health check passed for spoke '${spokeName}'"
     true
 }

The EXIT trap mechanism is unreliable here because the suppressing if ! ... context prevents the trap from firing on intermediate step failures. Explicit inline error handling ensures diagnostics are written and failures are propagated immediately.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`
around lines 160 - 188, RunSpokeHealthcheck currently lets failures in
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod slip through when called under if ! ..., so the function can still
return success after a bad healthcheck. Update RunSpokeHealthcheck to handle
each step explicitly: after every healthcheck call, immediately invoke
WriteSpokeHealthcheckFailureDiagnostics on failure and return 1 so the error
propagates. Keep the existing trap setup in sync with this flow, but do not rely
on EXIT trap cleanup alone to report or surface intermediate failures.

Comment on lines +330 to +337
oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount --no-headers > "${tmp_output}" || true
if [[ -s "${tmp_output}" ]]; then
unhealthy_mcp="$(grep -v 'False.*False.*0' "${tmp_output}" || true)"
if [[ -n "${unhealthy_mcp}" ]]; then
: "Detected unhealthy mcp"
echo "${unhealthy_mcp}"
: "Real-time detected unhealthy mcp"
oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -v 'False.*False.*0' || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
printf '%s\n' \
  'worker rendered-worker False False 0' \
  'worker rendered-worker False False 10' |
tee /tmp/mcp-sample.txt >/dev/null

echo 'Current filter output:'
grep -v 'False.*False.*0' /tmp/mcp-sample.txt || true

echo
echo 'Anchored filter output:'
grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' /tmp/mcp-sample.txt || true

Repository: openshift/release

Length of output: 346


The unhealthy-MCP filter incorrectly treats DEGRADEDMACHINECOUNT=10, 20, etc. as healthy

The current grep pattern 'False.*False.*0' matches any line containing the substring 0 after two False values. This causes lines with DEGRADEDMACHINECOUNT=10, 20, 100 to be incorrectly filtered out as "healthy" since they all contain the digit 0.

Test demonstrating the issue:

# Sample input with healthy (0) and unhealthy (10) MCPs
printf 'master rendered-worker False False 0\nworker rendered-worker False False 10\n' > /tmp/test.txt

# Current filter - incorrectly filters out the unhealthy line with DEGRADED=10
grep -v 'False.*False.*0' /tmp/test.txt
# Output: (empty - both lines filtered out)

# Corrected filter - only excludes lines where DEGRADED is exactly 0
grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' /tmp/test.txt  
# Output: worker rendered-worker False False 10 (correctly identifies as unhealthy)
Suggested change
-        unhealthy_mcp="$(grep -v 'False.*False.*0' "${tmp_output}" || true)"
+        unhealthy_mcp="$(grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' "${tmp_output}" || true)"
@@
-            oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -v 'False.*False.*0' || true
+            oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' || true

The anchored pattern [[:space:]]+ ensures we match exact spacing and *$ anchors to end of line, so only 0 (not 10, 20, etc.) matches.

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 330-330: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)


[info] 330-330: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)


[info] 337-337: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)


[info] 337-337: Ranges can only match single chars (mentioned due to duplicates).

(SC2102)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`
around lines 330 - 337, The unhealthy-MCP check in the machineconfigpools filter
is too loose because the grep pattern in the unhealthy_mcp detection and the
real-time oc get pipeline matches any line containing a 0, so
DEGRADEDMACHINECOUNT values like 10 or 20 are treated as healthy. Tighten the
filter used in the oc get machineconfigpools output so it only excludes rows
where the Updating and Degraded columns are False and DEGRADEDMACHINECOUNT is
exactly 0, using anchored whitespace-aware matching in the same unhealthy_mcp
and grep pipeline logic.

Comment on lines +160 to +188
RunSpokePrehealthcheck() {
typeset targetSpokeName="${1:?}"
typeset spokeKubeconfigPath="${2:?}"

spokeName="${targetSpokeName}"
export KUBECONFIG="${spokeKubeconfigPath}"
trap SpokePrehealthcheckFailureCleanup EXIT

: "Pre-upgrade health check for spoke '${spokeName}'"

OC="run_command_oc"

oc get machineconfig

: "Step #1: Make sure no degraded or updating mcp"
wait_mcp_continous_success

: "Step #2: check all cluster operators get stable and ready"
wait_clusteroperators_continous_success

: "Step #3: Make sure every machine is in 'Ready' status"
check_node

: "Step #4: check all pods are in status running or complete"
check_pod

trap - EXIT
: "Pre-upgrade health check passed for spoke '${spokeName}'"
true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

f() {
  trap 'echo "EXIT trap fired"' EXIT
  false
  echo "fell through after false"
  trap - EXIT
  true
}

if ! f; then
  echo "function returned failure"
else
  echo "function returned success"
fi

Repository: openshift/release

Length of output: 206


Failed spoke health checks may incorrectly report success

In Bash, when a function with set -euo pipefail is wrapped in if ! ..., the -e (errexit) flag is effectively disabled for the function body. If wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod fail (return non-zero), execution will not halt. Instead, it falls through to true at the end, causing the function to return success (0) despite the failure condition.

 RunSpokePrehealthcheck() {
     typeset targetSpokeName="${1:?}"
     typeset spokeKubeconfigPath="${2:?}"

     spokeName="${targetSpokeName}"
     export KUBECONFIG="${spokeKubeconfigPath}"
-    trap SpokePrehealthcheckFailureCleanup EXIT

     : "Pre-upgrade health check for spoke '${spokeName}'"

     OC="run_command_oc"

-    : "Step `#1`: Make sure no degraded or updating mcp"
-    wait_mcp_continous_success
-
-    : "Step `#2`: check all cluster operators get stable and ready"
-    wait_clusteroperators_continous_success
-
-    : "Step `#3`: Make sure every machine is in 'Ready' status"
-    check_node
-
-    : "Step `#4`: check all pods are in status running or complete"
-    check_pod
-
-    trap - EXIT
+    if ! oc get machineconfig || \
+       ! wait_mcp_continous_success || \
+       ! wait_clusteroperators_continous_success || \
+       ! check_node || \
+       ! check_pod; then
+        WriteSpokePrehealthcheckFailureDiagnostics || true
+        return 1
+    fi
+
     : "Pre-upgrade health check passed for spoke '${spokeName}'"
     true
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RunSpokePrehealthcheck() {
typeset targetSpokeName="${1:?}"
typeset spokeKubeconfigPath="${2:?}"
spokeName="${targetSpokeName}"
export KUBECONFIG="${spokeKubeconfigPath}"
trap SpokePrehealthcheckFailureCleanup EXIT
: "Pre-upgrade health check for spoke '${spokeName}'"
OC="run_command_oc"
oc get machineconfig
: "Step #1: Make sure no degraded or updating mcp"
wait_mcp_continous_success
: "Step #2: check all cluster operators get stable and ready"
wait_clusteroperators_continous_success
: "Step #3: Make sure every machine is in 'Ready' status"
check_node
: "Step #4: check all pods are in status running or complete"
check_pod
trap - EXIT
: "Pre-upgrade health check passed for spoke '${spokeName}'"
true
RunSpokePrehealthcheck() {
typeset targetSpokeName="${1:?}"
typeset spokeKubeconfigPath="${2:?}"
spokeName="${targetSpokeName}"
export KUBECONFIG="${spokeKubeconfigPath}"
: "Pre-upgrade health check for spoke '${spokeName}'"
OC="run_command_oc"
if ! oc get machineconfig || \
! wait_mcp_continous_success || \
! wait_clusteroperators_continous_success || \
! check_node || \
! check_pod; then
WriteSpokePrehealthcheckFailureDiagnostics || true
return 1
fi
: "Pre-upgrade health check passed for spoke '${spokeName}'"
true
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`
around lines 160 - 188, The RunSpokePrehealthcheck function can return success
even after a failed health check because failures from
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod are not being surfaced. Update the control flow so these checks are
executed in a way that preserves non-zero exit codes from the function body
instead of relying on a trailing true, and make sure
SpokePrehealthcheckFailureCleanup still runs on failure while the function
returns failure correctly.

Comment on lines +183 to +184
: "Step #4: check all pods are in status running or complete"
check_pod

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

check_pod never fails unhealthy spoke workloads.

Line 184 says this step validates pod health, but check_pod only prints oc get pods --all-namespaces and then returns success. Pending, Error, or CrashLoopBackOff pods on a spoke will not fail this prehealthcheck.

💡 Suggested fix
 function check_pod() {
-    : "Show all pods status for reference/debug"
-    oc get pods --all-namespaces
-    true
+    typeset unhealthy_pods
+    unhealthy_pods="$(
+        oc get pods --all-namespaces --no-headers |
+        awk '$4 != "Running" && $4 != "Completed" {print}'
+    )"
+    if [[ -n "${unhealthy_pods}" ]]; then
+        echo >&2 "Found pods not in Running/Completed state"
+        echo >&2 "${unhealthy_pods}"
+        return 1
+    fi
+    : "All pods are Running or Completed"
 }

Also applies to: 413-416

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`
around lines 183 - 184, The `check_pod` step is only listing pods and always
succeeding, so unhealthy spoke workloads never fail the prehealthcheck. Update
`check_pod` in the spoke upgrade prehealthcheck commands to actively inspect pod
phases and waiting/terminated states (for example, pending, error, crashloop,
not ready) and return a non-zero status when any unhealthy pod is found. Make
sure the same validation logic is used for the later call sites that also invoke
`check_pod`, so the step name and behavior match the health-check intent.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@sg-rh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant