WIP [INTEROP-9134] Add spoke cluster pre/post-upgrade OCP health check steps for ACM Virt Scenario#81164
WIP [INTEROP-9134] Add spoke cluster pre/post-upgrade OCP health check steps for ACM Virt Scenario#81164sg-rh wants to merge 1 commit into
Conversation
WalkthroughAdds new ACM interop P2P pre- and post-upgrade spoke healthcheck step registry entries, metadata, ownership files, README docs, and Bash command scripts. The scripts discover spokes, resolve per-spoke kubeconfigs, run MCP, clusteroperator, node, and pod checks, and write failure artifacts. ChangesPre-upgrade Spoke Healthcheck
Post-upgrade Spoke Healthcheck
Sequence Diagram(s)Pre-upgrade flowsequenceDiagram
participant "acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"
participant DiscoverSpokeClusters
participant ExtractSpokeKubeconfig
participant RunSpokeHealthcheck
participant oc
"acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>DiscoverSpokeClusters: collect spoke names
"acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>ExtractSpokeKubeconfig: resolve spoke kubeconfig
ExtractSpokeKubeconfig->>oc: read ClusterDeployment and Secret
"acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh"->>RunSpokeHealthcheck: run spoke checks
RunSpokeHealthcheck->>oc: check MCPs, clusteroperators, nodes, pods
Post-upgrade flowsequenceDiagram
participant "acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"
participant DiscoverSpokeClusters
participant ExtractSpokeKubeconfig
participant RunSpokeHealthcheck
participant oc
"acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>DiscoverSpokeClusters: collect spoke names
"acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>ExtractSpokeKubeconfig: resolve spoke kubeconfig
ExtractSpokeKubeconfig->>oc: read ClusterDeployment and Secret
"acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh"->>RunSpokeHealthcheck: run spoke checks
RunSpokeHealthcheck->>oc: check MCPs, clusteroperators, nodes, pods
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sg-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
[REHEARSALNOTIFIER] Note: If this PR includes changes to step registry files ( |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh (1)
7-7: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winDrop xtrace from this step-registry script.
This step handles kubeconfig material, so keeping
-xenabled by default adds avoidable trace noise around sensitive operations. As per coding guidelines, "Default toset -euo pipefail(without-x) in step registry scripts, and only enable-xwhen actively debugging to avoid tracing sensitive information."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh` at line 7, Remove xtrace from the step-registry script setup so it defaults to safe shell options only; update the script’s initial shell flags in the spoke-upgrade-prehealthcheck command script from the current `set -euxo pipefail; shopt -s inherit_errexit` behavior to use `set -euo pipefail` without `-x`, keeping `inherit_errexit` if still needed.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`:
- Line 7: The step script currently enables global xtrace via the shell options
setup, which can leak sensitive kubeconfig-related expansions into logs. Update
the script’s initial shell settings to default to non-tracing error handling
only, and keep tracing disabled unless you explicitly wrap a short,
non-sensitive debug block. Preserve the existing errexit/pipefail behavior and
the inherit_errexit setting while removing the global -x from the script
entrypoint.
- Around line 330-337: The unhealthy-MCP check in the machineconfigpools filter
is too loose because the grep pattern in the unhealthy_mcp detection and the
real-time oc get pipeline matches any line containing a 0, so
DEGRADEDMACHINECOUNT values like 10 or 20 are treated as healthy. Tighten the
filter used in the oc get machineconfigpools output so it only excludes rows
where the Updating and Degraded columns are False and DEGRADEDMACHINECOUNT is
exactly 0, using anchored whitespace-aware matching in the same unhealthy_mcp
and grep pipeline logic.
- Around line 160-188: RunSpokeHealthcheck currently lets failures in
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod slip through when called under if ! ..., so the function can still
return success after a bad healthcheck. Update RunSpokeHealthcheck to handle
each step explicitly: after every healthcheck call, immediately invoke
WriteSpokeHealthcheckFailureDiagnostics on failure and return 1 so the error
propagates. Keep the existing trap setup in sync with this flow, but do not rely
on EXIT trap cleanup alone to report or surface intermediate failures.
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`:
- Around line 183-184: The `check_pod` step is only listing pods and always
succeeding, so unhealthy spoke workloads never fail the prehealthcheck. Update
`check_pod` in the spoke upgrade prehealthcheck commands to actively inspect pod
phases and waiting/terminated states (for example, pending, error, crashloop,
not ready) and return a non-zero status when any unhealthy pod is found. Make
sure the same validation logic is used for the later call sites that also invoke
`check_pod`, so the step name and behavior match the health-check intent.
- Around line 160-188: The RunSpokePrehealthcheck function can return success
even after a failed health check because failures from
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod are not being surfaced. Update the control flow so these checks are
executed in a way that preserves non-zero exit codes from the function body
instead of relying on a trailing true, and make sure
SpokePrehealthcheckFailureCleanup still runs on failure while the function
returns failure correctly.
---
Nitpick comments:
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`:
- Line 7: Remove xtrace from the step-registry script setup so it defaults to
safe shell options only; update the script’s initial shell flags in the
spoke-upgrade-prehealthcheck command script from the current `set -euxo
pipefail; shopt -s inherit_errexit` behavior to use `set -euo pipefail` without
`-x`, keeping `inherit_errexit` if still needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bc6b18ff-0e41-4145-9a98-cb6672c7203c
📒 Files selected for processing (10)
ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/OWNERSci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/README.mdci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.shci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-ref.metadata.jsonci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-ref.yamlci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/OWNERSci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/README.mdci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.shci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-ref.metadata.jsonci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-ref.yaml
| # Health check logic mirrors cucushift-upgrade-healthcheck-commands.sh | ||
| # (ci-operator/step-registry/cucushift/upgrade/healthcheck/) with spoke-specific MCP wait tuning. | ||
| # | ||
| set -euxo pipefail; shopt -s inherit_errexit |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Drop global xtrace for this step script.
This step reads admin kubeconfigs from hub secrets, so keeping -x enabled for the entire script violates the step-registry guidance and widens the chance of leaking sensitive command expansions into job logs. Default to set -euo pipefail here and only enable tracing around short non-sensitive debug sections if needed.
Suggested change
-set -euxo pipefail; shopt -s inherit_errexit
+set -euo pipefail; shopt -s inherit_errexitAs per coding guidelines, "Default to set -euo pipefail (without -x) in step registry scripts," and based on learnings this script handles kubeconfig secret material.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| set -euxo pipefail; shopt -s inherit_errexit | |
| set -euo pipefail; shopt -s inherit_errexit |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`
at line 7, The step script currently enables global xtrace via the shell options
setup, which can leak sensitive kubeconfig-related expansions into logs. Update
the script’s initial shell settings to default to non-tracing error handling
only, and keep tracing disabled unless you explicitly wrap a short,
non-sensitive debug block. Preserve the existing errexit/pipefail behavior and
the inherit_errexit setting while removing the global -x from the script
entrypoint.
Sources: Coding guidelines, Learnings
| RunSpokeHealthcheck() { | ||
| typeset targetSpokeName="${1:?}" | ||
| typeset spokeKubeconfigPath="${2:?}" | ||
|
|
||
| spokeName="${targetSpokeName}" | ||
| export KUBECONFIG="${spokeKubeconfigPath}" | ||
| trap SpokeHealthcheckFailureCleanup EXIT | ||
|
|
||
| : "Post-upgrade health check for spoke '${spokeName}'" | ||
|
|
||
| OC="run_command_oc" | ||
|
|
||
| oc get machineconfig || true | ||
|
|
||
| : "Step #1: Make sure no degraded or updating mcp" | ||
| wait_mcp_continous_success | ||
|
|
||
| : "Step #2: check all cluster operators get stable and ready" | ||
| wait_clusteroperators_continous_success | ||
|
|
||
| : "Step #3: Make sure every machine is in 'Ready' status" | ||
| check_node | ||
|
|
||
| : "Step #4: check all pods are in status running or complete" | ||
| check_pod | ||
|
|
||
| trap - EXIT | ||
| : "Post-upgrade health check passed for spoke '${spokeName}'" | ||
| true |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
cat >/tmp/errexit-if-demo.sh <<'EOF'
#!/bin/bash
set -euo pipefail
run() {
trap 'echo EXIT-trap-fired' EXIT
false
echo continued-after-false
trap - EXIT
}
if ! run; then
echo caller-saw-failure
else
echo caller-saw-success
fi
EOF
bash /tmp/errexit-if-demo.shRepository: openshift/release
Length of output: 196
🏁 Script executed:
#!/bin/bash
echo "=== File header (first 50 lines) ==="
head -n 50 ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | cat -n
echo ""
echo "=== Lines 450-470 ==="
sed -n '450,470p' ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | cat -nRepository: openshift/release
Length of output: 3603
🏁 Script executed:
#!/bin/bash
echo "=== Find RunSpokeHealthcheck function definition ==="
grep -n "^RunSpokeHealthcheck\(\)" ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh
echo ""
echo "=== Count occurrences of similar patterns (if ! function calls) ==="
grep -n "if ! " ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh | head -20Repository: openshift/release
Length of output: 567
Failed spoke health checks can silently pass the step
When RunSpokeHealthcheck is invoked under if ! ... (line 456), Bash suppresses set -e inside the function. A failure in any healthcheck step (wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod) does not abort the function; execution continues to line 185 where the EXIT trap is cleared, and the function returns success. Consequently, a degraded spoke cluster may cause the CI job to pass without failure.
The fix requires explicit error handling after each healthcheck step to:
- Write failure diagnostics immediately via
WriteSpokeHealthcheckFailureDiagnostics - Return
1to propagate the failure
Apply the following changes to ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh (lines 160–188):
Proposed diff
RunSpokeHealthcheck() {
typeset targetSpokeName="${1:?}"
typeset spokeKubeconfigPath="${2:?}"
spokeName="${targetSpokeName}"
export KUBECONFIG="${spokeKubeconfigPath}"
- trap SpokeHealthcheckFailureCleanup EXIT
: "Post-upgrade health check for spoke '${spokeName}'"
OC="run_command_oc"
oc get machineconfig || true
: "Step `#1`: Make sure no degraded or updating mcp"
- wait_mcp_continous_success
+ wait_mcp_continous_success || {
+ WriteSpokeHealthcheckFailureDiagnostics || true
+ return 1
+ }
: "Step `#2`: check all cluster operators get stable and ready"
- wait_clusteroperators_continous_success
+ wait_clusteroperators_continous_success || {
+ WriteSpokeHealthcheckFailureDiagnostics || true
+ return 1
+ }
: "Step `#3`: Make sure every machine is in 'Ready' status"
- check_node
+ check_node || {
+ WriteSpokeHealthcheckFailureDiagnostics || true
+ return 1
+ }
: "Step `#4`: check all pods are in status running or complete"
- check_pod
-
- trap - EXIT
+ check_pod || {
+ WriteSpokeHealthcheckFailureDiagnostics || true
+ return 1
+ }
+
: "Post-upgrade health check passed for spoke '${spokeName}'"
true
}The EXIT trap mechanism is unreliable here because the suppressing if ! ... context prevents the trap from firing on intermediate step failures. Explicit inline error handling ensures diagnostics are written and failures are propagated immediately.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`
around lines 160 - 188, RunSpokeHealthcheck currently lets failures in
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod slip through when called under if ! ..., so the function can still
return success after a bad healthcheck. Update RunSpokeHealthcheck to handle
each step explicitly: after every healthcheck call, immediately invoke
WriteSpokeHealthcheckFailureDiagnostics on failure and return 1 so the error
propagates. Keep the existing trap setup in sync with this flow, but do not rely
on EXIT trap cleanup alone to report or surface intermediate failures.
| oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount --no-headers > "${tmp_output}" || true | ||
| if [[ -s "${tmp_output}" ]]; then | ||
| unhealthy_mcp="$(grep -v 'False.*False.*0' "${tmp_output}" || true)" | ||
| if [[ -n "${unhealthy_mcp}" ]]; then | ||
| : "Detected unhealthy mcp" | ||
| echo "${unhealthy_mcp}" | ||
| : "Real-time detected unhealthy mcp" | ||
| oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -v 'False.*False.*0' || true |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
printf '%s\n' \
'worker rendered-worker False False 0' \
'worker rendered-worker False False 10' |
tee /tmp/mcp-sample.txt >/dev/null
echo 'Current filter output:'
grep -v 'False.*False.*0' /tmp/mcp-sample.txt || true
echo
echo 'Anchored filter output:'
grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' /tmp/mcp-sample.txt || trueRepository: openshift/release
Length of output: 346
The unhealthy-MCP filter incorrectly treats DEGRADEDMACHINECOUNT=10, 20, etc. as healthy
The current grep pattern 'False.*False.*0' matches any line containing the substring 0 after two False values. This causes lines with DEGRADEDMACHINECOUNT=10, 20, 100 to be incorrectly filtered out as "healthy" since they all contain the digit 0.
Test demonstrating the issue:
# Sample input with healthy (0) and unhealthy (10) MCPs
printf 'master rendered-worker False False 0\nworker rendered-worker False False 10\n' > /tmp/test.txt
# Current filter - incorrectly filters out the unhealthy line with DEGRADED=10
grep -v 'False.*False.*0' /tmp/test.txt
# Output: (empty - both lines filtered out)
# Corrected filter - only excludes lines where DEGRADED is exactly 0
grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' /tmp/test.txt
# Output: worker rendered-worker False False 10 (correctly identifies as unhealthy)Suggested change
- unhealthy_mcp="$(grep -v 'False.*False.*0' "${tmp_output}" || true)"
+ unhealthy_mcp="$(grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' "${tmp_output}" || true)"
@@
- oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -v 'False.*False.*0' || true
+ oc get machineconfigpools -o custom-columns=NAME:metadata.name,CONFIG:spec.configuration.name,UPDATING:status.conditions[?\(@.type==\"Updating\"\)].status,DEGRADED:status.conditions[?\(@.type==\"Degraded\"\)].status,DEGRADEDMACHINECOUNT:status.degradedMachineCount | grep -Ev '[[:space:]]False[[:space:]]+False[[:space:]]+0[[:space:]]*$' || trueThe anchored pattern [[:space:]]+ ensures we match exact spacing and *$ anchors to end of line, so only 0 (not 10, 20, etc.) matches.
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 330-330: Ranges can only match single chars (mentioned due to duplicates).
(SC2102)
[info] 330-330: Ranges can only match single chars (mentioned due to duplicates).
(SC2102)
[info] 337-337: Ranges can only match single chars (mentioned due to duplicates).
(SC2102)
[info] 337-337: Ranges can only match single chars (mentioned due to duplicates).
(SC2102)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-healthcheck/acm-interop-p2p-spoke-upgrade-healthcheck-commands.sh`
around lines 330 - 337, The unhealthy-MCP check in the machineconfigpools filter
is too loose because the grep pattern in the unhealthy_mcp detection and the
real-time oc get pipeline matches any line containing a 0, so
DEGRADEDMACHINECOUNT values like 10 or 20 are treated as healthy. Tighten the
filter used in the oc get machineconfigpools output so it only excludes rows
where the Updating and Degraded columns are False and DEGRADEDMACHINECOUNT is
exactly 0, using anchored whitespace-aware matching in the same unhealthy_mcp
and grep pipeline logic.
| RunSpokePrehealthcheck() { | ||
| typeset targetSpokeName="${1:?}" | ||
| typeset spokeKubeconfigPath="${2:?}" | ||
|
|
||
| spokeName="${targetSpokeName}" | ||
| export KUBECONFIG="${spokeKubeconfigPath}" | ||
| trap SpokePrehealthcheckFailureCleanup EXIT | ||
|
|
||
| : "Pre-upgrade health check for spoke '${spokeName}'" | ||
|
|
||
| OC="run_command_oc" | ||
|
|
||
| oc get machineconfig | ||
|
|
||
| : "Step #1: Make sure no degraded or updating mcp" | ||
| wait_mcp_continous_success | ||
|
|
||
| : "Step #2: check all cluster operators get stable and ready" | ||
| wait_clusteroperators_continous_success | ||
|
|
||
| : "Step #3: Make sure every machine is in 'Ready' status" | ||
| check_node | ||
|
|
||
| : "Step #4: check all pods are in status running or complete" | ||
| check_pod | ||
|
|
||
| trap - EXIT | ||
| : "Pre-upgrade health check passed for spoke '${spokeName}'" | ||
| true |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
f() {
trap 'echo "EXIT trap fired"' EXIT
false
echo "fell through after false"
trap - EXIT
true
}
if ! f; then
echo "function returned failure"
else
echo "function returned success"
fiRepository: openshift/release
Length of output: 206
Failed spoke health checks may incorrectly report success
In Bash, when a function with set -euo pipefail is wrapped in if ! ..., the -e (errexit) flag is effectively disabled for the function body. If wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node, or check_pod fail (return non-zero), execution will not halt. Instead, it falls through to true at the end, causing the function to return success (0) despite the failure condition.
RunSpokePrehealthcheck() {
typeset targetSpokeName="${1:?}"
typeset spokeKubeconfigPath="${2:?}"
spokeName="${targetSpokeName}"
export KUBECONFIG="${spokeKubeconfigPath}"
- trap SpokePrehealthcheckFailureCleanup EXIT
: "Pre-upgrade health check for spoke '${spokeName}'"
OC="run_command_oc"
- : "Step `#1`: Make sure no degraded or updating mcp"
- wait_mcp_continous_success
-
- : "Step `#2`: check all cluster operators get stable and ready"
- wait_clusteroperators_continous_success
-
- : "Step `#3`: Make sure every machine is in 'Ready' status"
- check_node
-
- : "Step `#4`: check all pods are in status running or complete"
- check_pod
-
- trap - EXIT
+ if ! oc get machineconfig || \
+ ! wait_mcp_continous_success || \
+ ! wait_clusteroperators_continous_success || \
+ ! check_node || \
+ ! check_pod; then
+ WriteSpokePrehealthcheckFailureDiagnostics || true
+ return 1
+ fi
+
: "Pre-upgrade health check passed for spoke '${spokeName}'"
true
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RunSpokePrehealthcheck() { | |
| typeset targetSpokeName="${1:?}" | |
| typeset spokeKubeconfigPath="${2:?}" | |
| spokeName="${targetSpokeName}" | |
| export KUBECONFIG="${spokeKubeconfigPath}" | |
| trap SpokePrehealthcheckFailureCleanup EXIT | |
| : "Pre-upgrade health check for spoke '${spokeName}'" | |
| OC="run_command_oc" | |
| oc get machineconfig | |
| : "Step #1: Make sure no degraded or updating mcp" | |
| wait_mcp_continous_success | |
| : "Step #2: check all cluster operators get stable and ready" | |
| wait_clusteroperators_continous_success | |
| : "Step #3: Make sure every machine is in 'Ready' status" | |
| check_node | |
| : "Step #4: check all pods are in status running or complete" | |
| check_pod | |
| trap - EXIT | |
| : "Pre-upgrade health check passed for spoke '${spokeName}'" | |
| true | |
| RunSpokePrehealthcheck() { | |
| typeset targetSpokeName="${1:?}" | |
| typeset spokeKubeconfigPath="${2:?}" | |
| spokeName="${targetSpokeName}" | |
| export KUBECONFIG="${spokeKubeconfigPath}" | |
| : "Pre-upgrade health check for spoke '${spokeName}'" | |
| OC="run_command_oc" | |
| if ! oc get machineconfig || \ | |
| ! wait_mcp_continous_success || \ | |
| ! wait_clusteroperators_continous_success || \ | |
| ! check_node || \ | |
| ! check_pod; then | |
| WriteSpokePrehealthcheckFailureDiagnostics || true | |
| return 1 | |
| fi | |
| : "Pre-upgrade health check passed for spoke '${spokeName}'" | |
| true | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`
around lines 160 - 188, The RunSpokePrehealthcheck function can return success
even after a failed health check because failures from
wait_mcp_continous_success, wait_clusteroperators_continous_success, check_node,
or check_pod are not being surfaced. Update the control flow so these checks are
executed in a way that preserves non-zero exit codes from the function body
instead of relying on a trailing true, and make sure
SpokePrehealthcheckFailureCleanup still runs on failure while the function
returns failure correctly.
| : "Step #4: check all pods are in status running or complete" | ||
| check_pod |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
check_pod never fails unhealthy spoke workloads.
Line 184 says this step validates pod health, but check_pod only prints oc get pods --all-namespaces and then returns success. Pending, Error, or CrashLoopBackOff pods on a spoke will not fail this prehealthcheck.
💡 Suggested fix
function check_pod() {
- : "Show all pods status for reference/debug"
- oc get pods --all-namespaces
- true
+ typeset unhealthy_pods
+ unhealthy_pods="$(
+ oc get pods --all-namespaces --no-headers |
+ awk '$4 != "Running" && $4 != "Completed" {print}'
+ )"
+ if [[ -n "${unhealthy_pods}" ]]; then
+ echo >&2 "Found pods not in Running/Completed state"
+ echo >&2 "${unhealthy_pods}"
+ return 1
+ fi
+ : "All pods are Running or Completed"
}Also applies to: 413-416
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/acm/interop-p2p/spoke-upgrade-prehealthcheck/acm-interop-p2p-spoke-upgrade-prehealthcheck-commands.sh`
around lines 183 - 184, The `check_pod` step is only listing pods and always
succeeding, so unhealthy spoke workloads never fail the prehealthcheck. Update
`check_pod` in the spoke upgrade prehealthcheck commands to actively inspect pod
phases and waiting/terminated states (for example, pending, error, crashloop,
not ready) and return a non-zero status when any unhealthy pod is found. Make
sure the same validation logic is used for the later call sites that also invoke
`check_pod`, so the step name and behavior match the health-check intent.
|
@sg-rh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
acm-interop-p2p-spoke-upgrade-prehealthcheckandacm-interop-p2p-spoke-upgrade-healthcheckstep-registry stepscucushift-upgrade-prehealthcheck(MCP, ClusterOperators, nodes, pods) but target ACM managed spoke clusters instead of the single CI hubManagedClusterresources on the hub (excludinglocal-cluster), or via an explicit comma-separated env var overrideClusterDeploymentsecrets, with a fast path reusing${SHARED_DIR}/managed-cluster-kubeconfigwhen the spoke name matches${ARTIFACT_DIR}Summary by CodeRabbit
This PR adds new ACM interop P2P CI step-registry entries for spoke cluster upgrade pre-healthchecks and post-healthchecks. The new steps are designed to validate ACM-managed spoke clusters in OpenShift CI by discovering spokes from hub-side
ManagedClusterresources or optional environment overrides, resolving per-spoke kubeconfigs from Hive/CI artifacts, and running the same core OCP health checks used in existing upgrade validation flows.Practically, this expands OpenShift CI coverage for ACM Virt scenarios with: