Skip to content

Fix adm upgrade status timeouts: Optimize MachineConfig fetches and enforce client deadlines#2278

Open
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix-performance
Open

Fix adm upgrade status timeouts: Optimize MachineConfig fetches and enforce client deadlines#2278
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix-performance

Conversation

@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam YamunadeviShanmugam commented Jun 2, 2026

The oc adm upgrade status command frequently triggers context deadline exceeded in CI/CD pipelines. This PR add some fixes that improves the performance and fail faster rather than hanging indefinitely.

Fix includes modifying the command querying MachineConfigs sequentially node-by-node via multiple GET requests to single query and adding timeouts to avoid indefinite hang

Summary by CodeRabbit

  • Bug Fixes

    • Prevented status operations from hanging by introducing stricter operation timeouts and more reliable timeout handling.
    • Fixed how machine configurations are sourced so node upgrade status is reported correctly.
  • Improvements

    • More robust status assessment that yields more accurate and consistent upgrade status results.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a3b266ff-425d-4092-a649-89d6436d3343

📥 Commits

Reviewing files that changed from the base of the PR and between f46a9b9 and 0fb5d92.

📒 Files selected for processing (3)
  • pkg/cli/admin/upgrade/status/status.go
  • pkg/cli/admin/upgrade/status/workerpool.go
  • pkg/cli/admin/upgrade/status/workerpool_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/cli/admin/upgrade/status/workerpool_test.go
  • pkg/cli/admin/upgrade/status/status.go
  • pkg/cli/admin/upgrade/status/workerpool.go

Walkthrough

The upgrade status command now enforces a 5-minute command timeout and a 60-second REST client timeout. MachineConfig retrieval now lists all MachineConfig resources and builds a name→version map; workerpool logic and tests were refactored to use that map for node status assessment.

Changes

Upgrade Status Command Improvements

Layer / File(s) Summary
Timeout Configuration
pkg/cli/admin/upgrade/status/status.go
New wraps the command operation in a 5-minute timeout context, and Complete sets the REST client timeout to 60 seconds.
MachineConfig listing & version map
pkg/cli/admin/upgrade/status/status.go, pkg/cli/admin/upgrade/status/workerpool.go
Non-mocked run now lists all MachineConfig resources, then builds a machineConfigName→OpenShiftVersion map used by node assessment.
Workerpool refactor: map-based lookup & tests
pkg/cli/admin/upgrade/status/workerpool.go, pkg/cli/admin/upgrade/status/workerpool_test.go
Reworked imports, removed getMachineConfig, updated assessNodesStatus to accept the precomputed version map, replaced slice-scan lookups with map lookups, and updated tests to pass the map.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 11

❌ Failed checks (1 warning, 10 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Stable And Deterministic Test Names ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Test Structure And Quality ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Microshift Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Single Node Openshift (Sno) Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Topology-Aware Scheduling Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ote Binary Stdout Contract ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ipv6 And Disconnected Network Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: fixing timeouts by optimizing MachineConfig fetches and enforcing client deadlines, which directly aligns with the pull request objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YamunadeviShanmugam
Once this PR has been reviewed and has the lgtm label, please assign hongkailiu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/cli/admin/upgrade/status/status.go (1)

113-119: ⚠️ Potential issue | 🟠 Major | 💤 Low value

Apply cfg.Timeout (and/or request context) to alerts Thanos GETs

  • cfg.Timeout is set in pkg/cli/admin/upgrade/status/status.go, but the alerts path only uses rest.TransportFor(cfg) and constructs client := &http.Client{Transport: roundTripper} without Timeout; checkedGet also uses http.NewRequest(..., nil) (no context). As a result, the Thanos /api/v1/alerts GETs aren’t bounded by the 60s timeout/deadline.
  • Fix by setting Timeout on that http.Client and/or creating requests with http.NewRequestWithContext(ctx, ...) so the existing deadline actually applies.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/admin/upgrade/status/status.go` around lines 113 - 119, The Thanos
alerts GETs aren't honoring cfg.Timeout because the http.Client built from
rest.TransportFor(cfg) (roundTripper) is created without a Timeout and
checkedGet builds requests with http.NewRequest (no context); update the code
that constructs client := &http.Client{Transport: roundTripper} to also set
Timeout: cfg.Timeout (or a derived duration) and modify checkedGet (or the call
site that creates requests for /api/v1/alerts) to use
http.NewRequestWithContext(ctx, ...) passing a context with the deadline so the
configured timeout on cfg and any caller context are respected.
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/status/status.go (1)

237-240: 💤 Low value

Wrap the List error with context.

Per coding guidelines, prefer fmt.Errorf("context: %w", err) so failures are attributable.

♻️ Proposed change
 		machineConfigs, err = o.MachineConfigClient.MachineconfigurationV1().MachineConfigs().List(ctx, metav1.ListOptions{})
 		if err != nil {
-			return err
+			return fmt.Errorf("failed to list machine configs: %w", err)
 		}

As per coding guidelines: "Use fmt.Errorf("context: %w", err) for error wrapping with context".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/admin/upgrade/status/status.go` around lines 237 - 240, The List call
to MachineConfigClient (MachineconfigurationV1().MachineConfigs().List) returns
an error that is currently returned raw; wrap it with fmt.Errorf to add context
before returning (e.g., fmt.Errorf("listing MachineConfigs: %w", err)) so the
failure is attributable and follows the project's error-wrapping guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 234-241: The code currently calls MachineConfigs().List(ctx,
metav1.ListOptions{}) and then repeatedly scans that full list inside
getOpenShiftVersionOfMachineConfig (invoked by assessNodesStatus), causing
O(nodes * machineConfigs) work and potential timeouts; change the flow to first
build a name->version map for only the MachineConfigs referenced by node
annotations (or filter the List by those names) before looping nodes so
getOpenShiftVersionOfMachineConfig can do O(1) lookups, and remove or wire up
the dead getMachineConfig function in workerpool.go (either delete it if unused
or use it to fetch/filter MachineConfigs). Ensure references to
MachineConfigClient.MachineconfigurationV1().MachineConfigs().List and the
functions assessNodesStatus, getOpenShiftVersionOfMachineConfig, and
getMachineConfig are updated accordingly.

---

Outside diff comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 113-119: The Thanos alerts GETs aren't honoring cfg.Timeout
because the http.Client built from rest.TransportFor(cfg) (roundTripper) is
created without a Timeout and checkedGet builds requests with http.NewRequest
(no context); update the code that constructs client := &http.Client{Transport:
roundTripper} to also set Timeout: cfg.Timeout (or a derived duration) and
modify checkedGet (or the call site that creates requests for /api/v1/alerts) to
use http.NewRequestWithContext(ctx, ...) passing a context with the deadline so
the configured timeout on cfg and any caller context are respected.

---

Nitpick comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 237-240: The List call to MachineConfigClient
(MachineconfigurationV1().MachineConfigs().List) returns an error that is
currently returned raw; wrap it with fmt.Errorf to add context before returning
(e.g., fmt.Errorf("listing MachineConfigs: %w", err)) so the failure is
attributable and follows the project's error-wrapping guideline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e6971391-3fff-44ad-98b4-554504bfd1f2

📥 Commits

Reviewing files that changed from the base of the PR and between 9557cf3 and ea59751.

📒 Files selected for processing (1)
  • pkg/cli/admin/upgrade/status/status.go

Comment thread pkg/cli/admin/upgrade/status/status.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/status/workerpool.go (1)

236-244: ⚡ Quick win

Add a focused table test for the new map builder contract.

buildMachineConfigVersionMap now owns the filtering behavior for configs missing machineconfiguration.openshift.io/release-image-version. The current tests only cover the happy path through assessNodesStatus, so a small direct table-driven test here would lock down that edge case.

As per coding guidelines, "Write unit tests for every change using table-driven test patterns and standard testing.T".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/admin/upgrade/status/workerpool.go` around lines 236 - 244, Add a
focused table-driven unit test that exercises buildMachineConfigVersionMap's
contract: ensure it returns a map only for MachineConfig objects that have the
mco.ReleaseImageVersionAnnotationKey annotation and skips ones missing that
annotation. Create test cases using mcfgv1.MachineConfig inputs (with and
without the annotation, duplicate names, empty slice) and assert the returned
map contains expected keys/values and omits configs lacking the annotation; keep
it small and independent of assessNodesStatus so the behavior is locked down.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 236-239: Wrap the error returned by
MachineConfigClient.MachineconfigurationV1().MachineConfigs().List(...) with
call-site context so timeout/cancellation reasons are clear; update the block
where machineConfigs is assigned (variable machineConfigs, call List in
status.go) to return fmt.Errorf("listing MachineConfigs: %w", err) instead of
returning the raw err, preserving the original error as the wrapped cause.

---

Nitpick comments:
In `@pkg/cli/admin/upgrade/status/workerpool.go`:
- Around line 236-244: Add a focused table-driven unit test that exercises
buildMachineConfigVersionMap's contract: ensure it returns a map only for
MachineConfig objects that have the mco.ReleaseImageVersionAnnotationKey
annotation and skips ones missing that annotation. Create test cases using
mcfgv1.MachineConfig inputs (with and without the annotation, duplicate names,
empty slice) and assert the returned map contains expected keys/values and omits
configs lacking the annotation; keep it small and independent of
assessNodesStatus so the behavior is locked down.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c2a4c4c2-692b-4b01-9cd5-bfc2f2eaaa49

📥 Commits

Reviewing files that changed from the base of the PR and between ea59751 and f46a9b9.

📒 Files selected for processing (3)
  • pkg/cli/admin/upgrade/status/status.go
  • pkg/cli/admin/upgrade/status/workerpool.go
  • pkg/cli/admin/upgrade/status/workerpool_test.go

Comment thread pkg/cli/admin/upgrade/status/status.go Outdated
  Added code optimization
@YamunadeviShanmugam YamunadeviShanmugam marked this pull request as ready for review June 2, 2026 11:14
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2026
@openshift-ci openshift-ci Bot requested review from ardaguclu and ingvagabund June 2, 2026 11:15
@YamunadeviShanmugam
Copy link
Copy Markdown
Contributor Author

/retest

@ardaguclu
Copy link
Copy Markdown
Member

This command is maintained by Update team. It is better to ask review from

oc/OWNERS_ALIASES

Lines 21 to 24 in 9557cf3

update-approvers:
- DavidHurta
- wking
- hongkailiu

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

@YamunadeviShanmugam: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn-cmd 0fb5d92 link true /test e2e-agnostic-ovn-cmd

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hongkailiu
Copy link
Copy Markdown
Member

The oc adm upgrade status command frequently triggers context deadline exceeded in CI/CD pipelines. This PR add some fixes that improves the performance and fail faster rather than hanging indefinitely.

I will start to see if it is an issue from CI (how we run the test) or from the core code of the cmd, or neither.

Thanks to @YamunadeviShanmugam for the link to a failing job.

{  oc adm upgrade status failed 1 times (of 213)
- 2026-06-01T07:59:00Z: failed to run oc adm upgrade status after 2 attempts: context deadline exceeded
recorded attempts and outputs:
Attempt #1 failed: Error running oc --kubeconfig=/tmp/kubeconfig-819642240 adm upgrade status --details=all:
StdOut>
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get clusterversions.config.openshift.io version)
StdErr>
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get clusterversions.config.openshift.io version)
exit status 1
...

In the monitor test, oc adm upgrade status has been executed 213 times and failed only once* (with retries. See details below). It is not that bad.
I do not have stat about non-100% success rate leading to a job failure. But I would imagine it is not annoying enough because the test has been there almost 1 year (and I am aware of no complaint until today). I assume /retest would fix the flakiness in most cases.

The error in the run is from https://github.com/openshift/origin/blob/2d60dffe5a572f3078f9ee8018a9d31e61eadf17/pkg/monitortests/cli/adm_upgrade/status/monitortest.go#L128-L140

The timeout 2*time.Minute (the command may have no timeout while the test does) and thus it is NOT "hanging indefinitely". ^_^

If the API server has a glitch and does not respond (see the logs ^^^ from the job) to our oc cmd, the change in the cmd (like tuning timeout in the client side) would not help much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants