Fix adm upgrade status timeouts: Optimize MachineConfig fetches and enforce client deadlines#2278
Conversation
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
WalkthroughThe upgrade status command now enforces a 5-minute command timeout and a 60-second REST client timeout. MachineConfig retrieval now lists all MachineConfig resources and builds a name→version map; workerpool logic and tests were refactored to use that map for node status assessment. ChangesUpgrade Status Command Improvements
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 11❌ Failed checks (1 warning, 10 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Tools execution failed with the following error: Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error) Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: YamunadeviShanmugam The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/cli/admin/upgrade/status/status.go (1)
113-119:⚠️ Potential issue | 🟠 Major | 💤 Low valueApply
cfg.Timeout(and/or request context) to alerts Thanos GETs
cfg.Timeoutis set inpkg/cli/admin/upgrade/status/status.go, but the alerts path only usesrest.TransportFor(cfg)and constructsclient := &http.Client{Transport: roundTripper}withoutTimeout;checkedGetalso useshttp.NewRequest(..., nil)(no context). As a result, the Thanos/api/v1/alertsGETs aren’t bounded by the 60s timeout/deadline.- Fix by setting
Timeouton thathttp.Clientand/or creating requests withhttp.NewRequestWithContext(ctx, ...)so the existing deadline actually applies.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/admin/upgrade/status/status.go` around lines 113 - 119, The Thanos alerts GETs aren't honoring cfg.Timeout because the http.Client built from rest.TransportFor(cfg) (roundTripper) is created without a Timeout and checkedGet builds requests with http.NewRequest (no context); update the code that constructs client := &http.Client{Transport: roundTripper} to also set Timeout: cfg.Timeout (or a derived duration) and modify checkedGet (or the call site that creates requests for /api/v1/alerts) to use http.NewRequestWithContext(ctx, ...) passing a context with the deadline so the configured timeout on cfg and any caller context are respected.
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/status/status.go (1)
237-240: 💤 Low valueWrap the List error with context.
Per coding guidelines, prefer
fmt.Errorf("context: %w", err)so failures are attributable.♻️ Proposed change
machineConfigs, err = o.MachineConfigClient.MachineconfigurationV1().MachineConfigs().List(ctx, metav1.ListOptions{}) if err != nil { - return err + return fmt.Errorf("failed to list machine configs: %w", err) }As per coding guidelines: "Use
fmt.Errorf("context: %w", err)for error wrapping with context".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/admin/upgrade/status/status.go` around lines 237 - 240, The List call to MachineConfigClient (MachineconfigurationV1().MachineConfigs().List) returns an error that is currently returned raw; wrap it with fmt.Errorf to add context before returning (e.g., fmt.Errorf("listing MachineConfigs: %w", err)) so the failure is attributable and follows the project's error-wrapping guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 234-241: The code currently calls MachineConfigs().List(ctx,
metav1.ListOptions{}) and then repeatedly scans that full list inside
getOpenShiftVersionOfMachineConfig (invoked by assessNodesStatus), causing
O(nodes * machineConfigs) work and potential timeouts; change the flow to first
build a name->version map for only the MachineConfigs referenced by node
annotations (or filter the List by those names) before looping nodes so
getOpenShiftVersionOfMachineConfig can do O(1) lookups, and remove or wire up
the dead getMachineConfig function in workerpool.go (either delete it if unused
or use it to fetch/filter MachineConfigs). Ensure references to
MachineConfigClient.MachineconfigurationV1().MachineConfigs().List and the
functions assessNodesStatus, getOpenShiftVersionOfMachineConfig, and
getMachineConfig are updated accordingly.
---
Outside diff comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 113-119: The Thanos alerts GETs aren't honoring cfg.Timeout
because the http.Client built from rest.TransportFor(cfg) (roundTripper) is
created without a Timeout and checkedGet builds requests with http.NewRequest
(no context); update the code that constructs client := &http.Client{Transport:
roundTripper} to also set Timeout: cfg.Timeout (or a derived duration) and
modify checkedGet (or the call site that creates requests for /api/v1/alerts) to
use http.NewRequestWithContext(ctx, ...) passing a context with the deadline so
the configured timeout on cfg and any caller context are respected.
---
Nitpick comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 237-240: The List call to MachineConfigClient
(MachineconfigurationV1().MachineConfigs().List) returns an error that is
currently returned raw; wrap it with fmt.Errorf to add context before returning
(e.g., fmt.Errorf("listing MachineConfigs: %w", err)) so the failure is
attributable and follows the project's error-wrapping guideline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e6971391-3fff-44ad-98b4-554504bfd1f2
📒 Files selected for processing (1)
pkg/cli/admin/upgrade/status/status.go
ff05c96 to
f46a9b9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/cli/admin/upgrade/status/workerpool.go (1)
236-244: ⚡ Quick winAdd a focused table test for the new map builder contract.
buildMachineConfigVersionMapnow owns the filtering behavior for configs missingmachineconfiguration.openshift.io/release-image-version. The current tests only cover the happy path throughassessNodesStatus, so a small direct table-driven test here would lock down that edge case.As per coding guidelines, "Write unit tests for every change using table-driven test patterns and standard
testing.T".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/admin/upgrade/status/workerpool.go` around lines 236 - 244, Add a focused table-driven unit test that exercises buildMachineConfigVersionMap's contract: ensure it returns a map only for MachineConfig objects that have the mco.ReleaseImageVersionAnnotationKey annotation and skips ones missing that annotation. Create test cases using mcfgv1.MachineConfig inputs (with and without the annotation, duplicate names, empty slice) and assert the returned map contains expected keys/values and omits configs lacking the annotation; keep it small and independent of assessNodesStatus so the behavior is locked down.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/cli/admin/upgrade/status/status.go`:
- Around line 236-239: Wrap the error returned by
MachineConfigClient.MachineconfigurationV1().MachineConfigs().List(...) with
call-site context so timeout/cancellation reasons are clear; update the block
where machineConfigs is assigned (variable machineConfigs, call List in
status.go) to return fmt.Errorf("listing MachineConfigs: %w", err) instead of
returning the raw err, preserving the original error as the wrapped cause.
---
Nitpick comments:
In `@pkg/cli/admin/upgrade/status/workerpool.go`:
- Around line 236-244: Add a focused table-driven unit test that exercises
buildMachineConfigVersionMap's contract: ensure it returns a map only for
MachineConfig objects that have the mco.ReleaseImageVersionAnnotationKey
annotation and skips ones missing that annotation. Create test cases using
mcfgv1.MachineConfig inputs (with and without the annotation, duplicate names,
empty slice) and assert the returned map contains expected keys/values and omits
configs lacking the annotation; keep it small and independent of
assessNodesStatus so the behavior is locked down.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c2a4c4c2-692b-4b01-9cd5-bfc2f2eaaa49
📒 Files selected for processing (3)
pkg/cli/admin/upgrade/status/status.gopkg/cli/admin/upgrade/status/workerpool.gopkg/cli/admin/upgrade/status/workerpool_test.go
Added code optimization
f46a9b9 to
0fb5d92
Compare
|
/retest |
|
This command is maintained by Update team. It is better to ask review from Lines 21 to 24 in 9557cf3 |
|
@YamunadeviShanmugam: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
I will start to see if it is an issue from CI (how we run the test) or from the core code of the cmd, or neither. Thanks to @YamunadeviShanmugam for the link to a failing job. In the monitor test, The error in the run is from https://github.com/openshift/origin/blob/2d60dffe5a572f3078f9ee8018a9d31e61eadf17/pkg/monitortests/cli/adm_upgrade/status/monitortest.go#L128-L140 The timeout If the API server has a glitch and does not respond (see the logs ^^^ from the job) to our oc cmd, the change in the cmd (like tuning timeout in the client side) would not help much. |
The oc adm upgrade status command frequently triggers context deadline exceeded in CI/CD pipelines. This PR add some fixes that improves the performance and fail faster rather than hanging indefinitely.
Fix includes modifying the command querying MachineConfigs sequentially node-by-node via multiple GET requests to single query and adding timeouts to avoid indefinite hang
Summary by CodeRabbit
Bug Fixes
Improvements