ARO-24544: migrate admin API endpoint for control plane VM resize by tuxerrante · Pull Request #4733 · Azure/ARO-RP

tuxerrante · 2026-03-31T08:48:15Z

Which issue this PR addresses:

What this PR does / why we need it:

Migrates the control plane VM resize orchestration from the external C# Geneva Action (ResizeControlPlaneVMsOperation.cs) into the ARO-RP as a native admin API endpoint. This eliminates split-brain logic between the Geneva Action and the RP, gives the operation access to the RP's existing validation infrastructure, and makes the orchestration testable with standard Go tooling.

The new POST /admin/.../resizecontrolplane?vmSize=<sku>&deallocateVM=<bool> endpoint performs pre-flight health checks and then sequentially resizes each master node through the full lifecycle: cordon, drain (with retry), stop VM, resize VM, start VM, wait for Node Ready, uncordon, update Machine object metadata, and update Node instance-type labels.

Design choices and deviations from the original C# implementation

Synchronous admin operation. Matches the pattern used by stopvm, startvm, redeployvm. The UnplannedMaintenanceSignal middleware signals maintenance state.
deallocateVM defaults to true because Azure requires deallocation for most cross-family resizes. Callers can pass deallocateVM=false explicitly.
CPMS-active blocks the operation with 409 Conflict. CPMS-aware resize (patching the CPMS spec) is planned for a follow-up.
Reverse name-order processing (master-2 → master-1 → master-0) minimises etcd leader elections, matching the C# behaviour.
Pre-flight health validation via the shared _getPreResizeControlPlaneVMsValidation endpoint, which checks API server health (synchronous gate), etcd health, service principal validity, VM SKU/quota, and CPMS state.
API server health as a synchronous gate: if the kube-apiserver is unreachable, we fail immediately instead of spawning parallel kube-based checks that would all fail with connection errors.
Reuse of existing helpers: validateAdminMasterVMSize, getClusterMachines (via getControlPlaneMachines wrapper), and health checks from the pre-resize validation endpoint.
Machine/Node metadata updates with retry (up to 3 attempts, ctx-aware delays). The Machine update uses typed machinev1beta1.Machine objects for safety.
Explicit cordon/uncordon wrappers: cordonNode() and uncordonNode() wrap CordonNode(bool) to make intent visible at every call site.

Test plan for issue:

Unit tests (pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go):
- TestCheckCPMSNotActive — CPMS not found, Inactive, Active (blocked), empty state, non-NotFound error (fails closed), invalid JSON (fails closed)
- TestIsNodeReady — node ready, not ready, not found
- TestResizeControlPlane — all-nodes-already-at-size no-op, single-node full sequence (verifies exact call ordering via gomock.InOrder), no machines found, drain/stop/resize failure cases
- TestUpdateMachineVMSize — success, retry on conflict
- TestUpdateNodeInstanceTypeLabels — success, retry on conflict
- TestAdminResizeControlPlane — full HTTP integration: invalid VM size (400), cluster not found (404), subscription not found (400)
E2E tests (test/e2e/adminapi_resize_controlplane.go):
- Rejects unsupported VM size (400)
- Rejects missing vmSize parameter (400)
- (Full resize E2E requires a live cluster and is validated manually in dev environments)

Local integration testing against containerized RP:

Prerequisites: Docker (or Podman on Linux), env file at repo root (from env.example), secrets/ directory with valid credentials.

# Build the dev container image (first time or after Dockerfile changes)
make dev-env-build

# Start the containerized RP (first run compiles from source, ~2-5 min)
make dev-env-start

# Wait for the RP to become healthy
until curl -ksSf https://localhost:8443/healthz/ready 2>/dev/null; do sleep 5; done

# Test: invalid vmSize → 400
curl -ksS -X POST \
  "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?vmSize=Standard_Invalid_Fake&deallocateVM=true"

# Test: missing vmSize → 400
curl -ksS -X POST \
  "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?deallocateVM=true"

# Test: valid vmSize, nonexistent cluster → 404
curl -ksS -X POST \
  "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/nonexistent/resizecontrolplane?vmSize=Standard_D8s_v3&deallocateVM=true"

# Stop the containerized RP
make dev-env-stop

Test case	Expected
Invalid `vmSize`	`400 InvalidParameter`
Missing `vmSize`	`400 InvalidParameter`
Valid `vmSize`, nonexistent cluster	`404 ResourceNotFound`

Is there any documentation that needs to be updated for this PR?

The admin API endpoint documentation in docs/deploy-development-rp.md should be updated to include an example curl command for the new /resizecontrolplane endpoint. This can be done in a follow-up.

How do you know this will function as expected in production?

The operation is 1:1 with the proven C# Geneva Action sequence — same steps, same order.
Structured logging at every step provides full observability.
Failure modes are handled explicitly: drain retries (3×2s), kube update retries (3×1s), node-ready polling (30min×5s), context cancellation respected throughout.
The UnplannedMaintenanceSignal middleware, CPMS-active check, and pre-flight health checks (apiserver gate + etcd + SP + CPMS) prevent conflicting or unsafe operations.
Best-effort node recovery is planned as a separate follow-up PR for easier review and revertability.

Copilot

Pull request overview

Adds a new admin API endpoint to orchestrate control plane VM resizes directly within ARO-RP, migrating the resize workflow from the external Geneva Action into the RP so it can reuse existing validation and be unit-tested.

Changes:

Adds POST /admin/.../resizecontrolplane route guarded by UnplannedMaintenanceSignal.
Implements synchronous, sequential master resize orchestration with best-effort recovery and kube metadata/label updates.
Adds unit tests for orchestration helpers and minimal E2E validations for request rejection cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
test/e2e/adminapi_resize_controlplane.go	Adds E2E coverage for basic 400 validation cases (unsupported size, missing `vmSize`).
pkg/frontend/frontend.go	Wires the new `/resizecontrolplane` admin route with maintenance signaling middleware.
pkg/frontend/admin_openshiftcluster_resize_controlplane.go	New resize orchestration implementation (cordon/drain/stop/resize/start/wait/uncordon + metadata updates + recovery).
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go	Adds unit tests for CPMS gating, readiness checks, drain retries, full sequence ordering, and recovery paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mociarain

LGTM but there a bunch of copilot suggestions. I'm gonna treat these like the PR is still a draft i.e. when they're resolved I'll give it a more thorough review... Is this a good or bad heuristic?

…tx-aware retries - checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from KubeGet; return 500 on other errors instead of silently proceeding. Also return errors on JSON unmarshal failure and NestedString errors instead of treating them as "CPMS absent". - doUpdateMachineVMSize: handle NestedString and SetNestedField errors for creationTimestamp sync instead of discarding them. - updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep with ctx-aware select so retries abort promptly on context cancellation. - Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases for non-NotFound KubeGet error and invalid JSON (both fail closed). Ref: #4733 Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 40 out of 44 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

pkg/util/azureclient/azuresdk/common/options.go:51

shouldRetry() no longer reads the response body correctly: var b []byte; resp.Body.Read(b) reads 0 bytes, so the retry checks for AADSTS/AuthorizationFailed will never match. It also doesn’t restore/close the body, which can break downstream SDK error handling and leak connections. Please revert to fully reading the body (e.g., io.ReadAll), close the original body, and restore resp.Body so it can be read again; keep/restore the unit tests that validated this behavior.

	// Check if the body contains the certain strings that can be retried.
	var b []byte
	_, err = resp.Body.Read(b)
	if err != nil {
		return true
	}
	body := string(b)
	return strings.Contains(body, ErrCodeInvalidClientSecretProvided) ||
		strings.Contains(body, ErrCodeMissingRequiredParameters) ||
		strings.Contains(body, AuthorizationFailed)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bizz001 · 2026-04-08T16:00:05Z

@tuxerrante Since we are moving from various smaller individual API calls to a single long running synchronous call, should we be concerned by the possibility of connection drops or load balancer timeouts? If I am not mistaken If the connection is closed the context is instantly killed.

tuxerrante · 2026-04-09T10:55:24Z

@tuxerrante Since we are moving from various smaller individual API calls to a single long running synchronous call, should we be concerned by the possibility of connection drops or load balancer timeouts? If I am not mistaken If the connection is closed the context is instantly killed.

@bizz001 Yes, this is a valid concern.

This endpoint currently follows the same request-scoped synchronous model used by existing admin imperative APIs, rather than the ARM async-operation path:

Admin route wiring for these endpoints (redeployvm, stopvm, startvm, resize, resizecontrolplane, cordonnode, drainnode):
pkg/frontend/frontend.go#L374-L393
Existing handlers all start from request context (ctx := r.Context()), e.g.
postAdminOpenShiftClusterStopVM,
postAdminOpenShiftClusterStartVM,
postAdminOpenShiftClusterRedeployVM,
postAdminOpenShiftClusterVMResize,
postAdminOpenShiftClusterCordonNode,
postAdminOpenShiftClusterDrainNode
New endpoint uses the same model:
postAdminResizeControlPlane ->
_postAdminResizeControlPlane ->
resizeControlPlaneNode

So yes: if client/LB drops the connection, context cancellation can abort an in-flight operation. That behavior is consistent with the current admin API model; the main difference here is that this operation is longer, so the exposure window is larger.

A full async approach would require a broader architectural change (persisted progress + background worker/reconciler), closer to the async ARM pattern (operationsstatus/operationresults) here:
pkg/frontend/frontend.go#L292-L294

Given scope, this PR keeps parity with existing admin behavior; near-term hardening is idempotency/recovery so reruns are safe after interruptions (ARO-24543).

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-04-13T13:38:16Z

Please rebase pull request.

Add a new admin API POST endpoint `resizecontrolplane` that performs sequential in-place resizing of all control plane VMs in an ARO cluster. The operation follows a safe rolling approach: for each master node (processed in reverse order starting from the highest-numbered, least critical node), it cordons, drains, stops, resizes, starts the VM, waits for the node to become Ready, uncordons, and updates the Machine object and Node labels to reflect the new VM size. Key design decisions: - Pre-flight validation via _getPreResizeControlPlaneVMsValidation checks API server health, etcd health, service principal validity, VM SKU availability, and compute quota before starting. - CPMS (ControlPlaneMachineSet) guard ensures the operator is not Active, preventing conflicts with automated machine management. - Drain uses retries with context-aware delays (moved to kubeActions interface as DrainNodeWithRetries for reusability). - Node readiness polling uses wait.PollImmediate for idiomatic k8s wait patterns. - Machine object updates use typed machinev1beta1.Machine structs instead of Unstructured for type safety. - Node instance-type labels use shared constants (nodeLabelInstanceType, nodeLabelBetaInstanceType). - Recovery logic intentionally omitted — will be added as a separate PR for easier review and revert if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When resizeControlPlaneNode fails before the VM SKU has been changed (drain, stop, or resize step), attempt to restore the node to a schedulable state: - If the node is still running (drain/stop failed): uncordon it. - If the VM was stopped (resize failed): start VM, wait for node Ready, then uncordon. - If the node does not become Ready after recovery start, leave it cordoned per SOP — SRE should verify health before re-enabling scheduling. The original error is always returned with recovery outcome appended so SREs can see both the failure reason and the cluster state. Ref: #4723 (comment) Made-with: Cursor

…tx-aware retries - checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from KubeGet; return 500 on other errors instead of silently proceeding. Also return errors on JSON unmarshal failure and NestedString errors instead of treating them as "CPMS absent". - doUpdateMachineVMSize: handle NestedString and SetNestedField errors for creationTimestamp sync instead of discarding them. - updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep with ctx-aware select so retries abort promptly on context cancellation. - Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases for non-NotFound KubeGet error and invalid JSON (both fail closed). Ref: #4733 Made-with: Cursor

- Remove recovery code (bestEffortUncordon, bestEffortRecoverVM, resizeRecoveryError) per reviewer request to keep as separate PR - Use wait.PollImmediateUntilWithContext for waitForNodeReady - Revert doUpdateMachineVMSize to typed Machine object approach - Move checkCPMSNotActive to prevalidation pipeline - Remove recovery-related tests, add simple failure tests Made-with: Cursor

… sort cleanup - Make validateAPIServerHealth a synchronous gate before parallel checks so kube-unreachable errors are reported once instead of N times - Add cordonNode/uncordonNode wrappers around CordonNode(bool) to make intent explicit at every call site - Add getControlPlaneMachines wrapper to clarify that the returned map only contains master machines (getClusterMachines already filters) - Replace sort.Sort(sort.Reverse(sort.StringSlice(…))) with slices.SortedFunc(maps.Keys(…)) and add comment explaining why we process in reverse order - Add CPMS mock to allKubeChecksHealthyMock in pre-validation tests Made-with: Cursor

Add targeted unit tests for malformed Node readiness payloads and VM start/uncordon failures so regressions in critical resize error handling are caught earlier. Made-with: Cursor

Validate deallocateVM strictly to prevent silent behavior changes from invalid values and refactor control-plane preflight validation to reuse already-fetched docs and clients in the resize handler. Add a happy-path resize handler test to cover the shared prevalidation helper end-to-end. Made-with: Cursor

Use a shared retry helper with range-loop structure and make the max-attempts constant the single source of truth, so retry behavior is explicit and easier to reason about during review. Made-with: Cursor

Ensure Machine providerSpec metadata.creationTimestamp is synchronized with the Machine object before update so resize updates satisfy Machine API validation semantics. Made-with: Cursor

Replace marshal/unmarshal conversion with DefaultUnstructuredConverter so typed-to-unstructured translation is explicit while preserving the same update payload behavior. Made-with: Cursor

Use explicit max-attempt semantics in DrainNodeWithRetries so drain retries run a consistent number of attempts and logs/errors match actual behavior. Made-with: Cursor

Prevent the resize loop from continuing when any control plane node is NotReady or still unschedulable, even if a machine already matches the desired VM size. This closes the skip-path safety gap raised in review and adds test coverage for the new pre-loop guard. Made-with: Cursor

Align prevalidation flow with related resize work by gating kube-apiserver pod checks behind the ClusterOperator health check, reducing redundant failures and future merge conflicts. Extend unit coverage for pod-level API server validation and update resize handler mocks to reflect the new preflight behavior. Made-with: Cursor

Return RequestNotAllowed (409) via api.NewCloudError when no control plane Machines are found so adminReply preserves the intended status instead of falling back to a generic 500. Made-with: Cursor

Decode Node and ControlPlaneMachineSet with their typed APIs and set providerSpec creationTimestamp via typed fields so resize checks are safer and easier to review. This removes fragile unstructured field lookups for known resources while preserving unstructured only at the final KubeCreateOrUpdate boundary. Made-with: Cursor

Align the rebased prevalidation helper and resize handler tests with the merged readyz gate and response semantics so the resize-control-plane branch stays green after replaying onto master. Made-with: Cursor

mrWinston

Thanks for all your effort! Looks good to me now 🚀

mociarain

LGTM

Copilot AI review requested due to automatic review settings March 31, 2026 08:48

tuxerrante requested review from alcasim, bennerv, cadenmarchese, hawkowl, hlipsig, jharrington22, kevinobriendotca, kimorris27, mociarain, mrWinston, rogbas, sankur-codes, tiguelu, tsatam, ventifus, wanghaoran1988 and yjst2012 as code owners March 31, 2026 08:48

Copilot started reviewing on behalf of tuxerrante March 31, 2026 08:48 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

mociarain reviewed Mar 31, 2026

View reviewed changes

mrWinston requested changes Mar 31, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 1, 2026 15:36

tuxerrante force-pushed the resize-cp-subagent branch from fece767 to 31e9d5e Compare April 1, 2026 15:36

Copilot started reviewing on behalf of tuxerrante April 1, 2026 15:37 View session

tuxerrante changed the title ~~feat(frontend): add admin API endpoint for control plane VM resize~~ ARO-24544: migrate admin API endpoint for control plane VM resize Apr 1, 2026

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 2, 2026 09:01

mrWinston reviewed Apr 8, 2026

View reviewed changes

Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated

Copilot AI review requested due to automatic review settings April 10, 2026 07:45

Copilot started reviewing on behalf of tuxerrante April 10, 2026 07:46 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go

github-actions bot added the needs-rebase branch needs a rebase label Apr 13, 2026

tuxerrante and others added 16 commits April 14, 2026 09:02

test(frontend): extend resize control plane error-path coverage

5bdef61

Add targeted unit tests for malformed Node readiness payloads and VM start/uncordon failures so regressions in critical resize error handling are caught earlier. Made-with: Cursor

fix(frontend): align kube object update attempt semantics

a5b1df9

Use a shared retry helper with range-loop structure and make the max-attempts constant the single source of truth, so retry behavior is explicit and easier to reason about during review. Made-with: Cursor

fix(frontend): sync providerSpec creation timestamp on resize

ff3c580

Ensure Machine providerSpec metadata.creationTimestamp is synchronized with the Machine object before update so resize updates satisfy Machine API validation semantics. Made-with: Cursor

refactor(frontend): avoid json roundtrip for machine conversion

643f219

Replace marshal/unmarshal conversion with DefaultUnstructuredConverter so typed-to-unstructured translation is explicit while preserving the same update payload behavior. Made-with: Cursor

fix(frontend): align drain retry attempt semantics

da2359c

Use explicit max-attempt semantics in DrainNodeWithRetries so drain retries run a consistent number of attempts and logs/errors match actual behavior. Made-with: Cursor

fix(frontend): return typed conflict on empty control plane machine set

f4a0b13

Return RequestNotAllowed (409) via api.NewCloudError when no control plane Machines are found so adminReply preserves the intended status instead of falling back to a generic 500. Made-with: Cursor

fix(frontend): resolve resize prevalidation rebase fallout

b98a1c0

Align the rebased prevalidation helper and resize handler tests with the merged readyz gate and response semantics so the resize-control-plane branch stays green after replaying onto master. Made-with: Cursor

tuxerrante force-pushed the resize-cp-subagent branch from 738d622 to b98a1c0 Compare April 14, 2026 07:04

github-actions bot removed the needs-rebase branch needs a rebase label Apr 14, 2026

mrWinston approved these changes Apr 14, 2026

View reviewed changes

mociarain approved these changes Apr 15, 2026

View reviewed changes

mociarain merged commit 2b36003 into master Apr 15, 2026
31 checks passed

mociarain deleted the resize-cp-subagent branch April 15, 2026 11:48

Conversation

tuxerrante commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue this PR addresses:

What this PR does / why we need it:

Design choices and deviations from the original C# implementation

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mociarain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bizz001 commented Apr 8, 2026

Uh oh!

tuxerrante commented Apr 9, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

mrWinston left a comment

Choose a reason for hiding this comment

Uh oh!

mociarain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tuxerrante commented Mar 31, 2026 •

edited

Loading

tuxerrante commented Apr 9, 2026 •

edited by openshift-ci bot

Loading