Skip to content

ARO-24544: migrate admin API endpoint for control plane VM resize#4733

Merged
mociarain merged 16 commits intomasterfrom
resize-cp-subagent
Apr 15, 2026
Merged

ARO-24544: migrate admin API endpoint for control plane VM resize#4733
mociarain merged 16 commits intomasterfrom
resize-cp-subagent

Conversation

@tuxerrante
Copy link
Copy Markdown
Collaborator

@tuxerrante tuxerrante commented Mar 31, 2026

Which issue this PR addresses:

Fixes ARO-24544

What this PR does / why we need it:

Migrates the control plane VM resize orchestration from the external C# Geneva Action (ResizeControlPlaneVMsOperation.cs) into the ARO-RP as a native admin API endpoint. This eliminates split-brain logic between the Geneva Action and the RP, gives the operation access to the RP's existing validation infrastructure, and makes the orchestration testable with standard Go tooling.

The new POST /admin/.../resizecontrolplane?vmSize=<sku>&deallocateVM=<bool> endpoint performs pre-flight health checks and then sequentially resizes each master node through the full lifecycle: cordon, drain (with retry), stop VM, resize VM, start VM, wait for Node Ready, uncordon, update Machine object metadata, and update Node instance-type labels.

Design choices and deviations from the original C# implementation

  • Synchronous admin operation. Matches the pattern used by stopvm, startvm, redeployvm. The UnplannedMaintenanceSignal middleware signals maintenance state.
  • deallocateVM defaults to true because Azure requires deallocation for most cross-family resizes. Callers can pass deallocateVM=false explicitly.
  • CPMS-active blocks the operation with 409 Conflict. CPMS-aware resize (patching the CPMS spec) is planned for a follow-up.
  • Reverse name-order processing (master-2master-1master-0) minimises etcd leader elections, matching the C# behaviour.
  • Pre-flight health validation via the shared _getPreResizeControlPlaneVMsValidation endpoint, which checks API server health (synchronous gate), etcd health, service principal validity, VM SKU/quota, and CPMS state.
  • API server health as a synchronous gate: if the kube-apiserver is unreachable, we fail immediately instead of spawning parallel kube-based checks that would all fail with connection errors.
  • Reuse of existing helpers: validateAdminMasterVMSize, getClusterMachines (via getControlPlaneMachines wrapper), and health checks from the pre-resize validation endpoint.
  • Machine/Node metadata updates with retry (up to 3 attempts, ctx-aware delays). The Machine update uses typed machinev1beta1.Machine objects for safety.
  • Explicit cordon/uncordon wrappers: cordonNode() and uncordonNode() wrap CordonNode(bool) to make intent visible at every call site.

Test plan for issue:

  • Unit tests (pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go):

    • TestCheckCPMSNotActive — CPMS not found, Inactive, Active (blocked), empty state, non-NotFound error (fails closed), invalid JSON (fails closed)
    • TestIsNodeReady — node ready, not ready, not found
    • TestResizeControlPlane — all-nodes-already-at-size no-op, single-node full sequence (verifies exact call ordering via gomock.InOrder), no machines found, drain/stop/resize failure cases
    • TestUpdateMachineVMSize — success, retry on conflict
    • TestUpdateNodeInstanceTypeLabels — success, retry on conflict
    • TestAdminResizeControlPlane — full HTTP integration: invalid VM size (400), cluster not found (404), subscription not found (400)
  • E2E tests (test/e2e/adminapi_resize_controlplane.go):

    • Rejects unsupported VM size (400)
    • Rejects missing vmSize parameter (400)
    • (Full resize E2E requires a live cluster and is validated manually in dev environments)
  • Local integration testing against containerized RP:

    Prerequisites: Docker (or Podman on Linux), env file at repo root (from env.example), secrets/ directory with valid credentials.

    # Build the dev container image (first time or after Dockerfile changes)
    make dev-env-build
    
    # Start the containerized RP (first run compiles from source, ~2-5 min)
    make dev-env-start
    
    # Wait for the RP to become healthy
    until curl -ksSf https://localhost:8443/healthz/ready 2>/dev/null; do sleep 5; done
    
    # Test: invalid vmSize → 400
    curl -ksS -X POST \
      "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?vmSize=Standard_Invalid_Fake&deallocateVM=true"
    
    # Test: missing vmSize → 400
    curl -ksS -X POST \
      "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/test-cluster/resizecontrolplane?deallocateVM=true"
    
    # Test: valid vmSize, nonexistent cluster → 404
    curl -ksS -X POST \
      "https://localhost:8443/admin/subscriptions/00000000-0000-0000-0000-000000000000/resourcegroups/test-rg/providers/microsoft.redhatopenshift/openshiftclusters/nonexistent/resizecontrolplane?vmSize=Standard_D8s_v3&deallocateVM=true"
    
    # Stop the containerized RP
    make dev-env-stop
    Test case Expected
    Invalid vmSize 400 InvalidParameter
    Missing vmSize 400 InvalidParameter
    Valid vmSize, nonexistent cluster 404 ResourceNotFound

Is there any documentation that needs to be updated for this PR?

The admin API endpoint documentation in docs/deploy-development-rp.md should be updated to include an example curl command for the new /resizecontrolplane endpoint. This can be done in a follow-up.

How do you know this will function as expected in production?

  • The operation is 1:1 with the proven C# Geneva Action sequence — same steps, same order.
  • Structured logging at every step provides full observability.
  • Failure modes are handled explicitly: drain retries (3×2s), kube update retries (3×1s), node-ready polling (30min×5s), context cancellation respected throughout.
  • The UnplannedMaintenanceSignal middleware, CPMS-active check, and pre-flight health checks (apiserver gate + etcd + SP + CPMS) prevent conflicting or unsafe operations.
  • Best-effort node recovery is planned as a separate follow-up PR for easier review and revertability.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new admin API endpoint to orchestrate control plane VM resizes directly within ARO-RP, migrating the resize workflow from the external Geneva Action into the RP so it can reuse existing validation and be unit-tested.

Changes:

  • Adds POST /admin/.../resizecontrolplane route guarded by UnplannedMaintenanceSignal.
  • Implements synchronous, sequential master resize orchestration with best-effort recovery and kube metadata/label updates.
  • Adds unit tests for orchestration helpers and minimal E2E validations for request rejection cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
test/e2e/adminapi_resize_controlplane.go Adds E2E coverage for basic 400 validation cases (unsupported size, missing vmSize).
pkg/frontend/frontend.go Wires the new /resizecontrolplane admin route with maintenance signaling middleware.
pkg/frontend/admin_openshiftcluster_resize_controlplane.go New resize orchestration implementation (cordon/drain/stop/resize/start/wait/uncordon + metadata updates + recovery).
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go Adds unit tests for CPMS gating, readiness checks, drain retries, full sequence ordering, and recovery paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go
Copy link
Copy Markdown
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but there a bunch of copilot suggestions. I'm gonna treat these like the PR is still a draft i.e. when they're resolved I'll give it a more thorough review... Is this a good or bad heuristic?

tuxerrante added a commit that referenced this pull request Mar 31, 2026
…tx-aware retries

- checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from
  KubeGet; return 500 on other errors instead of silently proceeding.
  Also return errors on JSON unmarshal failure and NestedString errors
  instead of treating them as "CPMS absent".
- doUpdateMachineVMSize: handle NestedString and SetNestedField errors
  for creationTimestamp sync instead of discarding them.
- updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep
  with ctx-aware select so retries abort promptly on context cancellation.
- Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases
  for non-NotFound KubeGet error and invalid JSON (both fail closed).

Ref: #4733

Made-with: Cursor
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
tuxerrante added a commit that referenced this pull request Apr 1, 2026
…tx-aware retries

- checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from
  KubeGet; return 500 on other errors instead of silently proceeding.
  Also return errors on JSON unmarshal failure and NestedString errors
  instead of treating them as "CPMS absent".
- doUpdateMachineVMSize: handle NestedString and SetNestedField errors
  for creationTimestamp sync instead of discarding them.
- updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep
  with ctx-aware select so retries abort promptly on context cancellation.
- Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases
  for non-NotFound KubeGet error and invalid JSON (both fail closed).

Ref: #4733

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 1, 2026 15:36
@tuxerrante tuxerrante force-pushed the resize-cp-subagent branch from fece767 to 31e9d5e Compare April 1, 2026 15:36
@tuxerrante tuxerrante changed the title feat(frontend): add admin API endpoint for control plane VM resize ARO-24544: migrate admin API endpoint for control plane VM resize Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 44 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

pkg/util/azureclient/azuresdk/common/options.go:51

  • shouldRetry() no longer reads the response body correctly: var b []byte; resp.Body.Read(b) reads 0 bytes, so the retry checks for AADSTS/AuthorizationFailed will never match. It also doesn’t restore/close the body, which can break downstream SDK error handling and leak connections. Please revert to fully reading the body (e.g., io.ReadAll), close the original body, and restore resp.Body so it can be read again; keep/restore the unit tests that validated this behavior.
	// Check if the body contains the certain strings that can be retried.
	var b []byte
	_, err = resp.Body.Read(b)
	if err != nil {
		return true
	}
	body := string(b)
	return strings.Contains(body, ErrCodeInvalidClientSecretProvided) ||
		strings.Contains(body, ErrCodeMissingRequiredParameters) ||
		strings.Contains(body, AuthorizationFailed)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/util/changefeed/subscriptioncache_test.go Outdated
Comment thread pkg/util/acrtoken/acrtoken.go Outdated
Comment thread pkg/util/acrtoken/acrtoken.go Outdated
Comment thread pkg/util/acrtoken/acrtoken.go Outdated
Comment thread pkg/util/acrtoken/acrtoken.go Outdated
Comment thread pkg/deploy/generator/resources_gateway.go Outdated
Comment thread pkg/util/azureclient/mgmt/containerregistry/tokens_addons.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go
Comment thread pkg/util/version/const.go Outdated
Copilot AI review requested due to automatic review settings April 2, 2026 09:01
Comment thread pkg/frontend/admin_openshiftcluster_resize_controlplane.go Outdated
@bizz001
Copy link
Copy Markdown
Collaborator

bizz001 commented Apr 8, 2026

@tuxerrante Since we are moving from various smaller individual API calls to a single long running synchronous call, should we be concerned by the possibility of connection drops or load balancer timeouts? If I am not mistaken If the connection is closed the context is instantly killed.

@tuxerrante
Copy link
Copy Markdown
Collaborator Author

tuxerrante commented Apr 9, 2026

@tuxerrante Since we are moving from various smaller individual API calls to a single long running synchronous call, should we be concerned by the possibility of connection drops or load balancer timeouts? If I am not mistaken If the connection is closed the context is instantly killed.

@bizz001 Yes, this is a valid concern.

This endpoint currently follows the same request-scoped synchronous model used by existing admin imperative APIs, rather than the ARM async-operation path:

So yes: if client/LB drops the connection, context cancellation can abort an in-flight operation. That behavior is consistent with the current admin API model; the main difference here is that this operation is longer, so the exposure window is larger.

A full async approach would require a broader architectural change (persisted progress + background worker/reconciler), closer to the async ARM pattern (operationsstatus/operationresults) here:
pkg/frontend/frontend.go#L292-L294

Given scope, this PR keeps parity with existing admin behavior; near-term hardening is idempotency/recovery so reruns are safe after interruptions (ARO-24543).

Copilot AI review requested due to automatic review settings April 10, 2026 07:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go
@github-actions github-actions bot added the needs-rebase branch needs a rebase label Apr 13, 2026
@github-actions
Copy link
Copy Markdown

Please rebase pull request.

tuxerrante and others added 16 commits April 14, 2026 09:02
Add a new admin API POST endpoint `resizecontrolplane` that performs
sequential in-place resizing of all control plane VMs in an ARO cluster.

The operation follows a safe rolling approach: for each master node
(processed in reverse order starting from the highest-numbered, least
critical node), it cordons, drains, stops, resizes, starts the VM,
waits for the node to become Ready, uncordons, and updates the Machine
object and Node labels to reflect the new VM size.

Key design decisions:
- Pre-flight validation via _getPreResizeControlPlaneVMsValidation
  checks API server health, etcd health, service principal validity,
  VM SKU availability, and compute quota before starting.
- CPMS (ControlPlaneMachineSet) guard ensures the operator is not
  Active, preventing conflicts with automated machine management.
- Drain uses retries with context-aware delays (moved to kubeActions
  interface as DrainNodeWithRetries for reusability).
- Node readiness polling uses wait.PollImmediate for idiomatic k8s
  wait patterns.
- Machine object updates use typed machinev1beta1.Machine structs
  instead of Unstructured for type safety.
- Node instance-type labels use shared constants
  (nodeLabelInstanceType, nodeLabelBetaInstanceType).
- Recovery logic intentionally omitted — will be added as a separate
  PR for easier review and revert if needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When resizeControlPlaneNode fails before the VM SKU has been changed
(drain, stop, or resize step), attempt to restore the node to a
schedulable state:

- If the node is still running (drain/stop failed): uncordon it.
- If the VM was stopped (resize failed): start VM, wait for node
  Ready, then uncordon.
- If the node does not become Ready after recovery start, leave it
  cordoned per SOP — SRE should verify health before re-enabling
  scheduling.

The original error is always returned with recovery outcome appended
so SREs can see both the failure reason and the cluster state.

Ref: #4723 (comment)
Made-with: Cursor
…tx-aware retries

- checkCPMSNotActive: only ignore NotFound/CRD-not-installed errors from
  KubeGet; return 500 on other errors instead of silently proceeding.
  Also return errors on JSON unmarshal failure and NestedString errors
  instead of treating them as "CPMS absent".
- doUpdateMachineVMSize: handle NestedString and SetNestedField errors
  for creationTimestamp sync instead of discarding them.
- updateMachineVMSize, updateNodeInstanceTypeLabels: replace time.Sleep
  with ctx-aware select so retries abort promptly on context cancellation.
- Tests updated: CPMS mocks now use kerrors.NewNotFound; added test cases
  for non-NotFound KubeGet error and invalid JSON (both fail closed).

Ref: #4733

Made-with: Cursor
- Remove recovery code (bestEffortUncordon, bestEffortRecoverVM,
  resizeRecoveryError) per reviewer request to keep as separate PR
- Use wait.PollImmediateUntilWithContext for waitForNodeReady
- Revert doUpdateMachineVMSize to typed Machine object approach
- Move checkCPMSNotActive to prevalidation pipeline
- Remove recovery-related tests, add simple failure tests

Made-with: Cursor
… sort cleanup

- Make validateAPIServerHealth a synchronous gate before parallel checks
  so kube-unreachable errors are reported once instead of N times
- Add cordonNode/uncordonNode wrappers around CordonNode(bool) to make
  intent explicit at every call site
- Add getControlPlaneMachines wrapper to clarify that the returned map
  only contains master machines (getClusterMachines already filters)
- Replace sort.Sort(sort.Reverse(sort.StringSlice(…))) with
  slices.SortedFunc(maps.Keys(…)) and add comment explaining why we
  process in reverse order
- Add CPMS mock to allKubeChecksHealthyMock in pre-validation tests

Made-with: Cursor
Add targeted unit tests for malformed Node readiness payloads and VM start/uncordon failures so regressions in critical resize error handling are caught earlier.

Made-with: Cursor
Validate deallocateVM strictly to prevent silent behavior changes from invalid values and refactor control-plane preflight validation to reuse already-fetched docs and clients in the resize handler. Add a happy-path resize handler test to cover the shared prevalidation helper end-to-end.

Made-with: Cursor
Use a shared retry helper with range-loop structure and make the max-attempts constant the single source of truth, so retry behavior is explicit and easier to reason about during review.

Made-with: Cursor
Ensure Machine providerSpec metadata.creationTimestamp is synchronized with the Machine object before update so resize updates satisfy Machine API validation semantics.

Made-with: Cursor
Replace marshal/unmarshal conversion with DefaultUnstructuredConverter so typed-to-unstructured translation is explicit while preserving the same update payload behavior.

Made-with: Cursor
Use explicit max-attempt semantics in DrainNodeWithRetries so drain retries run a consistent number of attempts and logs/errors match actual behavior.

Made-with: Cursor
Prevent the resize loop from continuing when any control plane node is NotReady or still unschedulable, even if a machine already matches the desired VM size. This closes the skip-path safety gap raised in review and adds test coverage for the new pre-loop guard.

Made-with: Cursor
Align prevalidation flow with related resize work by gating kube-apiserver pod checks behind the ClusterOperator health check, reducing redundant failures and future merge conflicts. Extend unit coverage for pod-level API server validation and update resize handler mocks to reflect the new preflight behavior.

Made-with: Cursor
Return RequestNotAllowed (409) via api.NewCloudError when no control plane Machines are found so adminReply preserves the intended status instead of falling back to a generic 500.

Made-with: Cursor
Decode Node and ControlPlaneMachineSet with their typed APIs and set providerSpec creationTimestamp via typed fields so resize checks are safer and easier to review. This removes fragile unstructured field lookups for known resources while preserving unstructured only at the final KubeCreateOrUpdate boundary.

Made-with: Cursor
Align the rebased prevalidation helper and resize handler tests with the merged readyz gate and response semantics so the resize-control-plane branch stays green after replaying onto master.

Made-with: Cursor
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Apr 14, 2026
Copy link
Copy Markdown
Collaborator

@mrWinston mrWinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your effort! Looks good to me now 🚀

Copy link
Copy Markdown
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mociarain mociarain merged commit 2b36003 into master Apr 15, 2026
31 checks passed
@mociarain mociarain deleted the resize-cp-subagent branch April 15, 2026 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants