Skip to content

Adding Capacity Reservation during VM Resize#4707

Open
rh-returners wants to merge 14 commits intomasterfrom
mmancebo/ARO-23134
Open

Adding Capacity Reservation during VM Resize#4707
rh-returners wants to merge 14 commits intomasterfrom
mmancebo/ARO-23134

Conversation

@rh-returners
Copy link
Copy Markdown
Collaborator

@rh-returners rh-returners commented Mar 24, 2026

What this PR does / why we need it

This PR adds Capacity Reservation Group (CRG)-backed VM resize support to the admin VM resize endpoint, allowing operators to resize a single OpenShift cluster VM while allocating capacity from an Azure Capacity Reservation Group.
It also fixes a mock generation determinism issue that was causing CI failures.

Changes

  1. New admin VM resize parameters
  • Added useCapacityReservation=true query parameter to the POST /admin/.../vmresize endpoint
  • Added optional zone query parameter to specify the Azure availability zone
  • Invalid values return HTTP 400 with descriptive CloudError responses
  1. CRG-backed single-VM resize flow (azureactions_vmresize_capacity.go)
  • New CRGResizeSingleVM orchestration function: creates a dedicated CRG + reservation, resizes the VM, then tears down the CRG post-success
  • Zone validation: verifies the target VM is actually in the requested zone before proceeding
  • Cleanup uses a fresh context.Background()-derived timeout so teardown is not cancelled if the HTTP request ends
  • Idempotent cleanup: 404 responses during reservation/CRG deletion are treated as success
  • Zone mismatch returns HTTP 400 CloudError (not 500)
  1. New Azure SDK client wrappers (pkg/util/azureclient/azuresdk/armcompute/)
  • CapacityReservationGroupsClient — CRUD for CRGs
  • CapacityReservationsClient — CRUD for capacity reservations inside a CRG
  • VirtualMachinesClient (azuresdk) — Get and Update for VM operations
  • Addon helpers with retry logic and async-delete polling (polls Get until 404 to confirm deletion)
  1. Mock generation fix
  • Switched new armcompute interfaces to source mode (-source ./file.go) for deterministic mock generation across Go versions
  • Keeps reflection mode only for the pre-existing ResourceSKUsClient
  • Reverted armcompute.go to original single-interface format; added per-interface mock files
  1. Dependency injection / interface updates
  • Extended AzureActions interface with CRGResizeSingleVM
  • Updated generated mocks accordingly

Which issue this PR addresses

ARO-23134 — Add Capacity Reservation support during VM resize operations.

Test plan

  • Unit tests for all new CRG helpers (azureactions_vmresize_capacity_test.go, 508 lines)
  • Handler-level tests updated: invalid useCapacityReservation value → 400, zone without CRG flag → 400
  • Zone mismatch test: TestCRGResizeSingleVM_ZoneMismatch_ReturnsError
  • All existing tests pass (make unit-test-go, make lint-go)

Is there any documentation that needs to be updated for this PR?

No user-facing documentation changes required. The new parameters are admin-only and will be documented in internal runbooks.

How do you know this will function as expected in production?

  1. The CRG lifecycle (create → reserve → resize → delete) is fully tested with mocked Azure SDK clients
  2. Zone validation prevents misrouted requests from silently failing or leaving orphaned resources
  3. Cleanup is idempotent: 404 on teardown is treated as success, preventing stuck states after partial failures
  4. Async CRG deletion is polled until 404 (not fire-and-forget), avoiding name-collision races on retries
  5. Post-success cleanup uses an independent context with a fixed timeout, so it completes even if the HTTP request is cancelled

@rh-returners
Copy link
Copy Markdown
Collaborator Author

@rh-mmancebo please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

@tuxerrante
Copy link
Copy Markdown
Collaborator

@copilot review

Switched from reflection mode to source mode (-source) for the new
CapacityReservationGroups, CapacityReservations, and VirtualMachines
mock clients. This ensures deterministic mock generation across Go
versions (reflection mode requires Go 1.25 which is only available
in CI, not locally).

Changes:
- generate.go: revert to ResourceSKUsClient-only reflection mode;
  add separate source-mode generate directives for the 3 new clients
- armcompute.go: revert to original single-interface version
- Add new separate mock files per interface (source-mode generated)

Co-authored-by: tuxerrante <8364469+tuxerrante@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/ARO-RP/sessions/7f872d11-3b82-4401-bed7-8af1e5550ea9
Copy link
Copy Markdown

Copilot AI commented Mar 25, 2026

@copilot review

Reviewed and addressed the CI failure in the generate-check job (commit ee0ecbd).

Root cause: The new armcompute.go mock (with 4 interfaces) was generated using reflection mode, which requires Go 1.25 (from go.mod). Locally with Go 1.24, I couldn't regenerate it identically to what CI would produce with Go 1.25, causing the isClean.sh check to fail.

Fix: Switched to source mode (-source ./file.go) for the 3 new client mocks (CapacityReservationGroups, CapacityReservations, VirtualMachines). Source mode parses the Go AST directly and is deterministic across Go versions. Each interface now has its own separate mock file:

  • capacityreservationgroups.go
  • capacityreservations.go
  • virtualmachines.go

The original armcompute.go (ResourceSKUsClient only, reflection mode) is reverted to its pre-PR state. All tests pass.

Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go Outdated
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces capacity-reservation-aware VM resize support (via transient Capacity Reservation Groups) and addresses mock generation determinism issues by switching new interfaces to mockgen source-mode generation.

Changes:

  • Add new ARM compute wrapper clients/interfaces (VirtualMachines, CapacityReservationGroups, CapacityReservations) plus deterministic mocks.
  • Implement capacity-reservation-backed single-VM resize flow in admin AzureActions and expose it via the admin VM resize endpoint (useCapacityReservation + zone).
  • Refactor admin action preparation helpers and extend unit tests/mocks to cover the new flow.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/util/mocks/azureclient/azuresdk/armcompute/virtualmachines.go New gomock for the VirtualMachines ARM wrapper client.
pkg/util/mocks/azureclient/azuresdk/armcompute/capacityreservations.go New gomock for CapacityReservations ARM wrapper client.
pkg/util/mocks/azureclient/azuresdk/armcompute/capacityreservationgroups.go New gomock for CapacityReservationGroups ARM wrapper client.
pkg/util/mocks/azureclient/azuresdk/armcompute/armcompute.go Keeps legacy ResourceSKUsClient mock; minor regen/import ordering changes.
pkg/util/mocks/adminactions/kubeactions.go Mock regen/import ordering changes.
pkg/util/mocks/adminactions/azureactions.go Extends AzureActions mock with CRG-related methods.
pkg/util/azureclient/azuresdk/armcompute/virtualmachines.go New minimal ARM VirtualMachines client wrapper construction.
pkg/util/azureclient/azuresdk/armcompute/virtualmachines_addons.go Adds helper methods (Get/List/*AndWait) for ARM VirtualMachines operations.
pkg/util/azureclient/azuresdk/armcompute/generate.go Updates go:generate to create deterministic mocks for new interfaces via source mode.
pkg/util/azureclient/azuresdk/armcompute/capacityreservations.go New minimal ARM CapacityReservations client wrapper construction.
pkg/util/azureclient/azuresdk/armcompute/capacityreservations_addons.go Adds *AndWait helpers for ARM CapacityReservations operations.
pkg/util/azureclient/azuresdk/armcompute/capacityreservationgroups.go New minimal ARM CapacityReservationGroups client wrapper construction.
pkg/util/azureclient/azuresdk/armcompute/capacityreservationgroups_addons.go Adds helper methods for CRG create/delete behavior (including 202 handling).
pkg/frontend/adminactions/azureactions.go Extends AzureActions interface and wires in new ARM compute clients.
pkg/frontend/adminactions/azureactions_vmresize_capacity.go Implements CRG-backed single VM resize + CRG lifecycle management/retries.
pkg/frontend/adminactions/azureactions_vmresize_capacity_test.go Adds unit tests for the CRG-backed resize helpers.
pkg/frontend/admin_vm_actions.go Extracts a cluster-scoped admin actions preparer (no VM name validation).
pkg/frontend/admin_openshiftcluster_vmresize.go Adds useCapacityReservation/zone handling and routes to CRG resize path.
pkg/frontend/admin_openshiftcluster_vmresize_test.go Adds endpoint tests for CRG resize path and missing-zone validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity_test.go
Copilot AI review requested due to automatic review settings April 10, 2026 09:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/admin_openshiftcluster_vmresize.go Outdated
Copilot AI review requested due to automatic review settings April 10, 2026 12:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go Outdated
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/admin_openshiftcluster_vmresize.go
- Validate zone parameter against VM's actual zone before creating any
  Azure resources in CRGResizeSingleVM to prevent unnecessary deallocation
- Use a fresh background context with timeout for post-success CRG cleanup
  so cleanup is not skipped if the request context is cancelled/deadlined
- Add tests for invalid useCapacityReservation value (400) and zone
  provided without useCapacityReservation=true (400)
- Update existing CRGResizeSingleVM tests to expect the zone validation GET

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go
Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/admin_openshiftcluster_vmresize.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/frontend/adminactions/azureactions_vmresize_capacity.go Outdated
Comment thread pkg/frontend/admin_openshiftcluster_vmresize.go
rh-returners and others added 2 commits April 15, 2026 10:15
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 15, 2026 09:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 6 to +10
//go:generate rm -rf ../../../../util/mocks/$GOPACKAGE
//go:generate mockgen -destination=../../../../util/mocks/azureclient/azuresdk/$GOPACKAGE/$GOPACKAGE.go github.com/Azure/ARO-RP/pkg/util/azureclient/azuresdk/$GOPACKAGE ResourceSKUsClient
//go:generate mockgen -source ./capacityreservationgroups.go -destination=../../../../util/mocks/azureclient/azuresdk/$GOPACKAGE/capacityreservationgroups.go github.com/Azure/ARO-RP/pkg/util/azureclient/azuresdk/$GOPACKAGE CapacityReservationGroupsClient
//go:generate mockgen -source ./capacityreservations.go -destination=../../../../util/mocks/azureclient/azuresdk/$GOPACKAGE/capacityreservations.go github.com/Azure/ARO-RP/pkg/util/azureclient/azuresdk/$GOPACKAGE CapacityReservationsClient
//go:generate mockgen -source ./virtualmachines.go -destination=../../../../util/mocks/azureclient/azuresdk/$GOPACKAGE/virtualmachines.go github.com/Azure/ARO-RP/pkg/util/azureclient/azuresdk/$GOPACKAGE VirtualMachinesClient
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rm -rf target (../../../../util/mocks/$GOPACKAGE) doesn't match the directory where mocks are generated (../../../../util/mocks/azureclient/azuresdk/$GOPACKAGE). This can leave stale generated files behind and undermine the determinism goal. Update the cleanup path to remove the actual destination directory before regenerating.

Copilot uses AI. Check for mistakes.
)

// CapacityReservationsClientAddons contains addons for CapacityReservationsClient
type CapacityReservationsClientAddons interface {
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated mock for CapacityReservationsClient includes a Get(...) method, but the exposed interface here only declares CreateOrUpdateAndWait and DeleteAndWait. This mismatch usually indicates mockgen is not generating from the intended interface definition (or the interface is missing a method). Align the interface/mocks by either adding Get here intentionally (if it’s meant to be part of the minimal wrapper) or regenerating mocks from the correct source/args so the mock method set matches the actual interface.

Suggested change
type CapacityReservationsClientAddons interface {
type CapacityReservationsClientAddons interface {
Get(ctx context.Context, resourceGroupName, capacityReservationGroupName, capacityReservationName string, options *armcompute.CapacityReservationsClientGetOptions) (armcompute.CapacityReservationsClientGetResponse, error)

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +68
func (c *capacityReservationGroupsClient) pollCRGDeleted(ctx context.Context, resourceGroupName, capacityReservationGroupName string) error {
for {
_, err := c.Get(ctx, resourceGroupName, capacityReservationGroupName, nil)
if err != nil {
var responseErr *azcore.ResponseError
if errors.As(err, &responseErr) && responseErr.StatusCode == http.StatusNotFound {
return nil
}
return err
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(crgDeletePollInterval):
}
}
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using time.After(...) inside a loop allocates a new timer each iteration. For long-running deletes this can generate unnecessary allocations. Prefer a time.Ticker (created once, stopped via defer ticker.Stop()) or reuse a time.Timer to reduce per-iteration overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +283 to +301
func (a *azureActions) deleteReservationWithRetry(ctx context.Context, clusterRG, crName string) error {
for attempt := 1; attempt <= crgMaxRetries; attempt++ {
err := a.armCapacityReservations.DeleteAndWait(ctx, clusterRG, capacityReservationGroupName, crName)
if err == nil || azureerrors.IsNotFoundError(err) {
return nil
}
if !isReferencedByVMError(err) || attempt == crgMaxRetries {
return fmt.Errorf("delete target reservation %s: %w", crName, err)
}
a.log.Infof("reservation %s still referenced by VM (Azure propagation lag), retrying in %s (attempt %d/%d)",
crName, crgRetryInterval, attempt, crgMaxRetries)
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(crgRetryInterval):
}
}
return nil
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final return nil is effectively unreachable because the loop either returns nil on success/404 or returns an error on the final retry. Removing dead/unreachable code simplifies reasoning and avoids suggesting a silent-success path that can’t actually happen.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +67
// Allow enough time for all retries in CRGDelete to complete.
cleanupTimeout := time.Duration(crgMaxRetries)*crgRetryInterval + 2*time.Minute
cleanCtx, cancel := context.WithTimeout(context.Background(), cleanupTimeout)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup timeout calculation is duplicated in two places. Extracting it into a small helper (or a named const/var) reduces the risk of future edits changing one path but not the other.

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +126
cleanupTimeout := time.Duration(crgMaxRetries)*crgRetryInterval + 2*time.Minute
cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), cleanupTimeout)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup timeout calculation is duplicated in two places. Extracting it into a small helper (or a named const/var) reduces the risk of future edits changing one path but not the other.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants