Skip to content

CNTRLPLANE-2262: Add Azure scale-from-zero support#8337

Open
jhjaggars wants to merge 7 commits into
openshift:mainfrom
jhjaggars:azure-scale-from-zero
Open

CNTRLPLANE-2262: Add Azure scale-from-zero support#8337
jhjaggars wants to merge 7 commits into
openshift:mainfrom
jhjaggars:azure-scale-from-zero

Conversation

@jhjaggars
Copy link
Copy Markdown
Contributor

@jhjaggars jhjaggars commented Apr 24, 2026

Extend the existing scale-from-zero autoscaling framework to support Azure by implementing an Azure instance type provider that queries the Azure Resource SKUs API for VM size specifications and writing capacity annotations on MachineDeployments.

Changes:

  • Add Azure instancetype.Provider using armcompute.ResourceSKUsClient
  • Add AzureMachineTemplate case to scale_from_zero.go type switch
  • Extend supportedScaleFromZeroPlatform() for Azure
  • Extend reconcileScaleFromZeroAnnotations() for Azure
  • Update autoscalerEnabledCondition() to accept Azure with min=0
  • Update effectiveMin guard in capi.go to allow min=0 for Azure
  • Add "azure" to supportedProviders in main.go and install.go
  • Add Azure provider initialization with credential file parsing
  • Update CRD CEL validation to allow min=0 for Azure platform
  • Add unit tests for Azure provider and extended type switches

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • Scale-from-zero autoscaling now supported on Azure as well as AWS; operator/CLI accept Azure as a provider and use Azure SKU data for instance-type info.
  • Bug Fixes

    • Replica and autoscaler behavior updated so min=0 is honored for Azure where supported.
  • Documentation

    • API and CRD docs updated to reflect Azure support for scale-from-zero.
  • Tests

    • Added/updated tests covering Azure scale-from-zero, instance-type parsing, and annotation behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 24, 2026

@jhjaggars: This pull request references CNTRLPLANE-2262 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Extend the existing scale-from-zero autoscaling framework to support Azure by implementing an Azure instance type provider that queries the Azure Resource SKUs API for VM size specifications and writing capacity annotations on MachineDeployments.

Changes:

  • Add Azure instancetype.Provider using armcompute.ResourceSKUsClient
  • Add AzureMachineTemplate case to scale_from_zero.go type switch
  • Extend supportedScaleFromZeroPlatform() for Azure
  • Extend reconcileScaleFromZeroAnnotations() for Azure
  • Update autoscalerEnabledCondition() to accept Azure with min=0
  • Update effectiveMin guard in capi.go to allow min=0 for Azure
  • Add "azure" to supportedProviders in main.go and install.go
  • Add Azure provider initialization with credential file parsing
  • Update CRD CEL validation to allow min=0 for Azure platform
  • Add unit tests for Azure provider and extended type switches

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2026
@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform and removed do-not-merge/needs-area labels Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR extends scale-from-zero support to Azure in addition to AWS. CRD XValidation for NodePoolSpec now permits autoScaling.min=0 on Azure. The operator CLI and bootstrap accept --scale-from-zero-provider=azure and initialize an Azure instancetype provider that queries and caches Azure Resource SKUs. Controller logic and tests were updated so autoscaling min 0 is honored for Azure, NodePool reconciliation reads AzureMachineTemplate VM sizes, and scale-from-zero annotation reconciliation handles Azure templates.

Changes

Scale-from-zero: Azure support + Azure instancetype provider

Layer / File(s) Summary
API / Schema
api/hypershift/v1beta1/nodepool_types.go, api/.../nodepools*.yaml
XValidation and OpenAPI docs for spec.autoScaling.min broadened: autoScaling.min=0 allowed for platform.type == Azure as well as AWS.
Core controller behavior
hypershift-operator/controllers/nodepool/nodepool_controller.go, hypershift-operator/controllers/nodepool/capi.go, hypershift-operator/controllers/nodepool/conditions.go, hypershift-operator/controllers/nodepool/scale_from_zero.go
Reconcile gate switched to configurable ScaleFromZeroPlatform; enforcement that bumped effective min to 1 excludes Azure; reconcileScaleFromZeroAnnotations and setScaleFromZeroAnnotationsOnObject gain Azure handling (read AzureMachineTemplate VMSize and apply annotations/taints accordingly).
Instantiation / Provider implementation
hypershift-operator/controllers/nodepool/instancetype/azure/provider.go
New Azure instancetype Provider: lazy-loads and caches Azure Resource SKUs (paginated), transforms SKUs into InstanceTypeInfo (vCPU, MemoryMb, GPUs, CPU arch), exposes GetInstanceTypeInfo.
Provider tests
hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
New tests and mocks validating SKU transformation, capability parsing, error cases, GetInstanceTypeInfo lookup behavior, and capability helper.
CLI / bootstrap wiring
hypershift-operator/main.go, cmd/install/install.go, go.mod
--scale-from-zero-provider accepts azure; main reads Azure creds JSON (subscriptionId, clientId, clientSecret, tenantId, location), constructs Azure credential and ResourceSKUs client, initializes Azure instancetype provider; go.mod adds Azure SDK dependency entry.
Tests / Integration
hypershift-operator/controllers/nodepool/scale_from_zero_test.go, hypershift-operator/controllers/nodepool/capi_test.go, cmd/install/assets/crds/.../stable.nodepools.autoscaling.testsuite.yaml, docs, e2e import formatting
Unit tests updated to expect Azure can scale-from-zero and to cover Azure template cases; CRD test-suite and docs updated to reflect Azure support; small e2e import reformat.

Sequence Diagram(s)

sequenceDiagram
  participant OperatorMain as Operator (main/boot)
  participant AzureSKUs as Azure Resource SKUs API
  participant InstTypeProv as Azure Instancetype Provider
  participant K8sAPI as Kubernetes API (CAPI objects)
  participant NodePoolCtrl as NodePool Controller

  OperatorMain->>AzureSKUs: Read credentials & location\ncreate SKUs client
  OperatorMain->>InstTypeProv: NewProvider(skuClient, location)
  Note over InstTypeProv: Provider initialized (cache empty)

  NodePoolCtrl->>K8sAPI: Get NodePool / AzureMachineTemplate
  K8sAPI-->>NodePoolCtrl: return AzureMachineTemplate (VMSize)
  NodePoolCtrl->>InstTypeProv: GetInstanceTypeInfo(ctx, vmSize)
  InstTypeProv->>AzureSKUs: ListPager() / Paginate SKUs
  AzureSKUs-->>InstTypeProv: SKU pages
  InstTypeProv-->>NodePoolCtrl: InstanceTypeInfo (vCPUs, Memory, GPUs)
  NodePoolCtrl->>K8sAPI: Patch node template annotations\n(scale-from-zero capacity/taints)
  K8sAPI-->>NodePoolCtrl: Patch result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'CNTRLPLANE-2262: Add Azure scale-from-zero support' accurately and concisely describes the primary change in the PR, clearly communicating the main objective of extending scale-from-zero functionality to the Azure platform.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are static, deterministic strings. Tests use standard Go testing with t.Run() and static name literals. No dynamic content found in test titles.
Test Structure And Quality ✅ Passed Check designed for Ginkgo tests. PR contains only standard Go table-driven unit tests, no Ginkgo framework. Not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added to this PR. The PR only modifies unit tests and reorders imports in existing e2e tests. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. PR adds only unit tests for Azure scale-from-zero implementation. SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed Adds Azure scale-from-zero support. No topology-breaking scheduling constraints: no affinity rules, node selectors, topology spread, or topology-dependent replica logic.
Ote Binary Stdout Contract ✅ Passed PR adds Azure scale-from-zero support without violating OTE Binary Stdout Contract. New code uses JSON-based logging (stderr) and structured error handling only.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. Only unit tests using standard Go testing.T were added. The e2e test file had import-only changes. Check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/install/install.go (1)

251-263: ⚠️ Potential issue | 🟡 Minor

Expose Azure in the CLI help text as well.

Validation accepts azure here, but the --scale-from-zero-provider help string still says Platform type for scale-from-zero autoscaling (aws) at Line 394. hypershift install --help will still advertise AWS-only support.

✏️ Suggested follow-up
-	cmd.PersistentFlags().StringVar(&opts.ScaleFromZeroProvider, "scale-from-zero-provider", opts.ScaleFromZeroProvider, "Platform type for scale-from-zero autoscaling (aws)")
+	cmd.PersistentFlags().StringVar(&opts.ScaleFromZeroProvider, "scale-from-zero-provider", opts.ScaleFromZeroProvider, "Platform type for scale-from-zero autoscaling (aws, azure)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/install/install.go` around lines 251 - 263, The CLI help text still
advertises AWS-only for --scale-from-zero-provider even though
supportedProviders includes "azure"; update the help string where the flag for
ScaleFromZeroProvider (the option described as "Platform type for
scale-from-zero autoscaling (aws)") to list both aws and azure (or a dynamic
list based on supportedProviders) so the help matches validation in
supportedProviders and the ScaleFromZeroProvider flag behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/hypershift/v1beta1/nodepool_types.go`:
- Line 109: Update the comment on the NodePoolAutoScaling.Min field to reflect
that scale-from-zero (min=0) is supported for both AWS and Azure platforms;
locate the NodePoolAutoScaling struct and the Min field comment in
nodepool_types.go (symbol: NodePoolAutoScaling.Min) and change the text that
currently mentions only AWS to mention "AWS and Azure" so the CRD/OpenAPI schema
and docs match the XValidation rule. Ensure the wording mirrors the XValidation
message: "Scale-from-zero (autoScaling.min=0) is supported for AWS and Azure
platforms."

In `@hypershift-operator/controllers/nodepool/instancetype/azure/provider.go`:
- Around line 48-50: The cache is being set before the full Azure SKU pagination
succeeds, causing partial results to persist after pager/NextPage() failures;
modify loadSKUs to populate a local temporary map (e.g., tempCache) while
walking pages and only assign it to p.cache (and any related fields) after the
entire walk succeeds, and ensure GetInstanceTypeInfo still checks p.cache==nil
to trigger reloads; apply the same pattern to the other similar block around the
63-75 logic so that p.cache is only updated on successful completion of the full
SKU load.

In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Line 437: The current platform check returns true for Azure unconditionally
which lets Azure NodePools enter the scale-from-zero path even when the
configured provider (scale-from-zero-provider) is AWS; update the platform gate
to require both the cluster platform and the configured InstanceTypeProvider
match: modify the function that currently returns "return platform ==
hyperv1.AWSPlatform || platform == hyperv1.AzurePlatform" to instead check the
configured provider (e.g., scaleFromZeroProvider/InstanceTypeProvider) and only
return true when platform==AWS && provider==aws OR platform==Azure &&
provider==azure (use the actual flag/field name used to hold the
--scale-from-zero-provider value and the InstanceTypeProvider symbol in the
reconciler).

In `@hypershift-operator/main.go`:
- Around line 487-499: The code parses Azure credentials into the azureCreds
struct but only validates Location; update the validation after json.Unmarshal
to ensure SubscriptionID, ClientID, ClientSecret and TenantID are non-empty
before creating the client. Specifically, in the block that defines azureCreds
and calls json.Unmarshal, add checks for azureCreds.SubscriptionID,
azureCreds.ClientID, azureCreds.ClientSecret and azureCreds.TenantID and return
descriptive fmt.Errorf errors (or a single aggregated error) if any are empty so
client creation logic (using these fields) never runs with missing values.

---

Outside diff comments:
In `@cmd/install/install.go`:
- Around line 251-263: The CLI help text still advertises AWS-only for
--scale-from-zero-provider even though supportedProviders includes "azure";
update the help string where the flag for ScaleFromZeroProvider (the option
described as "Platform type for scale-from-zero autoscaling (aws)") to list both
aws and azure (or a dynamic list based on supportedProviders) so the help
matches validation in supportedProviders and the ScaleFromZeroProvider flag
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ffada2e8-18eb-4f47-9fc5-f9200533350a

📥 Commits

Reviewing files that changed from the base of the PR and between c1a8bb6 and 7c4cfe6.

📒 Files selected for processing (11)
  • api/hypershift/v1beta1/nodepool_types.go
  • cmd/install/install.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/scale_from_zero.go
  • hypershift-operator/controllers/nodepool/scale_from_zero_test.go
  • hypershift-operator/main.go

Comment thread api/hypershift/v1beta1/nodepool_types.go
Comment thread hypershift-operator/controllers/nodepool/nodepool_controller.go Outdated
Comment thread hypershift-operator/main.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 59.78836% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.46%. Comparing base (38f7b17) to head (ee42ce8).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
hypershift-operator/main.go 0.00% 62 Missing ⚠️
...erator/controllers/nodepool/nodepool_controller.go 0.00% 10 Missing ⚠️
...ontrollers/nodepool/instancetype/azure/provider.go 97.16% 2 Missing and 1 partial ⚠️
...rshift-operator/controllers/nodepool/conditions.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8337      +/-   ##
==========================================
+ Coverage   41.43%   41.46%   +0.03%     
==========================================
  Files         756      757       +1     
  Lines       93647    93820     +173     
==========================================
+ Hits        38802    38906     +104     
- Misses      52124    52192      +68     
- Partials     2721     2722       +1     
Files with missing lines Coverage Δ
cmd/install/install.go 63.04% <100.00%> (-0.04%) ⬇️
hypershift-operator/controllers/nodepool/capi.go 71.77% <100.00%> (ø)
...t-operator/controllers/nodepool/scale_from_zero.go 100.00% <100.00%> (ø)
...rshift-operator/controllers/nodepool/conditions.go 54.06% <0.00%> (+0.13%) ⬆️
...ontrollers/nodepool/instancetype/azure/provider.go 97.16% <97.16%> (ø)
...erator/controllers/nodepool/nodepool_controller.go 42.79% <0.00%> (-0.35%) ⬇️
hypershift-operator/main.go 0.00% <0.00%> (ø)
Flag Coverage Δ
cmd-support 34.87% <100.00%> (-0.01%) ⬇️
cpo-hostedcontrolplane 43.50% <ø> (ø)
cpo-other 42.74% <ø> (ø)
hypershift-operator 51.63% <59.35%> (+0.06%) ⬆️
other 31.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jhjaggars jhjaggars force-pushed the azure-scale-from-zero branch 2 times, most recently from 0b01bc1 to b3477bf Compare May 7, 2026 15:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/capi.go`:
- Around line 774-775: The code sets effectiveMin to 1 for Azure based solely on
nodePool.Spec.Platform.Type; change this to also require that the
operator/runtime has Azure scale-from-zero support wired up. Update the
condition around effectiveMin (the place that checks nodePool.Spec.Platform.Type
and sets effectiveMin) to call the runtime-configured check (e.g., a function or
map such as supportsScaleFromZero(platform) or scaleFromZeroProviders[platform])
and only force effectiveMin=1 when the platform is not supported for
scale-from-zero or when the runtime config does not indicate Azure
scale-from-zero is enabled; apply the same guarded change to the other identical
branch referenced by the comment (the block around the other effectiveMin
handling). Ensure you reference and use the existing runtime config/provider
flag/function rather than just nodePool.Spec.Platform.Type.

In `@hypershift-operator/main.go`:
- Around line 527-537: The Azure scale-from-zero path creates credentials and a
ResourceSKUs client with nil options, ignoring AZURE_CLOUD_NAME; update the
NewClientSecretCredential and armcompute.NewResourceSKUsClient calls to use the
same cloud-specific client options used elsewhere (the resolved
cloud/environment options derived from AZURE_CLOUD_NAME) so the credential and
skuClient target the correct sovereign endpoints (use azureCreds and the
resolved azure cloud options when constructing cred and skuClient before
assigning instanceTypeProvider and scaleFromZeroPlatform and logging).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 255fb7ac-fd14-43de-b4ad-7f7fc9cd02c4

📥 Commits

Reviewing files that changed from the base of the PR and between 0b01bc1 and b3477bf.

⛔ Files ignored due to path filters (8)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/stable.nodepools.autoscaling.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (12)
  • api/hypershift/v1beta1/nodepool_types.go
  • cmd/install/install.go
  • go.mod
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/scale_from_zero.go
  • hypershift-operator/controllers/nodepool/scale_from_zero_test.go
  • hypershift-operator/main.go
✅ Files skipped from review due to trivial changes (1)
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • hypershift-operator/controllers/nodepool/scale_from_zero_test.go
  • cmd/install/install.go
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go

Comment thread hypershift-operator/controllers/nodepool/capi.go Outdated
Comment thread hypershift-operator/main.go Outdated
@jhjaggars jhjaggars force-pushed the azure-scale-from-zero branch from b3477bf to 242417f Compare May 7, 2026 19:22
@openshift-ci openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label May 7, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
hypershift-operator/controllers/nodepool/capi.go (1)

774-775: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Azure min=0 guard still checks only platform type, not runtime scale-from-zero configuration.

The exemption added for hyperv1.AzurePlatform allows effectiveMin=0 for any Azure NodePool, regardless of whether the operator was started with the Azure scale-from-zero provider wired (--scale-from-zero-provider=azure). Without that provider, the scale-from-zero capacity annotations are never written, so the autoscaler receives a zero-minimum pool with no instance-type metadata and cannot scale back up — permanently stalling the pool.

The fix should key this exemption off the runtime-configured provider set rather than the static platform type. The identical issue exists in both setMachineDeploymentReplicas (Line 774) and setMachineSetReplicas (Line 1083).

#!/bin/bash
# Look for any existing runtime check or helper that exposes whether a given platform
# has scale-from-zero support configured (e.g. supportedScaleFromZeroPlatform,
# scaleFromZeroProviders, or similar).
rg -n --type=go -C4 'scaleFromZero|ScaleFromZero|scale_from_zero' \
  --glob '!*_test.go'

Also applies to: 1083-1085

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/capi.go` around lines 774 - 775, The
Azure min=0 exemption currently checks nodePool.Spec.Platform.Type directly;
update the condition in both setMachineDeploymentReplicas and
setMachineSetReplicas so it verifies the runtime-configured scale-from-zero
provider set (e.g., call the existing helper/flag that exposes configured
providers such as scaleFromZeroProviders / supportedScaleFromZeroPlatform or the
operator config tied to --scale-from-zero-provider) instead of checking
hyperv1.AzurePlatform; change the if that sets effectiveMin to 0 to require the
runtime provider to include "azure" (or the helper to return true) before
allowing effectiveMin==0 so pools only get zero-min when the operator actually
supports Azure scale-from-zero.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/instancetype/azure/provider.go`:
- Around line 130-136: The code currently ignores parse errors and accepts
negative GPU counts; instead validate and fail fast: after calling
getCapabilityValue and obtaining gpuStr, attempt strconv.ParseInt(gpuStr, 10,
32) and if err != nil or the parsed value is negative, return an error (or log
and propagate) with context including gpuStr and the SKU identifier rather than
silently setting info.GPU; only assign info.GPU = int32(gpu) when parsing
succeeds and gpu >= 0. Use the existing gpuStr, getCapabilityValue,
strconv.ParseInt and info.GPU symbols to locate and implement the checks and
error propagation.

---

Duplicate comments:
In `@hypershift-operator/controllers/nodepool/capi.go`:
- Around line 774-775: The Azure min=0 exemption currently checks
nodePool.Spec.Platform.Type directly; update the condition in both
setMachineDeploymentReplicas and setMachineSetReplicas so it verifies the
runtime-configured scale-from-zero provider set (e.g., call the existing
helper/flag that exposes configured providers such as scaleFromZeroProviders /
supportedScaleFromZeroPlatform or the operator config tied to
--scale-from-zero-provider) instead of checking hyperv1.AzurePlatform; change
the if that sets effectiveMin to 0 to require the runtime provider to include
"azure" (or the helper to return true) before allowing effectiveMin==0 so pools
only get zero-min when the operator actually supports Azure scale-from-zero.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4a1461b0-6355-45e0-b5b5-14af5c2bc3e3

📥 Commits

Reviewing files that changed from the base of the PR and between b3477bf and 242417f.

⛔ Files ignored due to path filters (10)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/stable.nodepools.autoscaling.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (12)
  • api/hypershift/v1beta1/nodepool_types.go
  • cmd/install/install.go
  • go.mod
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/scale_from_zero.go
  • hypershift-operator/controllers/nodepool/scale_from_zero_test.go
  • hypershift-operator/main.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • cmd/install/install.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
  • hypershift-operator/controllers/nodepool/scale_from_zero_test.go
  • hypershift-operator/main.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/scale_from_zero.go

@jhjaggars jhjaggars force-pushed the azure-scale-from-zero branch from 242417f to 826f26a Compare May 7, 2026 20:42
@openshift-ci openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label May 7, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
hypershift-operator/controllers/nodepool/capi.go (2)

1083-1085: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Same unguarded Azure platform-type check in setMachineSetReplicas.

Same issue as lines 774–776: effectiveMin=0 is allowed for Azure based on platform type alone, with no check for runtime-configured Azure scale-from-zero support.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/capi.go` around lines 1083 - 1085,
In setMachineSetReplicas the AzurePlatform check is unguarded so effectiveMin
may be set to 1 based solely on platform type; update the condition to also
verify the runtime-configured Azure scale-from-zero feature flag (the same
runtime check used earlier around lines with the guarded Azure check) before
allowing effectiveMin to remain 0. Specifically, change the if that references
nodePool.Spec.Platform.Type == hyperv1.AzurePlatform to require the
runtime-scale-from-zero check (e.g., call the existing
isAzureScaleFromZeroEnabled/clusterConfig.ScaleFromZero.Azure/feature helper
used elsewhere) so Azure only gets scale-from-zero behavior when the runtime
config enables it, leaving AWS and other logic unchanged.

774-776: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

[Still unresolved from previous review] Gate Azure min=0 on runtime-configured provider, not just platform type.

This guard continues to permit effectiveMin=0 for any Azure NodePool based solely on nodePool.Spec.Platform.Type, regardless of whether the operator was started with --scale-from-zero-provider=azure. If the Azure instancetype provider is not wired up at startup, the scale-from-zero annotation path won't populate the capacity metadata the autoscaler needs—leaving the pool permanently stuck at 0 replicas with no recovery path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/capi.go` around lines 774 - 776, The
current guard sets effectiveMin=1 for non-AWS/Azure based only on
nodePool.Spec.Platform.Type, which still allows Azure pools to stay at 0 when
the operator wasn't started with the Azure scale-from-zero provider; update the
condition in the block that assigns effectiveMin (around the effectiveMin
variable usage) to also check the operator's runtime configuration for enabled
scale-from-zero providers (e.g., consult the operator config or the
ScaleFromZeroProviders/scaleFromZeroProvider flag mechanism used at startup) and
only permit min=0 for Azure when the Azure provider is actually enabled; in
practice change the if that references nodePool.Spec.Platform.Type and
hyperv1.AzurePlatform to require both platform==Azure and
providerEnabled("azure") (or the equivalent runtime-config boolean/collection
used by the operator) before allowing effectiveMin to remain 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@hypershift-operator/controllers/nodepool/capi.go`:
- Around line 1083-1085: In setMachineSetReplicas the AzurePlatform check is
unguarded so effectiveMin may be set to 1 based solely on platform type; update
the condition to also verify the runtime-configured Azure scale-from-zero
feature flag (the same runtime check used earlier around lines with the guarded
Azure check) before allowing effectiveMin to remain 0. Specifically, change the
if that references nodePool.Spec.Platform.Type == hyperv1.AzurePlatform to
require the runtime-scale-from-zero check (e.g., call the existing
isAzureScaleFromZeroEnabled/clusterConfig.ScaleFromZero.Azure/feature helper
used elsewhere) so Azure only gets scale-from-zero behavior when the runtime
config enables it, leaving AWS and other logic unchanged.
- Around line 774-776: The current guard sets effectiveMin=1 for non-AWS/Azure
based only on nodePool.Spec.Platform.Type, which still allows Azure pools to
stay at 0 when the operator wasn't started with the Azure scale-from-zero
provider; update the condition in the block that assigns effectiveMin (around
the effectiveMin variable usage) to also check the operator's runtime
configuration for enabled scale-from-zero providers (e.g., consult the operator
config or the ScaleFromZeroProviders/scaleFromZeroProvider flag mechanism used
at startup) and only permit min=0 for Azure when the Azure provider is actually
enabled; in practice change the if that references nodePool.Spec.Platform.Type
and hyperv1.AzurePlatform to require both platform==Azure and
providerEnabled("azure") (or the equivalent runtime-config boolean/collection
used by the operator) before allowing effectiveMin to remain 0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d92e2f97-9c70-40b5-9329-3c634ae4f483

📥 Commits

Reviewing files that changed from the base of the PR and between 242417f and 826f26a.

⛔ Files ignored due to path filters (1)
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (22)
  • api/hypershift/v1beta1/nodepool_types.go
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/stable.nodepools.autoscaling.testsuite.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml
  • cmd/install/install.go
  • docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md
  • go.mod
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/scale_from_zero.go
  • hypershift-operator/controllers/nodepool/scale_from_zero_test.go
  • hypershift-operator/main.go
  • test/e2e/nodepool_test.go
✅ Files skipped from review due to trivial changes (3)
  • test/e2e/nodepool_test.go
  • docs/content/reference/aggregated-docs.md
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml
🚧 Files skipped from review as they are similar to previous changes (9)
  • hypershift-operator/controllers/nodepool/conditions.go
  • cmd/install/install.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • go.mod
  • hypershift-operator/controllers/nodepool/scale_from_zero.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider.go
  • hypershift-operator/controllers/nodepool/instancetype/azure/provider_test.go
  • hypershift-operator/main.go

@jhjaggars
Copy link
Copy Markdown
Contributor Author

/test all

@jhjaggars jhjaggars marked this pull request as ready for review May 8, 2026 20:10
@jhjaggars jhjaggars force-pushed the azure-scale-from-zero branch from a4e2f66 to 7fc9734 Compare May 13, 2026 14:22
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8337 May 13, 2026 14:30 Inactive
@celebdor celebdor removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2026
Copy link
Copy Markdown
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a pretty straightforward API change.

Approved from an API perspective.

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: everettraven, jhjaggars
Once this PR has been reviewed and has the lgtm label, please assign csrwng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bryan-cox
Copy link
Copy Markdown
Member

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2055336840939966464 | Cost: $3.6344315 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2026
@jhjaggars jhjaggars force-pushed the azure-scale-from-zero branch from 7fc9734 to f68bf6d Compare May 26, 2026 14:17
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2026
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8337 May 26, 2026 14:23 Inactive
@jhjaggars
Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@jhjaggars
Copy link
Copy Markdown
Contributor Author

/test e2e-azure-v2-self-managed

@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8337 May 27, 2026 12:36 Inactive
jhjaggars and others added 7 commits June 4, 2026 17:09
Extend the existing scale-from-zero autoscaling framework to support
Azure by implementing an Azure instance type provider that queries the
Azure Resource SKUs API for VM size specifications and writing capacity
annotations on MachineDeployments.

Changes:
- Add Azure instancetype.Provider using armcompute.ResourceSKUsClient
- Add AzureMachineTemplate case to scale_from_zero.go type switch
- Extend supportedScaleFromZeroPlatform() for Azure
- Extend reconcileScaleFromZeroAnnotations() for Azure
- Update autoscalerEnabledCondition() to accept Azure with min=0
- Update effectiveMin guard in capi.go to allow min=0 for Azure
- Add "azure" to supportedProviders in main.go and install.go
- Add Azure provider initialization with credential file parsing
- Update CRD CEL validation to allow min=0 for Azure platform
- Add unit tests for Azure provider and extended type switches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update NodePoolAutoScaling.Min field comment and CRD validation rule
  to reflect Azure support alongside AWS
- Regenerate CRD manifests with updated docs and validation
- Fix partial SKU cache on Azure pager failure: build into local map
  and assign to cache only after full walk succeeds
- Tighten platform gate: add ScaleFromZeroPlatform field so annotations
  are only set when nodepool platform matches the configured provider
- Validate all required Azure credential fields (subscriptionId,
  clientId, clientSecret, tenantId, location) upfront with a clear
  error listing missing fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update CRD test suite to match the updated validation rule that
allows autoScaling.min=0 on Azure platform:
- Change Azure min=0 test from expecting failure to expecting success
- Update Agent and KubeVirt error messages to include Azure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Lowercase error string for Azure scale-from-zero credentials
- Fix gci import ordering in main.go, provider_test.go,
  scale_from_zero_test.go, and nodepool_test.go

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Gate effectiveMin=0 on runtime-configured scaleFromZeroPlatform instead
  of static platform type check, preventing stalled pools when the
  scale-from-zero provider isn't wired up
- Resolve AZURE_CLOUD_NAME for credential and SKU client construction in
  scale-from-zero init, matching sovereign cloud support used elsewhere
- Return errors on invalid/negative GPU values in transformSKU instead of
  silently skipping, with VM size in error messages for debuggability

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update the --scale-from-zero-provider help text to list both aws and azure
as supported platforms. Regenerate vendor and docs to sync with the rebased
branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The TestNodePoolAutoscalingScaleFromZero test was hardcoded to skip on
non-AWS platforms. The test logic is already platform-agnostic (it copies
the existing NodePool spec), so the only change needed is widening the
platform gate to include Azure.

A follow-up PR in openshift/release will configure the Azure CI jobs to
install the operator with --scale-from-zero-provider=azure and the
appropriate credentials.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jhjaggars
Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented Jun 5, 2026

I have all the evidence I need. Here is the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
fixture.go:340: Failed to clean up 11 remaining resources for guest cluster

Summary

The TestAutoscaling/Teardown failed because 11 AWS infrastructure resources (10 EBS volumes and 1 NLB) were not deleted within the 15-minute cleanup validation timeout after the hosted cluster was destroyed. All functional autoscaling subtests (TestAutoscaling, TestAutoscalerRespectsNodePoolPause, TestAutoscalingBalancing) passed successfully. This is a pre-existing flaky teardown issue unrelated to PR #8337, which only modifies a platform skip condition in autoscaling_test.go (+2/-2 lines) to enable Azure scale-from-zero support.

Root Cause

The root cause is a race condition between the AWS resource cleanup timeline and the test fixture's validation timeout in fixture.go:

  1. TestAutoscalingBalancing leaves the cluster at high scale: The testAutoscalingBalancing() function creates an additional nodepool (autoscaling-kjvpb-us-east-1a-additional) and scales the cluster to 6 nodes across 2 nodepools. It does not scale down or delete the additional nodepool before returning — by design, cleanup is deferred to the fixture-level teardown.

  2. Accumulated nodes from earlier subtests: Before TestAutoscalingBalancing, the TestAutoscalerRespectsNodePoolPause subtest scaled the primary nodepool to 3 nodes, then back to 1. However, the cluster autoscaler's scale-down may not have fully terminated all nodes and their associated EBS volumes before TestAutoscalingBalancing scaled up again, leaving orphaned volumes.

  3. Teardown struggles with 11 resources: When the fixture teardown destroys the hosted cluster, it must wait for AWS to release 10 EBS volumes (7 from the primary nodepool l7gdg, 3 from the additional nodepool additional-dxqv9) and 1 Network Load Balancer (router-default). AWS EBS volume deletion can take significant time when volumes are attached to instances that are still terminating.

  4. 15-minute timeout is insufficient: The validateAWSGuestResourcesDeletedFunc() in fixture.go polls AWS for tagged resources every 20 seconds for a maximum of 15 minutes. With 11 resources to clean up from a 6-node cluster (plus leftover volumes from earlier scaling), this timeout is routinely insufficient.

  5. Not caused by PR CNTRLPLANE-2262: Add Azure scale-from-zero support #8337: The PR changes only 2 lines in autoscaling_test.go — expanding the platform skip condition from hyperv1.AWSPlatform to include hyperv1.AzurePlatform for the TestNodePoolAutoscalingScaleFromZero test. None of the autoscaling teardown code, fixture cleanup, or AWS resource management is modified by this PR.

Recommendations
  1. Retrigger the job: This is a pre-existing flaky teardown issue unrelated to the PR changes. A rerun will likely pass if AWS resource cleanup completes within the timeout window.

  2. For the hypershift team (not this PR): Consider adding explicit scale-down and nodepool cleanup at the end of testAutoscalingBalancing() to reduce the resource count before teardown begins. Currently the test leaves 6 nodes across 2 nodepools running, creating a ~10 volume + 1 NLB cleanup burden on the fixture teardown.

  3. For the hypershift team (not this PR): The 15-minute timeout in validateAWSGuestResourcesDeletedFunc() should be increased to 20-25 minutes for autoscaling tests that leave multiple nodepools at scale.

Evidence
Evidence Detail
Failed test TestAutoscaling/Teardown (1336.11s / ~22 min)
All functional tests PASSED — TestAutoscaling, TestAutoscalerRespectsNodePoolPause, TestAutoscalingBalancing
Remaining AWS resources 11 total: 7 EBS volumes (primary nodepool l7gdg), 3 EBS volumes (additional nodepool additional-dxqv9), 1 NLB
Cluster state at teardown 6 nodes across 2 nodepools (no scale-down after TestAutoscalingBalancing)
PR #8337 changes to test file +2/-2 lines — only expands platform skip condition to include Azure
Teardown timeout 15-minute wait.PollUntilContextTimeout in fixture.go:303
Error source fixture.go:333context deadline exceeded after 15-minute poll
Total tests 643 run, 33 skipped, 2 failures (both from same TestAutoscaling parent)
PR relevance None — teardown flake is independent of Azure scale-from-zero changes

@jhjaggars
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-4-22

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@jhjaggars: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jhjaggars
Copy link
Copy Markdown
Contributor Author

/verified by @jhjaggars (on top of OCP ipi in Azure)

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jhjaggars: This PR has been marked as verified by @jhjaggars (on top of OCP ipi in Azure).

Details

In response to this:

/verified by @jhjaggars (on top of OCP ipi in Azure)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/azure PR/issue for Azure (AzurePlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants