Skip to content

NO-JIRA: Add nested virtualization support for AWS EC2 NodePools#8681

Draft
jhjaggars wants to merge 5 commits into
openshift:mainfrom
jhjaggars:virt-on-new-aws-nodes
Draft

NO-JIRA: Add nested virtualization support for AWS EC2 NodePools#8681
jhjaggars wants to merge 5 commits into
openshift:mainfrom
jhjaggars:virt-on-new-aws-nodes

Conversation

@jhjaggars
Copy link
Copy Markdown
Contributor

@jhjaggars jhjaggars commented Jun 5, 2026

Adds support for enabling nested virtualization on AWS EC2 instances (C8i, M8i, R8i families) through the NodePool API. This allows running OpenShift Virtualization (KubeVirt) workloads on AWS-hosted clusters.

  • API: Add CpuOptions struct with NestedVirtualization field to AWSNodePoolPlatform
  • Vendor: Bump CAPA from v2.10.0 to v2.11.1 which includes upstream NestedVirtualization support in CPUOptions (PR kubernetes-sigs/cluster-api-provider-aws#5874)
  • CLI: Add --nested-virtualization flag to hypershift create cluster aws and hypershift create nodepool aws
  • Controller: Wire CpuOptions.NestedVirtualization from NodePool spec to CAPA AWSMachineTemplateSpec.CPUOptions
  • CPO fix: Prevent informer creation for unused platform-specific resources (e.g. SecretProviderClass on non-Azure clusters)
  • Built custom HO image and installed on management cluster
  • Created hosted cluster with --nested-virtualization enabled --instance-type c8i.2xlarge
  • Verified cpuOptions.nestedVirtualization: enabled propagates through: NodePool → AWSMachineTemplate → AWSMachine → EC2 instance
  • Confirmed EC2 CpuOptions.NestedVirtualization: enabled via AWS CLI
  • Confirmed vmx CPU flag present on guest nodes via /proc/cpuinfo
    N/A — new feature
  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

    • Added EC2 nested virtualization configuration option for AWS-backed node pools. Users can now enable or disable nested virtualization when creating or updating clusters and node pools using the new --nested-virtualization CLI flag.
  • Chores

    • Updated AWS EC2 SDK and Cluster API provider dependencies to latest versions.

jhjaggars and others added 5 commits June 5, 2026 03:54
…tform

Add CpuOptions struct to AWSNodePoolPlatform to support configuring
nested virtualization on AWS EC2 instances. Supported instance
families are C8i, M8i, and R8i with valid values of "enabled" or
"disabled".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update vendored cluster-api-provider-aws from v2.10.0 to v2.11.1
which includes upstream NestedVirtualization support in CPUOptions.
Regenerate CAPA CRDs and apply configuration clients.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ool creation

Add --nested-virtualization flag to both `hypershift create cluster aws`
and `hypershift create nodepool aws` commands. Accepts "enabled" or
"disabled" values for C8i, M8i, and R8i instance families.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nested virtualization

Map NodePool CpuOptions.NestedVirtualization to the upstream CAPA
CPUOptions type on AWSMachineTemplateSpec. The cpuOptions handling
runs before the placement nil guard so it applies even when placement
is not configured.

Also prevent informer creation for unused platform-specific resources
(e.g. SecretProviderClass on non-Azure clusters).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jhjaggars: This pull request explicitly references no jira issue.

Details

In response to this:

Adds support for enabling nested virtualization on AWS EC2 instances (C8i, M8i, R8i families) through the NodePool API. This allows running OpenShift Virtualization (KubeVirt) workloads on AWS-hosted clusters.

  • API: Add CpuOptions struct with NestedVirtualization field to AWSNodePoolPlatform
  • Vendor: Bump CAPA from v2.10.0 to v2.11.1 which includes upstream NestedVirtualization support in CPUOptions (PR kubernetes-sigs/cluster-api-provider-aws#5874)
  • CLI: Add --nested-virtualization flag to hypershift create cluster aws and hypershift create nodepool aws
  • Controller: Wire CpuOptions.NestedVirtualization from NodePool spec to CAPA AWSMachineTemplateSpec.CPUOptions
  • CPO fix: Prevent informer creation for unused platform-specific resources (e.g. SecretProviderClass on non-Azure clusters)
  • Built custom HO image and installed on management cluster
  • Created hosted cluster with --nested-virtualization enabled --instance-type c8i.2xlarge
  • Verified cpuOptions.nestedVirtualization: enabled propagates through: NodePool → AWSMachineTemplate → AWSMachine → EC2 instance
  • Confirmed EC2 CpuOptions.NestedVirtualization: enabled via AWS CLI
  • Confirmed vmx CPU flag present on guest nodes via /proc/cpuinfo
    N/A — new feature
  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

📝 Walkthrough

Walkthrough

This PR adds two independent features. First, it introduces AWS EC2 nested virtualization configuration by defining a new CpuOptions API type and wiring a --nested-virtualization CLI flag through cluster and nodepool create commands into the NodePool controller, which applies the setting to generated EC2 machine templates. Second, it implements apply-state tracking in generic adapter and control plane component reconciliation, recording whether resources have been successfully applied and preventing deletion of resources that were never applied in the first place, then resetting the flag on actual deletion.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning No tests added for new nested virtualization feature or hasBeenApplied flag. Reviewers flagged critical issues: hasBeenApplied set on errors, volatile process state deletion logic. Add Ginkgo tests for: nested virtualization validation, hasBeenApplied error cases, predicate-false deletion scenarios, and controller CpuOptions propagation.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the pull request—adding nested virtualization support for AWS EC2 NodePools. It is concise, specific, and directly reflects the primary changes across the API, CLI, and controller components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not add or modify any Ginkgo test files; only changes non-test source files. Per PR objectives: "Docs and unit tests are not included."
Topology-Aware Scheduling Compatibility ✅ Passed PR adds nested virtualization support via API types, CLI flags, and controller logic. No pod scheduling constraints, deployment manifests, affinity rules, or topology-aware assumptions introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests; changes are API, CLI, controller, and dependency updates. PR explicitly states "no unit tests included."
No-Weak-Crypto ✅ Passed PR introduces no weak cryptography. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB usage; no custom crypto implementations; no insecure secret comparisons found in modified files.
Container-Privileges ✅ Passed No privileged container settings (privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation: true, runAsUser: 0) found in PR changes.
No-Sensitive-Data-In-Logs ✅ Passed No logging statements expose sensitive data. NestedVirtualization and CpuOptions values are not logged, printed, or formatted in any output statements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jhjaggars
Once this PR has been reviewed and has the lgtm label, please assign csrwng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Jun 5, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.43%. Comparing base (e25a87a) to head (3668271).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
hypershift-operator/controllers/nodepool/aws.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8681      +/-   ##
==========================================
- Coverage   41.44%   41.43%   -0.01%     
==========================================
  Files         756      756              
  Lines       93644    93648       +4     
==========================================
- Hits        38807    38802       -5     
- Misses      52114    52124      +10     
+ Partials     2723     2722       -1     
Files with missing lines Coverage Δ
cmd/cluster/aws/create.go 41.59% <ø> (ø)
cmd/nodepool/aws/create.go 73.52% <ø> (ø)
...t/controlplane-component/controlplane-component.go 44.31% <ø> (+0.52%) ⬆️
support/controlplane-component/generic-adapter.go 57.40% <ø> (ø)
hypershift-operator/controllers/nodepool/aws.go 69.25% <0.00%> (-0.61%) ⬇️

... and 4 files with indirect coverage changes

Flag Coverage Δ
cpo-hostedcontrolplane 43.50% <ø> (ø)
cpo-other 42.74% <ø> (-0.05%) ⬇️
hypershift-operator 51.56% <0.00%> (-0.01%) ⬇️
other 31.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/cluster/aws/create.go`:
- Around line 410-414: Validate o.NestedVirtualization at the CLI boundary by
checking its value against the documented allow-list and return an error
immediately if it's not allowed; then compute the effective instance type (use
o.InstanceType if set else the code's default m5.large/m6g.large) and verify
that this instance type's family is one of the supported families for nested
virtualization, returning a clear validation error if incompatible; update the
block that assigns nodePool.Spec.Platform.AWS.CpuOptions (where
o.NestedVirtualization is used) to perform both checks and fail fast instead of
setting an invalid value.

In `@cmd/nodepool/aws/create.go`:
- Around line 108-109: Add early validation for the nested virtualization flag
before the NodePool spec is written. Implement allow-list based validation that
checks two things: first, verify that the nested virtualization value is only
"enabled" or "disabled" (reject any arbitrary values), and second, validate that
the selected instance family is compatible with nested virtualization (only C8i,
M8i, and R8i families are supported, so reject incompatible defaults like
m5.large and m6g.large). This validation should be added at the trust boundary
where the flag is processed and should reject invalid combinations before any
NodePool spec is created. Apply the same validation logic to the other affected
code sections around lines 176-180.

In `@support/controlplane-component/controlplane-component.go`:
- Around line 208-211: The deletion logic currently returns early based only on
the volatile field c.hasBeenApplied; update the delete path in the component's
delete handler (the method using c.hasBeenApplied) to avoid relying on in-memory
state by first checking persistent cluster state or resource markers
instead—e.g., query the API/server for the component resource or its
finalizer/annotation via the controller client (use the existing client/get
method used elsewhere) and only skip deletion when the resource truly does not
exist or when a persistent “never-delete” marker is present; remove the early
return that depends solely on c.hasBeenApplied and replace it with a
cluster-existence check and/or persistent state check before deciding not to
delete.
- Around line 186-190: The code marks c.hasBeenApplied = true regardless of
whether c.update(cpContext) returned an error; change the logic so
c.hasBeenApplied is set only when the update succeeded: call reconcilationError
= c.update(cpContext) and then set c.hasBeenApplied = true only if
reconcilationError == nil (i.e., after a successful return from c.update),
leaving it false when c.update fails so downstream delete/cleanup behavior
remains correct.

In `@support/controlplane-component/generic-adapter.go`:
- Around line 54-57: The short-circuit using the process-local flag
ga.hasBeenApplied causes deletion to be skipped after a controller restart;
remove that reliance and instead detect resource existence/ownership at runtime
before skipping deletion: in the delete path (where ga.hasBeenApplied is
checked) call the appropriate lookup/delete logic (e.g., use the adapter's
client or ga.get/ga.getResource method) to query the cluster for the resource or
its ownership/annotation, attempt the deletion if the resource is present, and
only ignore errors that indicate "not found"; keep the operation idempotent and
remove the ga.hasBeenApplied-only return so cleanup runs across restarts.
- Around line 21-22: The hasBeenApplied boolean on genericAdapter is mutated on
a value copy (so updates are lost) and is only in-memory (lost on restart);
change manifestsAdapters from map[string]genericAdapter to
map[string]*genericAdapter and update update() and any retrieval sites to use
pointer semantics so genericAdapter.reconcile() sets the real struct's
hasBeenApplied; also remove sole reliance on the in-memory flag by deriving
applied state from the cluster (e.g., check the target resource
existence/owner/annotation) or persist the flag into the control-plane
workload's Status (add a field in the CR status and read/write it in
update()/reconcile()) so the predicate-based cleanup survives restarts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fea52243-7df4-4f72-a4be-ab67d3321aaa

📥 Commits

Reviewing files that changed from the base of the PR and between f13c62d and 3668271.

⛔ Files ignored due to path filters (167)
  • api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !**/zz_generated*.go, !**/zz_generated*
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/AAA_ungated.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/GCPPlatform.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OpenStack.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • client/applyconfiguration/hypershift/v1beta1/awsnodepoolplatform.go is excluded by !client/**
  • client/applyconfiguration/hypershift/v1beta1/cpuoptions.go is excluded by !client/**
  • client/applyconfiguration/utils.go is excluded by !client/**
  • cmd/install/assets/crds/cluster-api-provider-aws/infrastructure.cluster.x-k8s.io_awsclusters.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/cluster-api-provider-aws/infrastructure.cluster.x-k8s.io_awsclustertemplates.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/cluster-api-provider-aws/infrastructure.cluster.x-k8s.io_awsmachinepools.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/cluster-api-provider-aws/infrastructure.cluster.x-k8s.io_awsmachines.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/cluster-api-provider-aws/infrastructure.cluster.x-k8s.io_awsmachinetemplates.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-Default.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/CHANGELOG.md is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_AttachVolume.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_CreateSecondaryNetwork.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_CreateSecondarySubnet.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_DeleteSecondaryNetwork.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_DeleteSecondarySubnet.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_DescribeSecondaryInterfaces.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_DescribeSecondaryNetworks.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_DescribeSecondarySubnets.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_DetachVolume.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_ModifyInstanceCpuOptions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_ModifyInstanceMetadataDefaults.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_ModifyInstanceMetadataOptions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_RunInstances.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/api_op_SearchTransitGatewayRoutes.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/deserializers.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/generated.json is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/go_module_metadata.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/serializers.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/types/enums.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/types/types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/ec2/validators.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/CHANGELOG.md is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/LICENSE.txt is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_client.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_AssociateAccessPolicy.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_AssociateEncryptionConfig.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_AssociateIdentityProviderConfig.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreateAccessEntry.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreateAddon.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreateCapability.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreateCluster.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreateEksAnywhereSubscription.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreateFargateProfile.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreateNodegroup.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_CreatePodIdentityAssociation.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeleteAccessEntry.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeleteAddon.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeleteCapability.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeleteCluster.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeleteEksAnywhereSubscription.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeleteFargateProfile.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeleteNodegroup.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeletePodIdentityAssociation.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DeregisterCluster.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeAccessEntry.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeAddon.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeAddonConfiguration.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeAddonVersions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeCapability.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeCluster.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeClusterVersions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeEksAnywhereSubscription.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeFargateProfile.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeIdentityProviderConfig.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeInsight.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeInsightsRefresh.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeNodegroup.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribePodIdentityAssociation.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DescribeUpdate.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_DisassociateAccessPolicy.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListAccessEntries.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListAccessPolicies.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListAddons.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListAssociatedAccessPolicies.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListCapabilities.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListClusters.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListEksAnywhereSubscriptions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListFargateProfiles.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListIdentityProviderConfigs.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListInsights.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListNodegroups.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListPodIdentityAssociations.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListTagsForResource.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_ListUpdates.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_RegisterCluster.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_StartInsightsRefresh.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_TagResource.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UntagResource.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdateAccessEntry.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdateAddon.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdateCapability.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdateClusterConfig.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdateClusterVersion.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdateNodegroupConfig.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdateNodegroupVersion.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/api_op_UpdatePodIdentityAssociation.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/auth.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/deserializers.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/doc.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/endpoints.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/generated.json is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/go_module_metadata.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/internal/endpoints/endpoints.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/options.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/serializers.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/types/enums.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/types/errors.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/types/types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/aws/aws-sdk-go-v2/service/eks/validators.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift-online/ocm-common/pkg/resource/validations/kms_arn_regex_validation.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/aws.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
  • vendor/golang.org/x/crypto/blake2b/blake2b.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/blake2bAVX2_amd64.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/blake2bAVX2_amd64.s is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/blake2b_amd64.s is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/blake2b_generic.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/blake2b_ref.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/blake2x.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/go125.go is excluded by !vendor/**, !**/vendor/**
  • vendor/golang.org/x/crypto/blake2b/register.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta1/awscluster_conversion.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta1/awsmachine_conversion.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta1/network_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta1/types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta1/zz_generated.conversion.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awscluster_defaults.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awscluster_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awscluster_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsclustercontrolleridentity_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsclusterroleidentity_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsclusterstaticidentity_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsclustertemplate_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsmachine_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsmachine_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsmachinetemplate_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/bastion.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/defaults.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/network_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controlplane/rosa/api/v1beta2/rosacontrolplane_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controlplane/rosa/api/v1beta2/rosacontrolplane_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controlplane/rosa/api/v1beta2/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/OWNERS is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/awsfargateprofile_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/awsmachinepool_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/awsmanagedmachinepool_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/rosamachinepool_defaults.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/rosamachinepool_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/rosamachinepool_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/rosanetwork_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/rosaroleconfig_webhook.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/validation.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/exp/api/v1beta2/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**, !**/zz_generated*.go, !**/zz_generated*
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/pkg/eks/eks.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/cluster-api-provider-aws/v2/pkg/hash/base36.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (7)
  • api/hypershift/v1beta1/aws.go
  • cmd/cluster/aws/create.go
  • cmd/nodepool/aws/create.go
  • go.mod
  • hypershift-operator/controllers/nodepool/aws.go
  • support/controlplane-component/controlplane-component.go
  • support/controlplane-component/generic-adapter.go

Comment thread cmd/cluster/aws/create.go
Comment on lines +410 to +414
if o.NestedVirtualization != "" {
nodePool.Spec.Platform.AWS.CpuOptions = &hyperv1.CpuOptions{
NestedVirtualization: o.NestedVirtualization,
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on invalid --nested-virtualization values and unsupported effective instance types.

The flag is currently free-form, and if users set it without --instance-type, the default m5.large/m6g.large path is incompatible with the documented supported families. This will fail late instead of at CLI validation time.

Proposed fix
 import (
 	"context"
 	"encoding/base64"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"os"
+	"strings"
@@
 func validateAWSOptions(_ context.Context, opts *core.CreateOptions, awsOpts *RawCreateOptions) error {
 	if err := ValidateCreateCredentialInfo(awsOpts.Credentials, awsOpts.CredentialSecretName, opts.Namespace, opts.PullSecretFile); err != nil {
 		return err
 	}
@@
 	if err := awsutil.ValidateVPCCIDR(awsOpts.VPCCIDR); err != nil {
 		return err
 	}
+	if err := validateNestedVirtualization(awsOpts.NestedVirtualization, awsOpts.InstanceType, opts.Arch); err != nil {
+		return err
+	}
 	return nil
 }
+
+func validateNestedVirtualization(value, instanceType, arch string) error {
+	if value == "" {
+		return nil
+	}
+	if value != "enabled" && value != "disabled" {
+		return fmt.Errorf("nested-virtualization must be either enabled or disabled")
+	}
+	effectiveInstanceType := instanceType
+	if effectiveInstanceType == "" {
+		switch arch {
+		case hyperv1.ArchitectureAMD64:
+			effectiveInstanceType = "m5.large"
+		case hyperv1.ArchitectureARM64:
+			effectiveInstanceType = "m6g.large"
+		}
+	}
+	if !(strings.HasPrefix(effectiveInstanceType, "c8i.") ||
+		strings.HasPrefix(effectiveInstanceType, "m8i.") ||
+		strings.HasPrefix(effectiveInstanceType, "r8i.")) {
+		return fmt.Errorf("nested-virtualization is only supported on C8i, M8i, and R8i instance families")
+	}
+	return nil
+}

As per coding guidelines: "Validate at trust boundaries with allow-lists, not deny-lists".

Also applies to: 514-514

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/cluster/aws/create.go` around lines 410 - 414, Validate
o.NestedVirtualization at the CLI boundary by checking its value against the
documented allow-list and return an error immediately if it's not allowed; then
compute the effective instance type (use o.InstanceType if set else the code's
default m5.large/m6g.large) and verify that this instance type's family is one
of the supported families for nested virtualization, returning a clear
validation error if incompatible; update the block that assigns
nodePool.Spec.Platform.AWS.CpuOptions (where o.NestedVirtualization is used) to
perform both checks and fail fast instead of setting an invalid value.

Comment on lines +108 to 109
cmd.Flags().StringVar(&platformOpts.NestedVirtualization, "nested-virtualization", platformOpts.NestedVirtualization, "Enable nested virtualization on EC2 instances (enabled or disabled). Supported on C8i, M8i, and R8i instance families.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add early validation for nested virtualization value and instance family compatibility.

This path also accepts arbitrary values and allows incompatible defaults (m5.large/m6g.large) when nested virtualization is requested, which should be rejected before writing the NodePool spec.

Proposed fix
 import (
 	"context"
 	"fmt"
+	"strings"
@@
 func (o *RawAWSPlatformCreateOptions) Validate(_ context.Context, _ *core.CreateNodePoolOptions) (core.NodePoolPlatformCompleter, error) {
@@
 	if o.RootVolumeSize < 8 {
 		return nil, fmt.Errorf("root volume size must be at least 8 GB, got %d", o.RootVolumeSize)
 	}
+	if o.NestedVirtualization != "" && o.NestedVirtualization != "enabled" && o.NestedVirtualization != "disabled" {
+		return nil, fmt.Errorf("nested-virtualization must be either enabled or disabled")
+	}
@@
 	if o.InstanceType != "" {
 		instanceType = o.InstanceType
 	} else {
@@
 	}
+	if o.NestedVirtualization != "" &&
+		!(strings.HasPrefix(instanceType, "c8i.") || strings.HasPrefix(instanceType, "m8i.") || strings.HasPrefix(instanceType, "r8i.")) {
+		return fmt.Errorf("nested-virtualization is only supported on C8i, M8i, and R8i instance families")
+	}

As per coding guidelines: "Validate at trust boundaries with allow-lists, not deny-lists".

Also applies to: 176-180

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/nodepool/aws/create.go` around lines 108 - 109, Add early validation for
the nested virtualization flag before the NodePool spec is written. Implement
allow-list based validation that checks two things: first, verify that the
nested virtualization value is only "enabled" or "disabled" (reject any
arbitrary values), and second, validate that the selected instance family is
compatible with nested virtualization (only C8i, M8i, and R8i families are
supported, so reject incompatible defaults like m5.large and m6g.large). This
validation should be added at the trust boundary where the flag is processed and
should reject invalid combinations before any NodePool spec is created. Apply
the same validation logic to the other affected code sections around lines
176-180.

Comment on lines 186 to 190
if len(unavailableDependencies) == 0 {
// reconcile only when all dependencies are available, and don't return error immediately so it can be included in the status condition first.
reconcilationError = c.update(cpContext)
c.hasBeenApplied = true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mark component applied only when update succeeds.

Line 189 sets c.hasBeenApplied = true even when reconcilationError != nil. This breaks the “applied successfully” contract and can trigger incorrect delete behavior later.

Suggested fix
  reconcilationError = c.update(cpContext)
- c.hasBeenApplied = true
+ if reconcilationError == nil {
+   c.hasBeenApplied = true
+ }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/controlplane-component/controlplane-component.go` around lines 186 -
190, The code marks c.hasBeenApplied = true regardless of whether
c.update(cpContext) returned an error; change the logic so c.hasBeenApplied is
set only when the update succeeded: call reconcilationError =
c.update(cpContext) and then set c.hasBeenApplied = true only if
reconcilationError == nil (i.e., after a successful return from c.update),
leaving it false when c.update fails so downstream delete/cleanup behavior
remains correct.

Comment on lines +208 to +211
if !c.hasBeenApplied {
// if the component has not been applied, it doesn't exist, so there's nothing to delete.
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deletion skip is tied to volatile process state.

Line 208 returns early based only on c.hasBeenApplied. After controller restarts, this flag resets and can leave previously created component resources undeleted when predicate becomes false.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/controlplane-component/controlplane-component.go` around lines 208 -
211, The deletion logic currently returns early based only on the volatile field
c.hasBeenApplied; update the delete path in the component's delete handler (the
method using c.hasBeenApplied) to avoid relying on in-memory state by first
checking persistent cluster state or resource markers instead—e.g., query the
API/server for the component resource or its finalizer/annotation via the
controller client (use the existing client/get method used elsewhere) and only
skip deletion when the resource truly does not exist or when a persistent
“never-delete” marker is present; remove the early return that depends solely on
c.hasBeenApplied and replace it with a cluster-existence check and/or persistent
state check before deciding not to delete.

Comment on lines +21 to +22
// tracks whether the resource has been applied successfully.
hasBeenApplied bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify map value type and copy-based retrieval path
rg -n 'manifestsAdapters\s+map\[string\]genericAdapter|adapter,\s*exist\s*:=\s*c\.manifestsAdapters\[manifestName\]|return adapter\.reconcile' \
  support/controlplane-component/controlplane-component.go

Repository: openshift/hypershift

Length of output: 219


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== generic-adapter.go (relevant sections) ==="
sed -n '1,180p' support/controlplane-component/generic-adapter.go

echo "=== controlplane-component.go (relevant sections) ==="
sed -n '120,360p' support/controlplane-component/controlplane-component.go

Repository: openshift/hypershift

Length of output: 11107


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== generic-adapter.go (relevant sections) ==="
sed -n '1,220p' support/controlplane-component/generic-adapter.go

echo "=== controlplane-component.go (relevant sections) ==="
sed -n '120,360p' support/controlplane-component/controlplane-component.go

Repository: openshift/hypershift

Length of output: 11107


generic-adapter.hasBeenApplied doesn’t persist (breaks predicate-based cleanup)

controlPlaneWorkload.manifestsAdapters is map[string]genericAdapter, and update() loads adapter by value (adapter, exist := c.manifestsAdapters[manifestName]). genericAdapter.reconcile() mutates ga.hasBeenApplied, but those mutations apply to the local copy, so subsequent reconciles won’t see it—meaning the predicate == false branch will often early-return and skip deletion.

Additionally, hasBeenApplied is purely in-memory, so an operator restart will reset it and the predicate-false path may skip cleanup even when the object still exists.

Suggested fix
- manifestsAdapters map[string]genericAdapter
+ manifestsAdapters map[string]*genericAdapter
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/controlplane-component/generic-adapter.go` around lines 21 - 22, The
hasBeenApplied boolean on genericAdapter is mutated on a value copy (so updates
are lost) and is only in-memory (lost on restart); change manifestsAdapters from
map[string]genericAdapter to map[string]*genericAdapter and update update() and
any retrieval sites to use pointer semantics so genericAdapter.reconcile() sets
the real struct's hasBeenApplied; also remove sole reliance on the in-memory
flag by deriving applied state from the cluster (e.g., check the target resource
existence/owner/annotation) or persist the flag into the control-plane
workload's Status (add a field in the CR status and read/write it in
update()/reconcile()) so the predicate-based cleanup survives restarts.

Comment on lines +54 to +57
if !ga.hasBeenApplied {
// if the resource has not been applied, it doesn't exist, so there's nothing to delete.
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

In-memory gate can skip required cleanup after controller restart.

Line 54 short-circuits deletion solely from ga.hasBeenApplied. That flag is process-local, so after restart previously managed resources are treated as “never applied” and cleanup can be skipped.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/controlplane-component/generic-adapter.go` around lines 54 - 57, The
short-circuit using the process-local flag ga.hasBeenApplied causes deletion to
be skipped after a controller restart; remove that reliance and instead detect
resource existence/ownership at runtime before skipping deletion: in the delete
path (where ga.hasBeenApplied is checked) call the appropriate lookup/delete
logic (e.g., use the adapter's client or ga.get/ga.getResource method) to query
the cluster for the resource or its ownership/annotation, attempt the deletion
if the resource is present, and only ignore errors that indicate "not found";
keep the operation idempotent and remove the ga.hasBeenApplied-only return so
cleanup runs across restarts.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have the complete root cause analysis. Here is my report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

--- FAIL: TestGenericAdapterReconcile (0.01s)
    --- FAIL: TestGenericAdapterReconcile/When_predicate_is_false_and_GVK_is_inaccessible_it_should_skip_without_error (0.00s)
        generic-adapter_test.go:81: Expected <int32>: 0 to equal <int32>: 1
    --- FAIL: TestGenericAdapterReconcile/When_predicate_is_false_and_GVK_is_accessible_it_should_attempt_cleanup (0.00s)
        generic-adapter_test.go:105: Expected <int32>: 0 to equal <int32>: 1
    --- FAIL: TestGenericAdapterReconcile/When_predicate_is_false_and_GVK_probe_returns_transient_error_it_should_propagate_error (0.00s)
        generic-adapter_test.go:149: Expected an error to have occurred. Got: <nil>: nil
    --- FAIL: TestGenericAdapterReconcile/When_predicate_is_false_and_accessible_resource_has_HCP_owner_ref_it_should_delete_it (0.00s)
        generic-adapter_test.go:258: Expected <bool>: false to be true
FAIL	github.com/openshift/hypershift/support/controlplane-component	3.124s

Summary

PR #8681 adds a hasBeenApplied boolean field to genericAdapter (and controlPlaneWorkload) to track whether a resource has ever been successfully applied. When the predicate is false and hasBeenApplied is false, the new code returns nil immediately — short-circuiting the entire predicate-false cleanup path (GVK access check, existing resource lookup, owner-ref-guarded deletion). All four failing tests create a fresh genericAdapter{} where hasBeenApplied defaults to false, then call reconcile() with a false predicate. The new early-return causes the GVK probe to never be called (reader.callCount stays 0), errors to never be propagated, and deletion to never occur — breaking every test that exercises the predicate-false path.

Root Cause

The PR added an optimization to generic-adapter.go at line 54–57: when the predicate is false and hasBeenApplied is false, the method returns nil immediately. The intent is to skip cleanup for resources that were never created. However, this change is incompatible with the existing tests because:

  1. hasBeenApplied defaults to false — In Go, a new genericAdapter{} struct has hasBeenApplied = false. The tests construct adapters directly via &genericAdapter{predicate: ...} without first calling a successful apply, so hasBeenApplied is always false.

  2. The early return short-circuits ALL predicate-false logic — The new guard at line 54 (if !ga.hasBeenApplied { return nil }) fires before the GVK access check, before the Client.Get() of the existing resource, and before the owner-ref-guarded deletion. This means:

    • Test 1 (GVK inaccessible): Expects reader.callCount == 1 but the GVK probe is never reached → callCount stays 0.
    • Test 2 (GVK accessible): Same issue — GVK probe never reached → callCount stays 0.
    • Test 3 (transient error): Expects the connection-refused error to be returned, but the early return produces nil instead.
    • Test 4 (delete with owner ref): Expects the ConfigMap to be deleted, but the early return skips the deletion entirely → the ConfigMap still exists.
  3. The tests are testing valid real-world scenarios — A component may need to clean up a resource that was created by a previous controller instance (before the current process started). In that case, hasBeenApplied would be false in-memory, but the resource exists on the cluster. The early-return optimization incorrectly assumes that hasBeenApplied == false means the resource doesn't exist — but it only tracks in-memory state for the current process lifetime, not cluster state.

Recommendations
  1. Fix the early-return guard to not skip external state checks: The hasBeenApplied optimization should not bypass the GVK access check and cleanup path when there could be pre-existing resources on the cluster. Consider either:

    • Removing the hasBeenApplied guard from genericAdapter.reconcile() entirely (the controlPlaneWorkload.delete() guard may still be valid since it controls a different code path), or
    • Moving the hasBeenApplied check to occur after the GVK accessibility probe so that inaccessible GVKs can still be skipped efficiently, while accessible resources are always checked for cleanup.
  2. Update tests to set hasBeenApplied = true before testing the predicate-false cleanup path, if the intent is that cleanup should only run when the resource was previously applied in the current process. However, this would mask a real bug — pre-existing resources from prior controller lifetimes would never be cleaned up.

  3. Consider persisting applied-state externally (e.g., via an annotation or status field) if the optimization is needed, rather than relying on in-memory state that is lost across process restarts.

Evidence
Evidence Detail
Failing package github.com/openshift/hypershift/support/controlplane-component
Failing test TestGenericAdapterReconcile (4 subtests)
Test file support/controlplane-component/generic-adapter_test.go lines 81, 105, 149, 258
Root cause file support/controlplane-component/generic-adapter.go lines 54–57 (new code in PR)
Root cause code if !ga.hasBeenApplied { return nil } — added inside the predicate == false branch
PR diff Adds hasBeenApplied bool field and early-return guard that short-circuits the entire cleanup path
Test 1 failure reader.callCount expected 1, got 0 — GVK probe never called
Test 2 failure reader.callCount expected 1, got 0 — GVK probe never called
Test 3 failure Expected error (connection refused), got nil — error path never reached
Test 4 failure Expected IsNotFound(err) == true, got false — deletion never executed
Base branch code Does NOT have the hasBeenApplied field or early-return — all 4 tests pass on main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants