OCPSTRAT-1677: AWS spot instance support for HyperShift by enxebre · Pull Request #1951 · openshift/enhancements

enxebre · 2026-03-02T12:49:16Z

Summary

Add marketType (OnDemand/Spot/CapacityBlocks) and spot (SpotOptions with maxPrice) fields to PlacementOptions on AWSNodePoolPlatform
Add terminationHandlerQueueURL field to AWSPlatformSpec on HostedCluster for NTH SQS queue configuration
Include aws-node-termination-handler image in the OCP release payload
Add spot remediation controller in HCCO that watches NTH taints on guest Nodes and triggers CAPI Machine deletion
Spot MachineHealthCheck with maxUnhealthy: 100% as a safety net fallback
CEL validations enforcing spot/capacityReservation mutual exclusion, tenancy constraints, and spot/spotOptions co-requirement

Test plan

Unit tests for isSpotEnabled(), CAPA template mapping, resource tags, CEL validation rules
Integration tests for MHC lifecycle, interruptible-instance label, NTH deployment reconciliation
E2E tests for spot NodePool creation, MHC configuration, NTH drain/replace flow

🤖 Generated with Claude Code

openshift-ci-robot · 2026-03-02T12:49:21Z

@enxebre: This pull request references OCPSTRAT-1677 which is a valid jira issue.

Details

In response to this:

Summary

Add marketType (OnDemand/Spot/CapacityBlocks) and spot (SpotOptions with maxPrice) fields to PlacementOptions on AWSNodePoolPlatform

Add terminationHandlerQueueURL field to AWSPlatformSpec on HostedCluster for NTH SQS queue configuration

Include aws-node-termination-handler image in the OCP release payload

Add spot remediation controller in HCCO that watches NTH taints on guest Nodes and triggers CAPI Machine deletion

Spot MachineHealthCheck with maxUnhealthy: 100% as a safety net fallback

CEL validations enforcing spot/capacityReservation mutual exclusion, tenancy constraints, and spot/spotOptions co-requirement

Test plan

Unit tests for isSpotEnabled(), CAPA template mapping, resource tags, CEL validation rules

Integration tests for MHC lifecycle, interruptible-instance label, NTH deployment reconciliation

E2E tests for spot NodePool creation, MHC configuration, NTH drain/replace flow

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-03-02T12:49:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

enhancements/hypershift/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

muraee · 2026-03-02T13:25:12Z

/lgtm

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-ci · 2026-03-05T08:30:44Z

@enxebre: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

muraee · 2026-03-05T09:40:56Z

/lgtm

devguyio · 2026-03-09T11:09:01Z

/lgtm

sjenning · 2026-03-11T18:38:52Z

+- Add `Spot` as a value to the existing `MarketType` enum and add a `marketType` field to `PlacementOptions` on `AWSNodePoolPlatform`.
+- Add a `SpotOptions` struct with an optional `maxPrice` field, referenced from `PlacementOptions` via a `spot` field.
+- Add a `terminationHandlerQueueURL` field to `AWSPlatformSpec` on the HostedCluster to configure the NTH SQS queue as a proper API field.
+- Ensure that spot instances are automatically labeled with `hypershift.openshift.io/interruptible-instance` so that workloads can use node affinity and anti-affinity rules to control placement.


nit: Nodes backed by spot instances

sjenning · 2026-03-11T18:43:15Z

+    aws:
+      region: us-east-1
+      rolesRef: ...
+      terminationHandlerQueueURL: "https://sqs.us-east-1.amazonaws.com/123456789012/my-nth-queue"


What kind of validation do we do here? Can a misconfigured/malicious user pointing to an arbitrary URL result in a DoS of the NTH on the control plane?

Update: I now see validation in the API definition below.

sjenning · 2026-03-11T18:47:28Z

+3. CAPA creates EC2 spot instances with the specified market options.
+
+4. The NTH (deployed by the control-plane-operator when `terminationHandlerQueueURL` is set on the HostedCluster) watches for spot interruption and rebalance recommendation events, and cordons/drains nodes before termination.
+5. The spot remediation controller in the HCCO watches Nodes tainted with the `aws-node-termination-handler/` prefix and triggers Machine deletion for immediate replacement.


This is a new responsibility for the HCCO (RBAC?) and Machines exist in the mgmt cluster, which the HCCO does not normal operate against.

I don't know this area well, but what prevents us from using the MHC to trigger machine replacement in both cases, whether the terminationHandlerQueueURL is specified or not i.e. MHC gaining the ability to replace Machines when the Node is tainted with the aws-node-termination-handler taint.

sjenning · 2026-03-11T18:53:30Z

+    // Supports both standard and FIFO queues (FIFO queues end with .fifo suffix).
+    //
+    // +optional
+    // +kubebuilder:validation:Pattern=`^https://sqs\.[a-z0-9-]+\.amazonaws\.com/[0-9]{12}/[a-zA-Z0-9_-]+(\.fifo)?$`


Ah ok, so there is some limitation on the URL 👍

sjenning · 2026-03-11T19:02:25Z

+
+This controller provides faster machine replacement than the MHC alone. The MHC waits for the Machine to become `Failed`, which only happens after the instance is already deleted. The spot remediation controller acts immediately upon receiving the NTH taint -- before the instance is actually terminated.
+
+The MHC remains as a fallback for cases where the spot remediation controller or NTH is unavailable.


Again, I'm out of my depth here, but it seems possible to enhance the MHC to do this and leave the HCCO unchanged. I really would like to keep the HCCO out of the business of operating against CAPI in the HCP.

sjenning · 2026-03-11T19:05:58Z

+
+#### IAM Permissions for SQS
+
+The NTH deployment uses the NodePoolManagement IAM role credentials (via IRSA/STS) to poll and acknowledge messages from the SQS queue. This requires adding `sqs:ReceiveMessage` and `sqs:DeleteMessage` permissions to the NodePoolManagement role's IAM policy:


I understand that adding a new role for NTH is heavyweight. Just want to call out that we will be unable to distinguish between AWS calls from CAPA and NTH on the AWS side if we reuse the role/serviceaccount.

ratnam915 · 2026-03-31T12:02:09Z

HI @enxebre :

Hi — as part of the work on adding sqs:ReceiveMessage and sqs:DeleteMessage to ROSANodePoolManagementPolicy (SREP-698 / OCPSTRAT-1677), we've received guidance that the SQS permissions should be scoped with an aws:ResourceTag/red-hat: true condition. This aligns with how KMS permissions are already scoped in the managed policy and strengthens the security posture for the AWS review.

Could the IAM permissions section of the enhancement be updated to reflect this? Specifically:

The current block:

{
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage"
],
"Resource": ["*"]
}

Should become:
{
"Sid": "NodePoolSQSActions",
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/red-hat": "true"
}
}
}

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 2, 2026

openshift-ci bot requested review from csrwng and derekwaynecarr March 2, 2026 12:49

enxebre force-pushed the hypershift-spot-instances branch 4 times, most recently from 6a151c5 to d66c090 Compare March 2, 2026 12:56

openshift-ci bot assigned muraee Mar 2, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2026

enxebre force-pushed the hypershift-spot-instances branch from d66c090 to 2348b4a Compare March 5, 2026 07:37

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026

OCPSTRAT-1677: Add AWS spot instance support enhancement for HyperShift

d7e5483

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

enxebre force-pushed the hypershift-spot-instances branch from 2348b4a to d7e5483 Compare March 5, 2026 08:18

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026

openshift-ci bot assigned devguyio Mar 9, 2026

sjenning reviewed Mar 11, 2026

View reviewed changes

ratnam915 mentioned this pull request Mar 31, 2026

SREP-3880: Request AWS additional SQS permissions for ROSA Managed Policies (Spot Instance Support) openshift/hypershift#8134

Open


		This controller provides faster machine replacement than the MHC alone. The MHC waits for the Machine to become `Failed`, which only happens after the instance is already deleted. The spot remediation controller acts immediately upon receiving the NTH taint -- before the instance is actually terminated.

		The MHC remains as a fallback for cases where the spot remediation controller or NTH is unavailable.


		#### IAM Permissions for SQS

		The NTH deployment uses the NodePoolManagement IAM role credentials (via IRSA/STS) to poll and acknowledge messages from the SQS queue. This requires adding `sqs:ReceiveMessage` and `sqs:DeleteMessage` permissions to the NodePoolManagement role's IAM policy:

Conversation

enxebre commented Mar 2, 2026

Summary

Test plan

Uh oh!

openshift-ci-robot commented Mar 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

openshift-ci bot commented Mar 2, 2026

Uh oh!

muraee commented Mar 2, 2026

Uh oh!

openshift-ci bot commented Mar 5, 2026

Uh oh!

muraee commented Mar 5, 2026

Uh oh!

devguyio commented Mar 9, 2026

Uh oh!

sjenning Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sjenning Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sjenning Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sjenning Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sjenning Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sjenning Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ratnam915 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

openshift-ci-robot commented Mar 2, 2026 •

edited by openshift-ci bot

Loading