Skip to content

OCPSTRAT-1677: AWS spot instance support for HyperShift#1951

Open
enxebre wants to merge 1 commit intoopenshift:masterfrom
enxebre:hypershift-spot-instances
Open

OCPSTRAT-1677: AWS spot instance support for HyperShift#1951
enxebre wants to merge 1 commit intoopenshift:masterfrom
enxebre:hypershift-spot-instances

Conversation

@enxebre
Copy link
Copy Markdown
Member

@enxebre enxebre commented Mar 2, 2026

Summary

  • Add marketType (OnDemand/Spot/CapacityBlocks) and spot (SpotOptions with maxPrice) fields to PlacementOptions on AWSNodePoolPlatform
  • Add terminationHandlerQueueURL field to AWSPlatformSpec on HostedCluster for NTH SQS queue configuration
  • Include aws-node-termination-handler image in the OCP release payload
  • Add spot remediation controller in HCCO that watches NTH taints on guest Nodes and triggers CAPI Machine deletion
  • Spot MachineHealthCheck with maxUnhealthy: 100% as a safety net fallback
  • CEL validations enforcing spot/capacityReservation mutual exclusion, tenancy constraints, and spot/spotOptions co-requirement

Test plan

  • Unit tests for isSpotEnabled(), CAPA template mapping, resource tags, CEL validation rules
  • Integration tests for MHC lifecycle, interruptible-instance label, NTH deployment reconciliation
  • E2E tests for spot NodePool creation, MHC configuration, NTH drain/replace flow

🤖 Generated with Claude Code

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 2, 2026

@enxebre: This pull request references OCPSTRAT-1677 which is a valid jira issue.

Details

In response to this:

Summary

  • Add marketType (OnDemand/Spot/CapacityBlocks) and spot (SpotOptions with maxPrice) fields to PlacementOptions on AWSNodePoolPlatform
  • Add terminationHandlerQueueURL field to AWSPlatformSpec on HostedCluster for NTH SQS queue configuration
  • Include aws-node-termination-handler image in the OCP release payload
  • Add spot remediation controller in HCCO that watches NTH taints on guest Nodes and triggers CAPI Machine deletion
  • Spot MachineHealthCheck with maxUnhealthy: 100% as a safety net fallback
  • CEL validations enforcing spot/capacityReservation mutual exclusion, tenancy constraints, and spot/spotOptions co-requirement

Test plan

  • Unit tests for isSpotEnabled(), CAPA template mapping, resource tags, CEL validation rules
  • Integration tests for MHC lifecycle, interruptible-instance label, NTH deployment reconciliation
  • E2E tests for spot NodePool creation, MHC configuration, NTH drain/replace flow

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from csrwng and derekwaynecarr March 2, 2026 12:49
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enxebre enxebre force-pushed the hypershift-spot-instances branch 4 times, most recently from 6a151c5 to d66c090 Compare March 2, 2026 12:56
@muraee
Copy link
Copy Markdown

muraee commented Mar 2, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2026
@enxebre enxebre force-pushed the hypershift-spot-instances branch from d66c090 to 2348b4a Compare March 5, 2026 07:37
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@enxebre enxebre force-pushed the hypershift-spot-instances branch from 2348b4a to d7e5483 Compare March 5, 2026 08:18
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 5, 2026

@enxebre: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@muraee
Copy link
Copy Markdown

muraee commented Mar 5, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
@devguyio
Copy link
Copy Markdown

devguyio commented Mar 9, 2026

/lgtm

- Add `Spot` as a value to the existing `MarketType` enum and add a `marketType` field to `PlacementOptions` on `AWSNodePoolPlatform`.
- Add a `SpotOptions` struct with an optional `maxPrice` field, referenced from `PlacementOptions` via a `spot` field.
- Add a `terminationHandlerQueueURL` field to `AWSPlatformSpec` on the HostedCluster to configure the NTH SQS queue as a proper API field.
- Ensure that spot instances are automatically labeled with `hypershift.openshift.io/interruptible-instance` so that workloads can use node affinity and anti-affinity rules to control placement.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Nodes backed by spot instances

aws:
region: us-east-1
rolesRef: ...
terminationHandlerQueueURL: "https://sqs.us-east-1.amazonaws.com/123456789012/my-nth-queue"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of validation do we do here? Can a misconfigured/malicious user pointing to an arbitrary URL result in a DoS of the NTH on the control plane?

Update: I now see validation in the API definition below.

3. CAPA creates EC2 spot instances with the specified market options.

4. The NTH (deployed by the control-plane-operator when `terminationHandlerQueueURL` is set on the HostedCluster) watches for spot interruption and rebalance recommendation events, and cordons/drains nodes before termination.
5. The spot remediation controller in the HCCO watches Nodes tainted with the `aws-node-termination-handler/` prefix and triggers Machine deletion for immediate replacement.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new responsibility for the HCCO (RBAC?) and Machines exist in the mgmt cluster, which the HCCO does not normal operate against.

I don't know this area well, but what prevents us from using the MHC to trigger machine replacement in both cases, whether the terminationHandlerQueueURL is specified or not i.e. MHC gaining the ability to replace Machines when the Node is tainted with the aws-node-termination-handler taint.

// Supports both standard and FIFO queues (FIFO queues end with .fifo suffix).
//
// +optional
// +kubebuilder:validation:Pattern=`^https://sqs\.[a-z0-9-]+\.amazonaws\.com/[0-9]{12}/[a-zA-Z0-9_-]+(\.fifo)?$`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so there is some limitation on the URL 👍


This controller provides faster machine replacement than the MHC alone. The MHC waits for the Machine to become `Failed`, which only happens after the instance is already deleted. The spot remediation controller acts immediately upon receiving the NTH taint -- before the instance is actually terminated.

The MHC remains as a fallback for cases where the spot remediation controller or NTH is unavailable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm out of my depth here, but it seems possible to enhance the MHC to do this and leave the HCCO unchanged. I really would like to keep the HCCO out of the business of operating against CAPI in the HCP.


#### IAM Permissions for SQS

The NTH deployment uses the NodePoolManagement IAM role credentials (via IRSA/STS) to poll and acknowledge messages from the SQS queue. This requires adding `sqs:ReceiveMessage` and `sqs:DeleteMessage` permissions to the NodePoolManagement role's IAM policy:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that adding a new role for NTH is heavyweight. Just want to call out that we will be unable to distinguish between AWS calls from CAPA and NTH on the AWS side if we reuse the role/serviceaccount.

@ratnam915
Copy link
Copy Markdown

HI @enxebre :

Hi — as part of the work on adding sqs:ReceiveMessage and sqs:DeleteMessage to ROSANodePoolManagementPolicy (SREP-698 / OCPSTRAT-1677), we've received guidance that the SQS permissions should be scoped with an aws:ResourceTag/red-hat: true condition. This aligns with how KMS permissions are already scoped in the managed policy and strengthens the security posture for the AWS review.

Could the IAM permissions section of the enhancement be updated to reflect this? Specifically:

The current block:

{
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage"
],
"Resource": ["*"]
}

Should become:
{
"Sid": "NodePoolSQSActions",
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/red-hat": "true"
}
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants