Skip to content

GEP-0054: Worker Capabilities for Machine Image Selection#53

Open
Roncossek wants to merge 2 commits into
gardener:mainfrom
Roncossek:worker-capabilities
Open

GEP-0054: Worker Capabilities for Machine Image Selection#53
Roncossek wants to merge 2 commits into
gardener:mainfrom
Roncossek:worker-capabilities

Conversation

@Roncossek
Copy link
Copy Markdown

@Roncossek Roncossek commented Feb 26, 2026

  • How to categorize this PR:
    /area os usability
    /kind enhancement
  • One-line PR description: Adds new GEP 0054 - Worker Capabilities for Machine Image Selection
  • Other comments:

@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vlerenc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added area/os Operator system related area/usability Usability related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 26, 2026
@Roncossek Roncossek changed the title GEP-0040: Worker Capabilities for Machine Image Selection GEP-0054: Worker Capabilities for Machine Image Selection Feb 26, 2026
The following example capability names are reserved for Gardener core use. All reserved capabilities use the `gardener-` prefix:

```go
const (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#constants, the enums should be in pascal case, e.g. GardenerBootType instead of gardener-bootType.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to have consistent constant values.

I would be fine with both gardener-boot-type and GardenerBootType, but mixing the styles looks strange.

Comment thread geps/0040-worker-capabilities/README.md Outdated

## Summary

This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle with the term "worker properties" or "worker capabilities". Can you find a better name?

Comment thread geps/0040-worker-capabilities/README.md Outdated

## Summary

This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed.
This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker pool configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed.

Comment thread geps/0040-worker-capabilities/README.md Outdated

## Motivation

Currently, machine image capabilities as defined in GEP-0033 only support selecting images based on hardware properties of the machine type (CPU architecture, hypervisor type, bare-metal vs virtualized). However, users increasingly need the ability to select images based on software or feature requirements that are specified in the worker configuration, such as:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the real motivation the defaulting? Users (already today) know what their worker pools will need. If they need in-place updates, they can look up a proper machine image/type in the CloudProfile. Same goes for other capability requirements.

What you seem to want to achieve is that they don't have to look this up and manage themselves, right?

Comment thread geps/0040-worker-capabilities/README.md Outdated
- **GPU Support**: Select images with specific GPU driver support (nvidia, amd, intel)
- ...

Without this feature, users must manually ensure that their worker configuration is compatible with the selected machine image, which is error-prone and limits the automation benefits provided by Gardener's maintenance operations.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, yes, here you seem to confirm. I would update this section to highlight this main use-case.

Comment thread geps/0040-worker-capabilities/README.md Outdated
- **Machine type capabilities**: Describe what the hardware supports (e.g., ARM64 architecture, virtualized hypervisor)
- **Worker capabilities**: Describe what features the worker configuration requests (e.g., secure boot enabled, in-place updates required)

Worker capabilities are derived from worker specification properties. To prevent naming conflicts between Gardener-reserved capabilities and custom or provider-defined capabilities, reserved capability names use a `gardener-` prefix.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain/provide some examples what "Gardener-reserved capabilities" would be, in contrast to custom or provider-defined capabilities?

Is this because of NamespacedCloudProfiles where end-users might define their own images or machine types with their custom capabilities? I have to guess way too much here.

Comment thread geps/0040-worker-capabilities/README.md Outdated
### Notes/Constraints/Caveats

- Worker capabilities are **additive** to the existing capability system. Existing CloudProfiles and image selection continue to work without modification.
- The `gardener-` prefix is reserved for Gardener core capabilities. Provider extensions should use their own prefixes (e.g., `azure-`, `aws-`) for provider-specific capabilities.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you ensure that the prefixes used by provider extensions will not be taken by users?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just combine the prefix, e.g. gardener-azure- and gardener-aws-? This was users can use anything, but the prefix gardener-.

Comment thread geps/0040-worker-capabilities/README.md Outdated
capabilities := make(map[string][]string)

// Boot type mapping
if worker.Machine.SecureBoot != nil && *worker.Machine.SecureBoot {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capabilities API in the CloudProfile is very generic without hard-coded capability names. Why would we now start to hard-code some of them in the worker pool API?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment is so true and addresses a tension within the idea behind this GEP. Allow me to elaborate my thoughts here.

My main concern is that the generic capability field on the worker might be confusing to use for the end users of gardener as they need to know what capabilities (and values) exist in the selected CloudProfile.

  • GEP-33 is an implementation detail of how Gardener selects the right image flavor automatically. Users specify via a well documented API what they want and capabilities are non of his concern. Only the gardener operator needs to understand the capabilities machanism.
  • explicit fields are easier to understand for the end user

Second is that some features might want to define additional configuration like the existing In-Place-Update feature here: (AutoInPlaceUpdate vs ManualInPlaceUpdate). This cannot be included in a capability. In these cases this would be kind of double maintenance as gardener can assume, that if these values are set, an image that supports in place updates is required. But this can be handled with some nice defaulting.

If we use the generic capabilities field on the worker API:
How can we provide a good UI and API documentation for the user?

Comment thread geps/0040-worker-capabilities/README.md Outdated
Comment on lines +133 to +140
case "nvidia":
capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportNvidia}
case "amd":
capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportAmd}
case "intel":
capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportIntel}
default:
capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportNone}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We even go beyond and not only hard-code capability names, but also concrete values (like nvidia, etc.)? Then I don't understand the whole effort that was put in GEP-33 to make the API as generic as possible.

Comment thread geps/0040-worker-capabilities/README.md Outdated

### Provider-Specific Worker Properties

Provider extensions can define additional worker properties that map to capabilities. This is achieved through the existing `WorkerConfig` extension mechanism:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds wrong, shouldn't we add a dedicated capabilities field to the worker pool API, similar to how it was done for machine images and types? Why would we use the WokerConfig for this?

Copy link
Copy Markdown
Member

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal.

From my perspective, it should be adapted to be less abstract. I wondered most of the time what this is actually about until I reached the design details. It should be possible to make this more tangible and understandable by adding examples earlier and using a better name as Rafael suggested.

Comment thread geps/0040-worker-capabilities/README.md Outdated
- [Provider-Specific Worker Properties](#provider-specific-worker-properties)
- [Component Changes](#component-changes)
- [Drawbacks](#drawbacks)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's not have double empty lines.

Suggested change

Comment thread geps/0040-worker-capabilities/README.md Outdated

## Summary

This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added examples for GEP-33, i.e. CPU architecture, hypervisor type, etc.. Could you please also add some examples for the new properties? Otherwise this paragraphs stays very abstract.

Comment thread geps/0040-worker-capabilities/README.md Outdated
### Notes/Constraints/Caveats

- Worker capabilities are **additive** to the existing capability system. Existing CloudProfiles and image selection continue to work without modification.
- The `gardener-` prefix is reserved for Gardener core capabilities. Provider extensions should use their own prefixes (e.g., `azure-`, `aws-`) for provider-specific capabilities.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just combine the prefix, e.g. gardener-azure- and gardener-aws-? This was users can use anything, but the prefix gardener-.

The following example capability names are reserved for Gardener core use. All reserved capabilities use the `gardener-` prefix:

```go
const (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to have consistent constant values.

I would be fine with both gardener-boot-type and GardenerBootType, but mixing the styles looks strange.

Comment thread geps/0040-worker-capabilities/README.md Outdated

## Drawbacks

- **Increased Complexity**: Adds a third dimension to capability matching, which increases the complexity of the image selection algorithm.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the image selection algorithm will be more complex, but it should get a lot less complex for the users of the service. Right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will make that more explicit, as this is one of the main drivers for the GEP:

Shift complexity and validation of worker config away from the user to our implementation.

@gardener-ci-robot
Copy link
Copy Markdown

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2026
@gardener-ci-robot
Copy link
Copy Markdown

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close

/lifecycle rotten

@gardener-prow gardener-prow Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2026
…tation

Update GEP number from 0050 to 0054

Update README.md

incorparate initial GEP feedback

rewrite based on three party cooperation (gardener,image,infra)

simplified version

Enhance GEP-0054 README: Add Risks and Mitigations section, refine capability definitions, and update validation details
@Roncossek Roncossek force-pushed the worker-capabilities branch from ca4a401 to cc18739 Compare May 22, 2026 10:24
@Roncossek
Copy link
Copy Markdown
Author

Roncossek commented May 22, 2026

Thanks for the valuable feedback so far! I've reworked the GEP substantially.

I structured the proposal around the two currently known use cases that connect worker pool fields with GEP-33 capabilities:

  • in-place node updates as a worker update strategy
  • secure boot support (planned)

This gives the proposal more direction and makes the benefits clearer.

I also narrowed the scope: the GEP now only covers capabilities owned by Gardener under the gardener- prefix. Provider-extension-owned capabilities driven by WorkerConfig were deferred to a future GEP once/if actual use cases exist that would justify such changes.

Would appreciate another look.
cc @ScheererJ @rfranzke @vpnachev

@Roncossek
Copy link
Copy Markdown
Author

/lifecycle active

@gardener-prow gardener-prow Bot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/os Operator system related area/usability Usability related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants