GEP-0054: Worker Capabilities for Machine Image Selection#53
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| The following example capability names are reserved for Gardener core use. All reserved capabilities use the `gardener-` prefix: | ||
|
|
||
| ```go | ||
| const ( |
There was a problem hiding this comment.
Following https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#constants, the enums should be in pascal case, e.g. GardenerBootType instead of gardener-bootType.
There was a problem hiding this comment.
+1 to have consistent constant values.
I would be fine with both gardener-boot-type and GardenerBootType, but mixing the styles looks strange.
|
|
||
| ## Summary | ||
|
|
||
| This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed. |
There was a problem hiding this comment.
I struggle with the term "worker properties" or "worker capabilities". Can you find a better name?
|
|
||
| ## Summary | ||
|
|
||
| This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed. |
There was a problem hiding this comment.
| This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed. | |
| This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker pool configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed. |
|
|
||
| ## Motivation | ||
|
|
||
| Currently, machine image capabilities as defined in GEP-0033 only support selecting images based on hardware properties of the machine type (CPU architecture, hypervisor type, bare-metal vs virtualized). However, users increasingly need the ability to select images based on software or feature requirements that are specified in the worker configuration, such as: |
There was a problem hiding this comment.
Isn't the real motivation the defaulting? Users (already today) know what their worker pools will need. If they need in-place updates, they can look up a proper machine image/type in the CloudProfile. Same goes for other capability requirements.
What you seem to want to achieve is that they don't have to look this up and manage themselves, right?
| - **GPU Support**: Select images with specific GPU driver support (nvidia, amd, intel) | ||
| - ... | ||
|
|
||
| Without this feature, users must manually ensure that their worker configuration is compatible with the selected machine image, which is error-prone and limits the automation benefits provided by Gardener's maintenance operations. |
There was a problem hiding this comment.
Aha, yes, here you seem to confirm. I would update this section to highlight this main use-case.
| - **Machine type capabilities**: Describe what the hardware supports (e.g., ARM64 architecture, virtualized hypervisor) | ||
| - **Worker capabilities**: Describe what features the worker configuration requests (e.g., secure boot enabled, in-place updates required) | ||
|
|
||
| Worker capabilities are derived from worker specification properties. To prevent naming conflicts between Gardener-reserved capabilities and custom or provider-defined capabilities, reserved capability names use a `gardener-` prefix. |
There was a problem hiding this comment.
Can you explain/provide some examples what "Gardener-reserved capabilities" would be, in contrast to custom or provider-defined capabilities?
Is this because of NamespacedCloudProfiles where end-users might define their own images or machine types with their custom capabilities? I have to guess way too much here.
| ### Notes/Constraints/Caveats | ||
|
|
||
| - Worker capabilities are **additive** to the existing capability system. Existing CloudProfiles and image selection continue to work without modification. | ||
| - The `gardener-` prefix is reserved for Gardener core capabilities. Provider extensions should use their own prefixes (e.g., `azure-`, `aws-`) for provider-specific capabilities. |
There was a problem hiding this comment.
How do you ensure that the prefixes used by provider extensions will not be taken by users?
There was a problem hiding this comment.
Why not just combine the prefix, e.g. gardener-azure- and gardener-aws-? This was users can use anything, but the prefix gardener-.
| capabilities := make(map[string][]string) | ||
|
|
||
| // Boot type mapping | ||
| if worker.Machine.SecureBoot != nil && *worker.Machine.SecureBoot { |
There was a problem hiding this comment.
The capabilities API in the CloudProfile is very generic without hard-coded capability names. Why would we now start to hard-code some of them in the worker pool API?
There was a problem hiding this comment.
Your comment is so true and addresses a tension within the idea behind this GEP. Allow me to elaborate my thoughts here.
My main concern is that the generic capability field on the worker might be confusing to use for the end users of gardener as they need to know what capabilities (and values) exist in the selected CloudProfile.
- GEP-33 is an implementation detail of how Gardener selects the right image flavor automatically. Users specify via a well documented API what they want and capabilities are non of his concern. Only the gardener operator needs to understand the capabilities machanism.
- explicit fields are easier to understand for the end user
Second is that some features might want to define additional configuration like the existing In-Place-Update feature here: (AutoInPlaceUpdate vs ManualInPlaceUpdate). This cannot be included in a capability. In these cases this would be kind of double maintenance as gardener can assume, that if these values are set, an image that supports in place updates is required. But this can be handled with some nice defaulting.
If we use the generic capabilities field on the worker API:
How can we provide a good UI and API documentation for the user?
| case "nvidia": | ||
| capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportNvidia} | ||
| case "amd": | ||
| capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportAmd} | ||
| case "intel": | ||
| capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportIntel} | ||
| default: | ||
| capabilities[CapabilityNameGpuSupport] = []string{CapabilityValueGpuSupportNone} |
There was a problem hiding this comment.
We even go beyond and not only hard-code capability names, but also concrete values (like nvidia, etc.)? Then I don't understand the whole effort that was put in GEP-33 to make the API as generic as possible.
|
|
||
| ### Provider-Specific Worker Properties | ||
|
|
||
| Provider extensions can define additional worker properties that map to capabilities. This is achieved through the existing `WorkerConfig` extension mechanism: |
There was a problem hiding this comment.
This sounds wrong, shouldn't we add a dedicated capabilities field to the worker pool API, similar to how it was done for machine images and types? Why would we use the WokerConfig for this?
ScheererJ
left a comment
There was a problem hiding this comment.
Thanks for the proposal.
From my perspective, it should be adapted to be less abstract. I wondered most of the time what this is actually about until I reached the design details. It should be possible to make this more tangible and understandable by adding examples earlier and using a better name as Rafael suggested.
| - [Provider-Specific Worker Properties](#provider-specific-worker-properties) | ||
| - [Component Changes](#component-changes) | ||
| - [Drawbacks](#drawbacks) | ||
|
|
There was a problem hiding this comment.
Nit: Let's not have double empty lines.
|
|
||
| ## Summary | ||
|
|
||
| This GEP extends [GEP-0033 (Machine Image Capabilities)](../0033-machine-image-capabilities/README.md) to support selecting machine images based on worker-defined properties in addition to machine type characteristics. While GEP-0033 enables image selection based on hardware capabilities of machine types (CPU architecture, hypervisor type, etc.), this proposal introduces worker capabilities that describe software/feature requirements requested by the worker configuration. Both machine type capabilities and worker capabilities must be satisfied for image selection to succeed. |
There was a problem hiding this comment.
You added examples for GEP-33, i.e. CPU architecture, hypervisor type, etc.. Could you please also add some examples for the new properties? Otherwise this paragraphs stays very abstract.
| ### Notes/Constraints/Caveats | ||
|
|
||
| - Worker capabilities are **additive** to the existing capability system. Existing CloudProfiles and image selection continue to work without modification. | ||
| - The `gardener-` prefix is reserved for Gardener core capabilities. Provider extensions should use their own prefixes (e.g., `azure-`, `aws-`) for provider-specific capabilities. |
There was a problem hiding this comment.
Why not just combine the prefix, e.g. gardener-azure- and gardener-aws-? This was users can use anything, but the prefix gardener-.
| The following example capability names are reserved for Gardener core use. All reserved capabilities use the `gardener-` prefix: | ||
|
|
||
| ```go | ||
| const ( |
There was a problem hiding this comment.
+1 to have consistent constant values.
I would be fine with both gardener-boot-type and GardenerBootType, but mixing the styles looks strange.
|
|
||
| ## Drawbacks | ||
|
|
||
| - **Increased Complexity**: Adds a third dimension to capability matching, which increases the complexity of the image selection algorithm. |
There was a problem hiding this comment.
I understand that the image selection algorithm will be more complex, but it should get a lot less complex for the users of the service. Right?
There was a problem hiding this comment.
Yes I will make that more explicit, as this is one of the main drivers for the GEP:
Shift complexity and validation of worker config away from the user to our implementation.
|
The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
You can:
/lifecycle stale |
|
The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
You can:
/lifecycle rotten |
…tation Update GEP number from 0050 to 0054 Update README.md incorparate initial GEP feedback rewrite based on three party cooperation (gardener,image,infra) simplified version Enhance GEP-0054 README: Add Risks and Mitigations section, refine capability definitions, and update validation details
ca4a401 to
cc18739
Compare
… machine image selection
|
Thanks for the valuable feedback so far! I've reworked the GEP substantially. I structured the proposal around the two currently known use cases that connect worker pool fields with GEP-33 capabilities:
This gives the proposal more direction and makes the benefits clearer. I also narrowed the scope: the GEP now only covers capabilities owned by Gardener under the gardener- prefix. Provider-extension-owned capabilities driven by WorkerConfig were deferred to a future GEP once/if actual use cases exist that would justify such changes. Would appreciate another look. |
|
/lifecycle active |
/area os usability
/kind enhancement