[WIP] CNTRLPLANE-2711: introduce configuration api for vault kms plugin#1954
[WIP] CNTRLPLANE-2711: introduce configuration api for vault kms plugin#1954flavianmissi wants to merge 3 commits intoopenshift:masterfrom
Conversation
|
@flavianmissi: This pull request references CNTRLPLANE-2711 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
0bfa6fd to
5b652a8
Compare
ardaguclu
left a comment
There was a problem hiding this comment.
This is WIP but just wanted to drop my some initial comments
| mustRegister() | ||
| ``` | ||
|
|
||
| Additionally, the existing `KMSEncryptionProvider` feature gate is extended to |
There was a problem hiding this comment.
That would mean that we will have to ask enabling 2 feature gates for that purpose. Do we really need this?. Why can't we tombstone KMSEncryptionProvider and all its types and invent something new (just belongs to KMSv2)?
There was a problem hiding this comment.
I've updated the EP to also propose the removal of the KMSEncryptionProvider gate, PTAL
| provider type in the `KMSProviderType` enum. | ||
|
|
||
| ```go | ||
| FeatureGateKMSv2 = newFeatureGate("KMSv2"). |
There was a problem hiding this comment.
nitpick:
| FeatureGateKMSv2 = newFeatureGate("KMSv2"). | |
| FeatureGateKMSv2 = newFeatureGate("KMSEncryptionv2"). |
to emphasize that this is v2 of which feature gate
There was a problem hiding this comment.
I've renamed the gate, PTAL
| The `KMSConfig` type is extended to include the Vault provider: | ||
|
|
||
| ```diff | ||
| type KMSConfig struct { |
There was a problem hiding this comment.
We definitely do not need to carry the burden of previous design.
There was a problem hiding this comment.
What are you suggesting? 🤔
There was a problem hiding this comment.
If you will tombstone the awsconfig, ignore my comment
There was a problem hiding this comment.
If you will tombstone the awsconfig, ignore my comment
I wasn't planning on doing that as part of this EP, but I think that should be our plan before GA. WDYT? Do we need an EP for that?
There was a problem hiding this comment.
In my opinion, this is exactly the part of this EP in tech preview v2. @benluddy for more insights.
There was a problem hiding this comment.
yeah after doing some more thinking, you're absolutely right. I'll change this ep to include the removal of awsconfig 👍🏼
| kind: Secret | ||
| metadata: | ||
| name: vault-approle | ||
| namespace: openshift-config |
There was a problem hiding this comment.
ah, I see. This is a different namespace than openshift-config-managed.
| name: cluster | ||
| spec: | ||
| encryption: | ||
| type: KMS |
There was a problem hiding this comment.
This should be different. Otherwise, controllers will use v1.
There was a problem hiding this comment.
Introduced a new type, PTAL
798cdf3 to
df24f00
Compare
|
/retitle CNTRLPLANE-2711: introduce configuration api for vault kms plugin |
df24f00 to
63c4117
Compare
|
|
||
| This enhancement introduces a new feature gate `ManagedKMSProvider` to control | ||
| access to Vault KMS provider configuration in the APIServer API. This gate | ||
| works in conjunction with the `KMSEncryption` feature gate from |
There was a problem hiding this comment.
I think ManagedKMSProvider does not work in conjunction with the KMSEncryption. They manage their own encryption modes separately and ManagedKMSProvider supersedes KMSEncryption.
There was a problem hiding this comment.
Ops, I've missed that, thanks for pointing it out.
| works in conjunction with the `KMSEncryption` feature gate from | ||
| [kms-encryption-foundations.md](./kms-encryption-foundations.md). | ||
|
|
||
| Additionally, this enhancement proposes removing the |
There was a problem hiding this comment.
Can we also mention about the removal of AWSConfig in KMSConfig which is managed by KMSEncryption feature gate (which is proposed to be removed)
| plugins based on API configuration (plugin deployment and lifecycle management | ||
| functionality is implemented): | ||
|
|
||
| 1. The unmanaged `KMS` encryption type will be deprecated and eventually removed |
There was a problem hiding this comment.
When this will happen is not obvious; ManagedKMS will be renamed to KMS when GA, pre-ga, tech preview v2, etc.?
There was a problem hiding this comment.
I wasn't specific here on purpose, but we should at least it's pre-GA, I'll update it 👍🏼
| // - "secretID": The AppRole Secret ID | ||
| // | ||
| // +required | ||
| ApproleSecretRef corev1.LocalObjectReference `json:"approleSecretRef,omitempty"` |
There was a problem hiding this comment.
First time I'm seeing LocalObjectReference and I couldn't find any example in openshift/api. Are we sure that about this?
There was a problem hiding this comment.
docs: https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/local-object-reference/
However, you're right to point this out - the reference shouldn't be local as we're telling users to place the secret in the openshift-config namespace. I think the right type for this is SecretNameReference, thanks for bringing this up 😄
There was a problem hiding this comment.
yeah, I agree with you. SecretNameReference is the correct one
|
|
||
| **Field Validation Limits and Error Messages:** | ||
|
|
||
| The API enforces the following validation limits on VaultKMSConfig fields, with |
There was a problem hiding this comment.
Would it make sense asking to Vault folks about these validations?, maybe they can enlighten us about the validations (to align with the future implementations)
| cannot ship closed-source software, so users must provide the container image | ||
| reference for the plugin. | ||
|
|
||
| **Image Reference:** |
There was a problem hiding this comment.
I'd expect the details that when/what will store and use the image url (e.g. registry.redhat.io/hashicorp/vault-kms-plugin, etc.) that digest will be appended?. How are we going to react to any url changes?.
| reference for the plugin. | ||
|
|
||
| **Image Reference:** | ||
| Users must specify the full container image reference using a digest (not a tag). For example: |
There was a problem hiding this comment.
Are we going to support multiple registries for the same image?
There was a problem hiding this comment.
Yes, users can freely specify the registry.
|
|
||
| This API allows users to specify arbitrary container images that will run in | ||
| the control plane with access to encryption keys. A malicious or compromised | ||
| image could compromise the entire cluster. Users are responsible for ensuring |
There was a problem hiding this comment.
How can users ensure that images coming from trusted sources, if they are only allowed to set digest?
There was a problem hiding this comment.
we're proposing that users provide the entire pull spec for the plugin image, not only the digest.
There was a problem hiding this comment.
How do we enforce the use of certified images and not malicious ones?
There was a problem hiding this comment.
This section already answers my question;
This API allows users to specify arbitrary container images that will run in the control plane with access to encryption keys. A malicious or compromised image could compromise the entire cluster. Users are responsible for ensuring images come from trusted sources.
There was a problem hiding this comment.
Yes, some time before GA we're also going to add some safety checks, such as image signature and Red Hat certification. However, this is out of the scope of this EP.
JoelSpeed
left a comment
There was a problem hiding this comment.
Please consider that you can change a tech preview API within reason between releases. There's a bunch of complexity in this document about adding a new enum value only to rename it later, I'm not sure why we need that?
Also, I'm not really sure what a user can achieve with this new API - it currently seems like we are proposing an API which when set, achieves nothing, that is unusual
I'm currently a hard no on shipping an API that does nothing - if I've missed what the point of this API is, please let me know
| implementation. Eventually, the unmanaged `KMS` type will be removed and | ||
| `ManagedKMS` will be renamed to `KMS`. |
There was a problem hiding this comment.
I don't think a rename is actually possible in a v1 API, you'd break clients. This could only happen if we added a v2 version of the API
| goal across multiple enhancements. While this enhancement only adds the Vault | ||
| API configuration, future work may add other managed provider types. The |
There was a problem hiding this comment.
They should come under their own gate no? This implies to me that you'll block the value part of this feature from promoting until you then have those other pieces in place.
Better to have smaller scoped FGs IMO
| (hardcoded static endpoint) | ||
|
|
||
| The `ManagedKMSProvider` feature gate (this enhancement) provides: | ||
| - The `encryption.type: ManagedKMS` support (operator-managed KMS - Tech Preview v2) |
There was a problem hiding this comment.
Under ManagedKMS, will there then be another union to allow choice of the provider implementation? For example you currently have Vault, will there be other choices later?
There was a problem hiding this comment.
Yes, we expect our customers will request other providers to be supported, and all KMS plugins of potential future providers will be managed by OCP. In other words, all kms plugins are to be managed by OCP.
|
|
||
| The `ManagedKMSProvider` feature gate (this enhancement) provides: | ||
| - The `encryption.type: ManagedKMS` support (operator-managed KMS - Tech Preview v2) | ||
| - The `kms` field in APIServerEncryption with provider-specific configuration |
There was a problem hiding this comment.
Is this a union? What does this structure actually look like?
| plugins based on API configuration (plugin deployment and lifecycle management | ||
| functionality is implemented): | ||
|
|
||
| 1. The unmanaged `KMS` encryption type will be deprecated and eventually removed |
There was a problem hiding this comment.
Has KMS shipped in the Default featureset yet? The above paragraph implies no
| Applying the above configuration will allow the API server to accept and | ||
| validate the Vault KMS configuration, but it will not result in functional | ||
| Vault KMS encryption. No operator will deploy or manage the Vault KMS plugin | ||
| based on this configuration until future enhancements implement that | ||
| functionality. |
There was a problem hiding this comment.
How will a user configure whether they own deploying vault, or whether the vault should be deployed by OCP?
There was a problem hiding this comment.
After TP v2, OCP will always deploy the kms plugin. We'll eventually remove TP v1 (user-deployed plugin), so that in the end, OCP is fully responsible for it, and user-deployed plugins are not supported.
There was a problem hiding this comment.
If OCP is responsible for configuring this in the long term, do we need such a broad API surface that allows users to BYO? That's not something we actually want to support in the product in the long term right?
There was a problem hiding this comment.
The API here defines what users will need to provide OCP in order for OCP to configure and run the vault KMS plugin. We're exposing the fields which we cannot decide on behalf of the user.
I'm going to try to explain this in the EP text.
| Future work may unify Hypershift to use the KMS configuration introduced here, | ||
| at which point this Vault KMS API would become available to Hypershift. That | ||
| unification is out of scope for this enhancement. |
There was a problem hiding this comment.
What does the existing API look like? Is it roughly the same as what we have here? Is there anything required by the new OCP API that isn't required in the existing HyperShift API?
If HyperShift were the first consumer of each feature, and then it was added to OCP later, do you think this would change how you design this?
| Users can configure the API, but no operators will act on the configuration | ||
| until future enhancements implement the operator-managed plugin deployment. |
There was a problem hiding this comment.
We shouldn't ever ship an API without an implementation
What is the purpose of a user to configure this API? What happens on upgrade when there suddenly is something observing the config and it makes changes to encryption during an upgrade?
There was a problem hiding this comment.
We shouldn't ever ship an API without an implementation
We're just hoping to ship this EP without an implementation. I explained this in more detail in a comment.
What is the purpose of a user to configure this API?
Configuring this API should trigger our apiserver operators to update their apiserver deployments to include a kms plugin static pod sidecar, and once up and running, to change the cluster's encryption config to use the KMS configured by the user, and encrypt certain resources in the cluster.
What happens on upgrade when there suddenly is something observing the config and it makes changes to encryption during an upgrade?
This is an excellent question and will be addressed by future enhancements. If I remember correctly, the current idea was to not allow this situation to happen, though I don't recall the details. I think @p0lyn0mial has better memory than me and may be able to help 😅
There was a problem hiding this comment.
I think I initially misunderstood your upgrade question:
What happens on upgrade when there suddenly is something observing the config and it makes changes to encryption during an upgrade?
There won't be something suddenly observing the config after an upgrade because we're planning on landing the API changes and actual implementation within the same OCP release.
Even if we somehow missed the mark and were forced to deliver API and implementation in separate OCP releases, the encryption framework in library-go will only allow encryption to begin (aka migration) once all apiserver revisions have converged.
Here's the relevant code for reference:
- https://github.com/openshift/library-go/blob/12d8376369b7c5b76f688d01089882ca28e351c3/pkg/operator/encryption/controllers/migration_controller.go#L210-L213
- https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/controllers/key_controller.go#L171-L178
- https://github.com/openshift/library-go/blob/12d8376369b7c5b76f688d01089882ca28e351c3/pkg/operator/encryption/statemachine/transition.go#L44
I'm uncertain whether the sidecar deployment would also wait for revision convergence. This is something we're currently building and will take into consideration.
| *Mitigation (Planned - Future Enhancement):* | ||
| - **Image signature verification**: Validate images are signed by a trusted | ||
| authority before deployment | ||
| - Images run in the control plane with restricted RBAC (limits blast radius) |
There was a problem hiding this comment.
Why do they run in the control plane vs in the data plane?
There was a problem hiding this comment.
API servers communicate with kms plugins via unix sockets. In addition to that, we're planning on running them as sidecars to each apiserver.
There was a problem hiding this comment.
This feels like an incredibly highly privileged component to be leaving up to cluster admins to configure the correct image - and to not expect someone at some point to use something that ends up being malicious
There was a problem hiding this comment.
That's true. The reason we're exposing the image field in the API is that the Vault KMS plugin is not going to be open source, so we cannot distribute it ourselves. Before this feature GAs we're going to add some safety checks, like image signature and RH certification. This will lower the chances of someone using a malicious image, but a determined person could still circumvent these checks (as is the case for many security measures). We're open to various ways of hardening this, so if you have any ideas please do share.
| - **Registry dependency**: Requires HashiCorp to publish images to a specific | ||
| Red Hat registry, or Red Hat to mirror Hashicorp image to its own registry | ||
| - **Partnership constraint**: Depends on formal agreement with HashiCorp for | ||
| image publishing to Red Hat registries, or requires internal agreement | ||
| between distinct parts of Red Hat |
There was a problem hiding this comment.
If we had this route, would we even need the sha to be provided? Wouldn't we be able to hardcode that as well?
There was a problem hiding this comment.
We're trying to avoid coupling openshift's release cycles with the kms plugin's release cycles, specifically to reduce our burden to upkeep CVE maintenance.
There was a problem hiding this comment.
But that then introduces the burden of cross version testing and creating a support matrix, which has the lower overall cost?
There was a problem hiding this comment.
I'm not sure about cost, it might be that it's cheaper for us to ship supported plugins in the release payload.
The thing is - vault kms plugin is not opensource. We can't distribute it 🫤
There was a problem hiding this comment.
In other parts of the EP you talk about it being published to a RH registry? Surely if we can't distribute it, it can't be published to a RH registry?
There was a problem hiding this comment.
Could you be referring to the "Hardcoded Registry with Digest-Only Image Field" section under the "Alternatives" header? We did consider this, and you're probably right that it might not even be viable given the proprietary nature of the vault kms plugin.
Right now, our idea is to sync with the team writing the vault kms plugin and agree on an image signature verification, probably using public key and existing signature verifying infrastructure. We're also planning on verifying that images are certified by RH.
We need more time to explore and design this - I was hoping this would not block us from landing this EP.
|
Hi @JoelSpeed , thanks for the review. I'll address each of your comments separately, but before that there a couple of things I'd like to sync on, please bare with me as I try to clarify them :) 1. Introduction of 2. Proposing an API without any implementation to back it
This is a non-exhaustive list, we might need to cover other pieces of this feature, such as key rotation and backup procedures, in their own EPs as well. I hope this clarifies it, please let me know your thoughts. |
| ```yaml | ||
| # Mirror the image to your registry, then reference it | ||
| vault: | ||
| vaultKMSPluginImage: mirror.registry.example.com/hashicorp/vault-kms-plugin@sha256:abc123... |
There was a problem hiding this comment.
Would it make sense documenting the risk of discrepancy between the expected arguments of the given kms image and arguments that are passed by plugin lifecycle.
For example, argument set wired by plugin lifecycle supports v1 version of the kms plugin, but user passes an image pointing to v2 version of the image which do not expect arg-1.
There was a problem hiding this comment.
Yes, good thinking.
Even though these fields are tech preview, there are changes you can and can't do. This is after all affecting a v1 API, and those APIs have clients already. We need to think very carefully about what we want our end state of the API to look like because some of the assumptions you make here (like renaming things) could be changes that we can't actually make.
You are not expecting folks to be able to upgrade from 4.21.z? Why can you not use the existing TP enum value in the TP v2? You can change the meaning of the API since the feature hasn't shipped yet
Generally to be able to design an API, you have to understand what the context of its usage is. This split of the EPs means it's very hard for API reviewers to advise you on the best shape for your API. We need to understand both how you expect users to interact with the API but also how the data in the API is used to best advise you |
This commit brings syntatic api definitions that is proposed here openshift/enhancements#1954. Resulted api definitions might be totally different from this commit. But this does not affect the implementation of foundational work in library-go.
This commit brings syntatic api definitions that is proposed here openshift/enhancements#1954. Resulted api definitions might be totally different from this commit. But this does not affect the implementation of foundational work in library-go.
This commit brings syntatic api definitions that is proposed here openshift/enhancements#1954. Resulted api definitions might be totally different from this commit. But this does not affect the implementation of foundational work in library-go.
|
I'm temporarily putting this on WIP again while I work through the comments. /retitle [WIP] CNTRLPLANE-2711: introduce configuration api for vault kms plugin |
also move tls config into a sub-section
|
@flavianmissi: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
After further discussion we decided to follow api reviewers advice and keep all kms related changes in the kms encryption foundations api. The new (WIP) PR can be found here: #1972 /close |
|
@flavianmissi: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This commit brings syntatic api definitions that is proposed here openshift/enhancements#1954. Resulted api definitions might be totally different from this commit. But this does not affect the implementation of foundational work in library-go.
No description provided.