SSCSI-254: Configurable secret rotation and WIF support for SSCSI#2012
SSCSI-254: Configurable secret rotation and WIF support for SSCSI#2012chiragkyal wants to merge 3 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
@chiragkyal: This pull request references SSCSI-254 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
00a6104 to
a49110c
Compare
|
/cc @mytreya-rh @dobsonj |
| workloads that use static secrets, so that the driver does not make unnecessary | ||
| provider API calls that may count against rate limits. | ||
|
|
||
| - As a cluster administrator, I want to configure the rotation polling interval to |
There was a problem hiding this comment.
i think the need is actually to have a larger value. So the story i think is more about ability to configure the rotation poll interval.
[out of scope for this PR] We can also perhaps explore upstream eco-system of an issuer/provider driven refresh of the secret, which could be more optimal compared to frequent polling.
https://redhat.atlassian.net/browse/RFE-8422
The customer manages approximately 200 secrets per cluster. Continuous polling of the Azure Key Vault for secret updates results in a high number of transactions, leading to unnecessary costs and performance overhead.
Having the ability to control secret rotation behavior would provide better cost efficiency and operational flexibility.
There was a problem hiding this comment.
i think the need is actually to have a larger value. So the story i think is more about ability to configure the rotation poll interval.
Thanks for the suggestion. Updated the user story.
[out of scope for this PR] We can also perhaps explore upstream eco-system of an issuer/provider driven refresh of the secret, which could be more optimal compared to frequent polling.
https://redhat.atlassian.net/browse/RFE-8422
The customer manages approximately 200 secrets per cluster. Continuous polling of the Azure Key Vault for secret updates results in a high number of transactions, leading to unnecessary costs and performance overhead.
Having the ability to control secret rotation behavior would provide better cost efficiency and operational flexibility.
Yeah, agreed! A push method is much more efficient than a pull method.
| **Upgrade**: Clusters upgrading to the new operator version will see no behavior | ||
| change. The operator defaults match the previously hardcoded values | ||
| (`requiresRepublish: true`, `--enable-secret-rotation=true`, | ||
| `--rotation-poll-interval=2m`, no `tokenRequests`). |
There was a problem hiding this comment.
We need to read the tokenRequests already set on the CSIDriver by users and merge it with user settings from ClusterCSIDriver*
We can think of a better way too to let the users keep their already configured CSIDriver settings than the above suggestion.
*This is needed because some users may already have set the audience for the Azure WIF integration.
The AWS provider was updated later, but the Azure Provider needed the audience for WIF flows for a long time.
Please note that changes to CSIDriver by user would not have caused reconcile (to the operator's static CSIDriver manifest) because the hash value in the annotation doesn't change.
There was a problem hiding this comment.
Good point. I looked into how the CSIDriver object reconciliation works in library-go's ApplyCSIDriver. The operator's manifest never included tokenRequests or requiresRepublish. So when someone manually patched tokenRequests onto the CSIDriver (for Azure WIF), the operator's desired spec hadn't changed, the hash matched, and reconciliation was a no-op.
I thought about the merge approach ("if ClusterCSIDriver.tokenRequests is empty, preserve whatever's on the CSIDriver"), but it might create some issue:
- Before upgrade: field doesn't exist on ClusterCSIDriver -> treat as preserve on CSIDriver.
- User sets tokenRequests via ClusterCSIDriver after upgrade (AWS WIF)-> merged with the existing propagated tokenRequest on CSIDriver (AWS WIF + Azure WIF)
- User later removes tokenRequests from ClusterCSIDriver -> merge logic says "preserve existing" -> no way to actually clear it
I think we should keep ClusterCSIDriver as the sole source of truth. The operator owns/manages the CSIDriver object completely, and not supposed to be handled manually. For users who already have Azure WIF configured manually, we can add a release note telling them to move their tokenRequests into ClusterCSIDriver. This keeps things predictable, both adding and removing tokenRequests works as expected.
There was a problem hiding this comment.
Even if we notify in release notes, there could be temporary secret rotation or pull failures from the time the operator is upgraded till the time the ClusterCSIDriver is configured.
Another chance is oversight of release notes causing outage due to not configuring the ClusterCSIDriver. (Especially if we consider auto-upgrade scenarios)
Can we add a parameter in ClusterCSIDriver in the secretsStore section which would control whether or not we will update the CSIDriver?
The default would be to keep existing tokenRequests on the CSIDriver.
The user can then update the ClusterCSIDriver with the needed audience, as well as enable the parameter.
When the parameter is set to not overwrite tokenRequests, and the tokenRequests is not empty, we could degrade or set the relevant status condition to alert the user.
There was a problem hiding this comment.
Can we add a parameter in ClusterCSIDriver in the secretsStore section which would control whether or not we will update the CSIDriver?
Whatever parameter we add will be part of the new OpenShift release. Operators deployed on older OpenShift releases won't have this parameter, so the apiserver will ignore this field or might error as unknown.
Do we have any data on how many such clusters are there which have manually patched the CSIDriver? TBH, users should not modify resources managed by an Operator, as such changes may be reverted.
There was a problem hiding this comment.
Whatever parameter we add will be part of the new OpenShift release. Operators deployed on older OpenShift releases won't have this parameter, so the apiserver will ignore this field or might error as unknown.
Sorry, but my suggestion is NOT about pre-upgrade. It is about what happens after the upgrade.
In the API, if we have a field lets say syncTokenRequests with a default value set to "false", the operator can look at it and NOT overwrite the already configured audience.
We will document that along with populating tokenRequests, syncTokenRequests should be set to "true"
Do we have any data on how many such clusters are there which have manually patched the CSIDriver? TBH, users should not modify resources managed by an Operator, as such changes may be reverted.
We know of at least one user https://redhat-internal.slack.com/archives/C08F8UBM0F7/p1758270464495719
We have not done any survey on how many users have configured the audience directly on the CSIDriver yet.
But as you know, the operator does not immediately reconcile this change.
Thus, i think when we can provide a smooth integration, it is better we do so to avoid any surprises to our users.
There was a problem hiding this comment.
Thanks for the suggestion. I've incorporated this into the proposal with the following design:
New struct: tokenRequests with policy and audiences
secretsStore:
tokenRequests:
policy: Managed # or "Unmanaged" (default)
audiences:
- audience: "sts.amazonaws.com"
expirationSeconds: 3600
- audience: "api://AzureADTokenExchange"How it works:
policy: "Unmanaged" (default): The operator reads the existing CSIDriver.spec.tokenRequests from the cluster and includes them in the desired spec. This means any manually patched audiences are preserved.
policy: "Managed": The operator uses the audiences list from ClusterCSIDriver as the sole source of truth. This gives full add/remove control power to the user.
776cb54 to
4b00729
Compare
Signed-off-by: chiragkyal <ckyal@redhat.com>
4b00729 to
66f2f50
Compare
Signed-off-by: chiragkyal <ckyal@redhat.com>
mytreya-rh
left a comment
There was a problem hiding this comment.
/lgtm
with a minor comment
| // list, replacing any previously configured values. | ||
| // +default="Unmanaged" | ||
| // +optional | ||
| Policy TokenRequestsPolicy `json:"policy,omitempty"` |
There was a problem hiding this comment.
i think we should make it immutable once set to "Managed". Would there be any issues with such restriction?
There was a problem hiding this comment.
Thanks for the suggestion; it has been incorporated. Also added some details about API state during upgrade. Please have a look.
Signed-off-by: chiragkyal <ckyal@redhat.com>
|
New changes are detected. LGTM label has been removed. |
|
@chiragkyal: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
jsafrane
left a comment
There was a problem hiding this comment.
From the storage point of view, it looks solid. It follows existing ClusterCSIDriver API usage and also common usage of CSI interface in Kubernetes emphemeral volumes.
| - On upgrade, `tokenRequests.policy` defaults to `"Unmanaged"`, preserving any | ||
| existing tokenRequests on the CSIDriver. |
There was a problem hiding this comment.
Is this a behavior change? The current operator will overwrite any CSIDriver changes made by the cluster admin during cluster update. This enhancement suggests it won't be overwritten. I think it's a good step forward, but it should be explicitly called out and documented.
There was a problem hiding this comment.
To clarify: there is no behavior change from the current operator's perspective.
Today the operator does not set tokenRequests at all (it's not in the static csidriver.yaml template), so any manually-patched tokenRequests on the CSIDriver object already survive reconciliation today.
What this enhancement does change is adding requiresRepublish and tokenRequests to the desired CSIDriver spec (which was previously unset). That changes the spec-hash and would trigger a delete+recreate, which would wipe manually patched tokenRequests if we do not explicitly preserve them.
So the "Unmanaged" default is not about any behavior change of the operator, it's about preventing the damage to user configured tokenRequests during the recreate.
See #2012 (comment) for a similar discussion.
| // Only honored when policy is "Managed". | ||
| // +optional | ||
| // +listType=atomic | ||
| Audiences []SecretsStoreTokenRequest `json:"audiences,omitempty"` |
There was a problem hiding this comment.
How many audiences does Kubernetes support? Should there be an upper limit?
And in general, most (all?) new fields need some validation. Like explicit enum values for all Policy fields, lower boundary for <anything>Seconds (negative numbers are probably bad) etc.
There was a problem hiding this comment.
BTW, the validation can wait for the API review.
There was a problem hiding this comment.
Yes, that's true. I have a dedicated API PR : openshift/api#2846
It's better to get these validations finalized there, then we can copy it over to the EP.
Summary
This enhancement proposal adds configurable secret rotation and workload identity federation (WIF) support to the OpenShift Secrets Store CSI Driver Operator via the
ClusterCSIDriverCR.Changes
CSIDriverConfigSpecwith a newSecretsStorediscriminated unionvariant containing
secretRotationandtokenRequestsfields.storage.k8s.io/v1CSIDriverobject (requiresRepublish,tokenRequests)--enable-secret-rotation,--rotation-poll-interval)rotation controller with kubelet-native
requiresRepublish.Tracking
/cc @mytreya-rh @dobsonj