Skip to content

SSCSI-254: Configurable secret rotation and WIF support for SSCSI#2012

Open
chiragkyal wants to merge 3 commits into
openshift:masterfrom
chiragkyal:configure-secret-rotation-and-wif
Open

SSCSI-254: Configurable secret rotation and WIF support for SSCSI#2012
chiragkyal wants to merge 3 commits into
openshift:masterfrom
chiragkyal:configure-secret-rotation-and-wif

Conversation

@chiragkyal
Copy link
Copy Markdown
Member

@chiragkyal chiragkyal commented May 16, 2026

Summary

This enhancement proposal adds configurable secret rotation and workload identity federation (WIF) support to the OpenShift Secrets Store CSI Driver Operator via the ClusterCSIDriver CR.

Changes

  • Extends CSIDriverConfigSpec with a new SecretsStore discriminated union
    variant containing secretRotation and tokenRequests fields.
  • The operator will dynamically propagate these settings to:
    • The storage.k8s.io/v1 CSIDriver object (requiresRepublish, tokenRequests)
    • The driver DaemonSet container args (--enable-secret-rotation, --rotation-poll-interval)
  • Aligns with upstream Secrets Store CSI Driver v1.6.0 which replaced the internal
    rotation controller with kubelet-native requiresRepublish.

Tracking

/cc @mytreya-rh @dobsonj

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 16, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 16, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 16, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 16, 2026

@chiragkyal: This pull request references SSCSI-254 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign suleymanakbas91 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@chiragkyal chiragkyal force-pushed the configure-secret-rotation-and-wif branch from 00a6104 to a49110c Compare May 16, 2026 20:00
@chiragkyal chiragkyal marked this pull request as ready for review May 18, 2026 06:46
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 18, 2026
@chiragkyal
Copy link
Copy Markdown
Member Author

/cc @mytreya-rh @dobsonj

@openshift-ci openshift-ci Bot requested review from dobsonj and mytreya-rh May 18, 2026 06:54
workloads that use static secrets, so that the driver does not make unnecessary
provider API calls that may count against rate limits.

- As a cluster administrator, I want to configure the rotation polling interval to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the need is actually to have a larger value. So the story i think is more about ability to configure the rotation poll interval.
[out of scope for this PR] We can also perhaps explore upstream eco-system of an issuer/provider driven refresh of the secret, which could be more optimal compared to frequent polling.

https://redhat.atlassian.net/browse/RFE-8422
The customer manages approximately 200 secrets per cluster. Continuous polling of the Azure Key Vault for secret updates results in a high number of transactions, leading to unnecessary costs and performance overhead.
Having the ability to control secret rotation behavior would provide better cost efficiency and operational flexibility.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the need is actually to have a larger value. So the story i think is more about ability to configure the rotation poll interval.

Thanks for the suggestion. Updated the user story.

[out of scope for this PR] We can also perhaps explore upstream eco-system of an issuer/provider driven refresh of the secret, which could be more optimal compared to frequent polling.

https://redhat.atlassian.net/browse/RFE-8422
The customer manages approximately 200 secrets per cluster. Continuous polling of the Azure Key Vault for secret updates results in a high number of transactions, leading to unnecessary costs and performance overhead.
Having the ability to control secret rotation behavior would provide better cost efficiency and operational flexibility.

Yeah, agreed! A push method is much more efficient than a pull method.

**Upgrade**: Clusters upgrading to the new operator version will see no behavior
change. The operator defaults match the previously hardcoded values
(`requiresRepublish: true`, `--enable-secret-rotation=true`,
`--rotation-poll-interval=2m`, no `tokenRequests`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to read the tokenRequests already set on the CSIDriver by users and merge it with user settings from ClusterCSIDriver*
We can think of a better way too to let the users keep their already configured CSIDriver settings than the above suggestion.

*This is needed because some users may already have set the audience for the Azure WIF integration.
The AWS provider was updated later, but the Azure Provider needed the audience for WIF flows for a long time.
Please note that changes to CSIDriver by user would not have caused reconcile (to the operator's static CSIDriver manifest) because the hash value in the annotation doesn't change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I looked into how the CSIDriver object reconciliation works in library-go's ApplyCSIDriver. The operator's manifest never included tokenRequests or requiresRepublish. So when someone manually patched tokenRequests onto the CSIDriver (for Azure WIF), the operator's desired spec hadn't changed, the hash matched, and reconciliation was a no-op.

I thought about the merge approach ("if ClusterCSIDriver.tokenRequests is empty, preserve whatever's on the CSIDriver"), but it might create some issue:

  • Before upgrade: field doesn't exist on ClusterCSIDriver -> treat as preserve on CSIDriver.
  • User sets tokenRequests via ClusterCSIDriver after upgrade (AWS WIF)-> merged with the existing propagated tokenRequest on CSIDriver (AWS WIF + Azure WIF)
  • User later removes tokenRequests from ClusterCSIDriver -> merge logic says "preserve existing" -> no way to actually clear it

I think we should keep ClusterCSIDriver as the sole source of truth. The operator owns/manages the CSIDriver object completely, and not supposed to be handled manually. For users who already have Azure WIF configured manually, we can add a release note telling them to move their tokenRequests into ClusterCSIDriver. This keeps things predictable, both adding and removing tokenRequests works as expected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we notify in release notes, there could be temporary secret rotation or pull failures from the time the operator is upgraded till the time the ClusterCSIDriver is configured.
Another chance is oversight of release notes causing outage due to not configuring the ClusterCSIDriver. (Especially if we consider auto-upgrade scenarios)

Can we add a parameter in ClusterCSIDriver in the secretsStore section which would control whether or not we will update the CSIDriver?
The default would be to keep existing tokenRequests on the CSIDriver.
The user can then update the ClusterCSIDriver with the needed audience, as well as enable the parameter.
When the parameter is set to not overwrite tokenRequests, and the tokenRequests is not empty, we could degrade or set the relevant status condition to alert the user.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a parameter in ClusterCSIDriver in the secretsStore section which would control whether or not we will update the CSIDriver?

Whatever parameter we add will be part of the new OpenShift release. Operators deployed on older OpenShift releases won't have this parameter, so the apiserver will ignore this field or might error as unknown.

Do we have any data on how many such clusters are there which have manually patched the CSIDriver? TBH, users should not modify resources managed by an Operator, as such changes may be reverted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever parameter we add will be part of the new OpenShift release. Operators deployed on older OpenShift releases won't have this parameter, so the apiserver will ignore this field or might error as unknown.

Sorry, but my suggestion is NOT about pre-upgrade. It is about what happens after the upgrade.
In the API, if we have a field lets say syncTokenRequests with a default value set to "false", the operator can look at it and NOT overwrite the already configured audience.
We will document that along with populating tokenRequests, syncTokenRequests should be set to "true"

Do we have any data on how many such clusters are there which have manually patched the CSIDriver? TBH, users should not modify resources managed by an Operator, as such changes may be reverted.

We know of at least one user https://redhat-internal.slack.com/archives/C08F8UBM0F7/p1758270464495719
We have not done any survey on how many users have configured the audience directly on the CSIDriver yet.
But as you know, the operator does not immediately reconcile this change.
Thus, i think when we can provide a smooth integration, it is better we do so to avoid any surprises to our users.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've incorporated this into the proposal with the following design:

New struct: tokenRequests with policy and audiences

secretsStore:
  tokenRequests:
    policy: Managed        # or "Unmanaged" (default)
    audiences:
      - audience: "sts.amazonaws.com"
        expirationSeconds: 3600
      - audience: "api://AzureADTokenExchange"

How it works:

policy: "Unmanaged" (default): The operator reads the existing CSIDriver.spec.tokenRequests from the cluster and includes them in the desired spec. This means any manually patched audiences are preserved.

policy: "Managed": The operator uses the audiences list from ClusterCSIDriver as the sole source of truth. This gives full add/remove control power to the user.

Signed-off-by: chiragkyal <ckyal@redhat.com>
@chiragkyal chiragkyal force-pushed the configure-secret-rotation-and-wif branch from 4b00729 to 66f2f50 Compare May 29, 2026 12:43
Signed-off-by: chiragkyal <ckyal@redhat.com>
Copy link
Copy Markdown
Contributor

@mytreya-rh mytreya-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

with a minor comment

// list, replacing any previously configured values.
// +default="Unmanaged"
// +optional
Policy TokenRequestsPolicy `json:"policy,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should make it immutable once set to "Managed". Would there be any issues with such restriction?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion; it has been incorporated. Also added some details about API state during upgrade. Please have a look.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2026
Signed-off-by: chiragkyal <ckyal@redhat.com>
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

@chiragkyal: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@jsafrane jsafrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the storage point of view, it looks solid. It follows existing ClusterCSIDriver API usage and also common usage of CSI interface in Kubernetes emphemeral volumes.

Comment on lines +176 to +177
- On upgrade, `tokenRequests.policy` defaults to `"Unmanaged"`, preserving any
existing tokenRequests on the CSIDriver.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a behavior change? The current operator will overwrite any CSIDriver changes made by the cluster admin during cluster update. This enhancement suggests it won't be overwritten. I think it's a good step forward, but it should be explicitly called out and documented.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: there is no behavior change from the current operator's perspective.
Today the operator does not set tokenRequests at all (it's not in the static csidriver.yaml template), so any manually-patched tokenRequests on the CSIDriver object already survive reconciliation today.

What this enhancement does change is adding requiresRepublish and tokenRequests to the desired CSIDriver spec (which was previously unset). That changes the spec-hash and would trigger a delete+recreate, which would wipe manually patched tokenRequests if we do not explicitly preserve them.

So the "Unmanaged" default is not about any behavior change of the operator, it's about preventing the damage to user configured tokenRequests during the recreate.

See #2012 (comment) for a similar discussion.

// Only honored when policy is "Managed".
// +optional
// +listType=atomic
Audiences []SecretsStoreTokenRequest `json:"audiences,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many audiences does Kubernetes support? Should there be an upper limit?

And in general, most (all?) new fields need some validation. Like explicit enum values for all Policy fields, lower boundary for <anything>Seconds (negative numbers are probably bad) etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the validation can wait for the API review.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. I have a dedicated API PR : openshift/api#2846

It's better to get these validations finalized there, then we can copy it over to the EP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants