Skip to content

Add application credential finalizer management#369

Open
Deydra71 wants to merge 1 commit into
openstack-k8s-operators:mainfrom
Deydra71:appcred-finalizer
Open

Add application credential finalizer management#369
Deydra71 wants to merge 1 commit into
openstack-k8s-operators:mainfrom
Deydra71:appcred-finalizer

Conversation

@Deydra71
Copy link
Copy Markdown
Contributor

@Deydra71 Deydra71 commented Apr 27, 2026

Jira: OSPRH-29269

Application Credential dev-doc: https://github.com/openstack-k8s-operators/dev-docs/blob/main/application_credentials.md

  • Tracks the active AC secret name in Status.ApplicationCredentialSecret
  • Add openstack.org/watcher-ac-consumer finalizer to the AC secret after service config is rendered
  • On AC rotation, move the finalizer from the old secret to the new one
  • On CR deletion, remove the consumer finalizer from the AC secret before cleaning up the CR

This ensures that the keystone-operator cannot revoke a rotated AC secret while Watcher is still consuming it.

2026-04-28T12:04:45Z	INFO	Controllers.Watcher	Added consumer finalizer	{"controller": "watcher", "controllerGroup": "watcher.openstack.org", "controllerKind": "Watcher", "Watcher": {"name":"watcher","namespace":"openstack"}, "namespace": "openstack", "name": "watcher", "reconcileID": "ac040495-e159-4291-bdd4-bb370a316368", "object": "ac-watcher-16b84-secret", "finalizer": "openstack.org/watcher-ac-consumer"}
2026-04-28T12:04:45Z	INFO	Controllers.Watcher	Removed consumer finalizer	{"controller": "watcher", "controllerGroup": "watcher.openstack.org", "controllerKind": "Watcher", "Watcher": {"name":"watcher","namespace":"openstack"}, "namespace": "openstack", "name": "watcher", "reconcileID": "ac040495-e159-4291-bdd4-bb370a316368", "object": "ac-watcher-b7b42-secret", "finalizer": "openstack.org/watcher-ac-consumer"}

Depends-On: openstack-k8s-operators/keystone-operator#685

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign amoralej for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1c533ab5c6e44e719416fe0284f555ae

✔️ openstack-meta-content-provider-master SUCCESS in 3h 57m 23s
✔️ watcher-operator-validation-master SUCCESS in 2h 14m 00s
✔️ openstack-meta-content-provider-epoxy SUCCESS in 3h 41m 26s
✔️ watcher-operator-validation-epoxy SUCCESS in 1h 57m 54s
✔️ watcher-operator-validation-epoxy-ocp4-16 SUCCESS in 1h 54m 38s
✔️ noop SUCCESS in 0s
watcher-operator-kuttl FAILURE in 1h 10m 00s

Copy link
Copy Markdown
Contributor

@amoralej amoralej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my note about the workflow, please.

May you also validate the change with finalizer in the existing appcred kuttl tests? https://github.com/openstack-k8s-operators/watcher-operator/tree/main/test/kuttl/test-suites/default/appcred-tests

return ctrl.Result{}, err
}

if instance.Spec.Auth.ApplicationCredentialSecret != "" || instance.Status.ApplicationCredentialSecret != "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the finalizer of the old secret is created after the configuration for DBjobs is recreated, but the watcher services have not been reconfigured yet to use the new config. In case something fails between this finalizer removal and the reconciling of the new config in watcherapi, watcherapplier and watcherdecisionengine, I guess it may break the running services. Unless, the logic to remove revoke the ACs without finalizer checks the Ready status of watcher (I'm not sure if that's the case). Will keystone do any check on the status of watcher?

An alternative workflow may be:

  1. Add the finalizer to the new AC here (but not removing the finalizer from the old one and not updating the instance.Status.ApplicationCredentialSecret.
  2. Remove the finalizer from the old finalizer and adding the instance.Status.ApplicationCredentialSecret right before mariadbv1.DeleteUnusedMariaDBAccountFinalizers (https://github.com/Deydra71/watcher-operator/blob/685c276a76228fbd999ff0b3dbbd821477c009bc/internal/controller/watcher_controller.go#L552)

Actually, this case is quite similar to the one with the mariadbaccounts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valid concern and keystone-operator is protecting both the current and previous AC secrets via status.previousSecretName that's saved in AC CR, independent of consumer finalizers. The consumer finalizer is a second layer of defense, but the previousSecretName check in cleanupUnusedRotatedSecrets ensures the old secret is never revoked during the rotation.

So if IIUC, the alternative splitting pattern is not necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will later add a comment on that code part in keystone-operator, so it's clear what is not revoked.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint! So my understanding of the AC rotation workflow is:

  • The previous secret is stored in status.previousSecretName and the new one in status.SecretName of the AC CR.
  • The watcher controller removes the finalizer in the old secret and adds it to the new one.
  • The keystone operator revokes all ACs for secrets which are not in status.previousSecretName or status.SecretName and has no finalizer.

This cover my concern for one rotation which is probably enough in regular terms, specially for the automatic rotations proposed in the doc.

Doing the splitting patern may cover other potential cases when doing multiple rotations. For example, a user manually rotating the ACs. Think in the case of an openstack operator wants to rotate the AC for any reason:

  • A cloud admin do the manual rotation step (i.e. Patching status.expiresAt to a timestamp in the past).
  • The AC is rotated successfully and goes to Ready state.
  • For any reason Watcher fails to deploy the new deployment and keeps running with the old AC which is still protected by both finalizer and being in the previousSecretName.
  • The cloud admin thinks... "mmm maybe something was wrong with the rotation, let's rotate it again".
  • Watcher fails to deploy again (as the issue is unrelated to AC).
  • The cloud admin does the rotation process again, the AC gets rotated and the orginal AC is removed from previousSecretName and the finalizer removed.
  • The original AC is cleaned by the keystone operator.
  • Watcher stops working.

I may be missing something or this case being too elaborate but IMHO, the other pattern, adding finalizer to the new one first and removing it from the old one at the end, is more robust in terms of service availability. it may be argued that it's less secure as a problem removing the finalizer after the deployment is done would prevent from revoking and unused AC but I'd say this case is much less likely and usually, keep the service running is a top priority.

This may not be a blocker for me, but i think it's better to discuss this now than later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this situation can definitely happen even if manual rotation is discouraged and it's not part of normal operation, but emergency/out of order situations.

But still - I agree that keeping the service alive is the top priority. The split is worth doing in all service operators, I will implement it in watcher and other single deployment services, but for nova and telemetry I need to put more thought in to it, so it could be done as a follow up for some operators, also based on other reviewers.

Thanks for this input!

})
})

When("ApplicationCredential consumer finalizer is managed", func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! functional tests looks good wrt the impelemented workflow,

@Deydra71 Deydra71 force-pushed the appcred-finalizer branch from 685c276 to d4c4d29 Compare May 13, 2026 10:40
@centosinfra-prod-github-app
Copy link
Copy Markdown

This change depends on a change that failed to merge.

Change openstack-k8s-operators/keystone-operator#685 is needed.

@centosinfra-prod-github-app
Copy link
Copy Markdown

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/watcher-operator for 369,2dc724033ff9beaf4dcb932dc39987462b22c368

@Deydra71
Copy link
Copy Markdown
Contributor Author

I had to reorder the env test to include all subservices to be ready before checking AC finalizer, but I think there's a race condition in env test timing now happening about once in 5 test runs :(

I think it's because after the rotation changes Spec.Auth.ApplicationCredentialSecret, the Watcher controller calls ensureAPI which updates the WatcherAPI sub CR's spec (incrementing the generation) and the WatcherAPI controller needs to complete a full reconcile cycle to propagate WatcherAPIReadyCondition = True while the test simulations race with the sub CR controller's spec updates.

I might be overcomplicating it, but not sure what could help to make it pass 100% of time, @amoralej any suggestion?

@centosinfra-prod-github-app
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/rdoproject.org/buildset/f75725d6839e491e9f43a931fe13c76c

✔️ openstack-meta-content-provider-master SUCCESS in 3h 07m 55s
✔️ watcher-operator-validation-master SUCCESS in 2h 13m 06s
✔️ openstack-meta-content-provider-epoxy SUCCESS in 2h 39m 26s
✔️ watcher-operator-validation-epoxy SUCCESS in 2h 00m 50s
✔️ watcher-operator-validation-epoxy-ocp4-18 SUCCESS in 2h 02m 04s
✔️ noop SUCCESS in 0s
watcher-operator-kuttl FAILURE in 1h 19m 05s

@amoralej
Copy link
Copy Markdown
Contributor

I had to reorder the env test to include all subservices to be ready before checking AC finalizer, but I think there's a race condition in env test timing now happening about once in 5 test runs :(

I think it's because after the rotation changes Spec.Auth.ApplicationCredentialSecret, the Watcher controller calls ensureAPI which updates the WatcherAPI sub CR's spec (incrementing the generation) and the WatcherAPI controller needs to complete a full reconcile cycle to propagate WatcherAPIReadyCondition = True while the test simulations race with the sub CR controller's spec updates.

I might be overcomplicating it, but not sure what could help to make it pass 100% of time, @amoralej any suggestion?

I think the best option to avoid race conditions is to wait for a full Reconcile loop to finish successfully before updating the Spec. See my inline note.

})
g.Expect(secret.Finalizers).To(
ContainElement(watcher.ACConsumerFinalizer))
}, timeout, interval).Should(Succeed())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid race conditions, you should wait to get the Watcher Ready before updating it:

Suggested change
}, timeout, interval).Should(Succeed())
}, timeout, interval).Should(Succeed())
// Simulate all watcher services deploying successfully
th.SimulateStatefulSetReplicaReady(watcherTest.WatcherAPIStatefulSet)
keystone.SimulateKeystoneEndpointReady(watcherTest.WatcherKeystoneEndpointName)
th.SimulateStatefulSetReplicaReady(watcherTest.WatcherApplierStatefulSet)
th.SimulateStatefulSetReplicaReady(watcherTest.WatcherDecisionEngineStatefulSet)
Eventually(func(g Gomega) {
w := GetWatcher(watcherTest.Instance)
g.Expect(w.Status.ApplicationCredentialSecret).To(Equal(acSecretName))
}, timeout, interval).Should(Succeed())
th.ExpectCondition(
watcherTest.Instance,
ConditionGetterFunc(WatcherConditionGetter),
condition.ReadyCondition,
corev1.ConditionTrue,
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, that did it, now it passed locally in multiple runs, thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kuttl tests is failing here with:

2026-05-14 10:19:30.070672 | controller | - ' logger.go:42: 10:09:23 | appcred-tests/2-deploy-appcred | Error from server
2026-05-14 10:19:30.070676 | controller | (NotFound): secrets "ac-watcher-test-secret" not found'

Any idea?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't realize this before. Right now we are hardcoding the secret name, but with the changes in keystone-operator we are using immutable secrets in the format: ac-<serviceName>-<first5charsOfACID>-secret, so we need to update the assertions to take into consideration dynamic secret names.

@Deydra71 Deydra71 force-pushed the appcred-finalizer branch from 1822be7 to 3fdfdd3 Compare May 19, 2026 10:30
@centosinfra-prod-github-app
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/rdoproject.org/buildset/c3f5a886dea04d358747ba7999e2eb5d

✔️ openstack-meta-content-provider-master SUCCESS in 3h 07m 55s
watcher-operator-validation-master FAILURE in 1h 44m 19s
✔️ openstack-meta-content-provider-epoxy SUCCESS in 3h 01m 26s
✔️ watcher-operator-validation-epoxy SUCCESS in 1h 56m 53s
✔️ watcher-operator-validation-epoxy-ocp4-18 SUCCESS in 1h 57m 52s
✔️ noop SUCCESS in 0s
watcher-operator-kuttl FAILURE in 1h 21m 15s

@Deydra71 Deydra71 force-pushed the appcred-finalizer branch from 3fdfdd3 to 89c799b Compare May 19, 2026 16:22
@centosinfra-prod-github-app
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/rdoproject.org/buildset/790f87245e644a8181a9b4329f5843ff

✔️ openstack-meta-content-provider-master SUCCESS in 2h 54m 35s
watcher-operator-validation-master FAILURE in 1h 45m 36s
✔️ openstack-meta-content-provider-epoxy SUCCESS in 2h 56m 11s
✔️ watcher-operator-validation-epoxy SUCCESS in 1h 55m 15s
✔️ watcher-operator-validation-epoxy-ocp4-18 SUCCESS in 1h 56m 57s
✔️ noop SUCCESS in 0s
watcher-operator-kuttl FAILURE in 56m 12s

@centosinfra-prod-github-app
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/rdoproject.org/buildset/fc28940316cf4b16804a3e99ba977c16

✔️ openstack-meta-content-provider-master SUCCESS in 1h 03m 26s
watcher-operator-validation-master FAILURE in 24m 19s
openstack-meta-content-provider-epoxy FAILURE in 4m 11s
⚠️ watcher-operator-validation-epoxy SKIPPED Skipped due to failed job openstack-meta-content-provider-epoxy
⚠️ watcher-operator-validation-epoxy-ocp4-18 SKIPPED Skipped due to failed job openstack-meta-content-provider-epoxy
✔️ noop SUCCESS in 0s
watcher-operator-kuttl RETRY_LIMIT in 17m 16s

Signed-off-by: Veronika Fisarova <vfisarov@redhat.com>
@Deydra71 Deydra71 force-pushed the appcred-finalizer branch from ef6e9ae to 65aba86 Compare May 20, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants