Skip to content

NO-JIRA: simplify waiting in GetLastKeyMeta#2219

Open
tjungblu wants to merge 1 commit into
openshift:masterfrom
tjungblu:laskeymeta
Open

NO-JIRA: simplify waiting in GetLastKeyMeta#2219
tjungblu wants to merge 1 commit into
openshift:masterfrom
tjungblu:laskeymeta

Conversation

@tjungblu
Copy link
Copy Markdown
Contributor

@tjungblu tjungblu commented May 12, 2026

Simplifies the changes made in #1188 to just do a dumb 5m context timeout wait with 30s polling in between.

Summary by CodeRabbit

  • Tests
    • Improved test utilities for retrieving encryption key metadata to be more resilient to transient listing errors.
    • Polling now retries periodically over a longer window and logs interim failures instead of aborting early.
    • Enhances reliability of key discovery during test runs.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@tjungblu: This pull request explicitly references no jira issue.

Details

In response to this:

Simplifies the changes made in #1188 to just do a dumb 5m context timeout wait with 30s polling in between.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from dgrisonnet and p0lyn0mial May 12, 2026 15:06
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tjungblu
Once this PR has been reviewed and has the lgtm label, please assign dgrisonnet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e9abfd03-912e-481b-9003-b239dfddba30

📥 Commits

Reviewing files that changed from the base of the PR and between 0b2d3b4 and 16a8dfe.

📒 Files selected for processing (1)
  • test/library/encryption/helpers.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/library/encryption/helpers.go

Walkthrough

GetLastKeyMeta in the test helper switches secret-listing retries to wait.PollUntilContextTimeout(t.Context(), 30*time.Second, 5*time.Minute, ...). List errors are logged and the poll continues; when listing succeeds the function proceeds with the same selection, parsing, sorting, and annotation-reading logic to build EncryptionKeyMeta.

Changes

Polling mechanism and list handling

Layer / File(s) Summary
Polling implementation
test/library/encryption/helpers.go
Replaced previous backoff/onErrorWithTimeout secret-listing with wait.PollUntilContextTimeout(t.Context(), 30*time.Second, 5*time.Minute, ...) to repeatedly attempt listing labeled secrets.
Error handling inside poll
test/library/encryption/helpers.go
On list errors the poll callback logs the error and returns false to continue polling instead of returning the error to the poller.
Success path and downstream processing
test/library/encryption/helpers.go
When listing succeeds the code assigns selectedSecrets and continues unchanged: copy selected secrets, sort by key ID, unmarshal migrated-resources annotation, read mode annotation, and build/return EncryptionKeyMeta.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: simplifying the waiting mechanism in the GetLastKeyMeta function, which aligns with the core purpose of refactoring the polling logic from a more complex pattern to a straightforward 5-minute timeout with 30-second polling intervals.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR modifies a helper utility library with no Ginkgo test definitions. The custom check applies only to Ginkgo test names (It, Describe, Context, When), which are absent here.
Test Structure And Quality ✅ Passed The modified file is a helper module with no Ginkgo test blocks. The custom check applies to test code, not helper functions, so it is not applicable to this PR.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are being added. Changes only modify a test helper function. MicroShift compatibility check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. It only modifies polling logic in a helper library function (test/library/encryption/helpers.go). The SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Modifies only test helper file with polling logic. No deployment manifests, operator code, controllers, or scheduling constraints. Not applicable to this check.
Ote Binary Stdout Contract ✅ Passed Test helper library with no process-level stdout writes. No main/init/TestMain functions. All logging uses t.Logf() (test-intercepted). No stdout-corrupting calls.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies helper library file, not e2e tests. No new Ginkgo test definitions were added. The IPv6/disconnected network check applies only to new Ginkgo e2e tests, which are absent here.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/library/encryption/helpers.go`:
- Around line 204-213: In the wait.PollUntilContextTimeout call's condition
function (the anonymous func passed to wait.PollUntilContextTimeout), don't
return (false, err) on transient List errors because that immediately stops
polling; instead log the error and return (false, nil) so polling continues
until the overall timeout, only allowing the PollUntilContextTimeout to surface
an error when the timeout is exceeded. Specifically update the error handling in
the function that calls secretsClient.List (inside wait.PollUntilContextTimeout)
to log the List error and return false, nil rather than false, err.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9c9fef02-14d9-4aed-8ad7-9bcefa092530

📥 Commits

Reviewing files that changed from the base of the PR and between 431f183 and 1470ba6.

📒 Files selected for processing (1)
  • test/library/encryption/helpers.go

Comment thread test/library/encryption/helpers.go Outdated
Comment thread test/library/encryption/helpers.go Outdated
if err != nil {
t.Logf("failed to list secrets, err = %v", err.Error())
t.Logf("GetLastKeyMeta: Failed to list secrets: %v", err)
return false, err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return err only when isTransientAPIError true? and the rest we should return false, nil?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's actually the thing that drove me to change it in the first place, isTransientAPIError doesn't cover everything. For example, auth can be offline for those said 5 minutes, but 401s are not covered. I also don't want to change the function because this is used in too many places.

Copy link
Copy Markdown
Member

@ardaguclu ardaguclu May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The error types not listed in isTransientAPIError does not make sense

Just one note: test/library/encryption/errors.go file won' be used anymore.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KMS changes are disruptive for the cluster. After this comment, I think we should retry for such errors. Because for instance, on SNO, there can be a period of time that kube-apiserver is not functioning?, we should retry. Otherwise, test will flake.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used in several places, ForceKeyRotation for example

backOff := retry.DefaultBackoff
backOff.Steps = 9999

// in theory the max time we tolerate disruption on an SNO cluster is 60 seconds
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to preserve this comment ? (it explains why 5m)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is still there

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok.

selectedSecrets, err = secretsClient.List(context.TODO(), metav1.ListOptions{LabelSelector: labelSelector})
err := wait.PollUntilContextTimeout(t.Context(), 30*time.Second, time.Minute*5, true, func(ctx context.Context) (bool, error) {
labelSecrets, err := secretsClient.List(ctx, metav1.ListOptions{LabelSelector: labelSelector})
if err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use transientAPIError - makes the test less brittle

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned earlier on slack and on the other thread with the ai rabbit: it doesn't cover all transient errors

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but right now we have a mix. Some tests use transientAPIError and some don’t.

Are there any known transient errors we’re currently missing? Would it make sense to add them to transientAPIError?

In general, do we currently have issues with transient errors?

I think it would be better not to hide any errors. At the moment, we treat all errors as transient. Using transientAPIError instead would raise the bar for the entire platform, and I think we’re already doing a good job since the tests are generally stable :)

WDYT?

Comment thread test/library/encryption/helpers.go Outdated
}
return

selectedSecrets = labelSecrets
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this ?

Comment thread test/library/encryption/helpers.go Outdated
}, func() (err error) {
selectedSecrets, err = secretsClient.List(context.TODO(), metav1.ListOptions{LabelSelector: labelSelector})
err := wait.PollUntilContextTimeout(t.Context(), 30*time.Second, time.Minute*5, true, func(ctx context.Context) (bool, error) {
labelSecrets, err := secretsClient.List(ctx, metav1.ListOptions{LabelSelector: labelSelector})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selectedSecrets ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it' s compilation error due to the scoping of the function.

Unused variable 'selectedSecrets'

I gave the return type a variable name and removed the assignment.

This replaces the existing logic with a simpler wait.PollUntilContextTimeout.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

@tjungblu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants