Skip to content

🐛 fix: allow reconciliation of deadline-exceeded ClusterObjectSets#2643

Merged
openshift-merge-bot[bot] merged 7 commits into
operator-framework:mainfrom
joelanford:fix/cos-deadline-exceeded-archival
Apr 29, 2026
Merged

🐛 fix: allow reconciliation of deadline-exceeded ClusterObjectSets#2643
openshift-merge-bot[bot] merged 7 commits into
operator-framework:mainfrom
joelanford:fix/cos-deadline-exceeded-archival

Conversation

@joelanford
Copy link
Copy Markdown
Member

@joelanford joelanford commented Apr 11, 2026

Description

The skipProgressDeadlineExceededPredicate blocked all update events for COS objects
with ProgressDeadlineExceeded, which prevented archival of stuck revisions — the
lifecycle state patch was silently dropped.

This PR:

  • Removes the predicate so all COS events are fully reconciled
  • Updates markAsProgressing to set ProgressDeadlineExceeded instead of
    RollingOut/Retrying when the deadline has been exceeded, preventing the reconcile
    loop the predicate was masking
  • Continues reconciling after ProgressDeadlineExceeded rather than clearing the
    error and stopping requeue. This allows revisions to recover if a transient error
    resolves itself, even after the deadline was exceeded
  • Extracts durationUntilDeadline as a shared helper for deadline computation
  • Adds a deadlineAwareRateLimiter that caps exponential backoff at the deadline so
    ProgressDeadlineExceeded is set promptly even during error retries
  • Moves deadline requeue logic into reconcile, using the remaining duration from
    durationUntilDeadline directly for RequeueAfter
  • Adds unit tests for durationUntilDeadline and recovery from
    ProgressDeadlineExceeded to Succeeded
  • Adds e2e tests for ProgressDeadlineExceeded archival/cleanup and for recovery
    from ProgressDeadlineExceeded to Succeeded

Addresses feedback from:
#2610 (comment)

Reviewer Checklist

  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

Copilot AI review requested due to automatic review settings April 11, 2026 17:08
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 11, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit feedd23
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69f0ffa53d13860008beb0c6
😎 Deploy Preview https://deploy-preview-2643--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a reconciliation dead-end where ClusterObjectSet (COS) updates were being dropped once a revision hit ProgressDeadlineExceeded, preventing stuck revisions from being archived and cleaned up. It removes the update-blocking predicate, makes progress-deadline handling “sticky” in status updates, and adds a deadline-aware rate limiter plus an E2E scenario to validate archival cleanup.

Changes:

  • Remove the ProgressDeadlineExceeded-skipping watch predicate and introduce a controller RateLimiter that caps exponential backoff at the progress deadline.
  • Refactor progress-deadline computation into a shared durationUntilDeadline helper and adjust progressing/retrying status updates to prefer ProgressDeadlineExceeded once exceeded.
  • Add an E2E scenario (and step helper) that archives a deadline-exceeded COS and verifies resource cleanup.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
internal/operator-controller/controllers/clusterobjectset_controller.go Removes the skip predicate, adds deadline-aware rate limiting, refactors deadline computation, and changes progressing/retrying condition behavior when the deadline is exceeded.
test/e2e/steps/steps.go Adds a new Godog step to patch a COS lifecycle state to Archived.
test/e2e/features/revision.feature Adds an E2E scenario that forces ProgressDeadlineExceeded, archives the COS, and asserts resources are removed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/operator-controller/controllers/clusterobjectset_controller.go Outdated
Comment thread internal/operator-controller/controllers/clusterobjectset_controller.go Outdated
When ClusterObjectSet "${COS_NAME}" lifecycle is set to "Archived"
Then ClusterObjectSet "${COS_NAME}" is archived
And resource "configmap/test-configmap" is eventually not found
And resource "deployment/test-deployment" is eventually not found
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seems to be testing the same scenario @joelanford
Could we ensure the same scenario here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. User installs a ClusterExtension. The CE controller creates COS-rev-1.
  2. COS-rev-1 gets stuck (e.g. a Deployment never becomes ready). After ProgressDeadlineMinutes, the reconciler sets Progressing=False/ProgressDeadlineExceeded.
  3. User updates the ClusterExtension. The CE controller creates COS-rev-2.
  4. COS-rev-2 rolls out successfully. It patches COS-rev-1 with lifecycleState: Archived so the old revision gets cleaned up.
  5. The watch predicate sees COS-rev-1 has ProgressDeadlineExceeded and drops the event.
  6. COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not the same? Making a CE that stamps out the COS-1 and COS-2 such that the CE reconciler eventually tries to set COS-1 as archived is the same thing, but just more ceremony around a standalone COS that exceeds the deadline and then is directly archived by the test code.

I'll verify manually to make sure. Is there something else happening with the CE/COS interaction that would make the COS-only test in the PR different?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not upgrading, checking it stuck and then, checking that is possible to change the resource and the changes will be reconciled. WDY about we add a test with?

@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from a586063 to 46dcf54 Compare April 11, 2026 17:39
Copilot AI review requested due to automatic review settings April 12, 2026 11:59
@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from 46dcf54 to aea3d72 Compare April 12, 2026 11:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts ClusterObjectSet (COS) reconciliation behavior so revisions that hit ProgressDeadlineExceeded still reconcile (enabling archival/cleanup), while making the deadline-exceeded state “sticky” to avoid the previous reconcile loop. It also adds a deadline-capped rate limiter and an e2e scenario to verify archival cleans up resources after a progress deadline is exceeded.

Changes:

  • Remove the watch predicate that dropped COS update events when ProgressDeadlineExceeded was set, allowing spec patches like archival to be reconciled.
  • Refactor progress-deadline handling (shared durationUntilDeadline, sticky ProgressDeadlineExceeded, requeue-at-deadline behavior) and add a deadline-aware rate limiter to cap backoff until the deadline.
  • Add an e2e scenario and a new step to patch COS lifecycle to Archived, plus improve e2e cleanup synchronization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/e2e/steps/steps.go Adds a Godog step to patch a ClusterObjectSet’s spec.lifecycleState.
test/e2e/steps/hooks.go Makes scenario cleanup wait for concurrent deletions to finish.
test/e2e/features/revision.feature Adds an e2e scenario covering archival + resource cleanup after ProgressDeadlineExceeded.
internal/operator-controller/controllers/clusterobjectset_controller.go Removes the skip predicate; makes deadline handling sticky; adds deadline-aware requeue + custom rate limiter.
internal/operator-controller/controllers/clusterobjectset_controller_test.go Updates the progress-deadline requeue expectation to match the new scheduling behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/steps/hooks.go Outdated
Comment on lines +370 to +371
cos := &ocv1.ClusterObjectSet{}
if err := r.client.Get(context.Background(), item.NamespacedName, cos); err != nil {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our use of the rate limiter uses the informer cache which is in-memory and does not block.

@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from aea3d72 to e1b56c2 Compare April 12, 2026 12:13
Copilot AI review requested due to automatic review settings April 12, 2026 12:20
@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from e1b56c2 to 00c3c75 Compare April 12, 2026 12:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a reconciliation dead-end for ClusterObjectSet (COS) revisions that hit ProgressDeadlineExceeded, ensuring they can still reconcile (e.g., to process archival) and improving deadline-related requeue behavior so deadline status is set promptly.

Changes:

  • Removes the update predicate that dropped COS updates when ProgressDeadlineExceeded was set.
  • Refactors progress-deadline handling into shared helpers (durationUntilDeadline, requeueForDeadline) and adds a deadline-aware rate limiter to cap exponential backoff at the deadline.
  • Extends e2e coverage with a new scenario that archives a deadline-exceeded COS and verifies resource cleanup; improves scenario cleanup to wait for deletions to complete.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/e2e/steps/steps.go Adds an e2e step to patch a COS lifecycle state (used for archival in scenarios).
test/e2e/steps/hooks.go Ensures scenario cleanup waits for concurrent deletion commands to finish.
test/e2e/features/revision.feature Adds an e2e scenario that drives a COS into ProgressDeadlineExceeded, archives it, and asserts resources are removed.
internal/operator-controller/controllers/clusterobjectset_controller.go Removes the deadline-exceeded predicate; adds deadline-aware requeue/rate-limiting and updates Progressing condition handling.
internal/operator-controller/controllers/clusterobjectset_controller_test.go Updates expected requeue timing behavior for progress deadline handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +127
remaining, hasDeadline := c.durationUntilDeadline(cos)
isDeadlineExceeded := hasDeadline && remaining <= 0

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. requeueForDeadline should not perform the calculation again with a different "now" timestamp. Perhaps instead we could simply pass the result of the original calculation into requeueForDeadline.

Comment on lines +350 to +354
// deadlineAwareRateLimiter wraps a delegate rate limiter and caps the backoff
// duration to the time remaining until the COS progress deadline (+2s), ensuring
// that ProgressDeadlineExceeded is set promptly even during exponential backoff.
type deadlineAwareRateLimiter struct {
delegate workqueue.TypedRateLimiter[ctrl.Request]
@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch 2 times, most recently from 9768528 to ef08a54 Compare April 12, 2026 14:43
Copilot AI review requested due to automatic review settings April 12, 2026 14:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a reconciliation dead-end where ClusterObjectSet (COS) revisions marked ProgressDeadlineExceeded would no longer reconcile, preventing actions like archival/cleanup from being processed.

Changes:

  • Removes the update predicate that filtered out COS updates after ProgressDeadlineExceeded, and adjusts progressing/deadline logic to avoid the prior status flip-flop loop.
  • Introduces shared deadline calculation (durationUntilDeadline), deadline-driven requeueing (requeueForDeadline), and a deadline-capped rate limiter to ensure the deadline condition is set promptly even under backoff.
  • Extends E2E coverage with a scenario that forces ProgressDeadlineExceeded, archives the COS, and asserts underlying resources are removed; adds a step to patch COS lifecycle state.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/operator-controller/controllers/clusterobjectset_controller.go Refactors progress-deadline handling, removes the predicate, adds deadline-aware requeue and rate limiting.
internal/operator-controller/controllers/clusterobjectset_controller_test.go Updates expected requeue timing to match new deadline requeue logic.
test/e2e/steps/steps.go Adds a new step to patch ClusterObjectSet.spec.lifecycleState.
test/e2e/steps/hooks.go Adjusts scenario cleanup deletion behavior.
test/e2e/features/revision.feature Adds an E2E scenario covering archival/cleanup of a deadline-exceeded COS.
Comments suppressed due to low confidence (1)

test/e2e/steps/hooks.go:205

  • ScenarioCleanup adds --wait=false to kubectl delete, but the PR description says cleanup now waits for resource deletions to complete. As written, deletions are still fire-and-forget goroutines with explicit non-waiting, so scenarios can finish while resources remain terminating (and any follow-on checks can observe leftovers). If the intent is to wait, run deletes synchronously (or collect goroutines with a WaitGroup) and/or add an explicit kubectl wait --for=delete phase.
	for _, r := range forDeletion {
		go func(res resource) {
			args := []string{"delete", res.kind, res.name, "--ignore-not-found=true", "--wait=false"}
			if res.namespace != "" {
				args = append(args, "-n", res.namespace)
			}
			if _, err := k8sClient(args...); err != nil {
				logger.Info("Error deleting resource", "name", res.name, "namespace", res.namespace, "stderr", stderrOutput(err))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +679 to +687
pd := cos.Spec.ProgressDeadlineMinutes
if pd <= 0 {
return -1, false
}
// Succeeded is a latch — once set, it's never cleared. A revision that
// has already succeeded should not be blocked by the deadline, even if
// it temporarily goes back to InTransition (e.g., recovery after drift).
if meta.IsStatusConditionTrue(cos.Status.Conditions, ocv1.ClusterObjectSetTypeSucceeded) {
return -1, false
Comment on lines +279 to +284
// requeueForDeadline returns a Result that requeues at the progress deadline
// if one is configured and has not yet been exceeded. This ensures that
// ProgressDeadlineExceeded is set promptly even when no object events occur.
func (c *ClusterObjectSetReconciler) requeueForDeadline(cos *ocv1.ClusterObjectSet) ctrl.Result {
if remaining, hasDeadline := c.durationUntilDeadline(cos); hasDeadline && remaining > 0 {
return ctrl.Result{RequeueAfter: remaining}
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 92.85714% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.06%. Comparing base (cadb7a7) to head (feedd23).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...troller/controllers/clusterobjectset_controller.go 91.66% 2 Missing and 1 partial ⚠️
...erator-controller/controllers/progress_deadline.go 94.11% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2643      +/-   ##
==========================================
+ Coverage   67.98%   68.06%   +0.08%     
==========================================
  Files         144      145       +1     
  Lines       10595    10623      +28     
==========================================
+ Hits         7203     7231      +28     
- Misses       2870     2871       +1     
+ Partials      522      521       -1     
Flag Coverage Δ
e2e 37.34% <0.00%> (-0.10%) ⬇️
experimental-e2e 52.99% <78.57%> (+0.07%) ⬆️
unit 53.64% <82.85%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch 2 times, most recently from 393ff45 to ad0859e Compare April 12, 2026 16:39
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2026
@camilamacedo86
Copy link
Copy Markdown
Contributor

@joelanford can you please rebase this one and address the Copilot comments so that we can review and see if we can go with.

@joelanford
Copy link
Copy Markdown
Member Author

Ran out of time to get back to this today. Hopefully tomorrow!

This PR has some flaws around recomputation of the remaining time until deadline which can cause funky things to happen if we're reconciling just before the deadline and then cross over the deadline threshold prior to the rate limiter logic running, which can cause requeues not to happen right at the deadline.

Might need a bit more logic in the recomputation code before it's ready. Hopefully that extra code isn't too brittle. Maybe it will be fine as a stopgap if the plan is to eventually move the progress deadline stuff to be a concern only of the ClusterExtension (leaving ClusterObjectSet with just a single "Ready" or similar condition)

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2026
@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from 4cedf95 to 14acf56 Compare April 27, 2026 16:04
Copilot AI review requested due to automatic review settings April 27, 2026 16:04
@camilamacedo86 camilamacedo86 self-requested a review April 27, 2026 16:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a reconciliation dead-end for ClusterObjectSet revisions that have hit ProgressDeadlineExceeded, ensuring they can still be reconciled (notably to process archival/spec updates) and that the deadline condition is set promptly even during error backoff.

Changes:

  • Removes the update-event predicate that previously blocked reconciliation for ProgressDeadlineExceeded revisions.
  • Makes progress-deadline handling “sticky” via markAsProgressing(..., isDeadlineExceeded) and introduces a shared durationUntilDeadline helper.
  • Adds a deadline-aware controller rate limiter and extends E2E coverage with a scenario validating archival cleans up resources after ProgressDeadlineExceeded.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/operator-controller/controllers/clusterobjectset_controller.go Removes the blocking predicate, wires a deadline-aware rate limiter, and updates progressing/retrying logic to respect deadline-exceeded state while continuing reconciliation.
internal/operator-controller/controllers/clusterobjectset_controller_test.go Adjusts progress-deadline requeue expectation to match new deadline calculation behavior.
internal/operator-controller/controllers/progress_deadline.go Adds deadlineAwareRateLimiter and the shared durationUntilDeadline helper.
internal/operator-controller/controllers/progress_deadline_test.go Adds unit tests covering the new deadline-aware rate limiter behavior.
test/e2e/steps/steps.go Adds a new E2E step to patch a ClusterObjectSet lifecycle state.
test/e2e/features/revision.feature Adds an E2E scenario validating that archiving a ProgressDeadlineExceeded COS cleans up its resources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@joelanford
Copy link
Copy Markdown
Member Author

joelanford commented Apr 27, 2026

experimental e2e is timing out. I guess I need to bump e2e timeout again?

Never mind: just needed to pick up the latest changes from main

@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from 14acf56 to 7e46425 Compare April 27, 2026 19:54
Comment thread internal/operator-controller/controllers/clusterobjectset_controller.go Outdated
Copy link
Copy Markdown
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review comments — focused on test coverage gaps, behavioral subtleties, and edge cases.

Comment thread internal/operator-controller/controllers/progress_deadline.go
Comment thread internal/operator-controller/controllers/clusterobjectset_controller.go Outdated
Copilot AI review requested due to automatic review settings April 28, 2026 15:37
@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch 2 times, most recently from 6229df1 to 81daf9c Compare April 28, 2026 16:26
Copy link
Copy Markdown
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up comments on the updated PR — one test coverage suggestion and two minor nits.

Comment thread internal/operator-controller/controllers/clusterobjectset_controller.go Outdated
Comment thread internal/operator-controller/controllers/clusterobjectset_controller.go Outdated
Previously, a watch predicate dropped update events for ClusterObjectSets
with ProgressDeadlineExceeded, preventing them from being archived or
cleaned up.

Remove the predicate and replace the inline deadline logic in Reconcile
with an enforceProgressDeadline() helper and a progressDeadline()
function that computes the absolute deadline from spec and metadata.

Add a deadline-aware rate limiter that caps exponential backoff at the
deadline and allows one immediate requeue when it passes, without
inspecting status conditions.

Add an e2e scenario that forces ProgressDeadlineExceeded, archives the
COS, and verifies resource cleanup.
@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from 81daf9c to feedd23 Compare April 28, 2026 18:42
Copy link
Copy Markdown
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: camilamacedo86, pedjak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [camilamacedo86,pedjak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 74e39f5 into operator-framework:main Apr 29, 2026
29 checks passed
@joelanford joelanford deleted the fix/cos-deadline-exceeded-archival branch May 11, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants