-
Notifications
You must be signed in to change notification settings - Fork 216
OCPSTRAT-2485: Introduce feature gate based inclusion/exclusion of manifests #1273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@JoelSpeed: This pull request references OCPSTRAT-2485 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the feature to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThread manifest-level enabled feature gates from startup through the Operator and sync worker into payload loading and rendering; update APIs and data structures to carry enabled gates; adjust manifest filtering to honor gates; add tests and testdata for feature-gate scenarios. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing touches
Comment |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aws-ovn-techpreview As a smoke test for generally not breaking existing stuff |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/payload/render.go (1)
67-95: Empty enabledFeatureGates at render time causes negatively-gated manifests to always be included, creating divergence from runtime behaviorIn render.go lines 155–172,
Includeis called with an emptysets.Set[string]{}for enabledFeatureGates. The integration tests confirm thatrelease.openshift.io/feature-gatessupports two semantics:
gate: included only if the gate is enabled-gate: included if the gate is NOT enabledWith an empty enabledFeatureGates set passed to
Include, this creates a mismatch:
- Positive gates always exclude manifests (gate never enabled)
- Negative gates always include manifests (gate never seen as enabled)
At runtime, the real CVO uses actual enabledFeatureGates. For manifests annotated with
-gate, the CVO may exclude them if that gate becomes enabled, but bootstrap has already applied them. CVO cannot retroactively remove applied resources.To align render-time and steady-state behavior, either:
- Skip all manifests bearing
release.openshift.io/feature-gatesannotations during render, deferring them to the real CVO, or- Compute
enabledFeatureGatesduring render to match what the runtime CVO will usepkg/cvo/sync_worker.go (1)
799-803: MissingEnabledFeatureGatespropagation incalculateNextFrom.When copying desired state to the work object,
EnabledFeatureGatesis not propagated. This means the sync loop's work object would retain stale feature gates even after a change is detected.if desired != nil { w.Desired = desired.Desired w.Overrides = desired.Overrides w.Capabilities = desired.Capabilities + w.EnabledFeatureGates = desired.EnabledFeatureGates }
🧹 Nitpick comments (7)
pkg/cvo/cvo_scenarios_test.go (1)
117-381: Consider centralizing or relaxingEnabledFeatureGatesexpectations in scenario testsAll these scenarios now hard-code
EnabledFeatureGates: sets.New[string]()inSyncWorkerStatusexpectations. As the real CVO starts populating non-empty enabled-gate sets, these broad scenario tests will become fragile and need frequent updates.You may want to:
- Either ignore
EnabledFeatureGatesin generic scenario comparisons (e.g., via a helper that zeroes it beforeDeepEqual), and- Add focused tests that explicitly assert on the gate set where that behavior matters.
This keeps scenario tests stable while still validating gate plumbing where it’s semantically important.
Also applies to: 438-545, 676-791, 851-909, 1053-1199, 1261-1372, 1451-1475, 1713-1737, 2114-2135, 2370-2393, 2451-2481, 2635-2707, 2875-2898, 3193-3390, 3503-3615, 3813-3874
pkg/start/start.go (1)
245-289: Startup hard-fails if FeatureGate informer cache doesn’t sync within 30s
processInitialFeatureGatereturns an error onWaitForCacheSynctimeout, causingOptions.Runand the CVO to exit if the FeatureGate informer can’t sync within 30 seconds (e.g., API server slow/unavailable on boot). Previously, CVO tended to start and let controllers block on unsynced caches.Consider instead:
- Logging the timeout,
- Falling back to a safe default feature set /
DefaultCvoGates, and- Letting the
ChangeStoppercontroller react once FeatureGates are readable,so transient API issues don’t prevent the operator from starting at all.
pkg/cvo/cvo_featuregates_test.go (1)
172-185: Consider adding a test with a mock lister.The comment acknowledges this limitation. The current test only validates the nil-lister fallback path. Consider adding a test with a populated lister to verify the happy path initialization.
Would you like me to help generate a test case with a mock lister to cover the initialization path when the lister is available?
pkg/cvo/cvo.go (2)
1147-1154: Consider usingClone()for defensive copy.The
sets.Settype has a built-inClone()method that is more idiomatic:func (optr *Operator) getEnabledFeatureGates() sets.Set[string] { - // Return a copy to prevent external modification - result := sets.Set[string]{} - for gate := range optr.enabledManifestFeatureGates { - result.Insert(gate) - } - return result + // Return a copy to prevent external modification + return optr.enabledManifestFeatureGates.Clone() }
1101-1114: Delete handler is missing.The
featureGateEventHandlerhandles Add and Update but not Delete. While the "cluster" FeatureGate object is typically not deleted, consider adding a DeleteFunc for completeness to clear the gates if the object is removed:func (optr *Operator) featureGateEventHandler() cache.ResourceEventHandler { workQueueKey := optr.queueKey() return cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { optr.updateEnabledFeatureGates(obj) optr.queue.Add(workQueueKey) }, UpdateFunc: func(old, new interface{}) { optr.updateEnabledFeatureGates(new) optr.queue.Add(workQueueKey) }, + DeleteFunc: func(obj interface{}) { + optr.enabledManifestFeatureGates = sets.Set[string]{} + optr.queue.Add(workQueueKey) + }, } }pkg/cvo/sync_worker.go (2)
140-147: Minor comment inaccuracy.The comment says "since this is a list" but
EnabledFeatureGatesis asets.Set[string], not a list. Consider updating the comment:- // Provide a proper deep copy for feature gates since this is a list. + // Provide a proper deep copy for feature gates since sets.Set requires explicit cloning.
473-484: Consider cloningenabledFeatureGatesduringSyncWorkconstruction.The
enabledFeatureGatesparameter is assigned directly towork.EnabledFeatureGateswithout cloning. Sincesets.Setis a reference type, external mutations to the caller's set could unexpectedly affect the sync work. For defensive programming, consider cloning:work := &SyncWork{ Generation: generation, Desired: desired, Overrides: config.Spec.Overrides, - EnabledFeatureGates: enabledFeatureGates, + EnabledFeatureGates: enabledFeatureGates.Clone(), }Alternatively, if the caller guarantees immutability, document this expectation.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
⛔ Files ignored due to path filters (4)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/library-go/pkg/crypto/crypto.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/library-go/pkg/manifest/manifest.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (14)
go.mod(1 hunks)hack/cluster-version-util/task_graph.go(2 hunks)lib/manifest/manifest.go(3 hunks)pkg/cvo/cvo.go(10 hunks)pkg/cvo/cvo_featuregates_test.go(1 hunks)pkg/cvo/cvo_scenarios_test.go(64 hunks)pkg/cvo/featuregate_integration_test.go(1 hunks)pkg/cvo/sync_test.go(2 hunks)pkg/cvo/sync_worker.go(18 hunks)pkg/payload/payload.go(3 hunks)pkg/payload/payload_test.go(3 hunks)pkg/payload/render.go(1 hunks)pkg/payload/task_graph_test.go(2 hunks)pkg/start/start.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
go.modpkg/cvo/cvo_featuregates_test.gopkg/cvo/sync_test.gopkg/cvo/sync_worker.gopkg/payload/payload_test.gopkg/payload/task_graph_test.gohack/cluster-version-util/task_graph.gopkg/cvo/cvo.golib/manifest/manifest.gopkg/start/start.gopkg/cvo/cvo_scenarios_test.gopkg/payload/payload.gopkg/cvo/featuregate_integration_test.gopkg/payload/render.go
🧬 Code graph analysis (7)
pkg/cvo/sync_test.go (1)
pkg/cvo/sync_worker.go (1)
SyncWorkerStatus(109-137)
pkg/cvo/sync_worker.go (2)
pkg/payload/payload.go (2)
Update(104-117)LoadUpdate(138-240)lib/capability/capability.go (2)
SortedList(28-33)GetCapabilitiesStatus(81-86)
pkg/payload/payload_test.go (2)
pkg/payload/payload.go (3)
LoadUpdate(138-240)DefaultClusterProfile(100-100)GetImplicitlyEnabledCapabilities(245-255)lib/manifest/manifest.go (1)
GetImplicitlyEnabledCapabilities(46-96)
pkg/payload/task_graph_test.go (1)
pkg/payload/payload.go (2)
LoadUpdate(138-240)DefaultClusterProfile(100-100)
hack/cluster-version-util/task_graph.go (1)
pkg/payload/payload.go (2)
LoadUpdate(138-240)DefaultClusterProfile(100-100)
pkg/cvo/cvo_scenarios_test.go (2)
pkg/cvo/cvo.go (1)
New(208-310)lib/resourcebuilder/interface.go (1)
New(83-89)
pkg/payload/payload.go (2)
lib/capability/capability.go (1)
ClusterCapabilities(12-16)lib/manifest/manifest.go (2)
GetImplicitlyEnabledCapabilities(46-96)InclusionConfiguration(18-36)
🔇 Additional comments (31)
pkg/payload/task_graph_test.go (1)
16-20: LoadUpdate signature update in tests looks correctUsing
sets.Set[string]{}as theenabledFeatureGatesargument cleanly adapts the test to the newLoadUpdatesignature without changing behavior.Also applies to: 486-495
pkg/start/start.go (1)
553-573: FeatureGate informer and ChangeStopper wiring into CVO looks consistentPassing
configInformerFactory.Config().V1().FeatureGates()intocvo.New, storingStopOnFeatureGateChangein theContext, and starting it alongside the main operator under leader election aligns cleanly with the new feature-gate model. The lifetime is correctly tied torunContext/runCancel, and informer factories are shared appropriately.Also applies to: 577-624, 336-389
hack/cluster-version-util/task_graph.go (1)
3-12: Task graph CLI adaptation to LoadUpdate signature is fineThe additional
sets.Set[string]{}argument topayload.LoadUpdatecorrectly reflects the newenabledFeatureGatesparameter, and importingsetshere is appropriate for that use.Also applies to: 32-35
pkg/cvo/sync_test.go (1)
18-25: fakeSyncRecorder updated to match new Update signatureAdding the
enabledFeatureGates sets.Set[string]parameter tofakeSyncRecorder.Updatekeeps the test double aligned with the production recorder interface while preserving existing behavior (only tracksdesired). No further changes needed here.Also applies to: 474-500
pkg/payload/payload_test.go (1)
35-143: Tests correctly adapted to enabledFeatureGates parametersPassing
sets.Set[string]{}intoLoadUpdateandGetImplicitlyEnabledCapabilitiesmatches the updated APIs and keeps these tests exercising the same behavior (no feature gates in play). This is a straightforward and appropriate adaptation.Also applies to: 145-218, 352-374
lib/manifest/manifest.go (1)
46-52: LGTM - Clean parameter addition.The
enabledFeatureGatesparameter is correctly added to the function signature and consistently propagated to bothIncludeAllowUnknownCapabilitiescalls.pkg/cvo/cvo_featuregates_test.go (3)
10-88: LGTM - Good test coverage for extractEnabledGates.Tests cover the key scenarios: matching version extraction, non-matching version handling, and empty enabled gates. The table-driven approach is clean.
90-109: Good verification of copy semantics.The test correctly validates that
getEnabledFeatureGates()returns a defensive copy, preventing external mutations from affecting internal state.
111-170: LGTM - Covers type safety and nil handling.Tests validate behavior for valid FeatureGate objects, invalid types, and nil objects appropriately.
pkg/cvo/cvo.go (1)
284-302: LGTM - Feature gate informer wiring.The event handler registration, lister assignment, cache sync setup, and initialization call are properly integrated following the existing patterns for other informers.
pkg/cvo/featuregate_integration_test.go (3)
120-160: LGTM - Good integration test for SyncWork pipeline.Validates that EnabledFeatureGates flows correctly through SyncWork and affects manifest filtering decisions.
203-277: LGTM - Useful real-world examples.The descriptive test cases serve as documentation for common usage patterns (TechPreview gating, alternative implementations, legacy fallbacks).
12-118: LGTM - Comprehensive feature gate filtering tests with well-documented scenarios.Tests cover positive gates, negative (exclusion) gates, AND logic combinations, and no-annotation defaults. The expected error messages document the library-go contract clearly, though these assertions depend on an external dependency (JoelSpeed/library-go) that cannot be verified from this repository. The production code gracefully handles these errors by logging them rather than asserting on exact message format, mitigating brittleness risk.
pkg/payload/payload.go (2)
138-139: LGTM - API signature update.The
enabledFeatureGatesparameter is correctly added toLoadUpdateand threaded through to the manifest filtering logic.
242-255: LGTM - Consistent parameter propagation.
GetImplicitlyEnabledCapabilitiescorrectly accepts and forwardsenabledFeatureGatesto the underlyinglocalmanifest.GetImplicitlyEnabledCapabilitiescall.pkg/cvo/sync_worker.go (16)
34-43: Interface signature change looks correct.The addition of
enabledFeatureGates sets.Set[string]to theUpdatemethod signature is consistent with the feature gate propagation design. The parameter type is appropriate for a set of feature gate names.
81-85: Field addition is well-documented.The
EnabledFeatureGatesfield is appropriately typed and documented. The placement withinSyncWorkis logical alongside other configuration fields likeCapabilities.
133-137: LGTM!The
EnabledFeatureGatesfield inSyncWorkerStatusappropriately mirrors the field inSyncWorkfor status visibility.
355-356: LGTM!The
EnabledFeatureGatesis correctly propagated topayload.LoadUpdatefor manifest filtering during payload loading.
429-432: LGTM!The
EnabledFeatureGatesis correctly propagated topayload.GetImplicitlyEnabledCapabilitiesto ensure feature gate state is considered when computing implicitly enabled capabilities.
507-513: LGTM!The equality check correctly considers feature gates alongside version, overrides, and capabilities. The logic properly treats work as unchanged only when all four aspects are equal.
537-538: LGTM!Feature gates are correctly initialized in status using
Clone()to ensure independent copies.
555-561: LGTM!The error path correctly preserves override, capability, and feature gate changes even when payload loading fails. The use of
Clone()for status ensures independent copies.
578-580: LGTM!Status is correctly updated with feature gates after successful payload load.
785-786: LGTM!The work calculation correctly considers feature gate changes alongside other change types.
841-876: LGTM!The
equalSyncWorkfunction is correctly extended to handle feature gate comparison. The use ofsets.Set.Equalis appropriate, and the logging of changes aids debugging.
1047-1050: LGTM!Feature gates are correctly propagated to
Manifest.Includeduring precreation to filter ClusterOperator manifests.
1067-1070: LGTM!Feature gates are correctly propagated to
Manifest.Includeduring main manifest processing.
1144-1148: LGTM!The switch to
DeepCopy()is necessary to properly handle theEnabledFeatureGatesset in status reporting.
1158-1165: LGTM!Consistent use of
DeepCopy()in the error reporting path.
1182-1189: LGTM!Consistent use of
DeepCopy()in the completion reporting path.
53b2f28 to
ed06ae4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This file is not to be merged, this is just for testing purposes
| func runTaskGraphCmd(cmd *cobra.Command, args []string) error { | ||
| manifestDir := args[0] | ||
| release, err := payload.LoadUpdate(manifestDir, "", "", "", payload.DefaultClusterProfile, nil) | ||
| release, err := payload.LoadUpdate(manifestDir, "", "", "", payload.DefaultClusterProfile, nil, sets.Set[string]{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't handle feature sets today, does it need to handle feature gates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so. It would be nice, but as you have stated, the file is not being kept up-to-date with new functionalities, so I believe it's okay not to implement it for now.
| name: "include manifest with matching feature gate", | ||
| enabledGates: sets.New[string]("TechPreviewFeatureGate"), | ||
| manifestAnnotations: map[string]string{ | ||
| "release.openshift.io/feature-gates": "TechPreviewFeatureGate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I look at this, the more I'm thinking we should remove the s, s/feature-gates/feature-gate to match what we have for feature-sets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take, for what it's worth. Anything that felt more natural when creating tests or when testing manifests 🙇 The developers will be the users, and both options can be understood. I also like the idea of removing the s. However, that's just my personal opinion after reviewing the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adjusted the library-go PR to account for the dropped S, and will update this one (and clean up the commit history what a mess Joel) once that has merged
|
/cc |
pkg/payload/render.go
Outdated
| if strings.Contains(file.Name(), "CustomNoUpgrade") || strings.Contains(file.Name(), "TechPreviewNoUpgrade") || strings.Contains(file.Name(), "DevPreviewNoUpgrade") { | ||
| // CustomNoUpgrade, TechPreviewNoUpgrade and DevPreviewNoUpgrade may add features to manifests like the ClusterVersion CRD, | ||
| // but we do not need those features during bootstrap-render time. In those clusters, the production | ||
| // CVO will be along shortly to update the manifests and deliver the gated features. | ||
| // fixme: now that we have requiredFeatureSet, use it to do Manifest.Include() filtering here instead of making filename assumptions | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should now all be handled in the manifest.Include, since I've adapted so that every manifest rendered runs through manifest.Include now
| } | ||
| filteredManifests := make([]string, 0, len(manifests)) | ||
| for _, manifest := range manifests { | ||
| if len(filterGroupKind) > 0 && !filterGroupKind.Has(manifest.GVK.GroupKind()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change from before here is that previously we only did any filtering when there was a filter on groups. Now we run manifest.Include on every manifest, and the group/kind exclusion only when there is a specified group/kind filter
| var featureGate configv1.FeatureGate | ||
| if err := runtime.DefaultUnstructuredConverter.FromUnstructured(featureGateManifest.Obj.Object, &featureGate); err != nil { | ||
| return nil, nil, fmt.Errorf("failed to convert FeatureGate manifest %s to structured object: %w", featureGateManifest.OriginalFilename, err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different to before, switching to a structured resource to allow fetching fields within to be easier. We already had a dep on configv1 so this didn't seem to make much sense not to use a structured object
|
Have been testing this on a cluster and locally applying the rendering and I'm confident this is now doing as it's supposed to. I think this is ready for review @DavidHurta I have yet to look at how this affects HyperShift, that's next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/cvo/cvo.go (1)
1130-1154: Critical race condition in equality check.The equality check at line 1141 reads
enabledManifestFeatureGateswithout holding the lock, but the lock is acquired only at line 1142. This creates a TOCTOU (time-of-check-time-of-use) race where concurrent calls toupdateEnabledFeatureGatesorgetEnabledFeatureGatescan interleave, causing stale reads and incorrect change detection.Acquire the lock before reading
enabledManifestFeatureGates:🔎 Proposed fix
func (optr *Operator) updateEnabledFeatureGates(obj interface{}) bool { featureGate, ok := obj.(*configv1.FeatureGate) if !ok { klog.Warningf("Expected FeatureGate object but got %T", obj) return false } newGates := optr.extractEnabledGates(featureGate) + optr.featureGatesMutex.Lock() + defer optr.featureGatesMutex.Unlock() + // Check if gates actually changed to avoid unnecessary work if !optr.enabledManifestFeatureGates.Equal(newGates) { - optr.featureGatesMutex.Lock() - defer optr.featureGatesMutex.Unlock() - klog.V(2).Infof("Cluster feature gates changed from %v to %v", sets.List(optr.enabledManifestFeatureGates), sets.List(newGates)) optr.enabledManifestFeatureGates = newGates return true } return false }
🧹 Nitpick comments (5)
install/0000_90_cluster-version-operator_03_configmaps.yaml (1)
1-32: Consider adding a test case for the missing feature-gate combination.The test fixture covers three of the four possible combinations of two feature gates:
- Both enabled (dev-preview-created)
- First enabled, second disabled (tech-preview-created)
- Both disabled (default-created)
For completeness, consider adding a fourth ConfigMap testing the remaining combination:
-Example,Example2(first disabled, second enabled).pkg/cvo/featuregate_integration_test.go (2)
13-118: Consider loosening assertions on exact error strings from manifest filteringSeveral table rows compare
err.Error()to a hard‑coded full string. That tightly couples these tests to the exact wording from the filtering layer, so even minor phrasing tweaks will break them without changing behavior. Using substring/strings.Containschecks or error types would keep the tests focused on behavior (include vs. exclude and which gate failed) while being more resilient to implementation detail changes.Also applies to: 207-280
120-160: SyncWorkIntegration test doesn’t exercise the real SyncWorker plumbing
TestSyncWorkIntegrationcurrently constructs aSyncWorkby hand and then calls manifest filtering directly withwork.EnabledFeatureGates, without driving the actual worker to emit aSyncWorkerStatus. That still validates the annotation semantics, but it doesn’t prove that realSyncWorkinstances produced by the worker carry the correct gate set. If you want coverage of the wiring, consider a follow‑up that runs the worker and inspectsSyncWorkerStatus.EnabledFeatureGatesinstead.pkg/payload/render.go (1)
188-239: parseFeatureGateManifest’s strict shape assumptions may be brittle over time
parseFeatureGateManifestcurrently (a) errors if the file contains anything other than a single manifest and (b) errors ifstatus.featureGateshas more than one entry. That matches today’s expectations for the CVO’s feature‑gate manifest, but it also meansRenderwill fail hard if the manifest ever grows multiple version blocks or extra docs. If you foresee richer manifests, consider selecting the entry that matches the CVO’s desired version (or documenting that only a single entry is supported) instead of treating “>1” as fatal.pkg/cvo/cvo.go (1)
258-260: Optional refactor: queue initialization modernization.The queue initialization has been refactored to use explicit configuration. While this change is unrelated to the PR's main feature-gate objective, it appears to be a code modernization.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
⛔ Files ignored due to path filters (3)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/library-go/pkg/manifest/manifest.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (17)
go.mod(1 hunks)hack/cluster-version-util/task_graph.go(2 hunks)install/0000_90_cluster-version-operator_03_configmaps.yaml(1 hunks)lib/manifest/manifest.go(3 hunks)pkg/cvo/cvo.go(10 hunks)pkg/cvo/cvo_featuregates_test.go(1 hunks)pkg/cvo/cvo_scenarios_test.go(64 hunks)pkg/cvo/featuregate_integration_test.go(1 hunks)pkg/cvo/status_test.go(2 hunks)pkg/cvo/sync_test.go(2 hunks)pkg/cvo/sync_worker.go(20 hunks)pkg/featuregates/featuregates.go(2 hunks)pkg/payload/payload.go(3 hunks)pkg/payload/payload_test.go(3 hunks)pkg/payload/render.go(6 hunks)pkg/payload/task_graph_test.go(2 hunks)pkg/start/start.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- pkg/cvo/cvo_featuregates_test.go
- lib/manifest/manifest.go
- go.mod
- pkg/payload/payload.go
- pkg/start/start.go
- pkg/payload/task_graph_test.go
- pkg/cvo/sync_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/featuregates/featuregates.gopkg/cvo/cvo_scenarios_test.gopkg/cvo/status_test.gopkg/payload/payload_test.goinstall/0000_90_cluster-version-operator_03_configmaps.yamlpkg/cvo/sync_worker.gopkg/cvo/featuregate_integration_test.gohack/cluster-version-util/task_graph.gopkg/cvo/cvo.gopkg/payload/render.go
🧬 Code graph analysis (3)
pkg/cvo/sync_worker.go (1)
pkg/payload/payload.go (2)
Update(104-117)LoadUpdate(138-240)
pkg/cvo/featuregate_integration_test.go (2)
pkg/cvo/cvo.go (1)
Operator(83-207)pkg/cvo/sync_worker.go (1)
SyncWork(68-85)
hack/cluster-version-util/task_graph.go (1)
pkg/payload/payload.go (2)
LoadUpdate(138-240)DefaultClusterProfile(100-100)
🔇 Additional comments (13)
hack/cluster-version-util/task_graph.go (2)
9-9: LGTM: Import correctly added for feature gate support.The sets package import is necessary for the updated
LoadUpdatesignature.
34-34: Empty feature gate set is acceptable for this utility, but consider configurability if needed.The empty set
sets.Set[string]{}aligns with this tool's pattern of using default values (empty strings for other parameters). Since this is a development utility for visualizing task graphs, the empty set is reasonable.However, the past review comment raises a valid point: if feature gate filtering affects task graph structure or manifest inclusion during debugging, you may want to add a CLI flag to specify enabled feature gates, similar to how
--parallelis configurable.For now, the empty set is acceptable unless feature gate visualization is a requirement.
pkg/cvo/status_test.go (1)
203-224: fakeRiFlags correctly updated to satisfy the new gate-checker APIAdding
desiredVersionandDesiredVersion()keepsfakeRiFlagscompatible with the extended gate-checker interface while preserving existing behavior for tests that don’t care about version. Looks good.pkg/cvo/cvo_scenarios_test.go (1)
21-38: Explicit EnabledFeatureGates in setup and status snapshots look correctInitializing
Operator.enabledFeatureGateswithDefaultCvoGates("version")and populatingSyncWorkerStatus.EnabledFeatureGateswithsets.New[string]()in the expectations keeps the status surface in sync with the new field while avoiding nil/non‑nil map mismatches in equality checks. No functional or maintainability issues from these changes.Also applies to: 117-149, 249-381, 436-547, 676-708, 854-908, 1089-1171, 1415-1593, 1713-1775, 1817-1855, 2114-2134, 2451-2479, 2634-2707, 3192-3295, 3311-3416, 3503-3530, 3554-3578, 3635-3667, 3815-3874
pkg/payload/payload_test.go (1)
129-143: Tests correctly adapted to the new enabledFeatureGates parameterSupplying an empty
sets.Set[string]{}toLoadUpdateandGetImplicitlyEnabledCapabilitieskeeps these tests’ behavior unchanged while matching the expanded function signatures. This is the right way to thread the new parameter in where no feature gates are under test.Also applies to: 205-218, 352-372
pkg/payload/render.go (1)
38-83: New renderDir filtering path is coherent and preserves behaviorThreading
requiredFeatureSetandenabledFeatureGatesintorenderDirand doing per‑manifest filtering (with group/kind checks first, then gate/profile checks) keeps the existing traversal and error‑aggregation semantics while making the gating behavior explicit and well‑logged. Re‑emitting only accepted manifest docs viafilteredManifestsis straightforward and doesn’t introduce obvious correctness or performance issues.Also applies to: 85-166
pkg/featuregates/featuregates.go (1)
18-42: CvoGateChecker/CvoGates wiring for DesiredVersion is internally consistentThe interface extension and
CvoGates.DesiredVersion()implementation are coherent;DefaultCvoGatescorrectly initializesdesiredVersion, and all internal usages remain consistent.CvoGateCheckeris exported, which technically constitutes a breaking change for external implementers requiring the new method. Repository-wide search found no out-of-tree users depending on the old interface shape.pkg/cvo/cvo.go (2)
194-196: LGTM!The mutex and field declaration are correct. The RWMutex allows concurrent reads while protecting writes.
1169-1192: No thread-safety issue exists.The
DesiredVersion()method only reads a string field from theCvoGatesstruct using a value receiver. Reading primitive types like strings is atomic in Go. Additionally, theenabledFeatureGatesfield is assigned once during operator initialization and never modified thereafter, making concurrent reads ofDesiredVersion()completely safe.Likely an incorrect or invalid review comment.
pkg/cvo/sync_worker.go (4)
82-84: LGTM!The
EnabledFeatureGatesfield addition toSyncWorkis well-documented and correctly typed.
141-146: LGTM!The
DeepCopyimplementation correctly usesClone()to deep-copy theEnabledFeatureGatesset, preventing external modification.
652-667: LGTM! Correct handling of feature gate changes.When feature gates change, the payload must be reloaded because manifest filtering criteria have changed. Clearing
w.payloadand callingloadUpdatedPayloadensures that manifests are re-evaluated with the new set of enabled feature gates. The approach correctly handles both inclusion of newly matching manifests and exclusion of manifests that no longer match.
863-895: LGTM!The
equalSyncWorkfunction correctly extends equality checking to include feature gates comparison usingEqual(), and properly returns all four equality flags. The logging of detected changes aids in debugging.
|
I am reviewing the PR. Should be done today 🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, Joel 👋 🙇The PR looks solid, I appreciate you taking the time to implement it. A few comments. I am mainly a bit concerned regarding the potential non-deterministic application of manifests, which was not apparent to me initially (see that wall of text comment). Apologies for taking a little bit longer. The PR is quite large, and some things were not clear to me from the commit history. Feel free to disagree with me and/or verify my statements (the CVO can sometimes be more complex than it maybe should be).
pkg/cvo/sync_worker.go
Outdated
| klog.V(2).Infof("Feature gates changed, loading updated payload") | ||
|
|
||
| // Clear the payload to force a reload. | ||
| w.payload = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nil dereference of w.payload in a following w.apply call seems possible if the following loadUpdatedPayload returns an error and does not set the w.payload 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Start waits before Update finishes at least once (see line 610). The w.payload seems to be only populated by Update before the proposed changes. Thus, Start may be taking some assumptions into consideration, such as the w.payload being always set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the nil assignment is to force payload reload in syncPayload, which is used by Update. I need to have a closer look at the code base for a suggestion. Maybe some changes in the syncPayload would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without clearing the payload, the loadUpdatedPayload wasn't actually doing anything, which was problematic as it meant that the previous round of filtering stood, even though the feature gates had changed and the files needed to be refiltered
I'll have a look and see if I can determine where this might break... Or we restore the original payload on error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the Start seems to be mainly to apply the current payload via the apply method here. By missing nil checks, it looks like that the Start assumes for a payload to be always present, and the responsibility of the Start is to apply the currently loaded payload.
We might consider the restoration of a payload on an error. We might also consider modifying the syncPayload logic. Currently, there are two main branches in the syncPayload method:
- We have already loaded the payload into the memory AND its image is the same as the desired image
- We are loading the payload into the memory for the first time OR its image is not the same as the desired image (an update was requested)
I am thinking of modifying the latter branch to also reload the payload if the enabled gates change. We would be ensuring the desired outcome of w.payload = nil; however, closer to the source. By restarting the CVO to pick up the newly enabled gates, we would also execute this branch (by de facto setting the payload to nil).
The Update seems to be checking whether an update to the current work is in order. If yes, it notifies the Start to apply the new work.
The currently proposed changes in Update are ensuring that loadUpdatedPayload is executed on changes to the enabled gates. However, loadUpdatedPayload calls syncPayload, which does not have to update the payload due to caching optimization. Thus, we could make the optimization aware of enabled gates being changed to reload the payload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although let me refresh my memory on the whole “apply payload" area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering how the capabilities work in the CVO, as they are quite similar in what we are trying to achieve (something that can be changed dynamically, and it's something that filters out manifests).
If I understand it correctly. They function quite differently from what I have proposed as an idea.
When the CVO loads the payload into the memory, it acts as if all known capabilities are enabled:
cluster-version-operator/pkg/payload/payload.go
Lines 158 to 164 in 8163e7d
| if knownCapabilities != nil { | |
| // We only pass known capabilities at payload load time to avoid doing any capability | |
| // enable filtering which only occurs during apply. | |
| onlyKnownCaps = &configv1.ClusterVersionCapabilitiesStatus{ | |
| EnabledCapabilities: knownCapabilities, | |
| KnownCapabilities: knownCapabilities} | |
| } |
| if err := manifest.Include(&excludeIdentifier, &requiredFeatureSet, &profile, onlyKnownCaps, nil, enabledFeatureGates); err != nil { |
This was done in #754 to "disallow disabling, add enabling capabilities":
This PR adds support for the following per the. Post-install, an admin can change a capability to enable it but cannot disable a previously enabled capability. This PR changes CVO to override any admin attempt to disable a previously enabled capability and reports any such action through a newly added condition showing these "tainted-in" capabilities. These "tainted-in" capabilities will remain enabled and will be reported as such in the status.
Supporting this new logic requires that capability filtering be moved from payload load time to payload apply time in a similar fashion to how overrides are handled.
Only when the CVO is actually trying to apply the manifest in the cluster, it checks whether it should depending on the actually enabled capabilities in:
cluster-version-operator/pkg/cvo/sync_worker.go
Line 1087 in f46468a
| if err := task.Manifest.Include(nil, nil, nil, &capabilities, work.Overrides, work.EnabledFeatureGates); err != nil { |
Thus, an equivalent of it for feature gates could be:
- When loading a payload into the memory, set the
enabledFeatureGateargument of theIncludefunction from library-go to nil to avoid excluding based on the filter (?) - Before applying a manifest, check whether the CVO should apply the manifest depending on the actually enabled gates
Pros:
- It is consistent with an existing pattern.
- It should work.
Cons:
- Not very efficient.
The feature set is quite different. It was initially implemented using the ChangeStopper in c323882 with a commit:
Add controller to shutdown the binary when the techpreview feature gate
changesThis value changes infrequently and can never be disabled. This simple
mechanism restarts the CVO so that when the CVO starts again it will
start creating using techpreview manifests
So the idea was - "let's just restart the CVO, this happens rarely". Simple and straightforward.
Given that we wouldn’t like to restart the CVO on every change in enabled feature gates, which may be more frequent, the proposed solution is awesome. Then it may depend on whether we want to make the CVO more memory-efficient and only load valid manifests into memory and execute the "loading payload into memory" operation more frequently, or the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting lets move the feature gate checking closer by moving the clearing into syncPayload, does that make sense? I'm not quite following which way you wanted me to go in this thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following which way you wanted me to go in this thread
Yeah, I was intentionally trying to leave the possibilities open, but I could have been more clear.
Instead of clearing the payload, I am proposing to change syncPayload to something like:
// cache the payload until the release image changes
validPayload := w.payload
previousEnabledFeatureGates := w.status.EnabledFeatureGates
if validPayload != nil && validPayload.Release.Image == desired.Image && work.EnabledFeatureGates.Equal(previousEnabledFeatureGates) {
// no change
}
if validPayload == nil {
klog.V(2).Info("Loading payload for the first time")
} else if !equalUpdate(configv1.Update{Image: validPayload.Release.Image}, configv1.Update{Image: desired.Image}) {
klog.V(2).Info("Loading payload due to desired image not being equal to the current one.")
} else if !work.EnabledFeatureGates.Equal(previousEnabledFeatureGates) {
// When the feature gates change, we must reload the payload.
// Loading the payload filters out files that didn't match the previous set of feature gates,
// this means now, additional files may match the new set of feature gates and need to be included.
// Some files in the current payload may no longer match the new set of feature gates and need to be excluded,
// though these ones are already excluded when apply calls Include on the manifests.
klog.V(2).Infof("Enabled feature gates changed from %v to %v, forcing a payload refresh", previousEnabledFeatureGates, work.EnabledFeatureGates)
} else {
return implicitlyEnabledCaps, nil
}
// The existing logic continuesand remove the following code from Start():
- if featureGatesChanged {
- // When the feature gates change, we must reload the payload.
- // Loading the payload filters out files that didn't match the previous set of feature gates,
- // this means now, additional files may match the new set of feature gates and need to be included.
- // Some files in the current payload may no longer match the new set of feature gates and need to be excluded,
- // though these ones are already excluded when apply calls Include on the manifests.
- klog.V(2).Infof("Feature gates changed, loading updated payload")
-
- _, err := w.loadUpdatedPayload(ctx, w.work)
- if err != nil {
- klog.Warningf("Error when attempting to load updated payload: %v.", err)
- }
- }I do have some concerns about clearing the payload directly, that is, w.payload = nil. The existing code assumes the w.payload to be always set after a specific point in its execution. This means that nil dereferences are possible without further modifications. The reason for the removal of the w.loadUpdatedPayload call from Start() is that the goroutine running Update() is the sole expected entity to update the w.payload at the moment. That is, Start() only applies a payload. Update loads an updated payload.
- When enabled gates change, an item is added to a queue, which results in the
Updateto run. Updatecan load manifests with new options.UpdateinstructsStartvia a channel that new work is available.Startsapplies the new loaded payload.
This follows the existing patterns without larger deviations. We would just reload the local payload more often, saving the memory, and we would not interfere too much with existing logic. The end outcome is the same. I am only concerned about making larger deviations from existing flows without extensive testing and a clear value being gained. WDYT?
This is based on my current understanding of the code from static analysis.
pkg/cvo/cvo.go
Outdated
|
|
||
| // featureGatesMutex protects access to enabledManifestFeatureGates | ||
| featureGatesMutex sync.RWMutex | ||
| enabledManifestFeatureGates sets.Set[string] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of the enabledFeatureGates and enabledManifestFeatureGates fields is now not so distinct at first glance. Although they do make sense in the sense of the struct.
Could we add a description for the enabledManifestFeatureGates field to make the distinction clearer for developers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I can add that. Perhaps we want to rename the enabledFeatureGates field to enabledCVOFeatureGates as well to further distinguish, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be even better!
pkg/cvo/cvo.go
Outdated
| // updateEnabledFeatureGates updates the cluster feature gates based on a FeatureGate object | ||
| func (optr *Operator) updateEnabledFeatureGates(obj interface{}) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please add a description for the meaning of the return value?
pkg/cvo/cvo.go
Outdated
| // If no matching version found, log a warning but continue with empty set | ||
| if enabledGates.Len() == 0 { | ||
| klog.V(2).Infof("No feature gates found for current version %s, using empty set", currentVersion) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states log a warning, but the utilized klog uses Infof and not Warningf. Noting in case you wanted to propose an actual warning in the logs. I am personally fine with klog.V(2).Infof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably leave this as info personally
pkg/cvo/cvo.go
Outdated
|
|
||
| // initializeFeatureGates initializes the cluster feature gates from the current FeatureGate object | ||
| func (optr *Operator) initializeFeatureGates() { | ||
| // Try to load initial state from the cluster FeatureGate object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could take an extra consideration for apierrors.IsNotFound case to differentiate between "oh, everything is working, but the object is not there - we are defaulting - this is intended mechanic" and "something is not working - should we fail or default?".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we already have this covered in pkg/start/start.go which fetches the initial gates and feature set. If the object doesn't exist, we assume nothing, and I'm adding a comment about why that is ok for feature gates. If there's a genuine error it returns and the startup will fail and CVO will exit by the looks of it.
So by the time we get here, we either didn't see the FeatureGate because it didn't exist, or there's a genuine error, in which case, given this is backed by an informer, would have come up in start.go and we would have exited there.
I think this is ok as is given the earlier protection.
pkg/cvo/sync_worker.go
Outdated
|
|
||
| if featureGatesChanged { | ||
| // When the feature gates change, we must reload the payload. | ||
| // Loading the payload fiters out files that didn't match the previous set of feature gates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Loading the payload fiters out files that didn't match the previous set of feature gates, | |
| // Loading the payload filters out files that didn't match the previous set of feature gates, |
pkg/cvo/cvo.go
Outdated
| if optr.featureGateLister != nil { | ||
| if featureGate, err := optr.featureGateLister.Get("cluster"); err == nil { | ||
| optr.updateEnabledFeatureGates(featureGate) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are silently ignoring the error. This will result in the default enabled gates being set. This fact may not logged anywhere. We are logging the fact when no enabled gates are found when extracting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've ended up reworking the feature gate initialization so that it resembles more closely the other feature gate initialization for cvo gates and the featureset
|
/testwith openshift/cluster-version-operator/main/e2e-agnostic-ovn #1300 |
|
multi-pr testing (#1273 (comment)) does not work because the code conflicts. Please ignore the cmd there. The reason I came here is to test if the issue OTA-1833 to CVO bought by the lastest o/api can be fixed by this pull. $ git --no-pager show --name-only 1ccf976dd7cc4946752e0f7355b39228a057c062
commit 1ccf976dd7cc4946752e0f7355b39228a057c062 (hongkailiu/bump-o-api, bump-o-api)
Author: Hongkai Liu <hongkailiu@users.noreply.github.com>
Date: Tue Jan 20 10:12:06 2026 +0800
testing: remove OKD manifests
Generated by
```console
$ git --no-pager show --name-only | grep OKD | xargs rm
```
vendor/github.com/openshift/api/config/v1/zz_generated.crd-manifests/0000_00_cluster-version-operator_01_clusterversions-OKD.crd.yaml
vendor/github.com/openshift/api/config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_apiservers-OKD.crd.yaml
vendor/github.com/openshift/api/config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_authentications-OKD.crd.yaml
vendor/github.com/openshift/api/config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_infrastructures-OKD.crd.yaml
vendor/github.com/openshift/api/config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_nodes-OKD.crd.yaml
vendor/github.com/openshift/api/config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_schedulers-SelfManagedHA-OKD.crd.yamlThose manifests are blamed because CI is recovered by removing them. Edit on Jan 23: We figured out another way. |
f723c0d to
e8ca53a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@pkg/cvo/cvo.go`:
- Around line 1134-1150: The read-to-write lock upgrade creates a TOCTOU race
around optr.enabledManifestFeatureGates; change the pattern in the code that
compares newGates to optr.enabledManifestFeatureGates to either (a) take
optr.featureGatesMutex.Lock() up front, perform the Equal check and assignment
while holding the write lock, then Unlock, or (b) keep the read lock, and if
Equal returns false then release the read lock, acquire the write lock and
re-check optr.enabledManifestFeatureGates.Equal(newGates) before assigning and
logging; update the code paths that reference featureGatesMutex,
enabledManifestFeatureGates, Equal, and newGates accordingly so the
compare-and-set is atomic.
In `@pkg/cvo/sync_worker.go`:
- Around line 662-676: When feature gates change and loadUpdatedPayload (called
from the featureGatesChanged branch) fails, w.payload can be nil and apply()
(function apply, which uses payloadUpdate / w.payload and checks
len(payloadUpdate.Manifests)) will panic; add a nil-check and early-return to
avoid dereferencing a nil payload: either in the caller skip calling apply()
when loadUpdatedPayload returned an error (propagate that error or set
changed=false), or inside apply() check that payloadUpdate (or w.payload) is
non-nil before accessing payloadUpdate.Manifests and return a handled error if
nil; update syncPayload/loadUpdatedPayload callers accordingly to ensure apply()
never runs with a nil w.payload.
pkg/cvo/cvo.go
Outdated
| optr.featureGatesMutex.RLock() | ||
|
|
||
| // Check if gates actually changed to avoid unnecessary work | ||
| if !optr.enabledManifestFeatureGates.Equal(newGates) { | ||
| optr.featureGatesMutex.RUnlock() | ||
| optr.featureGatesMutex.Lock() | ||
| defer optr.featureGatesMutex.Unlock() | ||
|
|
||
| klog.V(2).Infof("Cluster feature gates changed from %v to %v", | ||
| sets.List(optr.enabledManifestFeatureGates), sets.List(newGates)) | ||
|
|
||
| optr.enabledManifestFeatureGates = newGates | ||
| return true | ||
| } | ||
|
|
||
| optr.featureGatesMutex.RUnlock() | ||
| return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOCTOU race condition in lock upgrade pattern.
The current implementation has a race window between releasing the read lock and acquiring the write lock (lines 1138-1140). Another goroutine could modify enabledManifestFeatureGates during this gap, causing the comparison at line 1137 to be stale by the time the write occurs.
Consider using a write lock from the start since the operation is conditional but the check itself requires consistency with the subsequent write:
🔒 Proposed fix
func (optr *Operator) updateEnabledFeatureGates(obj interface{}) bool {
featureGate, ok := obj.(*configv1.FeatureGate)
if !ok {
klog.Warningf("Expected FeatureGate object but got %T", obj)
return false
}
newGates := optr.extractEnabledGates(featureGate)
- optr.featureGatesMutex.RLock()
-
- // Check if gates actually changed to avoid unnecessary work
- if !optr.enabledManifestFeatureGates.Equal(newGates) {
- optr.featureGatesMutex.RUnlock()
- optr.featureGatesMutex.Lock()
- defer optr.featureGatesMutex.Unlock()
-
- klog.V(2).Infof("Cluster feature gates changed from %v to %v",
- sets.List(optr.enabledManifestFeatureGates), sets.List(newGates))
-
- optr.enabledManifestFeatureGates = newGates
- return true
- }
-
- optr.featureGatesMutex.RUnlock()
- return false
+ optr.featureGatesMutex.Lock()
+ defer optr.featureGatesMutex.Unlock()
+
+ // Check if gates actually changed to avoid unnecessary work
+ if optr.enabledManifestFeatureGates.Equal(newGates) {
+ return false
+ }
+
+ klog.V(2).Infof("Cluster feature gates changed from %v to %v",
+ sets.List(optr.enabledManifestFeatureGates), sets.List(newGates))
+
+ optr.enabledManifestFeatureGates = newGates
+ return true
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| optr.featureGatesMutex.RLock() | |
| // Check if gates actually changed to avoid unnecessary work | |
| if !optr.enabledManifestFeatureGates.Equal(newGates) { | |
| optr.featureGatesMutex.RUnlock() | |
| optr.featureGatesMutex.Lock() | |
| defer optr.featureGatesMutex.Unlock() | |
| klog.V(2).Infof("Cluster feature gates changed from %v to %v", | |
| sets.List(optr.enabledManifestFeatureGates), sets.List(newGates)) | |
| optr.enabledManifestFeatureGates = newGates | |
| return true | |
| } | |
| optr.featureGatesMutex.RUnlock() | |
| return false | |
| optr.featureGatesMutex.Lock() | |
| defer optr.featureGatesMutex.Unlock() | |
| // Check if gates actually changed to avoid unnecessary work | |
| if optr.enabledManifestFeatureGates.Equal(newGates) { | |
| return false | |
| } | |
| klog.V(2).Infof("Cluster feature gates changed from %v to %v", | |
| sets.List(optr.enabledManifestFeatureGates), sets.List(newGates)) | |
| optr.enabledManifestFeatureGates = newGates | |
| return true |
🤖 Prompt for AI Agents
In `@pkg/cvo/cvo.go` around lines 1134 - 1150, The read-to-write lock upgrade
creates a TOCTOU race around optr.enabledManifestFeatureGates; change the
pattern in the code that compares newGates to optr.enabledManifestFeatureGates
to either (a) take optr.featureGatesMutex.Lock() up front, perform the Equal
check and assignment while holding the write lock, then Unlock, or (b) keep the
read lock, and if Equal returns false then release the read lock, acquire the
write lock and re-check optr.enabledManifestFeatureGates.Equal(newGates) before
assigning and logging; update the code paths that reference featureGatesMutex,
enabledManifestFeatureGates, Equal, and newGates accordingly so the
compare-and-set is atomic.
e8ca53a to
28a8781
Compare
|
Ahh, I updated the library-go to remove the trailing s, but have not updated the tests to reflect that, my bad |
| // Initialize feature gates in status | ||
| w.status.EnabledFeatureGates = work.EnabledFeatureGates.Clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I was scratching my head a little bit about why we are initializing the feature gates here. Could we modify the comment to also explain the why and not only the what, if possible? The purpose was not apparent to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as the rest of this block AFAICT. If you look at L545, it says we have to initilize the status for the first sync, so this just follows the pattern of everything else adding an initial value for w.status.X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, if I delete this line, all of the unit tests still pass, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, if I delete this line, all of the unit tests still pass, WDYT?
I have finally checked the test coverage, and I think that the CVO would benefit from a test scenario where we exercise the concurrency of the two main processes. That is Start and Update in combination with the new Feature Gate logic.
I have quickly drafted a CVO test scenario with Claude to test the concurrency here. It highlighted that we have missed a line to update the SyncWork status (mentioned in a different comment), but otherwise, everything looks promising? I would propose adding a similar test to ensure that the concurrency is working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CVO logs from the test output nicely demonstrates the flow of the logic. An excerpt:
I0126 19:47:29.187242 35895 cvo.go:1142] Cluster feature gates changed from [ExperimentalFeature] to [ExperimentalFeature LegacyFeature]
I0126 19:47:29.187246 35895 cvo.go:734] Started syncing cluster version "test/version", spec changes, status, and payload
I0126 19:47:29.187262 35895 cvo.go:774] Desired version from operator is v1.Update{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", Force:false}
I0126 19:47:29.187274 35895 sync_worker.go:909] Detected while considering cluster version generation 0: enabled feature gates changed (from [ExperimentalFeature] to [ExperimentalFeature LegacyFeature])
I0126 19:47:29.187278 35895 sync_worker.go:294] syncPayload: 1.0.0-abc (force=false)
I0126 19:47:29.187285 35895 sync_worker.go:310] Enabled feature gates changed from map[ExperimentalFeature:{}] to map[ExperimentalFeature:{} LegacyFeature:{}], forcing a payload refresh
I0126 19:47:29.187291 35895 payload.go:140] Loading updatepayload from "testdata/featuregatetest"
I0126 19:47:29.187323 35895 cincinnati.go:480] parsed metadata: URL "https://example.com/v1.0.0-abc", architecture "", channels [], errors []
I0126 19:47:29.187326 35895 payload.go:285] Architecture from release-metadata (1.0.0-abc) retrieved from runtime: "amd64"
I0126 19:47:29.187502 35895 payload.go:215] excluding Filename: "0000_30_legacy-excluded.yaml" Group: "" Kind: "ConfigMap" Namespace: "default" Name: "legacy-excluded": feature gate LegacyFeature is enabled but manifest requires it to be disabled
I0126 19:47:29.187509 35895 sync_worker.go:413] No preconditions configured.
I0126 19:47:29.187519 35895 sync_worker.go:466] Payload loaded from image/image:1 with hash ge54Uoy7v5o=, architecture amd64
I0126 19:47:29.187527 35895 sync_worker.go:603] Cancel the sync worker's current loop
I0126 19:47:29.187534 35895 sync_worker.go:610] Notify the sync worker that new work is available
I0126 19:47:29.187541 35895 status.go:96] merge into existing history completed=true desired=v1.Release{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", URL:"https://example.com/v1.0.0-abc", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Completed", StartedTime:time.Date(2026, time.January, 26, 19, 47, 29, 186155471, time.Local), CompletionTime:time.Date(2026, time.January, 26, 19, 47, 29, 186914051, time.Local), Version:"1.0.0-abc", Image:"image/image:1", Verified:false, AcceptedRisks:""}
I0126 19:47:29.187563 35895 cvo.go:736] Finished syncing cluster version "test/version" (317.595µs)
I0126 19:47:29.187573 35895 sync_worker.go:662] new work is available
I0126 19:47:29.187579 35895 sync_worker.go:909] Detected while calculating next work: enabled feature gates changed (from [ExperimentalFeature] to [ExperimentalFeature LegacyFeature])
I0126 19:47:29.187581 35895 sync_worker.go:829] Work changed, transitioning from Reconciling to Updating
I0126 19:47:29.187584 35895 sync_worker.go:728] Previous sync status: &cvo.SyncWorkerStatus{Generation:0, Failure:error(nil), Done:3, Total:3, Completed:2, Reconciling:true, Initial:false, VersionHash:"yrh5CWG1KPI=", Architecture:"amd64", LastProgress:time.Date(2026, time.January, 26, 19, 47, 29, 187224011, time.Local), Actual:v1.Release{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", URL:"https://example.com/v1.0.0-abc", Channels:[]string(nil)}, Verified:false, loadPayloadStatus:cvo.LoadPayloadStatus{Step:"PayloadLoaded", Message:"Payload loaded version=\"1.0.0-abc\" image=\"image/image:1\" architecture=\"amd64\"", AcceptedRisks:"", Failure:error(nil), Update:v1.Update{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", Force:false}, Verified:false, Local:false, LastTransitionTime:time.Time{wall:0xc2560b904b2d40ba, ext:14813617, loc:(*time.Location)(0x4dd4760)}}, CapabilitiesStatus:cvo.CapabilityStatus{Status:v1.ClusterVersionCapabilitiesStatus{EnabledCapabilities:[]v1.ClusterVersionCapability{"Build", "CSISnapshot", "CloudControllerManager", "CloudCredential", "Console", "DeploymentConfig", "ImageRegistry", "Ingress", "Insights", "MachineAPI", "NodeTuning", "OperatorLifecycleManager", "OperatorLifecycleManagerV1", "Storage", "baremetal", "marketplace", "openshift-samples"}, KnownCapabilities:[]v1.ClusterVersionCapability{"Build", "CSISnapshot", "CloudControllerManager", "CloudCredential", "Console", "DeploymentConfig", "ImageRegistry", "Ingress", "Insights", "MachineAPI", "NodeTuning", "OperatorLifecycleManager", "OperatorLifecycleManagerV1", "Storage", "baremetal", "marketplace", "openshift-samples"}}, ImplicitlyEnabledCaps:[]v1.ClusterVersionCapability{}}, EnabledFeatureGates:sets.Set[string]{"ExperimentalFeature":sets.Empty{}, "LegacyFeature":sets.Empty{}}}
I0126 19:47:29.187598 35895 sync_worker.go:986] apply: 1.0.0-abc on generation 0 in state Updating at attempt 0
I0126 19:47:29.187603 35895 task_graph.go:42] Unable to parse run level and component for "0000_10_always-included.yaml"
I0126 19:47:29.187606 35895 task_graph.go:42] Unable to parse run level and component for "0000_20_experimental-feature.yaml"
I0126 19:47:29.187616 35895 task_graph.go:497] Worker 0: Running job 0 (with 2 tasks)
I0126 19:47:29.187624 35895 sync_worker.go:1102] Running sync for configmap "default/always-included" (1 of 2)
I0126 19:47:29.187658 35895 sync_worker.go:1117] Done syncing for configmap "default/always-included" (1 of 2)
I0126 19:47:29.187662 35895 sync_worker.go:1102] Running sync for configmap "default/experimental-feature" (2 of 2)
I0126 19:47:29.187688 35895 sync_worker.go:1117] Done syncing for configmap "default/experimental-feature" (2 of 2)
I0126 19:47:29.187691 35895 task_graph.go:488] Worker 0: Received cancel signal
--- PASS: TestCVO_FeatureGateManifestInclusion (0.00s)
pkg/cvo/sync_worker.go
Outdated
| klog.V(2).Infof("Feature gates changed, loading updated payload") | ||
|
|
||
| // Clear the payload to force a reload. | ||
| w.payload = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite following which way you wanted me to go in this thread
Yeah, I was intentionally trying to leave the possibilities open, but I could have been more clear.
Instead of clearing the payload, I am proposing to change syncPayload to something like:
// cache the payload until the release image changes
validPayload := w.payload
previousEnabledFeatureGates := w.status.EnabledFeatureGates
if validPayload != nil && validPayload.Release.Image == desired.Image && work.EnabledFeatureGates.Equal(previousEnabledFeatureGates) {
// no change
}
if validPayload == nil {
klog.V(2).Info("Loading payload for the first time")
} else if !equalUpdate(configv1.Update{Image: validPayload.Release.Image}, configv1.Update{Image: desired.Image}) {
klog.V(2).Info("Loading payload due to desired image not being equal to the current one.")
} else if !work.EnabledFeatureGates.Equal(previousEnabledFeatureGates) {
// When the feature gates change, we must reload the payload.
// Loading the payload filters out files that didn't match the previous set of feature gates,
// this means now, additional files may match the new set of feature gates and need to be included.
// Some files in the current payload may no longer match the new set of feature gates and need to be excluded,
// though these ones are already excluded when apply calls Include on the manifests.
klog.V(2).Infof("Enabled feature gates changed from %v to %v, forcing a payload refresh", previousEnabledFeatureGates, work.EnabledFeatureGates)
} else {
return implicitlyEnabledCaps, nil
}
// The existing logic continuesand remove the following code from Start():
- if featureGatesChanged {
- // When the feature gates change, we must reload the payload.
- // Loading the payload filters out files that didn't match the previous set of feature gates,
- // this means now, additional files may match the new set of feature gates and need to be included.
- // Some files in the current payload may no longer match the new set of feature gates and need to be excluded,
- // though these ones are already excluded when apply calls Include on the manifests.
- klog.V(2).Infof("Feature gates changed, loading updated payload")
-
- _, err := w.loadUpdatedPayload(ctx, w.work)
- if err != nil {
- klog.Warningf("Error when attempting to load updated payload: %v.", err)
- }
- }I do have some concerns about clearing the payload directly, that is, w.payload = nil. The existing code assumes the w.payload to be always set after a specific point in its execution. This means that nil dereferences are possible without further modifications. The reason for the removal of the w.loadUpdatedPayload call from Start() is that the goroutine running Update() is the sole expected entity to update the w.payload at the moment. That is, Start() only applies a payload. Update loads an updated payload.
- When enabled gates change, an item is added to a queue, which results in the
Updateto run. Updatecan load manifests with new options.UpdateinstructsStartvia a channel that new work is available.Startsapplies the new loaded payload.
This follows the existing patterns without larger deviations. We would just reload the local payload more often, saving the memory, and we would not interfere too much with existing logic. The end outcome is the same. I am only concerned about making larger deviations from existing flows without extensive testing and a clear value being gained. WDYT?
This is based on my current understanding of the code from static analysis.
pkg/cvo/sync_worker.go
Outdated
| // determine whether we need to do work | ||
| w.lock.Lock() | ||
| changed := work.calculateNextFrom(w.work) | ||
| changed, featureGatesChanged := work.calculateNextFrom(w.work) | ||
|
|
||
| if featureGatesChanged { | ||
| // When the feature gates change, we must reload the payload. | ||
| // Loading the payload filters out files that didn't match the previous set of feature gates, | ||
| // this means now, additional files may match the new set of feature gates and need to be included. | ||
| // Some files in the current payload may no longer match the new set of feature gates and need to be excluded, | ||
| // though these ones are already excluded when apply calls Include on the manifests. | ||
| klog.V(2).Infof("Feature gates changed, loading updated payload") | ||
|
|
||
| _, err := w.loadUpdatedPayload(ctx, w.work) | ||
| if err != nil { | ||
| klog.Warningf("Error when attempting to load updated payload: %v.", err) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should now be able to revert the new changes and leave the existing logic. No need to call loadUpdatedPayload here directly. The Update() takes care of reloading the payload, and the Start() gets notified of new work to be made and proceeds to apply the new manifests.
| // determine whether we need to do work | |
| w.lock.Lock() | |
| changed := work.calculateNextFrom(w.work) | |
| changed, featureGatesChanged := work.calculateNextFrom(w.work) | |
| if featureGatesChanged { | |
| // When the feature gates change, we must reload the payload. | |
| // Loading the payload filters out files that didn't match the previous set of feature gates, | |
| // this means now, additional files may match the new set of feature gates and need to be included. | |
| // Some files in the current payload may no longer match the new set of feature gates and need to be excluded, | |
| // though these ones are already excluded when apply calls Include on the manifests. | |
| klog.V(2).Infof("Feature gates changed, loading updated payload") | |
| _, err := w.loadUpdatedPayload(ctx, w.work) | |
| if err != nil { | |
| klog.Warningf("Error when attempting to load updated payload: %v.", err) | |
| } | |
| } | |
| // determine whether we need to do work | |
| w.lock.Lock() | |
| changed := work.calculateNextFrom(w.work) | |
| w.lock.Unlock() | |
| if !changed && waitingToReconcile { | |
| klog.V(2).Infof("No change, waiting") | |
| continue | |
| } |
lib/manifest/manifest.go
Outdated
| currentPayloadManifests []manifest.Manifest, | ||
| manifestInclusionConfiguration InclusionConfiguration, | ||
| currentImplicitlyEnabled sets.Set[configv1.ClusterVersionCapability], | ||
| enabledFeatureGates sets.Set[string], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we could add this to the InclusionConfiguration struct type and pass it along other manifest inclusion configurations in manifestInclusionConfiguration.
The InclusionConfiguration is described as:
// InclusionConfiguration configures manifest inclusion, so
// callers can opt in to new filtering options instead of having to
// update existing call-sites, even if they do not need a new
// filtering option.
type InclusionConfiguration struct {| // do not overwrite these status values which are not managed by apply | ||
| update.loadPayloadStatus = w.status.loadPayloadStatus | ||
| update.CapabilitiesStatus = w.status.CapabilitiesStatus | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new testing (see the comment here) found that we have missed a line here. Otherwise, the EnabledFeatureGates status gets overwritten.
| update.EnabledFeatureGates = w.status.EnabledFeatureGates.Clone() | |
| // Initialize feature gates in status | ||
| w.status.EnabledFeatureGates = work.EnabledFeatureGates.Clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, if I delete this line, all of the unit tests still pass, WDYT?
I have finally checked the test coverage, and I think that the CVO would benefit from a test scenario where we exercise the concurrency of the two main processes. That is Start and Update in combination with the new Feature Gate logic.
I have quickly drafted a CVO test scenario with Claude to test the concurrency here. It highlighted that we have missed a line to update the SyncWork status (mentioned in a different comment), but otherwise, everything looks promising? I would propose adding a similar test to ensure that the concurrency is working as expected.
| // Initialize feature gates in status | ||
| w.status.EnabledFeatureGates = work.EnabledFeatureGates.Clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CVO logs from the test output nicely demonstrates the flow of the logic. An excerpt:
I0126 19:47:29.187242 35895 cvo.go:1142] Cluster feature gates changed from [ExperimentalFeature] to [ExperimentalFeature LegacyFeature]
I0126 19:47:29.187246 35895 cvo.go:734] Started syncing cluster version "test/version", spec changes, status, and payload
I0126 19:47:29.187262 35895 cvo.go:774] Desired version from operator is v1.Update{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", Force:false}
I0126 19:47:29.187274 35895 sync_worker.go:909] Detected while considering cluster version generation 0: enabled feature gates changed (from [ExperimentalFeature] to [ExperimentalFeature LegacyFeature])
I0126 19:47:29.187278 35895 sync_worker.go:294] syncPayload: 1.0.0-abc (force=false)
I0126 19:47:29.187285 35895 sync_worker.go:310] Enabled feature gates changed from map[ExperimentalFeature:{}] to map[ExperimentalFeature:{} LegacyFeature:{}], forcing a payload refresh
I0126 19:47:29.187291 35895 payload.go:140] Loading updatepayload from "testdata/featuregatetest"
I0126 19:47:29.187323 35895 cincinnati.go:480] parsed metadata: URL "https://example.com/v1.0.0-abc", architecture "", channels [], errors []
I0126 19:47:29.187326 35895 payload.go:285] Architecture from release-metadata (1.0.0-abc) retrieved from runtime: "amd64"
I0126 19:47:29.187502 35895 payload.go:215] excluding Filename: "0000_30_legacy-excluded.yaml" Group: "" Kind: "ConfigMap" Namespace: "default" Name: "legacy-excluded": feature gate LegacyFeature is enabled but manifest requires it to be disabled
I0126 19:47:29.187509 35895 sync_worker.go:413] No preconditions configured.
I0126 19:47:29.187519 35895 sync_worker.go:466] Payload loaded from image/image:1 with hash ge54Uoy7v5o=, architecture amd64
I0126 19:47:29.187527 35895 sync_worker.go:603] Cancel the sync worker's current loop
I0126 19:47:29.187534 35895 sync_worker.go:610] Notify the sync worker that new work is available
I0126 19:47:29.187541 35895 status.go:96] merge into existing history completed=true desired=v1.Release{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", URL:"https://example.com/v1.0.0-abc", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Completed", StartedTime:time.Date(2026, time.January, 26, 19, 47, 29, 186155471, time.Local), CompletionTime:time.Date(2026, time.January, 26, 19, 47, 29, 186914051, time.Local), Version:"1.0.0-abc", Image:"image/image:1", Verified:false, AcceptedRisks:""}
I0126 19:47:29.187563 35895 cvo.go:736] Finished syncing cluster version "test/version" (317.595µs)
I0126 19:47:29.187573 35895 sync_worker.go:662] new work is available
I0126 19:47:29.187579 35895 sync_worker.go:909] Detected while calculating next work: enabled feature gates changed (from [ExperimentalFeature] to [ExperimentalFeature LegacyFeature])
I0126 19:47:29.187581 35895 sync_worker.go:829] Work changed, transitioning from Reconciling to Updating
I0126 19:47:29.187584 35895 sync_worker.go:728] Previous sync status: &cvo.SyncWorkerStatus{Generation:0, Failure:error(nil), Done:3, Total:3, Completed:2, Reconciling:true, Initial:false, VersionHash:"yrh5CWG1KPI=", Architecture:"amd64", LastProgress:time.Date(2026, time.January, 26, 19, 47, 29, 187224011, time.Local), Actual:v1.Release{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", URL:"https://example.com/v1.0.0-abc", Channels:[]string(nil)}, Verified:false, loadPayloadStatus:cvo.LoadPayloadStatus{Step:"PayloadLoaded", Message:"Payload loaded version=\"1.0.0-abc\" image=\"image/image:1\" architecture=\"amd64\"", AcceptedRisks:"", Failure:error(nil), Update:v1.Update{Architecture:"", Version:"1.0.0-abc", Image:"image/image:1", Force:false}, Verified:false, Local:false, LastTransitionTime:time.Time{wall:0xc2560b904b2d40ba, ext:14813617, loc:(*time.Location)(0x4dd4760)}}, CapabilitiesStatus:cvo.CapabilityStatus{Status:v1.ClusterVersionCapabilitiesStatus{EnabledCapabilities:[]v1.ClusterVersionCapability{"Build", "CSISnapshot", "CloudControllerManager", "CloudCredential", "Console", "DeploymentConfig", "ImageRegistry", "Ingress", "Insights", "MachineAPI", "NodeTuning", "OperatorLifecycleManager", "OperatorLifecycleManagerV1", "Storage", "baremetal", "marketplace", "openshift-samples"}, KnownCapabilities:[]v1.ClusterVersionCapability{"Build", "CSISnapshot", "CloudControllerManager", "CloudCredential", "Console", "DeploymentConfig", "ImageRegistry", "Ingress", "Insights", "MachineAPI", "NodeTuning", "OperatorLifecycleManager", "OperatorLifecycleManagerV1", "Storage", "baremetal", "marketplace", "openshift-samples"}}, ImplicitlyEnabledCaps:[]v1.ClusterVersionCapability{}}, EnabledFeatureGates:sets.Set[string]{"ExperimentalFeature":sets.Empty{}, "LegacyFeature":sets.Empty{}}}
I0126 19:47:29.187598 35895 sync_worker.go:986] apply: 1.0.0-abc on generation 0 in state Updating at attempt 0
I0126 19:47:29.187603 35895 task_graph.go:42] Unable to parse run level and component for "0000_10_always-included.yaml"
I0126 19:47:29.187606 35895 task_graph.go:42] Unable to parse run level and component for "0000_20_experimental-feature.yaml"
I0126 19:47:29.187616 35895 task_graph.go:497] Worker 0: Running job 0 (with 2 tasks)
I0126 19:47:29.187624 35895 sync_worker.go:1102] Running sync for configmap "default/always-included" (1 of 2)
I0126 19:47:29.187658 35895 sync_worker.go:1117] Done syncing for configmap "default/always-included" (1 of 2)
I0126 19:47:29.187662 35895 sync_worker.go:1102] Running sync for configmap "default/experimental-feature" (2 of 2)
I0126 19:47:29.187688 35895 sync_worker.go:1117] Done syncing for configmap "default/experimental-feature" (2 of 2)
I0126 19:47:29.187691 35895 task_graph.go:488] Worker 0: Received cancel signal
--- PASS: TestCVO_FeatureGateManifestInclusion (0.00s)
|
if possible, could we rebase this PR? I want to setup a cluster to test it (although the development process is still ongoing) |
3bc4f6f to
3c9beba
Compare
|
Thanks for the review again @DavidHurta, I've rebased and squashed a few of the smaller commits down. The last 5 commits are based on your most recent round of feedback, but I'll squash those into some of the larger commits logically once you are happy with those @JianLi-RH should be able to test now, no conflicts at the moment |
3c9beba to
33f38a9
Compare
By implementing full manifest filtering in the render step, we now need to make sure any manifest that we render includes a cluster profile. In this case, we just render it out with the cluster profile that the cluster is on.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
33f38a9 to
4ca50a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/cvo/sync_worker.go (1)
571-580: Retry payload reload when a feature‑gate refresh fails.Line 577 and Line 579 update the stored gates even when
loadUpdatedPayloadfails. That makes subsequent Update calls treat the gates as already applied, so the payload reload won’t be retried and newly gated manifests can stay missing. Keep the previous gates (or track “pending” separately) until reload succeeds.🐛 Suggested fix
- w.work.EnabledFeatureGates = work.EnabledFeatureGates.Clone() - w.status.EnabledFeatureGates = work.EnabledFeatureGates.Clone()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/cvo/sync_worker.go (1)
571-580: Retry payload reload after feature‑gate change failures.
Line 571 and Line 579: whenloadUpdatedPayloadfails, the code updatesw.work.EnabledFeatureGatesandw.status.EnabledFeatureGates, which makes subsequent Update calls treat gates as in-sync and skip further reload attempts. This can leave the payload filtered by the old gate set indefinitely after a transient failure. Keep the old gates until a reload succeeds or track a pending-reload flag.💡 Proposed fix
if err != nil { // save override and capability changes if not first time through if w.work != nil { w.work.Overrides = config.Spec.Overrides w.work.Capabilities = work.Capabilities - w.work.EnabledFeatureGates = work.EnabledFeatureGates.Clone() w.status.CapabilitiesStatus.Status = capability.GetCapabilitiesStatus(w.work.Capabilities) - w.status.EnabledFeatureGates = work.EnabledFeatureGates.Clone() } return w.status.DeepCopy() }
🤖 Fix all issues with AI agents
In `@pkg/cvo/cvo.go`:
- Around line 1105-1120: featureGateEventHandler is missing a DeleteFunc so
deleting a FeatureGate leaves stale gates; add a DeleteFunc in the returned
cache.ResourceEventHandlerFuncs that clears optr.enabledManifestFeatureGates
(set to an empty set/map/slice consistent with its type) and calls
optr.queue.Add(workQueueKey) to queue a sync (use the existing workQueueKey
local and the optr.queue.Add call pattern shown in AddFunc/UpdateFunc); ensure
you perform any required locking consistent with how enabledManifestFeatureGates
is accessed elsewhere in the Operator.
In
`@pkg/cvo/testdata/featuregatetest/release-manifests/0000_20_experimental-feature.yaml`:
- Around line 6-8: The annotation uses the old singular key
release.openshift.io/feature-gate which won't match the new gating logic; update
the annotations block to use the plural key release.openshift.io/feature-gates
(preserving the same value "ExperimentalFeature") so tests exercise the new
feature-gates handling.
In
`@pkg/cvo/testdata/featuregatetest/release-manifests/0000_30_legacy-excluded.yaml`:
- Around line 6-8: The annotation key is using the old singular form
`release.openshift.io/feature-gate`; update it to the new plural key
`release.openshift.io/feature-gates` so tests exercise the new gating
logic—locate the annotations block containing
`release.openshift.io/feature-gate` in the manifest (the annotations map in
0000_30_legacy-excluded.yaml) and replace the key name while keeping the same
value `"-LegacyFeature"`.
| // featureGateEventHandler handles changes to FeatureGate objects and updates the cluster feature gates | ||
| func (optr *Operator) featureGateEventHandler() cache.ResourceEventHandler { | ||
| workQueueKey := optr.queueKey() | ||
| return cache.ResourceEventHandlerFuncs{ | ||
| AddFunc: func(obj interface{}) { | ||
| if optr.updateEnabledFeatureGates(obj) { | ||
| optr.queue.Add(workQueueKey) | ||
| } | ||
| }, | ||
| UpdateFunc: func(old, new interface{}) { | ||
| if optr.updateEnabledFeatureGates(new) { | ||
| optr.queue.Add(workQueueKey) | ||
| } | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the exact code at the specified lines
sed -n '1105,1120p' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 589
🏁 Script executed:
# Get more context around the featureGateEventHandler function
sed -n '1100,1125p' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 970
🏁 Script executed:
# Search for the Operator struct definition to find the fields mentioned
rg -A 30 'type Operator struct'Repository: openshift/cluster-version-operator
Length of output: 3877
🏁 Script executed:
# Search for featureGatesMutex and enabledManifestFeatureGates fields
rg 'featureGatesMutex|enabledManifestFeatureGates'Repository: openshift/cluster-version-operator
Length of output: 1182
🏁 Script executed:
# Look for other uses of featureGateEventHandler to understand context
rg -B 3 -A 3 'featureGateEventHandler'Repository: openshift/cluster-version-operator
Length of output: 858
🏁 Script executed:
# Search for how FeatureGate objects are used and created
rg -B 3 -A 3 'FeatureGate' pkg/cvo/cvo.go | head -60Repository: openshift/cluster-version-operator
Length of output: 3193
🏁 Script executed:
# Look at the featureGateEventHandler tests
rg -B 5 -A 20 'featureGateEventHandler' pkg/cvo/cvo_featuregates_test.goRepository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Check if there are any tests for DeleteFunc in ResourceEventHandler scenarios
rg -B 2 -A 5 'DeleteFunc' pkg/cvo/Repository: openshift/cluster-version-operator
Length of output: 2181
🏁 Script executed:
# Search for updateEnabledFeatureGates to understand what happens on update
rg -B 2 -A 15 'func.*updateEnabledFeatureGates' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 753
🏁 Script executed:
# Check how many FeatureGate objects are typically managed
rg 'FeatureGate' pkg/cvo/ -A 2 -B 2 | grep -i 'single\|multiple\|list\|watch'Repository: openshift/cluster-version-operator
Length of output: 1703
🏁 Script executed:
# Look at the featureGateEventHandler and surrounding code to understand the full context
sed -n '1090,1140p' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 1713
🏁 Script executed:
# Search for any tests around feature gate deletion
rg -B 5 -A 10 'featureGate.*[Dd]elete\|[Dd]elete.*featureGate' pkg/cvo/Repository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Check if there's documentation or comments about the FeatureGate lifecycle
rg -B 3 -A 5 'FeatureGate.*delete\|delete.*FeatureGate'Repository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Look at how clusterOperatorEventHandler implements DeleteFunc for comparison
rg -B 5 -A 10 'clusterOperatorEventHandler' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 1189
🏁 Script executed:
# Check if enabledManifestFeatureGates is ever reset or cleared anywhere
rg 'enabledManifestFeatureGates.*=' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 123
🏁 Script executed:
# Check how FeatureGate is configured/managed in the cluster - is deletion a realistic scenario?
rg -B 3 -A 3 'FeatureGate' pkg/cvo/sync_worker.go | head -40Repository: openshift/cluster-version-operator
Length of output: 1664
🏁 Script executed:
# Look for tests related to feature gate deletion or resets
rg -B 5 -A 10 'Delete\|Reset' pkg/cvo/cvo_featuregates_test.goRepository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Check the featuregate_integration_test to see scenarios
cat pkg/cvo/featuregate_integration_test.go | head -100Repository: openshift/cluster-version-operator
Length of output: 3275
🏁 Script executed:
# Verify the initialization and understand what "starting" gates mean
rg -B 10 'startingEnabledManifestFeatureGates' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 1074
🏁 Script executed:
# Check how extractEnabledGates works to understand the semantics better
rg -B 3 -A 15 'func.*extractEnabledGates' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 1066
🏁 Script executed:
# Look for any comments or documentation about what happens if FeatureGate is missing/deleted
rg -B 5 -A 5 'FeatureGate.*missing\|FeatureGate.*delete\|missing.*FeatureGate' pkg/cvo/Repository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Check if there's a default/empty state for feature gates
rg -B 3 -A 3 'sets\.New\[string\]\(\)' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Look at what happens in the main sync loop regarding feature gates
rg -B 5 -A 5 'enabledManifestFeatureGates' pkg/cvo/cvo.go | tail -50Repository: openshift/cluster-version-operator
Length of output: 1982
🏁 Script executed:
# Verify if FeatureGate is a cluster-scoped singleton or if there can be multiple
rg -B 3 -A 3 'featureGateLister\|FeatureGateLister' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Check the configv1.FeatureGate type to understand its scope
fd -e go -path '*/api/*' | xargs grep -l 'type FeatureGate struct' | head -3Repository: openshift/cluster-version-operator
Length of output: 246
🏁 Script executed:
# Look for any existing issue handling or comment about FeatureGate deletion
rg -B 5 -A 5 'FeatureGate' pkg/cvo/sync_worker.go | tail -30Repository: openshift/cluster-version-operator
Length of output: 1310
🏁 Script executed:
# Check what happens in the sync loop when enabledManifestFeatureGates is used
rg -B 3 -A 3 'getEnabledFeatureGates\|enabledManifestFeatureGates' pkg/cvo/cvo.go | grep -A 5 -B 5 'func (optr'Repository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Look at how other informers handle deletion in the codebase to understand patterns
rg -B 2 'DeleteFunc.*func' pkg/cvo/cvo.go | head -20Repository: openshift/cluster-version-operator
Length of output: 161
🏁 Script executed:
# Verify that FeatureGate is indeed cluster-scoped (singular) by checking the lister usage
rg 'featureGateLister\.' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 60
🏁 Script executed:
# Search for where getEnabledFeatureGates is called in the main sync loop
rg -B 5 -A 5 'getEnabledFeatureGates' pkg/cvo/Repository: openshift/cluster-version-operator
Length of output: 5793
🏁 Script executed:
# Look at the clusterOperatorEventHandler to understand the pattern better
sed -n '1050,1090p' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 1299
🏁 Script executed:
# Check if there are any other ResourceEventHandlers that handle DeleteFunc as a pattern
rg -B 3 -A 8 'ResourceEventHandlerFuncs{' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 193
🏁 Script executed:
# Verify that the proposed fix uses the correct approach by looking at similar patterns
rg -B 2 -A 6 'featureGatesMutex.Lock' pkg/cvo/cvo.goRepository: openshift/cluster-version-operator
Length of output: 384
Handle FeatureGate deletion to prevent stale manifest gates.
The featureGateEventHandler lacks a DeleteFunc, so if the FeatureGate resource is deleted, stale enabled gates remain in place and no sync is queued. This causes manifests to be incorrectly filtered using outdated gates. Add a DeleteFunc that resets enabledManifestFeatureGates to an empty set and queues a sync:
Suggested fix
func (optr *Operator) featureGateEventHandler() cache.ResourceEventHandler {
workQueueKey := optr.queueKey()
return cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
if optr.updateEnabledFeatureGates(obj) {
optr.queue.Add(workQueueKey)
}
},
UpdateFunc: func(old, new interface{}) {
if optr.updateEnabledFeatureGates(new) {
optr.queue.Add(workQueueKey)
}
},
+ DeleteFunc: func(_ interface{}) {
+ optr.featureGatesMutex.Lock()
+ optr.enabledManifestFeatureGates = sets.New[string]()
+ optr.featureGatesMutex.Unlock()
+ optr.queue.Add(workQueueKey)
+ },
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // featureGateEventHandler handles changes to FeatureGate objects and updates the cluster feature gates | |
| func (optr *Operator) featureGateEventHandler() cache.ResourceEventHandler { | |
| workQueueKey := optr.queueKey() | |
| return cache.ResourceEventHandlerFuncs{ | |
| AddFunc: func(obj interface{}) { | |
| if optr.updateEnabledFeatureGates(obj) { | |
| optr.queue.Add(workQueueKey) | |
| } | |
| }, | |
| UpdateFunc: func(old, new interface{}) { | |
| if optr.updateEnabledFeatureGates(new) { | |
| optr.queue.Add(workQueueKey) | |
| } | |
| }, | |
| } | |
| } | |
| // featureGateEventHandler handles changes to FeatureGate objects and updates the cluster feature gates | |
| func (optr *Operator) featureGateEventHandler() cache.ResourceEventHandler { | |
| workQueueKey := optr.queueKey() | |
| return cache.ResourceEventHandlerFuncs{ | |
| AddFunc: func(obj interface{}) { | |
| if optr.updateEnabledFeatureGates(obj) { | |
| optr.queue.Add(workQueueKey) | |
| } | |
| }, | |
| UpdateFunc: func(old, new interface{}) { | |
| if optr.updateEnabledFeatureGates(new) { | |
| optr.queue.Add(workQueueKey) | |
| } | |
| }, | |
| DeleteFunc: func(_ interface{}) { | |
| optr.featureGatesMutex.Lock() | |
| optr.enabledManifestFeatureGates = sets.New[string]() | |
| optr.featureGatesMutex.Unlock() | |
| optr.queue.Add(workQueueKey) | |
| }, | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@pkg/cvo/cvo.go` around lines 1105 - 1120, featureGateEventHandler is missing
a DeleteFunc so deleting a FeatureGate leaves stale gates; add a DeleteFunc in
the returned cache.ResourceEventHandlerFuncs that clears
optr.enabledManifestFeatureGates (set to an empty set/map/slice consistent with
its type) and calls optr.queue.Add(workQueueKey) to queue a sync (use the
existing workQueueKey local and the optr.queue.Add call pattern shown in
AddFunc/UpdateFunc); ensure you perform any required locking consistent with how
enabledManifestFeatureGates is accessed elsewhere in the Operator.
| annotations: | ||
| include.release.openshift.io/self-managed-high-availability: "true" | ||
| release.openshift.io/feature-gate: "ExperimentalFeature" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the new plural feature-gates annotation key.
Line 8 uses release.openshift.io/feature-gate, but the PR contract introduces release.openshift.io/feature-gates. Using the singular key may bypass the new gating logic in tests.
🔧 Proposed fix
- release.openshift.io/feature-gate: "ExperimentalFeature"
+ release.openshift.io/feature-gates: "ExperimentalFeature"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| annotations: | |
| include.release.openshift.io/self-managed-high-availability: "true" | |
| release.openshift.io/feature-gate: "ExperimentalFeature" | |
| annotations: | |
| include.release.openshift.io/self-managed-high-availability: "true" | |
| release.openshift.io/feature-gates: "ExperimentalFeature" |
🤖 Prompt for AI Agents
In
`@pkg/cvo/testdata/featuregatetest/release-manifests/0000_20_experimental-feature.yaml`
around lines 6 - 8, The annotation uses the old singular key
release.openshift.io/feature-gate which won't match the new gating logic; update
the annotations block to use the plural key release.openshift.io/feature-gates
(preserving the same value "ExperimentalFeature") so tests exercise the new
feature-gates handling.
| annotations: | ||
| include.release.openshift.io/self-managed-high-availability: "true" | ||
| release.openshift.io/feature-gate: "-LegacyFeature" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the new plural feature-gates annotation key.
Line 8 uses release.openshift.io/feature-gate, but the PR contract introduces release.openshift.io/feature-gates. Using the singular key may bypass the new gating logic in tests.
🔧 Proposed fix
- release.openshift.io/feature-gate: "-LegacyFeature"
+ release.openshift.io/feature-gates: "-LegacyFeature"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| annotations: | |
| include.release.openshift.io/self-managed-high-availability: "true" | |
| release.openshift.io/feature-gate: "-LegacyFeature" | |
| annotations: | |
| include.release.openshift.io/self-managed-high-availability: "true" | |
| release.openshift.io/feature-gates: "-LegacyFeature" |
🤖 Prompt for AI Agents
In
`@pkg/cvo/testdata/featuregatetest/release-manifests/0000_30_legacy-excluded.yaml`
around lines 6 - 8, The annotation key is using the old singular form
`release.openshift.io/feature-gate`; update it to the new plural key
`release.openshift.io/feature-gates` so tests exercise the new gating
logic—locate the annotations block containing
`release.openshift.io/feature-gate` in the manifest (the annotations map in
0000_30_legacy-excluded.yaml) and replace the key name while keeping the same
value `"-LegacyFeature"`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/cvo/sync_worker.go (1)
571-580: Avoid marking feature gates as applied when payload reload fails.
IfloadUpdatedPayloadfails, updatingw.work.EnabledFeatureGatesandw.status.EnabledFeatureGatesmakes subsequentUpdatecalls treat the change as already applied, so the payload reload won’t be retried and newly enabled manifests can remain missing.🛠️ Suggested fix
if w.work != nil { w.work.Overrides = config.Spec.Overrides w.work.Capabilities = work.Capabilities - w.work.EnabledFeatureGates = work.EnabledFeatureGates.Clone() w.status.CapabilitiesStatus.Status = capability.GetCapabilitiesStatus(w.work.Capabilities) - w.status.EnabledFeatureGates = work.EnabledFeatureGates.Clone() }
|
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR introduces a new annotation
release.openshift.io/feature-gatesthat accepts a comma separated list of feature gates which must be enabled to consider a manifest for inclusion within the cluster, optionally preceeded by a-to negate that (ie don't include me if this gate is enabled).The PR adds a feature gate informer to the core operator which keeps a synchronised list of enabled feature gates. This is then passed through to sync workers, added to the sync worker status, and finally passed into the library-go manifest inclusion/exclusion code for decisions on whether the manifest should/shouldn't be included.
I need to test this and explore what the render and task graph parts are about. I see some todo's in the render already about proper feature-setting so it looks like maybe we are ok to ignore that for now.
Note, claude was used to assist in the development of this, I have reviewed every line carefully so I think this is all good and follows existing CVO patterns, but I'm not a CVO expert so please point out where Claude and I have misunderstood 🙏
Draft for now as this is based on a library-go PR that I need to raise and merge before this can go in.