✨ WIP: Cluster provider and cluster-aware controllers by embik · Pull Request #3019 · kubernetes-sigs/controller-runtime

embik · 2024-11-22T12:18:03Z

This is a continuation of #2726 and #2207, a prototype for a multi-cluster support in controller-runtime. A design proposal is at #2746 which will need updating depending on the feedback for this PR. I've reorganized code changes into different commits than the previous attempts, but tried to keep authorship of changes faithful to the commits this work is based on.

The main change from the previous PR (#2726) is the update to typed generics support and adjustment to have a BYO request type. The implications of that is a) that you will need to bring your own typed EventHandler and b) that a wrapper function is needed to inject information (mostly the cluster name) when establishing a handler.

Apart from that, the core design of this is the cluster provider that can be plugged into a manager. This unlocks different ways to discover the Kubernetes "fleet" that is being reconciled.

sprsquish · 2024-11-25T21:58:52Z

+	// Name returns the name of the cluster. It identifies the cluster in the
+	// manager if that is attached to a cluster provider. The value is usually
+	// empty for the default cluster of a manager.
+	Name() string


We're internally working on a mechanism for dealing with multiple clusters, too. It's been helpful for the cluster identifier to be more flexible than a simple string. We've made it comparable.

Something like this:

type Cluster[ID comparable] interface { Identifier() ID ... }

I understand this suggestion would cause a much larger change requiring type parameters in many places. Future work could be done to migrate.

Mostly, I wanted to raise this as something to consider and that calling the method Identifier would provide room to migrate in the future.

I'm definitely open to that, I think renaming it to Identifier() should be easy enough! Maybe some maintainers can leave a note if they consider this a future endeavour and this would be a worthwhile change to prepare that.

We were there before, had a logical cluster abstraction, and then went back to string as it just simplifies everything.

@sprsquish Am curious why a string as identifier and some lookup table on the cluster provider side does not work?

Also, adding Name() string to the interface has no impact on the consumers of the Cluster type. But adding a type parameter forces them to pass e.g. string or whatever everywhere including the manager. I don't think that's a good idea. IMO, we should not do it.

@sttts

Am curious why a string as identifier and some lookup table on the cluster provider side does not work?

Strings work fine.. We like to work with more structure in our identifiers. For instance, working with multiple clusters across cloud providers it's nice to have a structure like this:

type ClusterID struct { Provider string Region string Account string Name string }

It can be marshaled into a string, of course, it's just nice not to have to.

I understand the need for a simpler approach and appreciate that this has been discussed already. I think there's a way to migrate over time that follows what's been happening with handlers by introducing TypedCluster[ID comparable] and an alias for the common use case type Cluster = TypedCluster[string] (Manager would need the same treatment).

The only change to the current proposal would be to use the more generic Identifier() rather than Name(). That leaves the door open for future discussions regarding generic identifiers without having to do all the work up front.

If I understand your use-case correctly, this pattern should help you elegantly: https://goplay.space/#wRWyxv-GUrj

Or in other words: the type only matters in code that is ClusterID aware. Turning the string back into the rich format is one type conversion away.

sttts · 2025-01-06T14:15:34Z

+type Provider interface {
+	// Get returns a cluster for the given identifying cluster name. Get
+	// returns an existing cluster if it has been created before.
+	Get(ctx context.Context, clusterName string) (Cluster, error)


is the default cluster a thing here already? I.e. can I always pass empty string?

Should we specify what is returned for an unknown cluster?

The default cluster is not a thing here, that is in Manager.GetCluster. The cluster provider doesn't have a concept of a default cluster.

Should we specify what is returned for an unknown cluster?

Probably. I can add a sentence to clarify this.

Added a note on what should happen if the cluster is unknown.

sttts · 2025-01-06T14:17:18Z

+// Provider defines methods to retrieve clusters by name. The provider is
+// responsible for discovering and managing the lifecycle of each cluster.
+//


"defines methods" is redundant for an interface.

Suggested change

// Provider defines methods to retrieve clusters by name. The provider is

// responsible for discovering and managing the lifecycle of each cluster.

//

// Provider allows to retrieve clusters by name. The provider is

// responsible for discovering and managing the lifecycle of each cluster.

//

It's kind of unclear what managing means in a getter interface.

Maybe be concrete saying:

Suggested change

// Provider defines methods to retrieve clusters by name. The provider is

// responsible for discovering and managing the lifecycle of each cluster.

//

// Provider allows to retrieve clusters by name, e.g. to reconcilers.

// The provider is responsible for discovering and managing the lifecycle

// of each cluster, calling `Engage` and `Disengage` on the manager

// it is run against, whenever a new cluster is discovered or a cluster

// is unregistered.

+1, I've added that.

k8s-ci-robot · 2025-01-07T11:58:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: embik
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Co-authored-by: Iván Álvarez <ivanape@inditex.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

On-behalf-of: SAP <marvin.beckers@sap.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

k8s-ci-robot · 2025-01-20T08:00:05Z

@embik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-controller-runtime-apidiff	`e93979e`	link	false	`/test pull-controller-runtime-apidiff`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

varshaprasad96

In the process of reviewing the PR in detail, for now most of my comments are either nits or questions.

This is a really helpful feature in controller-runtime, that can benefit other internal kubernetes-sigs projects.
cc: @sbueringer @joelanford Could we have some eyes on this and the proposal please. Thank you!

varshaprasad96 · 2025-02-07T10:25:16Z

-			return fmt.Errorf("For() can only be used with reconcile.Request, got %T", *new(request))
+		var hdler handler.TypedEventHandler[client.Object, request]
+		switch reflect.TypeFor[request]() {
+		case reflect.TypeOf(reconcile.Request{}):


I'm not sure if it matters, I'll leave it up to you - the former is comparing parameter type at runtime and the other is checking for its type at during compile time. Isn't it better to be consistent?

Suggested change

case reflect.TypeOf(reconcile.Request{}):

case reflect.TypeFor[reconcile.Request]():

Good point, but I tried to just adapt the previous if condition. I'm not opposed to changing it, I just wanted to not fiddle with the condition checked.

varshaprasad96 · 2025-02-07T10:31:59Z

+		case reflect.TypeOf(reconcile.Request{}):
+			reflect.ValueOf(&hdler).Elem().Set(reflect.ValueOf(handler.WithLowPriorityWhenUnchanged(&handler.EnqueueRequestForObject{})))
+		case reflect.TypeOf(reconcile.ClusterAwareRequest{}):
+			reflect.ValueOf(&hdler).Elem().Set(reflect.ValueOf(handler.WithLowPriorityWhenUnchanged(&handler.EnqueueClusterAwareRequestForObject{ClusterName: cl.Name()})))


Just curious on why we need to use reflect and set the value instead of doing a direct assignment? If we are using reflect.TypeOf previously, aren't we sure of the type at runtime?
Just to make reading of the code easier.

I've asked myself the same question when I stumbled over this part of the code (see below, it's how the code today sets the handler implementation as well). When I try to change it to something like:

case reflect.TypeOf(reconcile.ClusterAwareRequest{}): hdler = handler.WithLowPriorityWhenUnchanged(&handler.EnqueueClusterAwareRequestForObject{ClusterName: cl.Name()})

it fails with:

../../pkg/builder/controller.go:335:12: cannot use handler.WithLowPriorityWhenUnchanged(&handler.EnqueueClusterAwareRequestForObject{…}) (value of type handler.TypedEventHandler[client.Object, reconcile.ClusterAwareRequest]) as handler.TypedEventHandler[client.Object, request] value in assignment: handler.TypedEventHandler[client.Object, reconcile.ClusterAwareRequest] does not implement handler.TypedEventHandler[client.Object, request] (wrong type for method Create) have Create(context.Context, event.TypedCreateEvent[client.Object], workqueue.TypedRateLimitingInterface[reconcile.ClusterAwareRequest]) want Create(context.Context, event.TypedCreateEvent[client.Object], workqueue.TypedRateLimitingInterface[request])

My understanding of generics here is a bit limited, but I think this is because we expect TypedRateLimitingInterface to take whatever generic request we give into the function, but we are setting it to a concrete type. I suppose this only works with the reflect magic we currently have.

Ah, thanks for explaining @embik!

No problem! This is at least how I understand this line. It's a bit unclear to me too.

varshaprasad96 · 2025-02-07T11:24:01Z

+
+	// engagedCluster is a map of engaged clusters. The can come and go as the manager is running.
+	engagedClustersLock   sync.RWMutex
+	engagedClusters       map[string]cluster.Cluster


It would be super helpful if we can expose the set of engagedClusters from the manager. In Kueue for a multi cluster implementation, we have a list of active and missing clusters based on available quota for admitting workloads.

For my understanding, you'd be looking for a list of the keys for that map, is that correct (i.e. the cluster identifiers, not the cluster.Cluster objects directly)?

Yes, a unique identifier for the cluster, so that we can fetch the relevant client and submit workloads in that cluster.

varshaprasad96 · 2025-02-07T11:40:33Z

+
+	var errs []error
+	for _, r := range cm.clusterAwareRunnables {
+		if err := r.Disengage(ctx, cl); err != nil {


Please correct me if I'm wrong - While calling Engage we are starting a SyncingSource from multicluster cache (

controller-runtime/pkg/builder/controller.go

Line 397 in e93979e

src := &ctxBoundedSyncingSource[request]{ctx: ctx, src: source.TypedKind(cl.GetCache(), projected, handler.WithLowPriorityWhenUnchanged(h), allPredicates...)}

), but in disengage we aren't stopping the context anywhere in here:

controller-runtime/pkg/controller/multicluster.go

Line 126 in e93979e

func (c *typedMultiClusterController[request]) Disengage(ctx context.Context, cl cluster.Cluster) error {

. Is this expected?

The way this works (but I wonder if it's confusing and should be changed to be honest) is that the implementation expects the cluster provider to cancel the clusterCtx used for Engage (and subsequently, the ctxBoundedSyncingSource) when shutting down. Compare the fleet example here canceling the stored context when figuring out that a cluster has to be removed: https://github.com/embik/controller-runtime/blob/e93979ec5cdd656809f5fbb91b858665979cc040/examples/fleet/main.go#L280

// disengage manager if mgr != nil { if err := mgr.Disengage(ctx, k.clusters[name]); err != nil { k.log.Error(err, "failed to disengage manager") } } // stop and forget k.lock.Lock() k.cancelFns[name]() // <- HERE delete(k.clusters, name) delete(k.cancelFns, name) k.lock.Unlock()

The ctx passed to Disengage is a different context, one that is currently unused but could be used when disengaging should be cancelable.

I'm open to input here though. I was wondering if the manager should even have a Disengage (and instead clean up its internal cluster list when the respective context gets canceled), but then the manager itself wouldn't be a cluster.Aware.

sttts · 2025-02-10T18:38:11Z

+	// This is an experimental feature and is subject to change.
+	EngageWithDefaultCluster *bool
+
+	// EngageWithProvidedClusters indicates whether the controller should engage


Suggested change

// EngageWithProvidedClusters indicates whether the controller should engage

// EngageWithProviderClusters indicates whether the controller should engage

embik · 2025-02-18T09:28:39Z

As per #2746 (comment), we're focussing on getting the design presented in this PR up and running in https://github.com/multicluster-runtime/multicluster-runtime. Feel free to follow along on the new project if you are interested.

k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 22, 2024

k8s-ci-robot requested review from inteon and joelanford November 22, 2024 12:18

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 22, 2024

embik force-pushed the embik-typed-cluster-support branch from 9584ab6 to 9b3f243 Compare November 22, 2024 12:44

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Nov 22, 2024

embik commented Nov 22, 2024

View reviewed changes

Comment thread pkg/builder/controller.go Outdated

sprsquish reviewed Nov 25, 2024

View reviewed changes

embik force-pushed the embik-typed-cluster-support branch from 9b3f243 to f87db1f Compare December 4, 2024 10:58

embik mentioned this pull request Dec 4, 2024

📖 Add designs/multi-cluster.md #2746

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2024

embik force-pushed the embik-typed-cluster-support branch from f87db1f to 95a2c06 Compare December 9, 2024 11:20

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2024

embik force-pushed the embik-typed-cluster-support branch from 95a2c06 to 5c285f6 Compare December 9, 2024 15:56

sttts reviewed Jan 6, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2025

embik force-pushed the embik-typed-cluster-support branch from 83360c4 to 1dcba5e Compare January 8, 2025 08:18

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2025

embik force-pushed the embik-typed-cluster-support branch 2 times, most recently from b604946 to a362992 Compare January 8, 2025 09:39

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2025

embik and others added 6 commits January 20, 2025 08:54

Add cluster.Aware interface to support dynamic cluster engagement

d343d4d

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Add support for additional default indexers in cache

ef4760d

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Add experimental support for pluggable cluster providers to manager

6bfec22

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Add support to builder for cluster-aware controllers

c441fd9

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

embik and others added 2 commits January 20, 2025 08:54

Add fleet-namespace example for pseudo-multi-cluster support

01ff225

On-behalf-of: SAP marvin.beckers@sap.com Co-authored-by: Vince Prignano <vincepri@redhat.com> Co-authored-by: Dr. Stefan Schimanski <stefan.schimanski@gmail.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

Add reconcile.ClusterAwareRequest and matching EventHandlers

e93979e

On-behalf-of: SAP <marvin.beckers@sap.com> Signed-off-by: Marvin Beckers <marvin@kubermatic.com>

embik force-pushed the embik-typed-cluster-support branch from a362992 to e93979e Compare January 20, 2025 07:54

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2025

sttts mentioned this pull request Feb 3, 2025

✨ WIP: Cluster Provider and cluster-aware controllers #2726

Closed

varshaprasad96 reviewed Feb 7, 2025

View reviewed changes

sttts reviewed Feb 10, 2025

View reviewed changes

embik closed this Feb 18, 2025

-// Provider defines methods to retrieve clusters by name. The provider is
-// responsible for discovering and managing the lifecycle of each cluster.
-//
+// Provider allows to retrieve clusters by name, e.g. to reconcilers.
+// The provider is responsible for discovering and managing the lifecycle
+// of each cluster, calling `Engage` and `Disengage` on the manager
+// it is run against, whenever a new cluster is discovered or a cluster
+// is unregistered.

	case reflect.TypeOf(reconcile.Request{}):
	case reflect.TypeFor[reconcile.Request]():

	// EngageWithProvidedClusters indicates whether the controller should engage
	// EngageWithProviderClusters indicates whether the controller should engage

Conversation

embik commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jan 7, 2025

Uh oh!

k8s-ci-robot commented Jan 20, 2025

Uh oh!

varshaprasad96 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varshaprasad96 Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

embik Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

embik commented Feb 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

embik commented Nov 22, 2024 •

edited

Loading

varshaprasad96 Feb 7, 2025 •

edited

Loading

embik Feb 8, 2025 •

edited

Loading