Skip to content

fix(operator): Scope MPI ConfigMap and Secret watches to owned objects#3377

Open
beep-boopp wants to merge 1 commit intokubeflow:masterfrom
beep-boopp:master
Open

fix(operator): Scope MPI ConfigMap and Secret watches to owned objects#3377
beep-boopp wants to merge 1 commit intokubeflow:masterfrom
beep-boopp:master

Conversation

@beep-boopp
Copy link
Copy Markdown

What this PR does / why we need it:
Currently, the MPI plugin registers cluster-wide watches on ConfigMap and Secret using full object structs. While EnqueueRequestForOwner correctly filters reconcile events, the underlying controller-runtime Informer cache blindly loads the full Data payloads of all cluster ConfigMaps into memory, causing the operator to OOM kill on massive clusters.

This PR scopes the Informer cache and watches to only care about MPI-owned objects, and restricts the cache to only store metadata:

  • Label Stamping: Introduces constants.LabelMPIJobName (trainer.kubeflow.org/trainjob-name) and updates buildHostFileConfigMap and buildSSHAuthSecret to apply this label at creation time.
  • Metadata Watches & Predicates: Swaps b.Watches with b.WatchesMetadata in ReconcilerBuilders() and adds a label-selector predicate (builder.WithPredicates). The cache now completely bypasses massive Data payloads and only indexes MPI-owned objects.
  • Optimized Get Calls: Updates the m.client.Get call for SSH secrets to use metav1.PartialObjectMetadata instead of corev1.Secret to prevent full-object retrieval into memory.
  • Testing: make test passes locally (pre-existing go: no such tool "covdata" errors exist on master unrelated to this change). ConfigMapWrapper and SecretWrapper testing utilities and relevant mpi unit tests were updated to expect the new labels.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #3374

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings March 23, 2026 13:33
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses operator OOM risk in large clusters by preventing controller-runtime’s cache from loading full ConfigMap/Secret payloads cluster-wide for the MPI plugin, switching those watches to metadata-only and scoping reconcile triggers to MPI-owned objects via labeling.

Changes:

  • Add an MPI ownership label (constants.LabelMPIJobName) and stamp it onto the MPI hostfile ConfigMap and SSH auth Secret at creation time.
  • Switch MPI plugin watches for ConfigMap/Secret from Watches to WatchesMetadata and filter events via a label-based predicate.
  • Optimize the SSH secret existence check to use metav1.PartialObjectMetadata (metadata-only Get) and update related test utilities/expectations.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/constants/constants.go Introduces LabelMPIJobName used to identify MPI-owned resources.
pkg/runtime/framework/plugins/mpi/mpi.go Uses metadata-only watches with label predicate; stamps labels on created resources; uses PartialObjectMetadata for secret Get.
pkg/runtime/framework/plugins/mpi/mpi_test.go Updates MPI unit tests to expect the new label and adjusts the mocked Get interception type.
pkg/util/testing/wrapper.go Adds WithLabels helpers for ConfigMapWrapper/SecretWrapper to support updated expectations.
pkg/runtime/framework/core/framework_test.go Updates framework tests to expect labeled MPI resources.
pkg/runtime/core/trainingruntime_test.go Updates runtime tests to expect labeled MPI resources.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pkg/runtime/framework/plugins/mpi/mpi.go:269

  • If the SSH auth Secret already exists, this code never stamps the new LabelMPIJobName onto it (it only creates on NotFound), which means the Secret watch predicate will keep ignoring that Secret’s events; consider patching the label onto the existing Secret’s metadata (or widening the predicate to accept TrainJob-owned objects).
	// SSHAuthSecret is immutable.
	partialSecret := &metav1.PartialObjectMetadata{}
	partialSecret.SetGroupVersionKind(corev1.SchemeGroupVersion.WithKind("Secret"))
	if err := m.client.Get(ctx, client.ObjectKey{Name: sshAuthSecretName(trainJob.Name), Namespace: trainJob.Namespace}, partialSecret); err != nil {
		if client.IgnoreNotFound(err) != nil {
			return nil, err
		}
		secret, err := m.buildSSHAuthSecret(trainJob)
		if err != nil {
			return nil, fmt.Errorf("failed to build SSH Auth secret: %w", err)
		}
		objects = append(objects, secret)
	}
	return append(objects, m.buildHostFileConfigMap(info, trainJob)), nil

Comment on lines +223 to +243
return b.WatchesMetadata(
&corev1.ConfigMap{},
handler.EnqueueRequestForOwner(
m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(),
),
builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool {
_, ok := obj.GetLabels()[constants.LabelMPIJobName]
return ok
})),
)
},
func(b *builder.Builder, cl client.Client, cache cache.Cache) *builder.Builder {
return b.Watches(
return b.WatchesMetadata(
&corev1.Secret{},
handler.EnqueueRequestForOwner(
m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(),
),
builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool {
_, ok := obj.GetLabels()[constants.LabelMPIJobName]
return ok
})),
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label-only predicate will drop events for ConfigMaps/Secrets that are still TrainJob-owned but were created before this label existed, so deletes/updates of those owned objects won’t enqueue a reconcile after an upgrade.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @beep-boopp, I think caching should be configured in the controller-runtime manager cache options:

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), options)

@beep-boopp
Copy link
Copy Markdown
Author

Thanks @beep-boopp, I think caching should be configured in the controller-runtime manager cache options:

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), options)

Thanks, I looked into configuring this via cache.Options.
I’m planning to scope the cache using ByObject with a label selector for MPI-owned ConfigMaps and Secrets (in main.go), instead of filtering at the controller level.
Just to confirm — is it acceptable to scope the cache globally this way, given that currently only the MPI plugin watches these resources?

@astefanutti
Copy link
Copy Markdown
Contributor

Just to confirm — is it acceptable to scope the cache globally this way, given that currently only the MPI plugin watches these resources?

Yes I think it is acceptable and actually recommended to make sure the issue is fixed holistically.

We "just" need to make sure it does not impact anything.

@beep-boopp
Copy link
Copy Markdown
Author

Just to confirm — is it acceptable to scope the cache globally this way, given that currently only the MPI plugin watches these resources?

Yes I think it is acceptable and actually recommended to make sure the issue is fixed holistically.

We "just" need to make sure it does not impact anything.

Cool, thanks! I’ll work on moving this to the manager cache configuration and make sure it doesn’t impact existing behavior.

@beep-boopp
Copy link
Copy Markdown
Author

Hey @astefanutti I have updated the implementation to configure cache scoping via options.Cache.ByObject in main.go as suggested, and removed the controller-level predicate filter. Also removed the predicate import from mpi.go since it's no longer needed. Would appreciate another review when you get a chance!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

pkg/runtime/framework/plugins/mpi/mpi.go:258

  • With the manager cache filtered to only Secrets carrying LabelMPIJobName, this cached Get can return NotFound for pre-existing (unlabeled) SSH auth Secrets created by older operator versions, causing the plugin to generate a new immutable Secret and potentially fail with AlreadyExists/immutable update errors. Use an uncached reader (APIReader) for the existence check and/or add a label-only patch path for existing Secrets so upgrades don’t break.
	// SSHAuthSecret is immutable.
	partialSecret := &metav1.PartialObjectMetadata{}
	partialSecret.SetGroupVersionKind(corev1.SchemeGroupVersion.WithKind("Secret"))
	if err := m.client.Get(ctx, client.ObjectKey{Name: sshAuthSecretName(trainJob.Name), Namespace: trainJob.Namespace}, partialSecret); err != nil {
		if client.IgnoreNotFound(err) != nil {
			return nil, err
		}
		secret, err := m.buildSSHAuthSecret(trainJob)
		if err != nil {
			return nil, fmt.Errorf("failed to build SSH Auth secret: %w", err)
		}
		objects = append(objects, secret)
	}

pkg/runtime/framework/plugins/mpi/mpi.go:235

  • The PR description mentions adding a label-selector predicate (builder.WithPredicates) to scope Secret/ConfigMap watches, but the code here only switches to WatchesMetadata and relies on external cache configuration; either add the predicate as described or update the description to match the implementation.
func (m *MPI) ReconcilerBuilders() []runtime.ReconcilerBuilder {
	return []runtime.ReconcilerBuilder{
		func(b *builder.Builder, cl client.Client, cache cache.Cache) *builder.Builder {
			return b.WatchesMetadata(
				&corev1.ConfigMap{},
				handler.EnqueueRequestForOwner(
					m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(),
				))
		},
		func(b *builder.Builder, cl client.Client, cache cache.Cache) *builder.Builder {
			return b.WatchesMetadata(
				&corev1.Secret{},
				handler.EnqueueRequestForOwner(
					m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(),
				),
			)
		},

Comment on lines +127 to +140
// Scope the informer cache to only MPI-owned ConfigMaps and Secrets
// to prevent excessive memory consumption on clusters with many unrelated objects.
req, err := labels.NewRequirement(constants.LabelMPIJobName, selection.Exists, nil)
if err != nil {
setupLog.Error(err, "unable to create MPI label requirement")
os.Exit(1)
}
mpiSelector := labels.NewSelector().Add(*req)
if options.Cache.ByObject == nil {
options.Cache.ByObject = make(map[client.Object]cache.ByObject)
}
options.Cache.ByObject[&corev1.ConfigMap{}] = cache.ByObject{LabelSelector: mpiSelector}
options.Cache.ByObject[&corev1.Secret{}] = cache.ByObject{LabelSelector: mpiSelector}

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting options.Cache.ByObject for corev1.ConfigMap{} and corev1.Secret{} globally will hide all ConfigMaps/Secrets that don’t have constants.LabelMPIJobName, which breaks non-MPI components that read/watch Secrets (e.g., webhook cert rotation and the TrainJobStatus plugin’s lookup of the webhook TLS Secret). Scope this optimization to the MPI controller only (or remove the global selector and rely on metadata-only watches) so other controllers can still access their Secrets/ConfigMaps.

Suggested change
// Scope the informer cache to only MPI-owned ConfigMaps and Secrets
// to prevent excessive memory consumption on clusters with many unrelated objects.
req, err := labels.NewRequirement(constants.LabelMPIJobName, selection.Exists, nil)
if err != nil {
setupLog.Error(err, "unable to create MPI label requirement")
os.Exit(1)
}
mpiSelector := labels.NewSelector().Add(*req)
if options.Cache.ByObject == nil {
options.Cache.ByObject = make(map[client.Object]cache.ByObject)
}
options.Cache.ByObject[&corev1.ConfigMap{}] = cache.ByObject{LabelSelector: mpiSelector}
options.Cache.ByObject[&corev1.Secret{}] = cache.ByObject{LabelSelector: mpiSelector}

Copilot uses AI. Check for mistakes.
LabelSupport string = "trainer.kubeflow.org/support"

// LabelMPIJobName is the label to identify MPI-owned ConfigMap and Secret resources.
LabelMPIJobName string = "trainer.kubeflow.org/trainjob-name"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the label key be just be LabelJobName? I believe this shouldnt be specific to mpi

Signed-off-by: Prajwal <percy38621@gmail.com>
@beep-boopp
Copy link
Copy Markdown
Author

Hey @astefanutti and @akshaychitneni — addressing both threads.
Label rename : Done — renamed to LabelJobName.
Cache scoping : I implemented the manager-level ByObject approach and found it impacts other plugins:

  • trainjobstatus reads the webhook TLS secret (kubeflow-trainer-webhook-cert) via client.Get — this secret doesn’t carry the job label and becomes invisible with global cache scoping
  • flux plugin registers watches on ConfigMap/Secret for its own resources — same issue
    Given this, I scoped the optimization to the MPI controller level using WatchesMetadata + label predicate instead. This keeps the memory improvement (metadata-only cache + filtering to labeled objects) without affecting other plugins.
    Happy to revisit if there’s a preferred way to scope this at the manager level without impacting other controllers.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

pkg/runtime/framework/plugins/mpi/mpi.go:268

  • Because the SSH auth Secret is only created when it’s missing, existing Secrets from older versions will never get the new LabelJobName stamped, which means the new watch predicate will permanently ignore events for them (e.g., a delete won’t trigger reconciliation). Consider patching the label onto existing Secrets (metadata-only) when they’re found, or relaxing the predicate to also accept TrainJob ownerReferences.
	// SSHAuthSecret is immutable.
	partialSecret := &metav1.PartialObjectMetadata{}
	partialSecret.SetGroupVersionKind(corev1.SchemeGroupVersion.WithKind("Secret"))
	if err := m.client.Get(ctx, client.ObjectKey{Name: sshAuthSecretName(trainJob.Name), Namespace: trainJob.Namespace}, partialSecret); err != nil {
		if client.IgnoreNotFound(err) != nil {
			return nil, err
		}
		secret, err := m.buildSSHAuthSecret(trainJob)
		if err != nil {
			return nil, fmt.Errorf("failed to build SSH Auth secret: %w", err)
		}
		objects = append(objects, secret)
	}

Comment on lines +228 to +231
builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool {
_, ok := obj.GetLabels()[constants.LabelJobName]
return ok
})),
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new label-based predicate will drop events for existing MPI-owned Secrets/ConfigMaps created before this change (they have owner refs but won’t have the new label), so deletions/updates of those objects may no longer trigger TrainJob reconciliation after an upgrade.

Copilot uses AI. Check for mistakes.
Comment on lines +223 to 232
return b.WatchesMetadata(
&corev1.ConfigMap{},
handler.EnqueueRequestForOwner(
m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(),
),
builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool {
_, ok := obj.GetLabels()[constants.LabelJobName]
return ok
})),
)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

builder.WithPredicates only filters which events enqueue reconciles; it does not apply a label selector to the underlying informer list/watch, so the cache will still store metadata for all Secrets/ConfigMaps cluster-wide (just without Data). If the intent is to scope the cache itself to MPI-owned objects, this needs a cache-level selector (e.g., manager cache ByObject/label selector) rather than a predicate alone.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trainer registers cluster-wide ConfigMap and Secret causing excessive memory consumption

4 participants