fix(operator): Scope MPI ConfigMap and Secret watches to owned objects#3377
fix(operator): Scope MPI ConfigMap and Secret watches to owned objects#3377beep-boopp wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR addresses operator OOM risk in large clusters by preventing controller-runtime’s cache from loading full ConfigMap/Secret payloads cluster-wide for the MPI plugin, switching those watches to metadata-only and scoping reconcile triggers to MPI-owned objects via labeling.
Changes:
- Add an MPI ownership label (
constants.LabelMPIJobName) and stamp it onto the MPI hostfileConfigMapand SSH authSecretat creation time. - Switch MPI plugin watches for
ConfigMap/SecretfromWatchestoWatchesMetadataand filter events via a label-based predicate. - Optimize the SSH secret existence check to use
metav1.PartialObjectMetadata(metadata-onlyGet) and update related test utilities/expectations.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/constants/constants.go | Introduces LabelMPIJobName used to identify MPI-owned resources. |
| pkg/runtime/framework/plugins/mpi/mpi.go | Uses metadata-only watches with label predicate; stamps labels on created resources; uses PartialObjectMetadata for secret Get. |
| pkg/runtime/framework/plugins/mpi/mpi_test.go | Updates MPI unit tests to expect the new label and adjusts the mocked Get interception type. |
| pkg/util/testing/wrapper.go | Adds WithLabels helpers for ConfigMapWrapper/SecretWrapper to support updated expectations. |
| pkg/runtime/framework/core/framework_test.go | Updates framework tests to expect labeled MPI resources. |
| pkg/runtime/core/trainingruntime_test.go | Updates runtime tests to expect labeled MPI resources. |
4f92bd2 to
7c278d1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
pkg/runtime/framework/plugins/mpi/mpi.go:269
- If the SSH auth Secret already exists, this code never stamps the new LabelMPIJobName onto it (it only creates on NotFound), which means the Secret watch predicate will keep ignoring that Secret’s events; consider patching the label onto the existing Secret’s metadata (or widening the predicate to accept TrainJob-owned objects).
// SSHAuthSecret is immutable.
partialSecret := &metav1.PartialObjectMetadata{}
partialSecret.SetGroupVersionKind(corev1.SchemeGroupVersion.WithKind("Secret"))
if err := m.client.Get(ctx, client.ObjectKey{Name: sshAuthSecretName(trainJob.Name), Namespace: trainJob.Namespace}, partialSecret); err != nil {
if client.IgnoreNotFound(err) != nil {
return nil, err
}
secret, err := m.buildSSHAuthSecret(trainJob)
if err != nil {
return nil, fmt.Errorf("failed to build SSH Auth secret: %w", err)
}
objects = append(objects, secret)
}
return append(objects, m.buildHostFileConfigMap(info, trainJob)), nil
| return b.WatchesMetadata( | ||
| &corev1.ConfigMap{}, | ||
| handler.EnqueueRequestForOwner( | ||
| m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(), | ||
| ), | ||
| builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool { | ||
| _, ok := obj.GetLabels()[constants.LabelMPIJobName] | ||
| return ok | ||
| })), | ||
| ) | ||
| }, | ||
| func(b *builder.Builder, cl client.Client, cache cache.Cache) *builder.Builder { | ||
| return b.Watches( | ||
| return b.WatchesMetadata( | ||
| &corev1.Secret{}, | ||
| handler.EnqueueRequestForOwner( | ||
| m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(), | ||
| ), | ||
| builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool { | ||
| _, ok := obj.GetLabels()[constants.LabelMPIJobName] | ||
| return ok | ||
| })), |
There was a problem hiding this comment.
The label-only predicate will drop events for ConfigMaps/Secrets that are still TrainJob-owned but were created before this label existed, so deletes/updates of those owned objects won’t enqueue a reconcile after an upgrade.
astefanutti
left a comment
There was a problem hiding this comment.
Thanks @beep-boopp, I think caching should be configured in the controller-runtime manager cache options:
Thanks, I looked into configuring this via |
Yes I think it is acceptable and actually recommended to make sure the issue is fixed holistically. We "just" need to make sure it does not impact anything. |
Cool, thanks! I’ll work on moving this to the manager cache configuration and make sure it doesn’t impact existing behavior. |
|
Hey @astefanutti I have updated the implementation to configure cache scoping via options.Cache.ByObject in main.go as suggested, and removed the controller-level predicate filter. Also removed the predicate import from mpi.go since it's no longer needed. Would appreciate another review when you get a chance! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
pkg/runtime/framework/plugins/mpi/mpi.go:258
- With the manager cache filtered to only Secrets carrying
LabelMPIJobName, this cachedGetcan return NotFound for pre-existing (unlabeled) SSH auth Secrets created by older operator versions, causing the plugin to generate a new immutable Secret and potentially fail with AlreadyExists/immutable update errors. Use an uncached reader (APIReader) for the existence check and/or add a label-only patch path for existing Secrets so upgrades don’t break.
// SSHAuthSecret is immutable.
partialSecret := &metav1.PartialObjectMetadata{}
partialSecret.SetGroupVersionKind(corev1.SchemeGroupVersion.WithKind("Secret"))
if err := m.client.Get(ctx, client.ObjectKey{Name: sshAuthSecretName(trainJob.Name), Namespace: trainJob.Namespace}, partialSecret); err != nil {
if client.IgnoreNotFound(err) != nil {
return nil, err
}
secret, err := m.buildSSHAuthSecret(trainJob)
if err != nil {
return nil, fmt.Errorf("failed to build SSH Auth secret: %w", err)
}
objects = append(objects, secret)
}
pkg/runtime/framework/plugins/mpi/mpi.go:235
- The PR description mentions adding a label-selector predicate (
builder.WithPredicates) to scope Secret/ConfigMap watches, but the code here only switches toWatchesMetadataand relies on external cache configuration; either add the predicate as described or update the description to match the implementation.
func (m *MPI) ReconcilerBuilders() []runtime.ReconcilerBuilder {
return []runtime.ReconcilerBuilder{
func(b *builder.Builder, cl client.Client, cache cache.Cache) *builder.Builder {
return b.WatchesMetadata(
&corev1.ConfigMap{},
handler.EnqueueRequestForOwner(
m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(),
))
},
func(b *builder.Builder, cl client.Client, cache cache.Cache) *builder.Builder {
return b.WatchesMetadata(
&corev1.Secret{},
handler.EnqueueRequestForOwner(
m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(),
),
)
},
| // Scope the informer cache to only MPI-owned ConfigMaps and Secrets | ||
| // to prevent excessive memory consumption on clusters with many unrelated objects. | ||
| req, err := labels.NewRequirement(constants.LabelMPIJobName, selection.Exists, nil) | ||
| if err != nil { | ||
| setupLog.Error(err, "unable to create MPI label requirement") | ||
| os.Exit(1) | ||
| } | ||
| mpiSelector := labels.NewSelector().Add(*req) | ||
| if options.Cache.ByObject == nil { | ||
| options.Cache.ByObject = make(map[client.Object]cache.ByObject) | ||
| } | ||
| options.Cache.ByObject[&corev1.ConfigMap{}] = cache.ByObject{LabelSelector: mpiSelector} | ||
| options.Cache.ByObject[&corev1.Secret{}] = cache.ByObject{LabelSelector: mpiSelector} | ||
|
|
There was a problem hiding this comment.
Setting options.Cache.ByObject for corev1.ConfigMap{} and corev1.Secret{} globally will hide all ConfigMaps/Secrets that don’t have constants.LabelMPIJobName, which breaks non-MPI components that read/watch Secrets (e.g., webhook cert rotation and the TrainJobStatus plugin’s lookup of the webhook TLS Secret). Scope this optimization to the MPI controller only (or remove the global selector and rely on metadata-only watches) so other controllers can still access their Secrets/ConfigMaps.
| // Scope the informer cache to only MPI-owned ConfigMaps and Secrets | |
| // to prevent excessive memory consumption on clusters with many unrelated objects. | |
| req, err := labels.NewRequirement(constants.LabelMPIJobName, selection.Exists, nil) | |
| if err != nil { | |
| setupLog.Error(err, "unable to create MPI label requirement") | |
| os.Exit(1) | |
| } | |
| mpiSelector := labels.NewSelector().Add(*req) | |
| if options.Cache.ByObject == nil { | |
| options.Cache.ByObject = make(map[client.Object]cache.ByObject) | |
| } | |
| options.Cache.ByObject[&corev1.ConfigMap{}] = cache.ByObject{LabelSelector: mpiSelector} | |
| options.Cache.ByObject[&corev1.Secret{}] = cache.ByObject{LabelSelector: mpiSelector} |
pkg/constants/constants.go
Outdated
| LabelSupport string = "trainer.kubeflow.org/support" | ||
|
|
||
| // LabelMPIJobName is the label to identify MPI-owned ConfigMap and Secret resources. | ||
| LabelMPIJobName string = "trainer.kubeflow.org/trainjob-name" |
There was a problem hiding this comment.
Should the label key be just be LabelJobName? I believe this shouldnt be specific to mpi
Signed-off-by: Prajwal <percy38621@gmail.com>
|
Hey @astefanutti and @akshaychitneni — addressing both threads.
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
pkg/runtime/framework/plugins/mpi/mpi.go:268
- Because the SSH auth Secret is only created when it’s missing, existing Secrets from older versions will never get the new LabelJobName stamped, which means the new watch predicate will permanently ignore events for them (e.g., a delete won’t trigger reconciliation). Consider patching the label onto existing Secrets (metadata-only) when they’re found, or relaxing the predicate to also accept TrainJob ownerReferences.
// SSHAuthSecret is immutable.
partialSecret := &metav1.PartialObjectMetadata{}
partialSecret.SetGroupVersionKind(corev1.SchemeGroupVersion.WithKind("Secret"))
if err := m.client.Get(ctx, client.ObjectKey{Name: sshAuthSecretName(trainJob.Name), Namespace: trainJob.Namespace}, partialSecret); err != nil {
if client.IgnoreNotFound(err) != nil {
return nil, err
}
secret, err := m.buildSSHAuthSecret(trainJob)
if err != nil {
return nil, fmt.Errorf("failed to build SSH Auth secret: %w", err)
}
objects = append(objects, secret)
}
| builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool { | ||
| _, ok := obj.GetLabels()[constants.LabelJobName] | ||
| return ok | ||
| })), |
There was a problem hiding this comment.
The new label-based predicate will drop events for existing MPI-owned Secrets/ConfigMaps created before this change (they have owner refs but won’t have the new label), so deletions/updates of those objects may no longer trigger TrainJob reconciliation after an upgrade.
| return b.WatchesMetadata( | ||
| &corev1.ConfigMap{}, | ||
| handler.EnqueueRequestForOwner( | ||
| m.client.Scheme(), m.client.RESTMapper(), &trainer.TrainJob{}, handler.OnlyControllerOwner(), | ||
| ), | ||
| builder.WithPredicates(predicate.NewPredicateFuncs(func(obj client.Object) bool { | ||
| _, ok := obj.GetLabels()[constants.LabelJobName] | ||
| return ok | ||
| })), | ||
| ) |
There was a problem hiding this comment.
builder.WithPredicates only filters which events enqueue reconciles; it does not apply a label selector to the underlying informer list/watch, so the cache will still store metadata for all Secrets/ConfigMaps cluster-wide (just without Data). If the intent is to scope the cache itself to MPI-owned objects, this needs a cache-level selector (e.g., manager cache ByObject/label selector) rather than a predicate alone.
What this PR does / why we need it:
Currently, the MPI plugin registers cluster-wide watches on
ConfigMapandSecretusing full object structs. WhileEnqueueRequestForOwnercorrectly filters reconcile events, the underlyingcontroller-runtimeInformer cache blindly loads the fullDatapayloads of all cluster ConfigMaps into memory, causing the operator to OOM kill on massive clusters.This PR scopes the Informer cache and watches to only care about MPI-owned objects, and restricts the cache to only store metadata:
constants.LabelMPIJobName(trainer.kubeflow.org/trainjob-name) and updatesbuildHostFileConfigMapandbuildSSHAuthSecretto apply this label at creation time.b.Watcheswithb.WatchesMetadatainReconcilerBuilders()and adds a label-selector predicate (builder.WithPredicates). The cache now completely bypasses massiveDatapayloads and only indexes MPI-owned objects.GetCalls: Updates them.client.Getcall for SSH secrets to usemetav1.PartialObjectMetadatainstead ofcorev1.Secretto prevent full-object retrieval into memory.make testpasses locally (pre-existinggo: no such tool "covdata"errors exist on master unrelated to this change).ConfigMapWrapperandSecretWrappertesting utilities and relevantmpiunit tests were updated to expect the new labels.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>,#<issue number>, ... format, will close the issue(s) when PR gets merged):Fixes #3374
Checklist: