What happened?
The MPI plugin's ReconcilerBuilders() in pkg/runtime/framework/plugins/mpi/mpi.go registers cluster-wide watches on ConfigMap and Secret with no
label selector or namespace filter. While EnqueueRequestForOwner filters reconcile events to owner-referenced objects, the underlying informer
cache still lists and watches all ConfigMaps and Secrets cluster-wide, causing excessive memory consumption on clusters with many ConfigMaps.
Reproducer:
- Install trainer v2.2.0 on KinD
- Create 700 large ConfigMaps (200KB each)
- Add a memory limit for Trainer operator and observe OOMKill
The controller OOMKills within seconds and restarts repeatedly. Without a memory limit, memory consumption reaches 415Mi with 700×200KB ConfigMaps in a fresh KinD cluster.
What did you expect to happen?
The controller memory consumption should not grow with the number of unrelated ConfigMaps and Secrets in the cluster.
Environment
Kubernetes version: v1.32.2 (KinD)
Kubeflow Trainer version: v2.2.0
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
What happened?
The MPI plugin's ReconcilerBuilders() in pkg/runtime/framework/plugins/mpi/mpi.go registers cluster-wide watches on ConfigMap and Secret with no
label selector or namespace filter. While EnqueueRequestForOwner filters reconcile events to owner-referenced objects, the underlying informer
cache still lists and watches all ConfigMaps and Secrets cluster-wide, causing excessive memory consumption on clusters with many ConfigMaps.
Reproducer:
The controller OOMKills within seconds and restarts repeatedly. Without a memory limit, memory consumption reaches 415Mi with 700×200KB ConfigMaps in a fresh KinD cluster.
What did you expect to happen?
The controller memory consumption should not grow with the number of unrelated ConfigMaps and Secrets in the cluster.
Environment
Kubernetes version: v1.32.2 (KinD)
Kubeflow Trainer version: v2.2.0
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍