Skip to content

Trainer registers cluster-wide ConfigMap and Secret causing excessive memory consumption #3374

@sutaakar

Description

@sutaakar

What happened?

The MPI plugin's ReconcilerBuilders() in pkg/runtime/framework/plugins/mpi/mpi.go registers cluster-wide watches on ConfigMap and Secret with no
label selector or namespace filter. While EnqueueRequestForOwner filters reconcile events to owner-referenced objects, the underlying informer
cache still lists and watches all ConfigMaps and Secrets cluster-wide, causing excessive memory consumption on clusters with many ConfigMaps.

Reproducer:

  1. Install trainer v2.2.0 on KinD
  2. Create 700 large ConfigMaps (200KB each)
  3. Add a memory limit for Trainer operator and observe OOMKill

The controller OOMKills within seconds and restarts repeatedly. Without a memory limit, memory consumption reaches 415Mi with 700×200KB ConfigMaps in a fresh KinD cluster.

What did you expect to happen?

The controller memory consumption should not grow with the number of unrelated ConfigMaps and Secrets in the cluster.

Environment

Kubernetes version: v1.32.2 (KinD)
Kubeflow Trainer version: v2.2.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions