feat: customize R generation to include GPUs by vsoch · Pull Request #3325 · kubeflow/trainer

vsoch · 2026-03-13T11:13:38Z

What this PR does / why we need it:

Flux detection of GPUs depends on hwloc plugins, and a newer version. To get around any edge cases where the dependencies are missing, we can easily generate the R from the expected resource spec (cores and gpus). We can also add a shared memory mount to ensure the job MPI gets all available shared memory of the host. The current default is 64M (automatic from container runtime) and it can have implications for MPI performance.

This will close #3321

I would like to test this with GPUs on AWS today before we do any kind of merge. Thank you!

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #3321

Checklist:

Docs included if any changes are user facing

Copilot

Pull request overview

Updates the Flux runtime plugin to generate a Flux resource file (flux R encode) based on derived per-node CPU/GPU topology and introduces a new in-memory EmptyDir volume intended for shared memory.

Changes:

Parameterize templates/entrypoint.sh so flux R encode can receive a computed resource spec (cores/GPU ranges).
Refactor Flux entrypoint generation to build explicit Flux flags (-N/-n and optional -g) and derive an Rspec string for resource encoding.
Add a new shared-memory volume constant and include a memory-backed EmptyDir volume in Flux view volumes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
pkg/runtime/framework/plugins/flux/templates/entrypoint.sh	Accepts a formatted resource spec argument for `flux R encode`.
pkg/runtime/framework/plugins/flux/flux.go	Builds `Rspec`/flags for Flux execution and adds a memory-backed EmptyDir volume to the pod spec.
pkg/constants/constants.go	Introduces `FluxMemoryVolumeName` constant for the new shared-memory volume.

You can also share your feedback on Copilot code review. Take the survey.

pkg/runtime/framework/plugins/flux/flux.go

+	memoryVolumeAC := corev1ac.Volume().
+		WithName(constants.FluxMemoryVolumeName).
+		WithEmptyDir(corev1ac.EmptyDirVolumeSource().
+			WithMedium(corev1.StorageMediumMemory))
 	fluxVolumeAC := corev1ac.Volume().


andreyvelich · 2026-03-20T18:46:08Z

pkg/runtime/framework/plugins/flux/flux.go

 	} else {
-		tasks = fmt.Sprintf("-N %d -n %d", nodes, *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode*nodes)
+		tasks = *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode
 	}


I think, we should enforce kubebuilder validation for numProcPerNode>=1 here: https://github.com/converged-computing/trainer/blob/764a765a8f6a462bc84234a3f84270fe720acf09/pkg/apis/trainer/v1alpha1/trainjob_types.go#L266

and here:
https://github.com/converged-computing/trainer/blob/764a765a8f6a462bc84234a3f84270fe720acf09/pkg/apis/trainer/v1alpha1/trainingruntime_types.go#L228

cc @tenzen-y @astefanutti

@vsoch Do you want to add validation for numProcPerNode in the followup PR?

I think I had it before (in the first PR) but was asked to remove it because it’s part of the API as an annotation?

What we do need to think about is a case that we have for the Flux Operator (that may not be dealt with well here). Often there is a desire to put the number of nodes, but just say "discover and use all of the cores that are found" and then you would do like:

flux submit -N 16 --exclusive

Note that I don't have any -n specified. For the Flux Operator, I allowed this case when the user put tasks as 0. Is there any way we can support something similar? Is the only way some special envar that flags and triggers the condition (regardless of what the -n is) since the validation of >1 is in the spec?

I think I had it before (in the first PR) but was asked to remove it because it’s part of the API as an annotation?

Please open dedicated PR to introduce this kubebuilder validation, we need to ensure that
numProcPerNode >= 1

Note that I don't have any -n specified. For the Flux Operator, I allowed this case when the user put tasks as 0. Is there any way we can support something similar? I

If Flux can dynamically discover all available devices on the node, we can rely on this functionality when numProcPerNode is omitted.

Excellent! I did not know that was an option. Thank you!

pkg/constants/constants.go

 	// Path for Flux curve path
 	FluxCurveVolumePath = "/curve"

+	// Ensure MPI has full memory of the host


pkg/runtime/framework/plugins/flux/templates/entrypoint.sh

 # Generate host resources
 hosts=$(cat ${configroot}/etc/flux/system/hostlist)
-flux R encode --hosts=${hosts} --local > /tmp/R
+flux R encode --hosts=${hosts} %s > /tmp/R


pkg/runtime/framework/plugins/flux/flux.go

vsoch · 2026-03-13T17:12:09Z

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

andreyvelich · 2026-03-13T17:18:23Z

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

Sure, let's add another example in Flux subdirectory.

vsoch · 2026-03-13T22:54:45Z

GPU working is a go!

I have some changes I will push shortly, and looks like I need to look into those failing tests.

Flux detection of GPUs depends on hwloc plugins, and a newer version. To get around any edge cases where the dependencies are missing, we can easily generate the R from the expected resource spec (cores and gpus). We can also add a shared memory mount to ensure the job MPI gets all available shared memory of the host. The current default is 64M (automatic from container runtime) and it can have implications for MPI performance. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

Also update volume to be added to container. This still is pending testing on AWS! Signed-off-by: vsoch <vsoch@users.noreply.github.com>

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch · 2026-03-20T20:41:50Z

@andreyvelich can you give me any insight as to why that is failing? Is it a flaky test or something else?

andreyvelich · 2026-03-20T22:52:02Z

Yes, we have some flaky e2e test: #3366

andreyvelich · 2026-03-20T22:52:05Z

/retest

vsoch · 2026-03-20T23:41:12Z

Don't merge this yet - with the force push I am not convinced it is still working. I need to test again locally.

vsoch · 2026-03-20T23:51:29Z

I think my container is different and that's the issue. I'm going to add interactive mode to this PR. It's going to be essential for helping users, I know we will eventually need it, and I need it now, so it makes sense.

vsoch · 2026-03-20T23:53:17Z

Oh, I think I'm being dumb - I installed the current deployed master kubeflow image (which has flux now!) and not my PR image here. Derp! Trying again.

vsoch · 2026-03-21T00:07:01Z

yep that was it - still works great! Phew.

vsoch · 2026-03-21T03:25:36Z

@andreyvelich this is ready for review and eventual merge - apologies for my confusing posts! I was testing locally with the wrong container image (I had updated it back to kubeflow for the PR here, and needed to use my custom build).

I just finished the demo for the Kubecon booth, and I'll be releasing the longer variant (about 12 minutes) that does an intro and two examples - LAMMPS with the elastic fabric adapter and LAMMPS with GPU, both on AWS! The second requires the update here. It's a fun and engaging short presentation that introduces Flux (2.5 minutes) followed by the demos. I'm going to wait for the release (and hopefully this paired merge) before posting that. Very excited! 🥳

vsoch · 2026-03-23T14:38:39Z

Let’s get this merged today! 🎉

andreyvelich · 2026-03-23T14:41:23Z

Sure, happy to unblock. @vsoch can you address this issue in the followup PR: #3325 (comment)
/lgtm
/approve

google-oss-prow · 2026-03-23T14:41:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vsoch · 2026-03-23T14:50:58Z

Sure, happy to unblock. @vsoch can you address this issue in the followup PR: #3325 (comment)

Absolutely - I can address what you'd like for a followup! I'd like to add interactive mode for easy learning and debugging, so I will most definitely follow up soon.

vsoch · 2026-03-23T15:17:44Z

Thank you! 🙏

Copilot AI review requested due to automatic review settings March 13, 2026 11:13

google-oss-prow bot requested review from akshaychitneni and jinchihe March 13, 2026 11:13

google-oss-prow bot added the size/M label Mar 13, 2026

Copilot started reviewing on behalf of vsoch March 13, 2026 11:14 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

andreyvelich reviewed Mar 13, 2026

View reviewed changes

pkg/runtime/framework/plugins/flux/flux.go Outdated Show resolved Hide resolved

vsoch force-pushed the add-r-generation branch from 378d38a to 764a765 Compare March 13, 2026 22:58

vsoch added 3 commits March 20, 2026 12:02

review: use generateRange function for gpus

b148c3f

Also update volume to be added to container. This still is pending testing on AWS! Signed-off-by: vsoch <vsoch@users.noreply.github.com>

review and add gpu trainer example

dea9a08

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch force-pushed the add-r-generation branch from 764a765 to dea9a08 Compare March 20, 2026 19:02

vsoch requested a review from andreyvelich March 21, 2026 03:22

neeraj542 mentioned this pull request Mar 21, 2026

fix(examples): stabilize flaky E2E notebook completion #3366

Open

google-oss-prow bot assigned andreyvelich Mar 23, 2026

google-oss-prow bot added the lgtm label Mar 23, 2026

google-oss-prow bot added the approved label Mar 23, 2026

google-oss-prow bot merged commit 63c52aa into kubeflow:master Mar 23, 2026
35 of 39 checks passed

google-oss-prow bot added this to the v2.2 milestone Mar 23, 2026

andreyvelich modified the milestones: v2.2, v2.3 Mar 23, 2026

Conversation

vsoch commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

andreyvelich Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vsoch Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vsoch Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vsoch Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vsoch commented Mar 13, 2026

Uh oh!

andreyvelich commented Mar 13, 2026

Uh oh!

vsoch commented Mar 13, 2026

Uh oh!

vsoch commented Mar 20, 2026

Uh oh!

andreyvelich commented Mar 20, 2026

Uh oh!

andreyvelich commented Mar 20, 2026

Uh oh!

vsoch commented Mar 20, 2026

Uh oh!

vsoch commented Mar 20, 2026

Uh oh!

vsoch commented Mar 20, 2026

Uh oh!

vsoch commented Mar 21, 2026

Uh oh!

vsoch commented Mar 21, 2026

Uh oh!

vsoch commented Mar 23, 2026

Uh oh!

andreyvelich commented Mar 23, 2026

Uh oh!

google-oss-prow bot commented Mar 23, 2026

Uh oh!

vsoch commented Mar 23, 2026

Uh oh!

Uh oh!

vsoch commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vsoch commented Mar 13, 2026 •

edited

Loading