Skip to content

feat: customize R generation to include GPUs#3325

Merged
google-oss-prow[bot] merged 3 commits intokubeflow:masterfrom
converged-computing:add-r-generation
Mar 23, 2026
Merged

feat: customize R generation to include GPUs#3325
google-oss-prow[bot] merged 3 commits intokubeflow:masterfrom
converged-computing:add-r-generation

Conversation

@vsoch
Copy link
Copy Markdown
Contributor

@vsoch vsoch commented Mar 13, 2026

What this PR does / why we need it:

Flux detection of GPUs depends on hwloc plugins, and a newer version. To get around any edge cases where the dependencies are missing, we can easily generate the R from the expected resource spec (cores and gpus). We can also add a shared memory mount to ensure the job MPI gets all available shared memory of the host. The current default is 64M (automatic from container runtime) and it can have implications for MPI performance.

This will close #3321

I would like to test this with GPUs on AWS today before we do any kind of merge. Thank you!

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #3321

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings March 13, 2026 11:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Flux runtime plugin to generate a Flux resource file (flux R encode) based on derived per-node CPU/GPU topology and introduces a new in-memory EmptyDir volume intended for shared memory.

Changes:

  • Parameterize templates/entrypoint.sh so flux R encode can receive a computed resource spec (cores/GPU ranges).
  • Refactor Flux entrypoint generation to build explicit Flux flags (-N/-n and optional -g) and derive an Rspec string for resource encoding.
  • Add a new shared-memory volume constant and include a memory-backed EmptyDir volume in Flux view volumes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pkg/runtime/framework/plugins/flux/templates/entrypoint.sh Accepts a formatted resource spec argument for flux R encode.
pkg/runtime/framework/plugins/flux/flux.go Builds Rspec/flags for Flux execution and adds a memory-backed EmptyDir volume to the pod spec.
pkg/constants/constants.go Introduces FluxMemoryVolumeName constant for the new shared-memory volume.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +298 to 302
memoryVolumeAC := corev1ac.Volume().
WithName(constants.FluxMemoryVolumeName).
WithEmptyDir(corev1ac.EmptyDirVolumeSource().
WithMedium(corev1.StorageMediumMemory))
fluxVolumeAC := corev1ac.Volume().
} else {
tasks = fmt.Sprintf("-N %d -n %d", nodes, *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode*nodes)
tasks = *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch Do you want to add validation for numProcPerNode in the followup PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had it before (in the first PR) but was asked to remove it because it’s part of the API as an annotation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we do need to think about is a case that we have for the Flux Operator (that may not be dealt with well here). Often there is a desire to put the number of nodes, but just say "discover and use all of the cores that are found" and then you would do like:

flux submit -N 16 --exclusive

Note that I don't have any -n specified. For the Flux Operator, I allowed this case when the user put tasks as 0. Is there any way we can support something similar? Is the only way some special envar that flags and triggers the condition (regardless of what the -n is) since the validation of >1 is in the spec?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had it before (in the first PR) but was asked to remove it because it’s part of the API as an annotation?

Please open dedicated PR to introduce this kubebuilder validation, we need to ensure that
numProcPerNode >= 1

Note that I don't have any -n specified. For the Flux Operator, I allowed this case when the user put tasks as 0. Is there any way we can support something similar? I

If Flux can dynamically discover all available devices on the node, we can rely on this functionality when numProcPerNode is omitted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I did not know that was an option. Thank you!

// Path for Flux curve path
FluxCurveVolumePath = "/curve"

// Ensure MPI has full memory of the host
# Generate host resources
hosts=$(cat ${configroot}/etc/flux/system/hostlist)
flux R encode --hosts=${hosts} --local > /tmp/R
flux R encode --hosts=${hosts} %s > /tmp/R
@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 13, 2026

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

@andreyvelich
Copy link
Copy Markdown
Member

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

Sure, let's add another example in Flux subdirectory.

@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 13, 2026

GPU working is a go!

image

I have some changes I will push shortly, and looks like I need to look into those failing tests.

@vsoch vsoch force-pushed the add-r-generation branch from 378d38a to 764a765 Compare March 13, 2026 22:58
vsoch added 3 commits March 20, 2026 12:02
Flux detection of GPUs depends on hwloc plugins, and a newer
version. To get around any edge cases where the dependencies are missing,
we can easily generate the R from the expected resource spec (cores
and gpus). We can also add a shared memory mount to ensure the job
MPI gets all available shared memory of the host. The current
default is 64M (automatic from container runtime) and it can have
implications for MPI performance.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Also update volume to be added to container. This still
is pending testing on AWS!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the add-r-generation branch from 764a765 to dea9a08 Compare March 20, 2026 19:02
@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 20, 2026

@andreyvelich can you give me any insight as to why that is failing? Is it a flaky test or something else?

@andreyvelich
Copy link
Copy Markdown
Member

Yes, we have some flaky e2e test: #3366

@andreyvelich
Copy link
Copy Markdown
Member

/retest

@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 20, 2026

Don't merge this yet - with the force push I am not convinced it is still working. I need to test again locally.

@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 20, 2026

I think my container is different and that's the issue. I'm going to add interactive mode to this PR. It's going to be essential for helping users, I know we will eventually need it, and I need it now, so it makes sense.

@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 20, 2026

Oh, I think I'm being dumb - I installed the current deployed master kubeflow image (which has flux now!) and not my PR image here. Derp! Trying again.

@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 21, 2026

yep that was it - still works great! Phew.

@vsoch vsoch requested a review from andreyvelich March 21, 2026 03:22
@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 21, 2026

@andreyvelich this is ready for review and eventual merge - apologies for my confusing posts! I was testing locally with the wrong container image (I had updated it back to kubeflow for the PR here, and needed to use my custom build).

I just finished the demo for the Kubecon booth, and I'll be releasing the longer variant (about 12 minutes) that does an intro and two examples - LAMMPS with the elastic fabric adapter and LAMMPS with GPU, both on AWS! The second requires the update here. It's a fun and engaging short presentation that introduces Flux (2.5 minutes) followed by the demos. I'm going to wait for the release (and hopefully this paired merge) before posting that. Very excited! 🥳

@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 23, 2026

Let’s get this merged today! 🎉

@andreyvelich
Copy link
Copy Markdown
Member

Sure, happy to unblock. @vsoch can you address this issue in the followup PR: #3325 (comment)
/lgtm
/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 23, 2026

Sure, happy to unblock. @vsoch can you address this issue in the followup PR: #3325 (comment)

Absolutely - I can address what you'd like for a followup! I'd like to add interactive mode for easy learning and debugging, so I will most definitely follow up soon.

@google-oss-prow google-oss-prow bot merged commit 63c52aa into kubeflow:master Mar 23, 2026
35 of 39 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.2 milestone Mar 23, 2026
@andreyvelich andreyvelich modified the milestones: v2.2, v2.3 Mar 23, 2026
@vsoch
Copy link
Copy Markdown
Contributor Author

vsoch commented Mar 23, 2026

Thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] optimizations for flux

3 participants