Skip to content

Improve torchrun-hpc device init#56

Merged
bvanessen merged 1 commit intoLBANN:mainfrom
bvanessen:improve_torch_launch
Oct 10, 2025
Merged

Improve torchrun-hpc device init#56
bvanessen merged 1 commit intoLBANN:mainfrom
bvanessen:improve_torch_launch

Conversation

@bvanessen
Copy link
Collaborator

Added the device_id initialization to the init_process_group call in
the torchrun-hpc trampoline.

Copy link
Collaborator

@tbennun tbennun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes we will always use one device per process?

@szaman19
Copy link

@tbennun Yeah, we were discussing that this would be a good default setting. If anyone would like to do multi-GPU per process, they could set it post initialization.

@bvanessen bvanessen merged commit 4c14b52 into LBANN:main Oct 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants