Skip to content

Provide Toggle for PyTorch Distributed Initialization#62

Closed
michaelmckinsey1 wants to merge 1 commit intoLBANN:mainfrom
michaelmckinsey1:no-dist-init
Closed

Provide Toggle for PyTorch Distributed Initialization#62
michaelmckinsey1 wants to merge 1 commit intoLBANN:mainfrom
michaelmckinsey1:no-dist-init

Conversation

@michaelmckinsey1
Copy link
Contributor

@michaelmckinsey1 michaelmckinsey1 commented Feb 26, 2026

Description

Provide a flag to not invoke torch.dist.init_process_group() if the user needs to instead invoke this in their application.

Use case

With LBANN/ScaFFold#17, importing mpi4py after Torch Distributed is initialized causes a segmentation fault. By disabling dist.init_process_group() in the trampoline, ScaFFold can use mpi4py before initializing Torch Distributed within the benchmark. So this flag is used to keep mpi4py and torch.dist separate, in this case.

@bvanessen
Copy link
Collaborator

@michaelmckinsey1 @PatrickRMiles Hmmm, not sure that this is the path that I would suggest. Let's plan to talk about it tomorrow.

@michaelmckinsey1 michaelmckinsey1 marked this pull request as draft February 26, 2026 18:44
@michaelmckinsey1
Copy link
Contributor Author

We are quite certain that we won't merge this, so I can leave it as a draft until I find a better solution.

@michaelmckinsey1
Copy link
Contributor Author

michaelmckinsey1 commented Feb 27, 2026

Original issue that made this feature necessary is fixed by using cray-mpich/9.1.0. So this is no longer necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants