fix: disable mpi4py auto-init to prevent spawn-worker MPI_Init abort#22
Merged
Merged
Conversation
The existing comment in utility.py already documented this issue:
"MPI cannot be initialized automatically, or read_thread spawn/forkserver
child processes will abort trying to open a non-existant PMI_fd file."
However the rc flags that actually disable mpi4py's auto-initialization
were never set, so a bare "from mpi4py import MPI" still triggered
MPI_Init_thread() at module import time.
When PyTorch DataLoader workers are created with the default 'spawn'
multiprocessing context, each child re-imports dlio_benchmark modules.
Outside mpirun's PMIX namespace, the auto MPI_Init aborts with:
orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
*** An error occurred in MPI_Init_thread on a NULL communicator
Setting mpi4py.rc.initialize = False and mpi4py.rc.finalize = False
before any "from mpi4py import MPI" prevents the auto-init. Main MPI
ranks still get initialized explicitly via DLIOMPI.initialize() which
calls MPI.Init() when MPI.Is_initialized() is False.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Author
|
I am evaluating the impact of the change, seems no impact, but trying to cross-validate |
FileSystemGuy
approved these changes
Jun 12, 2026
4 tasks
FileSystemGuy
added a commit
to mlcommons/storage
that referenced
this pull request
Jun 12, 2026
PyTorch DataLoader workers were aborting with "MPI_Init_thread on a NULL communicator" whenever a DLIO workload used reader.read_threads > 0 (e.g. retinanet_b200, unet3d_*). Root cause is in dlio_benchmark: the top-level `from mpi4py import MPI` in statscounter.py triggers MPI_Init_thread() in every spawn-context DataLoader child, which has no PMIX/ORTE environment because it was not launched by mpirun. The upstream comment in dlio_benchmark/utils/utility.py:176 already documented the hazard, but the mpi4py.rc.initialize / rc.finalize flags that actually disable the auto-init were never set. mlcommons/DLIO_local_changes#22 adds those two lines. Pin to that PR's head commit (60fd3b8e) so storage picks up the fix immediately. Revert to branch = "main" once #22 merges upstream.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The existing comment in utility.py already documented this issue: "MPI cannot be initialized automatically, or read_thread spawn/forkserver child processes will abort trying to open a non-existant PMI_fd file."
However the rc flags that actually disable mpi4py's auto-initialization were never set, so a bare "from mpi4py import MPI" still triggered MPI_Init_thread() at module import time.
When PyTorch DataLoader workers are created with the default 'spawn' multiprocessing context, each child re-imports dlio_benchmark modules. Outside mpirun's PMIX namespace, the auto MPI_Init aborts with:
Setting mpi4py.rc.initialize = False and mpi4py.rc.finalize = False before any "from mpi4py import MPI" prevents the auto-init. Main MPI ranks still get initialized explicitly via DLIOMPI.initialize() which calls MPI.Init() when MPI.Is_initialized() is False.