fix: disable mpi4py auto-init to prevent spawn-worker MPI_Init abort by wolfgang-desalvador · Pull Request #22 · mlcommons/DLIO_local_changes

wolfgang-desalvador · 2026-06-09T13:31:45Z

The existing comment in utility.py already documented this issue: "MPI cannot be initialized automatically, or read_thread spawn/forkserver child processes will abort trying to open a non-existant PMI_fd file."

However the rc flags that actually disable mpi4py's auto-initialization were never set, so a bare "from mpi4py import MPI" still triggered MPI_Init_thread() at module import time.

When PyTorch DataLoader workers are created with the default 'spawn' multiprocessing context, each child re-imports dlio_benchmark modules. Outside mpirun's PMIX namespace, the auto MPI_Init aborts with:

orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
*** An error occurred in MPI_Init_thread on a NULL communicator

Setting mpi4py.rc.initialize = False and mpi4py.rc.finalize = False before any "from mpi4py import MPI" prevents the auto-init. Main MPI ranks still get initialized explicitly via DLIOMPI.initialize() which calls MPI.Init() when MPI.Is_initialized() is False.

The existing comment in utility.py already documented this issue: "MPI cannot be initialized automatically, or read_thread spawn/forkserver child processes will abort trying to open a non-existant PMI_fd file." However the rc flags that actually disable mpi4py's auto-initialization were never set, so a bare "from mpi4py import MPI" still triggered MPI_Init_thread() at module import time. When PyTorch DataLoader workers are created with the default 'spawn' multiprocessing context, each child re-imports dlio_benchmark modules. Outside mpirun's PMIX namespace, the auto MPI_Init aborts with: orte_ess_init failed --> Returned value No permission (-17) instead of ORTE_SUCCESS *** An error occurred in MPI_Init_thread on a NULL communicator Setting mpi4py.rc.initialize = False and mpi4py.rc.finalize = False before any "from mpi4py import MPI" prevents the auto-init. Main MPI ranks still get initialized explicitly via DLIOMPI.initialize() which calls MPI.Init() when MPI.Is_initialized() is False. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wolfgang-desalvador · 2026-06-09T13:36:55Z

I am evaluating the impact of the change, seems no impact, but trying to cross-validate

PyTorch DataLoader workers were aborting with "MPI_Init_thread on a NULL communicator" whenever a DLIO workload used reader.read_threads > 0 (e.g. retinanet_b200, unet3d_*). Root cause is in dlio_benchmark: the top-level `from mpi4py import MPI` in statscounter.py triggers MPI_Init_thread() in every spawn-context DataLoader child, which has no PMIX/ORTE environment because it was not launched by mpirun. The upstream comment in dlio_benchmark/utils/utility.py:176 already documented the hazard, but the mpi4py.rc.initialize / rc.finalize flags that actually disable the auto-init were never set. mlcommons/DLIO_local_changes#22 adds those two lines. Pin to that PR's head commit (60fd3b8e) so storage picks up the fix immediately. Revert to branch = "main" once #22 merges upstream.

wolfgang-desalvador requested review from a team and russfellows June 9, 2026 13:31

wolfgang-desalvador mentioned this pull request Jun 9, 2026

DLIO PyTorch DataLoader workers abort with MPI_Init "No permission" when read_threads > 0 mlcommons/storage#415

Closed

wolfgang-desalvador marked this pull request as draft June 9, 2026 13:36

FileSystemGuy approved these changes Jun 12, 2026

View reviewed changes

FileSystemGuy marked this pull request as ready for review June 12, 2026 18:30

FileSystemGuy removed the request for review from russfellows June 12, 2026 18:31

FileSystemGuy merged commit d7c3825 into main Jun 12, 2026
7 checks passed

FileSystemGuy mentioned this pull request Jun 12, 2026

fix: pin dlio-benchmark to PR #22 head to resolve issue #415 mlcommons/storage#436

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: disable mpi4py auto-init to prevent spawn-worker MPI_Init abort#22

fix: disable mpi4py auto-init to prevent spawn-worker MPI_Init abort#22
FileSystemGuy merged 1 commit into
mainfrom
wolfgang/fix-MPI-initialization-fork

wolfgang-desalvador commented Jun 9, 2026

Uh oh!

wolfgang-desalvador commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wolfgang-desalvador commented Jun 9, 2026

Uh oh!

wolfgang-desalvador commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants