Skip to content

DLIO PyTorch DataLoader workers abort with MPI_Init "No permission" when read_threads > 0 #415

Description

@wolfgang-desalvador

Summary

Any DLIO workload that uses the PyTorch DataLoader with reader.read_threads > 0 (e.g. retinanet_b200) aborts at the start of epoch 1 with:

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort)
  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
[host--...:NNNN] Local abort before MPI_INIT completed completed successfully ...

Each main MPI rank crashes once per spawned DataLoader worker (e.g. 8 workers × 8 ranks = 64 aborts), then the run dies.

Reproduction

Cluster: 2 nodes, OpenMPI via mpirun under SLURM. Fresh uv sync install of mlcommons/storage main (3.0.3) and bundled dlio_benchmark.

mlpstorage training run \
  --model=retinanet --exec-type=mpi -g b200 -na 8 -cm 8 \
  --num-client-hosts=2 --hosts host1 host2 \
  --data-dir=/data/mlps_uv/retinanet \
  --results-dir=/data/mlperf_storage_results \
  --open --allow-invalid-params --allow-run-as-root --oversubscribe

Root cause

PyTorch DataLoader's default multiprocessing_context is spawn (since dlio commit 39449df). Spawned worker subprocesses re-import every parent module, including dlio_benchmark/utils/statscounter.py which executes from mpi4py import MPI at module top level. With mpi4py.rc.initialize == True (the default), this implicitly invokes MPI_Init_thread() in the child — but the child was not launched by mpirun, has no PMIX/ORTE environment, and the init aborts.

Interestingly, the bug is already half-acknowledged in dlio_benchmark/utils/utility.py:176-178:

# MPI cannot be initialized automatically, or read_thread spawn/forkserver
# child processes will abort trying to open a non-existant PMI_fd file.
import mpi4py

…but the mpi4py.rc.initialize/finalize flags that actually disable auto-init were never set.

Proposed fix

In dlio_benchmark/utils/utility.py, set the rc flags before the first from mpi4py import MPI anywhere in the package:

import mpi4py
mpi4py.rc.initialize = False
mpi4py.rc.finalize = False

The main MPI ranks still get initialized via the existing explicit path in DLIOMPI.initialize():

from mpi4py import MPI
if not MPI.Is_initialized():
    MPI.Init()

…and finalised via DLIOMPI.finalize() (MPI.Finalize() at line 351). The atexit auto-finalize is redundant for the main ranks and actively harmful in spawn workers.

PR for dlio_benchmark proposed on a fork: wolfgang-desalvador/DLIO_local_changes branch wolfgang/fix-MPI-initialization-fork (single commit, +2 lines).

Environment

  • mlpstorage 3.0.3 (mlcommons/storage@44eee09)
  • dlio_benchmark (bundled via uv sync)
  • OpenMPI, SLURM, 2 × 192-core nodes
  • Python 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions