fix: restrict TorchIterableDatasetSimple to S3/AISTORE; gate s3dlio Parquet gen on storage type (fixes #391, #385)#21
Conversation
…arquet gen on storage type Two LOCAL_FS bug fixes: 1. Training hang on LOCAL_FS (#391) TorchIterableDatasetSimple was selected for all NPZ/NPY/JPEG/PNG formats regardless of storage type. With read_threads>1, DataLoader forks worker processes after _local_fs_iterable_mixin.py imports a module-level ThreadPoolExecutor. The executor is not fork-safe; child processes deadlock silently with no error output. Fix: add 'and self._args.storage_type in _s3_types' to the use_simple_iterable_dataset guard. LOCAL_FS falls back to map-style TorchDataset which does not fork. 2. Parquet datagen RuntimeError on LOCAL_FS (#385) ParquetGenerator unconditionally called s3dlio.generate_and_write_parquet_schema_streaming() when parquet.use_s3dlio_gen=true, including for local paths. s3dlio requires an s3:// URI and raises RuntimeError: URI must start with s3://. Fix: add StorageType import and guard the s3dlio path on storage_type in (S3, AISTORE). LOCAL_FS falls through to the PyArrow local write path. Fixes #391 Fixes #385
FileSystemGuy
left a comment
There was a problem hiding this comment.
Please review and approve: @idevasena @dslik
|
@russfellows one blocking issue when testing (I pushed the change in a commit to this PR itself):
Before: After the fix: |
|
Tested integration with The changes look good. |
idevasena
left a comment
There was a problem hiding this comment.
Approving this PR. But with a note for mlcommons/storage:
A separate PR needed to update pyproject.toml in mlcommons/storage repo here https://github.com/mlcommons/storage/blob/main/pyproject.toml#L95:
It points to Russ fork of DLIO and needs to be updated to https://github.com/mlcommons/DLIO_local_changes
|
Devasena,
As always, thanks for your diligence in testing, and finding that existing or new bug and fixing it. Not sure if I introduced it or not, but regardless, glad you found it and squashed it.
But yes, to your point the last thing that needs to happen is to update pyproject.toml in mlc-storage to point to the correct head.
Regards,
—Russ
… On Jun 2, 2026, at 2:38 PM, Devasena I ***@***.***> wrote:
@idevasena approved this pull request.
Approving this PR. But with a note for mlcommons/storage:
A separate PR needed to update pyproject.toml in mlcommons/storage repo here https://github.com/mlcommons/storage/blob/main/pyproject.toml#L95:
It points to Russ fork of DLIO and needs to be updated to https://github.com/mlcommons/DLIO_local_changes
—
Reply to this email directly, view it on GitHub <#21?email_source=notifications&email_token=AF64UJ33NESQQNIDWWGFE6L4543MHA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINBRGM2DKMJVGIZ2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#pullrequestreview-4413451523>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJY3QLW6MIB7NDSJBRD4543MHAVCNFSM6AAAAACZUZZ24SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DIMJTGQ2TCNJSGM>.
You are receiving this because you were mentioned.
|
Summary
This PR fixes two independent bugs that affect
LOCAL_FSworkloads:mlpstorage training runhangs silently withLOCAL_FSstorage (unet3d, NPZ/NPY/JPEG/PNG formats,read_threads > 1)RuntimeError: URI must start with s3://whenstorage_type = local_fsandparquet.use_s3dlio_gen: trueBug 1: Training hang on LOCAL_FS (#391)
Root Cause
TorchIterableDatasetSimplewas selected for all NPZ/NPY/JPEG/PNG formatsregardless of storage type. With
read_threads=4(unet3d default), PyTorchDataLoadercreates 4 worker processes viaos.fork(). The fork occurs after_local_fs_iterable_mixin.pyis imported at module level, which creates aThreadPoolExecutor(_PREFETCH_POOL). In the forked child processes theexecutor's internal state is inconsistent (the background management thread from
the parent is not replicated), causing the workers to deadlock. CPU and I/O drop
to zero indefinitely with no error message.
Fix
dlio_benchmark/data_loader/torch_data_loader.py— addand self._args.storage_type in _s3_typesto theuse_simple_iterable_datasetguard so
TorchIterableDatasetSimpleis only used for S3/AISTORE, where theprefetch benefit is most significant and
os.fork()is not involved:LOCAL_FSnow falls through to the original map-styleTorchDatasetpath,which does not fork and has no executor state issue.
Bug 2: Parquet datagen crashes on LOCAL_FS with
use_s3dlio_gen: true(#385)Root Cause
ParquetGenerator.generate()unconditionally calleds3dlio.generate_and_write_parquet_schema_streaming()wheneverparquet.use_s3dlio_gen: truewas set in the workload config. Thes3dliolibrary requires a
s3://-scheme URI and immediately raisesRuntimeError: URI must start with s3://when given a local path. Thedlrm_datagen.yamlconfig ships withuse_s3dlio_gen: trueglobally, so anyuser running the DLRM datagen against local storage hits this crash.
Fix
dlio_benchmark/data_generator/parquet_generator.py— importStorageTypeandadd a storage-type guard so the
s3dliofast-path is only taken for S3/AISTORE:For
LOCAL_FS, execution falls through to the existing PyArrow-based localwrite path, which works correctly without any URI scheme.
Issues Fixed
RuntimeError: URI must start with s3://on LOCAL_FSTesting
uv run python -m pytest tests/test_fast_ci.py -q)LOCAL_FS+NPZ → use_simple_iterable_dataset=False(wasTrue),S3+NPZ → True(unchanged)LOCAL_FS+use_s3dlio_gen=True → skip s3dlio path,S3+use_s3dlio_gen=True → use s3dlio pathFiles Changed
dlio_benchmark/data_loader/torch_data_loader.pyTorchIterableDatasetSimpleto S3/AISTORE onlydlio_benchmark/data_generator/parquet_generator.pyStorageTypeimportuv.lockFollow-up Required
After this PR merges,
mlcommons/storagewill need a second PR to updatepyproject.toml(rev pointing to the new commit hash here) and regenerateuv.lockso thatmlpstorageusers pick up these fixes.