fix: restrict TorchIterableDatasetSimple to S3/AISTORE; gate s3dlio Parquet gen on storage type (fixes #391, #385) by russfellows · Pull Request #21 · mlcommons/DLIO_local_changes

russfellows · 2026-05-31T23:51:21Z

Summary

This PR fixes two independent bugs that affect LOCAL_FS workloads:

#391 — mlpstorage training run hangs silently with LOCAL_FS storage (unet3d, NPZ/NPY/JPEG/PNG formats, read_threads > 1)
#385 — Parquet datagen crashes with RuntimeError: URI must start with s3:// when storage_type = local_fs and parquet.use_s3dlio_gen: true

Bug 1: Training hang on LOCAL_FS (#391)

Root Cause

TorchIterableDatasetSimple was selected for all NPZ/NPY/JPEG/PNG formats
regardless of storage type. With read_threads=4 (unet3d default), PyTorch
DataLoader creates 4 worker processes via os.fork(). The fork occurs after
_local_fs_iterable_mixin.py is imported at module level, which creates a
ThreadPoolExecutor (_PREFETCH_POOL). In the forked child processes the
executor's internal state is inconsistent (the background management thread from
the parent is not replicated), causing the workers to deadlock. CPU and I/O drop
to zero indefinitely with no error message.

Fix

dlio_benchmark/data_loader/torch_data_loader.py — add
and self._args.storage_type in _s3_types to the use_simple_iterable_dataset
guard so TorchIterableDatasetSimple is only used for S3/AISTORE, where the
prefetch benefit is most significant and os.fork() is not involved:

+        # TorchIterableDatasetSimple uses DataLoader(num_workers>0) which forks
+        # worker processes via os.fork(). On LOCAL_FS, this fork-after-module-import
+        # pattern causes a ThreadPoolExecutor deadlock (the executor's background
+        # thread is not fork-safe). Restrict the iterable path to object storage
+        # (S3/AISTORE) only where the prefetch benefit is most significant and
+        # the fork issue does not apply. LOCAL_FS falls through to map-style TorchDataset.
         use_simple_iterable_dataset = (
             self.format_type in _simple_iterable_formats
             and not use_rg_iterable_dataset
+            and self._args.storage_type in _s3_types
         )

LOCAL_FS now falls through to the original map-style TorchDataset path,
which does not fork and has no executor state issue.

Bug 2: Parquet datagen crashes on LOCAL_FS with `use_s3dlio_gen: true` (#385)

Root Cause

ParquetGenerator.generate() unconditionally called
s3dlio.generate_and_write_parquet_schema_streaming() whenever
parquet.use_s3dlio_gen: true was set in the workload config. The s3dlio
library requires a s3://-scheme URI and immediately raises
RuntimeError: URI must start with s3:// when given a local path. The
dlrm_datagen.yaml config ships with use_s3dlio_gen: true globally, so any
user running the DLRM datagen against local storage hits this crash.

Fix

dlio_benchmark/data_generator/parquet_generator.py — import StorageType and
add a storage-type guard so the s3dlio fast-path is only taken for S3/AISTORE:

-from dlio_benchmark.common.enumerations import Compression
+from dlio_benchmark.common.enumerations import Compression, StorageType

 ...

+            # Restricted to object storage (S3/AISTORE): s3dlio requires an
+            # s3:// URI and raises RuntimeError for local paths (issue #385).
+            _s3_storage = (StorageType.S3, StorageType.AISTORE)
-            if self.use_s3dlio_gen and self.parquet_columns:
+            if self.use_s3dlio_gen and self.parquet_columns and self._args.storage_type in _s3_storage:

For LOCAL_FS, execution falls through to the existing PyArrow-based local
write path, which works correctly without any URI scheme.

Issues Fixed

Fixes #391 — Training hangs indefinitely with LOCAL_FS (unet3d)
Fixes #385 — Parquet datagen fails with RuntimeError: URI must start with s3:// on LOCAL_FS

Testing

85 fast CI tests pass (uv run python -m pytest tests/test_fast_ci.py -q)
Logic verified: LOCAL_FS+NPZ → use_simple_iterable_dataset=False (was True), S3+NPZ → True (unchanged)
Logic verified: LOCAL_FS+use_s3dlio_gen=True → skip s3dlio path, S3+use_s3dlio_gen=True → use s3dlio path

Files Changed

File	Change
`dlio_benchmark/data_loader/torch_data_loader.py`	Restrict `TorchIterableDatasetSimple` to S3/AISTORE only
`dlio_benchmark/data_generator/parquet_generator.py`	Gate s3dlio Parquet gen on S3/AISTORE storage type; add `StorageType` import
`uv.lock`	Version reflect (3.0.1 → 3.0.2)

Follow-up Required

After this PR merges, mlcommons/storage will need a second PR to update
pyproject.toml (rev pointing to the new commit hash here) and regenerate
uv.lock so that mlpstorage users pick up these fixes.

…arquet gen on storage type Two LOCAL_FS bug fixes: 1. Training hang on LOCAL_FS (#391) TorchIterableDatasetSimple was selected for all NPZ/NPY/JPEG/PNG formats regardless of storage type. With read_threads>1, DataLoader forks worker processes after _local_fs_iterable_mixin.py imports a module-level ThreadPoolExecutor. The executor is not fork-safe; child processes deadlock silently with no error output. Fix: add 'and self._args.storage_type in _s3_types' to the use_simple_iterable_dataset guard. LOCAL_FS falls back to map-style TorchDataset which does not fork. 2. Parquet datagen RuntimeError on LOCAL_FS (#385) ParquetGenerator unconditionally called s3dlio.generate_and_write_parquet_schema_streaming() when parquet.use_s3dlio_gen=true, including for local paths. s3dlio requires an s3:// URI and raises RuntimeError: URI must start with s3://. Fix: add StorageType import and guard the s3dlio path on storage_type in (S3, AISTORE). LOCAL_FS falls through to the PyArrow local write path. Fixes #391 Fixes #385

FileSystemGuy

Please review and approve: @idevasena @dslik

idevasena · 2026-06-02T19:55:06Z

@russfellows one blocking issue when testing (I pushed the change in a commit to this PR itself):

torch_data_loader.py L477: _s3_types is referenced before it's defined (L484), raising NameError on every TorchDataLoader init — for all storage types, not just LOCAL_FS. This turns the #391 hang into an unconditional crash. Move the _s3_types = (StorageType.S3, StorageType.AISTORE) assignment above the use_simple_iterable_dataset block. Reproduces deterministically; see isolated repro.

Before:

smrc@dskbd029:~/DLIO_local_changes_new$ uv run dlio_benchmark workload=unet3d_a100 \
  ++workload.framework=pytorch \
  ++workload.reader.data_loader=pytorch \
  ++workload.dataset.format=npz \
  ++workload.dataset.storage_type=local_fs \
  ++workload.reader.read_threads=4 \
  ++workload.workflow.generate_data=True ++workload.workflow.train=True \
  ++workload.dataset.num_files_train=2 ++workload.dataset.num_samples_per_file=2
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = 'data/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 2
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-02T19:44:31.437379 Running DLIO [Generating data & Training & Checkpointing] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-06-02T19:44:31.521799 Starting data generation
[OUTPUT] 2026-06-02T19:44:31.934390 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-02T19:44:31.958015 Model size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:44:31.958093 Total checkpoint size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:44:31.958160 Max steps per epoch: 0 = 2 * 2 / 7 / 1 (samples per file * num files / batch size / comm size)
Error executing job with overrides: ['workload=unet3d_a100', '++workload.framework=pytorch', '++workload.reader.data_loader=pytorch', 
'++workload.dataset.format=npz', '++workload.dataset.storage_type=local_fs', '++workload.reader.read_threads=4', 
'++workload.workflow.generate_data=True', '++workload.workflow.train=True', '++workload.dataset.num_files_train=2', 
'++workload.dataset.num_samples_per_file=2']
Traceback (most recent call last):
  File "/home/smrc/DLIO_local_changes_new/dlio_benchmark/main.py", line 518, in run_benchmark
    benchmark.run()
  File "/home/smrc/DLIO_local_changes_new/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/DLIO_local_changes_new/dlio_benchmark/main.py", line 443, in run
    self.framework.get_loader(dataset_type=DatasetType.TRAIN).read()
  File "/home/smrc/DLIO_local_changes_new/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/DLIO_local_changes_new/dlio_benchmark/data_loader/torch_data_loader.py", line 477, in read
    and self._args.storage_type in _s3_types
                                   ^^^^^^^^^
UnboundLocalError: cannot access local variable '_s3_types' where it is not associated with a value

After the fix:

smrc@dskbd029:~/DLIO_local_changes_new$ uv run dlio_benchmark workload=unet3d_a100 \
  ++workload.framework=pytorch ++workload.reader.data_loader=pytorch \
  ++workload.dataset.format=npz ++workload.dataset.storage_type=local_fs \
  ++workload.reader.read_threads=4 \
  ++workload.workflow.generate_data=True ++workload.workflow.train=True \
  ++workload.dataset.num_files_train=2 ++workload.dataset.num_samples_per_file=2
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = 'data/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 2
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-02T19:48:13.171717 Running DLIO [Generating data & Training & Checkpointing] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-06-02T19:48:13.209262 Starting data generation
[OUTPUT] 2026-06-02T19:48:16.576561 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-02T19:48:16.607583 Model size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:48:16.607689 Total checkpoint size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:48:16.607821 Max steps per epoch: 0 = 2 * 2 / 7 / 1 (samples per file * num files / batch size / comm size)
[DATALOADER] format=npz storage=local_fs library=none
[DATALOADER]   torch_dataset=TorchDataset(map-style, 4 workers)
[DATALOADER]   reader=unknown
[DATALOADER]   sample_access=read_index (on-demand)
[OUTPUT] 2026-06-02T19:48:29.446146 Starting epoch 1: 0 steps expected
[OUTPUT] 2026-06-02T19:48:29.446944 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.072282 Ending block 1 - 0 steps completed in 10.63 s
[OUTPUT] 2026-06-02T19:48:40.084163 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.084322 Epoch 1 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.084416 Epoch 1 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.084639 Ending epoch 1 - 0 steps completed in 10.64 s
[OUTPUT] 2026-06-02T19:48:40.202745 Starting epoch 2: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.203399 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.206422 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.207701 Epoch 2 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.207758 Epoch 2 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.207802 Epoch 2 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.207887 Ending epoch 2 - 0 steps completed in 0.01 s
[OUTPUT] 2026-06-02T19:48:40.333407 Starting epoch 3: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.333984 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.335524 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.338844 Epoch 3 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.338985 Epoch 3 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.339094 Epoch 3 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.339317 Ending epoch 3 - 0 steps completed in 0.01 s
[OUTPUT] 2026-06-02T19:48:40.464225 Starting epoch 4: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.464772 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.466150 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.477473 Epoch 4 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.477687 Epoch 4 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.477871 Epoch 4 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.478347 Ending epoch 4 - 0 steps completed in 0.01 s
[OUTPUT] 2026-06-02T19:48:40.610231 Starting epoch 5: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.611079 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.614196 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.617610 Epoch 5 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.617767 Epoch 5 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.617883 Epoch 5 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.618055 Starting saving checkpoint 1 after total step 0 for epoch 5
[OUTPUT] 2026-06-02T19:48:40.643249 Finished saving checkpoint 1 for epoch 5 in 0.0252 s; Throughput: 0.0004 GiB/s
[OUTPUT] 2026-06-02T19:48:40.645826 Ending epoch 5 - 0 steps completed in 0.04 s
[OUTPUT] 2026-06-02T19:48:40.651649 Saved outputs in /home/smrc/DLIO_local_changes_new/hydra_log/unet3d/2026-06-02-19-48-13
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1 
[METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MiB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================

[OUTPUT] 2026-06-02T19:48:40.653198 outputs saved in RANKID_output.json

idevasena · 2026-06-02T20:33:52Z

Tested integration with mlcommons/storage after updating pyproject.toml in storage repo to latest SHA in this PR for dlio_benchmark i.e.

dlio-benchmark = { git = "https://github.com/russfellows/dlio_benchmark.git", rev = "05b7d91d94f55d80d3fccb9c8d135a3fe8502813" }

The changes look good.

 smrc@dskbd029:~/Storage_Repo_Tests/storage_June2$ ./mlpstorage training datagen \
  --model unet3d \
  --num-processes 4 \
  --params dataset.data_folder=/mnt/drives/nvme3n1/unet3d_data
⠋ Validating environment... 0:00:002026-06-02 20:29:09|INFO: Environment validation passed
2026-06-02 20:29:09|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/unet3d/datagen/20260602_202909
2026-06-02 20:29:09|WARNING: Results directory not specified. Writing results to the system temp directory: /tmp/mlperf_storage_results. These results will NOT persist across a reboot. Use --results-dir <path> or set the MLPERF_RESULTS_DIR environment variable to save results permanently.
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-06-02 20:29:09|INFO: MPI BTL transport: auto (OpenMPI default selection)
2026-06-02 20:29:09|STATUS: Running benchmark command:: mpirun -n 4 -host 127.0.0.1:4 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage_June2/.venv/bin/dlio_benchmark workload=unet3d_datagen ++hydra.run.dir=/tmp/mlperf_storage_results/training/unet3d/datagen/20260602_202909 ++hydra.output_subdir=dlio_config ++workload.dataset.data_folder=/mnt/drives/nvme3n1/unet3d_data --config-dir=/home/smrc/Storage_Repo_Tests/storage_June2/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/mnt/drives/nvme3n1/unet3d_data'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 168
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = False
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 1
[OUTPUT] 2026-06-02T20:29:15.943477 Running DLIO [Generating data] with 4 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-02T20:29:15.995917 Starting data generation
[?25l[?25l
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ 
Data[===>--------------------------------------------------------] 5.4%  9/168  
Generating NPZ 
Data[=================================>--------------------------] 55.4%  93/168
Generating NPZ 
Data[=====>------------------------------------------------------] 7.7%  13/168 
Generating NPZ Data
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ 
Data[===>--------------------------------------------------------] 5.4%  9/168  
Generating NPZ Data
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ 
Data[===>--------------------------------------------------------] 5.4%  9/168  
Generating NPZ 
Data[=================================>--------------------------] 55.4%  93/168
Generating NPZ Data
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ Data
[======>-----------------------------------------------------] 10.1%  17/168  
Generating NPZ Data
[========>---------------------------------------------------] 12.5%  21/168  
Generating NPZ Data
⠋ Generating NPZ Data ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/168 0:00:00
[32m⠋[0m Generating NPZ Data [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 0/168 [33m0:00:00[0m
⠋ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠋[0m Generating NPZ Data [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 0/168  [33m0:00:00[0m
[2K
⠙ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠙[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━[0m 85/168 [33m0:00:00[0m
[2K
⠹ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠹[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━[0m 85/168 [33m0:00:00[0m
[2K
⠸ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠸[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━[0m 85/168 [33m0:00:00[0m
[2K
[2K
⠴ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠴[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━[0m 97/168 [33m0:00:00[0m
[2K
⠋ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168  0:00:00
[32m⠋[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m 165/168 [33m0:00:00[0m
[OUTPUT] 2026-06-02T20:29:17.061737 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[2K
⠹ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168  0:00:01
[32m⠹[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m 165/168 [33m0:00:00[0m
[2K
⠴ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168  0:00:02
[32m⠴[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m 165/168 [33m0:00:02[0m
2026-06-02 20:29:18|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/training/unet3d/datagen/20260602_202909/training_20260602_202909_metadata.json

idevasena

Approving this PR. But with a note for mlcommons/storage:

A separate PR needed to update pyproject.toml in mlcommons/storage repo here https://github.com/mlcommons/storage/blob/main/pyproject.toml#L95:
It points to Russ fork of DLIO and needs to be updated to https://github.com/mlcommons/DLIO_local_changes

russfellows · 2026-06-02T20:58:50Z

Devasena, As always, thanks for your diligence in testing, and finding that existing or new bug and fixing it. Not sure if I introduced it or not, but regardless, glad you found it and squashed it. But yes, to your point the last thing that needs to happen is to update pyproject.toml in mlc-storage to point to the correct head. Regards, —Russ

…

On Jun 2, 2026, at 2:38 PM, Devasena I ***@***.***> wrote: @idevasena approved this pull request. Approving this PR. But with a note for mlcommons/storage: A separate PR needed to update pyproject.toml in mlcommons/storage repo here https://github.com/mlcommons/storage/blob/main/pyproject.toml#L95: It points to Russ fork of DLIO and needs to be updated to https://github.com/mlcommons/DLIO_local_changes — Reply to this email directly, view it on GitHub <#21?email_source=notifications&email_token=AF64UJ33NESQQNIDWWGFE6L4543MHA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINBRGM2DKMJVGIZ2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#pullrequestreview-4413451523>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJY3QLW6MIB7NDSJBRD4543MHAVCNFSM6AAAAACZUZZ24SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DIMJTGQ2TCNJSGM>. You are receiving this because you were mentioned.

russfellows requested a review from a team May 31, 2026 23:51

russfellows mentioned this pull request Jun 2, 2026

fix: add EXIT_CODE.INTERRUPTED to resolve AttributeError on SIGTERM (fixes #392, #393) mlcommons/storage#400

Merged

FileSystemGuy approved these changes Jun 2, 2026

View reviewed changes

fix for NameError on _s3_types

05b7d91

idevasena approved these changes Jun 2, 2026

View reviewed changes

idevasena merged commit e4c9b7a into main Jun 2, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: restrict TorchIterableDatasetSimple to S3/AISTORE; gate s3dlio Parquet gen on storage type (fixes #391, #385)#21

fix: restrict TorchIterableDatasetSimple to S3/AISTORE; gate s3dlio Parquet gen on storage type (fixes #391, #385)#21
idevasena merged 2 commits into
mainfrom
fix/local-fs-hang-and-parquet-gen

russfellows commented May 31, 2026

Uh oh!

FileSystemGuy left a comment

Uh oh!

idevasena commented Jun 2, 2026

Uh oh!

idevasena commented Jun 2, 2026

Uh oh!

idevasena left a comment

Uh oh!

russfellows commented Jun 2, 2026 via email

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

russfellows commented May 31, 2026

Summary

Bug 1: Training hang on LOCAL_FS (#391)

Root Cause

Fix

Bug 2: Parquet datagen crashes on LOCAL_FS with use_s3dlio_gen: true (#385)

Root Cause

Fix

Issues Fixed

Testing

Files Changed

Follow-up Required

Uh oh!

FileSystemGuy left a comment

Choose a reason for hiding this comment

Uh oh!

idevasena commented Jun 2, 2026

Uh oh!

idevasena commented Jun 2, 2026

Uh oh!

idevasena left a comment

Choose a reason for hiding this comment

Uh oh!

russfellows commented Jun 2, 2026 via email

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 2: Parquet datagen crashes on LOCAL_FS with `use_s3dlio_gen: true` (#385)