Skip to content

fix: restrict TorchIterableDatasetSimple to S3/AISTORE; gate s3dlio Parquet gen on storage type (fixes #391, #385)#21

Merged
idevasena merged 2 commits into
mainfrom
fix/local-fs-hang-and-parquet-gen
Jun 2, 2026
Merged

fix: restrict TorchIterableDatasetSimple to S3/AISTORE; gate s3dlio Parquet gen on storage type (fixes #391, #385)#21
idevasena merged 2 commits into
mainfrom
fix/local-fs-hang-and-parquet-gen

Conversation

@russfellows

Copy link
Copy Markdown

Summary

This PR fixes two independent bugs that affect LOCAL_FS workloads:

  1. #391mlpstorage training run hangs silently with LOCAL_FS storage (unet3d, NPZ/NPY/JPEG/PNG formats, read_threads > 1)
  2. #385 — Parquet datagen crashes with RuntimeError: URI must start with s3:// when storage_type = local_fs and parquet.use_s3dlio_gen: true

Bug 1: Training hang on LOCAL_FS (#391)

Root Cause

TorchIterableDatasetSimple was selected for all NPZ/NPY/JPEG/PNG formats
regardless of storage type. With read_threads=4 (unet3d default), PyTorch
DataLoader creates 4 worker processes via os.fork(). The fork occurs after
_local_fs_iterable_mixin.py is imported at module level, which creates a
ThreadPoolExecutor (_PREFETCH_POOL). In the forked child processes the
executor's internal state is inconsistent (the background management thread from
the parent is not replicated), causing the workers to deadlock. CPU and I/O drop
to zero indefinitely with no error message.

Fix

dlio_benchmark/data_loader/torch_data_loader.py — add
and self._args.storage_type in _s3_types to the use_simple_iterable_dataset
guard so TorchIterableDatasetSimple is only used for S3/AISTORE, where the
prefetch benefit is most significant and os.fork() is not involved:

+        # TorchIterableDatasetSimple uses DataLoader(num_workers>0) which forks
+        # worker processes via os.fork(). On LOCAL_FS, this fork-after-module-import
+        # pattern causes a ThreadPoolExecutor deadlock (the executor's background
+        # thread is not fork-safe). Restrict the iterable path to object storage
+        # (S3/AISTORE) only where the prefetch benefit is most significant and
+        # the fork issue does not apply. LOCAL_FS falls through to map-style TorchDataset.
         use_simple_iterable_dataset = (
             self.format_type in _simple_iterable_formats
             and not use_rg_iterable_dataset
+            and self._args.storage_type in _s3_types
         )

LOCAL_FS now falls through to the original map-style TorchDataset path,
which does not fork and has no executor state issue.


Bug 2: Parquet datagen crashes on LOCAL_FS with use_s3dlio_gen: true (#385)

Root Cause

ParquetGenerator.generate() unconditionally called
s3dlio.generate_and_write_parquet_schema_streaming() whenever
parquet.use_s3dlio_gen: true was set in the workload config. The s3dlio
library requires a s3://-scheme URI and immediately raises
RuntimeError: URI must start with s3:// when given a local path. The
dlrm_datagen.yaml config ships with use_s3dlio_gen: true globally, so any
user running the DLRM datagen against local storage hits this crash.

Fix

dlio_benchmark/data_generator/parquet_generator.py — import StorageType and
add a storage-type guard so the s3dlio fast-path is only taken for S3/AISTORE:

-from dlio_benchmark.common.enumerations import Compression
+from dlio_benchmark.common.enumerations import Compression, StorageType

 ...

+            # Restricted to object storage (S3/AISTORE): s3dlio requires an
+            # s3:// URI and raises RuntimeError for local paths (issue #385).
+            _s3_storage = (StorageType.S3, StorageType.AISTORE)
-            if self.use_s3dlio_gen and self.parquet_columns:
+            if self.use_s3dlio_gen and self.parquet_columns and self._args.storage_type in _s3_storage:

For LOCAL_FS, execution falls through to the existing PyArrow-based local
write path, which works correctly without any URI scheme.


Issues Fixed

  • Fixes #391 — Training hangs indefinitely with LOCAL_FS (unet3d)
  • Fixes #385 — Parquet datagen fails with RuntimeError: URI must start with s3:// on LOCAL_FS

Testing

  • 85 fast CI tests pass (uv run python -m pytest tests/test_fast_ci.py -q)
  • Logic verified: LOCAL_FS+NPZ → use_simple_iterable_dataset=False (was True), S3+NPZ → True (unchanged)
  • Logic verified: LOCAL_FS+use_s3dlio_gen=True → skip s3dlio path, S3+use_s3dlio_gen=True → use s3dlio path

Files Changed

File Change
dlio_benchmark/data_loader/torch_data_loader.py Restrict TorchIterableDatasetSimple to S3/AISTORE only
dlio_benchmark/data_generator/parquet_generator.py Gate s3dlio Parquet gen on S3/AISTORE storage type; add StorageType import
uv.lock Version reflect (3.0.1 → 3.0.2)

Follow-up Required

After this PR merges, mlcommons/storage will need a second PR to update
pyproject.toml (rev pointing to the new commit hash here) and regenerate
uv.lock so that mlpstorage users pick up these fixes.

…arquet gen on storage type

Two LOCAL_FS bug fixes:

1. Training hang on LOCAL_FS (#391)
   TorchIterableDatasetSimple was selected for all NPZ/NPY/JPEG/PNG formats
   regardless of storage type. With read_threads>1, DataLoader forks worker
   processes after _local_fs_iterable_mixin.py imports a module-level
   ThreadPoolExecutor. The executor is not fork-safe; child processes deadlock
   silently with no error output.
   Fix: add 'and self._args.storage_type in _s3_types' to the
   use_simple_iterable_dataset guard. LOCAL_FS falls back to map-style
   TorchDataset which does not fork.

2. Parquet datagen RuntimeError on LOCAL_FS (#385)
   ParquetGenerator unconditionally called
   s3dlio.generate_and_write_parquet_schema_streaming() when
   parquet.use_s3dlio_gen=true, including for local paths. s3dlio requires
   an s3:// URI and raises RuntimeError: URI must start with s3://.
   Fix: add StorageType import and guard the s3dlio path on
   storage_type in (S3, AISTORE). LOCAL_FS falls through to the PyArrow
   local write path.

Fixes #391
Fixes #385

@FileSystemGuy FileSystemGuy left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review and approve: @idevasena @dslik

@idevasena

Copy link
Copy Markdown

@russfellows one blocking issue when testing (I pushed the change in a commit to this PR itself):

torch_data_loader.py L477: _s3_types is referenced before it's defined (L484), raising NameError on every TorchDataLoader init — for all storage types, not just LOCAL_FS. This turns the #391 hang into an unconditional crash. Move the _s3_types = (StorageType.S3, StorageType.AISTORE) assignment above the use_simple_iterable_dataset block. Reproduces deterministically; see isolated repro.

Before:

smrc@dskbd029:~/DLIO_local_changes_new$ uv run dlio_benchmark workload=unet3d_a100 \
  ++workload.framework=pytorch \
  ++workload.reader.data_loader=pytorch \
  ++workload.dataset.format=npz \
  ++workload.dataset.storage_type=local_fs \
  ++workload.reader.read_threads=4 \
  ++workload.workflow.generate_data=True ++workload.workflow.train=True \
  ++workload.dataset.num_files_train=2 ++workload.dataset.num_samples_per_file=2
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = 'data/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 2
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-02T19:44:31.437379 Running DLIO [Generating data & Training & Checkpointing] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-06-02T19:44:31.521799 Starting data generation
[OUTPUT] 2026-06-02T19:44:31.934390 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-02T19:44:31.958015 Model size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:44:31.958093 Total checkpoint size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:44:31.958160 Max steps per epoch: 0 = 2 * 2 / 7 / 1 (samples per file * num files / batch size / comm size)
Error executing job with overrides: ['workload=unet3d_a100', '++workload.framework=pytorch', '++workload.reader.data_loader=pytorch', 
'++workload.dataset.format=npz', '++workload.dataset.storage_type=local_fs', '++workload.reader.read_threads=4', 
'++workload.workflow.generate_data=True', '++workload.workflow.train=True', '++workload.dataset.num_files_train=2', 
'++workload.dataset.num_samples_per_file=2']
Traceback (most recent call last):
  File "/home/smrc/DLIO_local_changes_new/dlio_benchmark/main.py", line 518, in run_benchmark
    benchmark.run()
  File "/home/smrc/DLIO_local_changes_new/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/DLIO_local_changes_new/dlio_benchmark/main.py", line 443, in run
    self.framework.get_loader(dataset_type=DatasetType.TRAIN).read()
  File "/home/smrc/DLIO_local_changes_new/.venv/lib/python3.12/site-packages/dftracer/python/common.py", line 504, in wrapper
    x = f(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^
  File "/home/smrc/DLIO_local_changes_new/dlio_benchmark/data_loader/torch_data_loader.py", line 477, in read
    and self._args.storage_type in _s3_types
                                   ^^^^^^^^^
UnboundLocalError: cannot access local variable '_s3_types' where it is not associated with a value

After the fix:

smrc@dskbd029:~/DLIO_local_changes_new$ uv run dlio_benchmark workload=unet3d_a100 \
  ++workload.framework=pytorch ++workload.reader.data_loader=pytorch \
  ++workload.dataset.format=npz ++workload.dataset.storage_type=local_fs \
  ++workload.reader.read_threads=4 \
  ++workload.workflow.generate_data=True ++workload.workflow.train=True \
  ++workload.dataset.num_files_train=2 ++workload.dataset.num_samples_per_file=2
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = 'data/unet3d'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 2
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = True
[OUTPUT]   epochs         = 5
[OUTPUT]   batch_size     = 7
[OUTPUT] 2026-06-02T19:48:13.171717 Running DLIO [Generating data & Training & Checkpointing] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-06-02T19:48:13.209262 Starting data generation
[OUTPUT] 2026-06-02T19:48:16.576561 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-02T19:48:16.607583 Model size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:48:16.607689 Total checkpoint size: 0.000010 GiB 
[OUTPUT] 2026-06-02T19:48:16.607821 Max steps per epoch: 0 = 2 * 2 / 7 / 1 (samples per file * num files / batch size / comm size)
[DATALOADER] format=npz storage=local_fs library=none
[DATALOADER]   torch_dataset=TorchDataset(map-style, 4 workers)
[DATALOADER]   reader=unknown
[DATALOADER]   sample_access=read_index (on-demand)
[OUTPUT] 2026-06-02T19:48:29.446146 Starting epoch 1: 0 steps expected
[OUTPUT] 2026-06-02T19:48:29.446944 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.072282 Ending block 1 - 0 steps completed in 10.63 s
[OUTPUT] 2026-06-02T19:48:40.084163 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.084322 Epoch 1 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.084416 Epoch 1 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.084639 Ending epoch 1 - 0 steps completed in 10.64 s
[OUTPUT] 2026-06-02T19:48:40.202745 Starting epoch 2: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.203399 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.206422 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.207701 Epoch 2 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.207758 Epoch 2 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.207802 Epoch 2 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.207887 Ending epoch 2 - 0 steps completed in 0.01 s
[OUTPUT] 2026-06-02T19:48:40.333407 Starting epoch 3: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.333984 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.335524 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.338844 Epoch 3 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.338985 Epoch 3 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.339094 Epoch 3 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.339317 Ending epoch 3 - 0 steps completed in 0.01 s
[OUTPUT] 2026-06-02T19:48:40.464225 Starting epoch 4: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.464772 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.466150 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.477473 Epoch 4 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.477687 Epoch 4 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.477871 Epoch 4 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.478347 Ending epoch 4 - 0 steps completed in 0.01 s
[OUTPUT] 2026-06-02T19:48:40.610231 Starting epoch 5: 0 steps expected
[OUTPUT] 2026-06-02T19:48:40.611079 Starting block 1
[OUTPUT] 2026-06-02T19:48:40.614196 Ending block 1 - 0 steps completed in 0.00 s
[OUTPUT] 2026-06-02T19:48:40.617610 Epoch 5 - Block 1 [Training] Accelerator Utilization [AU] (%): 0.0000
[OUTPUT] 2026-06-02T19:48:40.617767 Epoch 5 - Block 1 [Training] Throughput (samples/second): 0.0000
[OUTPUT] 2026-06-02T19:48:40.617883 Epoch 5 - Block 1 [Training] Computation time per step (second): n/a+/-n/a (metric window empty — too few steps) (set value: {'mean': 0.636})
[OUTPUT] 2026-06-02T19:48:40.618055 Starting saving checkpoint 1 after total step 0 for epoch 5
[OUTPUT] 2026-06-02T19:48:40.643249 Finished saving checkpoint 1 for epoch 5 in 0.0252 s; Throughput: 0.0004 GiB/s
[OUTPUT] 2026-06-02T19:48:40.645826 Ending epoch 5 - 0 steps completed in 0.04 s
[OUTPUT] 2026-06-02T19:48:40.651649 Saved outputs in /home/smrc/DLIO_local_changes_new/hydra_log/unet3d/2026-06-02-19-48-13
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1 
[METRIC] Training Accelerator Utilization [AU] (%): 0.0000 (0.0000)
[METRIC] Training Throughput (samples/second): 0.0000 (0.0000)
[METRIC] Training I/O Throughput (MiB/second): 0.0000 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================

[OUTPUT] 2026-06-02T19:48:40.653198 outputs saved in RANKID_output.json

@idevasena

Copy link
Copy Markdown

Tested integration with mlcommons/storage after updating pyproject.toml in storage repo to latest SHA in this PR for dlio_benchmark i.e.

dlio-benchmark = { git = "https://github.com/russfellows/dlio_benchmark.git", rev = "05b7d91d94f55d80d3fccb9c8d135a3fe8502813" }

The changes look good.

 smrc@dskbd029:~/Storage_Repo_Tests/storage_June2$ ./mlpstorage training datagen \
  --model unet3d \
  --num-processes 4 \
  --params dataset.data_folder=/mnt/drives/nvme3n1/unet3d_data
⠋ Validating environment... 0:00:002026-06-02 20:29:09|INFO: Environment validation passed
2026-06-02 20:29:09|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/unet3d/datagen/20260602_202909
2026-06-02 20:29:09|WARNING: Results directory not specified. Writing results to the system temp directory: /tmp/mlperf_storage_results. These results will NOT persist across a reboot. Use --results-dir <path> or set the MLPERF_RESULTS_DIR environment variable to save results permanently.
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-06-02 20:29:09|INFO: MPI BTL transport: auto (OpenMPI default selection)
2026-06-02 20:29:09|STATUS: Running benchmark command:: mpirun -n 4 -host 127.0.0.1:4 --bind-to none --map-by socket /home/smrc/Storage_Repo_Tests/storage_June2/.venv/bin/dlio_benchmark workload=unet3d_datagen ++hydra.run.dir=/tmp/mlperf_storage_results/training/unet3d/datagen/20260602_202909 ++hydra.output_subdir=dlio_config ++workload.dataset.data_folder=/mnt/drives/nvme3n1/unet3d_data --config-dir=/home/smrc/Storage_Repo_Tests/storage_June2/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/mnt/drives/nvme3n1/unet3d_data'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 168
[OUTPUT]   record_length  = 146600628
[OUTPUT]   generate_data  = True
[OUTPUT]   do_train       = False
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 1
[OUTPUT] 2026-06-02T20:29:15.943477 Running DLIO [Generating data] with 4 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-06-02T20:29:15.995917 Starting data generation
[?25l[?25l
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ 
Data[===>--------------------------------------------------------] 5.4%  9/168  
Generating NPZ 
Data[=================================>--------------------------] 55.4%  93/168
Generating NPZ 
Data[=====>------------------------------------------------------] 7.7%  13/168 
Generating NPZ Data
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ 
Data[===>--------------------------------------------------------] 5.4%  9/168  
Generating NPZ Data
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ 
Data[===>--------------------------------------------------------] 5.4%  9/168  
Generating NPZ 
Data[=================================>--------------------------] 55.4%  93/168
Generating NPZ Data
[==>---------------------------------------------------------] 3.0%  5/168  
Generating NPZ Data
[======>-----------------------------------------------------] 10.1%  17/168  
Generating NPZ Data
[========>---------------------------------------------------] 12.5%  21/168  
Generating NPZ Data
⠋ Generating NPZ Data ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/168 0:00:00
[32m⠋[0m Generating NPZ Data [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 0/168 [33m0:00:00[0m
⠋ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠋[0m Generating NPZ Data [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 0/168  [33m0:00:00[0m
[2K
⠙ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠙[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━[0m 85/168 [33m0:00:00[0m
[2K
⠹ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠹[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━[0m 85/168 [33m0:00:00[0m
[2K
⠸ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠸[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━[0m 85/168 [33m0:00:00[0m
[2K
[2K
⠴ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168 0:00:00
[32m⠴[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━[0m 97/168 [33m0:00:00[0m
[2K
⠋ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168  0:00:00
[32m⠋[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m 165/168 [33m0:00:00[0m
[OUTPUT] 2026-06-02T20:29:17.061737 Generation done
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MiB overhead
[OUTPUT] ================================================================================
[2K
⠹ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168  0:00:01
[32m⠹[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m 165/168 [33m0:00:00[0m
[2K
⠴ Generating NPZ Data ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/168  0:00:02
[32m⠴[0m Generating NPZ Data [38;5;197m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m 165/168 [33m0:00:02[0m
2026-06-02 20:29:18|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/training/unet3d/datagen/20260602_202909/training_20260602_202909_metadata.json

@idevasena idevasena left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this PR. But with a note for mlcommons/storage:

A separate PR needed to update pyproject.toml in mlcommons/storage repo here https://github.com/mlcommons/storage/blob/main/pyproject.toml#L95:
It points to Russ fork of DLIO and needs to be updated to https://github.com/mlcommons/DLIO_local_changes

@russfellows

russfellows commented Jun 2, 2026 via email

Copy link
Copy Markdown
Author

@idevasena idevasena merged commit e4c9b7a into main Jun 2, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants