Skip to content

mlpstorage training run --model=flux has considerably lower I/O Throughput causing train_au_meet_expectation to fail #330

Description

@ddn-kums

Hi,

The mlpstorage training run --model=flux --accelerator-type b200 .. job (even using a single accelerator) has considerably lower I/O throughput of 0.8 MB/s resulting in train_au_meet_expectation: fail.

The lower mlpstorage training run --model=flux --accelerator-type b200 .. has been observed across two high-performance file-systems systems comprised of of ONLY NVMe SSD drives.

From performance profiling during the flux training run, we observe most of the time being spent on PyUnicode_FromFormatV and parquet routine with all of the 8 x pt_data_worker 100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training parquet files.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                            
 371212 nodeadm+  20   0 6769292 963864  32760 R 101.6   0.4 250:37.83 pt_data_worker                                                                                                                                                     
 371210 nodeadm+  20   0 6769268 962760  33144 R 101.3   0.4 251:53.42 pt_data_worker                                                                                                                                                     
 371211 nodeadm+  20   0 6834544 995.6m  33928 R 101.3   0.4 251:40.72 pt_data_worker                                                                                                                                                     
 371213 nodeadm+  20   0 6769304 965116  33136 R 101.3   0.4 251:20.06 pt_data_worker                                                                                                                                                     
 371206 nodeadm+  20   0 6990404   1.0g  33144 S 101.0   0.4 251:41.42 pt_data_worker                                                                                                                                                     
 371207 nodeadm+  20   0 6834496   1.0g  33928 R 101.0   0.4 252:42.40 pt_data_worker                                                                                                                                                     
 371208 nodeadm+  20   0 6834508   1.0g  33520 R 101.0   0.4 250:56.91 pt_data_worker                                                                                                                                                     
 371209 nodeadm+  20   0 6802672 991216  33136 R 101.0   0.4 250:17.03 pt_data_worker
Image

Details

  • Generate the --model=flux dataset
$ mlpstorage training datagen --hosts=srt017-e0 --model=flux --exec-type=mpi --param dataset.num_files_train=2126 --num-processes=1 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/flux_b200
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 21:54:34|INFO: Environment validation passed
2026-04-10 21:54:34|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/datagen/20260410_215434
2026-04-10 21:54:34|INFO: Creating data directory: /mnt/redfs/mlstorage_dd/flux_b200/flux...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/train...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/valid...
2026-04-10 21:54:34|INFO: Creating directory: /mnt/redfs/mlstorage_dd/flux_b200/flux/test...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-10 21:54:35|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=flux_datagen ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/flux/datagen/20260410_215434 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=2126 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/flux_b200/flux --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-10T21:54:41.987119 Running DLIO [Generating data] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================

- Verify the generated dataset

$ ls -1 *.parquet | wc -l
2126

$ ls -lh *.parquet | tail -5
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2121_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2122_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2123_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2124_of_2126.parquet
-rw-rw---- 1 nodeadmin nodeadmin 17M Apr 11 12:41 img_2125_of_2126.parquet

$ du -sh *.parquet | tail -5
17M	img_2121_of_2126.parquet
17M	img_2122_of_2126.parquet
17M	img_2123_of_2126.parquet
17M	img_2124_of_2126.parquet
17M	img_2125_of_2126.parquet

- File System 1 - Parallel File System across 72 x NVMe drives - Training I/O Throughput (MB/second): 0.8269

$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=flux --exec-type=mpi --pa
ram dataset.num_files_train=2126 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/flux_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 22:15:30|INFO: Environment validation passed
2026-04-10 22:15:30|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529
2026-04-10 22:15:30|INFO: Created benchmark run: training_run_flux_20260410_221529
2026-04-10 22:15:30|STATUS: Verifying benchmark run for training_run_flux_20260410_221529
..
..
⠙ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:04:33
⠹ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:04:33
[OUTPUT] 2026-04-11T11:07:18.208178 Ending block 1 - 12756 steps completed in 46301.55 s
[OUTPUT] 2026-04-11T11:07:18.219430 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 37.2174
[OUTPUT] 2026-04-11T11:07:18.219543 Epoch 1 - Block 1 [Training] Throughput (samples/second): 13.2307
[OUTPUT] 2026-04-11T11:07:18.219621 Epoch 1 - Block 1 [Training] Computation time per step (second): 1.3501+/-0.0000 (set value: {'mean': 1.35})
[OUTPUT] 2026-04-11T11:07:18.224527 Ending epoch 1 - 12756 steps completed in 46301.56 s
[OUTPUT] 2026-04-11T11:07:18.935511 Saved outputs in /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1
[METRIC] Training Accelerator Utilization [AU] (%): 37.2174 (0.0000)
[METRIC] Training Throughput (samples/second): 13.2307 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.8269 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================

[OUTPUT] 2026-04-11T11:07:18.980509 outputs saved in RANKID_output.json
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
  storage_root   = './'
  storage_options= None
  data_folder    = '/mnt/redfs/mlstorage_dd/flux_b200/flux'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 2126
  record_length  = 65536
  generate_data  = False
  do_train       = True
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 48
2026-04-11 11:07:26|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/flux/run/20260410_221529/training_20260410_221529_metadata.json

- File System 2 - Local File System (zfs) across 12 x NVMe drives - Training I/O Throughput (MB/second): 0.8598

$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=flux --exec-type=mpi --param dataset.num_files_train=2126 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/zfs-fs1/mlstorage_dd/flux_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-11 12:47:38|INFO: Environment validation passed
2026-04-11 12:47:38|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738
2026-04-11 12:47:39|INFO: Created benchmark run: training_run_flux_20260411_124738
2026-04-11 12:47:39|STATUS: Verifying benchmark run for training_run_flux_20260411_124738
..
..
[OUTPUT] 2026-04-12T01:09:54.675389 Ending block 1 - 12756 steps completed in 44529.72 s
[OUTPUT] 2026-04-12T01:09:54.683906 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 38.6982
[OUTPUT] 2026-04-12T01:09:54.684024 Epoch 1 - Block 1 [Training] Throughput (samples/second): 13.7572
[OUTPUT] 2026-04-12T01:09:54.684086 Epoch 1 - Block 1 [Training] Computation time per step (second): 1.3501+/-0.0000 (set value: {'mean': 1.35})
[OUTPUT] 2026-04-12T01:09:54.688416 Ending epoch 1 - 12756 steps completed in 44529.73 s
[OUTPUT] 2026-04-12T01:09:55.396901 Saved outputs in /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738
[OUTPUT] Averaged metric over all steps/epochs
[METRIC] ==========================================================
[METRIC] Number of Simulated Accelerators: 1
[METRIC] Training Accelerator Utilization [AU] (%): 38.6982 (0.0000)
[METRIC] Training Throughput (samples/second): 13.7572 (0.0000)
[METRIC] Training I/O Throughput (MB/second): 0.8598 (0.0000)
[METRIC] train_au_meet_expectation: fail
[METRIC] ==========================================================
[OUTPUT] 2026-04-12T01:09:55.440256 outputs saved in RANKID_output.json
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
  storage_root   = './'
  storage_options= None
  data_folder    = '/zfs-fs1/mlstorage_dd/flux_b200/flux'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 2126
  record_length  = 65536
  generate_data  = False
  do_train       = True
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 48
2026-04-12 01:10:02|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/flux/run/20260411_124738/training_20260411_124738_metadata.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions