Skip to content

Possible issue with dlrm_datagen.yaml #275

@lou-lydiksen-purestorage

Description

I can't seem to get dlrm_datagen.yaml to work.
unet3d/a100 works fine, so I am pretty sure that my basic setup is correct.
However, while dlrm_b200_datagen or dlrm_mi355_datagen begins, creates the directories, but then it simply runs until it crashes with an Out-Of-Memory error.
Could you please let me know if you see something obvious that I am doing incorrectly?

Thank you

==========================================================================================================================================================================

HOSTS=172.16.4.101

HOSTMEMGB=1024

NUMHOSTS=1

PROCS_PER_HOST=64

NPROC=64

GPU=a100

MODEL=unet3d

DATADIR=/mnt/mlperf/unet3d

RESULTS=/mnt/mlperf/results

N=37500

SUBF=4

(.venv) ptc2 ir@init-ptc-rb07-u28:~/mlperf/storage 102# /home/ir/mlperf/storage/.venv/bin/mlpstorage training datagen
--hosts $HOSTS
--model $MODEL
--num-processes $NPROC
--data-dir $DATADIR
--results-dir $RESULTS
--param dataset.num_files_train=${N}
--param dataset.num_subfolders_train=${SUBF}
--mpi-params "-genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug"
Hosts is: ['172.16.4.101']
Hosts is: ['172.16.4.101']
⠴ Validating environment... 0:00:002026-03-17 10:38:56|INFO: Environment validation passed
2026-03-17 10:38:56|STATUS: Benchmark results directory: /mnt/mlperf/results/training/unet3d/datagen/20260317_103856
2026-03-17 10:38:56|INFO: Creating directory: /mnt/mlperf/unet3d/train...
2026-03-17 10:38:56|INFO: Creating directory: /mnt/mlperf/unet3d/valid...
2026-03-17 10:38:56|INFO: Creating directory: /mnt/mlperf/unet3d/test...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-03-17 10:38:56|STATUS: Running benchmark command:: mpirun -n 64 -host 172.16.4.101:64 --bind-to none --map-by socket -genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=unet3d_datagen ++hydra.run.dir=/mnt/mlperf/results/training/unet3d/datagen/20260317_103856 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=37500 ++workload.dataset.num_subfolders_train=4 ++workload.dataset.data_folder=/mnt/mlperf/unet3d --config-dir=/home/ir/mlperf/storage/configs/dlio
[OUTPUT] 2026-03-17T10:39:04.038650 Running DLIO [Generating data] with 64 process(es) [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T10:39:04.042688 Metric calculation will exclude the beginning 1 and end 0 steps, only includes 584 steps. [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py:91]
[OUTPUT] 2026-03-17T10:39:04.055989 Starting data generation [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T10:39:04.063643 Generating dataset in /mnt/mlperf/unet3d/train and /mnt/mlperf/unet3d/valid [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2026-03-17T10:39:04.063696 Number of files for training dataset: 37500 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:79]
[INFO] 2026-03-17T10:39:04.063726 Number of files for validation dataset: 0 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:80]
[INFO] 2026-03-17T11:33:06.388319 Generating NPZ Data: [============================================================>] 99.8% 37441 of 37500 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:311]
[OUTPUT] 2026-03-17T11:33:28.938898 Generation done [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
2026-03-17 11:33:31|STATUS: Writing metadata for benchmark to: /mnt/mlperf/results/training/unet3d/datagen/20260317_103856/training_20260317_103856_metadata.json

==========================================================================================================================================================================

HOSTS=172.16.4.101

HOSTMEMGB=1024

NUMHOSTS=1

PROCS_PER_HOST=2

NPROC=2

GPU=b200

MODEL=dlrm

DATADIR=/mnt/mlperf/dlrm

RESULTS=/mnt/mlperf/results

N=1530

SUBF=1

(.venv) ptc2 ir@init-ptc-rb07-u28:~/mlperf/storage 102# /home/ir/mlperf/storage/.venv/bin/mlpstorage training datagen
--hosts $HOSTS
--model $MODEL
--num-processes $NPROC
--data-dir $DATADIR
--results-dir $RESULTS
--param dataset.num_files_train=${N}
--param dataset.num_subfolders_train=${SUBF}
--mpi-params "-genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug" 2>&1 | tee ${RESULTS}/mlperf_training_datagen_${MODEL}${GPU}${N}files_${SUBF}_getnow.out
2026-03-17 12:12:11|STATUS: Validating environment......
2026-03-17 12:12:11|INFO: Environment validation passed
2026-03-17 12:12:11|STATUS: Benchmark results directory: /mnt/mlperf/results/training/dlrm/datagen/20260317_121211
2026-03-17 12:12:11|INFO: Creating directory: /mnt/mlperf/dlrm/train...
2026-03-17 12:12:11|INFO: Creating directory: /mnt/mlperf/dlrm/valid...
2026-03-17 12:12:11|INFO: Creating directory: /mnt/mlperf/dlrm/test...
2026-03-17 12:12:11|STATUS: Stage 1/4: Validating environment......
2026-03-17 12:12:11|STATUS: Stage 2/4: Collecting cluster info......
2026-03-17 12:12:11|STATUS: Stage 3/4: Running benchmark......
2026-03-17 12:12:11|STATUS: Running benchmark command:: mpirun -n 2 -host 172.16.4.101:2 --bind-to none --map-by socket -genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_121211 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
[OUTPUT] 2026-03-17T12:12:17.258755 Running DLIO [Generating data] with 2 process(es) [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T12:12:17.277080 Metric calculation will exclude the beginning 1 and end 0 steps, only includes 3208642559 steps. [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py:91]
[OUTPUT] 2026-03-17T12:12:17.279555 Starting data generation [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T12:12:17.285033 Generating dataset in /mnt/mlperf/dlrm/train and /mnt/mlperf/dlrm/valid [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2026-03-17T12:12:17.285071 Number of files for training dataset: 1530 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:79]
[INFO] 2026-03-17T12:12:17.285096 Number of files for validation dataset: 0 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:80]
[INFO]

====

ir@init-ptc-rb07-u30:/mnt/mlperf/dlrm 101# ps -eaf | grep dlio
ir 13186 13185 99 12:12 ? 00:05:36 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_121211 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir 13187 13185 99 12:12 ? 00:05:37 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_121211 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir 13382 6936 0 12:17 pts/0 00:00:00 grep --color=auto dlio

====

ir@init-ptc-rb07-u30:/mnt/mlperf/dlrm 101# ls -laR /mnt/mlperf/dlrm
/mnt/mlperf/dlrm:
total 0
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 .
drwxrwxrwx 6 root root 0 Mar 17 12:12 ..
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 test
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 train
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 valid

/mnt/mlperf/dlrm/test:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 ..

/mnt/mlperf/dlrm/train:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 ..

/mnt/mlperf/dlrm/valid:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 ..
ir@init-ptc-rb07-u30:/mnt/mlperf/dlrm 101#

==========================================================================================================================================================================

HOSTS=172.16.4.101

HOSTMEMGB=1024

NUMHOSTS=1

PROCS_PER_HOST=2

NPROC=2

GPU=mi355

MODEL=dlrm

DATADIR=/mnt/mlperf/dlrm

RESULTS=/mnt/mlperf/results

N=1530

SUBF=1

(.venv) ptc2 ir@init-ptc-rb07-u28:~/mlperf/storage 102# /home/ir/mlperf/storage/.venv/bin/mlpstorage training datagen
--hosts $HOSTS
--model $MODEL
--num-processes $NPROC
--data-dir $DATADIR
--results-dir $RESULTS
--param dataset.num_files_train=${N}
--param dataset.num_subfolders_train=${SUBF}
--mpi-params "-genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug" 2>&1 | tee ${RESULTS}/mlperf_training_datagen_${MODEL}${GPU}${N}files_${SUBF}directories_getnow.out
2026-03-17 13:12:36|STATUS: Validating environment......
2026-03-17 13:12:37|INFO: Environment validation passed
2026-03-17 13:12:37|STATUS: Benchmark results directory: /mnt/mlperf/results/training/dlrm/datagen/20260317_131236
2026-03-17 13:12:37|INFO: Creating directory: /mnt/mlperf/dlrm/train...
2026-03-17 13:12:37|INFO: Creating directory: /mnt/mlperf/dlrm/valid...
2026-03-17 13:12:37|INFO: Creating directory: /mnt/mlperf/dlrm/test...
2026-03-17 13:12:37|STATUS: Stage 1/4: Validating environment......
2026-03-17 13:12:37|STATUS: Stage 2/4: Collecting cluster info......
2026-03-17 13:12:37|STATUS: Stage 3/4: Running benchmark......
2026-03-17 13:12:37|STATUS: Running benchmark command:: mpirun -n 2 -host 172.16.4.101:2 --bind-to none --map-by socket -genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_131236 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
[OUTPUT] 2026-03-17T13:12:42.543042 Running DLIO [Generating data] with 2 process(es) [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T13:12:42.551877 Metric calculation will exclude the beginning 1 and end 0 steps, only includes 3208642559 steps. [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py:91]
[OUTPUT] 2026-03-17T13:12:42.595172 Starting data generation [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T13:12:42.600840 Generating dataset in /mnt/mlperf/dlrm/train and /mnt/mlperf/dlrm/valid [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2026-03-17T13:12:42.600906 Number of files for training dataset: 1530 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:79]
[INFO] 2026-03-17T13:12:42.600934 Number of files for validation dataset: 0 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:80]
[INFO]

====

ir@init-ptc-rb07-u30:/mnt/mlperf 101# mon dlio
ir 13639 13638 99 13:12 ? 00:00:37 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_131236 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir 13640 13638 99 13:12 ? 00:00:37 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_131236 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir@init-ptc-rb07-u30:/mnt/mlperf 101#

====

ptc2 ir@init-ptc-rb07-u28:/mnt/mlperf/dlrm 102# ls -laR
.:
total 0
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 .
drwxrwxrwx 6 root root 0 Mar 17 13:12 ..
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 test
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 train
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 valid

./test:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 ..

./train:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 ..

./valid:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 ..
ptc2 ir@init-ptc-rb07-u28:/mnt/mlperf/dlrm 102#

==========================================================================================================================================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions