-
Notifications
You must be signed in to change notification settings - Fork 59
Description
I can't seem to get dlrm_datagen.yaml to work.
unet3d/a100 works fine, so I am pretty sure that my basic setup is correct.
However, while dlrm_b200_datagen or dlrm_mi355_datagen begins, creates the directories, but then it simply runs until it crashes with an Out-Of-Memory error.
Could you please let me know if you see something obvious that I am doing incorrectly?
Thank you
==========================================================================================================================================================================
HOSTS=172.16.4.101
HOSTMEMGB=1024
NUMHOSTS=1
PROCS_PER_HOST=64
NPROC=64
GPU=a100
MODEL=unet3d
DATADIR=/mnt/mlperf/unet3d
RESULTS=/mnt/mlperf/results
N=37500
SUBF=4
(.venv) ptc2 ir@init-ptc-rb07-u28:~/mlperf/storage 102# /home/ir/mlperf/storage/.venv/bin/mlpstorage training datagen
--hosts $HOSTS
--model $MODEL
--num-processes $NPROC
--data-dir $DATADIR
--results-dir $RESULTS
--param dataset.num_files_train=${N}
--param dataset.num_subfolders_train=${SUBF}
--mpi-params "-genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug"
Hosts is: ['172.16.4.101']
Hosts is: ['172.16.4.101']
⠴ Validating environment... 0:00:002026-03-17 10:38:56|INFO: Environment validation passed
2026-03-17 10:38:56|STATUS: Benchmark results directory: /mnt/mlperf/results/training/unet3d/datagen/20260317_103856
2026-03-17 10:38:56|INFO: Creating directory: /mnt/mlperf/unet3d/train...
2026-03-17 10:38:56|INFO: Creating directory: /mnt/mlperf/unet3d/valid...
2026-03-17 10:38:56|INFO: Creating directory: /mnt/mlperf/unet3d/test...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-03-17 10:38:56|STATUS: Running benchmark command:: mpirun -n 64 -host 172.16.4.101:64 --bind-to none --map-by socket -genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=unet3d_datagen ++hydra.run.dir=/mnt/mlperf/results/training/unet3d/datagen/20260317_103856 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=37500 ++workload.dataset.num_subfolders_train=4 ++workload.dataset.data_folder=/mnt/mlperf/unet3d --config-dir=/home/ir/mlperf/storage/configs/dlio
[OUTPUT] 2026-03-17T10:39:04.038650 Running DLIO [Generating data] with 64 process(es) [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T10:39:04.042688 Metric calculation will exclude the beginning 1 and end 0 steps, only includes 584 steps. [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py:91]
[OUTPUT] 2026-03-17T10:39:04.055989 Starting data generation [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T10:39:04.063643 Generating dataset in /mnt/mlperf/unet3d/train and /mnt/mlperf/unet3d/valid [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2026-03-17T10:39:04.063696 Number of files for training dataset: 37500 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:79]
[INFO] 2026-03-17T10:39:04.063726 Number of files for validation dataset: 0 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:80]
[INFO] 2026-03-17T11:33:06.388319 Generating NPZ Data: [============================================================>] 99.8% 37441 of 37500 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:311]
[OUTPUT] 2026-03-17T11:33:28.938898 Generation done [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
2026-03-17 11:33:31|STATUS: Writing metadata for benchmark to: /mnt/mlperf/results/training/unet3d/datagen/20260317_103856/training_20260317_103856_metadata.json
==========================================================================================================================================================================
HOSTS=172.16.4.101
HOSTMEMGB=1024
NUMHOSTS=1
PROCS_PER_HOST=2
NPROC=2
GPU=b200
MODEL=dlrm
DATADIR=/mnt/mlperf/dlrm
RESULTS=/mnt/mlperf/results
N=1530
SUBF=1
(.venv) ptc2 ir@init-ptc-rb07-u28:~/mlperf/storage 102# /home/ir/mlperf/storage/.venv/bin/mlpstorage training datagen
--hosts $HOSTS
--model $MODEL
--num-processes $NPROC
--data-dir $DATADIR
--results-dir $RESULTS
--param dataset.num_files_train=${N}
--param dataset.num_subfolders_train=${SUBF}
--mpi-params "-genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug" 2>&1 | tee ${RESULTS}/mlperf_training_datagen_${MODEL}${GPU}${N}files_${SUBF}_getnow.out
2026-03-17 12:12:11|STATUS: Validating environment......
2026-03-17 12:12:11|INFO: Environment validation passed
2026-03-17 12:12:11|STATUS: Benchmark results directory: /mnt/mlperf/results/training/dlrm/datagen/20260317_121211
2026-03-17 12:12:11|INFO: Creating directory: /mnt/mlperf/dlrm/train...
2026-03-17 12:12:11|INFO: Creating directory: /mnt/mlperf/dlrm/valid...
2026-03-17 12:12:11|INFO: Creating directory: /mnt/mlperf/dlrm/test...
2026-03-17 12:12:11|STATUS: Stage 1/4: Validating environment......
2026-03-17 12:12:11|STATUS: Stage 2/4: Collecting cluster info......
2026-03-17 12:12:11|STATUS: Stage 3/4: Running benchmark......
2026-03-17 12:12:11|STATUS: Running benchmark command:: mpirun -n 2 -host 172.16.4.101:2 --bind-to none --map-by socket -genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_121211 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
[OUTPUT] 2026-03-17T12:12:17.258755 Running DLIO [Generating data] with 2 process(es) [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T12:12:17.277080 Metric calculation will exclude the beginning 1 and end 0 steps, only includes 3208642559 steps. [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py:91]
[OUTPUT] 2026-03-17T12:12:17.279555 Starting data generation [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T12:12:17.285033 Generating dataset in /mnt/mlperf/dlrm/train and /mnt/mlperf/dlrm/valid [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2026-03-17T12:12:17.285071 Number of files for training dataset: 1530 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:79]
[INFO] 2026-03-17T12:12:17.285096 Number of files for validation dataset: 0 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:80]
[INFO]
====
ir@init-ptc-rb07-u30:/mnt/mlperf/dlrm 101# ps -eaf | grep dlio
ir 13186 13185 99 12:12 ? 00:05:36 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_121211 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir 13187 13185 99 12:12 ? 00:05:37 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_121211 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir 13382 6936 0 12:17 pts/0 00:00:00 grep --color=auto dlio
====
ir@init-ptc-rb07-u30:/mnt/mlperf/dlrm 101# ls -laR /mnt/mlperf/dlrm
/mnt/mlperf/dlrm:
total 0
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 .
drwxrwxrwx 6 root root 0 Mar 17 12:12 ..
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 test
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 train
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 valid
/mnt/mlperf/dlrm/test:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 ..
/mnt/mlperf/dlrm/train:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 ..
/mnt/mlperf/dlrm/valid:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 12:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 12:12 ..
ir@init-ptc-rb07-u30:/mnt/mlperf/dlrm 101#
==========================================================================================================================================================================
HOSTS=172.16.4.101
HOSTMEMGB=1024
NUMHOSTS=1
PROCS_PER_HOST=2
NPROC=2
GPU=mi355
MODEL=dlrm
DATADIR=/mnt/mlperf/dlrm
RESULTS=/mnt/mlperf/results
N=1530
SUBF=1
(.venv) ptc2 ir@init-ptc-rb07-u28:~/mlperf/storage 102# /home/ir/mlperf/storage/.venv/bin/mlpstorage training datagen
--hosts $HOSTS
--model $MODEL
--num-processes $NPROC
--data-dir $DATADIR
--results-dir $RESULTS
--param dataset.num_files_train=${N}
--param dataset.num_subfolders_train=${SUBF}
--mpi-params "-genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug" 2>&1 | tee ${RESULTS}/mlperf_training_datagen_${MODEL}${GPU}${N}files_${SUBF}directories_getnow.out
2026-03-17 13:12:36|STATUS: Validating environment......
2026-03-17 13:12:37|INFO: Environment validation passed
2026-03-17 13:12:37|STATUS: Benchmark results directory: /mnt/mlperf/results/training/dlrm/datagen/20260317_131236
2026-03-17 13:12:37|INFO: Creating directory: /mnt/mlperf/dlrm/train...
2026-03-17 13:12:37|INFO: Creating directory: /mnt/mlperf/dlrm/valid...
2026-03-17 13:12:37|INFO: Creating directory: /mnt/mlperf/dlrm/test...
2026-03-17 13:12:37|STATUS: Stage 1/4: Validating environment......
2026-03-17 13:12:37|STATUS: Stage 2/4: Collecting cluster info......
2026-03-17 13:12:37|STATUS: Stage 3/4: Running benchmark......
2026-03-17 13:12:37|STATUS: Running benchmark command:: mpirun -n 2 -host 172.16.4.101:2 --bind-to none --map-by socket -genv PMI_VERSION=2 -genv FI_PROVIDER=tcp -genv FI_TCP_IFACE=ens8f0np0 -genv TF_ENABLE_ONEDNN_OPTS=0 -genv DLIO_LOG_LEVEL debug /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_131236 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
[OUTPUT] 2026-03-17T13:12:42.543042 Running DLIO [Generating data] with 2 process(es) [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T13:12:42.551877 Metric calculation will exclude the beginning 1 and end 0 steps, only includes 3208642559 steps. [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/statscounter.py:91]
[OUTPUT] 2026-03-17T13:12:42.595172 Starting data generation [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/utils/utility.py:85]
[INFO] 2026-03-17T13:12:42.600840 Generating dataset in /mnt/mlperf/dlrm/train and /mnt/mlperf/dlrm/valid [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2026-03-17T13:12:42.600906 Number of files for training dataset: 1530 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:79]
[INFO] 2026-03-17T13:12:42.600934 Number of files for validation dataset: 0 [/home/ir/mlperf/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_generator/data_generator.py:80]
[INFO]
====
ir@init-ptc-rb07-u30:/mnt/mlperf 101# mon dlio
ir 13639 13638 99 13:12 ? 00:00:37 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_131236 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir 13640 13638 99 13:12 ? 00:00:37 /home/ir/mlperf/storage/.venv/bin/python /home/ir/mlperf/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/mnt/mlperf/results/training/dlrm/datagen/20260317_131236 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=1530 ++workload.dataset.num_subfolders_train=1 ++workload.dataset.data_folder=/mnt/mlperf/dlrm --config-dir=/home/ir/mlperf/storage/configs/dlio
ir@init-ptc-rb07-u30:/mnt/mlperf 101#
====
ptc2 ir@init-ptc-rb07-u28:/mnt/mlperf/dlrm 102# ls -laR
.:
total 0
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 .
drwxrwxrwx 6 root root 0 Mar 17 13:12 ..
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 test
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 train
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 valid
./test:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 ..
./train:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 ..
./valid:
total 0
drwxrwxr-x 2 ir ir 0 Mar 17 13:12 .
drwxrwxr-x 5 ir ir 0 Mar 17 13:12 ..
ptc2 ir@init-ptc-rb07-u28:/mnt/mlperf/dlrm 102#
==========================================================================================================================================================================