Skip to content

mlpstorage training run --model=dlrm fails with job being aborted errors (signal 9) #329

Description

@ddn-kums

Hi, I have populated the dlrm datagen successfully however the training run fails against --file backend with signal 9 (Killed) errors

Details

  • dlrm datagen is successful and the generated dataset has been verified
2026-04-10 14:02:07|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/dlrm/datasize/20260410_140205/training_20260410_140205_metadata.json
(mlpstorage) nodeadmin@srt017:/work/kums/mlstorage_v3/storage$ mlpstorage training datagen --hosts=srt017-e0 --model=dlrm --exec-type=mpi --param dataset.num_files_train=369 --num-processes=1 --file --results-dir=/work/kums/mlstorage_
v3/results --data-dir=/mnt/redfs/mlstorage_dd/dlrm_b200
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 14:03:32|INFO: Environment validation passed
2026-04-10 14:03:32|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/dlrm/datagen/20260410_140331
2026-04-10 14:03:32|INFO: Creating data directory: /mnt/redfs/mlstorage_dd/dlrm_b200/dlrm...
2026-04-10 14:03:32|INFO: Creating directory: /mnt/redfs/mlstorage_dd/dlrm_b200/dlrm/train...
2026-04-10 14:03:32|INFO: Creating directory: /mnt/redfs/mlstorage_dd/dlrm_b200/dlrm/valid...
2026-04-10 14:03:32|INFO: Creating directory: /mnt/redfs/mlstorage_dd/dlrm_b200/dlrm/test...
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-10 14:03:32|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=dlrm_datagen ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/dlrm/datagen/20260410_140331 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=369 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-10T14:03:35.208122 Running DLIO [Generating data] with 1 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-04-10T14:03:35.264010 Starting data generation   
[OUTPUT] 2026-04-10T20:49:09.941069 Generation done
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
[OUTPUT] 2026-04-10T14:03:35.264010 Starting data generation   
[OUTPUT] 2026-04-10T20:49:09.941069 Generation done
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT] ================================================================================
  storage_root   = './'
  storage_options= None
  data_folder    = '/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 369
  record_length  = 761
  generate_data  = True
  do_train       = False
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 1
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
2026-04-10 20:49:10|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/dlrm/datagen/20260410_140331/training_20260410_140331_metadata.json
  • It is verified that the generated dlrm dataset looks good
$ pwd
/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm/train

# Num parquet files
$ ls -lh *.parquet | wc -l
369

# Dataset size
$ du -s .
1113379272	.

# Size of each parquet file
$ ls -lh *.parquet | tail -5
-rw-rw---- 1 nodeadmin nodeadmin 2.9G Apr 10 20:44 img_364_of_369.parquet
-rw-rw---- 1 nodeadmin nodeadmin 2.9G Apr 10 20:45 img_365_of_369.parquet
-rw-rw---- 1 nodeadmin nodeadmin 2.9G Apr 10 20:46 img_366_of_369.parquet
-rw-rw---- 1 nodeadmin nodeadmin 2.9G Apr 10 20:48 img_367_of_369.parquet
-rw-rw---- 1 nodeadmin nodeadmin 2.9G Apr 10 20:49 img_368_of_369.parquet
  • mlpstorage training run ---model=dlrm with single MPI process results in error - exited on signal 9 (Killed)

Expt 1 - --data-dir=/mnt/redfs/mlstorage_dd/dlrm_b200

$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=dlrm --exec-type=mpi --p
aram dataset.num_files_train=369 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/dlrm_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 20:58:15|INFO: Environment validation passed
2026-04-10 20:58:15|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/dlrm/run/20260410_205815
2026-04-10 20:58:16|INFO: Created benchmark run: training_run_dlrm_20260410_205815
2026-04-10 20:58:16|STATUS: Verifying benchmark run for training_run_dlrm_20260410_205815
2026-04-10 20:58:16|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-10 20:58:16|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 369 (Parameter: Overrode Parameters)
2026-04-10 20:58:16|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='dlrm', run_datetime='20260410_205815')])
2026-04-10 20:58:16|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠴ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-10 20:58:17|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=dlrm_b200 ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/dlrm/run/20260410_205815 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=369 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-10T20:58:20.007953 Running DLIO [Training] with 1 process(es)
[OUTPUT] 2026-04-10T20:58:21.648996 Max steps per epoch: 141696 = 4718592 * 369 / 12288 / 1 (samples per file * num files / batch size / comm size)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
mpirun noticed that process rank 0 with PID 0 on node srt017 exited on signal 9 (Killed).
  storage_root   = './'
  storage_options= None
  data_folder    = '/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 369
  record_length  = 761
  generate_data  = False
  do_train       = True
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 12288
--------------------------------------------------------------------------
2026-04-10 21:04:31|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/dlrm/run/20260410_205815/training_20260410_205815_metadata.json

Expt 2 - Adjusting the --data-dir ==> --data-dir=/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm

$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=dlrm --exec-type=mpi --param dataset.num_files_train=369 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-10 21:08:08|INFO: Environment validation passed
2026-04-10 21:08:08|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/dlrm/run/20260410_210808
2026-04-10 21:08:09|INFO: Created benchmark run: training_run_dlrm_20260410_210808
2026-04-10 21:08:09|STATUS: Verifying benchmark run for training_run_dlrm_20260410_210808
2026-04-10 21:08:09|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-10 21:08:09|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 369 (Parameter: Overrode Parameters)
2026-04-10 21:08:09|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='dlrm', run_datetime='20260410_210808')])
2026-04-10 21:08:09|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠴ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-10 21:08:09|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=dlrm_b200 ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/dlrm/run/20260410_210808 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=369 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-10T21:08:17.042394 Running DLIO [Training] with 1 process(es)
[OUTPUT] 2026-04-10T21:08:17.066619 Max steps per epoch: 141696 = 4718592 * 369 / 12288 / 1 (samples per file * num files / batch size / comm size)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
  storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
mpirun noticed that process rank 0 with PID 0 on node srt017 exited on signal 9 (Killed).
  storage_root   = './'
  storage_options= None
  data_folder    = '/mnt/redfs/mlstorage_dd/dlrm_b200/dlrm'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 369
  record_length  = 761
  generate_data  = False
  do_train       = True
  do_checkpoint  = False
  epochs         = 1
  batch_size     = 12288
--------------------------------------------------------------------------
2026-04-10 21:14:08|STATUS: Writing metadata for benchmark to: /work/kums/mlstorage_v3/results/training/dlrm/run/20260410_210808/training_20260410_210808_metadata.json

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions