Skip to content

Issue with running QCG-PilotJob with parallel tasks on ARCHER2 #324

@djgroen

Description

@djgroen

Describe the bug

When running an ensemble of MPI-parallelized jobs with QCG-PJ, SLURM on the ARCHER2 headnode throws a failure upon job submission.

Executed steps

Steps to reproduce the behavior:

  1. fa2 homecoming_ensemble:rv25,label=n1,cores=640,cpuspertask=32,job_wall_time=1:00:00,pj_type=qcg,venv=true

Expected behavior

Running and ensemble of 10 jobs using 32 cores each.

Relevant logs and/or media (optional)

csstddg@LIN-LAPTOP-DG:~/homecoming/homecoming$ fa2 homecoming_ensemble:rv25,label=n1,cores=640,cpuspertask=32,job_wall_time=1:00:00,pj_type=qcg,venv=true
/home/csstddg/.local/lib/python3.12/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
  "cipher": algorithms.TripleDES,
/home/csstddg/.local/lib/python3.12/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
  "class": algorithms.TripleDES,
[loading plugin] FabMD ...
[loading plugin] FabFlee ...
[loading plugin] FabParticleDA ...
[loading plugin] FabDummy ...
[loading plugin] FabUQCampaign ...
[loading plugin] fabmogp ...
[loading plugin] FabCovid19 ...
[loading plugin] FabNEPTUNE ...
[loading plugin] FUMEplot ...
[loading plugin] FabHomecoming ...
[Executing task] homecoming_ensemble
calling task homecoming_ensemble from plugin FabHomecoming
╭─ New/Updated environment variables from FabHomecoming plugin ─╮
│ env :                                                         │
│   +++ HOMECOMING_TYPE_CHECK is a new added key                │
│ env :                                                         │
│   +++ hc_location is a new added key                          │
╰───────────────────────────────────────────────────────────────╯
local config file path at: /home/csstddg/FabSim3/plugins/FabHomecoming/config_files/rv25
adding label:  n1
[INFO] sweepdir_items: ['3', '4', '10', '7', '9', '8', '5', '1', '2', '6']
[INFO] replica_counts: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[INFO] replicas: 1
[local] ssh -Y -p 22 dgroe@archer2 'mkdir -p /work/d202/d202/dgroe/FabSim/config_files; mkdir -p /work/d202/d202/dgroe/FabSim/results; mkdir -p /work/d202/d202/dgroe/FabSim/scripts; mkdir -p /work/d202/d202/dgroe/FabSim/config_files/rv25'
[local] rsync -pthrvz /home/csstddg/FabSim3/plugins/FabHomecoming/config_files/rv25/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/config_files/rv25/
sending incremental file list

sent 319 bytes  received 23 bytes  228.00 bytes/sec
total size is 5.88K  speedup is 17.20
[MultiProcessingPool] Starting Process child : Name = ForkPoolWorker-1 Process-1:1 , PID = 19163,  parentPID = 19092 Max PoolSize = 7 requested PoolSize = 1
╭──────── job preparation phase ────────╮
│ tmp_work_path = /tmp/p4yzzwlv/FabSim3 │
╰───────────────────────────────────────╯
Submit tasks to multiprocessingPool : start ...
Submit tasks to multiprocessingPool : done ...
Generating job scripts... | Waiting for tasks to be completed ...
Job preparation complete!                   
╭───────── job transmission phase ─────────╮
│ Copy all generated files/folder from     │
│ tmp_work_path = /tmp/p4yzzwlv/FabSim3    │
│ to                                       │
│ work_path = /work/d202/d202/dgroe/FabSim │
╰──────────────────────────────────────────╯
[rsync_project] rsync    -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/scripts/
[local] rsync    -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/scripts/
sending incremental file list
./
rv25_archer2_640_n1_1.sh
rv25_archer2_640_n1_10.sh
rv25_archer2_640_n1_2.sh
rv25_archer2_640_n1_3.sh
rv25_archer2_640_n1_4.sh
rv25_archer2_640_n1_5.sh
rv25_archer2_640_n1_6.sh
rv25_archer2_640_n1_7.sh
rv25_archer2_640_n1_8.sh
rv25_archer2_640_n1_9.sh

sent 620 bytes  received 329 bytes  632.67 bytes/sec
total size is 12.42K  speedup is 13.09
[rsync_project] rsync    -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/results/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/
[local] rsync    -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/results/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/
sending incremental file list
./
rv25_archer2_640_n1/
rv25_archer2_640_n1/RUNS/
rv25_archer2_640_n1/RUNS/1/
rv25_archer2_640_n1/RUNS/1/env.yml
rv25_archer2_640_n1/RUNS/1/rv25_archer2_640_n1_1.sh
rv25_archer2_640_n1/RUNS/10/
rv25_archer2_640_n1/RUNS/10/env.yml
rv25_archer2_640_n1/RUNS/10/rv25_archer2_640_n1_10.sh
rv25_archer2_640_n1/RUNS/2/
rv25_archer2_640_n1/RUNS/2/env.yml
rv25_archer2_640_n1/RUNS/2/rv25_archer2_640_n1_2.sh
rv25_archer2_640_n1/RUNS/3/
rv25_archer2_640_n1/RUNS/3/env.yml
rv25_archer2_640_n1/RUNS/3/rv25_archer2_640_n1_3.sh
rv25_archer2_640_n1/RUNS/4/
rv25_archer2_640_n1/RUNS/4/env.yml
rv25_archer2_640_n1/RUNS/4/rv25_archer2_640_n1_4.sh
rv25_archer2_640_n1/RUNS/5/
rv25_archer2_640_n1/RUNS/5/env.yml
rv25_archer2_640_n1/RUNS/5/rv25_archer2_640_n1_5.sh
rv25_archer2_640_n1/RUNS/6/
rv25_archer2_640_n1/RUNS/6/env.yml
rv25_archer2_640_n1/RUNS/6/rv25_archer2_640_n1_6.sh
rv25_archer2_640_n1/RUNS/7/
rv25_archer2_640_n1/RUNS/7/env.yml
rv25_archer2_640_n1/RUNS/7/rv25_archer2_640_n1_7.sh
rv25_archer2_640_n1/RUNS/8/
rv25_archer2_640_n1/RUNS/8/env.yml
rv25_archer2_640_n1/RUNS/8/rv25_archer2_640_n1_8.sh
rv25_archer2_640_n1/RUNS/9/
rv25_archer2_640_n1/RUNS/9/env.yml
rv25_archer2_640_n1/RUNS/9/rv25_archer2_640_n1_9.sh

sent 2.94K bytes  received 2.58K bytes  3.68K bytes/sec
total size is 243.23K  speedup is 44.05
[INFO] Using PilotJob mode: qcg
╭────── PJ job submission phase ───────╮
│ NOW, we are submitting QCG-PilotJobs │
╰──────────────────────────────────────╯
╭─ QCG-PilotJob Configuration ─╮
│ [SLURM Resources]            │
│ Nodes: 5                     │
│ Cores: 640                   │
│ Cores per node: 128          │
│ CPUs per task: 32            │
│ Tasks per node: 128          │
│ Total cores: 640             │
│                              │
│ [QCG-PJ Resources]           │
│ QCG_PJ_NODES: 5              │
│ QCG_PJ_CORES_PER_NODE: 128   │
│ QCG_PJ_TOTAL_CORES: 640      │
╰──────────────────────────────╯
[INFO] Created 10 task descriptions
[local] ssh -Y -p 22 dgroe@archer2 'mkdir -p /work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG'
[rsync_project] rsync    -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/QCG/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/
[local] rsync    -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/QCG/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/
sending incremental file list
./
qcg_manager_rv25_archer2_640_n1.py
qcg_submit_rv25_archer2_640_n1.sh

sent 222 bytes  received 159 bytes  108.86 bytes/sec
total size is 11.07K  speedup is 29.06
[local] ssh -Y -p 22 dgroe@archer2 'sbatch /work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/qcg_submit_rv25_archer2_640_n1.sh'
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
  File "/home/csstddg/.local/bin/fabsim", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/csstddg/FabSim3/fabsim/base/fabsim_main.py", line 162, in main
    env.exec_func(*env.task_args, **env.task_kwargs)
  File "/home/csstddg/FabSim3/fabsim/base/decorators.py", line 75, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/csstddg/FabSim3/fabsim/base/decorators.py", line 128, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/csstddg/FabSim3/plugins/FabHomecoming/FabHomecoming.py", line 137, in homecoming_ensemble
    run_ensemble(
  File "<@beartype(fabsim.base.fab.run_ensemble) at 0x7f6ca8c3cae0>", line 133, in run_ensemble
  File "/home/csstddg/FabSim3/fabsim/base/fab.py", line 1364, in run_ensemble
    pilot_job_fn()
  File "/home/csstddg/FabSim3/fabsim/base/fab.py", line 1665, in run_qcg
    job_submission(dict(job_script=env.qcg_remote_sh))
  File "/home/csstddg/FabSim3/fabsim/base/fab.py", line 1092, in job_submission
    run(
  File "<@beartype(fabsim.base.networks.run) at 0x7f6caa79ad40>", line 77, in run
  File "/home/csstddg/FabSim3/fabsim/base/networks.py", line 153, in run
    return manual(cmd, cd=cd, capture=capture)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<@beartype(fabsim.base.networks.manual) at 0x7f6ca8dab600>", line 77, in manual
  File "/home/csstddg/FabSim3/fabsim/base/networks.py", line 216, in manual
    return local(pre_cmd + "'" + manual_command + "'", capture=capture)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<@beartype(fabsim.base.networks.local) at 0x7f6caa6565c0>", line 77, in local
  File "/home/csstddg/FabSim3/fabsim/base/networks.py", line 62, in local
    raise RuntimeError(
RuntimeError: 
local() encountered an error (return code 1)while executing 'ssh -Y -p 22 dgroe@archer2 'sbatch /work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/qcg_submit_rv25_archer2_640_n1.sh''
csstddg@LIN-LAPTOP-DG:~/homecoming/homecoming$ 

Platform details (optional)

Ubuntu 24.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions