When running an ensemble of MPI-parallelized jobs with QCG-PJ, SLURM on the ARCHER2 headnode throws a failure upon job submission.
Running and ensemble of 10 jobs using 32 cores each.
csstddg@LIN-LAPTOP-DG:~/homecoming/homecoming$ fa2 homecoming_ensemble:rv25,label=n1,cores=640,cpuspertask=32,job_wall_time=1:00:00,pj_type=qcg,venv=true
/home/csstddg/.local/lib/python3.12/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
"cipher": algorithms.TripleDES,
/home/csstddg/.local/lib/python3.12/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
"class": algorithms.TripleDES,
[loading plugin] FabMD ...
[loading plugin] FabFlee ...
[loading plugin] FabParticleDA ...
[loading plugin] FabDummy ...
[loading plugin] FabUQCampaign ...
[loading plugin] fabmogp ...
[loading plugin] FabCovid19 ...
[loading plugin] FabNEPTUNE ...
[loading plugin] FUMEplot ...
[loading plugin] FabHomecoming ...
[Executing task] homecoming_ensemble
calling task homecoming_ensemble from plugin FabHomecoming
╭─ New/Updated environment variables from FabHomecoming plugin ─╮
│ env : │
│ +++ HOMECOMING_TYPE_CHECK is a new added key │
│ env : │
│ +++ hc_location is a new added key │
╰───────────────────────────────────────────────────────────────╯
local config file path at: /home/csstddg/FabSim3/plugins/FabHomecoming/config_files/rv25
adding label: n1
[INFO] sweepdir_items: ['3', '4', '10', '7', '9', '8', '5', '1', '2', '6']
[INFO] replica_counts: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[INFO] replicas: 1
[local] ssh -Y -p 22 dgroe@archer2 'mkdir -p /work/d202/d202/dgroe/FabSim/config_files; mkdir -p /work/d202/d202/dgroe/FabSim/results; mkdir -p /work/d202/d202/dgroe/FabSim/scripts; mkdir -p /work/d202/d202/dgroe/FabSim/config_files/rv25'
[local] rsync -pthrvz /home/csstddg/FabSim3/plugins/FabHomecoming/config_files/rv25/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/config_files/rv25/
sending incremental file list
sent 319 bytes received 23 bytes 228.00 bytes/sec
total size is 5.88K speedup is 17.20
[MultiProcessingPool] Starting Process child : Name = ForkPoolWorker-1 Process-1:1 , PID = 19163, parentPID = 19092 Max PoolSize = 7 requested PoolSize = 1
╭──────── job preparation phase ────────╮
│ tmp_work_path = /tmp/p4yzzwlv/FabSim3 │
╰───────────────────────────────────────╯
Submit tasks to multiprocessingPool : start ...
Submit tasks to multiprocessingPool : done ...
Generating job scripts... | Waiting for tasks to be completed ...
Job preparation complete!
╭───────── job transmission phase ─────────╮
│ Copy all generated files/folder from │
│ tmp_work_path = /tmp/p4yzzwlv/FabSim3 │
│ to │
│ work_path = /work/d202/d202/dgroe/FabSim │
╰──────────────────────────────────────────╯
[rsync_project] rsync -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/scripts/
[local] rsync -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/scripts/
sending incremental file list
./
rv25_archer2_640_n1_1.sh
rv25_archer2_640_n1_10.sh
rv25_archer2_640_n1_2.sh
rv25_archer2_640_n1_3.sh
rv25_archer2_640_n1_4.sh
rv25_archer2_640_n1_5.sh
rv25_archer2_640_n1_6.sh
rv25_archer2_640_n1_7.sh
rv25_archer2_640_n1_8.sh
rv25_archer2_640_n1_9.sh
sent 620 bytes received 329 bytes 632.67 bytes/sec
total size is 12.42K speedup is 13.09
[rsync_project] rsync -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/results/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/
[local] rsync -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/results/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/
sending incremental file list
./
rv25_archer2_640_n1/
rv25_archer2_640_n1/RUNS/
rv25_archer2_640_n1/RUNS/1/
rv25_archer2_640_n1/RUNS/1/env.yml
rv25_archer2_640_n1/RUNS/1/rv25_archer2_640_n1_1.sh
rv25_archer2_640_n1/RUNS/10/
rv25_archer2_640_n1/RUNS/10/env.yml
rv25_archer2_640_n1/RUNS/10/rv25_archer2_640_n1_10.sh
rv25_archer2_640_n1/RUNS/2/
rv25_archer2_640_n1/RUNS/2/env.yml
rv25_archer2_640_n1/RUNS/2/rv25_archer2_640_n1_2.sh
rv25_archer2_640_n1/RUNS/3/
rv25_archer2_640_n1/RUNS/3/env.yml
rv25_archer2_640_n1/RUNS/3/rv25_archer2_640_n1_3.sh
rv25_archer2_640_n1/RUNS/4/
rv25_archer2_640_n1/RUNS/4/env.yml
rv25_archer2_640_n1/RUNS/4/rv25_archer2_640_n1_4.sh
rv25_archer2_640_n1/RUNS/5/
rv25_archer2_640_n1/RUNS/5/env.yml
rv25_archer2_640_n1/RUNS/5/rv25_archer2_640_n1_5.sh
rv25_archer2_640_n1/RUNS/6/
rv25_archer2_640_n1/RUNS/6/env.yml
rv25_archer2_640_n1/RUNS/6/rv25_archer2_640_n1_6.sh
rv25_archer2_640_n1/RUNS/7/
rv25_archer2_640_n1/RUNS/7/env.yml
rv25_archer2_640_n1/RUNS/7/rv25_archer2_640_n1_7.sh
rv25_archer2_640_n1/RUNS/8/
rv25_archer2_640_n1/RUNS/8/env.yml
rv25_archer2_640_n1/RUNS/8/rv25_archer2_640_n1_8.sh
rv25_archer2_640_n1/RUNS/9/
rv25_archer2_640_n1/RUNS/9/env.yml
rv25_archer2_640_n1/RUNS/9/rv25_archer2_640_n1_9.sh
sent 2.94K bytes received 2.58K bytes 3.68K bytes/sec
total size is 243.23K speedup is 44.05
[INFO] Using PilotJob mode: qcg
╭────── PJ job submission phase ───────╮
│ NOW, we are submitting QCG-PilotJobs │
╰──────────────────────────────────────╯
╭─ QCG-PilotJob Configuration ─╮
│ [SLURM Resources] │
│ Nodes: 5 │
│ Cores: 640 │
│ Cores per node: 128 │
│ CPUs per task: 32 │
│ Tasks per node: 128 │
│ Total cores: 640 │
│ │
│ [QCG-PJ Resources] │
│ QCG_PJ_NODES: 5 │
│ QCG_PJ_CORES_PER_NODE: 128 │
│ QCG_PJ_TOTAL_CORES: 640 │
╰──────────────────────────────╯
[INFO] Created 10 task descriptions
[local] ssh -Y -p 22 dgroe@archer2 'mkdir -p /work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG'
[rsync_project] rsync -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/QCG/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/
[local] rsync -pthrvz --rsh='ssh -p 22 ' /tmp/p4yzzwlv/FabSim3/scripts/QCG/ dgroe@archer2:/work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/
sending incremental file list
./
qcg_manager_rv25_archer2_640_n1.py
qcg_submit_rv25_archer2_640_n1.sh
sent 222 bytes received 159 bytes 108.86 bytes/sec
total size is 11.07K speedup is 29.06
[local] ssh -Y -p 22 dgroe@archer2 'sbatch /work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/qcg_submit_rv25_archer2_640_n1.sh'
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
File "/home/csstddg/.local/bin/fabsim", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/csstddg/FabSim3/fabsim/base/fabsim_main.py", line 162, in main
env.exec_func(*env.task_args, **env.task_kwargs)
File "/home/csstddg/FabSim3/fabsim/base/decorators.py", line 75, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/csstddg/FabSim3/fabsim/base/decorators.py", line 128, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/csstddg/FabSim3/plugins/FabHomecoming/FabHomecoming.py", line 137, in homecoming_ensemble
run_ensemble(
File "<@beartype(fabsim.base.fab.run_ensemble) at 0x7f6ca8c3cae0>", line 133, in run_ensemble
File "/home/csstddg/FabSim3/fabsim/base/fab.py", line 1364, in run_ensemble
pilot_job_fn()
File "/home/csstddg/FabSim3/fabsim/base/fab.py", line 1665, in run_qcg
job_submission(dict(job_script=env.qcg_remote_sh))
File "/home/csstddg/FabSim3/fabsim/base/fab.py", line 1092, in job_submission
run(
File "<@beartype(fabsim.base.networks.run) at 0x7f6caa79ad40>", line 77, in run
File "/home/csstddg/FabSim3/fabsim/base/networks.py", line 153, in run
return manual(cmd, cd=cd, capture=capture)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<@beartype(fabsim.base.networks.manual) at 0x7f6ca8dab600>", line 77, in manual
File "/home/csstddg/FabSim3/fabsim/base/networks.py", line 216, in manual
return local(pre_cmd + "'" + manual_command + "'", capture=capture)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<@beartype(fabsim.base.networks.local) at 0x7f6caa6565c0>", line 77, in local
File "/home/csstddg/FabSim3/fabsim/base/networks.py", line 62, in local
raise RuntimeError(
RuntimeError:
local() encountered an error (return code 1)while executing 'ssh -Y -p 22 dgroe@archer2 'sbatch /work/d202/d202/dgroe/FabSim/results/rv25_archer2_640_n1/QCG/qcg_submit_rv25_archer2_640_n1.sh''
csstddg@LIN-LAPTOP-DG:~/homecoming/homecoming$
Describe the bug
When running an ensemble of MPI-parallelized jobs with QCG-PJ, SLURM on the ARCHER2 headnode throws a failure upon job submission.
Executed steps
Steps to reproduce the behavior:
fa2 homecoming_ensemble:rv25,label=n1,cores=640,cpuspertask=32,job_wall_time=1:00:00,pj_type=qcg,venv=trueExpected behavior
Running and ensemble of 10 jobs using 32 cores each.
Relevant logs and/or media (optional)
Platform details (optional)
Ubuntu 24.04