Skip to content

Detect MPI with Singularity#2216

Open
kinow wants to merge 1 commit intomainfrom
detect-mpi-with-singularity
Open

Detect MPI with Singularity#2216
kinow wants to merge 1 commit intomainfrom
detect-mpi-with-singularity

Conversation

@kinow
Copy link
Member

@kinow kinow commented Mar 6, 2026

Related to #2208 (not sure if this is enough to close that issue)

@mr-c, I started some tests yesterday, and completed them today after work. Once I managed to run the workflow on my machine reproducing the error we have on HPCs, debugging it showed that the changes appeared simple to be made.

I managed to keep the --contain, but in order to do that I had to include /dev/shm. I couldn't find any files on /tmp or on the temporary directory created by cwltool. And actually ,cwltool worked without having to remove the temporary directory created by cwltool (i.e. I was wrong, and the process manager wasn't using the temp dir to sync with processes, but apparently shared memory).

I kept the --cleanenv for when the job isn't using the cwltool MPIRequirement. When that requirement is a requirement (not a hint) and is used in the job, then I do not set the --cleanenv.

I tried keeping it, and even tested the MPI config file using the variables from the MPICH documentation.

runner: mpirun.mpich
extra_flags: [
# "likwid-perfctr",
# "-C", "L:N:0",
# "-g", "FLOPS_DP",
# "-o", "/output/path/likwid_%j_%h_%r.json"
]
nproc_flag: -n
env_pass_regex: [
  "MPIEXEC_.*",
  "HYDRA_.*",
  "SMPD_.*",
  "MPICH_.*",
  "DCMF_.*",
  "MPIO_.*",
  "RMS_.*",
  "MPITEST_.*",
  "PMI_*",
  "MPI_*"
  #  "SLURM_.*"
]

(The MPI config file was needed on my laptop as I have OpenMPI & MPICH, so I need to override the runner as mpirun defaults to OpenMPI on my machine, and the container I'm using has MPICH -- I also confirmed with the debugger that the MPI config file values were added)

The issue with the environment variables, @mr-c, is that when cwltool runs, I don't have any environment variables yet.

When we create the CWL job, the runtime config, and even when we pass through the env vars from mpi.py, there are no MPI variables loaded yet.

The Edinburgh paper discusses how to use some variables, but I believe those are only the variable already available. For instance, when you have a Slurm allocation and you have SLURM_JOB_ID, SLURM_MEM_PER_CPU, etc.

When the CWL job is executed, already with the runtime (command + args) created, and then Python launches mpirun.mpich, that's when the MPICH (or Hydra) program will create the environment variables.

(venv) kinow@ranma:~/Development/python/workspace/cwltool$ env | grep MPI
(venv) kinow@ranma:~/Development/python/workspace/cwltool$ mpirun.mpich -n 1 env | grep MPI
MPIR_CVAR_CH3_INTERFACE_HOSTNAME=ranma
MPI_LOCALNRANKS=1
MPI_LOCALRANKID=0
(venv) kinow@ranma:~/Development/python/workspace/cwltool$ mpirun.mpich -n 1 env | grep PMI
PMI_RANK=0
PMI_FD=9
PMI_SIZE=1

I tried setting these variables in the command line of singularity without success. This is what I did:

$ mpirun.mpich -n 1 env | sort > /tmp/mpirun.txt
$ env | sort > /tmp/run.txt
$ meld /tmp/run.txt /tmp/mpirun.txt

Got the list of variables that appeared in the diff. There were 8 extra variables:

  • GFORTRAN_UNBUFFERED_PRECONNECTED=y
  • HYDI_CONTROL_FD=7
  • MPIR_CVAR_CH3_INTERFACE_HOSTNAME=ranma
  • MPI_LOCALNRANKS=1
  • MPI_LOCALRANKID=0
  • PMI_FD=9
  • PMI_RANK=0
  • PMI_SIZE=1

I added them to the Singularity command with just their names to see if it would pass the host env vars to the container (as it wouldn't make sense trying to define random file descriptor, ranks, etc.).

diff --git a/cwltool/singularity.py b/cwltool/singularity.py
index 8e391d64..128df6e2 100644
--- a/cwltool/singularity.py
+++ b/cwltool/singularity.py
@@ -601,6 +601,12 @@ class SingularityCommandLineJob(ContainerCommandLineJob):
         if not mpi_req or not is_req:
             runtime.append("--cleanenv")
         else:
+            runtime.append("--cleanenv")
+            for var in ["PMI_RANK", "PMI_FD", "PMI_SIZE",
+                        "MPI_LOCALNRANKS", "MPI_LOCALRANKID", "MPIR_CVAR_CH3_INTERFACE_HOSTNAME",
+                        "HYDI_CONTROL_FD", "GFORTRAN_UNBUFFERED_PRECONNECTED"]:
+                runtime.append("--env")
+                runtime.append(var)
             self.append_volume(
                 runtime,
                 runtime_context.create_tmpdir(),

The resulting command has the --cleanenv and all the environment variables from the diff:

INFO [job run] /tmp/hib1sp8z$ mpirun.mpich \
    -n \
    2 \
    singularity \
    --quiet \
    run \
    --ipc \
    --contain \
    --cleanenv \
    --env \
    PMI_RANK \
    --env \
    PMI_FD \
    --env \
    PMI_SIZE \
    --env \
    MPI_LOCALNRANKS \
    --env \
    MPI_LOCALRANKID \
    --env \
    MPIR_CVAR_CH3_INTERFACE_HOSTNAME \
    --env \
    HYDI_CONTROL_FD \
    --env \
    GFORTRAN_UNBUFFERED_PRECONNECTED \
    --mount=type=bind,source=/tmp/g7lo1zku,target=/dev/shm \
    --no-eval \
    --userns \
    --home \
    /tmp/hib1sp8z:/PCIKJh \
    --mount=type=bind,source=/tmp/ub89clkk,target=/tmp \
    --mount=type=bind,source=/tmp/h8j2gy1j/a.out,target=/var/lib/cwl/stgfb9c3552-0c10-424c-82ba-284742e69287/a.out,readonly \
    --pwd \
    /PCIKJh \
    --net \
    --network \
    none \
    /home/kinow/Development/python/workspace/cwl-mpi/examples/mpich-sr/cwltool/mfisherman_mpich:4.3.2.sif \
    /var/lib/cwl/stgfb9c3552-0c10-424c-82ba-284742e69287/a.out \
    0 \
    1 > /tmp/hib1sp8z/sr.out 2> /tmp/hib1sp8z/sr.err
WARNING [job run] exited with status: 1
WARNING [job run] completed permanentFail
WARNING [step run] completed permanentFail

That failed again. And running a smaller test looks like only key=value is valid.

$ ARUBA=111 mpirun.mpich -n 2 singularity run --cleanenv --env ARUBA mfisherman_mpich\:4.3.2.sif env | grep ARUBA
Error for command "run": invalid argument "ARUBA" for "--env" flag: ARUBA must be formatted as key=value

Error for command "run": invalid argument "ARUBA" for "--env" flag: ARUBA must be formatted as key=value

But since I don't know the file descriptor, rank, etc., I can't see a way to define PMI_RANK, PMI_SIZE, PMI_FD, and other variables for MPICH.

@mr-c I did not write tests as I wanted to check with you first if this is going in the right direction. I'll run a couple of tests on MN5 and CESGA FT3. My main concern is --contain and InfiniBand. Even without --cleanenv, I think --contain may force the container to use ethernet instead of InfiniBand.

@kinow kinow force-pushed the detect-mpi-with-singularity branch from 1967d7d to d2b1e14 Compare March 6, 2026 20:27
@kinow kinow force-pushed the detect-mpi-with-singularity branch from d2b1e14 to bff8585 Compare March 6, 2026 20:31
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 20.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.30%. Comparing base (2bd2c8a) to head (bff8585).

Files with missing lines Patch % Lines
cwltool/singularity.py 20.00% 4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (2bd2c8a) and HEAD (bff8585). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (2bd2c8a) HEAD (bff8585)
17 12
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2216      +/-   ##
==========================================
- Coverage   85.06%   77.30%   -7.76%     
==========================================
  Files          46       46              
  Lines        8521     8526       +5     
  Branches     1988     1989       +1     
==========================================
- Hits         7248     6591     -657     
- Misses        805     1415     +610     
- Partials      468      520      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kinow
Copy link
Member Author

kinow commented Mar 6, 2026

Test successful on BSC MN5! 🎉

  • ✅ Running cwltool on the login node with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity
$ cwltool --singularity --enable-ext --enable-dev sr-workflow.cwl sr-workflow-job.yml
...
INFO Using local copy of Singularity image mfisherman_mpich:4.3.2.sif found in /gpfs/home/bsc/$USER/cwl/cwl-mpi/examples/mpich-sr/cwltool
INFO [job run] /scratch/tmp/x8ha_22l$ mpirun \
    -n \
    2 \
    singularity \
    --quiet \
    run \
    --ipc \
    --contain \
    --mount=type=bind,source=/scratch/tmp/r96ss6xn,target=/dev/shm \
    --no-eval \
    --userns \
    --home \
    /scratch/tmp/x8ha_22l:/hjBIvx \
    --mount=type=bind,source=/scratch/tmp/uia0ia52,target=/tmp \
    --mount=type=bind,source=/scratch/tmp/zbn2ofqt/a.out,target=/var/lib/cwl/stg29c22cda-58aa-4d8e-bfc8-d69f33e05729/a.out,readonly \
    --pwd \
    /hjBIvx \
    --net \
    --network \
    none \
    /gpfs/home/bsc/$USER/cwl/cwl-mpi/examples/mpich-sr/cwltool/mfisherman_mpich:4.3.2.sif \
    /var/lib/cwl/stg29c22cda-58aa-4d8e-bfc8-d69f33e05729/a.out \
    0 \
    1 > /scratch/tmp/x8ha_22l/sr.out 2> /scratch/tmp/x8ha_22l/sr.err
...
...
    }
}INFO Final process status is success

For posterity, if you run cwltool on an HPC using multiple nodes, but they are not sharing the same temp directory, you'll get something similar to:

slurmstepd: error: couldn't chdir to `/scratch/tmp/37435333/l0q96kcm': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/scratch/tmp/37435333/l0q96kcm': No such file or directory: going to /tmp instead
WARNING: skipping mount of /scratch/tmp/37435333/1db6cool: stat /scratch/tmp/37435333/1db6cool: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/e9hzplsn/a.out: stat /scratch/tmp/37435333/e9hzplsn/a.out: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/1db6cool: stat /scratch/tmp/37435333/1db6cool: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/e9hzplsn/a.out: stat /scratch/tmp/37435333/e9hzplsn/a.out: no such file or directory
FATAL:   container creation failed: mount /scratch/tmp/37435333/1db6cool->/dev/shm error: while mounting /scratch/tmp/37435333/1db6cool: mount source /scratch/tmp/37435333/1db6cool doesn't exist
FATAL:   container creation failed: mount /scratch/tmp/37435333/1db6cool->/dev/shm error: while mounting /scratch/tmp/37435333/1db6cool: mount source /scratch/tmp/37435333/1db6cool doesn't exist

And if you run without network access:

Abort(1017768207) on node 2: 
Fatal error in internal_Init: 
Other MPI error, error stack: 
internal_Init(70)....................: 
MPI_Init(argc=0x7ffd98ef954c, argv=0x7ffd98ef9540) failed MPII_Init_thread(282)................: 
MPIR_init_comm_world(34).............: 
MPIR_Comm_commit(794)................: 
MPIR_Comm_commit_internal(579).......: 
MPID_Comm_commit_pre_hook(151).......: 
MPIDI_world_pre_init(669)............: 
MPIDI_OFI_init_world(805)............: 
MPIDI_OFI_addr_exchange_root_ctx(143): 
MPIDU_bc_allgather(112)..............: 
MPIR_Allgatherv_intra_brucks(80).....: 
MPIC_Sendrecv(259)...................: 
MPID_Isend(60).......................: 
MPIDI_isend(32)......................: 
MPIDI_NM_mpi_isend(780)..............: 
MPIDI_OFI_send_fallback(483).........: 
OFI call tsendv failed (default nic=lo: 
No such file or directory) Abort(1017768207) on node 0: 
Fatal error in internal_Init: 
Other MPI error, error stack: 
internal_Init(70)....................: 
MPI_Init(argc=0x7ffe4702f11c, argv=0x7ffe4702f110) failed MPII_Init_thread(282)................: 
MPIR_init_comm_world(34).............: 
MPIR_Comm_commit(794)................: 
MPIR_Comm_commit_internal(579).......: 
MPID_Comm_commit_pre_hook(151).......: 
MPIDI_world_pre_init(669)............: 
MPIDI_OFI_init_world(805)............: 
MPIDI_OFI_addr_exchange_root_ctx(143): 
MPIDU_bc_allgather(112)..............: 
MPIR_Allgatherv_intra_brucks(80).....: 
MPIC_Sendrecv(259)...................: 
MPID_Isend(60).......................: 
MPIDI_isend(32)......................: 
MPIDI_NM_mpi_isend(780)..............: 
MPIDI_OFI_send_fallback(483).........: 
OFI call tsendv failed (default nic=lo: 
No such file or directory)

Using a temporary directory pointing to the parallel file system (GPFS in the case of MN5) and adding the NetworkAccess with networkAccess: true solved both problems.

@kinow kinow self-assigned this Mar 6, 2026
@kinow kinow requested a review from mr-c March 6, 2026 22:13
@kinow
Copy link
Member Author

kinow commented Mar 6, 2026

Test successful on CESGA FT3! 🎉

  • ✅ Running cwltool on the login node with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity

I used a folder from Lustre parallel filesystem, $STORE/cwl/temp as temporary storage. Loaded the singularity module, and the tests were executed fine.

Comment on lines +601 to +602
if not mpi_req or not is_req:
runtime.append("--cleanenv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! Lets always use --cleanenv, to cope with the environment variables sets I recommend creating a shell script to be executed instead of the main command line.

The environment variables set can be derived from either parsing env_pass and env_pass_regex from the --mpi-config-file OR pass in a list of the existing environment variable names and only pass those newly set by the MPI runner.

Comment on lines +603 to +608
else:
self.append_volume(
runtime,
runtime_context.create_tmpdir(),
"/dev/shm",
writable=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat trick; is this universal for all known MPI implementations? Or do some not use the shared memory device for communication between MPI processes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants