High metadata IOPS from Python MPI processes causing Lustre MDS overload

### Describe the bug

## Bug Report: Python MPI startup causes Lustre Metadata Server overload

### Problem
When running ensemble jobs, each MPI process independently loads the full Python environment on startup. This generates up to **30,000 metadata operations/sec per job**. This was identified by Dmitri while monitoring SYMFLUENCE on ARC (U. of Calgary) during the calibration of SUMMA + mizuRoute using the ASYNC-DDS optimization algorithm (1 basin, 50 cores, 100 batches).

### Profiled Execution Stages
Three distinct patterns were observed:

| Stage | Metadata IOPS | Notes |
|---|---|---|
| `mizuRoute.exe` | ~0 on shared NFS | Clean — no shared filesystem traffic (modified mizuRoute) |
| Python startup (launching `summa_sundials.exe`) | **up to 30,000 ops/sec** | Problematic — each MPI process searches for modules/libraries simultaneously |
| `summa_sundials.exe` | ~1,000 ops/sec | Moderate — dynamic library loads, clears quickly |

The Python stage is the critical bottleneck. Python installations contain many files, and finding/loading a library requires searching through many directories. Each MPI process does this independently, and they all start at approximately the same time.

### Proposed Solutions
- **Short-term (quick fix):** Copy the Python venv to `/tmp` (local node storage) before job startup, as is already done for file I/O operations.
- **Long-term (preferred):** Persist MPI workers and start them in batches to avoid simultaneous environment loading across processes.

### Additional Notes
- Only strictly needed Python modules should be loaded — modules loaded for convenience but never used contribute to the problem.
- This was diagnosed by Dimitri using filesystem profiling tools.

### References
Internal email thread between Darri, Martyn, Nicolás, and Dimitri (March 2026). Suggestions made by Dmitri. 

### SYMFLUENCE version

_No response_

### Steps to reproduce

On Arc (U. of Calgary):
`cd /work/comphyd_lab/users/SYMFLUENCE`
`sbatch run_confluence_calib.sh`

The configuration (yaml) file for the basin can be found here: 
`/work/comphyd_lab/users/SYMFLUENCE/0_config_files/century_basins_exp7_ASYNC-DDS/config_CAN_01BJ010_meso.yaml`

### Relevant logs or screenshots

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High metadata IOPS from Python MPI processes causing Lustre MDS overload #21

Describe the bug

Bug Report: Python MPI startup causes Lustre Metadata Server overload

Problem

Profiled Execution Stages

Proposed Solutions

Additional Notes

References

SYMFLUENCE version

Steps to reproduce

Relevant logs or screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stage	Metadata IOPS	Notes
`mizuRoute.exe`	~0 on shared NFS	Clean — no shared filesystem traffic (modified mizuRoute)
Python startup (launching `summa_sundials.exe`)	up to 30,000 ops/sec	Problematic — each MPI process searches for modules/libraries simultaneously
`summa_sundials.exe`	~1,000 ops/sec	Moderate — dynamic library loads, clears quickly

High metadata IOPS from Python MPI processes causing Lustre MDS overload #21

Description

Describe the bug

Bug Report: Python MPI startup causes Lustre Metadata Server overload

Problem

Profiled Execution Stages

Proposed Solutions

Additional Notes

References

SYMFLUENCE version

Steps to reproduce

Relevant logs or screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions