Enable multiple simulations to share a single GPU using NVIDIA MPS for small-to-medium systems that underutilize GPU compute.
Motivation
For small systems, a single GPU is underutilized by one simulation. MPS allows multiple processes to share a GPU with near-linear scaling up to compute capacity. Running 8-16 replicas via MPS can deliver 8-12x aggregate throughput.
Related to but not strictly dependend on #9
Scope
mdfactory/performance/mps.py
- MPS daemon lifecycle: start (
nvidia-cuda-mps-control -d), stop (echo quit | nvidia-cuda-mps-control), health check
- Two packing strategies:
-multidir approach: single gmx_mpi mdrun -multidir dir1 dir2 ... dirN call. Requires all systems to use the same .mdp parameters — natural for HT batches.
- Independent srun approach: each simulation gets its own
srun with MPS arbitrating GPU access. More flexible (different .mdp per system).
gpu_replicas field in run_schedules.yaml
- Error handling: MPS daemon failure, GPU OOM detection
MPS lifecycle pattern
# Start
nvidia-cuda-mps-control -d
# Run N simulations sharing the GPU
for i in $(seq 0 $((NUM_REPLICAS - 1))); do
srun --exact -n 1 --gpus=1 --cpus-per-task=${CORES} \
--cpu-bind=verbose,cores --distribution=block:block \
--chdir=${SIM_DIR} \
gmx_mpi mdrun -s topol.tpr -nb gpu -pme gpu &
done
wait
# Stop
echo quit | nvidia-cuda-mps-control
Acceptance criteria
- MPS start/stop functions work correctly (tested with mock subprocess)
- GPU replica count derived from benchmark results
- Combines with CPU affinity for CPU-side binding
- Graceful error when MPS unavailable (driver not configured)
Enable multiple simulations to share a single GPU using NVIDIA MPS for small-to-medium systems that underutilize GPU compute.
Motivation
For small systems, a single GPU is underutilized by one simulation. MPS allows multiple processes to share a GPU with near-linear scaling up to compute capacity. Running 8-16 replicas via MPS can deliver 8-12x aggregate throughput.
Related to but not strictly dependend on #9
Scope
mdfactory/performance/mps.pynvidia-cuda-mps-control -d), stop (echo quit | nvidia-cuda-mps-control), health check-multidirapproach: singlegmx_mpi mdrun -multidir dir1 dir2 ... dirNcall. Requires all systems to use the same.mdpparameters — natural for HT batches.srunwith MPS arbitrating GPU access. More flexible (different.mdpper system).gpu_replicasfield inrun_schedules.yamlMPS lifecycle pattern
Acceptance criteria