Skip to content

Implement GPU sharing via NVIDIA MPS for small-system throughput #10

@gregorweiss

Description

@gregorweiss

Enable multiple simulations to share a single GPU using NVIDIA MPS for small-to-medium systems that underutilize GPU compute.

Motivation

For small systems, a single GPU is underutilized by one simulation. MPS allows multiple processes to share a GPU with near-linear scaling up to compute capacity. Running 8-16 replicas via MPS can deliver 8-12x aggregate throughput.

Related to but not strictly dependend on #9

Scope

  • mdfactory/performance/mps.py
  • MPS daemon lifecycle: start (nvidia-cuda-mps-control -d), stop (echo quit | nvidia-cuda-mps-control), health check
  • Two packing strategies:
    • -multidir approach: single gmx_mpi mdrun -multidir dir1 dir2 ... dirN call. Requires all systems to use the same .mdp parameters — natural for HT batches.
    • Independent srun approach: each simulation gets its own srun with MPS arbitrating GPU access. More flexible (different .mdp per system).
  • gpu_replicas field in run_schedules.yaml
  • Error handling: MPS daemon failure, GPU OOM detection

MPS lifecycle pattern

# Start
nvidia-cuda-mps-control -d

# Run N simulations sharing the GPU
for i in $(seq 0 $((NUM_REPLICAS - 1))); do
    srun --exact -n 1 --gpus=1 --cpus-per-task=${CORES} \
        --cpu-bind=verbose,cores --distribution=block:block \
        --chdir=${SIM_DIR} \
        gmx_mpi mdrun -s topol.tpr -nb gpu -pme gpu &
done
wait

# Stop
echo quit | nvidia-cuda-mps-control

Acceptance criteria

  • MPS start/stop functions work correctly (tested with mock subprocess)
  • GPU replica count derived from benchmark results
  • Combines with CPU affinity for CPU-side binding
  • Graceful error when MPS unavailable (driver not configured)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions