Streamline Your ML Workflows in HPC: A Hands-on Container Workshop
A comprehensive workshop on containerizing machine learning workloads for High-Performance Computing environments using Apptainer (formerly Singularity) and Charliecloud.
| Aspect | Details |
|---|---|
| Duration | 2 hours (1 hour presentation, 1 hour hands-on) |
| Level | Intermediate |
| Instructor | Parmanand Sinha, Computational Scientist (GIS+HPC) |
| Institution | University of Chicago Research Computing Center |
- Basic programming knowledge (Python recommended)
- Linux CLI familiarity
- Active RCC/cluster account
- GPU node access (for GPU exercises; CPU alternatives provided)
# Clone the workshop repository
git clone https://github.com/rcc-uchicago/hpc-ml-containers-workshop.git
cd hpc-ml-containers-workshop
# Set up environment variables
export IMAGES=$SCRATCH/$USER/sif
export WORKDIR=$SCRATCH/$USER/ml_work
export CH_WORKDIR=$SCRATCH/$USER/charliecloud
mkdir -p $IMAGES $WORKDIR $CH_WORKDIR
# Load Apptainer module
module load apptainer
# Or load Charliecloud module (alternative)
module load spack.modules
module load charliecloud/0.35-gcc-12.2.0-6ifrorq| File | Description |
|---|---|
presentation_slides.md |
Main presentation slides (Marp format) |
hands_on_exercises.md |
Step-by-step lab exercises |
uchicago-rcc.css |
Custom Marp theme for UChicago branding |
marp.config.js |
Marp configuration with Kroki diagram support |
| Directory | Description |
|---|---|
examples/ |
Standalone Python scripts and SLURM job files |
examples/slurm/ |
Apptainer SLURM job scripts |
examples/charliecloud/ |
Charliecloud examples and SLURM scripts |
# Install Marp CLI (if not already installed)
npm install -g @marp-team/marp-cli
# Generate PDF from markdown
marp presentation_slides.md --pdf --theme uchicago-rcc.css
# Or with the config file
marp presentation_slides.md --config-file marp.config.js --pdf-
Introduction to Containerization in HPC
- Why containers for ML?
- Containers vs Virtual Machines
- Key components and terminology
-
Apptainer (formerly Singularity)
- Getting started
- Basic commands
- GPU support and data binding
-
Charliecloud Overview
- Comparison with Apptainer
- Security model
- Docker compatibility
-
Practical ML Deployment
- TensorFlow examples
- PyTorch examples
- Multi-GPU training
-
SLURM Integration
- Job lifecycle
- Single-node and multi-node jobs
-
Best Practices
- Container management
- Performance optimization
- Security considerations
- Environment sanity check
- Basic container operations
- File binding patterns
- Interactive ML training
- Batch (SLURM) workflow
- Real-world CNN with CIFAR-10
- Hyper-parameter sweeps with job arrays
- Distributed training (DDP/Horovod)
- Security and image hardening
- CI/CD integration
- Charliecloud setup and basics
- Building custom images with Dockerfiles
- SquashFS for performance
- SLURM integration
- MPI workloads with --join flag
┌─────────────────────────────────────────────────────────────┐
│ Workshop Topics │
├─────────────────────────────────────────────────────────────┤
│ • Container fundamentals (images, runtimes, bind mounts) │
│ • Apptainer: pull, shell, exec, run commands │
│ • Charliecloud: ch-image, ch-run, ch-convert │
│ • GPU access with --nv (Apptainer) and CUDA (Charliecloud) │
│ • SLURM batch job submission │
│ • Multi-GPU and distributed training │
│ • Hyper-parameter sweeps with job arrays │
│ • SquashFS images for HPC performance │
│ • MPI workloads with --join flag (Charliecloud) │
│ • Security best practices │
│ • CI/CD for containerized ML pipelines │
└─────────────────────────────────────────────────────────────┘
- Docker Hub -
pytorch/pytorch,tensorflow/tensorflow - NVIDIA NGC - Optimized GPU containers
| Error | Cause | Solution |
|---|---|---|
FATAL: container creation failed |
Missing or corrupted SIF file | Re-run apptainer pull |
CUDA not available |
Missing --nv flag |
Add --nv to apptainer command |
Permission denied |
Wrong bind path permissions | Check path exists and is readable |
module: command not found |
Not on login node or module not loaded | Source environment or load module |
No space left on device |
Quota exceeded | Clean cache or use scratch space |
| Error | Cause | Solution |
|---|---|---|
namespace unavailable |
User namespaces disabled | Contact admin or check kernel config |
no such image |
Image not pulled/built | Run ch-image pull or ch-image build |
ch-image: command not found |
Module not loaded | module load spack.modules; module load charliecloud |
| MPI hangs | Missing --join flag |
Add --join to ch-run command |
| Cache quota exceeded | Home directory limit | Move cache with mvln ~/.charliecloud $SCRATCH/$USER |
Found a bug or have suggestions? Please open an issue or pull request on GitHub.
This workshop material is provided for educational purposes. See individual container images for their respective licenses.
Parmanand Sinha
Computational Scientist (GIS+HPC)
Research Computing Center
University of Chicago
For questions about RCC resources: RCC Support