Skip to content

rcc-uchicago/hpc-ml-containers-workshop

Repository files navigation

HPC-ML-Containers-Workshop

Streamline Your ML Workflows in HPC: A Hands-on Container Workshop

A comprehensive workshop on containerizing machine learning workloads for High-Performance Computing environments using Apptainer (formerly Singularity) and Charliecloud.


Workshop Overview

Aspect Details
Duration 2 hours (1 hour presentation, 1 hour hands-on)
Level Intermediate
Instructor Parmanand Sinha, Computational Scientist (GIS+HPC)
Institution University of Chicago Research Computing Center

Quick Start

Prerequisites

  • Basic programming knowledge (Python recommended)
  • Linux CLI familiarity
  • Active RCC/cluster account
  • GPU node access (for GPU exercises; CPU alternatives provided)

Setup

# Clone the workshop repository
git clone https://github.com/rcc-uchicago/hpc-ml-containers-workshop.git
cd hpc-ml-containers-workshop

# Set up environment variables
export IMAGES=$SCRATCH/$USER/sif
export WORKDIR=$SCRATCH/$USER/ml_work
export CH_WORKDIR=$SCRATCH/$USER/charliecloud
mkdir -p $IMAGES $WORKDIR $CH_WORKDIR

# Load Apptainer module
module load apptainer

# Or load Charliecloud module (alternative)
module load spack.modules
module load charliecloud/0.35-gcc-12.2.0-6ifrorq

Workshop Materials

File Description
presentation_slides.md Main presentation slides (Marp format)
hands_on_exercises.md Step-by-step lab exercises
uchicago-rcc.css Custom Marp theme for UChicago branding
marp.config.js Marp configuration with Kroki diagram support

Example Scripts

Directory Description
examples/ Standalone Python scripts and SLURM job files
examples/slurm/ Apptainer SLURM job scripts
examples/charliecloud/ Charliecloud examples and SLURM scripts

Generating Slides

# Install Marp CLI (if not already installed)
npm install -g @marp-team/marp-cli

# Generate PDF from markdown
marp presentation_slides.md --pdf --theme uchicago-rcc.css

# Or with the config file
marp presentation_slides.md --config-file marp.config.js --pdf

Workshop Outline

Part 1: Presentation (60 min)

  1. Introduction to Containerization in HPC

    • Why containers for ML?
    • Containers vs Virtual Machines
    • Key components and terminology
  2. Apptainer (formerly Singularity)

    • Getting started
    • Basic commands
    • GPU support and data binding
  3. Charliecloud Overview

    • Comparison with Apptainer
    • Security model
    • Docker compatibility
  4. Practical ML Deployment

    • TensorFlow examples
    • PyTorch examples
    • Multi-GPU training
  5. SLURM Integration

    • Job lifecycle
    • Single-node and multi-node jobs
  6. Best Practices

    • Container management
    • Performance optimization
    • Security considerations

Part 2: Hands-on Exercises (60 min per part)

Part A: Core Apptainer Lab (60 min)

  • Environment sanity check
  • Basic container operations
  • File binding patterns
  • Interactive ML training
  • Batch (SLURM) workflow

Part B: Advanced Apptainer Lab (60-90 min, optional)

  • Real-world CNN with CIFAR-10
  • Hyper-parameter sweeps with job arrays
  • Distributed training (DDP/Horovod)
  • Security and image hardening
  • CI/CD integration

Part C: Charliecloud Lab (60 min)

  • Charliecloud setup and basics
  • Building custom images with Dockerfiles
  • SquashFS for performance
  • SLURM integration
  • MPI workloads with --join flag

Key Topics Covered

┌─────────────────────────────────────────────────────────────┐
│                    Workshop Topics                          │
├─────────────────────────────────────────────────────────────┤
│  • Container fundamentals (images, runtimes, bind mounts)   │
│  • Apptainer: pull, shell, exec, run commands               │
│  • Charliecloud: ch-image, ch-run, ch-convert               │
│  • GPU access with --nv (Apptainer) and CUDA (Charliecloud) │
│  • SLURM batch job submission                               │
│  • Multi-GPU and distributed training                       │
│  • Hyper-parameter sweeps with job arrays                   │
│  • SquashFS images for HPC performance                      │
│  • MPI workloads with --join flag (Charliecloud)            │
│  • Security best practices                                  │
│  • CI/CD for containerized ML pipelines                     │
└─────────────────────────────────────────────────────────────┘

Resources

Documentation

Container Registries

RCC Resources


Troubleshooting Quick Reference

Apptainer

Error Cause Solution
FATAL: container creation failed Missing or corrupted SIF file Re-run apptainer pull
CUDA not available Missing --nv flag Add --nv to apptainer command
Permission denied Wrong bind path permissions Check path exists and is readable
module: command not found Not on login node or module not loaded Source environment or load module
No space left on device Quota exceeded Clean cache or use scratch space

Charliecloud

Error Cause Solution
namespace unavailable User namespaces disabled Contact admin or check kernel config
no such image Image not pulled/built Run ch-image pull or ch-image build
ch-image: command not found Module not loaded module load spack.modules; module load charliecloud
MPI hangs Missing --join flag Add --join to ch-run command
Cache quota exceeded Home directory limit Move cache with mvln ~/.charliecloud $SCRATCH/$USER

Contributing

Found a bug or have suggestions? Please open an issue or pull request on GitHub.


License

This workshop material is provided for educational purposes. See individual container images for their respective licenses.


Contact

Parmanand Sinha
Computational Scientist (GIS+HPC)
Research Computing Center
University of Chicago

For questions about RCC resources: RCC Support

About

Streamline Your ML Workflows in HPC: A Hands-on Container

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors