FrontierAI

AI related notes on Frontier supercomputer at OLCF

Kernels
- FlashAttention -- Junqi
  - CK
  - Triton
  - SDPA -- Aris
- GEMM -- Junqi
Training
- PyTorch -- Aris
  - FSDP
- DeepSpeed -- Emily
  - MoE -- Sajal
- Jax -- Emily
- Megablock -- Sajal
- Verl -- Vanessa
- X-MoE -- Sajal
Serving -- Jesse, Junqi
- vLLM
- Ollama
- SGLang
Agentic
- LangChain -- Emily
- AutoGen -- Vanessa

FlashAttention

Installation

FA2 is supported on Frontier and the upstream repo can be pip installed

module load PrgEnv-gnu
module load rocm
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3-latest-Linux-x86_64.sh
bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $WRKSPC/miniconda
export PATH=$WRKSPC/miniconda/bin:$PATH
conda create --prefix $WRKSPC/miniconda/envs/fa2-env -y
source $WRKSPC/miniconda/etc/profile.d/conda.sh
conda activate $WRKSPC/miniconda/envs/fa2-env
git clone https://github.com/Dao-AILab/flash-attention
pushd flash-attention
git checkout v2.8.3
pip install -e .
popd

For latest development, please try AMD's fork

Backend

Standalone FA:

CK (default)
Triton

PyTorch scaled_dot_product_attention (SDPA):

Math
FA
Efficient

Performance

For standalone FA, use latest rocm. The built against rocm/6.3 is 1.5x faster than that with rocm/6.1 for certain inputs
For PyTorch SDPA, use FA or Efficient backend

X-MoE

Official X-MoE Code and Documentation

https://github.com/Supercomputing-System-AI-Lab/X-MoE

Preparing Conda Environment

module reset
module load cpe/24.11
module load PrgEnv-gnu/8.6.0
module load rocm/6.3.1
module load craype-accel-amd-gfx90a
module load miniforge3/23.11.0-0


# cd to your directory of choice,
# I recommend using /lustre/orion directories as these packages require lots of space.

conda create -p $PWD/TORCH-ROCM6.3.1_env python=3.11 -c conda-forge -y
source activate $PWD/TORCH-ROCM6.3.1_env

Installing Torch

# ROCM 6.3.1
pip3 install --pre torch torchvision torchaudio --index-url
https://download.pytorch.org/whl/nightly/rocm6.3

Installing APEX

git clone https://github.com/ROCm/apex.git
cd apex
pip install -r requirements.txt
python setup.py install --cpp_ext --cuda_ext

Installing FlashAttention

Didn't work with pip install, so tried to install from source instead.

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/
python setup.py install
pytest -q -s tests/test_flash_attn.py

Installing X-MoE

Worked as is.

cd ~
git clone https://github.com/Supercomputing-System-AI-Lab/X-MoE
cd X-MoE
git submodule update --init --recursive --remote

pip install -e .
cd Megatron-DeepSpeed-X-MoE && pip install -e .

Data Preparation

Recommending removing from XMoE's parent's directory. Worked as is.

Maybe make the amount of data a variable so that we don't have to go through the entire dataset.

Training on Single Node

Didn't work on login node. Looks like this depends on having a host list.

scontrol: error: host list is empty
first=
ssh: Could not resolve hostname : Name or service not known
MASTER_ADDR=

Getting an Interactive Node

    salloc -A PROJID -J RunSim123 -t 0:30:00 -p batch -N 1

Running the Training

    ./X-MoE-Small-node-1.sh 8 1

Frontier pytorch $cpp_extension$ related problem, "ninja -v" needs to be replaced with

    ninja --version

Need to install mpi4py

    MPICC="cc -shared" pip install --no-cache-dir --no-binary=mpi4py mpi4py

Replaced torchrun with srun

    time srun -u -N 1 -n${NUM_GPUS} -c2 --ntasks-per-node=8
    --gpus-per-node=8 --gpu-bind=closest python pretrain_gpt_deepspeed.py ...

Error Logs

./X-MoE-Small-node-1-srun.sh 8 1

srun_logs_with_jit_error.log

     ImportError: /lustre/orion/stf218/world-shared/sajal/testing_xmoe/
     X-MoE/Megatron-DeepSpeed-X-MoE/
     megatron/fused_kernels/build/scaled_upper_triang_masked_softmax_cuda.so:
     cannot open shared object file: No such file or directory

Using Correct Versions of Compiler

    export CXX=/opt/cray/pe/gcc-native/13/bin/g++
    export CC=/opt/cray/pe/gcc-native/13/bin/gcc
    export PATH=/opt/cray/pe/gcc-native/13/bin:$PATH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FrontierAI

Contents

FlashAttention

Installation

Backend

Performance

X-MoE

Official X-MoE Code and Documentation

Preparing Conda Environment

Installing Torch

Installing APEX

Installing FlashAttention

Installing X-MoE

Data Preparation

Training on Single Node

Getting an Interactive Node

Running the Training

Error Logs

Using Correct Versions of Compiler

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

FrontierAI

Contents

FlashAttention

Installation

Backend

Performance

X-MoE

Official X-MoE Code and Documentation

Preparing Conda Environment

Installing Torch

Installing APEX

Installing FlashAttention

Installing X-MoE

Data Preparation

Training on Single Node

Getting an Interactive Node

Running the Training

Error Logs

Using Correct Versions of Compiler