Skip to content

camlsys/L46_lab2_compile_gpu_cpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

L46 Lab 2: Native PyTorch and HuggingFace optimisation for training and inference

Important: Students must not use the AI assistant from Lightning AI.

Overview

This lab can be divided in two: training optimisation via PyTorch native functionnalities and GPU / CPU inference. You will learn to:

  • Learn about attention kernels
  • Use TensorBoard to profile PyTorch training loops and spot native PyTorch kernels
  • Practice with graph compilation for faster training
  • Try to serve a 4B LLM with GPU and CPU with and without quantization.
  • Understand the limits of serving a model in python.

Listen

Simply listen to the instructions! No code, no web browsing!

Setup

1. Lightning AI Account and Session

  1. From top left (profile switcher), make sure your in your own space (by switching to your user)
  2. Create a new studio (do not use your lab1 studio!)
  3. Start a 24GB GPU session (verify you are not running expensive instances!).
  4. Connect to your Studio and open a terminal (top-left → Terminal).
  5. In the terminal, clone the repository:
    git clone https://github.com/camlsys/L46_lab2_compile_gpu_cpu.git
  6. Move into the repository:
    cd L46_lab2_compile_gpu_cpu

2. Running the Scripts

Installing flash-attention sometimes become tricky due to dependencies and compilations. Please follow the following scripts.

# Install dependencies
pip install -r requirements.txt

mkdir -p ~/.tmp
TMPDIR=$HOME/.tmp python -m pip install --no-cache-dir flash-attn --no-build-isolation

python -m pip install bitsandbytes

All important scripts train.py, attention_toy_exemple.py, and infer.py can be run with python script.py. However, you will need to read them carefully to understand the different arguments.

2.5 Tensorboard Documentation

You should be a bit more familiar with the profiler by now, but in case you need it, you can find everything about the profiler here.

3. Viewing TensorBoard Profiling

After the script finishes, a tb_profiler/ directory containing profiling traces will be created.

  1. In your Studio, open the Port Viewer from the right-side panel.
  2. Click + New Port (top right).
  3. Set:
    • Name: tensorflow
    • Port: 6006
  4. Click Display, then return to your Studio (VS Code icon).
  5. Start TensorBoard in the terminal:
    tensorboard --logdir tb_profiler
  6. Go back to the Port Viewer and open the port you created in a new browser tab.
  7. Verify that everything is there as expected, you should see 3 steps by default in the trace.

some notes:

  • When switching between TensorBoard views (e.g., Overview, Trace, GPU), it may take a few seconds for the page to fully load — this is normal.
  • If TensorBoard does not load after ~30 seconds, stop the command and restart it.

Part 1: Optimising Training on GPU (45 minutes)

Objectives and Deliverables

Each step must be documented in the report, so think about taking screenshots! You can choose your prefered metrics, but you will notice that train.py gives you the number of tokens processed per second during training.

  1. The warmup of this lab is to read attention_toy_exemple.py and to run it. Comment in the report what you see, and, complementing with the documentation, explain in your own words what this code is about and what these different backends are.
  2. The second step is to look for a TODO in the code of train.py. This is to warm yourself up based on what you did in the previous lab.
  3. Answer and document the following question: is flash-attention used during training here? hint: in the trace zoom in the forward of LlamaAttention_9.
  4. The last step is to experiment with torch.compile (see documentation and tutorial). You will need to play with different configurations of torch.compile and report their differences. You will also need to describe in your own words what each configuration that you have tried is doing, i.e. do not simply report some results with a set of parameters without commenting.
  5. Conclude by summarizing the observed gains from the first version to your best one.

Part 2: GPU and CPU inference (45min)

This part will touch upon the limits of fast inference when sticking to a language like python and dynamic models. We will see that effictively, this is not a valid way of serving a model.

Instead, students will use Ollama, an open‑source LLM runtime that serves models as lightweight, self‑contained services. Ollama runs outside of Python, offering faster inference, lower overhead, and easier deployment on CPUs or GPUs, making it a more suitable choice for serving large models in this lab.

Objectives and Deliverables

Every step except 0 must be reported with tables or screenshots as well as comments.

  1. The infer.py script does a few things and your first step should be to read it carefuly and understand the different parameters.
  2. Run the script with GPU decoding and note your baseline performance. Try to decorate the generate function with some profiler capabilities for some extra metrics!
  3. Think about reusing previously tested optimization techniques to improve the metrics.
  4. Switch to CPU and do the same. For the CPU, you can also observe the impact of the number of threads. Try to draw a parallel with on-device deployment. Would this work as is on a constrained scenario?
  5. A first naïve solution for lowering the RAM/VRAM consumption and increase the speed would be to use [bitsandbytes][https://huggingface.co/docs/transformers/v4.28.0/main_classes/quantization] for quantizing the model. What type of quantization is that? Experiment with 4bits and 8bits quantization. What do you observe in your metrics? Explain in your own words any unexpected performance.

Part 3: Report Writing (15 minutes)

Take some time to structure properly your report and clean it.

Side Quest: Serving properly a model with llama.cpp

It is hopefuly clear now that serving models with python is not the best solution. This makes sense as, in practice, most models are compiled to target some hardware of interest. This part of the lab will take you one step further by using ollama as a middle ground. Ollama is a wrapper to the well-known llama.cpp runtime, enabling the serving of quantised model with a C++ backend. Ollama hosts most LLM such that checkpoints can be pulled and managed a bit like Docker images. In practice, it is a bit limited, but it will be sufficient for illustrating the gain in performance. The repository of ollama can be reached here.

Instead, students will use Ollama, an open‑source LLM runtime that serves models as lightweight, self‑contained services. Ollama runs outside of Python, offering faster inference, lower overhead, and easier deployment on CPUs or GPUs, making it a more suitable choice for serving large models in this lab.

  1. Download and install ollama

At the system level for serving models:

curl -fsSL https://ollama.com/install.sh | sh
  1. Install ollama via pip
pip install ollama
  1. Get the Qwen3:4B model:
ollama pull qwen3:4b
  1. Serve it:
ollama serve
  1. Run the script to get some numbers
python side_quest.py
  1. Your turn now! If you arrive at this point, here are a few questions that you should try to answer in your report:
  • Is the inference using CPU or GPU here?
  • Try to use the other hardware accelerator to make a comparison. Do you observe a speed up compared to python? you may need multiple runs as ollama is sometimes non-deterministic.
  • What is the qwen3:4b checkpoint exactly? Are we running in half-precision? Is it quantized? hint: ollama show.
  • Try everything again with qwen3:14b and make a comparison with the infer.py python inference.

Tips for Using TensorBoard

  • Trace View: Use the timeline to see when operations occur
  • Operator View: See which operations take the most time
  • GPU Kernel View: Check Tensor Core usage and kernel efficiency
  • Memory View: Analyze VRAM usage patterns
  • Zoom and Pan: Use these to focus on specific time ranges

Files in This Repository

  • attention_toy_exemple.py: script for experimentations with attention kernels.
  • train.py: Main training script with profiling integration.
  • infer.py: Main script for the inference exercices.
  • utils.py: Dataset loading and utility functions.
  • side_quest.py: Inference script using ollama.
  • requirements.txt: Python dependencies

Getting Help

  • Ask your teacher for hints (not solutions!) if you're stuck! Do not get stuck, the lab is very short!
  • Focus on understanding why optimizations work, not just implementing them
  • Use TensorBoard's documentation if needed

Good luck with the lab!

About

Repository for the second lab of L46 on torch.compile and CPU / GPU inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages