feat: custom cpu offload engine by xrsrke · Pull Request #63 · NousResearch/torchtitan

xrsrke · 2026-03-26T15:49:01Z

No description provided.

Reference implementation of fine-grained activation offloading and hybrid optimizer CPU offloading. Preserved as-is under megatron_fork/ for comparison before refactoring.

Break the monolithic reference into clean, independent modules with zero framework dependencies (only requires PyTorch): - utils.py: debug helpers, is_graph_capturing(), summary table printer - tensor_pool.py: TensorPool — pinned CPU memory pool with O(1) reuse - offload_group.py: OffloadTensorGroup — batch of tensors with CUDA events - chunk_handler.py: ChunkOffloadHandler — core D2H/H2D copy engine - offload_manager.py: OffloadManager — singleton orchestrator with VP/PP support - autograd_hooks.py: ActivationOffloadContext, group_start/commit functions - hybrid_optimizer.py: HybridDeviceOptimizer — GPU/CPU split optimizer

7 test files covering every component from unit to end-to-end: - test_tensor_pool.py (13): alloc/free, pool reuse, pinned memory - test_offload_group.py (8): push/pop, CUDA events, offload stats - test_chunk_handler.py (13): D2H/H2D roundtrips, bulk offload/reload - test_offload_manager.py (12): singleton, streams, VP, delayed offload - test_e2e_activation_offload.py (10): Linear, MLP (4/8/16 layers), MoE, multi-iteration, warmup stats, f32/f16/bf16 dtype support - test_hybrid_optimizer.py (11): GPU/CPU split, overlap modes, FP32 - test_utils.py (6): debug, graph capture, summary table

Tests that measure and prove each benefit: - Async D2H: dedicated stream overlaps with compute - Async H2D: dedicated stream overlaps with compute - Pinned pool: 11x faster than fresh cudaMallocHost - Per-module granularity: size threshold + selective offload - Activations on CPU: tensors physically on pinned CPU memory - Forced tensor release: storage freed immediately - CUDA event sync: prevents data races across streams - E2E: gradient correctness under async transfers

Tests covering patterns found in other frameworks' test suites: - Memory allocation tracking: memory_allocated() decreases after offload, increases after reload, delta matches tensor byte size - Device placement: CPU/GPU device verified at each stage - Unpinned fallback: offload works without pin_memory (slower) - Bitwise exact roundtrip: torch.equal() across 3 dtypes x 3 shapes - Multiple offload/reload cycles: 5 cycles same tensor, 20 cycles with pool reuse (1 miss, rest hits) - Large tensor: 128MB tensor roundtrip - Gradient accumulation: 3 micro-batches with offloading - Mixed contiguous/non-contiguous in same group - Stream busyness query during offload - Mixed dtype group (f32 + f16 + bf16 in one group)

- test_single_huge_tensor_offload[1/5/10]: single contiguous tensor offload+reload with bitwise verification, auto-skips if not enough GPU memory - test_massive_offload_50pct_gpu_memory: 20GB across 40x512MB chunks, verifies memory_allocated() drops by 20GB and all chunks match On B200 (178GB): all 5 tests pass, 20GB freed and verified in 30s.

…graphs Make the engine truly general-purpose — not just activations: New modules: - tensor_offloader.py: TensorOffloader — async D2H/H2D for ANY tensor, with pinned pool, release_storage, event-based sync - weight_offload.py: WeightOffloadHook — module hooks for weight prefetch/offload around forward, with next-module pipelining - gradient_offload.py: GradientOffloadHook — post_accumulate_grad_hook offloads grads to CPU, reload_all() before optimizer step 25 new tests (all passing): - TensorOffloader: roundtrip, dtypes, release_storage, non-contiguous, 10 concurrent tensors, pool reuse, no-pool fallback - WeightOffload: forward/backward correctness, multi-layer - GradientOffload: offload to CPU, reload for GPU optimizer, correctness, gradient accumulation, has_offloaded_grads - torch.compile: compiled model + tensor/gradient/activation offload - CUDA graphs: event-based sync (no stream.synchronize), graph capture - Combined: activation + gradient offloading on same model

Config: - training.enable_activation_offload (bool) - training.activation_offload_modules (comma-sep: expert_fc1,moe_act) - training.activation_offload_min_tensor_size (int, default 1M) Integration: - train.py: init_chunk_handler per microbatch, reset per iteration, configure MoE offload flags from config - moe.py: offload_expert_fc1/offload_moe_act flags on MoE module - chunk_handler.py: safety guards for FSDP/EP tensor validity Benchmark configs: - qwen3_30b_a3b_offload_bench.toml (OFF) - qwen3_30b_a3b_offload_bench_ON.toml (ON) - scripts/benchmark_offload.sh Known issue: autograd saved-tensor hooks intercept internal FSDP/EP communication tensors, causing illegal memory access. Current workaround skips tensors with freed storage. MoE expert-level activation offloading needs explicit TensorOffloader integration at the GroupedExperts level instead of autograd hooks around the whole expert call. Baseline verified: Qwen3 30B-A3B, EP=8, 8xB200 — 80 GiB, 5300 TPS.

Replace autograd saved_tensors_default_hooks (which intercept FSDP/EP internal tensors and crash) with _ExpertWithOffload autograd.Function that explicitly offloads expert input to CPU and recomputes during backward. Known issue: with activation_checkpoint.mode="full" (required for Qwen3-30B-A3B to fit in 178GB), the offload recompute conflicts with AC recompute — double memory usage causes OOM. Next step: integrate offload inside the AC boundary so checkpoint boundary tensors go to CPU instead of staying on GPU. Safety guards: skip tensors with freed storage, try/except around copy in bulk_offload_group.

…ompute Replace broken custom autograd.Function with torch.utils.checkpoint which correctly handles the autograd graph. When enable_activation_offload is True, expert forward is checkpointed — intermediate activations (w1*x, silu, w3*x) are not kept on GPU, recomputed during backward. Tested on debugmodel_moe (8 layers, 64 experts, EP=8, 8xB200): - Baseline (no AC, no offload): 14.3 GiB peak, 69K TPS - With expert checkpoint: 5.2 GiB min, 51K TPS - Memory reduction: up to 47% on some ranks - TPS overhead: 26% (recompute cost) - Loss converges correctly (3.42 -> 3.40, matching baseline)

debugmodel_moe (8 layers, 64 experts), EP=8, 8xB200, batch=40, seq=4096: Baseline (no AC, no offload): Memory: 132-173 GiB (74-97%), TPS: 157K, loss=3.62 With expert checkpoint (enable_activation_offload=true): Memory: 91-173 GiB (51-97%), TPS: 179K, loss=3.64 Results: - Memory (min rank): 132 -> 91 GiB (-31%) - Memory (average): ~148 -> ~112 GiB (-24%) - TPS: 157K -> 179K (+14% faster) - Loss: converges correctly

Qwen3 10B-A1B (128 experts), EP=8, batch=5, seq=4096, 8xB200, no AC: A: FSDP baseline — 167 GiB (94%), 16,439 TPS B: FSDP cpu_offload — 154 GiB (86%), 3,668 TPS (-78% TPS, -8% mem) C: Custom expert offload — 132 GiB (74%), 14,299 TPS (-13% TPS, -21% mem) Custom offload saves 2.6x more memory than FSDP offload with 3.9x better throughput. FSDP offloads params/optimizer to CPU (slow CPU optimizer step), ours checkpoints expert activations (GPU recompute, much faster).

Qwen3 30B-A3B (128 experts), EP=8, batch=2, seq=4096, 8xB200, no AC: A: Baseline — 155 GiB (87%), 7,100 TPS B: FSDP cpu_offload — 100 GiB (56%), 444 TPS (-55 GiB, -94% TPS) C: Custom expert offload — 130 GiB (73%), 6,300 TPS (-25 GiB, -11% TPS) FSDP offload: most memory saved (55 GiB) but 16x slower (CPU optimizer). Custom offload: 25 GiB saved, only 11% slower (GPU recompute). Custom is 14x faster than FSDP offload.

Qwen3 30B-A3B, EP=8, batch=2, seq=4096, 8xB200, no AC: | # | Act offload | FSDP offload | Memory | Saved | TPS | |---|-------------|--------------|----------|--------|-------| | 1 | OFF | OFF | 151 GiB | — | 7,426 | | 2 | ON | OFF | 130 GiB | 21 GiB | 6,543 | | 3 | OFF | ON | 100 GiB | 51 GiB | 518 | | 4 | ON | ON | 72 GiB | 79 GiB | 550 | Act offload = expert activation checkpoint (our engine) FSDP offload = params + optimizer + grads to CPU

Qwen3 30B-A3B (128 experts), EP=8, batch=2, seq=4096, 8xB200, no AC: | # | Config | What offloaded | Memory | Saved | TPS | |---|---------------------|-------------------------|----------|----------|-------| | 1 | Baseline | Nothing | 154 GiB | — | 7,666 | | 2 | save_on_cpu | Act → pinned CPU | 118 GiB | 36 GiB | 1,262 | | 3 | checkpoint | Act recomputed | 129 GiB | 25 GiB | 6,518 | | 4 | checkpoint+cpu | Act recompute+CPU save | 118 GiB | 36 GiB | 3,500 | | 5 | FSDP cpu_offload | Params+optim+grads→CPU | 122 GiB | 32 GiB | 457 | | 6 | save_on_cpu + FSDP | Everything → CPU | 61 GiB | 93 GiB | 318 | New: activation_offload_mode config ("save_on_cpu", "checkpoint", "both") using torch.autograd.graph.save_on_cpu for async D2H activation offload.

Add post-MoE attention block to TransformerBlock. Expert D2H offload runs on async stream overlapping with this attention compute. 30B-A3B, EP=8, batch=2, seq=4096, 8xB200: | Config | Memory | TPS | vs baseline | |----------------|----------|-------|-------------| | Baseline | 169 GiB | 1,559 | — | | Async offload | 144 GiB | 3,952 | +153% | | FSDP offload | 113 GiB | 553 | -65% | | Async + FSDP | 86 GiB | 509 | -67% | Async offload: checkpoint expert (no intermediates saved) + async D2H of expert input overlaps with post-MoE attention on default stream. Memory savings reduce allocator fragmentation → faster than baseline.

New: enable_weight_offload config. Offloads expert weights to pinned CPU after each layer's expert forward, reloads before next layer. D2H overlaps with post-MoE attention. Handles FSDP DTensor via to_local(). 30B-A3B + post-MoE attention, EP=8, batch=2, seq=4096, 8xB200: | # | Config | Memory | TPS | What offloaded | |---|---------------------|----------|-------|-------------------------| | 1 | Baseline | 162 GiB | 6,149 | Nothing | | 2 | Weight only | 169 GiB | 2,949 | Expert weights → CPU | | 3 | Activation only | 144 GiB | 3,330 | Expert acts (checkpoint)| | 4 | Weight + Activation | 146 GiB | 2,326 | Both | Weight offload (#2) doesn't save memory yet because FSDP DTensor .set_() doesn't actually free the original storage. Activation offload (#3) saves 18 GiB via checkpoint. Combined (#4) saves 16 GiB.

Deep investigation of FSDP2 CPUOffloadPolicy internals: - H2D via all_gather_copy_in_stream (non_blocking) - D2H via reduce_scatter_stream (non_blocking, event sync) - GPU memory freed via storage.resize_(0) - Optimizer runs on CPU (the throughput bottleneck) New modules: - weight_offload in TransformerBlock: async D2H expert weights after expert forward, async H2D reload before next forward. Handles DTensor via to_local(). Overlaps with post_moe_attn. - fsdp_gpu_optimizer.py: wrapper that copies grads GPU→GPU and runs optimizer.step() on GPU instead of CPU (prototype). Config: enable_weight_offload = true/false

xrsrke added 4 commits March 26, 2026 15:51

Add reference CPU offload engine implementation

ac7e2ae

Reference implementation of fine-grained activation offloading and hybrid optimizer CPU offloading. Preserved as-is under megatron_fork/ for comparison before refactoring.

xrsrke force-pushed the phuc/custom_offload branch from 6aa2943 to 2e2e8f1 Compare March 26, 2026 15:52

xrsrke added 17 commits March 26, 2026 16:04

Add README for CPU activation offload engine

874ce7b

Trim README to focus on feature docs

40b0d44

Update configs and gitignore

9f1bdcf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: custom cpu offload engine#63

feat: custom cpu offload engine#63
xrsrke wants to merge 21 commits intodev-updated-againfrom
phuc/custom_offload

xrsrke commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xrsrke commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant