Skip to content

feat: custom cpu offload engine#63

Draft
xrsrke wants to merge 21 commits intodev-updated-againfrom
phuc/custom_offload
Draft

feat: custom cpu offload engine#63
xrsrke wants to merge 21 commits intodev-updated-againfrom
phuc/custom_offload

Conversation

@xrsrke
Copy link
Copy Markdown

@xrsrke xrsrke commented Mar 26, 2026

No description provided.

xrsrke added 4 commits March 26, 2026 15:51
Reference implementation of fine-grained activation offloading and
hybrid optimizer CPU offloading. Preserved as-is under megatron_fork/
for comparison before refactoring.
Break the monolithic reference into clean, independent modules with
zero framework dependencies (only requires PyTorch):

- utils.py: debug helpers, is_graph_capturing(), summary table printer
- tensor_pool.py: TensorPool — pinned CPU memory pool with O(1) reuse
- offload_group.py: OffloadTensorGroup — batch of tensors with CUDA events
- chunk_handler.py: ChunkOffloadHandler — core D2H/H2D copy engine
- offload_manager.py: OffloadManager — singleton orchestrator with VP/PP support
- autograd_hooks.py: ActivationOffloadContext, group_start/commit functions
- hybrid_optimizer.py: HybridDeviceOptimizer — GPU/CPU split optimizer
7 test files covering every component from unit to end-to-end:

- test_tensor_pool.py (13): alloc/free, pool reuse, pinned memory
- test_offload_group.py (8): push/pop, CUDA events, offload stats
- test_chunk_handler.py (13): D2H/H2D roundtrips, bulk offload/reload
- test_offload_manager.py (12): singleton, streams, VP, delayed offload
- test_e2e_activation_offload.py (10): Linear, MLP (4/8/16 layers),
  MoE, multi-iteration, warmup stats, f32/f16/bf16 dtype support
- test_hybrid_optimizer.py (11): GPU/CPU split, overlap modes, FP32
- test_utils.py (6): debug, graph capture, summary table
Tests that measure and prove each benefit:
- Async D2H: dedicated stream overlaps with compute
- Async H2D: dedicated stream overlaps with compute
- Pinned pool: 11x faster than fresh cudaMallocHost
- Per-module granularity: size threshold + selective offload
- Activations on CPU: tensors physically on pinned CPU memory
- Forced tensor release: storage freed immediately
- CUDA event sync: prevents data races across streams
- E2E: gradient correctness under async transfers
@xrsrke xrsrke force-pushed the phuc/custom_offload branch from 6aa2943 to 2e2e8f1 Compare March 26, 2026 15:52
xrsrke added 17 commits March 26, 2026 16:04
Tests covering patterns found in other frameworks' test suites:

- Memory allocation tracking: memory_allocated() decreases after
  offload, increases after reload, delta matches tensor byte size
- Device placement: CPU/GPU device verified at each stage
- Unpinned fallback: offload works without pin_memory (slower)
- Bitwise exact roundtrip: torch.equal() across 3 dtypes x 3 shapes
- Multiple offload/reload cycles: 5 cycles same tensor, 20 cycles
  with pool reuse (1 miss, rest hits)
- Large tensor: 128MB tensor roundtrip
- Gradient accumulation: 3 micro-batches with offloading
- Mixed contiguous/non-contiguous in same group
- Stream busyness query during offload
- Mixed dtype group (f32 + f16 + bf16 in one group)
- test_single_huge_tensor_offload[1/5/10]: single contiguous tensor
  offload+reload with bitwise verification, auto-skips if not enough
  GPU memory
- test_massive_offload_50pct_gpu_memory: 20GB across 40x512MB chunks,
  verifies memory_allocated() drops by 20GB and all chunks match

On B200 (178GB): all 5 tests pass, 20GB freed and verified in 30s.
…graphs

Make the engine truly general-purpose — not just activations:

New modules:
- tensor_offloader.py: TensorOffloader — async D2H/H2D for ANY tensor,
  with pinned pool, release_storage, event-based sync
- weight_offload.py: WeightOffloadHook — module hooks for weight
  prefetch/offload around forward, with next-module pipelining
- gradient_offload.py: GradientOffloadHook — post_accumulate_grad_hook
  offloads grads to CPU, reload_all() before optimizer step

25 new tests (all passing):
- TensorOffloader: roundtrip, dtypes, release_storage, non-contiguous,
  10 concurrent tensors, pool reuse, no-pool fallback
- WeightOffload: forward/backward correctness, multi-layer
- GradientOffload: offload to CPU, reload for GPU optimizer,
  correctness, gradient accumulation, has_offloaded_grads
- torch.compile: compiled model + tensor/gradient/activation offload
- CUDA graphs: event-based sync (no stream.synchronize), graph capture
- Combined: activation + gradient offloading on same model
Config:
- training.enable_activation_offload (bool)
- training.activation_offload_modules (comma-sep: expert_fc1,moe_act)
- training.activation_offload_min_tensor_size (int, default 1M)

Integration:
- train.py: init_chunk_handler per microbatch, reset per iteration,
  configure MoE offload flags from config
- moe.py: offload_expert_fc1/offload_moe_act flags on MoE module
- chunk_handler.py: safety guards for FSDP/EP tensor validity

Benchmark configs:
- qwen3_30b_a3b_offload_bench.toml (OFF)
- qwen3_30b_a3b_offload_bench_ON.toml (ON)
- scripts/benchmark_offload.sh

Known issue: autograd saved-tensor hooks intercept internal FSDP/EP
communication tensors, causing illegal memory access. Current workaround
skips tensors with freed storage. MoE expert-level activation offloading
needs explicit TensorOffloader integration at the GroupedExperts level
instead of autograd hooks around the whole expert call.

Baseline verified: Qwen3 30B-A3B, EP=8, 8xB200 — 80 GiB, 5300 TPS.
Replace autograd saved_tensors_default_hooks (which intercept FSDP/EP
internal tensors and crash) with _ExpertWithOffload autograd.Function
that explicitly offloads expert input to CPU and recomputes during
backward.

Known issue: with activation_checkpoint.mode="full" (required for
Qwen3-30B-A3B to fit in 178GB), the offload recompute conflicts
with AC recompute — double memory usage causes OOM. Next step:
integrate offload inside the AC boundary so checkpoint boundary
tensors go to CPU instead of staying on GPU.

Safety guards: skip tensors with freed storage, try/except around
copy in bulk_offload_group.
…ompute

Replace broken custom autograd.Function with torch.utils.checkpoint
which correctly handles the autograd graph. When enable_activation_offload
is True, expert forward is checkpointed — intermediate activations
(w1*x, silu, w3*x) are not kept on GPU, recomputed during backward.

Tested on debugmodel_moe (8 layers, 64 experts, EP=8, 8xB200):
- Baseline (no AC, no offload): 14.3 GiB peak, 69K TPS
- With expert checkpoint:         5.2 GiB min,  51K TPS
- Memory reduction: up to 47% on some ranks
- TPS overhead: 26% (recompute cost)
- Loss converges correctly (3.42 -> 3.40, matching baseline)
debugmodel_moe (8 layers, 64 experts), EP=8, 8xB200, batch=40, seq=4096:

Baseline (no AC, no offload):
  Memory: 132-173 GiB (74-97%), TPS: 157K, loss=3.62

With expert checkpoint (enable_activation_offload=true):
  Memory: 91-173 GiB (51-97%), TPS: 179K, loss=3.64

Results:
  - Memory (min rank): 132 -> 91 GiB (-31%)
  - Memory (average):  ~148 -> ~112 GiB (-24%)
  - TPS: 157K -> 179K (+14% faster)
  - Loss: converges correctly
Qwen3 10B-A1B (128 experts), EP=8, batch=5, seq=4096, 8xB200, no AC:

A: FSDP baseline        — 167 GiB (94%), 16,439 TPS
B: FSDP cpu_offload     — 154 GiB (86%),  3,668 TPS (-78% TPS, -8% mem)
C: Custom expert offload — 132 GiB (74%), 14,299 TPS (-13% TPS, -21% mem)

Custom offload saves 2.6x more memory than FSDP offload with
3.9x better throughput. FSDP offloads params/optimizer to CPU
(slow CPU optimizer step), ours checkpoints expert activations
(GPU recompute, much faster).
Qwen3 30B-A3B (128 experts), EP=8, batch=2, seq=4096, 8xB200, no AC:

A: Baseline              — 155 GiB (87%),  7,100 TPS
B: FSDP cpu_offload      — 100 GiB (56%),    444 TPS  (-55 GiB, -94% TPS)
C: Custom expert offload — 130 GiB (73%),  6,300 TPS  (-25 GiB, -11% TPS)

FSDP offload: most memory saved (55 GiB) but 16x slower (CPU optimizer).
Custom offload: 25 GiB saved, only 11% slower (GPU recompute).
Custom is 14x faster than FSDP offload.
Qwen3 30B-A3B, EP=8, batch=2, seq=4096, 8xB200, no AC:

| # | Act offload | FSDP offload | Memory   | Saved  | TPS   |
|---|-------------|--------------|----------|--------|-------|
| 1 | OFF         | OFF          | 151 GiB  | —      | 7,426 |
| 2 | ON          | OFF          | 130 GiB  | 21 GiB | 6,543 |
| 3 | OFF         | ON           | 100 GiB  | 51 GiB |   518 |
| 4 | ON          | ON           |  72 GiB  | 79 GiB |   550 |

Act offload = expert activation checkpoint (our engine)
FSDP offload = params + optimizer + grads to CPU
Qwen3 30B-A3B (128 experts), EP=8, batch=2, seq=4096, 8xB200, no AC:

| # | Config              | What offloaded          | Memory   | Saved    | TPS   |
|---|---------------------|-------------------------|----------|----------|-------|
| 1 | Baseline            | Nothing                 | 154 GiB  | —        | 7,666 |
| 2 | save_on_cpu         | Act → pinned CPU        | 118 GiB  | 36 GiB   | 1,262 |
| 3 | checkpoint          | Act recomputed          | 129 GiB  | 25 GiB   | 6,518 |
| 4 | checkpoint+cpu      | Act recompute+CPU save  | 118 GiB  | 36 GiB   | 3,500 |
| 5 | FSDP cpu_offload    | Params+optim+grads→CPU  | 122 GiB  | 32 GiB   | 457   |
| 6 | save_on_cpu + FSDP  | Everything → CPU        |  61 GiB  | 93 GiB   | 318   |

New: activation_offload_mode config ("save_on_cpu", "checkpoint", "both")
using torch.autograd.graph.save_on_cpu for async D2H activation offload.
Add post-MoE attention block to TransformerBlock. Expert D2H offload
runs on async stream overlapping with this attention compute.

30B-A3B, EP=8, batch=2, seq=4096, 8xB200:
| Config         | Memory   | TPS   | vs baseline |
|----------------|----------|-------|-------------|
| Baseline       | 169 GiB  | 1,559 | —           |
| Async offload  | 144 GiB  | 3,952 | +153%       |
| FSDP offload   | 113 GiB  |   553 | -65%        |
| Async + FSDP   |  86 GiB  |   509 | -67%        |

Async offload: checkpoint expert (no intermediates saved) + async D2H
of expert input overlaps with post-MoE attention on default stream.
Memory savings reduce allocator fragmentation → faster than baseline.
New: enable_weight_offload config. Offloads expert weights to pinned
CPU after each layer's expert forward, reloads before next layer.
D2H overlaps with post-MoE attention. Handles FSDP DTensor via to_local().

30B-A3B + post-MoE attention, EP=8, batch=2, seq=4096, 8xB200:

| # | Config              | Memory   | TPS   | What offloaded          |
|---|---------------------|----------|-------|-------------------------|
| 1 | Baseline            | 162 GiB  | 6,149 | Nothing                 |
| 2 | Weight only         | 169 GiB  | 2,949 | Expert weights → CPU    |
| 3 | Activation only     | 144 GiB  | 3,330 | Expert acts (checkpoint)|
| 4 | Weight + Activation | 146 GiB  | 2,326 | Both                    |

Weight offload (#2) doesn't save memory yet because FSDP DTensor
.set_() doesn't actually free the original storage. Activation
offload (#3) saves 18 GiB via checkpoint. Combined (#4) saves 16 GiB.
Deep investigation of FSDP2 CPUOffloadPolicy internals:
- H2D via all_gather_copy_in_stream (non_blocking)
- D2H via reduce_scatter_stream (non_blocking, event sync)
- GPU memory freed via storage.resize_(0)
- Optimizer runs on CPU (the throughput bottleneck)

New modules:
- weight_offload in TransformerBlock: async D2H expert weights after
  expert forward, async H2D reload before next forward. Handles
  DTensor via to_local(). Overlaps with post_moe_attn.
- fsdp_gpu_optimizer.py: wrapper that copies grads GPU→GPU and runs
  optimizer.step() on GPU instead of CPU (prototype).

Config: enable_weight_offload = true/false
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant