feat: autotune deepep by xrsrke · Pull Request #62 · NousResearch/torchtitan

xrsrke · 2026-03-19T11:21:55Z

No description provided.

Grid-searches DeepEP kernel parameters (num_sms, nvl_chunk, rdma_chunk) at training startup by benchmarking buffer.dispatch()/buffer.combine() with synthetic data. Stores optimal Config objects as globals passed to every dispatch/combine call during training. Key changes: - job_config.py: Add autotune, autotune_warmup, autotune_repeat, autotune_verbose, num_sms, nvl_buffer_size, rdma_buffer_size fields - deepep.py: Tuned config globals, config= kwargs on all dispatch/combine calls (forward + backward), full autotune grid search with internode support (joint sms tuning, RDMA chunk search) - parallelize.py: Wire autotune into qwen3, upgrade MoE to DeepEPMoE - TOML configs for 1-node (EP=8) and 2-node (EP=16) debug runs - sbatch script with NVSHMEM env vars for internode RDMA Internode safety: restrict sms sweep to single validated value to prevent DeepEP dispatch timeouts that fatally corrupt CUDA state. Tested: 1-node (8x B200) and 2-node (16x B200) with decreasing loss.

- Replace 6 bare module globals with DeepEPState singleton class (_buffer, _handle_cache, _handle_counter, _pending_combine_event, _tuned_dispatch_config, _tuned_combine_config -> _state) - Add _create_uniform_routing() for deterministic round-robin routing in autotune benchmarks, replacing random scores + torch.topk - Add setup_deepep() to centralize SAC registration, MoE->DeepEPMoE upgrade, and autotune into a single call - Simplify qwen3 parallelize.py from 35-line block to one-liner

…ebug configs - Apply setup_deepep() to llama4 and deepseek_v3 parallelize.py (replaces 22-line DeepEP blocks with single call) - Add EP/ETP validation checks to setup_deepep() - Document autotune in deepep/README.md (config options, usage, example output) - Remove sbatch script and debug TOML configs from tracking

Restore the pre-existing ep_enabled/etp_enabled checks and SAC registration in llama4 and deepseek_v3. Remove duplicate validation from setup_deepep() since callers handle it.

Keep these files identical to the base branch to minimize merge conflicts when merging upstream changes.

Restore original bare globals and code structure. Only add: - _create_uniform_routing(): deterministic round-robin for autotune - setup_deepep(): centralized SAC + MoE upgrade + autotune setup

Keep original inline logic in parallelize.py. The only change to deepep.py is replacing random synthetic data with uniform round-robin routing (_create_uniform_routing) for autotune.

Eliminate all `global` statements by grouping mutable process state into a simple _State class. Functions remain as free functions, just access _state.xxx instead of bare globals.

24 tests covering: - _create_uniform_routing: shapes, dtypes, round-robin, balanced load - _State config management: get/set/overwrite tuned configs - _bench_fn: timing, warmup/repeat counts, exception propagation - _detect_internode: intranode vs internode topology detection - _get_gpu_sm_range: GPU-specific SM ranges, fallback behavior - run_deepep_autotune_if_enabled: default config fallback paths

Run autotune at beginning of training (alongside LLEP autotune) instead of during model parallelization. This avoids adding autotune code to each model's parallelize.py file.

…nt double-run - Use DeepEP's pre-tuned nvl_buffer_size per rank count (256→288→480→720) instead of hardcoded 256. The built-in values are optimized per topology. - When autotune=false, use Buffer.get_dispatch/combine_config() directly instead of hardcoded defaults that don't match DeepEP's tuned values. - Expand internode search space: nvl_dispatch [2,48], nvl_combine [1,16], rdma [4,36]. Covers all DeepEP built-in defaults so autotune converges to them if they're already optimal. - Guard train.py autotune call to skip if configs already set by parallelize.py, preventing double-autotune.

Phase 0 searches nvl_buffer_size [128,256,288,384,480,512,560,720] and rdma_buffer_size [64,128,256] before chunk tuning. Includes DeepEP's built-in per-rank value in candidates so autotune can converge to it if optimal.

Smaller buffer sizes cause CUDA illegal memory access in internode dispatch. Filter candidates to only include values at or above DeepEP's recommended per-rank-count value.

Replace phased sequential tuning with full Cartesian search over (num_sms, nvl_chunk, nvl_buffer_size, rdma_chunk, rdma_buffer_size) for both dispatch and combine independently. Based on https://nousresearch.com/moe-scaling-field-notes/: - num_sms range extended to 128 (2.3-2.6x speedup over 24) - nvl_buffer_size up to 1024 (blog found optimal at 1024) - All params searched jointly, no greedy phase decomposition

…icts

Sweeping nvl_buffer_size at runtime causes unrecoverable CUDA crashes (illegal memory access) when values are too small. Buffer sizes are hardware/topology-dependent and DeepEP's per-rank-count defaults are already well-tuned. Only sweep chunk sizes and num_sms (intranode).

xrsrke added 17 commits March 18, 2026 14:32

Keep original validation blocks in llama4/deepseek_v3 parallelize.py

35c13fb

Restore the pre-existing ep_enabled/etp_enabled checks and SAC registration in llama4 and deepseek_v3. Remove duplicate validation from setup_deepep() since callers handle it.

Revert llama4/deepseek_v3 parallelize.py to base branch state

b6e6d45

Keep these files identical to the base branch to minimize merge conflicts when merging upstream changes.

Revert DeepEPState refactor, keep uniform routing + setup_deepep

1f8691a

Restore original bare globals and code structure. Only add: - _create_uniform_routing(): deterministic round-robin for autotune - setup_deepep(): centralized SAC + MoE upgrade + autotune setup

Remove setup_deepep(), restore inline DeepEP setup in qwen3

5ecfb5c

Keep original inline logic in parallelize.py. The only change to deepep.py is replacing random synthetic data with uniform round-robin routing (_create_uniform_routing) for autotune.

Replace bare globals with _State container in deepep.py

8a87397

Eliminate all `global` statements by grouping mutable process state into a simple _State class. Functions remain as free functions, just access _state.xxx instead of bare globals.

Move DeepEP autotune from parallelize.py to train.py

d213f12

Run autotune at beginning of training (alongside LLEP autotune) instead of during model parallelization. This avoids adding autotune code to each model's parallelize.py file.

Fix buffer size search: only test sizes >= DeepEP built-in minimum

6b7dacb

Smaller buffer sizes cause CUDA illegal memory access in internode dispatch. Filter candidates to only include values at or above DeepEP's recommended per-rank-count value.

Fix: enforce minimum nvl_buffer=288 for internode

e40c22e

Lock sms for internode to avoid CUDA crashes from cached matrix confl…

df0a7ae

…icts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: autotune deepep#62

feat: autotune deepep#62
xrsrke wants to merge 17 commits intodev-updated-againfrom
phuc/multinode_deepep_autotune

xrsrke commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xrsrke commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant