Skip to content

feat: autotune deepep#62

Draft
xrsrke wants to merge 17 commits intodev-updated-againfrom
phuc/multinode_deepep_autotune
Draft

feat: autotune deepep#62
xrsrke wants to merge 17 commits intodev-updated-againfrom
phuc/multinode_deepep_autotune

Conversation

@xrsrke
Copy link
Copy Markdown

@xrsrke xrsrke commented Mar 19, 2026

No description provided.

xrsrke added 17 commits March 18, 2026 14:32
Grid-searches DeepEP kernel parameters (num_sms, nvl_chunk, rdma_chunk) at
training startup by benchmarking buffer.dispatch()/buffer.combine() with
synthetic data. Stores optimal Config objects as globals passed to every
dispatch/combine call during training.

Key changes:
- job_config.py: Add autotune, autotune_warmup, autotune_repeat,
  autotune_verbose, num_sms, nvl_buffer_size, rdma_buffer_size fields
- deepep.py: Tuned config globals, config= kwargs on all dispatch/combine
  calls (forward + backward), full autotune grid search with internode
  support (joint sms tuning, RDMA chunk search)
- parallelize.py: Wire autotune into qwen3, upgrade MoE to DeepEPMoE
- TOML configs for 1-node (EP=8) and 2-node (EP=16) debug runs
- sbatch script with NVSHMEM env vars for internode RDMA

Internode safety: restrict sms sweep to single validated value to prevent
DeepEP dispatch timeouts that fatally corrupt CUDA state.

Tested: 1-node (8x B200) and 2-node (16x B200) with decreasing loss.
- Replace 6 bare module globals with DeepEPState singleton class
  (_buffer, _handle_cache, _handle_counter, _pending_combine_event,
  _tuned_dispatch_config, _tuned_combine_config -> _state)
- Add _create_uniform_routing() for deterministic round-robin routing
  in autotune benchmarks, replacing random scores + torch.topk
- Add setup_deepep() to centralize SAC registration, MoE->DeepEPMoE
  upgrade, and autotune into a single call
- Simplify qwen3 parallelize.py from 35-line block to one-liner
…ebug configs

- Apply setup_deepep() to llama4 and deepseek_v3 parallelize.py
  (replaces 22-line DeepEP blocks with single call)
- Add EP/ETP validation checks to setup_deepep()
- Document autotune in deepep/README.md (config options, usage, example output)
- Remove sbatch script and debug TOML configs from tracking
Restore the pre-existing ep_enabled/etp_enabled checks and SAC
registration in llama4 and deepseek_v3. Remove duplicate validation
from setup_deepep() since callers handle it.
Keep these files identical to the base branch to minimize
merge conflicts when merging upstream changes.
Restore original bare globals and code structure. Only add:
- _create_uniform_routing(): deterministic round-robin for autotune
- setup_deepep(): centralized SAC + MoE upgrade + autotune setup
Keep original inline logic in parallelize.py. The only change
to deepep.py is replacing random synthetic data with uniform
round-robin routing (_create_uniform_routing) for autotune.
Eliminate all `global` statements by grouping mutable process state
into a simple _State class. Functions remain as free functions,
just access _state.xxx instead of bare globals.
24 tests covering:
- _create_uniform_routing: shapes, dtypes, round-robin, balanced load
- _State config management: get/set/overwrite tuned configs
- _bench_fn: timing, warmup/repeat counts, exception propagation
- _detect_internode: intranode vs internode topology detection
- _get_gpu_sm_range: GPU-specific SM ranges, fallback behavior
- run_deepep_autotune_if_enabled: default config fallback paths
Run autotune at beginning of training (alongside LLEP autotune)
instead of during model parallelization. This avoids adding
autotune code to each model's parallelize.py file.
…nt double-run

- Use DeepEP's pre-tuned nvl_buffer_size per rank count (256→288→480→720)
  instead of hardcoded 256. The built-in values are optimized per topology.
- When autotune=false, use Buffer.get_dispatch/combine_config() directly
  instead of hardcoded defaults that don't match DeepEP's tuned values.
- Expand internode search space: nvl_dispatch [2,48], nvl_combine [1,16],
  rdma [4,36]. Covers all DeepEP built-in defaults so autotune converges
  to them if they're already optimal.
- Guard train.py autotune call to skip if configs already set by
  parallelize.py, preventing double-autotune.
Phase 0 searches nvl_buffer_size [128,256,288,384,480,512,560,720]
and rdma_buffer_size [64,128,256] before chunk tuning. Includes
DeepEP's built-in per-rank value in candidates so autotune can
converge to it if optimal.
Smaller buffer sizes cause CUDA illegal memory access in internode
dispatch. Filter candidates to only include values at or above
DeepEP's recommended per-rank-count value.
Replace phased sequential tuning with full Cartesian search over
(num_sms, nvl_chunk, nvl_buffer_size, rdma_chunk, rdma_buffer_size)
for both dispatch and combine independently.

Based on https://nousresearch.com/moe-scaling-field-notes/:
- num_sms range extended to 128 (2.3-2.6x speedup over 24)
- nvl_buffer_size up to 1024 (blog found optimal at 1024)
- All params searched jointly, no greedy phase decomposition
Sweeping nvl_buffer_size at runtime causes unrecoverable CUDA crashes
(illegal memory access) when values are too small. Buffer sizes are
hardware/topology-dependent and DeepEP's per-rank-count defaults are
already well-tuned. Only sweep chunk sizes and num_sms (intranode).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant