[feat] Add DeepSeek-V4 support#147
Open
Meirtz wants to merge 1 commit into
Open
Conversation
Adds DeepseekV4Bridge translating HuggingFace deepseek_v4 configs to the experimental MLATransformerConfig. The DSv4 hybrid attention schedule (window / CSA / DSA), mHC hyper-connections, hash-routed MoE, MTP, and ClampedSwiGLU are all wired through HF config fields (compress_ratios, hc_mult, index_*, num_hash_layers, swiglu_limit, ...) and verified on a toy DeepSeek-V4 config end-to-end. Includes a build-only smoke under example/deepseek_v4/ that verifies the 17 DSv4-specific config fields are present, the model builds, and a single forward pass produces a finite loss. Notes: * Requires a Megatron-LM commit that ships the dsv4_hybrid attention variant (CSA, DSA, mHC, hash routing, ClampedSwiGLU). ImportError is suppressed in mbridge/models/__init__.py so users on a main-only MCore can still import mbridge. * dsa_indexer_loss_coeff defaults to 0.0 because MLATransformerConfig defaults the field to None (which would TypeError inside csa.py); the bridge provides a safe float default. Production training paths must override via set_extra_args (e.g. 0.01 for DSA loss to contribute). * FP8/MXFP4 weight import is not implemented in this bridge; for real DeepSeek-V4-Flash checkpoint import / export use the NeMo Megatron-Bridge DSv4 path. This PR covers the model-build and random-init forward path that downstream RL frameworks need. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
DeepseekV4Bridgefor the DeepSeek-V4 model family.Translates HF
deepseek_v4configs toMLATransformerConfigand wiresup the hybrid attention schedule (window / CSA / DSA via
compress_ratios), mHC hyper-connections, hash-routed MoE(
num_hash_layers,sqrtsoftplusscoring), MTP, and ClampedSwiGLU.Requirements
dsv4_hybridattention variant(CSA, DSA, mHC, hash routing, ClampedSwiGLU).
mbridge/models/__init__.pysuppresses
ImportErrorso a main-only MCore install still works.fast_hadamard_transformfor the DSA indexer atcompress_ratio>0layers — install from the Dao-AILab git repo (PyPI sdist is incomplete).
Notes
dsa_indexer_loss_coeffdefaults to0.0in the bridge because theupstream
MLATransformerConfigfield defaults toNone, whichTypeErrors inside
csa.py. Production training paths must overrideto a non-zero value (e.g.
0.01) so the indexer's parameters getgradients.
DeepSeek-V4-Flashcheckpoint conversion use NeMo Megatron-Bridge.Verification
Toy DSv4 config (4 layers,
compress_ratios=[0,4,128,4], 1 MTP layer,8 routed experts) on GB200, bf16 random init:
Build smoke (
example/deepseek_v4/test_dsv4_build.py, in this PR):17 DSv4 config fields present, model builds (69M params), forward
pass
loss=0.000144finite.verl forward-only smoke (script kept out of mbridge — verl is a
downstream consumer): the bridge resolves through
verl.models.mcore.mbridge.AutoBridge,bridge.get_model()returnsthe DSv4 model, forward
loss=0.000144finite.verl full training step (script kept out of mbridge): forward +
backward +
optimizer.step()viaverl.utils.megatron.optimizer.get_megatron_optimizer(bf16 Adam), full architecture including the DSA indexer at the
compress_ratio=128layer. Result:loss=0.000170,grad_norm=0.135386,update_successful=True.The verl-side integration scripts and a small
transformer_impl.pyfix (mirror of verl PR #6473 for the
vanilla_mbridge=Truepath) willbe submitted as a separate PR to
verl-project/verl.AI-assisted; author has reviewed every changed line.