[feat] Add DeepSeek-V4 support by Meirtz · Pull Request #147 · ISEEKYAN/mbridge

Meirtz · 2026-05-27T17:57:28Z

Adds DeepseekV4Bridge for the DeepSeek-V4 model family.

Translates HF deepseek_v4 configs to MLATransformerConfig and wires
up the hybrid attention schedule (window / CSA / DSA via
compress_ratios), mHC hyper-connections, hash-routed MoE
(num_hash_layers, sqrtsoftplus scoring), MTP, and ClampedSwiGLU.

Requirements

A Megatron-LM commit that ships the dsv4_hybrid attention variant
(CSA, DSA, mHC, hash routing, ClampedSwiGLU). mbridge/models/__init__.py
suppresses ImportError so a main-only MCore install still works.
fast_hadamard_transform for the DSA indexer at compress_ratio>0
layers — install from the Dao-AILab git repo (PyPI sdist is incomplete).

Notes

dsa_indexer_loss_coeff defaults to 0.0 in the bridge because the
upstream MLATransformerConfig field defaults to None, which
TypeErrors inside csa.py. Production training paths must override
to a non-zero value (e.g. 0.01) so the indexer's parameters get
gradients.
FP8 / MXFP4 weight import is not implemented; for real
DeepSeek-V4-Flash checkpoint conversion use NeMo Megatron-Bridge.
TP > 1 is asserted upstream in MCore.

Verification

Toy DSv4 config (4 layers, compress_ratios=[0,4,128,4], 1 MTP layer,
8 routed experts) on GB200, bf16 random init:

Build smoke (example/deepseek_v4/test_dsv4_build.py, in this PR):
17 DSv4 config fields present, model builds (69M params), forward
pass loss=0.000144 finite.
verl forward-only smoke (script kept out of mbridge — verl is a
downstream consumer): the bridge resolves through
verl.models.mcore.mbridge.AutoBridge, bridge.get_model() returns
the DSv4 model, forward loss=0.000144 finite.
verl full training step (script kept out of mbridge): forward +
backward + optimizer.step() via verl.utils.megatron.optimizer.get_megatron_optimizer
(bf16 Adam), full architecture including the DSA indexer at the
compress_ratio=128 layer. Result: loss=0.000170,
grad_norm=0.135386, update_successful=True.

The verl-side integration scripts and a small transformer_impl.py
fix (mirror of verl PR #6473 for the vanilla_mbridge=True path) will
be submitted as a separate PR to verl-project/verl.

AI-assisted; author has reviewed every changed line.

Adds DeepseekV4Bridge translating HuggingFace deepseek_v4 configs to the experimental MLATransformerConfig. The DSv4 hybrid attention schedule (window / CSA / DSA), mHC hyper-connections, hash-routed MoE, MTP, and ClampedSwiGLU are all wired through HF config fields (compress_ratios, hc_mult, index_*, num_hash_layers, swiglu_limit, ...) and verified on a toy DeepSeek-V4 config end-to-end. Includes a build-only smoke under example/deepseek_v4/ that verifies the 17 DSv4-specific config fields are present, the model builds, and a single forward pass produces a finite loss. Notes: * Requires a Megatron-LM commit that ships the dsv4_hybrid attention variant (CSA, DSA, mHC, hash routing, ClampedSwiGLU). ImportError is suppressed in mbridge/models/__init__.py so users on a main-only MCore can still import mbridge. * dsa_indexer_loss_coeff defaults to 0.0 because MLATransformerConfig defaults the field to None (which would TypeError inside csa.py); the bridge provides a safe float default. Production training paths must override via set_extra_args (e.g. 0.01 for DSA loss to contribute). * FP8/MXFP4 weight import is not implemented in this bridge; for real DeepSeek-V4-Flash checkpoint import / export use the NeMo Megatron-Bridge DSv4 path. This PR covers the model-build and random-init forward path that downstream RL frameworks need. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

Meirtz force-pushed the feat/deepseek-v4 branch from d3712b4 to 3c89b5f Compare May 27, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add DeepSeek-V4 support#147

[feat] Add DeepSeek-V4 support#147
Meirtz wants to merge 1 commit into
ISEEKYAN:mainfrom
Meirtz:feat/deepseek-v4

Meirtz commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Meirtz commented May 27, 2026

Requirements

Notes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant