Skip to content

[feat] Add DeepSeek-V4 support#147

Open
Meirtz wants to merge 1 commit into
ISEEKYAN:mainfrom
Meirtz:feat/deepseek-v4
Open

[feat] Add DeepSeek-V4 support#147
Meirtz wants to merge 1 commit into
ISEEKYAN:mainfrom
Meirtz:feat/deepseek-v4

Conversation

@Meirtz

@Meirtz Meirtz commented May 27, 2026

Copy link
Copy Markdown

Adds DeepseekV4Bridge for the DeepSeek-V4 model family.

Translates HF deepseek_v4 configs to MLATransformerConfig and wires
up the hybrid attention schedule (window / CSA / DSA via
compress_ratios), mHC hyper-connections, hash-routed MoE
(num_hash_layers, sqrtsoftplus scoring), MTP, and ClampedSwiGLU.

Requirements

  • A Megatron-LM commit that ships the dsv4_hybrid attention variant
    (CSA, DSA, mHC, hash routing, ClampedSwiGLU). mbridge/models/__init__.py
    suppresses ImportError so a main-only MCore install still works.
  • fast_hadamard_transform for the DSA indexer at compress_ratio>0
    layers — install from the Dao-AILab git repo (PyPI sdist is incomplete).

Notes

  • dsa_indexer_loss_coeff defaults to 0.0 in the bridge because the
    upstream MLATransformerConfig field defaults to None, which
    TypeErrors inside csa.py. Production training paths must override
    to a non-zero value (e.g. 0.01) so the indexer's parameters get
    gradients.
  • FP8 / MXFP4 weight import is not implemented; for real
    DeepSeek-V4-Flash checkpoint conversion use NeMo Megatron-Bridge.
  • TP > 1 is asserted upstream in MCore.

Verification

Toy DSv4 config (4 layers, compress_ratios=[0,4,128,4], 1 MTP layer,
8 routed experts) on GB200, bf16 random init:

  1. Build smoke (example/deepseek_v4/test_dsv4_build.py, in this PR):
    17 DSv4 config fields present, model builds (69M params), forward
    pass loss=0.000144 finite.

  2. verl forward-only smoke (script kept out of mbridge — verl is a
    downstream consumer): the bridge resolves through
    verl.models.mcore.mbridge.AutoBridge, bridge.get_model() returns
    the DSv4 model, forward loss=0.000144 finite.

  3. verl full training step (script kept out of mbridge): forward +
    backward + optimizer.step() via verl.utils.megatron.optimizer.get_megatron_optimizer
    (bf16 Adam), full architecture including the DSA indexer at the
    compress_ratio=128 layer. Result: loss=0.000170,
    grad_norm=0.135386, update_successful=True.

The verl-side integration scripts and a small transformer_impl.py
fix (mirror of verl PR #6473 for the vanilla_mbridge=True path) will
be submitted as a separate PR to verl-project/verl.

AI-assisted; author has reviewed every changed line.

Adds DeepseekV4Bridge translating HuggingFace deepseek_v4 configs to the
experimental MLATransformerConfig. The DSv4 hybrid attention schedule
(window / CSA / DSA), mHC hyper-connections, hash-routed MoE, MTP, and
ClampedSwiGLU are all wired through HF config fields (compress_ratios,
hc_mult, index_*, num_hash_layers, swiglu_limit, ...) and verified on a
toy DeepSeek-V4 config end-to-end.

Includes a build-only smoke under example/deepseek_v4/ that verifies the
17 DSv4-specific config fields are present, the model builds, and a
single forward pass produces a finite loss.

Notes:
* Requires a Megatron-LM commit that ships the dsv4_hybrid attention
  variant (CSA, DSA, mHC, hash routing, ClampedSwiGLU). ImportError is
  suppressed in mbridge/models/__init__.py so users on a main-only MCore
  can still import mbridge.
* dsa_indexer_loss_coeff defaults to 0.0 because MLATransformerConfig
  defaults the field to None (which would TypeError inside csa.py); the
  bridge provides a safe float default. Production training paths must
  override via set_extra_args (e.g. 0.01 for DSA loss to contribute).
* FP8/MXFP4 weight import is not implemented in this bridge; for real
  DeepSeek-V4-Flash checkpoint import / export use the NeMo
  Megatron-Bridge DSv4 path. This PR covers the model-build and
  random-init forward path that downstream RL frameworks need.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Lingrui Mei <lmei@nvidia.com>
@Meirtz Meirtz force-pushed the feat/deepseek-v4 branch from d3712b4 to 3c89b5f Compare May 27, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant