Skip to content

[WIP] Initial DeepSeek reference implementation#861

Merged
ShriyaRishab merged 53 commits intomlcommons:masterfrom
denys-fridman:dfridman/deepseek-reference-implementation
Feb 27, 2026
Merged

[WIP] Initial DeepSeek reference implementation#861
ShriyaRishab merged 53 commits intomlcommons:masterfrom
denys-fridman:dfridman/deepseek-reference-implementation

Conversation

@denys-fridman
Copy link
Copy Markdown
Contributor

No description provided.

@denys-fridman denys-fridman requested a review from a team as a code owner January 14, 2026 11:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 14, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@denys-fridman denys-fridman force-pushed the dfridman/deepseek-reference-implementation branch from 7e03fe1 to 18410e3 Compare January 14, 2026 14:16
@denys-fridman denys-fridman force-pushed the dfridman/deepseek-reference-implementation branch from 44a6723 to 38e318c Compare January 21, 2026 09:50
export GBS=1024
# Dataloader: Micro batch size
export MBS=1
export MAX_LR="2e-4"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the final tuned HPs?

grad_accumulation_steps = mini_batch_size // args.mbs

logging_configs = {
mllogger.constants.SEED: args.seed,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need more mllog event like
[ self.mllogger.event(
key=constants.SUBMISSION_BENCHMARK,
value=self.submission_info["submission_benchmark"],
)], detailed list:(https://github.com/mlcommons/training/blob/master/llama2_70b_lora/scripts/mlperf_logging_utils.py)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added missing (I think all). SUBMISSION_BENCHMARK is logged above with mllogger.mlperf_submission_log(bmark)

Comment thread moe_pretraining/nemo/requirements.txt Outdated
@@ -0,0 +1,16 @@
git+https://github.com/denys-fridman/logging.git@dfridman/deepseek-v3 # TODO(dfridman): revert to main repo once merged
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this reverted before merging. There is another TODO in the PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll update after mlcommons/logging#445 is merged

Comment thread moe_pretraining/nemo/Dockerfile Outdated
pip install -e .

## 2. Megatron-bridge and megatron-core
ARG MBRIDGE_REVISION=main
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pin this like NEMORUN_REVISION?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread moe_pretraining/nemo/README.md Outdated

#### Run model conversion

Assuming that we have downloaded the HuggingFace checkpoint to a `<SRC_PATH>` directory, the checkpoint must be converted to Megatron-Bridge format before training. After conversion is done, set `MODEL_CKPT=<DST_PATH>` when launching the job.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section on typical runtime expected and share reference hardware used along with # of nodes

Comment thread moe_pretraining/nemo/README.md Outdated
Comment thread moe_pretraining/nemo/README.md Outdated
@@ -0,0 +1,606 @@
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move everything to llm_moe_pretraining folder

Updated instructions for using the repository and downloading checkpoints.
Updated README to clarify GBS requirements and evaluation process.
Copy link
Copy Markdown
Contributor

@ShriyaRishab ShriyaRishab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShriyaRishab ShriyaRishab merged commit f0e0607 into mlcommons:master Feb 27, 2026
1 check passed
@github-actions github-actions Bot locked and limited conversation to collaborators Feb 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants