-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[docs] update deepseek_v4 vllm docs #9597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Jintao-Huang
merged 7 commits into
modelscope:main
from
Jintao-Huang:update_deepseek_v4_vllm_docs
Jun 24, 2026
+76
−0
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
718bf68
update deepseek_v
Jintao-Huang eba5306
update
Jintao-Huang 38946ba
update
Jintao-Huang 92bc0cb
Merge branch 'main' into update_deepseek_v4_vllm_docs
Jintao-Huang 4239a81
Merge branch 'main' into update_deepseek_v4_vllm_docs
Jintao-Huang 0eaa5dd
update
Jintao-Huang cef2008
update
Jintao-Huang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -214,3 +214,41 @@ swift infer \ | |
| 推理结果: | ||
|
|
||
|  | ||
|
|
||
| 跑通vLLM推理: | ||
|
|
||
| - 如果要使用vllm推理,你可以参考[这里的文档](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash)。你需要FP4/FP8精度的权重。 | ||
| - 此外你需要copy原始的'config.json'文件,并修改'expert_dtype'(与训练后的config.json一致)。因为,使用transformers的`config.save_pretrained`保存的文件与原始文件不同,vllm不兼容保存后的文件。 | ||
| - 如果遇到tilelang问题,可以查看[这个issue](https://github.com/modelscope/ms-swift/issues/9494)。 | ||
| - mcore-bridge DeepSeek-V4 Fp8修复:[PR](https://github.com/modelscope/mcore-bridge/pull/133)。 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| 这里先做量化(这里的量化会导致LoRA增量信息丢失,这里只作为例子,建议使用FP8全参数训练并导出FP8权重): | ||
|
|
||
| ```shell | ||
| CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ | ||
| NPROC_PER_NODE=8 \ | ||
| megatron export \ | ||
| --model megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged \ | ||
| --output_dir megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \ | ||
| --to_hf true \ | ||
| --fp8_recipe blockwise \ | ||
| --fp8_format e4m3 \ | ||
| --fp8_param_gather true \ | ||
| --mtp_num_layers 1 \ | ||
| --expert_model_parallel_size 8 | ||
| ``` | ||
|
|
||
| vLLM启动命令: | ||
| ```shell | ||
| vllm serve megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \ | ||
| --trust-remote-code \ | ||
| --kv-cache-dtype fp8 \ | ||
| --block-size 256 \ | ||
| --enable-expert-parallel \ | ||
| --tensor-parallel-size 8 \ | ||
| --max-model-len 8192 \ | ||
| --tokenizer-mode deepseek_v4 \ | ||
| --tool-call-parser deepseek_v4 \ | ||
| --enable-auto-tool-choice \ | ||
| --reasoning-parser deepseek_v4 | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -214,3 +214,41 @@ swift infer \ | |||||||||
| Inference result: | ||||||||||
|
|
||||||||||
|  | ||||||||||
|
|
||||||||||
| Running vLLM inference: | ||||||||||
|
|
||||||||||
| - If you want to use vLLM for inference, you can refer to [this documentation](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash). You need FP4/FP8 precision weights. | ||||||||||
| - Additionally, you need to copy the original 'config.json' file and modify 'expert_dtype' (consistent with the config.json after training). This is because the file saved by transformers' `config.save_pretrained` differs from the original file, and vLLM is not compatible with the saved file. | ||||||||||
|
Comment on lines
+220
to
+221
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For consistency and better markdown formatting, please use backticks for file names and configuration keys (e.g.,
Suggested change
|
||||||||||
| - If you encounter tilelang issues, you can check [this issue](https://github.com/modelscope/ms-swift/issues/9494). | ||||||||||
| - mcore-bridge DeepSeek-V4 FP8 fix: [PR](https://github.com/modelscope/mcore-bridge/pull/133). | ||||||||||
|
|
||||||||||
| First perform quantization (note: this quantization will cause LoRA incremental information loss; this is only an example. It is recommended to use FP8 full-parameter training and export FP8 weights): | ||||||||||
|
|
||||||||||
| ```shell | ||||||||||
| CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ | ||||||||||
| NPROC_PER_NODE=8 \ | ||||||||||
| megatron export \ | ||||||||||
| --model megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged \ | ||||||||||
| --output_dir megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \ | ||||||||||
| --to_hf true \ | ||||||||||
| --fp8_recipe blockwise \ | ||||||||||
| --fp8_format e4m3 \ | ||||||||||
| --fp8_param_gather true \ | ||||||||||
| --mtp_num_layers 1 \ | ||||||||||
| --expert_model_parallel_size 8 | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| vLLM launch command: | ||||||||||
| ```shell | ||||||||||
| vllm serve megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \ | ||||||||||
| --trust-remote-code \ | ||||||||||
| --kv-cache-dtype fp8 \ | ||||||||||
| --block-size 256 \ | ||||||||||
| --enable-expert-parallel \ | ||||||||||
| --tensor-parallel-size 8 \ | ||||||||||
| --max-model-len 8192 \ | ||||||||||
| --tokenizer-mode deepseek_v4 \ | ||||||||||
| --tool-call-parser deepseek_v4 \ | ||||||||||
| --enable-auto-tool-choice \ | ||||||||||
| --reasoning-parser deepseek_v4 | ||||||||||
| ``` | ||||||||||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and better readability, please use
vLLMinstead ofvllm, translatecopyto复制, and use backticks for file names and configuration keys (e.g.,`config.json`and`expert_dtype`).