Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions docs/source/BestPractices/deepseek-v4.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,3 +214,41 @@ swift infer \
推理结果:

![result](../../resources/deepseek_v4/infer_result.png)

跑通vLLM推理:

- 如果要使用vllm推理,你可以参考[这里的文档](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash)。你需要FP4/FP8精度的权重。
- 此外你需要copy原始的'config.json'文件,并修改'expert_dtype'(与训练后的config.json一致)。因为,使用transformers的`config.save_pretrained`保存的文件与原始文件不同,vllm不兼容保存后的文件。
Comment on lines +220 to +221

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and better readability, please use vLLM instead of vllm, translate copy to 复制, and use backticks for file names and configuration keys (e.g., `config.json` and `expert_dtype`).

Suggested change
- 如果要使用vllm推理,你可以参考[这里的文档](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash)。你需要FP4/FP8精度的权重。
- 此外你需要copy原始的'config.json'文件,并修改'expert_dtype'(与训练后的config.json一致)。因为,使用transformers的`config.save_pretrained`保存的文件与原始文件不同,vllm不兼容保存后的文件
- 如果要使用vLLM推理,你可以参考[这里的文档](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash)。你需要FP4/FP8精度的权重。
- 此外你需要复制原始的`config.json`文件,并修改`expert_dtype`(与训练后的`config.json`一致)。因为,使用transformers的`config.save_pretrained`保存的文件与原始文件不同,vLLM不兼容保存后的文件

- 如果遇到tilelang问题,可以查看[这个issue](https://github.com/modelscope/ms-swift/issues/9494)。
- mcore-bridge DeepSeek-V4 Fp8修复:[PR](https://github.com/modelscope/mcore-bridge/pull/133)。

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use uppercase FP8 instead of Fp8 for consistency with other parts of the document.

Suggested change
- mcore-bridge DeepSeek-V4 Fp8修复[PR](https://github.com/modelscope/mcore-bridge/pull/133)
- mcore-bridge DeepSeek-V4 FP8修复[PR](https://github.com/modelscope/mcore-bridge/pull/133)


这里先做量化(这里的量化会导致LoRA增量信息丢失,这里只作为例子,建议使用FP8全参数训练并导出FP8权重):

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
megatron export \
--model megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged \
--output_dir megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \
--to_hf true \
--fp8_recipe blockwise \
--fp8_format e4m3 \
--fp8_param_gather true \
--mtp_num_layers 1 \
--expert_model_parallel_size 8
```

vLLM启动命令:
```shell
vllm serve megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4
```
38 changes: 38 additions & 0 deletions docs/source_en/BestPractices/deepseek-v4.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,3 +214,41 @@ swift infer \
Inference result:

![result](../../resources/deepseek_v4/infer_result.png)

Running vLLM inference:

- If you want to use vLLM for inference, you can refer to [this documentation](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash). You need FP4/FP8 precision weights.
- Additionally, you need to copy the original 'config.json' file and modify 'expert_dtype' (consistent with the config.json after training). This is because the file saved by transformers' `config.save_pretrained` differs from the original file, and vLLM is not compatible with the saved file.
Comment on lines +220 to +221

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and better markdown formatting, please use backticks for file names and configuration keys (e.g., `config.json` and `expert_dtype`).

Suggested change
- If you want to use vLLM for inference, you can refer to [this documentation](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash). You need FP4/FP8 precision weights.
- Additionally, you need to copy the original 'config.json' file and modify 'expert_dtype' (consistent with the config.json after training). This is because the file saved by transformers' `config.save_pretrained` differs from the original file, and vLLM is not compatible with the saved file.
- If you want to use vLLM for inference, you can refer to [this documentation](https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash). You need FP4/FP8 precision weights.
- Additionally, you need to copy the original `config.json` file and modify `expert_dtype` (consistent with the `config.json` after training). This is because the file saved by transformers' `config.save_pretrained` differs from the original file, and vLLM is not compatible with the saved file.

- If you encounter tilelang issues, you can check [this issue](https://github.com/modelscope/ms-swift/issues/9494).
- mcore-bridge DeepSeek-V4 FP8 fix: [PR](https://github.com/modelscope/mcore-bridge/pull/133).

First perform quantization (note: this quantization will cause LoRA incremental information loss; this is only an example. It is recommended to use FP8 full-parameter training and export FP8 weights):

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
megatron export \
--model megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged \
--output_dir megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \
--to_hf true \
--fp8_recipe blockwise \
--fp8_format e4m3 \
--fp8_param_gather true \
--mtp_num_layers 1 \
--expert_model_parallel_size 8
```

vLLM launch command:
```shell
vllm serve megatron_output/DeepSeek-V4-Flash/vx-xxx/checkpoint-xxx-merged-FP8 \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4
```
Loading