number of parameters on (tensor, pipeline) model parallel rank (0, 0): 41583674112
sharded_state_dict metadata loaded from the checkpoint: {'distrib_optim_sharding_type': 'fully_sharded_model_space'}
loading distributed checkpoint from /ssd4/nietianyu/workspace/ms-swift/model/Qwen3-Next-80B-A3B-Instruct-mcore at iteration 1
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:915: FutureWarning: load_state_dict is deprecated and will be removed in future versions. Please use load instead.
checkpoint.load_state_dict(
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/planner_helpers.py:316: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py:362: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
and md.size != obj.size()
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:21<00:00, 2851.07it/s]
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:22<00:00, 2813.90it/s]
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:22<00:00, 2800.74it/s]
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:22<00:00, 2790.16it/s]
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:22<00:00, 2749.19it/s]
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:22<00:00, 2775.16it/s]
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:22<00:00, 2725.45it/s]
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████| 62320/62320 [00:24<00:00, 2524.19it/s]
could not find arguments in the checkpoint ...
checkpoint version 3.0
WARNING:megatron.core.rerun_state_machine:RerunStateMachine disabled via CLI, ignoring machine state saved in checkpoint
successfully loaded checkpoint from /ssd4/nietianyu/workspace/ms-swift/model/Qwen3-Next-80B-A3B-Instruct-mcore [ t 1/1, p 1/1 ] at iteration 0
(min, max) time across ranks (ms):
load-checkpoint ................................: (77218.99, 77219.23)
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-10-31 20:17:08
building train, validation, and test datasets ...
[after dataloaders are built] datetime: 2025-10-31 20:17:08
done with setup ...
training ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (96759.80, 96769.29)
train/valid/test-data-iterators-setup ..........: (0.78, 0.84)
Setting rerun_state_machine.current_iteration to 0...
[before the start of training step] datetime: 2025-10-31 20:17:08
[INFO:swift] The training of Epoch 0 starts...
WARNING:DotProductAttention:flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 3ba6f82 && git submodule update --init && cd hopper/ && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) cp flash_attn_interface.py $python_path/flash_attn_3/flash_attn_interface.py
[INFO:swift] images_dir: /ssd4/nietianyu/workspace/ms-swift/megatron_output/Qwen3-Next-80B-A3B-Instruct/v27-20251031-201257/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/cli/_megatron/sft.py", line 5, in
[rank0]: megatron_sft_main()
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 79, in megatron_sft_main
[rank0]: return MegatronSft(args).main()
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 69, in run
[rank0]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 774, in train
[rank0]: pretrain(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 864, in pretrain
[rank0]: iteration, num_floating_point_operations_so_far = train(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 2279, in train
[rank0]: ) = train_step(
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 327, in train_step
[rank0]: return self._origin_train_step(forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler,
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1422, in train_step
[rank0]: update_successful, grad_norm, num_zeros_in_grad = optimizer.step()
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/optimizer/optimizer.py", line 1213, in step
[rank0]: grad_norm = self.get_grad_norm()
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/optimizer/optimizer.py", line 1176, in get_grad_norm
[rank0]: grad_norm = get_grad_norm_fp32(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/optimizer/clip_grads.py", line 130, in get_grad_norm_fp32
[rank0]: torch.distributed.all_reduce(
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2806, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Failed to CUDA calloc async 136 bytes
[rank0]:[W1031 20:18:24.506717314 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1031 20:18:27.319586 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164470 closing signal SIGTERM
W1031 20:18:27.323579 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164471 closing signal SIGTERM
W1031 20:18:27.328172 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164472 closing signal SIGTERM
W1031 20:18:27.333950 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164473 closing signal SIGTERM
W1031 20:18:27.337446 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164474 closing signal SIGTERM
W1031 20:18:27.355330 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164475 closing signal SIGTERM
W1031 20:18:27.358980 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 164476 closing signal SIGTERM
E1031 20:18:32.038630 164386 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 164469) of binary: /ssd4/nietianyu/.conda/envs/ms_swift/bin/python3.10
Traceback (most recent call last):
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in
main()
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/ssd4/nietianyu/workspace/ms-swift/swift/cli/_megatron/sft.py FAILED
Describe the bug
Qwen3-Next-80B-A3B-Instruct 8卡H20,lora微调报错OOM
运行命令
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ NPROC_PER_NODE=8 \ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ megatron sft \ --load /model/Qwen3-Next-80B-A3B-Instruct-mcore \ --dataset 'high_quality_data_0101_1022_265w_converted.json' \ --train_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --expert_model_parallel_size 2 \ --moe_permute_fusion true \ --moe_grouped_gemm true \ --moe_shared_expert_overlap true \ --moe_aux_loss_coeff 1e-3 \ --micro_batch_size 1 \ --global_batch_size 8 \ --recompute_granularity full \ --recompute_method uniform \ --recompute_num_layers 1 \ --max_epochs 3 \ --finetune true \ --cross_entropy_loss_fusion true \ --lr 1e-4 \ --lr_warmup_fraction 0.05 \ --min_lr 1e-5 \ --save /megatron_output/Qwen3-Next-80B-A3B-Instruct \ --save_interval 82976 \ --max_length 512 \ --num_workers 8 \ --dataset_num_proc 8 \ --no_save_optim true \ --no_save_rng true \ --sequence_parallel true \ --attention_backend flash \ --model_author swift \ --model_name swift-robot报错:
`[INFO:swift] [rank0] model_parameter_info: PeftModelForCausalLM: 41583.6741M Params (563.9885M Trainable [1.3563%]), 0.0000M Buffers.
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-10-31_20:18:27
host : gajl-ime-h20-hpn001-0002.gajl.baidu.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 164469)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================`
Your hardware and system info
absl-py 2.3.1 accelerate 1.11.0 addict 2.4.0 aiofiles 24.1.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.1 aiosignal 1.4.0 aliyun-python-sdk-core 2.16.0 aliyun-python-sdk-kms 2.16.5 annotated-doc 0.0.3 annotated-types 0.7.0 anyio 4.11.0 anykeystore 0.2 apex 0.1 async-timeout 5.0.1 attrdict 2.0.1 attrs 25.4.0 binpacking 1.5.2 Brotli 1.1.0 certifi 2025.10.5 cffi 2.0.0 charset-normalizer 3.4.4 click 8.3.0 cmake 4.1.2 contourpy 1.3.2 cpm-kernels 1.0.11 crcmod 1.7 cryptacular 1.6.2 cryptography 46.0.3 cycler 0.12.1 dacite 1.9.2 datasets 3.6.0 defusedxml 0.7.1 dill 0.3.8 distro 1.9.0 einops 0.8.1 exceptiongroup 1.3.0 fastapi 0.120.1 ffmpy 0.6.4 filelock 3.20.0 flash_attn 2.8.1 fonttools 4.60.1 frozenlist 1.8.0 fsspec 2025.10.0 future 1.0.0 gradio 5.49.1 gradio_client 1.13.3 greenlet 3.2.4 groovy 0.1.2 grpcio 1.76.0 h11 0.16.0 hf-xet 1.2.0 httpcore 1.0.9 httpx 0.28.1 huggingface-hub 0.36.0 hupper 1.12.1 idna 3.11 importlib_metadata 8.7.0 jieba 0.42.1 Jinja2 3.1.6 jiter 0.11.1 jmespath 0.10.0 joblib 1.5.2 json_repair 0.52.3 kiwisolver 1.4.9 Markdown 3.9 markdown-it-py 4.0.0 MarkupSafe 3.0.3 matplotlib 3.10.7 mdurl 0.1.2 megatron-core 0.13.2 ml_dtypes 0.5.3 modelscope 1.31.0 mpmath 1.3.0 ms_swift 3.10.0.dev0 msgspec 0.19.0 multidict 6.7.0 multiprocess 0.70.16 networkx 3.4.2 ninja 1.13.0 nltk 3.9.2 numpy 1.26.4 nvidia-cublas-cu12 12.6.4.1 nvidia-cuda-cupti-cu12 12.6.80 nvidia-cuda-nvrtc-cu12 12.6.77 nvidia-cuda-runtime-cu12 12.6.77 nvidia-cudnn-cu12 9.5.1.17 nvidia-cufft-cu12 11.3.0.4 nvidia-cufile-cu12 1.13.1.3 nvidia-curand-cu12 10.3.7.77 nvidia-cusolver-cu12 11.7.1.2 nvidia-cusparse-cu12 12.5.4.2 nvidia-cusparselt-cu12 0.6.3 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.6.85 nvidia-nvshmem-cu12 3.3.20 nvidia-nvtx-cu12 12.6.77 oauthlib 3.3.1 onnx 1.19.1 onnx-ir 0.1.12 onnxscript 0.5.4 openai 2.6.1 orjson 3.11.4 oss2 2.19.1 packaging 25.0 pandas 2.3.3 PasteDeploy 3.1.0 pbkdf2 1.3 peft 0.17.1 pillow 11.3.0 pip 25.3 plaster 1.1.2 plaster-pastedeploy 1.0.1 propcache 0.4.1 protobuf 6.33.0 psutil 7.1.2 pyarrow 20.0.0 pybind11 3.0.1 pycparser 2.23 pycryptodome 3.23.0 pydantic 2.11.10 pydantic_core 2.33.2 pydub 0.25.1 Pygments 2.19.2 pyparsing 3.2.5 pyramid 2.0.2 pyramid-mailer 0.15.1 python-dateutil 2.9.0.post0 python-multipart 0.0.20 python3-openid 3.2.0 pytz 2025.2 PyYAML 6.0.3 regex 2025.10.23 repoze.sendmail 4.4.1 requests 2.32.5 requests-oauthlib 2.0.0 rich 14.2.0 rouge 1.0.1 ruff 0.14.2 safehttpx 0.1.7 safetensors 0.6.2 scipy 1.15.3 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 78.1.1 shellingham 1.5.4 simplejson 3.20.2 six 1.17.0 sniffio 1.3.1 some-package 0.1 sortedcontainers 2.4.0 SQLAlchemy 2.0.44 starlette 0.48.0 sympy 1.13.1 tensorboard 2.20.0 tensorboard-data-server 0.7.2 tiktoken 0.11.0 tokenizers 0.22.1 tomlkit 0.13.3 torch 2.6.0+cu126 torchaudio 2.6.0+cu124 torchvision 0.21.0+cu126 tqdm 4.67.1 transaction 5.0 transformer_engine 2.8.0 transformer_engine_cu12 2.8.0 transformer_engine_torch 2.8.0 transformers 4.57.1 transformers-stream-generator 0.0.5 translationstring 1.4 triton 3.1.0 trl 0.23.1 typer 0.20.0 typing_extensions 4.15.0 typing-inspection 0.4.2 tzdata 2025.2 urllib3 2.5.0 uvicorn 0.38.0 velruse 1.1.1 venusian 3.1.1 WebOb 1.8.9 websockets 15.0.1 Werkzeug 3.1.3 wheel 0.45.1 WTForms 3.2.1 wtforms-recaptcha 0.3.2 xxhash 3.6.0 yarl 1.22.0 zipp 3.23.0 zope.deprecation 6.0 zope.interface 8.0.1 zope.sqlalchemy 4.0 zstandard 0.25.0Additional context
我看官方8*60G就可以跑起来了 https://github.com/modelscope/ms-swift/pull/5764,我是8*96G,而且还调小了batch size,之前报错是OOM,现在又是Failed to CUDA calloc async 136 bytes,感觉是显存不够的问题。然后我用HF格式lora是可以跑起来的,但是速度太慢了,完全跑不了所以才尝试转MCore checkpoint训练