[INFO|trainer.py:749] 2025-11-18 21:49:36,652 >> Using auto half precision backend
[WARNING|trainer.py:982] 2025-11-18 21:49:36,653 >> The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 151645, 'pad_token_id': 151643}.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/autodl-tmp/SGL-main/internvl/train/internvl_chat_pretrain.py", line 892, in
[rank0]: main()
[rank0]: File "/root/autodl-tmp/SGL-main/internvl/train/internvl_chat_pretrain.py", line 877, in main
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2375, in _inner_training_loop
[rank0]: train_dataloader = self.get_train_dataloader()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 1140, in get_train_dataloader
[rank0]: return self._get_dataloader(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 1109, in _get_dataloader
[rank0]: dataloader_params["sampler"] = sampler_fn(dataset)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: _get_train_sampler() takes 1 positional argument but 2 were given
[rank0]:[W1118 21:49:37.788983626 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E1118 21:49:39.131000 5322 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5342) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
internvl/train/internvl_chat_pretrain.py FAILED
Failures:
<NO_OTHER_FAILURES>
[INFO|trainer.py:749] 2025-11-18 21:49:36,652 >> Using auto half precision backend
[WARNING|trainer.py:982] 2025-11-18 21:49:36,653 >> The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 151645, 'pad_token_id': 151643}.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/autodl-tmp/SGL-main/internvl/train/internvl_chat_pretrain.py", line 892, in
[rank0]: main()
[rank0]: File "/root/autodl-tmp/SGL-main/internvl/train/internvl_chat_pretrain.py", line 877, in main
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2375, in _inner_training_loop
[rank0]: train_dataloader = self.get_train_dataloader()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 1140, in get_train_dataloader
[rank0]: return self._get_dataloader(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/transformers/trainer.py", line 1109, in _get_dataloader
[rank0]: dataloader_params["sampler"] = sampler_fn(dataset)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: _get_train_sampler() takes 1 positional argument but 2 were given
[rank0]:[W1118 21:49:37.788983626 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E1118 21:49:39.131000 5322 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5342) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
internvl/train/internvl_chat_pretrain.py FAILED
Failures:
<NO_OTHER_FAILURES>