I am getting the following error during Zipformer training...
Initially, I was getting exactly the same error with an older version of CUDA/drivers, and an older version of k2fsa and icefall. re-ran after upgrading everything (including pytorch) and still got the same error. The error does not happen consistently at the same point. Any pointers will be greatly appreciated.
2024-09-30 11:09:40,620 INFO [train.py:1190] (0/2) Training started
2024-09-30 11:09:40,620 INFO [train.py:1200] (0/2) Device: cuda:0
2024-09-30 11:09:40,621 INFO [train.py:1231] (0/2) Using dtype=torch.float16
2024-09-30 11:09:40,621 INFO [train.py:1232] (0/2) Use AMP=True
2024-09-30 11:09:40,621 INFO [train.py:1234] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '21302dae6cdbaa25c5b851f35329e592f5bf12d5', 'k2-git-date': 'Sat Sep 7 05:29:18 2024', 'lhotse-version': '1.28.0.dev+git.bc2c0a29.clean', 'torch-version': '2.6.0a0+gitc9653bf', 'torch-cuda-available': True, 'torch-cuda-version': '12.6', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-clean', 'icefall-git-date': 'Fri Sep 20 00:38:52 2024', 'icefall-path': '/mnt/dsk1/home/ngoel/icefall', 'k2-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/k2-1.24.4.dev20240930+cpu.torch2.6.0a0-py3.10-linux-x86_64.egg/k2/init.py', 'lhotse-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/lhotse-1.28.0.dev0+git.bc2c0a29.clean-py3.10.egg/lhotse/init.py', 'hostname': 'rahim', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 8000, 'exp_dir': PosixPath('exp/zipformer/v6'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.025, 'lr_batches': 5000.0, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 200, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 200, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-09-30 11:09:40,621 INFO [train.py:1236] (0/2) About to create model
....
2024-09-30 11:15:42,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5666.666666666667, ans=0.234375
2024-09-30 11:15:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029
2024-09-30 11:15:43,257 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.37 vs. limit=9.625
2024-09-30 11:15:44,083 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=9.625
2024-09-30 11:15:47,973 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.01 vs. limit=11.754999999999999
[F] /home/ngoel/k2/k2/csrc/eval.h:147:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor ()(torch::autograd::AutogradContext, at::Tensor, float), k2::SwooshFunctionk2::SwooshRConstants::forward, void, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.
[rank1]:[E930 11:15:51.984368626 ProcessGroupNCCL.cpp:1598] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /mnt/dsk1/home/ngoel/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1125d62 (0x7fb2f0f2fd62 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdaffe4 (0x7fb2f0bb9fe4 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
W0930 11:15:52.358000 1791209 /mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1791247 via signal SIGTERM
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1642, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
(icefall-sep-24) ngoel@rahim:~/icefall/egs/multien/ASR13$ /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown
I am getting the following error during Zipformer training...
Initially, I was getting exactly the same error with an older version of CUDA/drivers, and an older version of k2fsa and icefall. re-ran after upgrading everything (including pytorch) and still got the same error. The error does not happen consistently at the same point. Any pointers will be greatly appreciated.
2024-09-30 11:09:40,620 INFO [train.py:1190] (0/2) Training started
2024-09-30 11:09:40,620 INFO [train.py:1200] (0/2) Device: cuda:0
2024-09-30 11:09:40,621 INFO [train.py:1231] (0/2) Using dtype=torch.float16
2024-09-30 11:09:40,621 INFO [train.py:1232] (0/2) Use AMP=True
2024-09-30 11:09:40,621 INFO [train.py:1234] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '21302dae6cdbaa25c5b851f35329e592f5bf12d5', 'k2-git-date': 'Sat Sep 7 05:29:18 2024', 'lhotse-version': '1.28.0.dev+git.bc2c0a29.clean', 'torch-version': '2.6.0a0+gitc9653bf', 'torch-cuda-available': True, 'torch-cuda-version': '12.6', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-clean', 'icefall-git-date': 'Fri Sep 20 00:38:52 2024', 'icefall-path': '/mnt/dsk1/home/ngoel/icefall', 'k2-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/k2-1.24.4.dev20240930+cpu.torch2.6.0a0-py3.10-linux-x86_64.egg/k2/init.py', 'lhotse-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/lhotse-1.28.0.dev0+git.bc2c0a29.clean-py3.10.egg/lhotse/init.py', 'hostname': 'rahim', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 8000, 'exp_dir': PosixPath('exp/zipformer/v6'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.025, 'lr_batches': 5000.0, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 200, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 200, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-09-30 11:09:40,621 INFO [train.py:1236] (0/2) About to create model
....
2024-09-30 11:15:42,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5666.666666666667, ans=0.234375
2024-09-30 11:15:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029
2024-09-30 11:15:43,257 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.37 vs. limit=9.625
2024-09-30 11:15:44,083 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=9.625
2024-09-30 11:15:47,973 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.01 vs. limit=11.754999999999999
[F] /home/ngoel/k2/k2/csrc/eval.h:147:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor ()(torch::autograd::AutogradContext, at::Tensor, float), k2::SwooshFunctionk2::SwooshRConstants::forward, void, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.
[rank1]:[E930 11:15:51.984368626 ProcessGroupNCCL.cpp:1598] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /mnt/dsk1/home/ngoel/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1125d62 (0x7fb2f0f2fd62 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdaffe4 (0x7fb2f0bb9fe4 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
W0930 11:15:52.358000 1791209 /mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1791247 via signal SIGTERM
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1642, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
(icefall-sep-24) ngoel@rahim:~/icefall/egs/multien/ASR13$ /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown