Skip to content

Illegal memory access during zipformer training #1764

@ngoel17

Description

@ngoel17

I am getting the following error during Zipformer training...
Initially, I was getting exactly the same error with an older version of CUDA/drivers, and an older version of k2fsa and icefall. re-ran after upgrading everything (including pytorch) and still got the same error. The error does not happen consistently at the same point. Any pointers will be greatly appreciated.

2024-09-30 11:09:40,620 INFO [train.py:1190] (0/2) Training started
2024-09-30 11:09:40,620 INFO [train.py:1200] (0/2) Device: cuda:0
2024-09-30 11:09:40,621 INFO [train.py:1231] (0/2) Using dtype=torch.float16
2024-09-30 11:09:40,621 INFO [train.py:1232] (0/2) Use AMP=True
2024-09-30 11:09:40,621 INFO [train.py:1234] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '21302dae6cdbaa25c5b851f35329e592f5bf12d5', 'k2-git-date': 'Sat Sep 7 05:29:18 2024', 'lhotse-version': '1.28.0.dev+git.bc2c0a29.clean', 'torch-version': '2.6.0a0+gitc9653bf', 'torch-cuda-available': True, 'torch-cuda-version': '12.6', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-clean', 'icefall-git-date': 'Fri Sep 20 00:38:52 2024', 'icefall-path': '/mnt/dsk1/home/ngoel/icefall', 'k2-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/k2-1.24.4.dev20240930+cpu.torch2.6.0a0-py3.10-linux-x86_64.egg/k2/init.py', 'lhotse-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/lhotse-1.28.0.dev0+git.bc2c0a29.clean-py3.10.egg/lhotse/init.py', 'hostname': 'rahim', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 8000, 'exp_dir': PosixPath('exp/zipformer/v6'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.025, 'lr_batches': 5000.0, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 200, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 200, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-09-30 11:09:40,621 INFO [train.py:1236] (0/2) About to create model

....

2024-09-30 11:15:42,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5666.666666666667, ans=0.234375
2024-09-30 11:15:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029
2024-09-30 11:15:43,257 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.37 vs. limit=9.625
2024-09-30 11:15:44,083 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=9.625
2024-09-30 11:15:47,973 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.01 vs. limit=11.754999999999999
[F] /home/ngoel/k2/k2/csrc/eval.h:147:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor ()(torch::autograd::AutogradContext, at::Tensor, float), k2::SwooshFunctionk2::SwooshRConstants::forward, void, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.
[rank1]:[E930 11:15:51.984368626 ProcessGroupNCCL.cpp:1598] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /mnt/dsk1/home/ngoel/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1125d62 (0x7fb2f0f2fd62 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdaffe4 (0x7fb2f0bb9fe4 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0930 11:15:52.358000 1791209 /mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1791247 via signal SIGTERM
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1642, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
(icefall-sep-24) ngoel@rahim:~/icefall/egs/multien/ASR13$ /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions