Illegal memory access during zipformer training

I am getting the following error during Zipformer training...  
  Initially, I was getting exactly the same error with an older version of CUDA/drivers, and an older version of k2fsa and icefall. re-ran after upgrading everything (including pytorch) and still got the same error. The error does not happen consistently at the same point. Any pointers will be greatly appreciated. 

2024-09-30 11:09:40,620 INFO [train.py:1190] (0/2) Training started
2024-09-30 11:09:40,620 INFO [train.py:1200] (0/2) Device: cuda:0
2024-09-30 11:09:40,621 INFO [train.py:1231] (0/2) Using dtype=torch.float16
2024-09-30 11:09:40,621 INFO [train.py:1232] (0/2) Use AMP=True
2024-09-30 11:09:40,621 INFO [train.py:1234] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '21302dae6cdbaa25c5b851f35329e592f5bf12d5', 'k2-git-date': 'Sat Sep 7 05:29:18 2024', 'lhotse-version': '1.28.0.dev+git.bc2c0a29.clean', 'torch-version': '2.6.0a0+gitc9653bf', 'torch-cuda-available': True, 'torch-cuda-version': '12.6', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-clean', 'icefall-git-date': 'Fri Sep 20 00:38:52 2024', 'icefall-path': '/mnt/dsk1/home/ngoel/icefall', 'k2-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/k2-1.24.4.dev20240930+cpu.torch2.6.0a0-py3.10-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/lhotse-1.28.0.dev0+git.bc2c0a29.clean-py3.10.egg/lhotse/__init__.py', 'hostname': 'rahim', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 8000, 'exp_dir': PosixPath('exp/zipformer/v6'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.025, 'lr_batches': 5000.0, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 200, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 200, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-09-30 11:09:40,621 INFO [train.py:1236] (0/2) About to create model

....


2024-09-30 11:15:42,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5666.666666666667, ans=0.234375
2024-09-30 11:15:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029
2024-09-30 11:15:43,257 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.37 vs. limit=9.625
2024-09-30 11:15:44,083 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=9.625
2024-09-30 11:15:47,973 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.01 vs. limit=11.754999999999999
[F] /home/ngoel/k2/k2/csrc/eval.h:147:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(torch::autograd::AutogradContext*, at::Tensor, float), k2::SwooshFunction<k2::SwooshRConstants>::forward, void, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0)  Error: an illegal memory access was encountered.
[rank1]:[E930 11:15:51.984368626 ProcessGroupNCCL.cpp:1598] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /mnt/dsk1/home/ngoel/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1125d62 (0x7fb2f0f2fd62 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdaffe4 (0x7fb2f0bb9fe4 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0930 11:15:52.358000 1791209 /mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1791247 via signal SIGTERM
Traceback (most recent call last):
  File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in <module>
    main()
  File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1642, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
(icefall-sep-24) ngoel@rahim:~/icefall/egs/multien/ASR13$ /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illegal memory access during zipformer training #1764

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Illegal memory access during zipformer training #1764

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions