Skip to content

Zipformer training crash : 'cannot set number of interop threads ' ... #1395

@iggygeek

Description

@iggygeek

Training a zipformer with a recent icefall/k2 install results in a crash:

2023-11-29 13:02:22,614 INFO [train.py:1138] About to create model
2023-11-29 13:02:22,996 INFO [train.py:1142] Number of model parameters: 65549011
terminate called after throwing an instance of 'c10::Error'
what(): Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x154f981a5617 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x154f98160a56 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x1826cbf (0x154f59d7acbf in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x70c26a (0x154f7028526a in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #4: python3() [0x52422b]

frame #7: python3() [0x5c82ce]

My env:
'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'd12eec7521aaa26f49ca0c11c94ea42879a8e71d', 'k2-git-date': 'Mon Oct 23 11:54:42 2023', 'lhotse-version': '1.17.0.dev+git.3c0574f.clean', 'torch-version': '2.1.0+cu121', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'ae67f75-clean', 'icefall-git-date': 'Sun Nov 26 03:04:15 2023', 'icefall-path': '/home/user/git_projects/icefall1', 'k2-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/k2/init.py', 'lhotse-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/lhotse/init.py', 'hostname': 'gpu3', 'IP address': '127.0.1.1'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'trnmanifest': PosixPath('data/fbank/cuts_trn.jsonl.gz'), 'devmanifest': PosixPath('data/fbank/cuts_dev.jsonl.gz'), 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 500, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions