RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5

xbsdsongnan · 2021-01-22T08:34:10Z

Traceback (most recent call last):
File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in
model.cuda(args.local_rank)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

danpovey · 2021-01-22T12:28:03Z

I've seen it said that that error is fairly commonly and randomly found when you use LSTMs with PyTorch, particularly with some anaconda distributions... but I've also seen it said that that error can actually mask out of memory. Regardless, I doubt it is repeatable.

…

On Fri, Jan 22, 2021 at 4:34 PM xbsdsongnan ***@***.***> wrote: Traceback (most recent call last): File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in model.cuda(args.local_rank) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda return self._apply(lambda t: t.cuda(device)) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "*main*", mod_spec) File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in main() File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7IVLIPGR5THZHUUDLS3E2BDANCNFSM4WOEK37A> .

cweng6 · 2021-01-22T12:44:03Z

Thanks, Dan.

@xbsdsongnan I believe this is most likely relevant to GPU OOM. Could you try lowering the 'TU_limit' value to reduce GPU memory usage? BTW, you might need to adjust some of your option such as '--padding_tgt', '--num_batches_per_epoch' instead of default values.

xbsdsongnan · 2021-01-25T08:47:24Z

Can pytorch1.1.0 and cuda10.0 work normally?@cweng6@danpovey

cweng6 · 2021-01-25T09:25:03Z

I believe so. I saw some stable available wheels for the installation here, https://download.pytorch.org/whl/torch_stable.html

xbsdsongnan · 2021-01-26T01:58:11Z

@cweng6
Traceback (most recent call last):
File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in
model.cuda(args.local_rank)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '10', '--num_epochs', '2', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '2', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '1', '--padding_tgt', '1', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '1', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

xbsdsongnan · 2021-01-26T02:05:24Z

＠cweng6
I've adjusted a lot of parameters, but the above one is just one of them. No matter how I modify the parameters, I can't pass it. Do you have the configuration parameter settings for the basic demo

cweng6 · 2021-01-26T02:13:18Z

We could run with the config in the release example. set TU_limit to 1 will not load any utterances for training. Anyway, could you describe your environment, python/PyTorch/cuda version, number/spec of GPUs, etc

csukuangfj · 2021-01-26T02:18:49Z

The output of the following command should be helpful for describing the environment.

$ python3 -m torch.utils.collect_env

xbsdsongnan · 2021-01-26T02:41:35Z

@cweng6 @csukuangfj
Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design
Nvidia driver version: 430.34
cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] numpydoc==0.8.0
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0
[conda] torchvision 0.3.0

xbsdsongnan · 2021-01-26T02:44:36Z

python3.6 cuda==10.0 torch==1.1.0 gpu==1

cweng6 · 2021-01-26T02:51:13Z

Thanks, Fangjun.

@xbsdsongnan , looks like the version of Cuda used to build pytorch doesn't match the one used in runtime.

Also, I am not sure the example script could run with one GPU. We will release an example using 1GPU later on.

xbsdsongnan · 2021-01-26T02:55:48Z

＠cweng6
Thanks, Wengchao
Learn from you

xbsdsongnan · 2021-01-26T02:58:43Z

＠cweng6
I have eight GPUs on my server, but I really want to run on one GPU

xbsdsongnan · 2021-01-27T06:00:15Z

Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design
Nvidia driver version: 430.34
cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip3] numpy==1.14.3
[pip3] numpydoc==0.8.0
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0
[conda] torchvision 0.3.0

xbsdsongnan · 2021-01-28T01:11:35Z

@cweng6
filenotfounderror:[error2]no such file or directory:/home/pika/egs/arks/train.0.2.mrk.0

cweng6 · 2021-01-28T01:16:01Z

can you locate the needed mrk file? if not, there must be something wrong with the data preparation step.

xbsdsongnan · 2021-01-28T01:54:52Z

label.txt:
BAC009S0764W0121 中国实现民族复兴
wav.scp
BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav

xbsdsongnan · 2021-01-28T01:56:40Z

＠cweng6
My data preparation sample
Is there a problem

xbsdsongnan · 2021-01-28T06:41:27Z

＠cweng6

Why can't I run the demo you released on four GPUs? What are the parameters of your demo that need to be modified? What is the version configuration environment

cweng6 · 2021-01-28T06:55:18Z

label.txt:
BAC009S0764W0121 中国实现民族复兴
wav.scp
BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav

Your label.txt doesn't look right. Check our project README,

label.txt: label text file, the format is, uttid sequence-of-integer, where integer is one-based indexing mapped label, note that zero is reserved for blank,eg., utt_id_1 3 5 7 10 23

You will need to map each character in transcription to an integer when preparing label.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5

xbsdsongnan commented Jan 22, 2021

danpovey commented Jan 22, 2021 via email

cweng6 commented Jan 22, 2021

xbsdsongnan commented Jan 25, 2021

cweng6 commented Jan 25, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

cweng6 commented Jan 26, 2021

csukuangfj commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

cweng6 commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 27, 2021

xbsdsongnan commented Jan 28, 2021

cweng6 commented Jan 28, 2021

xbsdsongnan commented Jan 28, 2021

xbsdsongnan commented Jan 28, 2021

xbsdsongnan commented Jan 28, 2021

cweng6 commented Jan 28, 2021 •

edited

Loading

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5

Comments

xbsdsongnan commented Jan 22, 2021

danpovey commented Jan 22, 2021 via email

cweng6 commented Jan 22, 2021

xbsdsongnan commented Jan 25, 2021

cweng6 commented Jan 25, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

cweng6 commented Jan 26, 2021

csukuangfj commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

cweng6 commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 26, 2021

xbsdsongnan commented Jan 27, 2021

xbsdsongnan commented Jan 28, 2021

cweng6 commented Jan 28, 2021

xbsdsongnan commented Jan 28, 2021

xbsdsongnan commented Jan 28, 2021

xbsdsongnan commented Jan 28, 2021

cweng6 commented Jan 28, 2021 • edited Loading

cweng6 commented Jan 28, 2021 •

edited

Loading