Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5

Open
xbsdsongnan opened this issue Jan 22, 2021 · 20 comments

Comments

@xbsdsongnan
Copy link

Traceback (most recent call last):
File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in
model.cuda(args.local_rank)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

@danpovey
Copy link

danpovey commented Jan 22, 2021 via email

@cweng6
Copy link

cweng6 commented Jan 22, 2021

Thanks, Dan.

@xbsdsongnan I believe this is most likely relevant to GPU OOM. Could you try lowering the 'TU_limit' value to reduce GPU memory usage? BTW, you might need to adjust some of your option such as '--padding_tgt', '--num_batches_per_epoch' instead of default values.

@xbsdsongnan
Copy link
Author

Can pytorch1.1.0 and cuda10.0 work normally?@cweng6@danpovey

@cweng6
Copy link

cweng6 commented Jan 25, 2021

I believe so. I saw some stable available wheels for the installation here, https://download.pytorch.org/whl/torch_stable.html

@xbsdsongnan
Copy link
Author

@cweng6
Traceback (most recent call last):
File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in
model.cuda(args.local_rank)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '10', '--num_epochs', '2', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '2', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '1', '--padding_tgt', '1', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '1', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.

@xbsdsongnan
Copy link
Author

@cweng6
I've adjusted a lot of parameters, but the above one is just one of them. No matter how I modify the parameters, I can't pass it. Do you have the configuration parameter settings for the basic demo

@cweng6
Copy link

cweng6 commented Jan 26, 2021

We could run with the config in the release example. set TU_limit to 1 will not load any utterances for training. Anyway, could you describe your environment, python/PyTorch/cuda version, number/spec of GPUs, etc

@csukuangfj
Copy link

The output of the following command should be helpful for describing the environment.

$ python3 -m torch.utils.collect_env

@xbsdsongnan
Copy link
Author

@cweng6 @csukuangfj
Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design
Nvidia driver version: 430.34
cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] numpydoc==0.8.0
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0
[conda] torchvision 0.3.0

@xbsdsongnan
Copy link
Author

python3.6 cuda==10.0 torch==1.1.0 gpu==1

@cweng6
Copy link

cweng6 commented Jan 26, 2021

Thanks, Fangjun.

@xbsdsongnan , looks like the version of Cuda used to build pytorch doesn't match the one used in runtime.

Also, I am not sure the example script could run with one GPU. We will release an example using 1GPU later on.

@xbsdsongnan
Copy link
Author

@cweng6
Thanks, Wengchao
Learn from you

@xbsdsongnan
Copy link
Author

@cweng6
I have eight GPUs on my server, but I really want to run on one GPU

@xbsdsongnan
Copy link
Author

Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: GPU 0: GeForce RTX 2080 with Max-Q Design
Nvidia driver version: 430.34
cuDNN version: /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7

Versions of relevant libraries:
[pip3] numpy==1.14.3
[pip3] numpydoc==0.8.0
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2018.0.2 1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.1 py36h3010b51_0
[conda] mkl_random 1.0.1 py36h629b387_0
[conda] torch 1.1.0
[conda] torchvision 0.3.0

@xbsdsongnan
Copy link
Author

@cweng6
filenotfounderror:[error2]no such file or directory:/home/pika/egs/arks/train.0.2.mrk.0

@cweng6
Copy link

cweng6 commented Jan 28, 2021

can you locate the needed mrk file? if not, there must be something wrong with the data preparation step.

@xbsdsongnan
Copy link
Author

label.txt:
BAC009S0764W0121 中国 实现 民族 复兴
wav.scp
BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav

@xbsdsongnan
Copy link
Author

@cweng6
My data preparation sample
Is there a problem

@xbsdsongnan
Copy link
Author

@cweng6

Why can't I run the demo you released on four GPUs? What are the parameters of your demo that need to be modified? What is the version configuration environment

@cweng6
Copy link

cweng6 commented Jan 28, 2021

label.txt:
BAC009S0764W0121 中国 实现 民族 复兴
wav.scp
BAC009S0764W0121 /home/pika/data/test/S0764/BAC009S0764W0121.wav

Your label.txt doesn't look right. Check our project README,

label.txt: label text file, the format is, uttid sequence-of-integer, where integer is one-based indexing mapped label, note that zero is reserved for blank,eg., utt_id_1 3 5 7 10 23

You will need to map each character in transcription to an integer when preparing label.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants