-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED/subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u #5
Comments
I've seen it said that that error is fairly commonly and randomly found
when you use LSTMs with PyTorch, particularly with some anaconda
distributions... but I've also seen it said that that error can actually
mask out of memory. Regardless, I doubt it is repeatable.
…On Fri, Jan 22, 2021 at 4:34 PM xbsdsongnan ***@***.***> wrote:
Traceback (most recent call last):
File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305,
in
model.cuda(args.local_rank)
File
"/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py",
line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File
"/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py",
line 193, in _apply
module._apply(fn)
File
"/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line
127, in _apply
self.flatten_parameters()
File
"/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line
123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in
_run_module_as_main
"*main*", mod_spec)
File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File
"/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py",
line 235, in
main()
File
"/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py",
line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python',
'-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py',
'--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003',
'--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch',
'526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum',
'0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size',
'8', '--encoder_type', 'transformer', '--enc_layers', '9',
'--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM',
'--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn',
'--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1',
'--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn',
'--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats',
'--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000',
'--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit',
'15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1',
'--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1',
'--rctx', '1', '--model_lctx', '21', '--model_rctx', '21',
'--model_stride', '4', 'transducer',
'/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst',
'/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log',
'/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit
status 1.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7IVLIPGR5THZHUUDLS3E2BDANCNFSM4WOEK37A>
.
|
Thanks, Dan. @xbsdsongnan I believe this is most likely relevant to GPU OOM. Could you try lowering the 'TU_limit' value to reduce GPU memory usage? BTW, you might need to adjust some of your option such as '--padding_tgt', '--num_batches_per_epoch' instead of default values. |
Can pytorch1.1.0 and cuda10.0 work normally?@cweng6@danpovey |
I believe so. I saw some stable available wheels for the installation here, https://download.pytorch.org/whl/torch_stable.html |
@cweng6 |
@cweng6 |
We could run with the config in the release example. set TU_limit to 1 will not load any utterances for training. Anyway, could you describe your environment, python/PyTorch/cuda version, number/spec of GPUs, etc |
The output of the following command should be helpful for describing the environment. $ python3 -m torch.utils.collect_env |
@cweng6 @csukuangfj OS: Ubuntu 16.04.6 LTS Python version: 3.6 Versions of relevant libraries: |
python3.6 cuda==10.0 torch==1.1.0 gpu==1 |
Thanks, Fangjun. @xbsdsongnan , looks like the version of Cuda used to build pytorch doesn't match the one used in runtime. Also, I am not sure the example script could run with one GPU. We will release an example using 1GPU later on. |
@cweng6 |
@cweng6 |
Collecting environment information... OS: Ubuntu 16.04.6 LTS Python version: 3.6 Versions of relevant libraries: |
@cweng6 |
can you locate the needed mrk file? if not, there must be something wrong with the data preparation step. |
label.txt: |
@cweng6 |
@cweng6 Why can't I run the demo you released on four GPUs? What are the parameters of your demo that need to be modified? What is the version configuration environment |
Your label.txt doesn't look right. Check our project README, label.txt: label text file, the format is, uttid sequence-of-integer, where integer is one-based indexing mapped label, note that zero is reserved for blank,eg., utt_id_1 3 5 7 10 23 You will need to map each character in transcription to an integer when preparing label.txt |
Traceback (most recent call last):
File "/home/pika-main/trainer/train_transducer_bmuf_otfaug.py", line 305, in
model.cuda(args.local_rank)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/root/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-u', '/home/pika-main/trainer/train_transducer_bmuf_otfaug.py', '--local_rank=0', '--verbose', '--optim', 'sgd', '--initial_lr', '0.003', '--final_lr', '0.0001', '--grad_clip', '3.0', '--num_batches_per_epoch', '526264', '--num_epochs', '8', '--momentum', '0.9', '--block_momentum', '0.9', '--sync_period', '5', '--feats_dim', '80', '--cuda', '--batch_size', '8', '--encoder_type', 'transformer', '--enc_layers', '9', '--decoder_type', 'rnn', '--dec_layers', '2', '--rnn_type', 'LSTM', '--rnn_size', '1024', '--embd_dim', '100', '--dropout', '0.2', '--brnn', '--padding_idx', '6268', '--padding_tgt', '6268', '--stride', '1', '--queue_size', '8', '--loader', 'otf_utt', '--batch_first', '--cmn', '--cmvn_stats', '/home/pika/pika-main/egs/global_cmvn.stats', '--output_dim', '6268', '--num_workers', '1', '--sample_rate', '16000', '--feat_config', '/home/pika/pika-main/egs/fbank.conf', '--TU_limit', '15000', '--gain_range', '50,10', '--speed_rate', '0.9,1.0,1.1', '--log_per_n_frames', '131072', '--max_len', '1600', '--lctx', '1', '--rctx', '1', '--model_lctx', '21', '--model_rctx', '21', '--model_stride', '4', 'transducer', '/home/pika/pika-main/egs/lst/data.0.WORKER-ID.lst', '/home/pika/pika-main/egs/logs.baseline/train_transducer.0.WORKER-ID.log', '/home/pika/pika-main/egs/output/baseline.0']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered: