Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #124

Open
gregbugaj opened this issue Oct 3, 2024 · 0 comments

Comments

@gregbugaj
Copy link
Collaborator

Describe the bug
While running the classification on marie-3.0.30 we fail with following exception

UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from run_conv_plan at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:374 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8b260d0897 in /opt/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf1dcb (0x7f8ad5ec0dcb in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x106b1d7 (0x7f8ad613a1d7 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x106b84b (0x7f8ad613a84b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x104e2f2 (0x7f8ad611d2f2 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x53f (0x7f8ad611dcbf in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x32c1b6e (0x7f8ad8390b6e in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x32d9321 (0x7f8ad83a8321 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) + 0x2bb (0x7f8b0f7b6c2b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::_convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x13cb (0x7f8b0e9f180b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2e0089f (0x7f8b0fb7f89f in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2e071fc (0x7f8b0fb861fc in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) + 0x344 (0x7f8b0f2c86f4 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::native::convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x3b8 (0x7f8b0e9e4e88 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2e0013c (0x7f8b0fb7f13c in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x2e07068 (0x7f8b0fb86068 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt) + 0x2d4 (0x7f8b0f2c74f4 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x19bd900 (0x7f8b0e73c900 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::native::conv2d_symint(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt) + 0x16b (0x7f8b0e9e876b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x2ff96c3 (0x7f8b0fd786c3 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0x2ff995d (0x7f8b0fd7895d in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #21: at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt) + 0x26e (0x7f8b0f8eb95e in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x68541d (0x7f8b24d7e41d in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #23: <unknown function> + 0x15a10e (0x56325ed7c10e in /opt/venv/bin/python3)
frame #24: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #26: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #28: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #29: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #30: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #31: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #32: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #33: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #34: _PyEval_EvalFrameDefault + 0x64e6 (0x56325ed6b096 in /opt/venv/bin/python3)
frame #35: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #36: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #37: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #38: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #39: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #40: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #41: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #42: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #43: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #44: <unknown function> + 0x1687f1 (0x56325ed8a7f1 in /opt/venv/bin/python3)
frame #45: _PyEval_EvalFrameDefault + 0x614a (0x56325ed6acfa in /opt/venv/bin/python3)
frame #46: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #47: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #48: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #49: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #50: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #51: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #52: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #53: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #55: <unknown function> + 0x1687f1 (0x56325ed8a7f1 in /opt/venv/bin/python3)
frame #56: _PyEval_EvalFrameDefault + 0x198c (0x56325ed6653c in /opt/venv/bin/python3)
frame #57: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #58: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #59: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #60: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #61: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #62: _PyFunction_Vectorcall + 0x7c (0x56325ed7c9fc in /opt/venv/bin/python3)
frame #63: _PyEval_EvalFrameDefault + 0x8ac (0x56325ed6545c in /opt/venv/bin/python3)
 (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:855.) (raised from /opt/venv/lib/python3.10/site-packages/detectron2/layers/wrappers.py:142)
ERROR  marie@37 Error in PSM_SPARSE_STEP : FIND was unable to find an engine to execute this computation after trying 6 plans.                                   [10/03/24 16:28:30]
ERROR  marie@37 Error in PSM_SPARSE_STEP : CUDA error: unspecified launch failure                                                                                                   
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                          
                                                                                                                                                                                    
INFO   marie@37 Refinement step 1 : No change in image                                                                                                                              
ERROR  marie@37 CUDA error: unspecified launch failure                                                                                                           [10/03/24 16:28:30]
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                          
                                                                                                                                                                                    
ERROR  marie@37 Extract error                                                                                                                                                       
       Traceback (most recent call last):                                                                                                                                           
         File "/opt/venv/lib/python3.10/site-packages/marie/boxes/dit/ulim_dit_box_processor.py", line 811, in extract_bounding_boxes                                               
           raise ex                                                                                                                                                                 
         File "/opt/venv/lib/python3.10/site-packages/marie/boxes/dit/ulim_dit_box_processor.py", line 692, in extract_bounding_boxes                                               
           bboxes, polys, scores, lines_bboxes, classes = self.psm_sparse(                                                                                                          
         File "/opt/venv/lib/python3.10/site-packages/marie/boxes/dit/ulim_dit_box_processor.py", line 640, in psm_sparse                                                           
           torch_gc()                                                                                                                                                               
         File "/opt/venv/lib/python3.10/site-packages/marie/models/utils.py", line 106, in torch_gc                                                                                 
           torch.cuda.empty_cache()                                                                                                                                                 
         File "/opt/venv/lib/python3.10/site-packages/torch/cuda/memory.py", line 162, in empty_cache                                                                               
           torch._C._cuda_emptyCache()                                                                                                                                              
       RuntimeError: CUDA error: unspecified launch failure                                                                                                                         
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                          
                                                                                                           
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant