UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #124

gregbugaj · 2024-10-03T16:31:47Z

Describe the bug
While running the classification on marie-3.0.30 we fail with following exception

UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from run_conv_plan at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:374 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8b260d0897 in /opt/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf1dcb (0x7f8ad5ec0dcb in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x106b1d7 (0x7f8ad613a1d7 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x106b84b (0x7f8ad613a84b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x104e2f2 (0x7f8ad611d2f2 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x53f (0x7f8ad611dcbf in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x32c1b6e (0x7f8ad8390b6e in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x32d9321 (0x7f8ad83a8321 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool) + 0x2bb (0x7f8b0f7b6c2b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::_convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x13cb (0x7f8b0e9f180b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2e0089f (0x7f8b0fb7f89f in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2e071fc (0x7f8b0fb861fc in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt, bool, bool, bool, bool) + 0x344 (0x7f8b0f2c86f4 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::native::convolution(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x3b8 (0x7f8b0e9e4e88 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2e0013c (0x7f8b0fb7f13c in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x2e07068 (0x7f8b0fb86068 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, bool, c10::ArrayRef<c10::SymInt>, c10::SymInt) + 0x2d4 (0x7f8b0f2c74f4 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x19bd900 (0x7f8b0e73c900 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::native::conv2d_symint(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt) + 0x16b (0x7f8b0e9e876b in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x2ff96c3 (0x7f8b0fd786c3 in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0x2ff995d (0x7f8b0fd7895d in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #21: at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::SymInt) + 0x26e (0x7f8b0f8eb95e in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x68541d (0x7f8b24d7e41d in /opt/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #23: <unknown function> + 0x15a10e (0x56325ed7c10e in /opt/venv/bin/python3)
frame #24: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #26: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #27: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #28: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #29: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #30: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #31: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #32: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #33: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #34: _PyEval_EvalFrameDefault + 0x64e6 (0x56325ed6b096 in /opt/venv/bin/python3)
frame #35: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #36: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #37: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #38: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #39: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #40: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #41: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #42: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #43: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #44: <unknown function> + 0x1687f1 (0x56325ed8a7f1 in /opt/venv/bin/python3)
frame #45: _PyEval_EvalFrameDefault + 0x614a (0x56325ed6acfa in /opt/venv/bin/python3)
frame #46: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #47: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #48: <unknown function> + 0x16893e (0x56325ed8a93e in /opt/venv/bin/python3)
frame #49: _PyEval_EvalFrameDefault + 0x2a27 (0x56325ed675d7 in /opt/venv/bin/python3)
frame #50: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #51: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #52: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #53: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #55: <unknown function> + 0x1687f1 (0x56325ed8a7f1 in /opt/venv/bin/python3)
frame #56: _PyEval_EvalFrameDefault + 0x198c (0x56325ed6653c in /opt/venv/bin/python3)
frame #57: _PyObject_FastCallDictTstate + 0xc4 (0x56325ed71c14 in /opt/venv/bin/python3)
frame #58: _PyObject_Call_Prepend + 0x5c (0x56325ed8786c in /opt/venv/bin/python3)
frame #59: <unknown function> + 0x280700 (0x56325eea2700 in /opt/venv/bin/python3)
frame #60: _PyObject_MakeTpCall + 0x25b (0x56325ed72a7b in /opt/venv/bin/python3)
frame #61: _PyEval_EvalFrameDefault + 0x6a79 (0x56325ed6b629 in /opt/venv/bin/python3)
frame #62: _PyFunction_Vectorcall + 0x7c (0x56325ed7c9fc in /opt/venv/bin/python3)
frame #63: _PyEval_EvalFrameDefault + 0x8ac (0x56325ed6545c in /opt/venv/bin/python3)
 (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:855.) (raised from /opt/venv/lib/python3.10/site-packages/detectron2/layers/wrappers.py:142)
ERROR  marie@37 Error in PSM_SPARSE_STEP : FIND was unable to find an engine to execute this computation after trying 6 plans.                                   [10/03/24 16:28:30]
ERROR  marie@37 Error in PSM_SPARSE_STEP : CUDA error: unspecified launch failure                                                                                                   
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                          
                                                                                                                                                                                    
INFO   marie@37 Refinement step 1 : No change in image                                                                                                                              
ERROR  marie@37 CUDA error: unspecified launch failure                                                                                                           [10/03/24 16:28:30]
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                          
                                                                                                                                                                                    
ERROR  marie@37 Extract error                                                                                                                                                       
       Traceback (most recent call last):                                                                                                                                           
         File "/opt/venv/lib/python3.10/site-packages/marie/boxes/dit/ulim_dit_box_processor.py", line 811, in extract_bounding_boxes                                               
           raise ex                                                                                                                                                                 
         File "/opt/venv/lib/python3.10/site-packages/marie/boxes/dit/ulim_dit_box_processor.py", line 692, in extract_bounding_boxes                                               
           bboxes, polys, scores, lines_bboxes, classes = self.psm_sparse(                                                                                                          
         File "/opt/venv/lib/python3.10/site-packages/marie/boxes/dit/ulim_dit_box_processor.py", line 640, in psm_sparse                                                           
           torch_gc()                                                                                                                                                               
         File "/opt/venv/lib/python3.10/site-packages/marie/models/utils.py", line 106, in torch_gc                                                                                 
           torch.cuda.empty_cache()                                                                                                                                                 
         File "/opt/venv/lib/python3.10/site-packages/torch/cuda/memory.py", line 162, in empty_cache                                                                               
           torch._C._cuda_emptyCache()                                                                                                                                              
       RuntimeError: CUDA error: unspecified launch failure                                                                                                                         
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #124

UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #124

gregbugaj commented Oct 3, 2024

UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #124

UserWarning: Plan failed with a CuDNNError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #124

Comments

gregbugaj commented Oct 3, 2024