-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault (core dumped) even with Cuda-9.0 #5
Comments
I am very sorry to hear that for my bugging code. I had been losing my temper for a long time when running the code using CUDA8.0; the bug does not occur every time: sometimes the codes run well, sometimes it went wrong, and even worse, I can not detect the detail position of the bug (I rewrite or comment out several parts I think it would go wrong, but the bug still exist). But the code runs well when I switch my CUDA to 9.0 version. So I used to believe that this is a bug of CUDA8.0 and give up the debugging. So I do not know actually where the bug comes from. But I think it's my responsibility to fix the bug. Would you like to offer the detail command you use and the logs when the code goes wrong? Then I think we can work together to fix the bug? |
Error information: Cuda version: CUDA Version 9.0.176 Always core dumped at epoch 6 when running "python train_script/train_final.py --checkpoint_cg runs/test/model/test_00076_01-22-20-11-37.ckp --alias wsdec" and without issue when running pre-training. |
YES, exactly the same situation, the error only occurs when training the final model (I do believe the error occurs at the first several epoch when switch from When using CUDA 8.0, I tried using another pretrained model, which sometimes runs well. Have you ever tried another pretrained model? Or try our pretrained model which runs exactly the same result as in the paper and does not encounter any bugs (at least in our case). |
Hi, are you still there? |
Hi! Thanks for the response! Due to the bug will block one GPU, I have no an available GPU to get a try recently. I will work on the CVPR rebuttal in the following two weeks and may further explore the code later. Thanks! |
Hi, thank you for your wonderful work! I am also experiencing this exact same bug. It always crashes with a segfault at epoch 6, batch 260 using CUDA 9.0.176. Here are the current scores for the latest checkpoint:
Let me know if any progress can be made in fixing this bug. I would like to train the model further! |
I have not made any progress on this project currently. Maybe I will return to this project next month, and let you know whether we have made any progress. BTW, I compare your scores with the one in my paper, I think your model is trained not bad. According to my experiment, the best scores are always obtained in the several starting epochs, so I think you can check the result after the first several epochs. And, thanks for your comment, keep in touch under this issue pls. |
Hi, I wanted to let you know that the training proceeded past epoch 6 after I decreased batch size from the default of 32 to 16. Obvious fix in retrospect! |
@aemrey, See, this is really a strange bug. |
Thanks for sharing the code!
I ran the training code with CUDA-9.0 under Pytorch-0.3.1-cuda90. But, I still met the bug. Can you tell me which part of the code leads to the bug? I would like to try to address it.
Thanks.
The text was updated successfully, but these errors were encountered: