Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) even with Cuda-9.0 #5

Open
YapengTian opened this issue Jan 24, 2019 · 9 comments
Open

Segmentation fault (core dumped) even with Cuda-9.0 #5

YapengTian opened this issue Jan 24, 2019 · 9 comments
Labels
bug Something isn't working

Comments

@YapengTian
Copy link

Thanks for sharing the code!

I ran the training code with CUDA-9.0 under Pytorch-0.3.1-cuda90. But, I still met the bug. Can you tell me which part of the code leads to the bug? I would like to try to address it.

Thanks.

@XgDuan
Copy link
Owner

XgDuan commented Jan 25, 2019

I am very sorry to hear that for my bugging code. I had been losing my temper for a long time when running the code using CUDA8.0; the bug does not occur every time: sometimes the codes run well, sometimes it went wrong, and even worse, I can not detect the detail position of the bug (I rewrite or comment out several parts I think it would go wrong, but the bug still exist). But the code runs well when I switch my CUDA to 9.0 version. So I used to believe that this is a bug of CUDA8.0 and give up the debugging. So I do not know actually where the bug comes from.

But I think it's my responsibility to fix the bug. Would you like to offer the detail command you use and the logs when the code goes wrong? Then I think we can work together to fix the bug?

@XgDuan XgDuan added the bug Something isn't working label Jan 25, 2019
@YapengTian
Copy link
Author

Error information:
[wsdec(0), 20]: train: epoch[00006], batch[0260/0312], elapsed time=2.8320s, loss: 41.171696, 0.000101
Segmentation fault (core dumped)

Cuda version: CUDA Version 9.0.176

Always core dumped at epoch 6 when running "python train_script/train_final.py --checkpoint_cg runs/test/model/test_00076_01-22-20-11-37.ckp --alias wsdec" and without issue when running pre-training.

@XgDuan
Copy link
Owner

XgDuan commented Jan 26, 2019

YES, exactly the same situation, the error only occurs when training the final model (I do believe the error occurs at the first several epoch when switch from train_sl to train_cg because of the loaded JAVA METEOR score package, maybe due to subprocess terminating?).

When using CUDA 8.0, I tried using another pretrained model, which sometimes runs well. Have you ever tried another pretrained model? Or try our pretrained model which runs exactly the same result as in the paper and does not encounter any bugs (at least in our case).
BTW, what's the current score, I notice that sometimes the results are very well in the several beginning epochs.

@XgDuan
Copy link
Owner

XgDuan commented Jan 29, 2019

Hi, are you still there?

@YapengTian
Copy link
Author

Hi!

Thanks for the response! Due to the bug will block one GPU, I have no an available GPU to get a try recently. I will work on the CVPR rebuttal in the following two weeks and may further explore the code later. Thanks!

@aemrey
Copy link

aemrey commented Nov 12, 2019

Hi, thank you for your wonderful work!

I am also experiencing this exact same bug. It always crashes with a segfault at epoch 6, batch 260 using CUDA 9.0.176. Here are the current scores for the latest checkpoint:

Average across all tIoUs
--------------------------------------------------------------------------------
| CIDEr: 17.6971
| Bleu_4: 1.0493
| Bleu_3: 2.2116
| Bleu_2: 4.9127
| Bleu_1: 11.7354
| Precision: 60.4421
| ROUGE_L: 12.2123
| METEOR: 6.0898
| Recall: 33.7766

Let me know if any progress can be made in fixing this bug. I would like to train the model further!

@XgDuan
Copy link
Owner

XgDuan commented Nov 13, 2019

I have not made any progress on this project currently. Maybe I will return to this project next month, and let you know whether we have made any progress. BTW, I compare your scores with the one in my paper, I think your model is trained not bad. According to my experiment, the best scores are always obtained in the several starting epochs, so I think you can check the result after the first several epochs.

And, thanks for your comment, keep in touch under this issue pls.

@aemrey
Copy link

aemrey commented Nov 15, 2019

Hi, I wanted to let you know that the training proceeded past epoch 6 after I decreased batch size from the default of 32 to 16. Obvious fix in retrospect!

@XgDuan
Copy link
Owner

XgDuan commented Dec 13, 2019

@aemrey, See, this is really a strange bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants