Segmentation fault (core dumped) even with Cuda-9.0 #5

YapengTian · 2019-01-24T18:49:57Z

Thanks for sharing the code!

I ran the training code with CUDA-9.0 under Pytorch-0.3.1-cuda90. But, I still met the bug. Can you tell me which part of the code leads to the bug? I would like to try to address it.

Thanks.

XgDuan · 2019-01-25T08:27:37Z

I am very sorry to hear that for my bugging code. I had been losing my temper for a long time when running the code using CUDA8.0; the bug does not occur every time: sometimes the codes run well, sometimes it went wrong, and even worse, I can not detect the detail position of the bug (I rewrite or comment out several parts I think it would go wrong, but the bug still exist). But the code runs well when I switch my CUDA to 9.0 version. So I used to believe that this is a bug of CUDA8.0 and give up the debugging. So I do not know actually where the bug comes from.

But I think it's my responsibility to fix the bug. Would you like to offer the detail command you use and the logs when the code goes wrong? Then I think we can work together to fix the bug?

YapengTian · 2019-01-25T19:55:11Z

Error information:
[wsdec(0), 20]: train: epoch[00006], batch[0260/0312], elapsed time=2.8320s, loss: 41.171696, 0.000101
Segmentation fault (core dumped)

Cuda version: CUDA Version 9.0.176

Always core dumped at epoch 6 when running "python train_script/train_final.py --checkpoint_cg runs/test/model/test_00076_01-22-20-11-37.ckp --alias wsdec" and without issue when running pre-training.

XgDuan · 2019-01-26T06:09:22Z

YES, exactly the same situation, the error only occurs when training the final model (I do believe the error occurs at the first several epoch when switch from train_sl to train_cg because of the loaded JAVA METEOR score package, maybe due to subprocess terminating?).

When using CUDA 8.0, I tried using another pretrained model, which sometimes runs well. Have you ever tried another pretrained model? Or try our pretrained model which runs exactly the same result as in the paper and does not encounter any bugs (at least in our case).
BTW, what's the current score, I notice that sometimes the results are very well in the several beginning epochs.

XgDuan · 2019-01-29T15:05:05Z

Hi, are you still there?

YapengTian · 2019-01-29T16:39:21Z

Hi!

Thanks for the response! Due to the bug will block one GPU, I have no an available GPU to get a try recently. I will work on the CVPR rebuttal in the following two weeks and may further explore the code later. Thanks!

aemrey · 2019-11-12T14:49:22Z

Hi, thank you for your wonderful work!

I am also experiencing this exact same bug. It always crashes with a segfault at epoch 6, batch 260 using CUDA 9.0.176. Here are the current scores for the latest checkpoint:

Average across all tIoUs
--------------------------------------------------------------------------------
| CIDEr: 17.6971
| Bleu_4: 1.0493
| Bleu_3: 2.2116
| Bleu_2: 4.9127
| Bleu_1: 11.7354
| Precision: 60.4421
| ROUGE_L: 12.2123
| METEOR: 6.0898
| Recall: 33.7766

Let me know if any progress can be made in fixing this bug. I would like to train the model further!

XgDuan · 2019-11-13T06:22:57Z

I have not made any progress on this project currently. Maybe I will return to this project next month, and let you know whether we have made any progress. BTW, I compare your scores with the one in my paper, I think your model is trained not bad. According to my experiment, the best scores are always obtained in the several starting epochs, so I think you can check the result after the first several epochs.

And, thanks for your comment, keep in touch under this issue pls.

aemrey · 2019-11-15T13:53:58Z

Hi, I wanted to let you know that the training proceeded past epoch 6 after I decreased batch size from the default of 32 to 16. Obvious fix in retrospect!

XgDuan · 2019-12-13T05:39:48Z

@aemrey, See, this is really a strange bug.

XgDuan added the bug Something isn't working label Jan 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault (core dumped) even with Cuda-9.0 #5

Segmentation fault (core dumped) even with Cuda-9.0 #5

YapengTian commented Jan 24, 2019

XgDuan commented Jan 25, 2019 •

edited

Loading

YapengTian commented Jan 25, 2019

XgDuan commented Jan 26, 2019 •

edited

Loading

XgDuan commented Jan 29, 2019

YapengTian commented Jan 29, 2019

aemrey commented Nov 12, 2019

XgDuan commented Nov 13, 2019 •

edited

Loading

aemrey commented Nov 15, 2019

XgDuan commented Dec 13, 2019

Segmentation fault (core dumped) even with Cuda-9.0 #5

Segmentation fault (core dumped) even with Cuda-9.0 #5

Comments

YapengTian commented Jan 24, 2019

XgDuan commented Jan 25, 2019 • edited Loading

YapengTian commented Jan 25, 2019

XgDuan commented Jan 26, 2019 • edited Loading

XgDuan commented Jan 29, 2019

YapengTian commented Jan 29, 2019

aemrey commented Nov 12, 2019

XgDuan commented Nov 13, 2019 • edited Loading

aemrey commented Nov 15, 2019

XgDuan commented Dec 13, 2019

XgDuan commented Jan 25, 2019 •

edited

Loading

XgDuan commented Jan 26, 2019 •

edited

Loading

XgDuan commented Nov 13, 2019 •

edited

Loading