-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: device-side assert triggered #17
Comments
On running again, with I get this in the error stack trace:
|
I have the same issue. Do you find any solutions? |
I have found the problems. These codes use a small trick here. For a triple (h, r, t) in the dataset, this trick will mask some action with entity e_1, e_2, e_3 in the last step. Here, these entities meet the conditions that (h, r, e_1), (h, r, e_2), (h, r, e_3) are also in the dataset. When all entities in the action space meet the above conditions, i.e., every action leads to the right answer, this trick will bring some problems that all actions are masked and the model has no action to select. This trick will mostly fail on a dense knowledge graph with some SPECIAL 1-N triples, i.e., a large proportion of entities are acted as the tail entity for (h, r, ?) in the knowledge graph. Some work may be needed to adapt these codes to more knowledge graphs. @todpole3 The actual trigger for the error should be at here, and the exception is
Here are codes using the trick: def get_false_negative_mask(self, e_space, e_s, q, e_t, kg):
answer_mask = self.get_answer_mask(e_space, e_s, q, kg)
# This is a trick applied during training where we convert a multi-answer predction problem into several
# single-answer prediction problems. By masking out the other answers in the training set, we are forcing
# the agent to walk towards a particular answer.
# This trick does not affect inference on the test set: at inference time the ground truth answer will not
# appear in the answer mask. This can be checked by uncommenting the following assertion statement.
# Note that the assertion statement can trigger in the last batch if you're using a batch_size > 1 since
# we append dummy examples to the last batch to make it the required batch size.
# The assertion statement will also trigger in the dev set inference of NELL-995 since we randomly
# sampled the dev set from the training data.
# assert(float((answer_mask * (e_space == e_t.unsqueeze(1)).long()).sum()) == 0)
false_negative_mask = (answer_mask * (e_space != e_t.unsqueeze(1)).long()).float()
return false_negative_mask |
@davidlvxin Thanks for helping w/ the trouble shooting. Unfortunately
Okay, I realized that this argument is wrong if Also, did you encounter the error during training cycle or inference cycle? |
Yeah, I have print some vectors and find that only after function By the way, I encountered the error during training. |
@davidlvxin Thanks and sorry about the confusion. I realized the issue myself shortly after making the comment. I believe the reason we did not find this an issue in our paper is that we had augmented the graph such that each node has a self-edge. In our training data we don't have examples of self relations hence the self-edge is always the fall back solution. So density of the graph should not be a problem, but if you have triples of the form |
Also, I'm perplexed that this line throws the exception for you. Are you setting |
I have printed the tensor The reason why they are all zeros is that I don't think Suppose we have (A, r, B), (A, r, C), (A, r, D), (A, r, F), (A, r_1, E), (E, r_2, F), (F, r, C), (F, r, D) (F, r, F) in a small KG. The training triple is (A, r, B). We start from entity A, and the max hop step is 3. Our model search path is A->r_1->E->r_2->F, and this is the last step. F has three actions, i.e., (r, C), (r, D), (r, F). But C, D and F are not |
Got it. The reason we design A possible fix here is to add EPSILON to |
Yeah, I think it is OK :). The above trick only fails with a very small probability. And in most cases, it works fine. |
Cool. I'll keep this issue open and push a fix at some point. Thanks again for identifying it. |
Following the above question, I want to ensure the impact of the false-negative mask during the inference cycle. Is the false-negative mask actually make the model only get the right results under the filter metric??? If I want to get the results under the raw metric, the false-negative mask should be unused ?? |
"Is the false-negative mask actually make the model only get the right results under the filter metric???" In our implementation we made sure to only include "false-negative" examples from the training KG (given triples). Surprisingly many datasets have query overlap between train/dev/test. |
Thanks very much for your reply! The false-negative mask filters the other answers in the train_objects/train_subjects but the filter operation filters the other answers in the all_objects/all_subjects. That is the only difference between the two operations in the inference cycle. |
|
Greetings from 2024, has this problem been solved? I have the same problem, using triples in my own data set. |
Epoch 0 |
I was trying to run the model on my custom data of KG triples, to compare its performance, however I encountered a problem.
Upon running the training command for policy gradient model:
./experiment.sh configs/<model>.sh --train 0
Encountered the following error:
RuntimeError: CUDA error: device-side assert triggered
Full stack trace:
Kindly help me debug this, possible error sources and how to remove them.
The text was updated successfully, but these errors were encountered: