Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper use of CUDA Graph #6

Open
Judith-PAS opened this issue Oct 15, 2023 · 1 comment
Open

Improper use of CUDA Graph #6

Judith-PAS opened this issue Oct 15, 2023 · 1 comment

Comments

@Judith-PAS
Copy link

Improper use of CUDA Graph in TC-GNN

Hello,

I wanted to bring to your attention a potential issue regarding the usage of CUDA Graph in TC-GNN. Upon reviewing the torch document and tutorial, it appears that CUDA Graph is intended to capture and replay GPU kernels on a specified stream. However, I noticed that in TC-GNN, the manually-implemented kernels (e.g., TCGNN_conv/TCGNN_kernel.cu) are being called without setting the stream. As a result, these kernels are not being captured or replayed (executed) within the TC-GNN framework.

It seems that the speedup achieved through the utilization of CUDA Graph is a consequence of these kernels being ignored during execution. In fact, during my profiling, I observed that the performance was over five times faster than the non-CUDA Graph test that actually runs the kernels. Furthermore, the lower test accuracy experienced in #5 also supports this observation.

I kindly suggest considering some corrections to the tested results that involve the usage of CUDA Graph.

Thank you for your attention to this matter.

@YukeWang96
Copy link
Owner

YukeWang96 commented Oct 15, 2023

Hi, Thanks for reaching out!

Thanks for bringing this to our attention, our current observation is that the CUDA graph on PyToch seems to have some problem supporting kernel with dynamic array (e.g., edge_list or row_ptr) in GNN cases.
Here are the results we recently ran without the CUDA graph compared with DGL on RTX-3090.
The original conclusion for performance advantage over DGL still holds.

GCN-model DGL TC-GNN(w/o CUDA Graph) Speedup (x)
citeseer 7.27 3.75 1.94
cora 7.05 3.68 1.92
pubmed 7.34 3.74 1.96
ppi 7.56 4.46 1.70
PROTEINS_full 7.48 3.75 2.00
OVCAR-8H 69.45 66.95 1.04
Yeast 63.67 61.07 1.04
DD 13.35 10.53 1.27
YeastH 114.87 111.48 1.03
amazon0505 20.58 22.70 0.91
artist 7.45 4.50 1.66
com-amazon 16.70 16.69 1.00
soc-BlogCatalog 7.56 9.41 0.80
amazon0601 19.58 19.55 1.00
Average 1.38
AGNN-model DGL TC-GNN(w/o CUDA Graph) Speedup (x)
citeseer 31.25 10.31 3.03
cora 31.08 10.34 3.01
pubmed 31.38 10.59 2.96
ppi 40.28 19.89 2.03
PROTEINS_full 31.47 10.48 3.00
OVCAR-8H 143.94 112.05 1.28
Yeast 131.67 100.85 1.31
DD 44.31 23.29 1.90
YeastH 231.63 184.51 1.26
amazon0505 69.63 118.42 0.59
artist 40.40 38.71 1.04
com-amazon 50.67 41.60 1.22
soc-BlogCatalog 50.72 81.73 0.62
amazon0601 61.05 47.42 1.29
Average 1.75

We will soon update our current code repo to fix this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants