You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to bring to your attention a potential issue regarding the usage of CUDA Graph in TC-GNN. Upon reviewing the torch document and tutorial, it appears that CUDA Graph is intended to capture and replay GPU kernels on a specified stream. However, I noticed that in TC-GNN, the manually-implemented kernels (e.g., TCGNN_conv/TCGNN_kernel.cu) are being called without setting the stream. As a result, these kernels are not being captured or replayed (executed) within the TC-GNN framework.
It seems that the speedup achieved through the utilization of CUDA Graph is a consequence of these kernels being ignored during execution. In fact, during my profiling, I observed that the performance was over five times faster than the non-CUDA Graph test that actually runs the kernels. Furthermore, the lower test accuracy experienced in #5 also supports this observation.
I kindly suggest considering some corrections to the tested results that involve the usage of CUDA Graph.
Thank you for your attention to this matter.
The text was updated successfully, but these errors were encountered:
Thanks for bringing this to our attention, our current observation is that the CUDA graph on PyToch seems to have some problem supporting kernel with dynamic array (e.g., edge_list or row_ptr) in GNN cases.
Here are the results we recently ran without the CUDA graph compared with DGL on RTX-3090.
The original conclusion for performance advantage over DGL still holds.
GCN-model
DGL
TC-GNN(w/o CUDA Graph)
Speedup (x)
citeseer
7.27
3.75
1.94
cora
7.05
3.68
1.92
pubmed
7.34
3.74
1.96
ppi
7.56
4.46
1.70
PROTEINS_full
7.48
3.75
2.00
OVCAR-8H
69.45
66.95
1.04
Yeast
63.67
61.07
1.04
DD
13.35
10.53
1.27
YeastH
114.87
111.48
1.03
amazon0505
20.58
22.70
0.91
artist
7.45
4.50
1.66
com-amazon
16.70
16.69
1.00
soc-BlogCatalog
7.56
9.41
0.80
amazon0601
19.58
19.55
1.00
Average
1.38
AGNN-model
DGL
TC-GNN(w/o CUDA Graph)
Speedup (x)
citeseer
31.25
10.31
3.03
cora
31.08
10.34
3.01
pubmed
31.38
10.59
2.96
ppi
40.28
19.89
2.03
PROTEINS_full
31.47
10.48
3.00
OVCAR-8H
143.94
112.05
1.28
Yeast
131.67
100.85
1.31
DD
44.31
23.29
1.90
YeastH
231.63
184.51
1.26
amazon0505
69.63
118.42
0.59
artist
40.40
38.71
1.04
com-amazon
50.67
41.60
1.22
soc-BlogCatalog
50.72
81.73
0.62
amazon0601
61.05
47.42
1.29
Average
1.75
We will soon update our current code repo to fix this error.
Improper use of CUDA Graph in TC-GNN
Hello,
I wanted to bring to your attention a potential issue regarding the usage of CUDA Graph in TC-GNN. Upon reviewing the torch document and tutorial, it appears that CUDA Graph is intended to capture and replay GPU kernels on a specified stream. However, I noticed that in TC-GNN, the manually-implemented kernels (e.g., TCGNN_conv/TCGNN_kernel.cu) are being called without setting the stream. As a result, these kernels are not being captured or replayed (executed) within the TC-GNN framework.
It seems that the speedup achieved through the utilization of CUDA Graph is a consequence of these kernels being ignored during execution. In fact, during my profiling, I observed that the performance was over five times faster than the non-CUDA Graph test that actually runs the kernels. Furthermore, the lower test accuracy experienced in #5 also supports this observation.
I kindly suggest considering some corrections to the tested results that involve the usage of CUDA Graph.
Thank you for your attention to this matter.
The text was updated successfully, but these errors were encountered: