-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST]: How to run multi GPU cugraph leiden for a large graph #4884
Comments
"I think you're going to need a bigger boat" :-) I don't have exact measurements here, I'm going to do some hand-wavy math. We did scale testing of Louvain (which Leiden builds upon) from C++ (without the python dask/CUDF overhead). During that scale testing on A100s with 80 GB of memory, we needed 8 GPUs to first create the graph and then execute the algorithm. Leiden adds some additional memory pressure, but I have not measured that at scale. Let's assume it's an additional 10% to have memory to compute the refinement and to store both Leiden and Louvain clustering results during intermediate processing. That pushes us to a hand-wavy 9 GPUs. When we run this from python, as you are above, you have dask/cudf and the dask/cugraph layers also using GPU memory. The python approach above will use GPU memory when creating the persisted DataFrame, then it will use GPU memory in the DataFrame itself. cugraph is not allowed, when creating the graph, to delete the DataFrame memory, so this adds at least 70GB of GPU memory to the required footprint. Again, some hand-wavy math... I would think you would need a minimum of 10 GPUs, 12 GPUs would be safer, 16 GPUs would give you more margin for error in my hand-waviness and slightly better performance because you'd have more balance across the nodes. If I'm off in my projections, it's probably that I've forgotten something... so more GPUs is probably better than fewer. I suspect that if you push to 6 or 8 GPUs you'll be able to create the graph but will fail in Leiden. We don't currently have a memory spilling capability within cugraph. Generally, graph algorithms (due to the nature of the memory accesses having poor locality) perform poorly with data stored outside of the GPU memory. We have a number of options we are pursuing in the long run, but at the moment if you run out of memory you need to run with more GPUs. Using something like managed memory (which is frequently a linear slowdown for applications with better memory access patterns) typically results in terrible thrashing of the memory system. Specifically, in managed memory the system would have to bring an entire page of memory from host memory to GPU memory and we might only access one 8 byte value from that page before it gets ejected. Because of that I don't really have a better recommendation. |
Thanks @ChuckHastings that's what I thought. But, realistically with the advances in single-cell sequencing and spatial transcriptomics the problem is going to get worse. Right now, we have a 120 million dataset that's sequenced and we cannot possibly run the standard pipelines without having an insanely expensive GPU node with 50-100 GPUs based on what you suggested above. The above dataset with 3.7 billion edges was a KNN graph of only 50 million cells. Since the field heavily relies on the graph based clustering algorithms can we make this a priority to make these algorithms scale for a reasonable amount of resources (4-8 A100s)? As a matter of fact, even that's out of the reach for many biomedical institutions that rely on grant funding. |
We will consider this as we identify our priorities. We meet in March to discuss our priorities for the next year, I can update the issue after we have the discussion. |
What is your question?
Hey, I am trying to run multi-GPU leiden clustering on a large graph with 3.7 billion edges. I have 4 NVIDIA A100 GPUs with 80G VRAM each. I am facing memory issues when I try to run it and I was wondering if you have any suggestions on how to handle such a large graph. Here's my code:
Here's the error message I get:
Code of Conduct
The text was updated successfully, but these errors were encountered: