-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Support gather operation in NCCL backend #1061
[Enhancement] Support gather operation in NCCL backend #1061
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Thanks for your contribution! We should update the unit test here to verify the modification works as expected.
Besides, PyTorch has already supported gather
in nccl since version 1.11, and we should also take it into account.
add pytorch version condition
mmengine/dist/dist.py
Outdated
torch_dist.gather(data, gather_list, dst, group) | ||
else: | ||
if get_rank(group) == dst: | ||
gather_list = torch.cuda.comm.gather(data, dst, group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, torch.cuda.comm.gather
only supports single-node. Can we use all_gather
to implement it as a workaround?
Hi, @sh0622-kim, you can use |
gather_list = all_gather_list | ||
else: | ||
gather_list = [] | ||
gather_list = all_gather(data, group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_gather
should be called at all ranks otherwise the program will be blocked. We should only return the gathered list at the main rank, and return an empty list at other ranks.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1061 +/- ##
=======================================
Coverage ? 77.88%
=======================================
Files ? 139
Lines ? 11301
Branches ? 2281
=======================================
Hits ? 8802
Misses ? 2104
Partials ? 395
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
I wanted to help with the work in #916.
Modification
Supports gather operation for NCCL backend.
Checklist