-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/UCP: Add Sliding Window allreduce implementation #958
TL/UCP: Add Sliding Window allreduce implementation #958
Conversation
97fa01b
to
24b3d51
Compare
Can one of the admins verify this patch? |
e6b3294
to
0802106
Compare
ok to test |
139ee8d
to
f46c19d
Compare
f46c19d
to
288ec3d
Compare
@Sergei-Lebedev It seems like the same ucc test is failing as the active_set. However, this one fails because of wrong cuda versions: |
@artemry-nv Seems like we have CI issues on this PR, unrelated to the PR |
@B-a-S please take a look |
I've rerun the job. Take a look http://hpc-master.lab.mtl.com:8080/job/ucc/3282/ |
288ec3d
to
d13ba55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks nice and readable, thanks! I left some comments, most importantly about memory leaks
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
f7ff83a
to
63789d9
Compare
The ucc gtest failed on
|
33f6db6
to
9625f72
Compare
I updated the PR. The get/reduce/put phase and the barrier part of the algorithm are now run via schedule. I left the allgather phase the way it was inside of the get/reduce/put phase because once Ferrol's PR goes in, I will be removing the allgather and using that instead for key exchange. Also, I moved some code to two new files, |
Please wait to review, there are some failures I should fix first. |
930e828
to
1d084bb
Compare
bot:retest |
7953c46
to
91391f4
Compare
91391f4
to
417f5aa
Compare
@samnordmann The PR is ready for review, thank you. Please note that the allgather task is still part of the algorithm. Once Ferrol's PR goes in I will convert the code to use that for the import/allgather phase of the algorithm. I also left the reduction as part of the algorithm, since it would involve splitting the gets and puts into tasks of their own as well. There would be thousands of these tasks for large message sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing the comments. I still have left a couple of new minor comments
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
8f3ed31
to
8e0381f
Compare
e755399
to
355fa4f
Compare
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
355fa4f
to
30b21f9
Compare
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
3f13f77
to
2cfdab7
Compare
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
src/components/tl/ucp/allreduce/allreduce_sliding_window_setup.c
Outdated
Show resolved
Hide resolved
2cfdab7
to
7609fbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up changes:
- Use single allocation instead of multiple mallocs
- Get rid of explicit ucp request objects and use callbacks
- Optimize ucp_worker calls
@Sergei-Lebedev When you say follow up changes - are you talking in a separate PR after this is merged? If so, I'll open up an issue and tag this so we don't forget. |
7609fbd
to
bed397d
Compare
@Sergei-Lebedev @samnordmann The PR is ready to be merged |
This PR is the second in a series of two that will add Sliding Window allreduce to UCC. This one implements the function stubs left by the first PR (#902)