-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/MLX5: adding mcast allgather staging based algo #994
TL/MLX5: adding mcast allgather staging based algo #994
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have time to review everything yet, but here is a first quick round of review. I see several natural ways to make the PR smaller -- more than 1k lines is a large pr, so please try to break it down whenever possible.
The CI has relevant issues |
Hi @samnordmann Thanks for the constructive comments. I have addressed those and added a new commit (second commit). Please take a look and let me know your thoughts. |
src/components/tl/mlx5/mcast/tl_mlx5_mcast_one_sided_progress.c
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for addressing the comment! Here is another round
Thanks @samnordmann for today's comments. I have added a new separate commit. Please take a look. Also, I have yet to cut something from this PR to make it shorter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for addressing the comments
#994 (comment) |
711f798
to
61b99b6
Compare
13e6845
to
75576ee
Compare
Thank you Sergey and Sam. All the comments have been resolved. |
@ Sergei-Lebedev can you also look at this PR and let me know if you have more comments? Thank you, Sergey! |
@Sergei-Lebedev Can you please let me know if you have more comments on this? So that we can continue implementing the rest of the design. This is a blocker for future PRs. Thank you! |
Ping @wfaderhold21 . Can you please review? |
@wfaderhold21 Can we get this PR reviewed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Minor comments...
src/components/tl/mlx5/mcast/tl_mlx5_mcast_one_sided_progress.c
Outdated
Show resolved
Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_one_sided_progress.c
Outdated
Show resolved
Hide resolved
Thanks @wfaderhold21 and @Sergei-Lebedev for the comments. I push the changes. |
e7a7bcd
to
39f3069
Compare
39f3069
to
30749be
Compare
edb71ae
to
c584457
Compare
c584457
to
867f828
Compare
Hi @Sergei-Lebedev Thank you for your constructive comments. I have addressed all of them. Thank you. |
867f828
to
6e85c33
Compare
Thanks @MamziB |
What
add the algorithm for mcast-based allgather
Why
scalability and performance improvement over sw based allgather
How ?
Realizing the Allgather operations as N (team-size) concurrent Bcast operations (every process becomes the root). We use the one-sided design that was merged before for its reliability.