-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/UCP: Add support for active_set broadcast with knomial and dbt for size greater than 2 #926
Conversation
Can one of the admins verify this patch? |
11fe2ab
to
ba8aace
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me! thanks
b7820a2
to
e86c6da
Compare
ok to test |
fc31e64
to
2f1a76c
Compare
2f1a76c
to
a7e68af
Compare
a7e68af
to
f20a629
Compare
@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast? |
This was requested by Almog from the cuBLAS team. I'm not sure the details after that. @Sergei-Lebedev do you know what Almog wanted with this? |
The use-case for that is to be able perform a "broadcast" on row/col comms without having to create these comms. |
0a9a5d7
to
8e5bac3
Compare
@manjugv @Sergei-Lebedev @samnordmann |
8e5bac3
to
14e0dcc
Compare
bot:retest |
14e0dcc
to
a91f7c0
Compare
a91f7c0
to
c5613a6
Compare
bot:retest |
@nsarka can you please check why gtest failed in CI? |
@Sergei-Lebedev it seems like the gtest passed. However, the ucc test failed with |
@Sergei-Lebedev , it seems we're facing the same issue (something with containers) on several PRs |
@janjust
Note that test_context.global test hangs.
|
Hi Sergey, in the other PR (#958), @B-a-S fixed and reran the failing pipeline. He posted a link to the new log (http://hpc-master.lab.mtl.com:8080/job/ucc/3282/) which is hanging in the same |
@Sergei-Lebedev I ran the hanging gtest manually with my nsarka/active-set-broadcast branch and it passed, which suggests this is a CI issue:
|
c5613a6
to
f79b668
Compare
f79b668
to
eb413aa
Compare
a7cd88b
to
b122da3
Compare
b122da3
to
7f5b949
Compare
@Sergei-Lebedev Hey Sergey, I updated the PR so that the root is in terms of the ucc_team only if the active_set flag was set on coll_args. This fixed the hang in tl/mlx5 in the ucc_service_coll_bcast used for broadcasting protection domain details, which hardcodes the root to 0 in terms of the node-level subset instead of whatever that rank is in the ucc team. tl/sharp and I think some others do this too. |
Active set is a sort of lightweight subcommunicator for TL/UCP and TL/NCCL. It was originally used as a hack that enables point to point communication. This PR: