-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/MLX5: complete the mcast init #900
Conversation
68f1a1a
to
db07f9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! about the service bcast, I'm afraid it will use the full team from ctx and not the actual team. Did you try it with a subteam? Im afraid it could hang
db07f9b
to
04b4f8d
Compare
Thanks, @samnordmann for the comments. They are all resolved. @Sergei-Lebedev do you have any comments on this PR? |
04b4f8d
to
e18769c
Compare
@samnordmann @Sergei-Lebedev Thank you, guys, for the comments. I resolved all of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good, thanks!
6f8b89b
to
918c2f0
Compare
@Sergei-Lebedev @samnordmann Thanks guys for the commets. Can you please merge this PR? It has some important fixes that we need for the hpcx release. Thanks |
f391233
to
7fe2eb1
Compare
ebb6e15
to
a663b2c
Compare
@Sergei-Lebedev @samnordmann I have addressed the rest of the comments. Please let me know if you have more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments. Here are some new ones.
I am also trying to run this branch, using two ray nodes and the following command line:
mpirun -x UCC_TL_MLX5_MCAST_ENABLE=1 -x UCC_TL_MLX5_TUNE=inf --map-by ppr:2:node -np 4 test/mpi/ucc_test_mpi -c alltoall -O 1
The tests pass but I am getting the following errors:
tl_mlx5_mcast_coll.c:41 TL_MLX5 ERROR mcast_coll_do_bcast failed:-1
ucc_schedule.h:198 UCC ERROR failure in task 0x22bea00, Operation is not supported
and
tl_mlx5_mcast_context.c:277 TL_MLX5 ERROR ibv_dealloc_pd failed errno 16
Do you have an idea of what it could be ? Could you look into it?
The recently merged PR #921 introduced a small bug here
The line cleaning up |
a663b2c
to
b52f83c
Compare
@samnordmann @Sergei-Lebedev I did not see any Finalize issue after disabling the mcast flag. So, the PR is now ready. FYI, I tested it on HPCAC on multiple node with mcast flags ON/OFF and it was passed. |
b52f83c
to
d249b54
Compare
d249b54
to
5c79b72
Compare
@Sergei-Lebedev I resolved the merge conflict with master |
5c79b72
to
f679e46
Compare
adding the states regarding the team creation and mcast group join