Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: complete the mcast init #900

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Jan 17, 2024

adding the states regarding the team creation and mcast group join

@MamziB MamziB requested review from samnordmann and Sergei-Lebedev and removed request for samnordmann January 17, 2024 17:28
@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch from 68f1a1a to db07f9b Compare January 17, 2024 17:55
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! about the service bcast, I'm afraid it will use the full team from ctx and not the actual team. Did you try it with a subteam? Im afraid it could hang

src/components/tl/mlx5/tl_mlx5_team.c Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_team.c Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch from db07f9b to 04b4f8d Compare January 24, 2024 18:52
@MamziB
Copy link
Collaborator Author

MamziB commented Jan 24, 2024

Thanks, @samnordmann for the comments. They are all resolved. @Sergei-Lebedev do you have any comments on this PR?

@samnordmann samnordmann self-requested a review January 29, 2024 09:21
@MamziB MamziB self-assigned this Jan 29, 2024
@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch from 04b4f8d to e18769c Compare January 31, 2024 18:02
@MamziB
Copy link
Collaborator Author

MamziB commented Jan 31, 2024

@samnordmann @Sergei-Lebedev Thank you, guys, for the comments. I resolved all of them.

@samnordmann samnordmann self-requested a review February 1, 2024 11:17
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, thanks!

src/components/tl/mlx5/tl_mlx5_team.c Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch 2 times, most recently from 6f8b89b to 918c2f0 Compare February 2, 2024 01:49
@MamziB
Copy link
Collaborator Author

MamziB commented Feb 2, 2024

@Sergei-Lebedev @samnordmann Thanks guys for the commets. Can you please merge this PR? It has some important fixes that we need for the hpcx release. Thanks

@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch 3 times, most recently from f391233 to 7fe2eb1 Compare February 2, 2024 04:00
@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch 3 times, most recently from ebb6e15 to a663b2c Compare February 4, 2024 23:49
@MamziB
Copy link
Collaborator Author

MamziB commented Feb 4, 2024

@Sergei-Lebedev @samnordmann I have addressed the rest of the comments. Please let me know if you have more comments.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. Here are some new ones.

I am also trying to run this branch, using two ray nodes and the following command line:

mpirun -x UCC_TL_MLX5_MCAST_ENABLE=1 -x UCC_TL_MLX5_TUNE=inf --map-by ppr:2:node -np 4 test/mpi/ucc_test_mpi -c alltoall -O 1

The tests pass but I am getting the following errors:

tl_mlx5_mcast_coll.c:41   TL_MLX5 ERROR mcast_coll_do_bcast failed:-1
ucc_schedule.h:198  UCC  ERROR failure in task 0x22bea00, Operation is not supported

and

tl_mlx5_mcast_context.c:277  TL_MLX5 ERROR ibv_dealloc_pd failed errno 16

Do you have an idea of what it could be ? Could you look into it?

@samnordmann
Copy link
Collaborator

The recently merged PR #921 introduced a small bug here

ucc_rcache_destroy(self->rcache);

The line cleaning up rcache should be removed. Can you fix that please?

@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch from a663b2c to b52f83c Compare February 15, 2024 18:43
@MamziB
Copy link
Collaborator Author

MamziB commented Feb 15, 2024

@samnordmann @Sergei-Lebedev I did not see any Finalize issue after disabling the mcast flag. So, the PR is now ready. FYI, I tested it on HPCAC on multiple node with mcast flags ON/OFF and it was passed.

@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch from b52f83c to d249b54 Compare February 15, 2024 18:53
@MamziB MamziB force-pushed the mamzi/mcast-merge-6-4 branch from d249b54 to 5c79b72 Compare February 15, 2024 18:55
@MamziB
Copy link
Collaborator Author

MamziB commented Feb 15, 2024

@Sergei-Lebedev I resolved the merge conflict with master

@Sergei-Lebedev Sergei-Lebedev enabled auto-merge (squash) February 16, 2024 11:10
@Sergei-Lebedev Sergei-Lebedev merged commit e13d962 into openucx:master Feb 16, 2024
8 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants