Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: Add linear alltoall and allgather algorithms based on xgvmi ucp get #992

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

nsarka
Copy link
Collaborator

@nsarka nsarka commented Jun 24, 2024

This PR is a follow up to allreduce sliding window. It adds linear alltoall and allgather algorithms based on XGVMI. They will post ucp gets from host to host in a round robin fashion.

@nsarka nsarka force-pushed the nsarka/allgather_xgvmi branch from 3b8cbf2 to ecbd9e1 Compare June 24, 2024 18:40
@nsarka nsarka force-pushed the nsarka/allgather_xgvmi branch from ecbd9e1 to a362467 Compare June 25, 2024 15:48
@nsarka nsarka marked this pull request as draft June 25, 2024 21:31
@nsarka nsarka changed the title TL/UCP: Add xgvmi allgather TL/UCP: Add linear alltoall and allgather algorithms based on xgvmi ucp get Jun 26, 2024
@nsarka nsarka marked this pull request as ready for review June 26, 2024 17:11
Copy link
Collaborator

@janjust janjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over the code with Nick, LGTM

@janjust
Copy link
Collaborator

janjust commented Jun 26, 2024

I don't think we should do this now, but these algorithms, including sliding-window AR will not need the allgather in the init function when #909 is merged

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thanks! I still need to go voer the main file tl_ucp_dpu_offload.c.

Just a first round of minor review in the meantime

src/components/tl/ucp/allgather/allgather.h Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_xgvmi.c Outdated Show resolved Hide resolved

req_param.op_attr_mask |= UCP_OP_ATTR_FIELD_MEMH;

for (i = *posted; i < host_team_size; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that posted is not 0 when entering this function?
Would it make sense to put this first loop in a "start" function instead?

Copy link
Collaborator Author

@nsarka nsarka Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first entry *posted will be 0, but after that it will be equal to host_team_size. It's possible to make a standalone start function, but I thought it would be less code to reuse ucc_tl_ucp_dpu_xgvmi_rdma_task_post for both alltoall and allgather, even though it isn't that long (10 lines)


ucp_worker_progress(tl_ctx->worker.ucp_worker);

for (i = *completed; i < *posted; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here *posted is necessarily equal to host_team_size right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when it gets to this point all the gets will be posted. I think *posted might be clearer, just because we want *posted == *completed at the end. What do you think?

src/components/tl/ucp/alltoall/alltoall.h Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_coll.h Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_dpu_offload.h Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_dpu_offload.h Outdated Show resolved Hide resolved
@samnordmann
Copy link
Collaborator

Can you update the tests as well?

@nsarka nsarka force-pushed the nsarka/allgather_xgvmi branch from 8e17a30 to 419a128 Compare July 8, 2024 16:52
@nsarka
Copy link
Collaborator Author

nsarka commented Jul 8, 2024

Can you update the tests as well?

I updated the gtests to test the linear allgather/alltoall

@nsarka nsarka force-pushed the nsarka/allgather_xgvmi branch from 873b576 to ea42acf Compare July 29, 2024 14:07
@nsarka nsarka force-pushed the nsarka/allgather_xgvmi branch from ea42acf to 925c3bb Compare August 20, 2024 16:21
@swx-jenkins3
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants