Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP RMA: Fix put-based plugin #167

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Conversation

tvegas1
Copy link
Collaborator

@tvegas1 tvegas1 commented Oct 4, 2024

Fix UCP RMA plugin.

Copy link
Collaborator

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first bunch of comments

Comment on lines 26 to 27
NCCL_PARAM(UCXAckDelay, "UCXPUT_ACK_DELAY", 1);
NCCL_PARAM(UCXAckSkip, "UCXPUT_ACK_SKIP", 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

Suggested change
NCCL_PARAM(UCXAckDelay, "UCXPUT_ACK_DELAY", 1);
NCCL_PARAM(UCXAckSkip, "UCXPUT_ACK_SKIP", 0);
NCCL_PARAM(UCXAckDelay, "UCX_PUT_ACK_DELAY", 1);
NCCL_PARAM(UCXAckSkip, "UCX_PUT_ACK_SKIP", 0);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

char inflight; /* Chunk still being sent */
char reqs; /* Count request alive */
int sizes[NCCL_UCP_MAX_RECV];
unsigned short id; /* Id of the origin RTR again */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need two ids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, to check if data has fully arrived (although in this case the struct occupies only 1 cacheline)

}
static ncclResult_t nccl_ucx_rma_connect(int dev, void *listen_handle,
void **send_comm,
ncclNetDeviceHandle_t **sendDevComm) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ncclNetDeviceHandle_v7_t?

Copy link
Collaborator Author

@tvegas1 tvegas1 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep main function with latest struct definition (also matches implementation in the ib_plugin.c file).

anyways we have in net_device.h:

 27 typedef ncclNetDeviceHandle_v7_t ncclNetDeviceHandle_v8_t;
 28 typedef ncclNetDeviceHandle_v8_t ncclNetDeviceHandle_t;

w->address = malloc(attr.address_length);
if (w->address == NULL) {
ucp_worker_release_address(w->ucp_worker, attr.address);
return ncclSystemError;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to destroy worker

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

UCXCHECK(ucp_worker_query(w->ucp_worker, &attr));

if (attr.thread_mode != UCS_THREAD_MODE_MULTI) {
INFO(NCCL_NET, "Thread mode multi is not supported");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

req->comm = comm;
req->type = NCCL_UCP_TYPE_ISEND;
req->rtr_id = rtr->id_start;
req->inflight = (UCS_PTR_STATUS(status_ptr) == UCS_INPROGRESS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
req->inflight = (UCS_PTR_STATUS(status_ptr) == UCS_INPROGRESS);
req->inflight = (UCS_PTR_IS_PTR(status_ptr));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if (UCS_PTR_IS_ERR(req->st)) {
WARN("NET/UCX_RMA: am_send_nb failed");
return ncclInternalError;
assert(tag != INT_MAX);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use this value to mark entries as consumed later

comm->rem_fifo.tail++;
out:
if ((*request == NULL) && (comm->total == 0)) {
ucp_worker_progress(comm->worker->ucp_worker);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on the transport, it could be that we need to progress to receive the first remote puts.
(could there also be timeouts to service?).

Comment on lines 1035 to 1036
assert((UCS_PTR_STATUS(status_ptr) == UCS_INPROGRESS) ||
(UCS_PTR_STATUS(status_ptr) == UCS_OK));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already checked 2 lines above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

status = nccl_ucp_shared_put(comm, atp, sizeof(*atp), remote,
&req->inflight);
req->inflight += (status == UCS_INPROGRESS);
rtr->avail -= (status == UCS_INPROGRESS) || (status == UCS_OK);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rtr->avail -= (status == UCS_INPROGRESS) || (status == UCS_OK);
rtr->avail -= !UCS_STATUS_IS_ERR(status);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants