Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completion error on UD server with SRQ #308

Open
hginjgerx opened this issue Jan 16, 2025 · 0 comments
Open

Completion error on UD server with SRQ #308

hginjgerx opened this issue Jan 16, 2025 · 0 comments

Comments

@hginjgerx
Copy link

Hi all,
I was running a simple UD test using SRQ on our hns device:

ib_send_bw -d hns_1 -c UD -s 4096 --use-srq
ib_send_bw -d hns_1 -c UD -s 4096 --use-srq 127.0.0.1

But a completion error occured on server side:
image

I looked into this error and added some printings as follows:

diff --git a/src/perftest_resources.c b/src/perftest_resources.c
index 814ffdb..b1d016d 100755
--- a/src/perftest_resources.c
+++ b/src/perftest_resources.c
@@ -1705,6 +1705,7 @@ int create_single_mr(struct pingpong_context *ctx, struct perftest_parameters *u
 #endif
        {
                ctx->mr[qp_index] = ibv_reg_mr(ctx->pd, ctx->buf[qp_index], ctx->buff_size, flags);
+               printf("buf addr - %p, size - %lu\n", ctx->buf[qp_index], ctx->buff_size);
                if (!ctx->mr[qp_index]) {
                        fprintf(stderr, "Couldn't allocate MR\n");
                        return FAILURE;
@@ -3249,6 +3250,9 @@ int ctx_set_recv_wqes(struct pingpong_context *ctx,struct perftest_parameters *u
 
                        ctx->rwr[i * user_param->recv_post_list + j].sg_list = &ctx->recv_sge_list[i * user_param->recv_post_list + j];
                        ctx->rwr[i * user_param->recv_post_list + j].num_sge = MAX_RECV_SGE;
+                       printf("recv sge addr - %p, sge len - %lu\n",
+                               ctx->rwr[i * user_param->recv_post_list + j].sg_list->addr,
+                               ctx->rwr[i * user_param->recv_post_list + j].sg_list->length);
                        ctx->rwr[i * user_param->recv_post_list + j].wr_id   = i;
 
                        if (j == (user_param->recv_post_list - 1))

And the result was:
image

From the printings we can see that the sge length was expanded from 4096+40 (the size I specified in ib_send_bw command plus the 40-byte GRH) to 8192, and the address range of sge (0x39a4c018 + 8192 = 0x39A4E018) exceeded the MR's (0x39a4b000 + 8256 = 0x39A4D040), thus leading to the completion error.

I found that the expansion of sge length was introduced by commit f226ca2, which was specially for UD recv WR when using SRQ. It seems this commit is breaking this UD-SRQ test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant