Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: Allgather Bruck algorithm #898

Merged
merged 14 commits into from
Feb 26, 2024

Conversation

ikryukov
Copy link
Collaborator

What

Implementation of Bruck algorith for allgather collective.

Why ?

This algorith has O(long(N)) complexity and shows great performance on small (1-2Kb) messages (according to research: https://arxiv.org/pdf/2109.08751.pdf)

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@ikryukov
Copy link
Collaborator Author

ikryukov commented Feb 3, 2024

Test command:
mpirun -x UCC_TL_UCP_TUNE=allgather:0-inf:@3 --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/home/ikryukov/work/ucc/install/lib:/home/ikryukov/work/ucx/install/lib:$LD_LIBRARY_PATH -x UCC_CLS=basic -x UCC_TLS=ucp -x UCC_CL_BASIC_TLS=ucp -x UCC_LOG_LEVEL=info -np 16 ./install/bin/ucc_test_mpi -c allgather -O 0 -v
Perf test:
mpirun -x UCC_TL_UCP_TUNE=allgather:0-inf:@3 --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/home/ikryukov/work/ucc/install/lib:/home/ikryukov/work/ucx/install/lib:$LD_LIBRARY_PATH -x UCC_CLS=basic -x UCC_TLS=ucp -x UCC_CL_BASIC_TLS=ucp -np 13 ./install/bin/ucc_perftest -c allgather -F -b 1 -e 4k

@Sergei-Lebedev
Copy link
Contributor

ok to test

@Sergei-Lebedev
Copy link
Contributor

bot:retest

@ikryukov ikryukov marked this pull request as ready for review February 6, 2024 11:35
@Sergei-Lebedev
Copy link
Contributor

CI issue seems to be relevant

13:40:50  [ RUN      ] test_allgather_alg.alg/int8_Cuda_count_1_inplace_1_bruck
13:40:50  [swx-clx01:196  :0:196] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc5b7600000)
13:40:50  ==== backtrace (tid:    196) ====
13:40:50   0  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc5fe76b564]
13:40:50   1  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7fc5fe76b75f]
13:40:50   2  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7fc5fe76ba46]
13:40:50   3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc5fdfee520]
13:40:50   4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a7e94) [0x7fc5fe153e94]
13:40:50   5  /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_mc_cpu.so(+0x137d) [0x7fc5fc1cc37d]
13:40:50   6  /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allgather_bruck_progress+0x10fe) [0x7fc5eea08a5e]
13:40:50   7  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(+0x12bfb) [0x7fc5fe710bfb]
13:40:50   8  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(ucc_context_progress+0x3e) [0x7fc5fe70b3fe]
13:40:50   9  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x556f80) [0x56533c3ecf80]
13:40:50  10  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x8d46db) [0x56533c76a6db]
13:40:50  11  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5549d1) [0x56533c3ea9d1]
13:40:50  12  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x547caa) [0x56533c3ddcaa]
13:40:50  13  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x548322) [0x56533c3de322]
13:40:50  14  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5485ae) [0x56533c3de5ae]
13:40:50  15  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549689) [0x56533c3df689]
13:40:50  16  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549b88) [0x56533c3dfb88]
13:40:50  17  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x50fe65) [0x56533c3a5e65]
13:40:50  18  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc5fdfd5d90]
13:40:50  19  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fc5fdfd5e40]
13:40:50  20  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5216d5) [0x56533c3b76d5]
13:40:50  =================================
13:40:50  make[1]: Leaving directory '/opt/nvidia/src/ucc/build/test/gtest'
13:40:50  make[1]: *** [Makefile:1960: test] Segmentation fault (core dumped)
13:40:50  make: *** [Makefile:995: gtest] Error 2

@Sergei-Lebedev
Copy link
Contributor

bot:retest

@Sergei-Lebedev
Copy link
Contributor

22:16:15  [ RUN      ] test_allgather_alg.alg/int8_Cuda_count_1_inplace_1_bruck
22:16:15  [swx-clx01:402  :0:402] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f2716800002)
22:16:15  ==== backtrace (tid:    402) ====
22:16:15   0  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f275f5d3564]
22:16:15   1  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7f275f5d375f]
22:16:15   2  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7f275f5d3a46]
22:16:15   3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f275edee520]
22:16:15   4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a7e94) [0x7f275ef53e94]
22:16:15   5  /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allgather_bruck_progress+0x1133) [0x7f275cfe8a93]
22:16:15   6  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(+0x12bfb) [0x7f275f578bfb]
22:16:15   7  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(ucc_context_progress+0x3e) [0x7f275f5733fe]
22:16:15   8  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x556f80) [0x564795aa4f80]
22:16:15   9  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x8d46db) [0x564795e226db]
22:16:15  10  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5549d1) [0x564795aa29d1]
22:16:15  11  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x547caa) [0x564795a95caa]
22:16:15  12  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x548322) [0x564795a96322]
22:16:15  13  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5485ae) [0x564795a965ae]
22:16:15  14  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549689) [0x564795a97689]
22:16:15  15  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549b88) [0x564795a97b88]
22:16:15  16  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x50fe65) [0x564795a5de65]
22:16:15  17  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f275edd5d90]
22:16:15  18  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f275edd5e40]
22:16:15  19  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5216d5) [0x564795a6f6d5]

@ikryukov ikryukov force-pushed the allgather_bruck branch 2 times, most recently from 76b3936 to b7fb9f1 Compare February 12, 2024 11:16
@ikryukov
Copy link
Collaborator Author

bot:retest

1 similar comment
@Sergei-Lebedev
Copy link
Contributor

bot:retest

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the very clean PR

@Sergei-Lebedev Sergei-Lebedev merged commit 7930478 into openucx:master Feb 26, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants