Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RVV] add qs8-dwconv support for risc-v #7638

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ken-unger
Copy link

  • Added support for qs8-dwconv and qs8-qc8w-dwconv for RVV
  • Generator script can support qu8 but I've left those kernels out given past comments on qu8 being deprecated.
  • Tested on qemu and Spacemit K1.

Copy link

google-cla bot commented Jan 3, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

vint32m${LMUL}_t vacc = __riscv_vle32_v_i32m${LMUL}(w, vl);
w = (const void*) ((uintptr_t) w + vlmax * sizeof(int32_t));

for (int k=0; k<${KERNEL_TILE}; k++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style... should have space around binary operators. check all operators have a space around them

for (int k = 0; k < ${KERNEL_TILE}; k++) {

typically we use size_t for most integers... I'd check if other kernels use size_t or int
personal nit - ++k preferred

for (size_t k = 0; k < ${KERNEL_TILE}; ++k) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XNNPACK has a .clang-format file, would it suffice just to follow its suggestions?

Copy link
Contributor

@fbarchard fbarchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall ok. can you quote a performance number compared to scalar?

would prefer to see variations for LMUL so we can test against a few variations, depending on the exact RISCV code and choose the best.

Although qu8 is being discouraged, it hasnt gone away.
In general QD8 is a preferred format... but we dont have a dwconv for that.
5x5 seems to be rare... only on mobilenet v3. 3x3 is far more common.
On many cpus 5x5 would be an issue for register pressure.

@ken-unger
Copy link
Author

In terms of performance are you referring to the qs8-dwconv-bench results? Below is a snippet comparing the mobilenet v1 subset on a K1 (vlen=256). Around 10x faster for this subset.

Running ./qs8-dwconv-bench
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Instruction 32 KiB (x8)
L1 Data 32 KiB (x8)
L2 Unified 512 KiB (x2)
Load Average: 2.00, 2.10, 2.07

Benchmark Time CPU Iterations UserCounters...

qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:32/real_time 18568354 ns 18559505 ns 38 OPS=389.121M/s bytes=43.2581M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:64/real_time 9099572 ns 9091252 ns 76 OPS=397.016M/s bytes=110.374M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:128/real_time 17784443 ns 17786151 ns 39 OPS=406.273M/s bytes=45.235M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:128/real_time 4580208 ns 4581827 ns 153 OPS=394.379M/s bytes=109.913M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:256/real_time 8914670 ns 8915580 ns 78 OPS=405.25M/s bytes=45.4011M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:256/real_time 2236056 ns 2236370 ns 313 OPS=403.911M/s bytes=113.686M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:512/real_time 4444486 ns 4444885 ns 158 OPS=406.422M/s bytes=46.6556M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:512/real_time 1118630 ns 1118777 ns 625 OPS=403.694M/s bytes=118.087M/s cpufreq=1.6G

qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:32/real_time 3234607 ns 3238992 ns 216 OPS=2.23376G/s bytes=248.324M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:64/real_time 799176 ns 795711 ns 876 OPS=4.5205G/s bytes=1.25673G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:128/real_time 1479186 ns 1479508 ns 475 OPS=4.88468G/s bytes=543.867M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:128/real_time 408674 ns 411487 ns 1715 OPS=4.42G/s bytes=1.23185G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:256/real_time 743153 ns 743798 ns 940 OPS=4.86127G/s bytes=544.62M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:256/real_time 199850 ns 199890 ns 3501 OPS=4.51924G/s bytes=1.272G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:512/real_time 351208 ns 351501 ns 1991 OPS=5.14321G/s bytes=590.42M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:512/real_time 106398 ns 106365 ns 6580 OPS=4.2443G/s bytes=1.24153G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:7/W:7/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:1024/real_time 210250 ns 210241 ns 3330 OPS=4.29569G/s bytes=540.614M/s cpufreq=1.6G

@ken-unger
Copy link
Author

In general my experience has been that a higher LMUL is always preferred. I wasn't completely sure about the philosophy in adding the non-production kernels, so committed just the production ones.

I've also been running within tflite (benchmark_model) to see the net result. For mobilenet V1, and on the same K1 platform, running single threaded, we were previously at 105ms and now at 10ms.

[Node type] [count] [avg ms]
Convolution (NHWC, QC8) DWConv 13 105.745 (old)
Convolution (NHWC, QC8) DWConv 13 10.168 (new)

@ken-unger
Copy link
Author

@fbarchard I'll do the coding style fix and also add qu8 support to this PR since that touches mostly the same files.

@ken-unger
Copy link
Author

New commit added;

  • fixed the coding style noted.
  • added qu8-dwconv support. So we now have the full suite of qs8-dwconv, qs8-qc8w-dwconv and qu8-dwconv supported for rvv.
  • removed qs8_dwconv_25p8vc__rvv from bench/qs8-dwconv.cc as I was encountering a heap corruption with this test. However adding a benchmark case for scalar 25p4c also exhibits the same apparent heap corruption. I've filed Possible heap corruption in qs8-dwconv-bench with primary_tile=25 #7657 for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants