[RVV] add qs8-dwconv support for risc-v #7638

ken-unger · 2025-01-03T16:51:39Z

Added support for qs8-dwconv and qs8-qc8w-dwconv for RVV
Generator script can support qu8 but I've left those kernels out given past comments on qu8 being deprecated.
Tested on qemu and Spacemit K1.

google-cla · 2025-01-03T16:51:44Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

fbarchard · 2025-01-04T05:40:57Z

src/qs8-dwconv/unipass-rvv.c.in

+      vint32m${LMUL}_t vacc = __riscv_vle32_v_i32m${LMUL}(w, vl);
+      w = (const void*) ((uintptr_t) w + vlmax * sizeof(int32_t));
+
+      for (int k=0; k<${KERNEL_TILE}; k++) {


code style... should have space around binary operators. check all operators have a space around them

for (int k = 0; k < ${KERNEL_TILE}; k++) {

typically we use size_t for most integers... I'd check if other kernels use size_t or int
personal nit - ++k preferred

for (size_t k = 0; k < ${KERNEL_TILE}; ++k) {

sounds good, will do.

XNNPACK has a .clang-format file, would it suffice just to follow its suggestions?

fbarchard

overall ok. can you quote a performance number compared to scalar?

would prefer to see variations for LMUL so we can test against a few variations, depending on the exact RISCV code and choose the best.

Although qu8 is being discouraged, it hasnt gone away.
In general QD8 is a preferred format... but we dont have a dwconv for that.
5x5 seems to be rare... only on mobilenet v3. 3x3 is far more common.
On many cpus 5x5 would be an issue for register pressure.

ken-unger · 2025-01-04T07:31:00Z

In terms of performance are you referring to the qs8-dwconv-bench results? Below is a snippet comparing the mobilenet v1 subset on a K1 (vlen=256). Around 10x faster for this subset.

Running ./qs8-dwconv-bench
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Instruction 32 KiB (x8)
L1 Data 32 KiB (x8)
L2 Unified 512 KiB (x2)
Load Average: 2.00, 2.10, 2.07

Benchmark Time CPU Iterations UserCounters...

qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:32/real_time 18568354 ns 18559505 ns 38 OPS=389.121M/s bytes=43.2581M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:64/real_time 9099572 ns 9091252 ns 76 OPS=397.016M/s bytes=110.374M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:128/real_time 17784443 ns 17786151 ns 39 OPS=406.273M/s bytes=45.235M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:128/real_time 4580208 ns 4581827 ns 153 OPS=394.379M/s bytes=109.913M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:256/real_time 8914670 ns 8915580 ns 78 OPS=405.25M/s bytes=45.4011M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:256/real_time 2236056 ns 2236370 ns 313 OPS=403.911M/s bytes=113.686M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:512/real_time 4444486 ns 4444885 ns 158 OPS=406.422M/s bytes=46.6556M/s cpufreq=1.6G
qs8_dwconv_9p4c__scalar_lrintf/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:512/real_time 1118630 ns 1118777 ns 625 OPS=403.694M/s bytes=118.087M/s cpufreq=1.6G

qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:32/real_time 3234607 ns 3238992 ns 216 OPS=2.23376G/s bytes=248.324M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:64/real_time 799176 ns 795711 ns 876 OPS=4.5205G/s bytes=1.25673G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:128/real_time 1479186 ns 1479508 ns 475 OPS=4.88468G/s bytes=543.867M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:56/W:56/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:128/real_time 408674 ns 411487 ns 1715 OPS=4.42G/s bytes=1.23185G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:256/real_time 743153 ns 743798 ns 940 OPS=4.86127G/s bytes=544.62M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:28/W:28/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:256/real_time 199850 ns 199890 ns 3501 OPS=4.51924G/s bytes=1.272G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:512/real_time 351208 ns 351501 ns 1991 OPS=5.14321G/s bytes=590.42M/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:14/W:14/KH:3/KW:3/PH:2/PW:2/S:2/D:1/G:512/real_time 106398 ns 106365 ns 6580 OPS=4.2443G/s bytes=1.24153G/s cpufreq=1.6G
qs8_dwconv_9p8vc__rvv/mobilenet_v1/H:7/W:7/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:1024/real_time 210250 ns 210241 ns 3330 OPS=4.29569G/s bytes=540.614M/s cpufreq=1.6G

ken-unger · 2025-01-04T07:37:51Z

In general my experience has been that a higher LMUL is always preferred. I wasn't completely sure about the philosophy in adding the non-production kernels, so committed just the production ones.

I've also been running within tflite (benchmark_model) to see the net result. For mobilenet V1, and on the same K1 platform, running single threaded, we were previously at 105ms and now at 10ms.

[Node type] [count] [avg ms]
Convolution (NHWC, QC8) DWConv 13 105.745 (old)
Convolution (NHWC, QC8) DWConv 13 10.168 (new)

ken-unger · 2025-01-06T16:34:57Z

@fbarchard I'll do the coding style fix and also add qu8 support to this PR since that touches mostly the same files.

ken-unger · 2025-01-08T03:51:30Z

New commit added;

fixed the coding style noted.
added qu8-dwconv support. So we now have the full suite of qs8-dwconv, qs8-qc8w-dwconv and qu8-dwconv supported for rvv.
removed qs8_dwconv_25p8vc__rvv from bench/qs8-dwconv.cc as I was encountering a heap corruption with this test. However adding a benchmark case for scalar 25p4c also exhibits the same apparent heap corruption. I've filed Possible heap corruption in qs8-dwconv-bench with primary_tile=25 #7657 for this issue.

add qs8-dwconv support for rvv

ba490bb

fbarchard reviewed Jan 4, 2025

View reviewed changes

fbarchard suggested changes Jan 4, 2025

View reviewed changes

ken-unger mentioned this pull request Jan 8, 2025

Possible heap corruption in qs8-dwconv-bench with primary_tile=25 #7657

Open

add qu8-dwconv support for rvv

a5b7c59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RVV] add qs8-dwconv support for risc-v #7638

[RVV] add qs8-dwconv support for risc-v #7638

ken-unger commented Jan 3, 2025

google-cla bot commented Jan 3, 2025

fbarchard Jan 4, 2025

ken-unger Jan 4, 2025

steven-johnson Jan 6, 2025

fbarchard left a comment

ken-unger commented Jan 4, 2025

ken-unger commented Jan 4, 2025

ken-unger commented Jan 6, 2025

ken-unger commented Jan 8, 2025

[RVV] add qs8-dwconv support for risc-v #7638

Are you sure you want to change the base?

[RVV] add qs8-dwconv support for risc-v #7638

Conversation

ken-unger commented Jan 3, 2025

google-cla bot commented Jan 3, 2025

fbarchard Jan 4, 2025

Choose a reason for hiding this comment

ken-unger Jan 4, 2025

Choose a reason for hiding this comment

steven-johnson Jan 6, 2025

Choose a reason for hiding this comment

fbarchard left a comment

Choose a reason for hiding this comment

ken-unger commented Jan 4, 2025

Running ./qs8-dwconv-bench Run on (8 X 1600 MHz CPU s) CPU Caches: L1 Instruction 32 KiB (x8) L1 Data 32 KiB (x8) L2 Unified 512 KiB (x2) Load Average: 2.00, 2.10, 2.07

Benchmark Time CPU Iterations UserCounters...

ken-unger commented Jan 4, 2025

ken-unger commented Jan 6, 2025

ken-unger commented Jan 8, 2025

Running ./qs8-dwconv-bench
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Instruction 32 KiB (x8)
L1 Data 32 KiB (x8)
L2 Unified 512 KiB (x2)
Load Average: 2.00, 2.10, 2.07