bc7enc: Optimize "find approximate selector" branch chains #24

abbriggs · 2024-09-24T21:19:47Z

Description

Several BC7 code paths have branch chains which sequentially compare a value against an array of thresholds. These chains are long enough that compilers have trouble converting them to branchless operations.

In all of these code paths, the value produced by the branch chain is a direct dependency of of the subsequent code. This often results in a pipeline stall, because the branches can't be easily predicted.

To improve this, convert each branch chain to a branchless loop. Compiler optimizations will inline and unroll the loop, significantly improving codegen and making room for further compiler optimizations (such as auto-vectorization).

Results

I compiled bc7enc.exe using Clang-CL 17 on Windows and ran it on an AMD RZ9-7950x system for this data. I did spot check MSVC and it appears to receive similar performance benefits.

Before changes:

Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.197000 secs
Total processing time: 0.206000 secs

Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.757000 secs
Total processing time: 4.772000 secs

After changes:

Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.186000 secs
Total processing time: 0.195000 secs

Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.429000 secs
Total processing time: 4.445000 secs

If needed, I can provide some images that show the difference in x86 codegen before and after the changes.

Several BC7 code paths have branch chains which sequentially compare a value against an array of thresholds. These chains are long enough that compilers have trouble converting them to branchless operations. In all of these code paths, the value produced by the branch chain is a direct dependency of of the subsequent code. This often results in a pipeline stall, because the branches can't be easily predicted. To improve this, convert each branch chain to a branchless loop. Compiler optimizations will inline and unroll the loop, significantly improving the codgen in these code paths and making room for further compiler optimizations (such as auto-vectorization).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bc7enc: Optimize "find approximate selector" branch chains #24

bc7enc: Optimize "find approximate selector" branch chains #24

abbriggs commented Sep 24, 2024 •

edited

Loading

bc7enc: Optimize "find approximate selector" branch chains #24

Are you sure you want to change the base?

bc7enc: Optimize "find approximate selector" branch chains #24

Conversation

abbriggs commented Sep 24, 2024 • edited Loading

Description

Results

abbriggs commented Sep 24, 2024 •

edited

Loading