bc7enc: Optimize "find approximate selector" branch chains #24
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Several BC7 code paths have branch chains which sequentially compare a value against an array of thresholds. These chains are long enough that compilers have trouble converting them to branchless operations.
In all of these code paths, the value produced by the branch chain is a direct dependency of of the subsequent code. This often results in a pipeline stall, because the branches can't be easily predicted.
To improve this, convert each branch chain to a branchless loop. Compiler optimizations will inline and unroll the loop, significantly improving codegen and making room for further compiler optimizations (such as auto-vectorization).
Results
I compiled
bc7enc.exe
using Clang-CL 17 on Windows and ran it on an AMD RZ9-7950x system for this data. I did spot check MSVC and it appears to receive similar performance benefits.Before changes:
After changes:
If needed, I can provide some images that show the difference in x86 codegen before and after the changes.