Releases: WolframRhodium/VapourSynth-BM3DCUDA
R2.14
- Added support for Intel GPUs via SYCL.
- SYCL at present does not support runtime compilation required for
bm3dcuda_rtc
's counterpart. - Pre-compiled binary for pre-gen12lp devices will be dropped starting from the next release.
- TODO: runtime sub-group size selection. (Xe2 is 16 wide)
- SYCL at present does not support runtime compilation required for
preliminary benchmark
- Intel Arc A770 Graphics, ACM-G10, Xe-HPG, driver 1.3.26690, linux kernel 6.2.8, PCIe 4.0 x16, sub-group size 8
- Intel Data Center GPU Max 1100, Xe-HPC, driver 1.3.26516, linux kernel 5.15.0, PCIe 5.0 x16, large GRF mode, sub-group size 16
input: 1920x1080
chroma=False
:GrayS
chroma=True
:YUV444PS
backend: level zero
data format: fps
radius | chroma | final | Arc A770 | Max 1100 |
---|---|---|---|---|
0 | False | False | 252.46 | 323.51 |
0 | False | True | 205.89 | 264.46 |
0 | True | False | 103.46 | 103.51 |
0 | True | True | 78.51 | 80.76 |
1 | False | False | 83.37 | 46.41 |
1 | False | True | 67.31 | 42.15 |
1 | True | False | 27.15 | 15.75 |
1 | True | True | 22.09 | 13.90 |
2 | False | False | 51.40 | 29.11 |
2 | False | True | 41.54 | 24.51 |
2 | True | False | 16.35 | 8.17 |
2 | True | True | 13.40 | 7.40 |
R2.13
- Added support for AMD GPUs via HIP. supported GPUs
- Output frames may be broken. (#23)
- The code is designed for discrete RDNA GPUs (with wavefront size 32 and a separate address space), and may not work on GCN and CDNA GPUs.
- Since current AMD's implementation of HIP does not provide support for backward compatible virtual ISA like
PTX
in CUDA, thebm3dhip
binary will not be able to run on future AMD GPUs or GPUs that are not current compilation target. This could be modified here. - Only
bm3dhip
is available at present.bm3dhip_rtc
, the hiprtc-based counterpart tobm3dcuda_rtc
have to wait at least until ROCm 6.1.0 because of the missing support for some features.
Benchmark
- NVIDIA T4 (
bm3dcuda
)- AWS g4dn.2xlarge, Linux 6.2.0-1014-aws, driver version: 545.23.06, CUDA Toolkit 12.3
- AMD Radeon™ Pro V520 (
bm3dhip
)- AWS g4ad.2xlarge, Linux 6.2.0-1014-aws, driver version: 6.2.4, ROCm 5.7.1
- Hygon C86 7390 (
bm3dcpu
)- 32C @ 2.70GHz, L1i: 32 x 64 KB, L1d: 32 x 32 KB, L2: 32 x 512 KB, L3: 8 x 8 MB
- Windows Server 2022
- Intel Sapphire Rapids (
bm3dcpu
)- 32C @ 3.4GHz
- Windows Server 2022
- AMD EPYC Zen4 (
bm3dcpu
)- 32C @ 3.4GHz
- Windows Server 2022
- VapourSynth
R65-RC1-6-g3dcc6a35
input: 1920x1080
chroma=False
:GrayS
chroma=True
:YUV444PS
data format: fps
radius | chroma | final | NVIDIA T4 | AMD Radeon™ Pro V520 | Hygon 7390 | Intel Sapphire Rapids | AMD EPYC Zen4 |
---|---|---|---|---|---|---|---|
0 | False | False | 342.23 | 152.20 | 207.90 | 598.43 | 674.37 |
0 | False | True | 262.98 | 134.08 | 180.39 | 514.53 | 577.75 |
0 | True | False | 121.35 | 66.44 | 122.40 | 311.64 | 375.23 |
0 | True | True | 96.79 | 53.84 | 100.85 | 134.40 | 142.46 |
1 | False | False | 60.80 | 63.70 | 110.63 | 162.40 | 180.68 |
1 | False | True | 52.59 | 53.36 | 53.55 | 136.40 | 152.13 |
1 | True | False | 21.35 | 24.41 | 25.60 | 58.01 | 70.08 |
1 | True | True | 18.15 | 19.91 | 21.50 | 49.50 | 59.87 |
2 | False | False | 37.22 | 41.24 | 39.32 | 103.15 | 111.87 |
2 | False | True | 31.68 | 33.25 | 34.01 | 89.75 | 99.14 |
2 | True | False | 12.64 | 14.70 | 17.05 | 37.54 | 45.69 |
2 | True | True | 10.88 | 12.55 | 14.35 | 33.01 | 39.35 |
R2.13.test
- Added experimental support for AMD GPUs via HIP. supported SKUs
- Only tested on Ubuntu 22.04 on an RDNA1 GPU (radeon pro v520).
- The code is designed for RDNA GPUs (with wavefront size 32), and may not work on GCN and CDNA GPUs.
- No backward compatibility for current HIP implementation.
R2.12-cuda118
- This release is built with cuda 11.8.0 and requires driver 450 or higher.
- Fix incorrect temporal denoising results on 900 and 10 series gpus (b66572c). This issue is introduced in R2.6 because of unitialized variables and cuda compiler >= 11.5.
- Fix a bug in
BM3Dv2
that produces invalid output inBM3Dv2(yuv, radius=1, sigma=[3,0])
. - Fix data races in
bm3dcuda{_rtc}
.
R2.12
- This release is built with cuda 12.0.0 and requires driver 525 or higher.
- Fix incorrect temporal denoising results on 900 and 10 series gpus (b66572c). This issue is introduced in R2.6 because of unitialized variables and cuda compiler >= 11.5.
- Fix a bug in
BM3Dv2
that produces invalid output inBM3Dv2(yuv, radius=1, sigma=[3,0])
. - Fix data races in
bm3dcuda{_rtc}
.
R2.11
- Upgrade to cuda 11.8.0, clang 15.0.0
- Added initial tuning for the Ada architecture
Known issue
bm3dcuda{_rtc}
produces incorrect temporal denoising results on 9 and 10 series gpus. This issue is introduced in R2.6 because of cuda compiler >= 11.5. Users of these gpus should use the special build.
R2.10
-
Fix temporal padding for built-in
VAggregate()
.This function differs from
bm3d.VAggregate()
in handling of edge frames. The former one uses replication padding, whereas the later one uses zero padding.Related issue: WolframRhodium/VapourSynth-WNNM#4
R2.9
- Fix zero-initialization for
bm3dcpu
, introduced in R2.7.
R2.8
- Fix performance degradation introduced in R2.6 on RTX 2000/3000 gpus.
- Improve performance of
VAggregate()
andBM3Dv2()
for temporal denoising.- This
VAggregate()
implementation is measured to be ~40% faster than the original implementation, resulting in 0 ~ 5% speedup overall.
- This
BM3Dv2()
is stable now.- Upgrade cuda to 11.7.0, clang to 14.0.5.
Known issue
- Mixing
bm3dcuda(_rtc)
with other cuda plugins in the same script could create full-frame artifacts under rare conditions. All releases seem to be affected. Please check the output carefully. Related issue: #17
R2.7
-
Fix potential data races.
-
Unprocessed data is now zero initialized by default.
-
Add experimental
BM3Dv2()
that callsVAggregate()
internally. -
Upgrade to cuda 11.6.2 and clang 14.0.
Known issues
-
There could be data corruption when using (
bm3dcuda
/bm3dcuda_rtc
) and (vs-ort_cuda
orvs-trt
) in the same script. -
Users of the cuda related implementations with RTX 2000/3000 may experience up to 3.5x performance degradation due to compiler transformation. This will be fixed in R2.8 release and has been fixed in the main branch. Please use R2.8 release or newer.