Skip to content

Releases: WolframRhodium/VapourSynth-BM3DCUDA

R2.14

11 Nov 07:13
Compare
Choose a tag to compare
  • Added support for Intel GPUs via SYCL.
    • SYCL at present does not support runtime compilation required for bm3dcuda_rtc's counterpart.
    • Pre-compiled binary for pre-gen12lp devices will be dropped starting from the next release.
    • TODO: runtime sub-group size selection. (Xe2 is 16 wide)

preliminary benchmark

  1. Intel Arc A770 Graphics, ACM-G10, Xe-HPG, driver 1.3.26690, linux kernel 6.2.8, PCIe 4.0 x16, sub-group size 8
  2. Intel Data Center GPU Max 1100, Xe-HPC, driver 1.3.26516, linux kernel 5.15.0, PCIe 5.0 x16, large GRF mode, sub-group size 16

input: 1920x1080

  • chroma=False: GrayS
  • chroma=True: YUV444PS

backend: level zero

data format: fps

radius chroma final Arc A770 Max 1100
0 False False 252.46 323.51
0 False True 205.89 264.46
0 True False 103.46 103.51
0 True True 78.51 80.76
1 False False 83.37 46.41
1 False True 67.31 42.15
1 True False 27.15 15.75
1 True True 22.09 13.90
2 False False 51.40 29.11
2 False True 41.54 24.51
2 True False 16.35 8.17
2 True True 13.40 7.40

R2.13

25 Oct 23:26
Compare
Choose a tag to compare
  • Added support for AMD GPUs via HIP. supported GPUs
    • Output frames may be broken. (#23)
    • The code is designed for discrete RDNA GPUs (with wavefront size 32 and a separate address space), and may not work on GCN and CDNA GPUs.
    • Since current AMD's implementation of HIP does not provide support for backward compatible virtual ISA like PTX in CUDA, the bm3dhip binary will not be able to run on future AMD GPUs or GPUs that are not current compilation target. This could be modified here.
    • Only bm3dhip is available at present. bm3dhip_rtc, the hiprtc-based counterpart to bm3dcuda_rtc have to wait at least until ROCm 6.1.0 because of the missing support for some features.

Benchmark

  • NVIDIA T4 (bm3dcuda)
    • AWS g4dn.2xlarge, Linux 6.2.0-1014-aws, driver version: 545.23.06, CUDA Toolkit 12.3
  • AMD Radeon™ Pro V520 (bm3dhip)
    • AWS g4ad.2xlarge, Linux 6.2.0-1014-aws, driver version: 6.2.4, ROCm 5.7.1
  • Hygon C86 7390 (bm3dcpu)
    • 32C @ 2.70GHz, L1i: 32 x 64 KB, L1d: 32 x 32 KB, L2: 32 x 512 KB, L3: 8 x 8 MB
    • Windows Server 2022
  • Intel Sapphire Rapids (bm3dcpu)
    • 32C @ 3.4GHz
    • Windows Server 2022
  • AMD EPYC Zen4 (bm3dcpu)
    • 32C @ 3.4GHz
    • Windows Server 2022
  • VapourSynth R65-RC1-6-g3dcc6a35

input: 1920x1080

  • chroma=False: GrayS
  • chroma=True: YUV444PS

data format: fps

radius chroma final NVIDIA T4 AMD Radeon™ Pro V520 Hygon 7390 Intel Sapphire Rapids AMD EPYC Zen4
0 False False 342.23 152.20 207.90 598.43 674.37
0 False True 262.98 134.08 180.39 514.53 577.75
0 True False 121.35 66.44 122.40 311.64 375.23
0 True True 96.79 53.84 100.85 134.40 142.46
1 False False 60.80 63.70 110.63 162.40 180.68
1 False True 52.59 53.36 53.55 136.40 152.13
1 True False 21.35 24.41 25.60 58.01 70.08
1 True True 18.15 19.91 21.50 49.50 59.87
2 False False 37.22 41.24 39.32 103.15 111.87
2 False True 31.68 33.25 34.01 89.75 99.14
2 True False 12.64 14.70 17.05 37.54 45.69
2 True True 10.88 12.55 14.35 33.01 39.35

R2.13.test

30 Jul 11:49
Compare
Choose a tag to compare
R2.13.test Pre-release
Pre-release
  • Added experimental support for AMD GPUs via HIP. supported SKUs
    • Only tested on Ubuntu 22.04 on an RDNA1 GPU (radeon pro v520).
    • The code is designed for RDNA GPUs (with wavefront size 32), and may not work on GCN and CDNA GPUs.
    • No backward compatibility for current HIP implementation.

R2.12-cuda118

21 Jan 03:46
Compare
Choose a tag to compare
  • This release is built with cuda 11.8.0 and requires driver 450 or higher.
  • Fix incorrect temporal denoising results on 900 and 10 series gpus (b66572c). This issue is introduced in R2.6 because of unitialized variables and cuda compiler >= 11.5.
  • Fix a bug in BM3Dv2 that produces invalid output in BM3Dv2(yuv, radius=1, sigma=[3,0]).
  • Fix data races in bm3dcuda{_rtc}.

R2.12

13 Dec 11:40
Compare
Choose a tag to compare
  • This release is built with cuda 12.0.0 and requires driver 525 or higher.
  • Fix incorrect temporal denoising results on 900 and 10 series gpus (b66572c). This issue is introduced in R2.6 because of unitialized variables and cuda compiler >= 11.5.
  • Fix a bug in BM3Dv2 that produces invalid output in BM3Dv2(yuv, radius=1, sigma=[3,0]).
  • Fix data races in bm3dcuda{_rtc}.

R2.11

04 Oct 01:59
Compare
Choose a tag to compare
  • Upgrade to cuda 11.8.0, clang 15.0.0
  • Added initial tuning for the Ada architecture

Known issue

  • bm3dcuda{_rtc} produces incorrect temporal denoising results on 9 and 10 series gpus. This issue is introduced in R2.6 because of cuda compiler >= 11.5. Users of these gpus should use the special build.

R2.10

15 Jul 05:53
Compare
Choose a tag to compare
  • Fix temporal padding for built-in VAggregate().

    This function differs from bm3d.VAggregate() in handling of edge frames. The former one uses replication padding, whereas the later one uses zero padding.

    Related issue: WolframRhodium/VapourSynth-WNNM#4

R2.9

12 Jul 14:00
Compare
Choose a tag to compare
  • Fix zero-initialization for bm3dcpu, introduced in R2.7.

R2.8

17 Jun 09:48
Compare
Choose a tag to compare
  • Fix performance degradation introduced in R2.6 on RTX 2000/3000 gpus.
  • Improve performance of VAggregate() and BM3Dv2() for temporal denoising.
    • This VAggregate() implementation is measured to be ~40% faster than the original implementation, resulting in 0 ~ 5% speedup overall.
  • BM3Dv2() is stable now.
  • Upgrade cuda to 11.7.0, clang to 14.0.5.

Known issue

  • Mixing bm3dcuda(_rtc) with other cuda plugins in the same script could create full-frame artifacts under rare conditions. All releases seem to be affected. Please check the output carefully. Related issue: #17

R2.7

10 Apr 09:20
Compare
Choose a tag to compare
  • Fix potential data races.

  • Unprocessed data is now zero initialized by default.

  • Add experimental BM3Dv2() that calls VAggregate() internally.

  • Upgrade to cuda 11.6.2 and clang 14.0.

Known issues

  • There could be data corruption when using (bm3dcuda / bm3dcuda_rtc) and (vs-ort_cuda or vs-trt) in the same script.

  • Users of the cuda related implementations with RTX 2000/3000 may experience up to 3.5x performance degradation due to compiler transformation. This will be fixed in R2.8 release and has been fixed in the main branch. Please use R2.8 release or newer.