Releases · WolframRhodium/VapourSynth-BM3DCUDA

11 Nov 07:13

github-actions

R2.14

6cda4e5

R2.14 Latest

Latest

Added support for Intel GPUs via SYCL.
- SYCL at present does not support runtime compilation required for bm3dcuda_rtc's counterpart.
- Pre-compiled binary for pre-gen12lp devices will be dropped starting from the next release.
- TODO: runtime sub-group size selection. (Xe2 is 16 wide)

preliminary benchmark

Intel Arc A770 Graphics, ACM-G10, Xe-HPG, driver 1.3.26690, linux kernel 6.2.8, PCIe 4.0 x16, sub-group size 8
Intel Data Center GPU Max 1100, Xe-HPC, driver 1.3.26516, linux kernel 5.15.0, PCIe 5.0 x16, large GRF mode, sub-group size 16

input: 1920x1080

chroma=False: GrayS
chroma=True: YUV444PS

backend: level zero

data format: fps

radius	chroma	final	Arc A770	Max 1100
0	False	False	252.46	323.51
0	False	True	205.89	264.46
0	True	False	103.46	103.51
0	True	True	78.51	80.76
1	False	False	83.37	46.41
1	False	True	67.31	42.15
1	True	False	27.15	15.75
1	True	True	22.09	13.90
2	False	False	51.40	29.11
2	False	True	41.54	24.51
2	True	False	16.35	8.17
2	True	True	13.40	7.40

Assets 8

25 Oct 23:26

github-actions

R2.13

ca0f091

R2.13

Added support for AMD GPUs via HIP. supported GPUs
- Output frames may be broken. (#23)
- The code is designed for discrete RDNA GPUs (with wavefront size 32 and a separate address space), and may not work on GCN and CDNA GPUs.
- Since current AMD's implementation of HIP does not provide support for backward compatible virtual ISA like PTX in CUDA, the bm3dhip binary will not be able to run on future AMD GPUs or GPUs that are not current compilation target. This could be modified here.
- Only bm3dhip is available at present. bm3dhip_rtc, the hiprtc-based counterpart to bm3dcuda_rtc have to wait at least until ROCm 6.1.0 because of the missing support for some features.

Benchmark

NVIDIA T4 (bm3dcuda)
- AWS g4dn.2xlarge, Linux 6.2.0-1014-aws, driver version: 545.23.06, CUDA Toolkit 12.3
AMD Radeon™ Pro V520 (bm3dhip)
- AWS g4ad.2xlarge, Linux 6.2.0-1014-aws, driver version: 6.2.4, ROCm 5.7.1
Hygon C86 7390 (bm3dcpu)
- 32C @ 2.70GHz, L1i: 32 x 64 KB, L1d: 32 x 32 KB, L2: 32 x 512 KB, L3: 8 x 8 MB
- Windows Server 2022
Intel Sapphire Rapids (bm3dcpu)
- 32C @ 3.4GHz
- Windows Server 2022
AMD EPYC Zen4 (bm3dcpu)
- 32C @ 3.4GHz
- Windows Server 2022
VapourSynth R65-RC1-6-g3dcc6a35

input: 1920x1080

chroma=False: GrayS
chroma=True: YUV444PS

data format: fps

radius	chroma	final	NVIDIA T4	AMD Radeon™ Pro V520	Hygon 7390	Intel Sapphire Rapids	AMD EPYC Zen4
0	False	False	342.23	152.20	207.90	598.43	674.37
0	False	True	262.98	134.08	180.39	514.53	577.75
0	True	False	121.35	66.44	122.40	311.64	375.23
0	True	True	96.79	53.84	100.85	134.40	142.46
1	False	False	60.80	63.70	110.63	162.40	180.68
1	False	True	52.59	53.36	53.55	136.40	152.13
1	True	False	21.35	24.41	25.60	58.01	70.08
1	True	True	18.15	19.91	21.50	49.50	59.87
2	False	False	37.22	41.24	39.32	103.15	111.87
2	False	True	31.68	33.25	34.01	89.75	99.14
2	True	False	12.64	14.70	17.05	37.54	45.69
2	True	True	10.88	12.55	14.35	33.01	39.35

Assets 6

30 Jul 11:49

github-actions

R2.13.test

d073d98

R2.13.test Pre-release

Pre-release

Added experimental support for AMD GPUs via HIP. supported SKUs
- Only tested on Ubuntu 22.04 on an RDNA1 GPU (radeon pro v520).
- The code is designed for RDNA GPUs (with wavefront size 32), and may not work on GCN and CDNA GPUs.
- No backward compatibility for current HIP implementation.

Assets 6

21 Jan 03:46

github-actions

R2.12-cuda118

d119330

R2.12-cuda118

This release is built with cuda 11.8.0 and requires driver 450 or higher.
Fix incorrect temporal denoising results on 900 and 10 series gpus (b66572c). This issue is introduced in R2.6 because of unitialized variables and cuda compiler >= 11.5.
Fix a bug in BM3Dv2 that produces invalid output in BM3Dv2(yuv, radius=1, sigma=[3,0]).
Fix data races in bm3dcuda{_rtc}.

Assets 5

13 Dec 11:40

github-actions

R2.12

b66572c

R2.12

This release is built with cuda 12.0.0 and requires driver 525 or higher.
Fix incorrect temporal denoising results on 900 and 10 series gpus (b66572c). This issue is introduced in R2.6 because of unitialized variables and cuda compiler >= 11.5.
Fix a bug in BM3Dv2 that produces invalid output in BM3Dv2(yuv, radius=1, sigma=[3,0]).
Fix data races in bm3dcuda{_rtc}.

Assets 5

04 Oct 01:59

github-actions

R2.11

636a633

R2.11

Upgrade to cuda 11.8.0, clang 15.0.0
Added initial tuning for the Ada architecture

Known issue

bm3dcuda{_rtc} produces incorrect temporal denoising results on 9 and 10 series gpus. This issue is introduced in R2.6 because of cuda compiler >= 11.5. Users of these gpus should use the special build.

Assets 5

15 Jul 05:53

github-actions

R2.10

3adfbe6

R2.10

Fix temporal padding for built-in VAggregate().

This function differs from bm3d.VAggregate() in handling of edge frames. The former one uses replication padding, whereas the later one uses zero padding.

Related issue: WolframRhodium/VapourSynth-WNNM#4

Assets 5

12 Jul 14:00

github-actions

R2.9

b489dcb

R2.9

Fix zero-initialization for bm3dcpu, introduced in R2.7.

Assets 5

17 Jun 09:48

WolframRhodium

R2.8

d2325b5

R2.8

Fix performance degradation introduced in R2.6 on RTX 2000/3000 gpus.
Improve performance of VAggregate() and BM3Dv2() for temporal denoising.
- This VAggregate() implementation is measured to be ~40% faster than the original implementation, resulting in 0 ~ 5% speedup overall.
BM3Dv2() is stable now.
Upgrade cuda to 11.7.0, clang to 14.0.5.

Known issue

Mixing bm3dcuda(_rtc) with other cuda plugins in the same script could create full-frame artifacts under rare conditions. All releases seem to be affected. Please check the output carefully. Related issue: #17

Assets 5

10 Apr 09:20

WolframRhodium

R2.7

91b6e5d

R2.7

Fix potential data races.
Unprocessed data is now zero initialized by default.
Add experimental BM3Dv2() that calls VAggregate() internally.
Upgrade to cuda 11.6.2 and clang 14.0.

Known issues

There could be data corruption when using (bm3dcuda / bm3dcuda_rtc) and (vs-ort_cuda or vs-trt) in the same script.
Users of the cuda related implementations with RTX 2000/3000 may experience up to 3.5x performance degradation due to compiler transformation. This will be fixed in R2.8 release and has been fixed in the main branch. Please use R2.8 release or newer.

Assets 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preliminary benchmark

Benchmark

Known issue

Known issue

Known issues

Releases: WolframRhodium/VapourSynth-BM3DCUDA

R2.14

preliminary benchmark

R2.13

Benchmark

R2.13.test

R2.12-cuda118

R2.12

R2.11

Known issue

R2.10

R2.9

R2.8

Known issue

R2.7

Known issues