v0.12.0
Version 0.12.0
This release contains new features, bug fixes, and build improvements. Please see the RAJA user guide for more information about items in this release.
Please download the RAJAPerf-v0.12.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.
Notable changes include:
-
New features / API changes:
- Add command line options to exclude individual kernels and/or variants, and kernels using specified RAJA features. Please use '-h' option to see available options and what they do.
- Add command line option to output min, max, and/or average of kernel timing data over number of passes through the suite. Please use '-h' option to see available options and what they do.
- Added basic MPI support, which enables the code to run on multiple MPI ranks simultaneously. This makes analysis of node performance more realistic since it mimics how real applications exercise memory bandwidth, for example.
- Add a new checksum calculation for verifying correctness of results generated by kernel variants. The new algorithm uses a new weighting scheme to reduce the amount of bias towards later elements in the results arrays, and employs a Kahan sum to reduce error in the summation of many terms.
- Added support for running multiple GPU block size "tunings" of kernels so that experiments can be run to assess how kernel performance depends on block size for different programming models and hardware architectures. By default, the Suite will run all tunings when executed, but a subset of tunings may be chosen at runtime via command line arguments.
- Add DIFFUSION3DPA kernel, which is a high-order FEM kernel that stresses shared memory usage.
- Add NODAL_ACCUMULATION_3D and DAXPY_ATOMIC kernels which exercise atomic operations in cases with few or unlikely collisions.
- Add REDUCE_STRUCT kernel, which tests compilers' ability to optimize load operations when using data arrays accessed through pointer members of a struct.
- Add REDUCE_SUM kernel so we can more easilyt compare reduction implementations.
- Add SCAN, INDEXLIST, and INDEXLIST_3LOOP kernels that include scan operations, and operations to create lists of indices based on where a condition is satisfied by elements of a vector (common type of operation used in mesh-based physics codes).
- Following improvements in RAJA, removed unused execution policies in RAJA "Teams" kernels: DIFFUSION3DPA, MASS3DPA, MAT_MAT_SHARED. Kernel implementations are unchanged.
-
Build changes/improvements
- Updated versions of RAJA and BLT submodules.
- RAJA is at the SHA-1 commit 87a5cac, which is a few commits ahead of the v2022.03.0 release. The post-release changes are used here for CI testing improvements.
- BLT v0.5.0.
See the release documentation for those libraries for details.
- With this release, the RAJA Perf Suite requires C++14 (due to use of RAJA v2022.03.0).
- With this release, the RAJA Perf Suite requires CMake 3.14.5 or newer.
- BLT v0.5.0 includes improved support for ROCm/HIP builds. Although the option CMAKE_HIP_ARCHITECTURES to specify the HIP target architecture is not available until CMake version 3.21, the option is supported in the new BLT version and works with all versions of CMake.
- Updated versions of RAJA and BLT submodules.
-
Bug fixes/improvements:
- Fixed index ordering in GPU variants of HEAT_3D kernel, which was preventing coalesced memory accesses.
- Squashed warnings related to unused variables.