This version uses RAJA 0.7.0 on both the OpenMP and CUDA versions. Compile and run instructions are in the README.md file. Some sample run times are below:
Runtimes for gaussianHill-rev.in on 1 node of Lassen ( Power 9 x 2 + Volta 100 x 4 )
Devices | SW4Lite-RAJA(s) | SW4Lite-CUDA(s) | SW4(s) |
---|---|---|---|
1 | 8.4 | 4.88 | 6.42 |
4 | 3.9 | 3.3 | 4.07 |
Run times on 1 node of Summit ( Power 9 x 2 Volta 100 x 6 )
Case | SW4Lite-RAJA 1 device(s) | SW4Lite-RAJA 6 devices(s) |
---|---|---|
LOH2h=100 | 21 | 7.83 |
gaussianHill-rev | 9.54 | 4.08 |
Run time for LOH2h=100 case on Quartz node [Intel Xeon E5-2695v4] with 4 ranks 9 threads ~120s
NOTES:
Certain performance critical loops have been ported using custom lambda offloads to avoid the register usage penalty due to reduction support in the corresponding RAJA constructs. This is the default option in the Makefile but can be turned off by removing the -DUNRAJA=1
option from the list of compile flags.
sw4lite-cuda vs sw4lite-RAJA
The Cuda version of sw4lite contains the following optimizations that are not in the RAJA or CPU version:
- A few small routines (corr,pred) have been merged into larger ones (rhs4center, and similar).
- MPI communication is overlapped with computation.
- Data transfer between GPU and CPU are optimized by only sending the points needed for inter processor halo exchange.
- The Cartesian grid discretization (procedures rhs4center, and similar) have been optimized by use of shared memory.
Benchmarking the Cuda version of rhs4center have shown it can achieve at up to 40% of peak performance in double precision on Volta and Pascal.
Benchmarking in 2018 shows Cuda sw4lite running about 2x faster than RAJA sw4lite.
Note, optimizations 1 and 2 could have been done with the RAJA version as well. However, implementing some of the optimizations are very time consuming. For example, overlapping communication with computation is very intrusive, since the stencil update has to be split into two routines, the first updates points needed for communication, and the second routine performs the rest of the update while the boundary points are exchanged.
For the same reasons, these optimizations were not done in SW4. In SW4 the time step loop is more complicated than in sw4lite due to a visco-elastic model that is integrated along with the elastic wave equation, and the existence of grid refinement boundaries where special interface conditions are imposed. These additional features would have made it very time consuming to develop a fully optimized Cuda version of SW4.