Author: Matt Norman
Table of Contents
- Getting Started
- Compiling the Code
- Running the Code and Viewing the Results
- Adding OpenACC
- Playing with the code
- Other Resources
First, log in with:
ssh -Y [username]@bwbay.ncsa.illinois.edu
ORNL by default blocks outgoing SSH, including to Blue Waters. To get around this, you have two options:
- Use "corkscrew" (
apt-get intsall corkscrew
on Ubuntu), and use the ssh command
ssh -o ProxyCommand="corkscrew snowman.ornl.gov 3128 %h %p" -Y [username]@bwbay.ncsa.illinois.edu
- ssh into
home.ccs.ornl.gov
using 2-factor authentication first, and then use the normal ssh command above.
Then clone this repo
git clone git@github.com:mrnorman/miniWeather.git
cd miniWeather
Choose either the c
or fortran
directories to work with, depending on your language preference.
All documentation is in documentation/miniWeather_documentation.pdf.
This code can be used to learn three different parallel programming aspects:
- MPI task-level parallelism
- OpenMP loop-level parallelism
- OpenACC accelerator parallelism
For this particular context, we are focusing on the OpenACC parallelism, so the goal is to add your own OpenACC directives to miniWeather_mpi.F90
or miniWeather_mpi.cpp
. There are exmaples of how to do this with a loop from the code below in C and Fortran. You'll see there already exists a miniWeather_mpi_openacc.F90
file, where the OpenACC directives have already been introduced. Please use it as a resource, but be aware that the data movement in that file has already been optimized, so the individual kernels no longer have data statements. Your code will probably look different as you add OpenACC directives yourself.
To compile the code, you first need to create the correct environment with:
module swap PrgEnv-cray PrgEnv-pgi
module swap pgi pgi/18.7.0
module load cudatoolkit cray-parallel-netcdf
Use the Makefile.bw
to compile the code:
cd miniWeather/[c | fortran]
make -f Makefile.bw
To start an interactive 1-node job for an hour that contains a K20x GPU:
qsub -I -X -A bayr -lwalltime=01:00:00,nodes=1:ppn=16:xk
qsub -I -X -A bayr -lwalltime=01:00:00,nodes=1:ppn=16:xk
The -X
flag is important if you want to view the NetCDF output file or profiler output from the Nvidia profile (see further down) because it enables X11 forwarding for the interactive job.
Next, setup the correct environment:
module swap PrgEnv-cray PrgEnv-pgi
module swap pgi pgi/18.7.0
module load cudatoolkit cray-parallel-netcdf ncview
Run the code:
aprun -n 1 ./miniWeather_mpi_openacc
OR
aprun -n 1 ./miniWeather_mpi
View the results
ncview output.nc
IMPORTANT!: Do not run ./miniWeather*
without aprun
, as this will run it either on the login node or a service node. Those nodes are shared resources with limited resources, and you risk keeping others from compiling their code or even potentially killing other peoples' jobs if you cause an OOM on a service node. So please, be a good neighbor, and always use aprun
for anything computationally intensive.
Again, for a fuller description, please see the OpenACC presentation and the miniWeather documentation. The code guides you where to put OpenACC directives with the following comment headers in miniWeather_mpi.[F90 | cpp]
.
Fortran:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! TODO: THREAD ME
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
C:
/////////////////////////////////////////////////
// TODO: THREAD ME
/////////////////////////////////////////////////
Here is an example of how to add directives to a loop with data statements.
do ll = 1 , NUM_VARS
do k = 1 , nz
do i = 1 , nx
tend(i,k,ll) = -( flux(i+1,k,ll) - flux(i,k,ll) ) / dx
enddo
enddo
enddo
Becomes
!$acc parallel loop collapse(3) copyin(flux) copyout(tend)
do ll = 1 , NUM_VARS
do k = 1 , nz
do i = 1 , nx
tend(i,k,ll) = -( flux(i+1,k,ll) - flux(i,k,ll) ) / dx
enddo
enddo
enddo
C:
for (ll=0; ll<NUM_VARS; ll++) {
for (k=0; k<nz; k++) {
for (i=0; i<nx; i++) {
indt = ll* nz * nx + k* nx + i ;
indf1 = ll*(nz+1)*(nx+1) + k*(nx+1) + i ;
indf2 = ll*(nz+1)*(nx+1) + k*(nx+1) + i+1;
tend[indt] = -( flux[indf2] - flux[indf1] ) / dx;
}
}
}
Becomes
#pragma acc parallel loop collapse(3) copyin(flux[NUM_VARS*(nz+1)*(nx+1)]) copyout(tend[NUM_VARS*nz*nx])
for (ll=0; ll<NUM_VARS; ll++) {
for (k=0; k<nz; k++) {
for (i=0; i<nx; i++) {
indt = ll* nz * nx + k* nx + i ;
indf1 = ll*(nz+1)*(nx+1) + k*(nx+1) + i ;
indf2 = ll*(nz+1)*(nx+1) + k*(nx+1) + i+1;
tend[indt] = -( flux[indf2] - flux[indf1] ) / dx;
}
}
}
If you want to make things easier on yourself, especially in C, where the compiler does not generate data statements for you, I recommend using the PGI compiler with Managed Memory. The reason is that you no longer need to add your own data statements (e.g., you don't need any copy()
, copyin()
, or copyout()
statements) because the CUDA runtime will move the data back and forth between CPU and GPU memory for you transparently under the hood. This means you can just focus on the parallel loop
directive.
See the Managed Memory section for more information on this.
Find the BEGIN USER-CONFIGURABLE PARAMETERS
in the source file, and change the number of grid points and the data specification according to the options in documentation.pdf
in this repo. One interesting thing you can do is see how the OpenACC GPU efficiency changes as you change the problem size in terms of number of grid cells. The smaller the workload becomes, the worse the GPU performance should get because you begin to have too few threads to keep the device busy.
You can profile the code with:
aprun -n 1 nvprof --profile-child-processes -o %h.%p.nvvp ./miniWeather_mpi_openacc
and view the profile with:
nvvp [filename].nvvp
You will need X11 forwarding for this, though. To view a reduced text-only profile, use:
aprun -n 1 nvprof --profile-child-processes --print-gpu-summary ./miniWeather_mpi_openacc
The PGI compiler allows you to use Managed Memory instead of explicit data statements via the -ta=nvidia,managed
flag. Try editing the Makefile to use Managed Memory, and see how the performance of the code changes. To do this, change the ACCFLAGS
in Makefile.bw
to:
ACCFLAGS := -ta=tesla,pinned,cc35,managed,ptxinfo -Minfo=accel
The PGI compiler also has a neat tool called PCAST, which automatically compares variables from redundatly executed CPU and GPU versions of the OpenACC code every time you move data from GPU memory to CPU memory (e.g., update host(...)
or copyout(...)
or the end of a block that has copy(...)
). You can also control the size of the absolute or relative differences that will trigger terminal output. Change your ACCFLAGS
in Makefile.bw
to:
ACCFLAGS := -ta=tesla,pinned,cc35,autocompare,ptxinfo -Minfo=accel
And recompile the code. For more options regarding the PCAST tool, see: https://www.pgroup.com/blogs/posts/pcast.htm
Nvidia has a valgrind
-esque tool called cuda-memcheck
, which checks for invalid memory address errors in your GPU code, and can be quite handy in finding bugs. To use it, use the original ACCFLAGS
specified in the repo's version of Makefile.bw
, compile the code, and run:
aprun -n 1 cuda-memcheck ./miniWeather_mpi_openacc
valgrind
is a very useful CPU memory tool that checks for memory leaks and invalid accesses (among other things). To debug with valgrind on Blue Waters, you'll need to first load the module:
module load valgrind
Then, add the -g
option to the CFLAGS
in Makefile.bw
so that valgrind can see the symbols and give you useful output. Then, simply run:
aprun -n 1 valgrind ./miniWeather_mpi
to see if you're committing any memory sins in your code. Keep in mind, this is for CPU code, not GPU code. In fact, with PGI 18.7, you'll find what appears to be a compiler bug in the C version, where valgrind complains about invalid reads in the main time stepping loop with the PGI compiler but does not with the GNU compiler.
I have a fairly involved document about the process you should expect when porting a new code to OpenACC from scratch. It's geared toward climate models, but I think you'll find it helpful even if you're in a different domain: