-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Halo class to allow halo exchanges for device arrays #163
Update Halo class to allow halo exchanges for device arrays #163
Conversation
Replace packBuffer overloaded functions with specialized function templates for supported Kokkos array ranks. Allow for buffer packing of device arrays as well as host arrays.
Replace unpackBuffer overloaded functions with specialized function templates for supported Kokkos array ranks. Allow for buffer unpacking into device arrays as well as host arrays.
Allow for MPI communication of buffer arrays on either host or device
Remove device->host and host->device deep copies no longer needed for halo exchanges
Remove unnecessary deep copies and add Halo destructor
Add OMEGA_MPI_ON_DEVICE build flag and machine specific build configs for Frontier and Perlmutter
Curious why you test Perlmutter GPU with nvidia instead of gnu (which is what E3SM and SCREAM test with). |
@rljacob , I think we want Omega to support both NVIDIA and GNU compilers on Perlmutter GPU nodes. Since Perlmutter uses NVIDIA GPUs, I think that we typically test Omega with the NVIDIA compiler first and the GNU compiler next. However, if E3SM and SCREAM consider GNU as the primary compiler on Perlmutter, we may also adopt the same compiler preference. |
We have not actually done a performance comparison between nvidia, gnu and intel on perlmutter gpus. But we have seen nvidia have trouble with some of the Fortran code. gnu is preferred unless there is evidence another one is better. |
Confirmed that PR passes unit tests on Frontier with cpu and gpu. Since the unit tests show this is working correctly because they create arrays with unique values per cell, do a halo exchange, and then compute the error. Awaiting timing tests from Kieran Ringel for performance comparison between this new halo exchange on device and the previous halo exchange on host (gpu versus cpu). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment on each items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found one correctness bug and I think some generic Kokkos utilities could be moved to OmegaKokkos.h
but the changes in general look good to me. I performed some initial scaling experiments on Frontier and Perlmutter. In general, the performance improved, although the results on Frontier for large gpu counts are a bit surprising (the original code is slightly faster). I also wasn't able to get GPU-aware MPI to work on Perlmutter.
This comment was marked as outdated.
This comment was marked as outdated.
@kieran-ringel Note that depending on what exactly you measured this bug #163 (comment) might have affected these results, since it causes state and tracer halo exchanges to exchange host arrays only. |
Unify similar functions and enums from Halo.h and Field.h and move them to OmegaKokkos.h, along with ArrayRank struct.
Remove system-specific check for disabling cudaMallocAsync in Kokkos build. Instead set Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF by default for OMEGA_ARCH=CUDA, and add ability to enable via OMEGA_CUDA_MALLOC_ASYNC command line CMake option.
Remove system-specific check, and instead append `export MPICH_GPU_SUPPORT_ENABLED=1` to omega_env.sh whenever Omega is built with MPICH, and both OMEGA_TARGET_DEVICE and OMEGA_MPI_ON_DEVICE are true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Tests passed on Frontier, Perlmutter, and Chrysalis using various compilers. The only exception is a segmentation fault failure in TIMESTEPPING_TEST on Perlmutter-GPU with the gnugpu compiler. This issue appears to be isolated to this specific case and could be addressed in a separate task.
@grnydawn I had ran the tests successfully with gnugpu after the discussion above and didn't see any issue, but I built in debug mode. I'm seeing this seg fault now building in release. I'll take a look and see if I can track it down. |
@brian-oneill , It is strange that I encountered a failure while running Omega built in "Debug" mode. However, when I retried the TIMESTEPPER_TEST this morning, I did not experience any failures in either "Debug" or "Release" mode. FYI, yesterday, I was able to locate the source lines where the failure occurs. It segfaulted in Halo.h near line 850.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @brian-oneill! Approving based on testing above, particularly the plot of Halo exchange speed-up time by @kieran-ringel. The timing results on perlmutter are confusing, but I think that is due to the architecture and not an issue with this PR.
Merging. We will keep an eye out for the failure @grnydawn mentions above, but since it was not repeatable, I don't want to hold up this merge. |
This PR updates the Halo to allow for halo exchanges of arrays allocated in device memory space as well as host memory space. With this update, Omega can take advantage of GPU aware MPI implementations.
Changes include:
Successfully built and passed unit tests with OMEGA_MPI_ON_DEVICE both on and off on Chrysalis (intel), Perlmutter CPU (intel) & GPU (nvidiagpu), and Frontier CPU (crayclang) & GPU (crayclanggpu)
Checklist