Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Halo class to allow halo exchanges for device arrays #163

Conversation

brian-oneill
Copy link

This PR updates the Halo to allow for halo exchanges of arrays allocated in device memory space as well as host memory space. With this update, Omega can take advantage of GPU aware MPI implementations.

Changes include:

  • Replacing SendBuffer, RecvBuffer and certain other member vectors of Neighbor and ExchList classes with arrays on both host and device
  • Replace packBuffer and unpackBuffer overloaded functions with specialized function templates for supported array ranks
  • Removing deepCopy before and after halo exchanges in classes and unit tests where no longer necessary
  • Add device array tests to Halo unit test
  • Add OMEGA_MPI_ON_DEVICE build flag (on by default)

Successfully built and passed unit tests with OMEGA_MPI_ON_DEVICE both on and off on Chrysalis (intel), Perlmutter CPU (intel) & GPU (nvidiagpu), and Frontier CPU (crayclang) & GPU (crayclanggpu)

Checklist

  • Documentation:
    • User's Guide has been updated
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.

Replace packBuffer overloaded functions with specialized function
templates for supported Kokkos array ranks. Allow for buffer packing
of device arrays as well as host arrays.
Replace unpackBuffer overloaded functions with specialized function
templates for supported Kokkos array ranks. Allow for buffer unpacking
into device arrays as well as host arrays.
Allow for MPI communication of buffer arrays on either host or device
Remove device->host and host->device deep copies no longer needed
for halo exchanges
Remove unnecessary deep copies and add Halo destructor
Add OMEGA_MPI_ON_DEVICE build flag and machine specific build configs
for Frontier and Perlmutter
@rljacob
Copy link
Member

rljacob commented Nov 18, 2024

Curious why you test Perlmutter GPU with nvidia instead of gnu (which is what E3SM and SCREAM test with).

@grnydawn
Copy link

Curious why you test Perlmutter GPU with nvidia instead of gnu (which is what E3SM and SCREAM test with).

@rljacob , I think we want Omega to support both NVIDIA and GNU compilers on Perlmutter GPU nodes. Since Perlmutter uses NVIDIA GPUs, I think that we typically test Omega with the NVIDIA compiler first and the GNU compiler next. However, if E3SM and SCREAM consider GNU as the primary compiler on Perlmutter, we may also adopt the same compiler preference.

@rljacob
Copy link
Member

rljacob commented Nov 18, 2024

We have not actually done a performance comparison between nvidia, gnu and intel on perlmutter gpus. But we have seen nvidia have trouble with some of the Fortran code. gnu is preferred unless there is evidence another one is better.

@mark-petersen
Copy link

Confirmed that PR passes unit tests on Frontier with cpu and gpu. Since the unit tests show this is working correctly because they create arrays with unique values per cell, do a halo exchange, and then compute the error.

Awaiting timing tests from Kieran Ringel for performance comparison between this new halo exchange on device and the previous halo exchange on host (gpu versus cpu).

grnydawn

This comment was marked as duplicate.

Copy link

@grnydawn grnydawn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment on each items

components/omega/OmegaBuild.cmake Outdated Show resolved Hide resolved
components/omega/OmegaBuild.cmake Outdated Show resolved Hide resolved
components/omega/src/base/Halo.h Outdated Show resolved Hide resolved
Copy link
Member

@mwarusz mwarusz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one correctness bug and I think some generic Kokkos utilities could be moved to OmegaKokkos.h but the changes in general look good to me. I performed some initial scaling experiments on Frontier and Perlmutter. In general, the performance improved, although the results on Frontier for large gpu counts are a bit surprising (the original code is slightly faster). I also wasn't able to get GPU-aware MPI to work on Perlmutter.

components/omega/src/ocn/OceanState.cpp Outdated Show resolved Hide resolved
components/omega/src/base/Halo.h Outdated Show resolved Hide resolved
components/omega/src/base/Halo.h Outdated Show resolved Hide resolved
@kieran-ringel

This comment was marked as outdated.

@mwarusz
Copy link
Member

mwarusz commented Nov 25, 2024

Timing results for this PR with OMEGA_MPI_ON_DEVICE turned on and off (indicted in second half of name in the legend)

@kieran-ringel Note that depending on what exactly you measured this bug #163 (comment) might have affected these results, since it causes state and tracer halo exchanges to exchange host arrays only.

@kieran-ringel
Copy link

kieran-ringel commented Nov 25, 2024

Updating timing with updated exchange device arrays in State and Tracer exchangeHalo functions
Screenshot 2024-11-25 at 1 11 55 PM
(note, x axis is actually number of nodes, not GPUs)

Unify similar functions and enums from Halo.h and Field.h and move
them to OmegaKokkos.h, along with ArrayRank struct.
Remove system-specific check for disabling cudaMallocAsync in Kokkos
build. Instead set Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF by
default  for OMEGA_ARCH=CUDA, and add ability to enable via
OMEGA_CUDA_MALLOC_ASYNC command line CMake option.
Remove system-specific check, and instead append
`export MPICH_GPU_SUPPORT_ENABLED=1` to omega_env.sh whenever
Omega is built with MPICH, and both OMEGA_TARGET_DEVICE and
OMEGA_MPI_ON_DEVICE are true
Copy link

@grnydawn grnydawn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Tests passed on Frontier, Perlmutter, and Chrysalis using various compilers. The only exception is a segmentation fault failure in TIMESTEPPING_TEST on Perlmutter-GPU with the gnugpu compiler. This issue appears to be isolated to this specific case and could be addressed in a separate task.

@brian-oneill
Copy link
Author

@grnydawn I had ran the tests successfully with gnugpu after the discussion above and didn't see any issue, but I built in debug mode. I'm seeing this seg fault now building in release. I'll take a look and see if I can track it down.

@kieran-ringel
Copy link

Screenshot 2024-12-03 at 4 14 57 PM I ran the same performance tests as above on Perlmutter Approving PR based on performance results on these 2 machines

@grnydawn
Copy link

grnydawn commented Dec 4, 2024

@brian-oneill , It is strange that I encountered a failure while running Omega built in "Debug" mode. However, when I retried the TIMESTEPPER_TEST this morning, I did not experience any failures in either "Debug" or "Release" mode.

FYI, yesterday, I was able to locate the source lines where the failure occurs. It segfaulted in Halo.h near line 850.

         parallelFor(
             {NK, NTotList, NJ}, KOKKOS_LAMBDA(int K, int IExch, int J) {
                auto Val       = Array(K, LocIndex(IExch), J);
                const R8 RVal  = reinterpret_cast<R8 &>(Val);
                const I4 IBuff = (K * NTotList + IExch) * NJ + J;
                LocBuff(IBuff) = RVal;
             });

Copy link

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @brian-oneill! Approving based on testing above, particularly the plot of Halo exchange speed-up time by @kieran-ringel. The timing results on perlmutter are confusing, but I think that is due to the architecture and not an issue with this PR.

@mark-petersen
Copy link

Merging. We will keep an eye out for the failure @grnydawn mentions above, but since it was not repeatable, I don't want to hold up this merge.

@mark-petersen mark-petersen merged commit 217a69b into E3SM-Project:develop Dec 5, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants