Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable building and testing Omega in single precision #147

Merged
merged 11 commits into from
Nov 6, 2024

Conversation

mwarusz
Copy link
Member

@mwarusz mwarusz commented Oct 16, 2024

This PR enables configuring Omega to use single precision and adds a test (reusing the tendency terms test) checking that we can build and run in single precision. The main changes are:

  • Changing R8 to Real where appropriate
  • Reading mesh and state arrays into temporary R8 arrays before storing them into the class members
  • Always using R8 buffers for halo exchanges
  • CMake logic to build the single precision library when needed

The halo buffers change is not optimal for performance in single precision. It would need to be optimized if we decide
to pursue this option seriously in the future.

Checklist

  • Documentation:
    • User's Guide has been updated
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • CTest unit tests for new features have been added per the approved design.
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.

@mark-petersen mark-petersen changed the title Enable builing and testing Omega in single precision Enable building and testing Omega in single precision Oct 21, 2024
Copy link

@sbrus89 sbrus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and includes some nice clean-ups as well. I just had a couple questions related to the MPI calls and _Real suffix.

Comment on lines +480 to +483
// Read mesh cell coordinates
readCellArray(XCellH, "xCell");
readCellArray(YCellH, "yCell");
readCellArray(ZCellH, "zCell");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These function calls are a nice improvement, thanks.

components/omega/src/base/Halo.h Show resolved Hide resolved
@mwarusz mwarusz force-pushed the mwarusz/omega/real-precision branch 2 times, most recently from 42e8b11 to 3c6a1b2 Compare October 30, 2024 14:28
Copy link

@grnydawn grnydawn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the double-precision and single-precision versions of the Omega unit tests were profiled using Nsight profilers, and the results indicate that this PR correctly generates the corresponding GPU kernels. However, I have not verified the validity of the algorithms.

The profiling results indicate that the unit tests in this PR are insufficient to determine whether single-precision improves performance, as the kernel sizes are too small and do not appear to be representative of typical climate algorithms.

I approve this PR, assuming it properly handles merging commits from external libraries as I noted in another comment.

cime Outdated Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have profiled PLANE_TEST unit test with Nsight profilers and summarized the profiling result below. Further details are in (https://acme-climate.atlassian.net/l/cp/YSFk1Zra).

  • The PR correctly generates the single-precision version of the Omega TEND_PLANE unit test.
  • The elapsed time for both the single-precision and double-precision versions of the unit test is nearly the same, at around 155 ms, excluding initialization and finalization routines.
  • GPU resources are underutilized in both cases. For the double-precision version, Compute (SM) Throughput is 4.15%, and Memory Throughput is 1.39%. For single-precision, Compute (SM) Throughput is 7.27%, and Memory Throughput is 0.81%.
  • It appears that the kernels are too small to effectively compare the performance characteristics of different floating-point precisions. The longest kernel runs for about 27 µs, but most kernels run in under 20 µs.
  • The arithmetic intensity (AI) of the kernels also appears to be too high, meaning these kernels might not accurately represent the performance characteristics of the full Omega model. The typical AI of climate algorithms ranges between 0.1 and 1, but the AI of these unit test kernels reaches up to 460 FLOPs/byte.

@mwarusz mwarusz force-pushed the mwarusz/omega/real-precision branch from 3c6a1b2 to 9b403ad Compare November 1, 2024 01:44
@brian-oneill
Copy link

After the rebase, able to build and run tests successfully on Chrysalis and Perlmutter CPU & GPU. Everything looks good, approving.

Copy link

@sbrus89 sbrus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based my inspection and testing by @brian-oneill and @grnydawn.

@sbrus89 sbrus89 merged commit 801ceaa into E3SM-Project:develop Nov 6, 2024
2 checks passed
@sbrus89 sbrus89 self-assigned this Nov 6, 2024
@sbrus89 sbrus89 mentioned this pull request Nov 7, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants