Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Mesh class #48

Merged
merged 38 commits into from
Feb 6, 2024
Merged

Conversation

sbrus89
Copy link

@sbrus89 sbrus89 commented Jan 19, 2024

This PR introduces the Mesh class for the local mesh variables on an MPI rank. It includes unit tests that check mesh variables against various metrics to ensure the values have been read correctly.

Checklist

  • Documentation:
    • Design document has been generated and added to the docs
    • User's Guide has been updated
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • A comment in the PR documents testing used to verify the changes including any tests that are added/modified/impacted.
    • CTest unit tests for new features have been added per the approved design.
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.
      after.

@sbrus89 sbrus89 added the Omega label Jan 19, 2024
@sbrus89 sbrus89 self-assigned this Jan 19, 2024
@sbrus89
Copy link
Author

sbrus89 commented Jan 19, 2024

I still need to add documentation, but have opened a PR since the code portion is nearly complete.

Comment on lines +82 to +84
Array2DI4 CellsOnCell; ///< Indx of cells that neighbor each cell
ArrayHost2DI4 CellsOnCellH; ///< Indx of cells that neighbor each cell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern of declaring two arrays (host and device) per mesh variable will double the amount of memory needed to store the mesh when the device is a CPU. Additionally, redundant copies between the two arrays will be performed. It is probably fine for now, but long-term we might need a different solution.

@sbrus89 sbrus89 requested a review from brian-oneill January 22, 2024 18:58
@sbrus89 sbrus89 force-pushed the sbrus89/omega/mesh-class branch 2 times, most recently from f0503a5 to 5ed7ba3 Compare January 22, 2024 23:20
Copy link

@brian-oneill brian-oneill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the code, but with the needed change to the parmetis location variable in the CMake configuration, I built and successfully ran the test on Chrysalis and Frontier.

components/omega/test/CMakeLists.txt Outdated Show resolved Hide resolved
components/omega/test/CMakeLists.txt Show resolved Hide resolved
@mwarusz
Copy link
Member

mwarusz commented Jan 25, 2024

Just noting that for implementing differential operators it would be good for the HorzMesh class to compute and store EdgeSignOnCell and EdgeSignOnVertex.

@sbrus89
Copy link
Author

sbrus89 commented Jan 25, 2024

thanks @mwarusz, yes I realized that an am adding computations.

@philipwjones philipwjones mentioned this pull request Jan 25, 2024
3 tasks
Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good. A few questions/comments and the specific change requested in a separate comment.

Any reason not to read all the quantities at once with a single read routine? That way we don't need to retain the IO decompositions and other IO stuff in the class.

It would be good to flesh out a little more detail in the DevGuide how you would actually use this class throughout the code.

Given that the instance will be created in an Omega init routine, how would you access this by the run method? We won't be able to pass the instance up through the coupler. This is one reason the other classes have been storing all instances for later retrieval. Another option is to have some big omega header with extern static instance for stuff that needs to be saved from one phase to another.

components/omega/test/CMakeLists.txt Show resolved Hide resolved
@sbrus89
Copy link
Author

sbrus89 commented Jan 26, 2024

@philipwjones, thanks for the comments. I'll make edits to have a single read routine and instance storage/retrieval.

@sbrus89 sbrus89 force-pushed the sbrus89/omega/mesh-class branch from aa61e7f to 7a010ff Compare January 26, 2024 20:02
@philipwjones
Copy link

@sbrus89 Thanks for the explanation - I guess we can leave the multiple reads for now and consolidate later once we've determined what we're reading vs computing.

@sbrus89
Copy link
Author

sbrus89 commented Jan 30, 2024

Thanks again for your comments @philipwjones, @brian-oneill, and @mwarusz. I think I have addressed the the changes suggested so far. Please let me know if anything else comes to mind.

@sbrus89
Copy link
Author

sbrus89 commented Jan 30, 2024

I realized some build issues, so I'm sorting those out.

@philipwjones
Copy link

@sbrus89 Just a couple (hopefully final) things. First, Omega is following the MPAS convention of adding boundary conditions in the NCellsAll+1 location so all arrays should be dimensioned NCellsSize (and NEdgesSize, etc) to accommodate the extra entry since the connectivity arrays will point to that location for boundary Cells/Edges/Vertices. Second, can you verify that the halo cells are filled correctly? Since the CellID, etc arrays do have the IDs in the halo regions, the parallel IO may be filling them correctly, but I have not checked that - you could do a simple before/after Halo update comparison to see.

Built the docs and they look fine and verified unit tests pass, but will recheck after the above changes.

@sbrus89
Copy link
Author

sbrus89 commented Feb 2, 2024

@philipwjones, thanks for noticing that. I fixed the array sizes and added a test that verifies the read-in values for the halos based on CellID etc. are correct by checking them against a full halo exchange.

Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @sbrus89 . This looks fine to me now and still passes unit tests.

@mark-petersen
Copy link

I rebased locally on the develop branch because master was just merged in. There were no conflicts. I tested as follows

test instructions:


######### Frontier
cd /ccs/home/mpetersen/repos/omega/pr
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd components/omega/
module load cmake
rm -rf build
mkdir build
cd build
export PARMETIS_ROOT=$PROJWORK/cli115/pwjones/frontierlibs-cray/parmetis
cmake \
   -DOMEGA_CIME_COMPILER=crayclang \
   -DOMEGA_CIME_MACHINE=frontier \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S .. -B .
./omega_build.sh
ln -isf /ccs/home/mpetersen/meshes/ocean.QU.240km.151209.nc test/OmegaMesh.nc

salloc -A cli115 -J inter -t 2:00:00 -q debug -N 1 -S 0
cd /ccs/home/mpetersen/repos/omega/pr/components/omega/build
./omega_ctest.sh

######### perlmutter
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd components/omega/
module load cmake
mkdir build
cd build
export PARMETIS_ROOT=/global/cfs/cdirs/e3sm/software/polaris/pm-cpu/spack/dev_polaris_0_3_0_gnu_mpich/var/spack/environments/dev_polaris_0_3_0_gnu_mpich/.spack-env/view
cmake \
   -DOMEGA_CIME_COMPILER=intel \
   -DOMEGA_CIME_MACHINE=pm-cpu \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S .. -B .
ln -s /global/cfs/cdirs/e3sm/inputdata/ocn/mpas-o/oQU240/ocean.QU.240km.151209.nc test/OmegaMesh.nc
./omega_build.sh

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu --account=e3sm
./omega_ctest.sh

######### chrysalis
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd components/omega/
module load cmake
mkdir build
cd build
export PARMETIS_ROOT=/lcrc/group/e3sm/ac.boneill/intel-openmpi-libs
cmake \
   -DOMEGA_CIME_COMPILER=intel \
   -DOMEGA_CIME_MACHINE=chrysalis \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S .. -B .
ln -isf /home/ac.mpetersen/inputdata/ocn/mpas-o/oQU240/ocean.QU.240km.151209.nc test/OmegaMesh.nc
./omega_build.sh

srun -p debug -N 1 -t 1:00:00 --pty bash
./omega_ctest.sh
  1. chrysalis: All tests pass, including the new HORZMESH_TEST.
  2. frontier: All tests pass except IO_TEST due to a memory high water mark. That is unrelated to this PR, but I show it here for reference:
error:


Test project /ccs/home/mpetersen/repos/omega/pr/components/omega/build
    Start 8: IO_TEST
1/1 Test #8: IO_TEST ..........................***Failed    3.50 sec
Using memory pool. Initial size: 62.8379GB ;  Grow size: 62.8379GB.
[2024-02-05 13:52:28.894] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
[2024-02-05 13:52:28.895] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
[2024-02-05 13:52:28.895] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
[2024-02-05 13:52:28.896] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
[2024-02-05 13:52:28.897] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
[2024-02-05 13:52:28.898] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
[2024-02-05 13:52:28.899] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
[2024-02-05 13:52:28.899] [info] [IOTest.cpp:76] IOTest: Default decomp retrieval PASS
PIO: FATAL ERROR: Aborting... FATAL ERROR: Permission denied (file = IOTest.nc) (/ccs/home/mpetersen/repos/omega/pr/externals/scorpio/src/clib/pioc_support.c: 3429)
Obtained 9 stack frames.
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x387aa4]
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x387c92]
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x388368]
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x38e395]
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x385f12]
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x2f9448]
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x2a7448]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fffe8baf29d]
/autofs/nccs-svm1_home1/mpetersen/repos/omega/pr/components/omega/build/test/./testIO.exe() [0x2a4d0a]
MPICH ERROR [Rank 0] [job id 1631852.8] [Mon Feb  5 13:52:29 2024] [frontier08681] - Abort(-1) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Pool Memory High Water Mark:       445952
Pool Memory High Water Efficiency: 6.60947e-06
ERROR: For the pool allocator labeled " �`




AKL's primary memory pool":
ERROR: Trying to free an invalid pointer
This means you have either already freed the pointer, or its address has been corrupted somehow.
terminate called after throwing an instance of 'std::runtime_error'
  what():  This means you have either already freed the pointer, or its address has been corrupted somehow.
srun: error: frontier08681: task 0: Aborted
srun: Terminating StepId=1631852.8
slurmstepd: error: *** STEP 1631852.8 ON frontier08681 CANCELLED AT 2024-02-05T13:52:31 ***
srun: error: frontier08681: tasks 1-3: Terminated
srun: error: frontier08688: tasks 4-7: Terminated
srun: Force Terminated StepId=1631852.8
  1. perlmutter: I can't get cmake to work for this PR or for develop, so also unrelated to this PR.
error:

cmake \
   -DOMEGA_CIME_COMPILER=intel \
   -DOMEGA_CIME_MACHINE=pm-cpu \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S .. -B .

...

-- ===== Configuring GPTL library... =====
-- Could NOT find MPI_C (missing: MPI_C_WORKS)
-- Could NOT find MPI_Fortran (missing: MPI_Fortran_WORKS)
CMake Error at /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_C_FOUND MPI_Fortran_FOUND C Fortran)
Call Stack (most recent call first):
  /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindMPI.cmake:1830 (find_package_handle_standard_args)
  /global/homes/m/mpeterse/repos/omega/pr/externals/scorpio/src/gptl/CMakeLists.txt:140 (find_package)

@brian-oneill
Copy link

Looks good to me, builds and unit test successfully passes on Chrysalis and Frontier.

@philipwjones
Copy link

philipwjones commented Feb 5, 2024

@mark-petersen The error you're getting on Frontier is because IOTest is trying to write a test output file and doesn't have permission. You'll have to run this from the /lustre/orion/scratch spaces on Frontier (eg ${MEMBERWORK}/cli115 ) The compute nodes don't have write access to home directory spaces.

The memory high water mark is just a diagnostic and wasn't the actual error.

@mark-petersen
Copy link

Thanks @philipwjones and @grnydawn for the help. I made two mistakes. On perlmutter and frontier, I have to run tests on the scratch file system for PIO. On perlmutter, I had to rm -rf build directory to start a clean build and avoid that cmake error. Everything passes now. Here are my notes for reference.

instructions (change to your paths)


######### Frontier
cd /ccs/home/mpetersen/repos/omega/pr
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd /lustre/orion/cli115/scratch/mpetersen/runs/240205_omega
module load cmake
rm -rf build
mkdir build
cd build
export PARMETIS_ROOT=$PROJWORK/cli115/pwjones/frontierlibs-cray/parmetis
cmake \
   -DOMEGA_CIME_COMPILER=crayclang \
   -DOMEGA_CIME_MACHINE=frontier \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S /ccs/home/mpetersen/repos/omega/pr/components/omega -B .
./omega_build.sh
ln -isf /ccs/home/mpetersen/meshes/ocean.QU.240km.151209.nc test/OmegaMesh.nc

salloc -A cli115 -J inter -t 2:00:00 -q debug -N 1 -S 0
cd /lustre/orion/cli115/scratch/mpetersen/runs/240205_omega
./omega_ctest.sh

######### perlmutter
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd components/omega/
module load cmake
cd /pscratch/sd/m/mpeterse/runs/240205_omega
rm -rf build
mkdir build
cd build
export PARMETIS_ROOT=/global/cfs/cdirs/e3sm/software/polaris/pm-cpu/spack/dev_polaris_0_3_0_gnu_mpich/var/spack/environments/dev_polaris_0_3_0_gnu_mpich/.spack-env/view
cmake \
   -DOMEGA_CIME_COMPILER=gnu  \
   -DOMEGA_CIME_MACHINE=pm-cpu \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S /global/homes/m/mpeterse/repos/omega/pr/components/omega -B .
ln -s /global/cfs/cdirs/e3sm/inputdata/ocn/mpas-o/oQU240/ocean.QU.240km.151209.nc test/OmegaMesh.nc
./omega_build.sh

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu --account=e3sm
cd /pscratch/sd/m/mpeterse/runs/240205_omega
./omega_ctest.sh

######### chrysalis
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd components/omega/
module load cmake
mkdir build
cd build
export PARMETIS_ROOT=/lcrc/group/e3sm/ac.boneill/intel-openmpi-libs
cmake \
   -DOMEGA_CIME_COMPILER=intel \
   -DOMEGA_CIME_MACHINE=chrysalis \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S .. -B .
ln -isf /home/ac.mpetersen/inputdata/ocn/mpas-o/oQU240/ocean.QU.240km.151209.nc test/OmegaMesh.nc
./omega_build.sh

srun -p debug -N 1 -t 1:00:00 --pty bash
./omega_ctest.sh


Copy link

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on testing and review by Phil and Bryan. Thanks!

@sbrus89
Copy link
Author

sbrus89 commented Feb 6, 2024

@mwarusz, let me know if you have any other comments. Otherwise I'll plan on merging this soon.

Copy link
Member

@mwarusz mwarusz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one small comment regarding the dev docs, but this should be an easy change and everything else looks good so I am approving it.

components/omega/doc/devGuide/HorzMesh.md Outdated Show resolved Hide resolved
@sbrus89 sbrus89 merged commit 57c0258 into E3SM-Project:develop Feb 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants