-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional point weight file (pnt_wght.ww3.nc) for unstructured grid to speed up initialization #1333
Optional point weight file (pnt_wght.ww3.nc) for unstructured grid to speed up initialization #1333
Conversation
and only write the file if it's processor 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review
Pass
Testing
Pass
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (12 files differ)
mww3_test_03/./work_PR1_MPI_d2 (12 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (17 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (16 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (15 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (6 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.17/./work_a (0 files differ)
ww3_tp2.17/./work_c (0 files differ)
ww3_tp2.17/./work_b (0 files differ)
ww3_tp2.19/./work_1B_a (0 files differ)
ww3_tp2.19/./work_1A_a (0 files differ)
ww3_tp2.19/./work_1C_a (0 files differ)
ww3_tp2.21/./work_b_metis (0 files differ)
ww3_tp2.21/./work_a (0 files differ)
ww3_tp2.21/./work_b (0 files differ)
ww3_tp2.6/./work_ST0 (0 files differ)
ww3_tp2.6/./work_ST4 (0 files differ)
ww3_tp2.6/./work_pdlib (0 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_tp2.7/./work_ST0 (0 files differ)
ww3_ts4/./work_ug_MPI (0 files differ)
ww3_ufs1.1/./work_unstr_b (0 files differ)
ww3_ufs1.1/./work_unstr_a (0 files differ)
ww3_ufs1.1/./work_unstr_c (0 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
Thanks @JessicaMeixner-NOAA, I'm glad to see this fix had such a big impact on performance! |
@JessicaMeixner-NOAA and @MatthewMasarik-NOAA, I was able to replicate my tests and all commits worked until I got to this PR. It looks like something in this PR is causing the issues with hanging on tp2.6, additionally, on my other HPC I am seeing a new error to me (below). Rank 220 [Mon Jan 13 18:09:49 2025] [c1-0c1s12n1] Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack: I looked through the PR and there seems to be a decent amount to unravel that could be causing the issues. |
@thesser1 @JessicaMeixner-NOAA I can also confirm that this PR is causing the stalling in tp2.6. Going back to commit 488e3c solves the issue. @JessicaMeixner-NOAA @MatthewMasarik-NOAA please let me know how to proceed. |
@sbanihash We need to create an issue and try to fix the problem. @thesser1 - does this work for you? |
Yes, on our side, we will likely roll back a commit and continue working with our development until this is resolved. It is not ideal, but we have some boundary bugs that need testing before we lose momentum. |
Pull Request Summary
This PR adds the ability to create and then use a point file to store the weights to speed up initialization when using unstructured grid. This is particularly needed for large unstructured grids with lots of point output.
Description
For unstructured grids, if a pnt_wght.ww3.nc file does not exist, it will write a file. (Note ww3 is replaced with grid name for multi-grid). Then on a subsequent run, if this file exists instead of performing a search for points, it will use this point list of files. When testing with a global unstructured grid with 15km resolution in the UFS weather model this sped up initialization time from 297.8026 to 18.3455s.
There are two dependencies to this PR, which are both satisfied as of 12/17/24
Issue(s) addressed
Fixes #1179
Commit Message
Optional point weight file (pnt_wght.ww3.nc) for unstructured grid to speed up initialization
Check list
Testing
First, confirmed that no answers were changed (/scratch1/NCEPDEV/climate/Jessica.Meixner/PR_WW3/pointssavelist02/regtests) Then saved off weight files for all unstructured grid tests and copied those into work directories (/scratch1/NCEPDEV/climate/Jessica.Meixner/PR_WW3/pointssavelist04/regtests/ copyfiles.sh) and then ensured that all answers replicated: /scratch1/NCEPDEV/climate/Jessica.Meixner/PR_WW3/pointssavelist04/regtests . Then I added a weight file to the tar file for 1 regtest for future testing and tested versus that. Those answers are what is shown below. I also merged these changes to dev/ufs-waether-model and confirmed we do get the initialization speed-up desired.
These have the expected non-b4b. Added files with "0 diff" because of new pnt weight output and the two tp2.21 will not be there when compared after #1325 is merged.
matrixCompSummary.txt
matrixCompFull.txt