Address diffs v2.12.1 to v3 #907

chengzhuzhang · 2024-12-11T21:08:35Z

Description

Closes [Bug]: Investigate differences in plots produced with zppy test using Xarray/xCDAT codebase #906
Closes [Bug]: CDAT migration: potential performance or memory bottlenecks handling time-series data files #892
Related to EAMxx variables #880 (comment)

Bug Fixes:

Fixed incorrect logic in cf_units.Unit that replaced genutil.udunits, which previously caused incorrect results.

Performance Improvements:

Improved subsetting logic by performing a time slice before loading time series datasets into memory, enhancing performance by reducing the size of the datasets
Optimized variable derivation by subsetting and loading datasets into memory with .load(scheduler="sync") first, as the convert_units() function requires in-memory access for cf_units.Unit operations -- affected MERRA2 datasets
Addressed an issue with single-point time dimensions in climatology datasets (e.g., OMI-MLS). xCDAT now squeezes the time dimension to remove time coordinates and bounds before loading the dataset, avoiding errors like "Non-integer years and months are ambiguous and not currently supported." -- affected OMI-MLS datasets
Update _get_slice_from_time_bounds() to load time bounds into memory if they already exist using .load(scheduler="sync") to avoid hanging -- affected streamflow datasets
Add missed unit conversion for H2OLNZ via Convert H2OLNZ units to ppm by volume #874 to address large differences
- Related code: https://github.com/E3SM-Project/e3sm_diags/blame/ca41b0e5d913610c88410928951f1ed11c75663f/e3sm_diags/driver/zonal_mean_2d_driver.py#L166-L171

Todo

Regression testing with v2 data and arm_diags

Link here: #907 (comment)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

New and existing unit tests pass with my changes (locally and CI/CD build)
I have added tests that prove my fix is effective or that my feature works
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

This reverts commit a84b860.

tomvothecoder · 2024-12-11T23:31:07Z

Regression testing performed in #903, related to comment.

Copying here:

Here is the regression notebook that compares latest main vs. v2.12.1.

Matching number of .nc files produced (590)
501/590 are matching
4/590 are mismatching (nan positions)
85/590 are not equal

For the 85/590 are not equal to rtol=1e-4. These variables will need further investigation.

['/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/COREv2_Flux/COREv2_Flux-PminusE-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/CRU_IPCC/CRU-TREFHT-ANN-land_60S90N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/CRU_IPCC/CRU-TREFHT-ANN-land_60S90N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/Cloud MISR/MISRCOSP-CLDTOT_TAU1.3_9.4_MISR-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/Cloud MISR/MISRCOSP-CLDTOT_TAU1.3_MISR-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-NET_FLUX_SRF-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-200-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-500-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-land_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-U-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/GPCP_OAFLux/GPCP_OAFLux-PminusE-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-NET_FLUX_SRF-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-200-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-500-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-land_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_CL_HadISST/HadISST_CL-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_HadISST/HadISST-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_PD_HadISST/HadISST_PD-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_PI_HadISST/HadISST_PI-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/CRU_IPCC/CRU-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/CRU_IPCC/CRU-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-U-850-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-U-850-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_CL_HadISST/HadISST_CL-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_CL_HadISST/HadISST_CL-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PD_HadISST/HadISST_PD-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PD_HadISST/HadISST_PD-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PI_HadISST/HadISST_PI-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PI_HadISST/HadISST_PI-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d/ERA5/ERA5-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/ERA5/ERA5-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/ERA5/ERA5-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/ERA5/ERA5-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/ERA5/ERA5-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_CL_HadISST/HadISST_CL-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_PD_HadISST/HadISST_PD-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_PI_HadISST/HadISST_PI-SST-ANN-global_test.nc']

tomvothecoder · 2024-12-11T23:37:06Z

I will pull my regression testing notebooks over from PR #903 to this PR and work on debugging in this branch with you.

That way #903 can be dedicated to adapting the regression tests.

chengzhuzhang · 2024-12-12T17:21:09Z

thank you @tomvothecoder for summarizing the cases to investigate. It look like I'm running into environment issue while debugging with vscode #906 (comment). Let me know if the env works for you when debugging.

tomvothecoder · 2024-12-12T20:59:18Z

thank you @tomvothecoder for summarizing the cases to investigate. It look like I'm running into environment issue while debugging with vscode #906 (comment). Let me know if the env works for you when debugging.

I just replied here. The dev env works fine for me.

tomvothecoder · 2024-12-12T21:31:04Z

I think I found the root cause of the large diffs here and in your comment here, specifically with TREFHT and ANN.

`v2.12.1` code uses `genutil.udunits`

In v2.12.1, the convert_units() function for derived vars includes a portion of logic that calls genutil.udunits:

e3sm_diags/e3sm_diags/derivations/acme.py

Lines 93 to 97 in ca41b0e

    
           temp = udunits(1.0, var.units) 
        
           coeff, offset = temp.how(target_units) 
        
           var = coeff * var + offset 
        
           var.units = target_units

Latest `main` uses `cf_units.Unit` with new implementation logic

I replaced genutil.udnits with cf_units.Unit. It looks like the conversion is incorrect for TREFHT between K to C.

e3sm_diags/e3sm_diags/derivations/utils.py

Lines 85 to 93 in df10271

    
           temp = cf_units.Unit(var.attrs["units"]) 
        
           target = cf_units.Unit(target_units) 
        
           coeff, offset = temp.convert(1, target), temp.convert(0, target) 
        
           # Keep all of the attributes except the units. 
        
           with xr.set_options(keep_attrs=True): 
        
               var = coeff * var + offset 
        
           var.attrs["units"] = target_units

The implementation might not be able to handle all variables and/or the logic is incorrect, which leads to the massive diff found in your comment here.

Next steps

Investigate the cf_units.Unit code block in convert_units(). Fix as needed.

tomvothecoder · 2024-12-12T22:16:27Z

Investigate the cf_units.Unit code block in convert_units(). Fix as needed.

I just pushed ee009c1 (#907) to address this issue.

Next steps

Add this fix on PR EAMxx variables #880 and re-run plot for TREFHT-ANN-land to compare results

Perform another regression test with the complete run to compare against v2.12.1

01/06/25 - I keep running into an issue where it hangs during arm_diags after the PS variable, resulting in missing files (I think). I'm going to try running the script with dask=2024.11.2 to see if that helps.

Figure out why complete run script hangs on arm_diags after PS variable -- due to sub-optimal chunking for specific datasets (more info here)

https://portal.nersc.gov/project/e3sm/e3sm_diags/complete_run/25-01-06-branch-907-arm-diags-only/

Files only in [/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/v2.12.1v2/arm_diags](https://vscode-remote+ssh-002dremote-002bperlmutter.vscode-resource.vscode-cdn.net/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/v2.12.1v2/arm_diags):
armdiags-CLOUD-ANNUALCYCLE-nsac1-ref.png
armdiags-CLOUD-ANNUALCYCLE-nsac1-test.png
armdiags-CLOUD-ANNUALCYCLE-sgpc1-ref.png
armdiags-CLOUD-ANNUALCYCLE-sgpc1-test.png
armdiags-CLOUD-ANNUALCYCLE-twpc1-ref.png
armdiags-CLOUD-ANNUALCYCLE-twpc1-test.png
armdiags-CLOUD-ANNUALCYCLE-twpc2-ref.png
armdiags-CLOUD-ANNUALCYCLE-twpc2-test.png
armdiags-CLOUD-ANNUALCYCLE-twpc3-ref.png
armdiags-CLOUD-ANNUALCYCLE-twpc3-test.png
armdiags-FLUS-ANNUALCYCLE-sgpc1.png
armdiags-PRECT-DJF-sgpc1-diurnal-cycle.png
armdiags-PRECT-JJA-sgpc1-diurnal-cycle.png
armdiags-PRECT-MAM-sgpc1-diurnal-cycle.png
armdiags-PRECT-SON-sgpc1-diurnal-cycle.png
armdiags-aerosol-activation-enac1-ccn02-ref.png
armdiags-aerosol-activation-enac1-ccn02-test.png
armdiags-aerosol-activation-enac1-ccn05-ref.png
armdiags-aerosol-activation-enac1-ccn05-test.png
armdiags-aerosol-activation-sgpc1-ccn02-ref.png
armdiags-aerosol-activation-sgpc1-ccn02-test.png
armdiags-aerosol-activation-sgpc1-ccn05-ref.png
armdiags-aerosol-activation-sgpc1-ccn05-test.png
armdiags-convection-onset-twpc1.png
armdiags-convection-onset-twpc2.png
armdiags-convection-onset-twpc3.png

Figure out why these MERRA2 datasets aren't being generated. The log says resource temporarily unavailable, maybe this is related.

['/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-200-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-500-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-PSL-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TAUXY-ANN-ocean_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TMQ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFHT-ANN-land_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-Z3-500-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-Q-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-Q-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-Q-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-Q-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TMQ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TMQ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-Z3-500-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-Z3-500-ANN-global_test.nc']

For the missing MERRA2 datasets in the comment above, I have identified a timeout issue when attempting to derive a climatology variable that is a Dask Array via convert_units().

e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py

Line 422 in 4ad47a7

ds = self._get_dataset_with_derived_climo_var(ds)

In convert_units(), loading the variable into memory via values causes a deadlock for some reason.

e3sm_diags/e3sm_diags/derivations/utils.py

Line 88 in 4ad47a7

var.values = original_udunit.convert(var.values, target_udunit)

This does not happen when isolating the code outside of E3SM Diags. Further investigation is needed.

import cf_units
import xcdat as xc

filepath = "/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2/MERRA2_ANN_198001_201612_climo.nc"

args = {
    "paths": filepath,
    "decode_times": True,
    "add_bounds": ["X", "Y", "Z"],
    "coords": "minimal",
    "compat": "override",
}

ds = xc.open_mfdataset(**args)

# Load data into memory and view values
ds.wap.values

<xarray.DataArray 'wap' (time: 1, plev: 42, lat: 361, lon: 576)> Size: 35MB
dask.array<open_dataset-wap, shape=(1, 42, 361, 576), dtype=float32, chunksize=(1, 42, 361, 576), chunktype=numpy.ndarray>
Coordinates:
    height   float64 8B ...
  * lat      (lat) float64 3kB -90.0 -89.5 -89.0 -88.5 ... 88.5 89.0 89.5 90.0
  * lon      (lon) float64 5kB 0.0 0.625 1.25 1.875 ... 357.5 358.1 358.8 359.4
  * plev     (plev) float64 336B 1e+05 9.75e+04 9.5e+04 ... 40.0 30.0 10.0
  * time     (time) object 8B 1980-01-15 00:00:00
Attributes:
    standard_name:     lagrangian_tendency_of_air_pressure
    long_name:         omega (=dp/dt)
    units:             Pa s-1
    comment:           commonly referred to as ""omega"", this represents the...
    original_name:     omega
    original_units:    Pa s-1
    history:           2015-10-11T21:02:10Z altered by CMOR: Converted units ...
    cell_methods:      time: mean
    cell_measures:     area: areacella
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

The root cause is this that the Dask Array is being loaded into memory without using the correct scheduler (e.g,. .load(scheduler="sync"):

e3sm_diags/e3sm_diags/derivations/utils.py

Lines 85 to 89 in 4ad47a7

    
           original_udunit = cf_units.Unit(var.attrs["units"]) 
        
           target_udunit = cf_units.Unit(target_units) 
        
           var.values = original_udunit.convert(var.values, target_udunit) 
        
           var.attrs["units"] = target_units

- Addresses performance bottleneck associated with attempting to load large datasets into memory. Time slicing reduces the size before loading into memory

- Constraint `dask <2024.12.0`

- Subset and load the climo dataset with the source variables before deriving the target variable to use the appropriate Dask scheduler

tomvothecoder

@chengzhuzhang FYI if we are still experiencing slowness or hanging with a complete run in #880, I will carry this fix over to that PR as well and re-run.

For cases where deriving a climo variable occurs, this fix will first subset the dataset with the source variables and load it into memory before performing the actual derivation.

This fix is required because the convert_units() function will try to load the variables into memory using the wrong scheduler by calling .values instead of .load(scheduler="sync"), which causes hanging/crashing (code below).

e3sm_diags/e3sm_diags/derivations/utils.py

Lines 85 to 89 in 3ea5422

    
           original_udunit = cf_units.Unit(var.attrs["units"]) 
        
           target_udunit = cf_units.Unit(target_units) 
        
           var.values = original_udunit.convert(var.values, target_udunit) 
        
           var.attrs["units"] = target_units

chengzhuzhang · 2025-01-09T00:07:46Z

@tomvothecoder hey, Tom. Just a heads-up. I think #880 is in good shape and ready to merge.
I'm happy to test the scripts from #880 again when this PR is finalized. Let me know if anything I can help for this PR...

- Avoids issue where time bounds are generated for singleton coords that have issues (e.g., OMI files) - Update `squeeze_time_dim()` to catch cases where time_dim does not exist in the dataset - Add logger info messages to print out climatology and time series filepaths and derived variable filepaths

tomvothecoder · 2025-01-09T00:52:16Z

@tomvothecoder hey, Tom. Just a heads-up. I think #880 is in good shape and ready to merge.

Great, I'll perform a final review before merging. Thanks!

I'm happy to test the scripts from #880 again when this PR is finalized. Let me know if anything I can help for this PR...

In this PR, I’m debugging the final variables that fail to generate for arm_diags. The issue stems from sub-optimal/incorrect chunk sizes in the native v3 datasets, causing the Xarray .load(scheduler="sync") call to hang due to excessive chunks.

Here’s a script from af046a2 (#907) (pasted below for convenience) detailing the affected datasets and their chunk size issues. Any thoughts on why these chunk sizes were set this way?

# %%
"""
Issue 1 - Sub-optimal `CLOUD` and `time_bnds chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/CLOUD_twpc3_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `CLOUD` variable has chunks of (1, 80), resulting in 131400 chunks in 2 graph layers. (very bad, slow loading)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers. (very bad, slow loading)
"""

import xcdat as xc

ds = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/CLOUD_twpc3_200001_201412.nc"
    ]
)

print(ds.CLOUD.data)
#               Array	        Chunk
# Bytes	        40.10 MiB	    320 B
# Shape	        (131400, 80)	(1, 80)
# Dask graph	131400 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds.time_bnds.data)
# Array	Chunk
# Bytes	2.01 MiB	16 B
# Shape	(131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	object numpy.ndarray


# %%
"""
Issue 2 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/FLDS_sgpc1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `FLDS` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds2 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/FLDS_sgpc1_200001_201412.nc"
    ]
)

print(ds2.FLDS.data)
#               Array	Chunk
# Bytes	        513.28 kiB	3.98 kiB
# Shape	        (131400,)	(1019,)
# Dask graph	129 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds2.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

# %%
"""
Issue 3 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_sgpc1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `PRECT` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds3 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_sgpc1_200001_201412.nc"
    ]
)

print(ds3.PRECT.data)
#               Array	Chunk
# Bytes	        513.28 kiB	3.98 kiB
# Shape	        (131400,)	(1019,)
# Dask graph	129 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds3.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

# %%
"""
Issue 4 - Sub-optimal `num_a1` and `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/num_a1_enac1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `num_a1` variable has chunks of (1, 80), resulting in 131400 chunks in 2 graph layers. (very bad, slow loading)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds4 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/num_a1_enac1_200001_201412.nc"
    ]
)

print(ds4.num_a1.data)
#               Array	Chunk
# Bytes	        40.10 MiB	320 B
# Shape	        (131400, 80)	(1, 80)
# Dask graph	131400 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds4.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray


# %%
"""
Issue 5 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_twpc3_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `PRECT` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds5 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_twpc3_200001_201412.nc"
    ]
)

print(ds5.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

chengzhuzhang · 2025-01-09T21:29:42Z

@tomvothecoder I'm trying to understand the new emerging issues. I feel we haven't seen these when we tested the refactor branch. Another difference is that we are now testing a new dataset as well v3 as compared to v2 that is used for testing during refactor.

It looks like firstly, genutil.udunits is changed to cf_units.Unit, and then this change requires that instead of using .value we need .load(scheduler="sync"), and the latter caused wired chunking? Did I get the story right? I'm wondering if we will get same behavior if we re-test arm-diags with the v2 data we have been working with?

I'm now testing with the v2 dataset here: /global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20221103.v2.LR.amip.NGD_v3atm.chrysalis/arm-diags-data/. The time record is longer in this v2 dataset which is from 1985-2014.

chengzhuzhang · 2025-01-10T23:11:10Z

I updated the pre-processing script for single site output, and created correctly chunked datasets: /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/

The datasets load fast but I ran into an issue with the diurnal_cycle_zt subset (error message as follows). I will further investigate if a fix is needed for the newly generated data, or the code base.

15:05:14,345 [ERROR]: core_parameter.py(_run_diag:343) >> Error in e3sm_diags.driver.arm_diags_driver
Traceback (most recent call last):
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/parameter/core_parameter.py", line 340, in _run_diag
    single_result = module.run_diag(self)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/arm_diags_driver.py", line 61, in run_diag
    return _run_diag_diurnal_cycle_zt(parameter)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/arm_diags_driver.py", line 183, in _run_diag_diurnal_cycle_zt
    test_diurnal, lst = composite_diurnal_cycle(  # type: ignore
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/utils/diurnal_cycle_xr.py", line 86, in composite_diurnal_cycle
    lst[it, :, ilon] = (itime + start_time + lon[ilon] / 360 * 24) % 24  # type: ignore
    ~~~^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (131400,) into shape (1,)

chengzhuzhang · 2025-01-11T00:16:07Z

313a641 should addressed the data problem reported #907 (comment). The lat, and lon values has time as their dimension which offended the diurnal cycle routine. I'm generating a new set of datasets and will replace the files under /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/ for testing.

The old version of site data is now located at /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site_cdat/

tomvothecoder · 2025-01-11T01:41:44Z

313a641 should addressed the data problem reported #907 (comment). The lat, and lon values has time as their dimension which offended the diurnal cycle routine. I'm generating a new set of datasets and will replace the files under /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/ for testing.

The old version of site data is now located at /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site_cdat/

Thanks Jill. I will do an arm_diags only run with v3 data on Tues.

This PR looks like it's almost done. I'm wrapping up regression testing (v3 is good to go, v2 almost done just Q var is off). I'd say we're on track to merge next Wed/Thurs.

chengzhuzhang · 2025-01-13T19:36:15Z

I can confirm that the performance issue is resolved by the new dataset in place. The full arm set completed in 2 mins in my test.

chengzhuzhang · 2025-01-13T20:31:19Z

I checked the metadata for v2 vs v3. The difference is that v2 data has 2 x time steps of v3; v2 has 72 vertical levels and v3 has 80. Is it possible that in the original refactored code, data was loaded in without chunking, while in the new code base, data are loaded as dask arrray and chunked in a weird way?

There are a few possibilities here (and any combination of them together):
1. We did regression testing on v2 data with `open_dataset()` which avoids Dask issues

2. We incorporated `open_mfdataset()` and `.load()` later on (e.g., [[Feature]: CDAT Migration: Add support for opening multiple time series datasets #861](https://github.com/E3SM-Project/e3sm_diags/issues/861))

3. The v2 data does not have weird chunking like v3 data

To recap on this issue, the root cause is that the original set of site-based output was pre-processed with a script that is based on ncrcat, for this specific dataset with var(time,lev) or var(time) dimension, to chunk over time create too many small chunks and caused I/O bottleneck. This problem was not shown up during refactor because, the data set was read in with open_dataset() which avoids chunking issue. When we introduced open_mfdataset() and .load() later, the chunksize is not compatible and caused I/O issue. Now we rewrote the pre-processing script with xarray, which is used to generate correctly chunked time-series files, e.g., chunksize = var dimension. The performance is now good.

tomvothecoder · 2025-01-14T18:05:23Z

@chengzhuzhang Great, thank you for summarizing the arm_diags I/O issue and confirming the performance with the new e3sm_diags is now good!

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

chengzhuzhang · 2025-01-14T18:11:37Z

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

@chengzhuzhang Great, thank you for summarizing the arm_diags I/O issue and confirming the performance with the new e3sm_diags is now good!

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

yeah, I'm glad we don't need to update the code base for workaround this data issue. I haven't re-produced v2 site data, so the complete v2 run still has to exclude arm_diags set.

- Add streamflow debugging

tomvothecoder · 2025-01-14T23:42:47Z

e3sm_diags/driver/qbo_driver.py

@chengzhuzhang We need to replace cwt from scipy with PyWavelets (#912) I'm trying to figure out where to pass the deg = 6 arg here. If you can try taking a look too that'd be great.

Click here for diff: https://github.com/E3SM-Project/e3sm_diags/pull/907/files/2144f51f07a9b855e1b763a3cc5f1a493bac7f68..60f53bc5165bb6bb61b549a3dee6d74627681856#diff-747eb4531216710f7645e357652e33c3fe74c293bbb311a1b9c298a8193f02af

Related SciPy docs:

https://docs.scipy.org/doc/scipy-1.12.0/reference/generated/scipy.signal.cwt.html

docs.scipy.org/doc/scipy-1.12.0/reference/generated/scipy.signal.morlet2.html

Related PyWavelets docs:

pywavelets.readthedocs.io/en/latest/ref/cwt.html

pywavelets.readthedocs.io/en/latest/ref/cwt.html#complex-morlet-wavelets

@whannah1 hey Walter, as we are preparing for the e3sm_diags v3 release, @tomvothecoder found that scipy.signal.cwt and scipy.signal.morlet2 are removed from latest scipy, and the suggestion was to use pywavelets instead. It seems that transition to use pywavelets API is not straightforward and not much documentation was found to help transition. I'm wondering if you are familiar with the this tool and could help re-write this _get_psd_from_wavelet function?

@chengzhuzhang I suggest that we:

Not rush to finish this issue for v3.0.0rc1 due to high risk of side-effects and bugs with new, unfamiliar API

Move this work to a separate PR and try to get it done during E3SM Unified RC testing phase

Constrain scipy < 1.15 for now -- FYI @xylar for E3SM Unified (will update E3SM Confluence page)

Let me know thoughts.

UPDATE: Just saw Xylar's comment here related to 3.

All are reasonable! I was thinking along the same line.

Constrain scipy < 1.15 for now -- FYI @xylar for E3SM Unified (will update E3SM Confluence page)

I want to make sure we're on the same page. This constraint would not be in E3SM-Unified, it would be in e3sm_diags. But it would be important that E3SM-Unified or its other dependencies don't contradict this constraint.

Constrain scipy < 1.15 for now -- FYI @xylar for E3SM Unified (will update E3SM Confluence page)

I want to make sure we're on the same page. This constraint would not be in E3SM-Unified, it would be in e3sm_diags. But it would be important that E3SM-Unified or its other dependencies don't contradict this constraint.

Yup we're on the same page. Thanks for confirming.

@tomvothecoder, see comment from @whannah1 here. We can either implement this as-is, and testing and fine tune (if needed) during testing period. Or pinning scipy, and update to use pywavelets during testing.

@chengzhuzhang I'm going to pin scipy here and address #912 in a separate PR.

tomvothecoder

@chengzhuzhang Here is my final self-review. All of the code changes look good to me. I will merge if you agree with the testing results below.

v2 data complete run results (notebook):

Results:

We are good to go here. The differences are expected and explained in the comments below.

                stat_name  value       pct
0    matching_files_count   1093  0.875801
1      missing_vars_count      0  0.000000
2   mismatch_errors_count     10  0.008013
3  not_equal_errors_count    143  0.114583
4        key_errors_count      0  0.000000
5     missing_files_count      2  0.001603

1093/1246 matching files
10/1246 mismatch errors due to ccb regional subsetting differences
143/1246 not equal files -- number of different elements are really small and stats (min, max, mean, sum) are similar, most likely related to the convert_units() change to replace genutil.udunits with cf_units.Unit.
- Except Q variable, differences are explained by not updating the Q results for main branch with unit conversion changes. Non-issue.
2/1246 are missing because AOD_550 retired.

v3 data complete run results (notebook)

We are good to go here. The differences are expected and explained in the comments below.

                stat_name  value       pct
0    matching_files_count    553  0.937288
1      missing_vars_count      0  0.000000
2   mismatch_errors_count      4  0.006780
3  not_equal_errors_count     33  0.055932
4        key_errors_count      0  0.000000
5     missing_files_count      0  0.000000

553 match within the rtol of 1e-5, which is awesome.
4 mismatch errors are known issues due to regional subsetting differences with "ccb" flag
33 not equal errors are not a concern because they affect very small number of elements in the dataset. The stats (min, max, mean, and sum) of the datasets between branches are close.

...ary_tools/cdat_regression_testing/843-migration-phase3/run-script-model-vs-obs/run_script.py

auxiliary_tools/cdat_regression_testing/906-v2_complete_run/run_script.py

chengzhuzhang · 2025-01-15T20:41:43Z

@tomvothecoder thank you for the regression tests! the metrics data look good for releasing the first rc. To not to delay the testing period of e3sm-unified, I suggest to merge. I will go through the diffs and we can resolve remaining issues, if needed, during testing period.

chengzhuzhang · 2025-01-15T20:47:26Z

@tomvothecoder somehow, I can't find 25-01-15-branch-907 under https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/

tomvothecoder · 2025-01-15T21:03:32Z

@tomvothecoder thank you for the regression tests! the metrics data look good for releasing the first rc. To not to delay the testing period of e3sm-unified, I suggest to merge. I will go through the diffs and we can resolve remaining issues, if needed, during testing period.

Sounds good!

@tomvothecoder somehow, I can't find 25-01-15-branch-907 under https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/

I forgot to make some of the directories o=rx. I will do that once perlmutter is back online.

tomvothecoder · 2025-01-16T18:12:35Z

@tomvothecoder somehow, I can't find 25-01-15-branch-907 under https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/

The directory is now available here: https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/25-01-15-branch-907/

chengzhuzhang added 3 commits December 11, 2024 13:03

address_diffs_v2.12.1_to_v3

a84b860

Revert "address_diffs_v2.12.1_to_v3"

796ca4c

This reverts commit a84b860.

add_test_scripts

0591c39

chengzhuzhang mentioned this pull request Dec 11, 2024

[Bug]: Investigate differences in plots produced with zppy test using Xarray/xCDAT codebase #906

Closed

tomvothecoder mentioned this pull request Dec 11, 2024

[Feature]: Adapt CDAT Migration regression testing scripts for general purpose use #903

Draft

10 tasks

Fix incorrect logic for udunits conversions in convert_units()

ee009c1

tomvothecoder mentioned this pull request Dec 12, 2024

EAMxx variables #880

Merged

9 tasks

tomvothecoder added 4 commits December 16, 2024 12:09

Add complete run script

528bac6

Update run script to save netCDF

8982c31

Update dev dir for regression notebook

09e853c

Add time slicing before .load() for performance

b9da444

- Addresses performance bottleneck associated with attempting to load large datasets into memory. Time slicing reduces the size before loading into memory

tomvothecoder force-pushed the address_diffs_v2.12.1_to_v3 branch from 46e77ef to b9da444 Compare December 17, 2024 19:12

This was referenced Dec 23, 2024

Enable CDAT-migrated E3SM Diags E3SM-Project/zppy#651

Open

Add support for multiple time series datasets via glob and fix enso_diags set #866

Merged

tomvothecoder added 3 commits January 6, 2025 16:22

Add arm_diags and merra2 debug scripts

4ad47a7

- Constraint `dask <2024.12.0`

Fix hanging for derived vars with convert_units()

b1a5a77

- Subset and load the climo dataset with the source variables before deriving the target variable to use the appropriate Dask scheduler

Add arm_diags back into complete run script

3ea5422

tomvothecoder reviewed Jan 8, 2025

View reviewed changes

tomvothecoder added this to the FY25 Q1 (10/1/24 - 12/31/24) milestone Jan 8, 2025

tomvothecoder added 2 commits January 8, 2025 16:17

Update arm diags debug scripts with info on specific issues

af046a2

tomvothecoder and others added 2 commits January 10, 2025 15:44

Update regression testing scripts

9221c5b

update site data pre-processing script based on xarray

313a641

Update v2 regression test notebooks

7020a25

tomvothecoder mentioned this pull request Jan 14, 2025

[Bug]: AttributeError: module 'scipy.signal' has no attribute 'cwt' (deprecated, use PyWavelets instead) #912

Open

3 tasks

tomvothecoder added 3 commits January 14, 2025 14:12

Add pywavelets dependency to fix scipy cwt deprecation

d94ed40

- Add streamflow debugging

Clean up postprocessing script for arm_diags

2144f51

figure out how to use pywavelets

60f53bc

tomvothecoder reviewed Jan 14, 2025

View reviewed changes

tomvothecoder added 2 commits January 14, 2025 17:16

Add final v3 data regression test results

dd19a17

Revert pywavelets code changes and constrain scipy <1.15

01a91e4

tomvothecoder marked this pull request as ready for review January 15, 2025 20:26

tomvothecoder assigned tomvothecoder and chengzhuzhang Jan 15, 2025

tomvothecoder added the cdat-migration-fy24 CDAT Migration FY24 Task label Jan 15, 2025

tomvothecoder reviewed Jan 15, 2025

View reviewed changes

Apply suggestions from code review

1b0f998

tomvothecoder merged commit 70ecf94 into main Jan 15, 2025
6 checks passed

tomvothecoder mentioned this pull request Jan 16, 2025

[Feature]: Optimize postprocessing_E3SM_data_for_single_sites.py to improve to_netcdf() performance #914

Open

2 tasks

chengzhuzhang mentioned this pull request Jan 16, 2025

[Bug]: Re-testing land variables for unit conversion issue on main #915

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address diffs v2.12.1 to v3 #907

Address diffs v2.12.1 to v3 #907

chengzhuzhang commented Dec 11, 2024 •

edited by tomvothecoder

Loading

tomvothecoder commented Dec 11, 2024 •

edited

Loading

tomvothecoder commented Dec 11, 2024

chengzhuzhang commented Dec 12, 2024

tomvothecoder commented Dec 12, 2024

tomvothecoder commented Dec 12, 2024

tomvothecoder commented Dec 12, 2024 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

chengzhuzhang commented Jan 9, 2025

tomvothecoder commented Jan 9, 2025

chengzhuzhang commented Jan 9, 2025 •

edited

Loading

chengzhuzhang commented Jan 10, 2025

chengzhuzhang commented Jan 11, 2025 •

edited

Loading

tomvothecoder commented Jan 11, 2025

chengzhuzhang commented Jan 13, 2025

chengzhuzhang commented Jan 13, 2025

tomvothecoder commented Jan 14, 2025 •

edited

Loading

chengzhuzhang commented Jan 14, 2025

tomvothecoder Jan 14, 2025

tomvothecoder Jan 14, 2025

tomvothecoder Jan 15, 2025

chengzhuzhang Jan 15, 2025

tomvothecoder Jan 15, 2025 •

edited

Loading

chengzhuzhang Jan 15, 2025

xylar Jan 15, 2025

tomvothecoder Jan 15, 2025

chengzhuzhang Jan 15, 2025

tomvothecoder Jan 15, 2025

tomvothecoder left a comment

chengzhuzhang commented Jan 15, 2025

chengzhuzhang commented Jan 15, 2025

tomvothecoder commented Jan 15, 2025

tomvothecoder commented Jan 16, 2025

	original_udunit = cf_units.Unit(var.attrs["units"])
	target_udunit = cf_units.Unit(target_units)

	var.values = original_udunit.convert(var.values, target_udunit)
	var.attrs["units"] = target_units

Address diffs v2.12.1 to v3 #907

Address diffs v2.12.1 to v3 #907

Conversation

chengzhuzhang commented Dec 11, 2024 • edited by tomvothecoder Loading

Description

Todo

Regression testing with v2 data and arm_diags

Checklist

tomvothecoder commented Dec 11, 2024 • edited Loading

tomvothecoder commented Dec 11, 2024

chengzhuzhang commented Dec 12, 2024

tomvothecoder commented Dec 12, 2024

tomvothecoder commented Dec 12, 2024

v2.12.1 code uses genutil.udunits

Latest main uses cf_units.Unit with new implementation logic

Next steps

tomvothecoder commented Dec 12, 2024 • edited Loading

Next steps

tomvothecoder left a comment • edited Loading

Choose a reason for hiding this comment

chengzhuzhang commented Jan 9, 2025

tomvothecoder commented Jan 9, 2025

chengzhuzhang commented Jan 9, 2025 • edited Loading

chengzhuzhang commented Jan 10, 2025

chengzhuzhang commented Jan 11, 2025 • edited Loading

tomvothecoder commented Jan 11, 2025

chengzhuzhang commented Jan 13, 2025

chengzhuzhang commented Jan 13, 2025

tomvothecoder commented Jan 14, 2025 • edited Loading

chengzhuzhang commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder left a comment

Choose a reason for hiding this comment

v2 data complete run results (notebook):

v3 data complete run results (notebook)

chengzhuzhang commented Jan 15, 2025

chengzhuzhang commented Jan 15, 2025

tomvothecoder commented Jan 15, 2025

tomvothecoder commented Jan 16, 2025

chengzhuzhang commented Dec 11, 2024 •

edited by tomvothecoder

Loading

tomvothecoder commented Dec 11, 2024 •

edited

Loading

`v2.12.1` code uses `genutil.udunits`

Latest `main` uses `cf_units.Unit` with new implementation logic

tomvothecoder commented Dec 12, 2024 •

edited

Loading

tomvothecoder left a comment •

edited

Loading

chengzhuzhang commented Jan 9, 2025 •

edited

Loading

chengzhuzhang commented Jan 11, 2025 •

edited

Loading

tomvothecoder commented Jan 14, 2025 •

edited

Loading

tomvothecoder Jan 15, 2025 •

edited

Loading