Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address diffs v2.12.1 to v3 #907

Merged
merged 26 commits into from
Jan 15, 2025
Merged

Address diffs v2.12.1 to v3 #907

merged 26 commits into from
Jan 15, 2025

Conversation

chengzhuzhang
Copy link
Contributor

@chengzhuzhang chengzhuzhang commented Dec 11, 2024

Description

Bug Fixes:

  • Fixed incorrect logic in cf_units.Unit that replaced genutil.udunits, which previously caused incorrect results.

Performance Improvements:

  • Improved subsetting logic by performing a time slice before loading time series datasets into memory, enhancing performance by reducing the size of the datasets
  • Optimized variable derivation by subsetting and loading datasets into memory with .load(scheduler="sync") first, as the convert_units() function requires in-memory access for cf_units.Unit operations -- affected MERRA2 datasets
  • Addressed an issue with single-point time dimensions in climatology datasets (e.g., OMI-MLS). xCDAT now squeezes the time dimension to remove time coordinates and bounds before loading the dataset, avoiding errors like "Non-integer years and months are ambiguous and not currently supported." -- affected OMI-MLS datasets
  • Update _get_slice_from_time_bounds() to load time bounds into memory if they already exist using .load(scheduler="sync") to avoid hanging -- affected streamflow datasets
  • Add missed unit conversion for H2OLNZ via Convert H2OLNZ units to ppm by volume #874 to address large differences

Todo

  • Complete run with v3 data
    • (note: must run without arm_diags due to chunking issue with dataset described above
    • Regression testing
      • Regression testing notebook
      • 553/590 match
      • 4/590 nan mismatch (okay, known diff with ccb subsetting)
      • 33/590 not equal (okay, very small number of elements are different with reasonable abs/rel diffs)
  • Complete run with v2 data -- IN PROGRESS
    • (note: must run without arm_diags due to chunking issue with dataset described above
    • Confirm full run works with convert_units() change
    • Regression testing
      • Figure out why there are more variables with diffs now -- the new diffs are actually not an issue, mostly minimal number of elements and not large. I think this is related to the convert_units() change.
      • Address Q variable diffs -- the unit conversion changed from g/kg to g/kg by volume via 4cda0a8, and I didn't re-run zonal_mean_2d_stratosphere complete run to update /main results. This is good to go.
  • Test arm_diags with v2 and v3 data with correct chunking
    • Alternative solution: Call open_dataset() when only a single file is needed.
  • Missing streamflow plots
      '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/25-01-10-branch-907-v2-data/streamflow/RIVER_DISCHARGE_OVER_LAND_LIQ_GSIM/annual_map.png',
      '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/25-01-10-branch-907-v2-data/streamflow/RIVER_DISCHARGE_OVER_LAND_LIQ_GSIM/annual_scatter.png',
      '/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/25-01-10-branch-907-v2-data/streamflow/RIVER_DISCHARGE_OVER_LAND_LIQ_GSIM/seasonality_map.png',
    • Issue with Opening time series files: ['/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/time-series/rgr/RIVER_DISCHARGE_OVER_LAND_LIQ_005101_006012.nc']
    • Root cause: Due to attempting to calculate time slice using existing time_bounds variable in the dataset without calling .load(scheduler="sync") first.
    • Fix: Call .load(scheduler="sync") if time bounds underlying array is a Dask Array object.
  • Fix integration tests failure on Python 3.10+

Regression testing with v2 data and arm_diags

Link here: #907 (comment)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have added tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 11, 2024

Regression testing performed in #903, related to comment.

Copying here:

Here is the regression notebook that compares latest main vs. v2.12.1.

  • Matching number of .nc files produced (590)
  • 501/590 are matching
  • 4/590 are mismatching (nan positions)
  • 85/590 are not equal

For the 85/590 are not equal to rtol=1e-4. These variables will need further investigation.

['/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/COREv2_Flux/COREv2_Flux-PminusE-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/CRU_IPCC/CRU-TREFHT-ANN-land_60S90N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/CRU_IPCC/CRU-TREFHT-ANN-land_60S90N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/Cloud MISR/MISRCOSP-CLDTOT_TAU1.3_9.4_MISR-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/Cloud MISR/MISRCOSP-CLDTOT_TAU1.3_MISR-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-NET_FLUX_SRF-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-200-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-500-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-OMEGA-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-TREFHT-ANN-land_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/ERA5/ERA5-U-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/GPCP_OAFLux/GPCP_OAFLux-PminusE-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-NET_FLUX_SRF-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-200-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-500-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFHT-ANN-land_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_CL_HadISST/HadISST_CL-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_HadISST/HadISST-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_PD_HadISST/HadISST_PD-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/lat_lon/SST_PI_HadISST/HadISST_PI-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/CRU_IPCC/CRU-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/CRU_IPCC/CRU-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-TREFHT-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-U-850-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/ERA5/ERA5-U-850-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_CL_HadISST/HadISST_CL-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_CL_HadISST/HadISST_CL-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PD_HadISST/HadISST_PD-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PD_HadISST/HadISST_PD-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PI_HadISST/HadISST_PI-SST-ANN-polar_N_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/polar/SST_PI_HadISST/HadISST_PI-SST-ANN-polar_S_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d/ERA5/ERA5-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/ERA5/ERA5-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/ERA5/ERA5-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/ERA5/ERA5-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/ERA5/ERA5-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_CL_HadISST/HadISST_CL-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_PD_HadISST/HadISST_PD-SST-ANN-global_test.nc',
 '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/24-12-09-main/zonal_mean_xy/SST_PI_HadISST/HadISST_PI-SST-ANN-global_test.nc']

@tomvothecoder
Copy link
Collaborator

I will pull my regression testing notebooks over from PR #903 to this PR and work on debugging in this branch with you.

That way #903 can be dedicated to adapting the regression tests.

@chengzhuzhang
Copy link
Contributor Author

thank you @tomvothecoder for summarizing the cases to investigate. It look like I'm running into environment issue while debugging with vscode #906 (comment). Let me know if the env works for you when debugging.

@tomvothecoder
Copy link
Collaborator

thank you @tomvothecoder for summarizing the cases to investigate. It look like I'm running into environment issue while debugging with vscode #906 (comment). Let me know if the env works for you when debugging.

I just replied here. The dev env works fine for me.

@tomvothecoder
Copy link
Collaborator

I think I found the root cause of the large diffs here and in your comment here, specifically with TREFHT and ANN.

v2.12.1 code uses genutil.udunits

In v2.12.1, the convert_units() function for derived vars includes a portion of logic that calls genutil.udunits:

temp = udunits(1.0, var.units)
coeff, offset = temp.how(target_units)
var = coeff * var + offset
var.units = target_units

Latest main uses cf_units.Unit with new implementation logic

I replaced genutil.udnits with cf_units.Unit. It looks like the conversion is incorrect for TREFHT between K to C.

temp = cf_units.Unit(var.attrs["units"])
target = cf_units.Unit(target_units)
coeff, offset = temp.convert(1, target), temp.convert(0, target)
# Keep all of the attributes except the units.
with xr.set_options(keep_attrs=True):
var = coeff * var + offset
var.attrs["units"] = target_units

The implementation might not be able to handle all variables and/or the logic is incorrect, which leads to the massive diff found in your comment here.

Next steps

Investigate the cf_units.Unit code block in convert_units(). Fix as needed.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 12, 2024

Investigate the cf_units.Unit code block in convert_units(). Fix as needed.

I just pushed ee009c1 (#907) to address this issue.

Next steps

  • Add this fix on PR EAMxx variables #880 and re-run plot for TREFHT-ANN-land to compare results
  • Perform another regression test with the complete run to compare against v2.12.1
    • 01/06/25 - I keep running into an issue where it hangs during arm_diags after the PS variable, resulting in missing files (I think). I'm going to try running the script with dask=2024.11.2 to see if that helps.

    • Figure out why complete run script hangs on arm_diags after PS variable -- due to sub-optimal chunking for specific datasets (more info here)

      Files only in [/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/v2.12.1v2/arm_diags](https://vscode-remote+ssh-002dremote-002bperlmutter.vscode-resource.vscode-cdn.net/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/v2.12.1v2/arm_diags):
      armdiags-CLOUD-ANNUALCYCLE-nsac1-ref.png
      armdiags-CLOUD-ANNUALCYCLE-nsac1-test.png
      armdiags-CLOUD-ANNUALCYCLE-sgpc1-ref.png
      armdiags-CLOUD-ANNUALCYCLE-sgpc1-test.png
      armdiags-CLOUD-ANNUALCYCLE-twpc1-ref.png
      armdiags-CLOUD-ANNUALCYCLE-twpc1-test.png
      armdiags-CLOUD-ANNUALCYCLE-twpc2-ref.png
      armdiags-CLOUD-ANNUALCYCLE-twpc2-test.png
      armdiags-CLOUD-ANNUALCYCLE-twpc3-ref.png
      armdiags-CLOUD-ANNUALCYCLE-twpc3-test.png
      armdiags-FLUS-ANNUALCYCLE-sgpc1.png
      armdiags-PRECT-DJF-sgpc1-diurnal-cycle.png
      armdiags-PRECT-JJA-sgpc1-diurnal-cycle.png
      armdiags-PRECT-MAM-sgpc1-diurnal-cycle.png
      armdiags-PRECT-SON-sgpc1-diurnal-cycle.png
      armdiags-aerosol-activation-enac1-ccn02-ref.png
      armdiags-aerosol-activation-enac1-ccn02-test.png
      armdiags-aerosol-activation-enac1-ccn05-ref.png
      armdiags-aerosol-activation-enac1-ccn05-test.png
      armdiags-aerosol-activation-sgpc1-ccn02-ref.png
      armdiags-aerosol-activation-sgpc1-ccn02-test.png
      armdiags-aerosol-activation-sgpc1-ccn05-ref.png
      armdiags-aerosol-activation-sgpc1-ccn05-test.png
      armdiags-convection-onset-twpc1.png
      armdiags-convection-onset-twpc2.png
      armdiags-convection-onset-twpc3.png
    • Figure out why these MERRA2 datasets aren't being generated. The log says resource temporarily unavailable, maybe this is related.

      ['/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-200-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-500-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-OMEGA-850-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-PSL-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TAUXY-ANN-ocean_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TMQ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFHT-ANN-land_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/lat_lon/MERRA2/MERRA2-Z3-500-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/meridional_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-PSL-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TAUXY-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TMQ-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFHT-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMNAV-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-TREFMXAV-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_N_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_N_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_S_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/polar/MERRA2/MERRA2-Z3-500-ANN-polar_S_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-Q-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-Q-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-H2OLNZ-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-OMEGA-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-OMEGA-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-Q-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-Q-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-RELHUM-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_2d_stratosphere/MERRA2/MERRA2-RELHUM-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TMQ-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TMQ-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFHT-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMNAV-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-TREFMXAV-ANN-global_test.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-Z3-500-ANN-global_ref.nc',
       '/global/cfs/cdirs/e3sm/www/e3sm_diags/complete_run/25-01-06-branch-907-no-arm-diags/zonal_mean_xy/MERRA2/MERRA2-Z3-500-ANN-global_test.nc']

      For the missing MERRA2 datasets in the comment above, I have identified a timeout issue when attempting to derive a climatology variable that is a Dask Array via convert_units().

      ds = self._get_dataset_with_derived_climo_var(ds)

      In convert_units(), loading the variable into memory via values causes a deadlock for some reason.

      var.values = original_udunit.convert(var.values, target_udunit)

      This does not happen when isolating the code outside of E3SM Diags. Further investigation is needed.

      import cf_units
      import xcdat as xc
      
      filepath = "/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2/MERRA2_ANN_198001_201612_climo.nc"
      
      args = {
          "paths": filepath,
          "decode_times": True,
          "add_bounds": ["X", "Y", "Z"],
          "coords": "minimal",
          "compat": "override",
      }
      
      ds = xc.open_mfdataset(**args)
      
      # Load data into memory and view values
      ds.wap.values
      
      <xarray.DataArray 'wap' (time: 1, plev: 42, lat: 361, lon: 576)> Size: 35MB
      dask.array<open_dataset-wap, shape=(1, 42, 361, 576), dtype=float32, chunksize=(1, 42, 361, 576), chunktype=numpy.ndarray>
      Coordinates:
          height   float64 8B ...
        * lat      (lat) float64 3kB -90.0 -89.5 -89.0 -88.5 ... 88.5 89.0 89.5 90.0
        * lon      (lon) float64 5kB 0.0 0.625 1.25 1.875 ... 357.5 358.1 358.8 359.4
        * plev     (plev) float64 336B 1e+05 9.75e+04 9.5e+04 ... 40.0 30.0 10.0
        * time     (time) object 8B 1980-01-15 00:00:00
      Attributes:
          standard_name:     lagrangian_tendency_of_air_pressure
          long_name:         omega (=dp/dt)
          units:             Pa s-1
          comment:           commonly referred to as ""omega"", this represents the...
          original_name:     omega
          original_units:    Pa s-1
          history:           2015-10-11T21:02:10Z altered by CMOR: Converted units ...
          cell_methods:      time: mean
          cell_measures:     area: areacella
          associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...
      • The root cause is this that the Dask Array is being loaded into memory without using the correct scheduler (e.g,. .load(scheduler="sync"):
        original_udunit = cf_units.Unit(var.attrs["units"])
        target_udunit = cf_units.Unit(target_units)
        var.values = original_udunit.convert(var.values, target_udunit)
        var.attrs["units"] = target_units

@tomvothecoder tomvothecoder mentioned this pull request Dec 12, 2024
9 tasks
- Addresses performance bottleneck associated with attempting to load large datasets into memory. Time slicing reduces the size before loading into memory
- Constraint `dask <2024.12.0`
- Subset and load the climo dataset with the source variables before deriving the target variable to use the appropriate Dask scheduler
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang FYI if we are still experiencing slowness or hanging with a complete run in #880, I will carry this fix over to that PR as well and re-run.

For cases where deriving a climo variable occurs, this fix will first subset the dataset with the source variables and load it into memory before performing the actual derivation.

This fix is required because the convert_units() function will try to load the variables into memory using the wrong scheduler by calling .values instead of .load(scheduler="sync"), which causes hanging/crashing (code below).

original_udunit = cf_units.Unit(var.attrs["units"])
target_udunit = cf_units.Unit(target_units)
var.values = original_udunit.convert(var.values, target_udunit)
var.attrs["units"] = target_units

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder hey, Tom. Just a heads-up. I think #880 is in good shape and ready to merge.
I'm happy to test the scripts from #880 again when this PR is finalized. Let me know if anything I can help for this PR...

- Avoids issue where time bounds are generated for singleton coords that have issues (e.g., OMI files)
- Update `squeeze_time_dim()` to catch cases where time_dim does not exist in the dataset
- Add logger info messages to print out climatology and time series filepaths and derived variable filepaths
@tomvothecoder
Copy link
Collaborator

@tomvothecoder hey, Tom. Just a heads-up. I think #880 is in good shape and ready to merge.

Great, I'll perform a final review before merging. Thanks!

I'm happy to test the scripts from #880 again when this PR is finalized. Let me know if anything I can help for this PR...

In this PR, I’m debugging the final variables that fail to generate for arm_diags. The issue stems from sub-optimal/incorrect chunk sizes in the native v3 datasets, causing the Xarray .load(scheduler="sync") call to hang due to excessive chunks.

Here’s a script from af046a2 (#907) (pasted below for convenience) detailing the affected datasets and their chunk size issues. Any thoughts on why these chunk sizes were set this way?

# %%
"""
Issue 1 - Sub-optimal `CLOUD` and `time_bnds chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/CLOUD_twpc3_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `CLOUD` variable has chunks of (1, 80), resulting in 131400 chunks in 2 graph layers. (very bad, slow loading)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers. (very bad, slow loading)
"""

import xcdat as xc

ds = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/CLOUD_twpc3_200001_201412.nc"
    ]
)

print(ds.CLOUD.data)
#               Array	        Chunk
# Bytes	        40.10 MiB	    320 B
# Shape	        (131400, 80)	(1, 80)
# Dask graph	131400 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds.time_bnds.data)
# Array	Chunk
# Bytes	2.01 MiB	16 B
# Shape	(131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	object numpy.ndarray


# %%
"""
Issue 2 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/FLDS_sgpc1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `FLDS` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds2 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/FLDS_sgpc1_200001_201412.nc"
    ]
)

print(ds2.FLDS.data)
#               Array	Chunk
# Bytes	        513.28 kiB	3.98 kiB
# Shape	        (131400,)	(1019,)
# Dask graph	129 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds2.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

# %%
"""
Issue 3 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_sgpc1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `PRECT` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds3 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_sgpc1_200001_201412.nc"
    ]
)

print(ds3.PRECT.data)
#               Array	Chunk
# Bytes	        513.28 kiB	3.98 kiB
# Shape	        (131400,)	(1019,)
# Dask graph	129 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds3.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

# %%
"""
Issue 4 - Sub-optimal `num_a1` and `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/num_a1_enac1_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `num_a1` variable has chunks of (1, 80), resulting in 131400 chunks in 2 graph layers. (very bad, slow loading)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds4 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/num_a1_enac1_200001_201412.nc"
    ]
)

print(ds4.num_a1.data)
#               Array	Chunk
# Bytes	        40.10 MiB	320 B
# Shape	        (131400, 80)	(1, 80)
# Dask graph	131400 chunks in 2 graph layers
# Data type	    float32 numpy.ndarray

print(ds4.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray


# %%
"""
Issue 5 - Sub-optimal `time_bnds` chunking
  * Related dataset: "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_twpc3_200001_201412.nc"
  * Dataset shape is (time: 131400, bound: 2, lev: 80)
  * `PRECT` variable has chunks of (1019,), resulting in 129 in 2 graph layers (okay)
  * `time_bnds` has chunks of (1, 2), resulting in 131400 chunks in 3 graph layers (very bad, slow loading)
"""

ds5 = xc.open_mfdataset(
    [
        "/global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/PRECT_twpc3_200001_201412.nc"
    ]
)

print(ds5.time_bnds.data)
#               Array	Chunk
# Bytes	        2.01 MiB	16 B
# Shape	        (131400, 2)	(1, 2)
# Dask graph	131400 chunks in 3 graph layers
# Data type	    object numpy.ndarray

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jan 9, 2025

@tomvothecoder I'm trying to understand the new emerging issues. I feel we haven't seen these when we tested the refactor branch. Another difference is that we are now testing a new dataset as well v3 as compared to v2 that is used for testing during refactor.

It looks like firstly, genutil.udunits is changed to cf_units.Unit, and then this change requires that instead of using .value we need .load(scheduler="sync"), and the latter caused wired chunking? Did I get the story right? I'm wondering if we will get same behavior if we re-test arm-diags with the v2 data we have been working with?

I'm now testing with the v2 dataset here: /global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20221103.v2.LR.amip.NGD_v3atm.chrysalis/arm-diags-data/. The time record is longer in this v2 dataset which is from 1985-2014.

@chengzhuzhang
Copy link
Contributor Author

I updated the pre-processing script for single site output, and created correctly chunked datasets: /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/

The datasets load fast but I ran into an issue with the diurnal_cycle_zt subset (error message as follows). I will further investigate if a fix is needed for the newly generated data, or the code base.

15:05:14,345 [ERROR]: core_parameter.py(_run_diag:343) >> Error in e3sm_diags.driver.arm_diags_driver
Traceback (most recent call last):
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/parameter/core_parameter.py", line 340, in _run_diag
    single_result = module.run_diag(self)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/arm_diags_driver.py", line 61, in run_diag
    return _run_diag_diurnal_cycle_zt(parameter)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/arm_diags_driver.py", line 183, in _run_diag_diurnal_cycle_zt
    test_diurnal, lst = composite_diurnal_cycle(  # type: ignore
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/edv3/lib/python3.12/site-packages/e3sm_diags/driver/utils/diurnal_cycle_xr.py", line 86, in composite_diurnal_cycle
    lst[it, :, ilon] = (itime + start_time + lon[ilon] / 360 * 24) % 24  # type: ignore
    ~~~^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (131400,) into shape (1,)

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jan 11, 2025

313a641 should addressed the data problem reported #907 (comment). The lat, and lon values has time as their dimension which offended the diurnal cycle routine. I'm generating a new set of datasets and will replace the files under /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/ for testing.

The old version of site data is now located at /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site_cdat/

@tomvothecoder
Copy link
Collaborator

313a641 should addressed the data problem reported #907 (comment). The lat, and lon values has time as their dimension which offended the diurnal cycle routine. I'm generating a new set of datasets and will replace the files under /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site/ for testing.

The old version of site data is now located at /global/cfs/cdirs/e3sm/chengzhu/tutorial2024/v3.LR.historical_0101/post/atm/site_cdat/

Thanks Jill. I will do an arm_diags only run with v3 data on Tues.

This PR looks like it's almost done. I'm wrapping up regression testing (v3 is good to go, v2 almost done just Q var is off). I'd say we're on track to merge next Wed/Thurs.

@chengzhuzhang
Copy link
Contributor Author

I can confirm that the performance issue is resolved by the new dataset in place. The full arm set completed in 2 mins in my test.

@chengzhuzhang
Copy link
Contributor Author

I checked the metadata for v2 vs v3. The difference is that v2 data has 2 x time steps of v3; v2 has 72 vertical levels and v3 has 80. Is it possible that in the original refactored code, data was loaded in without chunking, while in the new code base, data are loaded as dask arrray and chunked in a weird way?

There are a few possibilities here (and any combination of them together):

1. We did regression testing on v2 data with `open_dataset()` which avoids Dask issues

2. We incorporated `open_mfdataset()` and `.load()` later on (e.g., [[Feature]: CDAT Migration: Add support for opening multiple time series datasets #861](https://github.com/E3SM-Project/e3sm_diags/issues/861))

3. The v2 data does not have weird chunking like v3 data

To recap on this issue, the root cause is that the original set of site-based output was pre-processed with a script that is based on ncrcat, for this specific dataset with var(time,lev) or var(time) dimension, to chunk over time create too many small chunks and caused I/O bottleneck. This problem was not shown up during refactor because, the data set was read in with open_dataset() which avoids chunking issue. When we introduced open_mfdataset() and .load() later, the chunksize is not compatible and caused I/O issue. Now we rewrote the pre-processing script with xarray, which is used to generate correctly chunked time-series files, e.g., chunksize = var dimension. The performance is now good.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 14, 2025

@chengzhuzhang Great, thank you for summarizing the arm_diags I/O issue and confirming the performance with the new e3sm_diags is now good!

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

@chengzhuzhang
Copy link
Contributor Author

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

@chengzhuzhang Great, thank you for summarizing the arm_diags I/O issue and confirming the performance with the new e3sm_diags is now good!

I will complete regression testing with v2 data, fix the integration tests, and do a final review before merging this PR.

yeah, I'm glad we don't need to update the code base for workaround this data issue. I haven't re-produced v2 site data, so the complete v2 run still has to exclude arm_diags set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang We need to replace cwt from scipy with PyWavelets (#912) I'm trying to figure out where to pass the deg = 6 arg here. If you can try taking a look too that'd be great.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whannah1 hey Walter, as we are preparing for the e3sm_diags v3 release, @tomvothecoder found that scipy.signal.cwt and scipy.signal.morlet2 are removed from latest scipy, and the suggestion was to use pywavelets instead. It seems that transition to use pywavelets API is not straightforward and not much documentation was found to help transition. I'm wondering if you are familiar with the this tool and could help re-write this _get_psd_from_wavelet function?

Copy link
Collaborator

@tomvothecoder tomvothecoder Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang I suggest that we:

  1. Not rush to finish this issue for v3.0.0rc1 due to high risk of side-effects and bugs with new, unfamiliar API
  2. Move this work to a separate PR and try to get it done during E3SM Unified RC testing phase
  3. Constrain scipy < 1.15 for now -- FYI @xylar for E3SM Unified (will update E3SM Confluence page)

Let me know thoughts.

UPDATE: Just saw Xylar's comment here related to 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All are reasonable! I was thinking along the same line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constrain scipy < 1.15 for now -- FYI @xylar for E3SM Unified (will update E3SM Confluence page)

I want to make sure we're on the same page. This constraint would not be in E3SM-Unified, it would be in e3sm_diags. But it would be important that E3SM-Unified or its other dependencies don't contradict this constraint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constrain scipy < 1.15 for now -- FYI @xylar for E3SM Unified (will update E3SM Confluence page)

I want to make sure we're on the same page. This constraint would not be in E3SM-Unified, it would be in e3sm_diags. But it would be important that E3SM-Unified or its other dependencies don't contradict this constraint.

Yup we're on the same page. Thanks for confirming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder, see comment from @whannah1 here. We can either implement this as-is, and testing and fine tune (if needed) during testing period. Or pinning scipy, and update to use pywavelets during testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang I'm going to pin scipy here and address #912 in a separate PR.

@tomvothecoder tomvothecoder marked this pull request as ready for review January 15, 2025 20:26
@tomvothecoder tomvothecoder added the cdat-migration-fy24 CDAT Migration FY24 Task label Jan 15, 2025
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang Here is my final self-review. All of the code changes look good to me. I will merge if you agree with the testing results below.

v2 data complete run results (notebook):

Results:

We are good to go here. The differences are expected and explained in the comments below.

                stat_name  value       pct
0    matching_files_count   1093  0.875801
1      missing_vars_count      0  0.000000
2   mismatch_errors_count     10  0.008013
3  not_equal_errors_count    143  0.114583
4        key_errors_count      0  0.000000
5     missing_files_count      2  0.001603
  • 1093/1246 matching files
  • 10/1246 mismatch errors due to ccb regional subsetting differences
  • 143/1246 not equal files -- number of different elements are really small and stats (min, max, mean, sum) are similar, most likely related to the convert_units() change to replace genutil.udunits with cf_units.Unit.
    • Except Q variable, differences are explained by not updating the Q results for main branch with unit conversion changes. Non-issue.
  • 2/1246 are missing because AOD_550 retired.

v3 data complete run results (notebook)

We are good to go here. The differences are expected and explained in the comments below.

                stat_name  value       pct
0    matching_files_count    553  0.937288
1      missing_vars_count      0  0.000000
2   mismatch_errors_count      4  0.006780
3  not_equal_errors_count     33  0.055932
4        key_errors_count      0  0.000000
5     missing_files_count      0  0.000000
  • 553 match within the rtol of 1e-5, which is awesome.
  • 4 mismatch errors are known issues due to regional subsetting differences with "ccb" flag
  • 33 not equal errors are not a concern because they affect very small number of elements in the dataset. The stats (min, max, mean, and sum) of the datasets between branches are close.

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder thank you for the regression tests! the metrics data look good for releasing the first rc. To not to delay the testing period of e3sm-unified, I suggest to merge. I will go through the diffs and we can resolve remaining issues, if needed, during testing period.

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder somehow, I can't find 25-01-15-branch-907 under https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/

@tomvothecoder
Copy link
Collaborator

@tomvothecoder thank you for the regression tests! the metrics data look good for releasing the first rc. To not to delay the testing period of e3sm-unified, I suggest to merge. I will go through the diffs and we can resolve remaining issues, if needed, during testing period.

Sounds good!

@tomvothecoder somehow, I can't find 25-01-15-branch-907 under https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/

I forgot to make some of the directories o=rx. I will do that once perlmutter is back online.

@tomvothecoder tomvothecoder merged commit 70ecf94 into main Jan 15, 2025
6 checks passed
@tomvothecoder
Copy link
Collaborator

@tomvothecoder somehow, I can't find 25-01-15-branch-907 under https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/

The directory is now available here: https://portal.nersc.gov/cfs/e3sm/e3sm_diags/complete_run/25-01-15-branch-907/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cdat-migration-fy24 CDAT Migration FY24 Task
Projects
None yet
3 participants