My process for setting up reproduction scripts #39

forsyth2 · 2024-05-30T23:12:17Z

forsyth2
May 30, 2024
Maintainer

Note that https://github.com/E3SM-Project/e3sm_data_docs/blob/main/utils/README.md provides a more general overview. This is a step-by-step guide to my process, with more explanation provided.

(1) Creating reproduction scripts

The scripts used to actually run the v2 simulations can be found in https://github.com/E3SM-Project/e3sm_data_docs/tree/main/run_scripts/v2/original. With these original scripts, I am able to do:

$ cd /home/ac.forsyth2/ez/e3sm_data_docs/utils
$ emacs update_reproduction_scripts.bash
# Edit the `for case_name` lines to contain only the relevant cases (replace `hist-GHG_0151`):
# for case_name in hist-GHG_0151; do
$ ./update_reproduction_scripts.bash

That will run ./generate_reproduction_script.bash for each case specified and place the generated reproduction scripts in run_scripts/v2/reproduce (i.e., https://github.com/E3SM-Project/e3sm_data_docs/tree/main/run_scripts/v2/reproduce will show them after they are merged into main). The generate_reproduction_script.bash script does a few things:

It uses the patch command to apply diff_patch to the original run script (which we know is in ../run_scripts/v2/original/). diff_patch is the set of changes we know need to be made to the original run script in order to generate the reproduction script.
The patch process is not perfect. Some lines won't match up exactly. So, python patch_helper.py will be run to apply patches that may have been rejected.
The rejects from the patch command will be displayed. In most cases, the previous step should have addressed all the rejections.

Caution

It's still possible not all patches were properly applied at this point. That would mean the reproduction script may be missing an important change from the original script. (One downside of using ./update_reproduction_scripts.bash is that it can run on many cases, and to avoid needing user input multiple times, simply assumes no changes from diff_patch were missed)

Tip

If the reproduction script isn't in run_scripts/v2/reproduce, then ./generate_reproduction_script.bash must have failed for that script.

(2) Creating test output to compare with the original test output

$ cd /home/ac.forsyth2/ez/e3sm_data_docs/utils
$ emacs test_reproduction_scripts.bash
# Edit the `for simulation_name` lines to contain only the relevant cases (replace `hist-GHG_0151`):
# for simulation_name in hist-GHG_0151; do
$ sbatch test_reproduction_scripts.bash

Important

This can take a long time to run. The allowed wall time is 24 hours. If it is still running after that time, it will be stopped and some cases won't be finished.

For each case, that script will do:

It will copy the reproduction script from run_scripts/v2/reproduce (and not the GitHub repo since I have $use_wget = false, which is because the reproduction script hasn't been merged to main yet). This copy will be visible in /home/ac.forsyth2/E3SMv2_test/scripts

Tip

If the copied reproduction script isn't in /home/ac.forsyth2/E3SMv2_test/scripts, then there must have been nothing to copy from run_scripts/v2/reproduce. (i.e., ./generate_reproduction_script.bash must have failed for that script.)

Initial conditions are retrieved from NERSCH HPSS using zstash. They are placed in /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test -- specifically the <case-name including the v2.>/init subdirectory.

Tip

If that init directory doesn't exist or is empty, then there must have not been any initial conditions archived on NERSC HPSS for that simulation.

From /home/ac.forsyth2/E3SMv2_test/scripts, the reproduction script is run. The script is set up to run the XS_1x10_ndays test. This will generate a /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/<case-name including the v2.>/tests subdirectory.

Tip

If that tests directory doesn't exist or is empty, then the reproduction script failed to produce output.

(3) Finding the expected checksum

This will be the checksum from the original script. How do I get that though?

See if it's listed in the "10 day checksum" column of https://docs.e3sm.org/e3sm_data_docs/_build/html/v2/reproducing_simulations.html already.
If not, see if it's listed on the the original simulation page (specifically the "sanity checks" section) which is linked from https://acme-climate.atlassian.net/wiki/spaces/ED/pages/2766340117/V2+Simulation+Planning. (The corresponding v3 page is linked on https://docs.e3sm.org/running-e3sm-guide/guide-long-term-archiving/#4-document).
If not, we might be able to use the tests/ subdirectory archived on NERSC HPSS. E.g., for v2.NARRM.historical_0151:
a. Log into Globus. Authenticate for both the NERSC HPSS endpoint and the LCRC endpoint.
b. Run:

$ cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/zstash_extractions
$ mkdir v2.NARRM.historical_0151
$ cd v2.NARRM.historical_0151
$ source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh
$ rm ~/.globus-native-apps.cfg # Not doing this can cause Globus issues.
$ zstash extract --hpss=globus://nersc/home/projects/e3sm/www/WaterCycle/E3SMv2/NARRM/v2.NARRM.historical_0151 tests/*

c. If there is no tests/ directory to extract, skip to step 4 of this list.

# Now, we've extracted just the `tests/` subdirectory.
$ cd tests
# We're now in /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/zstash_extractions/v2.NARRM.historical_0151/tests
$ for test in *_*_ndays
do
  gunzip -c ${test}/run/atm.log.*.gz | grep '^ nstep, te ' | uniq > atm_${test}.txt
done
$ md5sum atm_*_ndays.txt
668fb58e3da9070640cf1ec907ac66c0  atm_XL2_1x5_ndays.txt

d. If the test listed does not include 10_ndays, as in this case, then go to step 4 of this list. If it does, then I have the expected checksum.
4. At this point, the original script's test output will have to be re-generated from scratch. E.g, for run.v2.LR.hist-GHG_0151.sh:

$ cd /home/ac.forsyth2/E3SMv2_test/data_docs_scripts
$ cp /home/ac.forsyth2/ez/e3sm_data_docs/run_scripts/v2/original/run.v2.LR.hist-GHG_0151.sh run.v2.LR.hist-GHG_0151.sh
# I need to change a few lines in this copied original script:
# readonly RUN_REFDIR="/lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/v2.LR.hist-GHG_0151/init" # Changed
# readonly run='XS_1x10_ndays' # Changed (we want this to match the reproduction script's test length of 10 days!)
# do_fetch_code=true #  Changed                                                                                
# do_case_build=true # Changed
$ ./run.v2.LR.hist-GHG_0151.sh

Important

This will take an hour or more to run, because of do_fetch_code=true and do_case_build=true

Tip

The RUN_REFDIR directory may be deleted if sbatch test_reproduction_scripts.bash is running. This is because it clears the directory space to start fresh. To keep a separate copy of the initial conditions, using v2.LR.hist-aer_0151 as an example:

$ cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2
$ mkdir -p v2.LR.hist-aer_0151
$ cd v2.LR.hist-aer_0151
$ source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh
$ rm ~/.globus-native-apps.cfg
$ zstash extract -v --hpss=globus://nersc/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.hist-aer_0151 "init/*"
# May need to enter auth code twice
# Run `rm -rf zstash/` and rerun zstash command above.
$ rm -rf zstash # We only need the init directory

$ cd /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.NARRM.amip_0101/tests
# Once the script finishes, run the `sq` alias for:
# squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10r %.10M %.10l %j" --sort=P,-t,-p -u ${USER}
# Once the job finishes, I can get the checksum.
$ for test in *_*_ndays
do
   gunzip -c ${test}/run/atm.log.*.gz | grep '^ nstep, te ' | uniq > atm_${test}.txt
done
$ md5sum atm_*_ndays.txt
c9aff4fd826f18d0872135b845090a6b  atm_XS_1x10_ndays.txt

(4) Actually comparing that test output with the original test output

$ cd /home/ac.forsyth2/ez/e3sm_data_docs/utils
$ emacs check_results.bash
# Add a check line for each new script, e.g. for  v2.NARRM.historical_0101:
# check_test_results E3SMv2_test NARRM historical_0101 <checksum from original script's test>
./check_results.bash

The script should not show up in the output. If it does, it will display Failed line count and/or Failed checksum. That means there is a problem in the reproduction script. It's not matching up with the output of the original script. The reproduction script needs to be fixed somehow. Return to Step (1) above.

Assuming the script passed the test, it is ready to be added officially (i.e., on https://github.com/E3SM-Project/e3sm_data_docs/tree/main/run_scripts/v2/reproduce). Merge the script to main.

(5) Adding reproduction scripts to the official table

I do this step on Perlmutter rather than Chrysalis (this is because there is a direct hsi call in the script).

Once a reproduction script has been added to https://github.com/E3SM-Project/e3sm_data_docs/tree/main/run_scripts/v2/reproduce, I need to link it in the "Script" column of https://docs.e3sm.org/e3sm_data_docs/_build/html/v2/reproducing_simulations.html.

Now that the reproduction script exists, this will be accomplished by this code block in generate_tables.py (https://github.com/E3SM-Project/e3sm_data_docs/blob/main/utils/generate_tables.py):

        run_script_reproduction = f"https://raw.githubusercontent.com/E3SM-Project/e3sm_data_docs/main/run_scripts/v2/reproduce/run.{name}.sh"
        response = requests.get(run_script_reproduction).status_code
        if response == 200:
            self.run_script_reproduction = f"`{name} <{run_script_reproduction}>`_"
        else:
            self.run_script_reproduction = ""

I add the checksum to the simulation's tuple in generate_tables.py, e.g.:

("v2.LR.hist-GHG_0201", "LR", "chrysalis", "<checksum>", "hist-GHG", 3),

Then, I just follow the directions at https://github.com/E3SM-Project/e3sm_data_docs/tree/main/utils#generating-tables and merge the updated tables to main.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My process for setting up reproduction scripts #39

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

My process for setting up reproduction scripts #39

forsyth2 May 30, 2024 Maintainer

(1) Creating reproduction scripts

(2) Creating test output to compare with the original test output

(3) Finding the expected checksum

(4) Actually comparing that test output with the original test output

(5) Adding reproduction scripts to the official table

Replies: 0 comments

forsyth2
May 30, 2024
Maintainer