Skip to content

Commit

Permalink
Merge pull request #593 from kevinstratford/ks-issue-472
Browse files Browse the repository at this point in the history
Remove reference to non-existant optfile
  • Loading branch information
juanfrh authored Apr 8, 2024
2 parents c45af6b + b6b515f commit 28360b0
Showing 1 changed file with 37 additions and 45 deletions.
82 changes: 37 additions & 45 deletions docs/research-software/mitgcm.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,26 +18,18 @@ flow of both the atmosphere and ocean.

MITgcm is not available via a module on ARCHER2 as users will build
their own executables specific to the problem they are working on.
However, we do provide an optfile which will allow `genmake2` to create
Makefiles which will work on ARCHER2.

!!! note
The processes to build MITgcm on the ARCHER2 4-cabinet system and full
system are slightly different. Please make sure you use the commands
for the correct system below.

You can obtain the MITgcm source code from the developers by cloning
from the GitHub repository with the command

git clone https://github.com/MITgcm/MITgcm.git

You should then copy the ARCHER2 optfile into the MITgcm directories. You may use the files at the locations below for the 4-cabinet and full systems.
You should then copy the ARCHER2 optfile into the MITgcm directories.

!!! warning
A current ARCHER2 optfile is not available at the present time. Please
contact `support@archer2.ac.uk` for help.

```bash
cp /work/n02/shared/mjmn02/ECCOv4/cases/cce/cce1/scripts/dev_linux_amd64_cray_archer2 MITgcm/tools/build_options/
```

You should also set the following environment variables.
`MITGCM_ROOTDIR` is used to locate the source code and should point to
the top MITgcm directory. Optionally, adding the MITgcm tools directory
Expand All @@ -63,7 +55,7 @@ running

genmake2 -help

Finally, you may then build your executable by running
Finally, you may then build your executable by running

make depend
make
Expand All @@ -87,7 +79,7 @@ each for up to one hour.
#SBATCH --cpus-per-task=1
# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code]
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard
Expand Down Expand Up @@ -143,7 +135,7 @@ This can also sometimes lead to performance increases.
#SBATCH --cpus-per-task=4
# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code]
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard
Expand All @@ -170,7 +162,7 @@ those requested in the job submission script.

## Reproducing the ECCO version 4 (release 4) state estimate on ARCHER2

The ECCO version 4 state estimate (ECCOv4-r4) is an observationally-constrained numerical solution produced by the ECCO group at JPL. If you would like to reproduce the state estimate on ARCHER2 in order to create customised runs and experiments, follow the instructions below. They have been slightly modified from the JPL instructions for ARCHER2.
The ECCO version 4 state estimate (ECCOv4-r4) is an observationally-constrained numerical solution produced by the ECCO group at JPL. If you would like to reproduce the state estimate on ARCHER2 in order to create customised runs and experiments, follow the instructions below. They have been slightly modified from the JPL instructions for ARCHER2.

For more information, see the ECCOv4-r4 website <https://ecco-group.org/products-ECCO-V4r4.htm>

Expand All @@ -180,11 +172,11 @@ First, navigate to your directory on the ``/work`` filesystem in order to get ac

mkdir MYECCO
cd MYECCO
In order to reproduce ECCOv4-r4, we need a specific checkpoint of the MITgcm source code.

In order to reproduce ECCOv4-r4, we need a specific checkpoint of the MITgcm source code.

git clone https://github.com/MITgcm/MITgcm.git -b checkpoint66g

Next, get the ECCOv4-r4 specific code from GitHub:

cd MITgcm
Expand All @@ -193,7 +185,7 @@ Next, get the ECCOv4-r4 specific code from GitHub:
git clone https://github.com/ECCO-GROUP/ECCO-v4-Configurations.git
mv ECCO-v4-Configurations/ECCOv4\ Release\ 4/code .
rm -rf ECCO-v4-Configurations

### Get the ECCOv4-r4 forcing files

The surface forcing and other input files that are too large to be stored on GitHub are available via NASA data servers. In total, these files are about 200 GB in size. You must register for an Earthdata account and connect to a WebDAV server in order to access these files. For more detailed instructions, read the help page <https://ecco.jpl.nasa.gov/drive/help>.
Expand All @@ -205,9 +197,9 @@ Next, acquire your WebDAV credentials: <https://ecco.jpl.nasa.gov/drive> (second
Now, you can use wget to download the required forcing and input files:

wget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_forcing
wget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_init
wget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_init
wget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_ecco

After using `wget`, you will notice that the `input*` directories are, by default, several levels deep in the directory structure. Use the `mv` command to move the `input*` directories to the directory where you executed the `wget` command. Specifically,

```
Expand Down Expand Up @@ -235,22 +227,22 @@ If you haven't already, set your environment variables:
export MITGCM_ROOTDIR=../../../../MITgcm
export PATH=$MITGCM_ROOTDIR/tools:$PATH
export MITGCM_OPT=$MITGCM_ROOTDIR/tools/build_options/dev_linux_amd64_cray_archer2

Next, compile the executable:

genmake2 -mods ../code -mpi -optfile $MITGCM_OPT
make depend
make
Once you have compiled the model, you will have the mitgcmuv executable for ECCOv4-r4.

Once you have compiled the model, you will have the mitgcmuv executable for ECCOv4-r4.

#### Create run directory and link files

In order to run the model, you need to create a run directory and link/copy the appropriate files. First, navigate to your directory on the ``work`` filesystem. From the ``MITgcm/ECCOV4/release4`` directory:

mkdir run
cd run

# link the data files
ln -s ../input_init/NAMELIST/* .
ln -s ../input_init/error_weight/ctrl_weight/* .
Expand All @@ -261,11 +253,11 @@ In order to run the model, you need to create a run directory and link/copy the
ln -s ../input_forcing/eccov4r4* .

python mkdir_subdir_diags.py

# manually copy the mitgcmuv executable
cp -p ../build/mitgcmuv .

For a short test run, edit the ``nTimeSteps`` variable in the file ``data``. Comment out the default value and uncomment the line reading ``nTimeSteps=8``. This is a useful test to make sure that the model can at least start up.
For a short test run, edit the ``nTimeSteps`` variable in the file ``data``. Comment out the default value and uncomment the line reading ``nTimeSteps=8``. This is a useful test to make sure that the model can at least start up.

To run on ARCHER2, submit a batch script to the Slurm scheduler. Here is an example submission script:

Expand All @@ -280,7 +272,7 @@ To run on ARCHER2, submit a batch script to the Slurm scheduler. Here is an exam
#SBATCH --cpus-per-task=1
# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code]
#SBATCH --account=[budget code]
#SBATCH --partition=standard
#SBATCH --qos=standard
Expand All @@ -298,15 +290,15 @@ export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
srun --distribution=block:block --hint=nomultithread ./mitgcmuv
```

This configuration uses 96 MPI processes at 12 MPI processes per node. Once the run has finished, in order to check that the run has successfully completed, check the end of one of the standard output files.
This configuration uses 96 MPI processes at 12 MPI processes per node. Once the run has finished, in order to check that the run has successfully completed, check the end of one of the standard output files.

tail STDOUT.0000
It should read

It should read

PROGRAM MAIN: Execution ended Normally
The files named `STDOUT.*` contain diagnostic information that you can use to check your results. As a first pass, check the printed statistics for any clear signs of trouble (e.g. NaN values, extremely large values).

The files named `STDOUT.*` contain diagnostic information that you can use to check your results. As a first pass, check the printed statistics for any clear signs of trouble (e.g. NaN values, extremely large values).

#### ECCOv4-r4 in adjoint mode

Expand All @@ -318,19 +310,19 @@ If you have access to the commercial TAF software produced by <http://FastOpt.de
cd ..
mkdir build_ad
cd build_ad

In this instance, the ``code_ad`` and ``code`` directories are identical, although this does not have to be the case. Make sure that you have the ``staf`` script in your path or in the ``build_ad`` directory itself. To make sure that you have the most up-to-date script, run:

./staf -get staf

To test your connection to the FastOpt servers, try:

./staf -test

You should receive the following message:

Your access to the TAF server is enabled.

The compilation commands are similar to those used to build the forward case.

# load relevant modules
Expand All @@ -342,16 +334,16 @@ The compilation commands are similar to those used to build the forward case.
make depend
make adtaf
make adall

The source code will be packaged and forwarded to the FastOpt servers, where it will undergo source-to-source translation via the TAF algorithmic differentiation software. If the compilation is successful, you will have an executable named ``mitgcmuv_ad``. This will run the ECCOv4-r4 configuration of MITgcm in adjoint mode. As before, create a run directory and copy in the relevant files. The procedure is the same as for the forward model, with the following modifications:

cd ..
mkdir run_ad
cd run_ad
# manually copy the mitgcmuv executable
cp -p ../build_ad/mitgcmuv_ad .
To run the model, change the name of the executable in the Slurm submission script; everything else should be the same as in the forward case. As above, at the end of the run you should have a set of `STDOUT.*` files that you can examine for any obvious problems.

To run the model, change the name of the executable in the Slurm submission script; everything else should be the same as in the forward case. As above, at the end of the run you should have a set of `STDOUT.*` files that you can examine for any obvious problems.


##### Compile time errors
Expand All @@ -361,14 +353,14 @@ relink with --no-relax` then add the following line to the FFLAGS options: `-Wl,

##### Checkpointing for adjoint runs

In an adjoint run, there is a balance between storage (i.e. saving the model state to disk) and recomputation (i.e. integrating the model forward from a stored state). Changing the `nchklev` parameters in the `tamc.h` file at compile time is how you control the relative balance between storage and recomputation.
In an adjoint run, there is a balance between storage (i.e. saving the model state to disk) and recomputation (i.e. integrating the model forward from a stored state). Changing the `nchklev` parameters in the `tamc.h` file at compile time is how you control the relative balance between storage and recomputation.

A suggested strategy that has been used on a variety of HPC platforms is as follows:
1. Set `nchklev_1` as large as possible, up to the size allowed by memory on your machine. (Use the `size` command to estimate the memory per process. This should be just a little bit less than the maximum allowed on the machine. On ARCHER2 this is 2 GB (standard) and 4 GB (high memory)).
2. Next, set `nchklev_2` and `nchklev_3` to be large enough to accomodate the entire run. A common strategy is to set `nchklev_2 = nchklev_3 = sqrt(numsteps/nchklev_1) + 1`.
3. If the `nchklev_2` files get too big, then you may have to add a fourth level (i.e. `nchklev_4`), but this is unlikely.
2. Next, set `nchklev_2` and `nchklev_3` to be large enough to accommodate the entire run. A common strategy is to set `nchklev_2 = nchklev_3 = sqrt(numsteps/nchklev_1) + 1`.
3. If the `nchklev_2` files get too big, then you may have to add a fourth level (i.e. `nchklev_4`), but this is unlikely.

This strategy allows you to keep as much in memory as possible, minimising the I/O requirements for the disk. This is useful, as I/O is often the bottleneck for MITgcm runs on HPC.
This strategy allows you to keep as much in memory as possible, minimising the I/O requirements for the disk. This is useful, as I/O is often the bottleneck for MITgcm runs on HPC.

Another way to adjust performance is to adjust how tapelevel I/O is handled. This strategy performs well for most configurations:
```
Expand Down

0 comments on commit 28360b0

Please sign in to comment.