Skip to content

Commit

Permalink
Merge pull request #551 from aturner-epcc/aturner-epcc/optimised-mpiio
Browse files Browse the repository at this point in the history
Improves description of optimising parallel IO on ARCHER2
  • Loading branch information
aturner-epcc authored Nov 15, 2023
2 parents 8ba3402 + a4788a0 commit c286b80
Showing 1 changed file with 99 additions and 23 deletions.
122 changes: 99 additions & 23 deletions docs/user-guide/io.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Information on the file systems, directory layouts, quotas,
archiving and transferring data can be found in the
[Data management and transfer section](data.md).

The advice here is targetted at use of the parallel file
The advice here is targeted at use of the parallel file
systems available on the compute nodes on ARCHER2 (i.e. **Not**
the home and RDFaaS file systems).

Expand Down Expand Up @@ -96,7 +96,7 @@ This section provides information on getting the best performance out of
the parallel `/work` file systems on ARCHER2 when writing data,
particularly using parallel I/O patterns.

### Lustre
### Lustre technology

The ARCHER2 `/work` file systems use Lustre as a parallel file system
technology. It has many disk units (called Object Storage Targets or
Expand All @@ -106,6 +106,88 @@ system provides POSIX semantics (changes on one node are immediately
visible on other nodes) and can support very high data rates for
appropriate I/O patterns.

In order to achieve good performance on the ARCHER2 Lustre file systems,
you need to make sure your IO is configured correctly for the type
of I/O you want to do. In the following sections we describe how to
do this.

### Summary: achieving best I/O performance

The configuration you should use depends on the type of I/O you are
performing. Here, we summarise the settings for two of the I/O patterns
described above: File-Per-Process (FPP, including using ADIOS2) and
Single Share File with collective writes (SSF).

Following sections describe the settings in more detail.

#### File-Per-Process (FPP)

- Optimal file system: NVMe-based Lustre
- Stripe count: single stripe (`-c 1`), this is the default on ARCHER2
- Stripe size: little impact on performance - default (1 MiB) is fine
- Environment variables: none required
- MPI transport protocol choice: N/A

#### Single Shared File with collective writes (SSF)

- Optimal file system: no difference between NVMe and HDD-based file systems
- Stripe count: Maximum striping (`-c -1`)
- Stripe size: little impact on performance - default (1 MiB) is fine
- Environment variables:
* `export FI_OFI_RXM_SAR_LIMIT=64K`
* `export MPICH_MPIIO_HINTS="*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4”`
- MPI transport protocol choice: UCX, if possible

### Summary: typical I/O performance on ARCHER2

#### File-Per-Process (FPP)

We regularly run tests of FPP write performance on ARCHER2 `/work`` Lustre file
systems using the [benchio](https://github.com/EPCCed/epcc-reframe/tree/main/tests/synth/benchio) software in the following configuration:

- Number of MPI processes writing: 2048 (16 nodes each with 128 processes)
- Amount of data written: 65,536 MiB (32 MiB per process)
- Striping: stripe count = 1; stripe size = 1 MiB
- Environment variables: none
- MPI transport protocol: OFI

Typical write performance:

- NVMe-based file system:
- Maximum: 200 GiB/s
- Upper quartile: 180 GiB/s
- Median: 150 GiB/s
- Lower quartile: 135 GiB/s
- HDD-based file system:
- Maximum: 170 GiB/s
- Upper quartile: 160 GiB/s
- Median: 90 GiB/s
- Lower quartile: 30 GiB/s

#### Single Shared File with collective writes (SSF)

We regularly run tests of FPP write performance on ARCHER2 `/work`` Lustre file
systems using the [benchio](https://github.com/EPCCed/epcc-reframe/tree/main/tests/synth/benchio) software in the following configuration:

- Number of MPI processes writing: 2048 (16 nodes each with 128 processes)
- Amount of data written: 65,536 MiB (32 MiB per process)
- Striping: stripe count = -1 (maximum striping); stripe size = 1 MiB
- Environment variables: `FI_OFI_RXM_SAR_LIMIT=64K`, `MPICH_MPIIO_HINTS="*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4”`
- MPI transport protocol: UCX

Typical write performance:

- NVMe-based file system:
- Maximum: 20 GiB/s
- Upper quartile: 11 GiB/s
- Median: 10 GiB/s
- Lower quartile: 9 GiB/s
- HDD-based file system:
- Maximum: 22 GiB/s
- Upper quartile: 12 GiB/s
- Median: 10 GiB/s
- Lower quartile: 8 GiB/s

### Striping

One of the main factors leading to the high performance of Lustre file
Expand Down Expand Up @@ -146,7 +228,7 @@ created directory `resdir`:
resdir
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1

#### Setting Custom Striping Configurations
#### Setting custom striping configurations

Users can set stripe settings for a directory (or file) using the
`lfs setstripe` command. The options for `lfs setstripe` are:
Expand All @@ -167,32 +249,26 @@ For example, to set a stripe size of 4 MiB for the existing directory

auser@ln03:~> lfs setstripe -S 4m -c -1 resdir/

### Recommended ARCHER2 I/O settings
### Environment variables

The following environment variables typically only have an impact for the case
when you using Single Shared Files with collective communications.
As mentioned above, it is very important to use collective calls when
doing parallel I/O to a single shared file.

However, with the default settings, parallel I/O on multiple nodes can
currently give poor performance. We recommend always setting the
following environment variable in your SLURM batch script:

export FI_OFI_RXM_SAR_LIMIT=64K
currently give poor performance. We recommend always setting these
environment variables in your SLURM batch script when
you are using the SSF I/O pattern:

Although I/O requirements vary significantly between different
applications, the following settings should be good in most cases:
export FI_OFI_RXM_SAR_LIMIT=64K
export MPICH_MPIIO_HINTS="*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4”

- If each process or node is writing to its own individual file then
the default settings (unstriped files) should give good
performance.
<!-- TODO: add description of what these settings do -->

- If processes are writing to a single shared file (e.g. using
MPI-IO, HDF5 or NetCDF), set the appropriate directories to be
fully striped: `lfs setstripe -c -1 resdir`. On ARCHER2 this will
use all of the 12 OSTs.
### MPI transport protocol

#### Alternative MPI library

Setting the environment variable `FI_OFI_RXM_SAR_LIMIT` can improve
Setting the environment variables described above can improve
the performance of MPI collectives when handling large amounts of
data, which in turn can improve collective file I/O. An alternative is
to use the non-default UCX implementation of the MPI library as an
Expand All @@ -206,7 +282,7 @@ To switch library version see [the Application Development Environment section o
before and after the switch. It is possible that other functions
may run slower even if the I/O performance improves.

## I/O Profiling
## I/O profiling

If you are concerned about your I/O performance, you should quantify
your transfer rates in terms of GB/s of data read or written to
Expand All @@ -228,9 +304,9 @@ variable in your Slurm script
Amongst other things, this will give you information on how many
independent and collective I/O operations were issued. If you see a
large number of independent operations compared to collectives, this
indicates that you have inefficient I/O patterns and you should check that you are calling your parallel I/O library correctly.
indicates that you have inefficient I/O patterns and you should check
that you are calling your parallel I/O library correctly.

Although this information comes from the MPI library, it is still
useful for users of higher-level libraries such as HDF5 as they all
call MPI-IO at the lowest level.

0 comments on commit c286b80

Please sign in to comment.