From 776d2b8dc245b304f25f3cb6529bf0c05ca7bb25 Mon Sep 17 00:00:00 2001 From: <> Date: Thu, 12 Dec 2024 11:31:26 +0000 Subject: [PATCH] Deployed be9de762 with MkDocs version: 1.6.1 --- search/search_index.json | 2 +- user-guide/machine-learning/index.html | 46 +++++++++++++++++++++++++- 2 files changed, 46 insertions(+), 2 deletions(-) diff --git a/search/search_index.json b/search/search_index.json index 7bac833fb..fea7f8a71 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"ARCHER2 User Documentation","text":"
ARCHER2 is the next generation UK National Supercomputing Service. You can find more information on the service and the research it supports on the ARCHER2 website.
The ARCHER2 Service is a world class advanced computing resource for UK researchers. ARCHER2 is provided by UKRI, EPCC, Cray (an HPE company) and the University of Edinburgh.
"},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"This is the documentation for the ARCHER2 service and includes:
Quick Start Guide The ARCHER2 quick start guide provides the minimum information for new users.
ARCHER2 User and Best Practice Guide Covers all aspects of use of the ARCHER2 supercomputing service. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of ARCHER2, and other advanced technical topics.
Research Software Information on each of the centrally-installed research software packages.
Software Libraries Information on the centrally-installed software libraries. Most libraries work as expected so no additional notes are required however a small number require specific documentation
Data Analysis and Tools Information on data analysis tools and other useful utilities.
Other Software Useful information on software that is not officially supported by the ARCHER2 service but that will be useful to users of that software.
Essential Skills This section provides information and links on essential skills required to use ARCHER2 efficiently: e.g. using Linux command line, accessing help and documentation.
ARCHER2 and publications This section describes how to acknowledge the use of ARCHER2 in your published work and how to use the ARCHER2 publications database.
The source for this documentation is publicly available in the ARCHER2 documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or addtions to the content and/or addtion of Issues providing suggestions for how it can be improved.
Full details of how to contribute can be found in the README.md
file of the repository.
This documentation draws on the Cirrus Tier-2 HPC Documentation, Sheffield Iceberg Documentation and the ARCHER National Supercomputing Service Documentation.
"},{"location":"archer-migration/","title":"ARCHER to ARCHER2 migration","text":"This section of the documentation is a guide for user migrating from ARCHER to ARCHER2.
It covers:
Tip
If you need help or have questions on ARCHER to ARCHER2 migration, please contact the ARCHER2 service desk
"},{"location":"archer-migration/account-migration/","title":"Migrating your account from ARCHER to ARCHER2","text":"This section covers the following questions:
Tip
If you need help or have questions on ARCHER to ARCHER2 migration, please contact the ARCHER2 service desk
"},{"location":"archer-migration/account-migration/#when-will-i-be-able-to-access-archer2","title":"When will I be able to access ARCHER2?","text":"We anticipate that users will have access during the week beginning 11th January 2021. Notification of activation of ARCHER2 projects will be sent to the project leaders/PIs and the project users.
"},{"location":"archer-migration/account-migration/#has-my-project-been-migrated-to-archer2","title":"Has my project been migrated to ARCHER2?","text":"If you have an active ARCHER allocation at the end of the ARCHER service then your project will very likely be migrated to ARCHER2. If your project is migrated to ARCHER2 then it will have the same project code as it had on ARCHER.
Some further information that may be useful:
The unit of allocation on ARCHER2 is called the ARCHER2 Compute Unit (CU) and, in general, 1 CU will be worth 1 ARCHER2 node hour.
UKRI have determined the conversion rates which will be used to transfer existing ARCHER allocations onto ARCHER2. These will be:
In identifying these conversion rates UKRI has endeavoured to ensure that no user will be disadvantaged by the transfer of their allocation from ARCHER to ARCHER2.
A nominal allocation will be provided to all projects during the initial no-charging period. Users will be notified before the no-charging period ends.
When the ARCHER service ends, any unused ARCHER allocation in kAUs will be converted to ARCHER2 CUs and transferred to ARCHER2 project allocation.
"},{"location":"archer-migration/account-migration/#how-do-i-set-up-an-archer2-account","title":"How do I set up an ARCHER2 account?","text":"Once you have been notified that you can go ahead and setup an ARCHER2 account you will do this through SAFE. Note that you should use the new unified SAFE interface rather than the ARCHER SAFE. The correct URL for the new SAFE is:
Your access details for this SAFE are the same as those for the ARCHER SAFE. You should log in in exactly the same way as you did on the ARCHER SAFE.
Important
You should make sure you request the same account name in your project on ARCHER2 as you have on ARCHER. This is to ensure that you have seamless access to your ARCHER /home data on ARCHER2. See the ARCHER to ARCHER2 Data Migration page for details on data transfer from ARCHER to ARCHER2
Once you have logged into SAFE, you will need to complete the following steps before you can log into ARCHER2 for the first time:
The ARCHER2 documentation covers logging in to ARCHER from a variety of operating systems:
This section provides an overview of the main differences between ARCHER and ARCHER2 along with links to more information where appropriate.
"},{"location":"archer-migration/archer2-differences/#for-all-users","title":"For all users","text":"srun
rather than aprun
This short guide explains how to move data from the ARCHER service to the ARCHER2 service.
We have also created a walkthrough video to guide you.
Note
This section assumes that you have an active ARCHER and ARCHER2 account, and that you have successfully logged in to both accounts.
Tip
Unlike normal access, ARCHER to ARCHER2 transfer has been set up to require only one form of authentication. You will not need to generate a new SSH key pair to transfer data from ARCHER to ARCHER2 as your password will suffice.
First, login to the ARCHER(1) (making sure to change auser
to your username):
ssh auser@login.archer.ac.uk\n
Then, combine important research data into a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt directory1/\n
Please be selective -- the more data you want to transfer, the more time it will take.
From ARCHER in particular, in order to get the best transfer performance, we need to access a newer version of the SSH program. We do this by loading the openssh
module:
module load openssh\n
"},{"location":"archer-migration/data-migration/#transferring-data-using-rsync-recommended","title":"Transferring data using rsync
(recommended)","text":"Begin the data transfer from ARCHER to ARCHER2 using rsync
:
rsync -Pv -e\"ssh -c aes128-gcm@openssh.com\" \\\n ./all_my_files.tar.gz a2user@transfer.dyn.archer2.ac.uk:/work/t01/t01/a2user\n
Important
Notice that the hostname for data transfer from ARCHER to ARCHER2 is not the usual login address. Instead, you use transfer.dyn.archer2.ac.uk
. This address has been configured to allow higher performance data transfer and to allow access to ARCHER with password only with no SSH key required.
When running this command, you will be prompted to enter your ARCHER2 password. Enter it and the data transfer will begin. Also, remember to replace a2user
with your ARCHER2 username, and t01
with the budget associated with that username.
The use of the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e
flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c
option specifies the cipher to be used as aes128-gcm
which has been found to increase performance. Unfortunately the ~
shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on ARCHER2.
If you are unconcerned about being able to restart an interrupted transfer, you could instead use the scp
command,
scp -c aes128-gcm@openssh.com all_my_files.tar.gz \\\n a2user@transfer.dyn.archer2.ac.uk:/work/t01/t01/a2user/\n
but rsync
is recommended for larger transfers.
Important
Notice that the hostname for data transfer from ARCHER to ARCHER2 is not the usual login address. Instead, you use transfer.dyn.archer2.ac.uk
. This address has been configured to allow higher performance data transfer and to allow access to ARCHER with password only with no SSH key required.
This section of the documentation is a guide for user migrating from the ARCHER2 4-cabinet system to the ARCHER2 full system.
It covers:
Tip
If you need help or have questions on ARCHER2 4-cab to full ARCHER2 migration please contact the ARCHER2 service desk
"},{"location":"archer2-migration/account-migration/","title":"Accessing the ARCHER2 full system","text":"This section covers the following questions:
Tip
If you need help or have questions on using ARCHER2 4-cabinet system and ARCHER2 full system please contact the ARCHER2 service desk
"},{"location":"archer2-migration/account-migration/#when-will-i-be-able-to-access-archer2-full-system","title":"When will I be able to access ARCHER2 full system?","text":"We anticipate that users will have access from mid-late November. Users will have access to both the ARCHER2 4-cabinet system and ARCHER2 full system for at least 30 days. UKRI will confirm the dates and these will be communicated to you as they are confirmed. There will be at least 14 days notice before access to the ARCHER2 4-Cabinet system is removed.
"},{"location":"archer2-migration/account-migration/#has-my-project-been-enabled-on-archer2-full-system","title":"Has my project been enabled on ARCHER2 full system?","text":"If you have an active ARCHER2 4-cabinet system allocation on 1st October 2021 then your project will be enabled on the ARCHER2 full system. The project code is the same on the full service as it is on ARCHER2 4-cabinet system.
Some further information that may be useful:
The unit of allocation on ARCHER2 is called the ARCHER2 Compute Unit (CU) and 1 CU is equivalent to 1 ARCHER2 node hour. Your time budget will be shared on both systems. This means that any existing allocation available to your project on the 4-cabinet system will also be available on the full system.
There will be a period of at least 30 days where users will have access to both the 4-cabinet system and the full system. During this time, use on the full system will be uncharged (though users must still have access to a valid, positive budget to be able to submit jobs) and use on the 4-cabinet system will be a charged in the usual way. Users will be notified before the no-charging period ends.
"},{"location":"archer2-migration/account-migration/#how-do-i-set-up-an-account-on-the-full-system","title":"How do I set up an account on the full system?","text":"You will keep the same usernames, passwords and SSH keys that you use on the 4-cabinet system on the full system.
You do not need to do anything to enable your account, these will be made available automatically once access to the full system is available.
You will connect to the full system in the same way as you connect to the 4-cabinet system except for switching the ordering of the credentials:
The ARCHER2 documentation covers logging in to ARCHER2 from a variety of operating systems: - Logging in to ARCHER2 from macOS/Linux - Logging in to ARCHER2 from Windows
Login addresses:
Tip
When logging into the ARCHER2 full system for the first time, you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
There are three file systems associated with the ARCHER2 Service:
"},{"location":"archer2-migration/account-migration/#home-file-systems","title":"home file systems","text":"The home file systems will be mounted on both the 4-cabinet system and the full system; so users\u2019 directories are shared across the two systems. Users will be able to access the home file systems from both systems and no action is required to move data. The home file systems will be read and writeable on both services during the transition period.
"},{"location":"archer2-migration/account-migration/#work-file-systems","title":"work file systems","text":"There are different work file systems for the 4-cabinet system and the full system.
The work file system on the 4-cabinet system (labelled \u201carcher2-4c-work\u201d in SAFE) will remain available on the 4-cabinet system during the transition period.
There will be new work file systems on the full system and you will have new directories on the new work file systems. Your initial quotas will typically be double your quotas for the 4-cabinet work file system.
Important: you are responsible for transferring any required data from the 4-cabinet work file systems to your new directories on the work file systems on the full system.
The work file system on the 4-cabinet system will be available for you to transfer your data from for at least 30 days from the start of the ARCHER2 full system access and 14 days notice will be given before the 4-cabinet work file system is removed.
"},{"location":"archer2-migration/account-migration/#rdfaas-file-systems","title":"RDFaaS file systems","text":"For users who have access to the RDFaaS, your RDFaaS data will be available on both the 4-cabinet system and the full system during the transition period and will be readable and writeable on both systems.
"},{"location":"archer2-migration/archer2-differences/","title":"Main differences between ARCHER2 4-cabinet system and ARCHER2 full system","text":"This section provides an overview of the main differences between the ARCHER2 4-cabinet system that all users have been using up until now and the full ARCHER2 system along with links to more information where appropriate.
"},{"location":"archer2-migration/archer2-differences/#for-all-users","title":"For all users","text":"--reservation=shortqos
when using the short
QoS.reservation
QoS.module load epcc-job-env
command to job submission scripts.cray-netcdf
or cray-netcdf-hdf5parallel
modules until you have loaded the appropriate cray-hdf5
or cray-hdf5-parallel
modules). You can use the module spider
command to see all available modules, including hidden ones.This short guide explains how to move data from from the work file system on the ARCHER2 4-cabinet system to the ARCHER2 full system. Your space on the home file system is shared between the ARCHER2 4-cabinet system and the ARCHER2 full system so everything from your home directory is already effectively transferred.
Note
This section assumes that you have an active ARCHER2 4-cabinet system and ARCHER2 full system account, and that you have successfully logged in to both accounts.
Tip
Unlike normal access, ARCHER2 4-cabinet system to ARCHER2 full system transfer has been set up to require only one form of authentication. You will only need one factor to authenticate from the 4-cab to the full system or vice versa. This factor can be either an SSH key (that has been registered against your account in SAFE) or you can use your passowrd. If you have a large amount of data to transfer you may want to setup a passphrase-less SSH key on ARCHER2 full system and use the data analysis nodes to run transfers via a Slurm job.
"},{"location":"archer2-migration/data-migration/#transferring-data-interactively-from-the-4-cabinet-system-to-the-full-system","title":"Transferring data interactively from the 4-cabinet system to the full system","text":"First, login to the ARCHER2 4-cabinet system (making sure to change auser
to your username):
ssh auser@login-4c.archer2.ac.uk\n
Then, combine important research data into a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt directory1/\n
Please be selective -- the more data you want to transfer, the more time it will take.
Unpack the archive file in the destination directory
tar -xzf all_my_files.tar.gz\n
"},{"location":"archer2-migration/data-migration/#transferring-data-using-rsync-recommended","title":"Transferring data using rsync
(recommended)","text":"Begin the data transfer from the ARCHER2 4-cabinet system to the ARCHER2 full system using rsync
:
rsync -Pv all_my_files.tar.gz a2user@login.archer2.ac.uk:/work/t01/t01/a2user\n
When running this command, you will be prompted to enter your ARCHER2 password -- this is the same password for the ARCHER2 4-cabinet system and the ARCHER2 full system. Enter it and the data transfer will begin. Remember to replace a2user
with your ARCHER2 username, and t01
with the budget associated with that username.
We use the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. We move our research archive to our project work directory on the ARCHER2 full system.
If you are unconcerned about being able to restart an interrupted transfer, you could instead use the scp
command,
scp all_my_files.tar.gz a2user@login.archer2.ac.uk:/work/t01/t01/a2user/\n
but rsync
is recommended for larger transfers.
It may be convenient to submit long data transfers to the serial queue. In this case, a number of simple preparatory steps are required to authenticate:
ssh/scp
commands in the serial queue to authenticate. As it has been arranged that only one of ssh key/password are required between the serial nodes and the 4-cabinet system, this is sufficient.An example serial queue script using rsync
might be:
#!/bin/bash\n\n# Slurm job options (job-name, job time)\n\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n#SBATCH --time=02:00:00\n#SBATCH --ntasks=1\n\n# Replace [budget code] below with your budget code\n\n#SBATCH --account=[budget code] \n\n# Issue appropriate rsync command\n\nrsync -av --stats --progress --rsh=\"ssh -i ${HOME}/.ssh/id_rsa_batch\" \\\n user-01@login-4c.archer2.ac.uk:/work/proj01/proj01/user-01/src \\\n /work/proj01/proj01/user-01/destination\n
where ${HOME}/.ssh/id_rsa_batch
is the new ssh key. Note that the ${HOME}
directory is visible from the serial nodes on the full system, so ssh key pairs in ${HOME}/.ssh
are available. "},{"location":"archer2-migration/porting/","title":"Porting applications to full ARCHER2 system","text":"Porting applications to the full ARCHER2 system has generally proven straightforward if they are running successfully on the ARCHER2 4-cabinet system. You should be able to use the same (or very similar) compile processes on the the full system as you used on ARCHER2.
During testing of the ARCHER2 full system, the CSE team at EPCC have seen that application binaries compiled on the 4-cabinet system can usually be copied over to the full system and work well and give good performance. However, if you run into issues with executables taken from the 4-cabinet system on the full system you should recompile in the first instance.
Information on compiling applications on the full system can be found in the Application Development Environment section of the User and Best Practice Guide.
"},{"location":"data-tools/","title":"Data Analysis and Tools","text":"This section provides information on each of the centrally installed data analysis software and other software tools.
The tools currently available in this section are (software that is installed or maintained by third-parties rather than the ARCHER2 service are marked with *):
AMD \u03bcProf (\u201cMICRO-prof\u201d) is a software profiling analysis tool for x86 applications running on Windows, Linux and FreeBSD operating systems and provides event information unique to the AMD \u201cZen\u201d-based processors and AMD INSTINCT\u2122 MI Series accelerators. AMD uProf enables the developer to better understand the limiters of application performance and evaluate improvements.
"},{"location":"data-tools/amd-uprof/#accessing-amd-prof-on-archer2","title":"Accessing AMD \u03bcProf on ARCHER2","text":"To gain access to the AMD\u03bcProf tools on ARCHER2, you must load the module:
module load amd-uprof\n
"},{"location":"data-tools/amd-uprof/#using-amd-prof","title":"Using AMD \u03bcProf","text":"Please see the AMD documentation for information on how to use \u03bcProf:
Blender is a 3D rendering and package tool primarily used for 3D animation and VFX but increasingly also for scientific visualisation. By being an artist tool first as opposed to a scientific visualisation package, it allows for a great versatility and a complete control over every aspect of the final image.
"},{"location":"data-tools/blender/#useful-links","title":"Useful links","text":"Blender is available through the blender
module.
module load blender\n
Once the module has been loaded, the Blender executable will be available.
"},{"location":"data-tools/blender/#running-blender-jobs","title":"Running blender jobs","text":"Even though blender is single node only, each frame being independent makes the render of animations an embarrassingly parallel problem. Here is an example job for running blender to export frames 1 to 100 from the blend file scene.blend
. Submitting an other job with a different frame range will use a 2nd node etc.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=example_blender_job\n#SBATCH --time=0:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load blender\n\nexport BLENDER_USER_RESOURCES=${HOME/home/work}/.blender\n\nblender -b scene.blend --render-output //render_ -noaudio -f 1-100 -- --cycles-device CPU\n
The full list of command line arguments can be found in Blender's documentation. Note that with blender, the order of the arguments is important.
To automatise this process addons like the one available in blender4science are helpful as they allow to submit multiple identical jobs and handles the parallelisation to render each frame only once.
!!! note Blender doesn't work on ARCHER2 GPU nodes at the moment due to incompatibilities with the rocm version available
"},{"location":"data-tools/cray-r/","title":"R","text":""},{"location":"data-tools/cray-r/#r-for-statistical-computing","title":"R for statistical computing","text":"R is a software environment for statistical computing and graphics. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time-series analysis, classification, clustering, and so on).
Note
When you log onto ARCHER2, no R module is loaded by default. You need to load the cray-R
module to access the functionality described below.
The recommended version of R to use on ARCHER2 is the HPE Cray R distribution, which can be loaded using:
module load cray-R\n
The HPE Cray R distribution includes a range of common R packages, including all of the base packages, plus a few others.
To see what packages are available, run the R command
library()\n
--from the R command prompt.
At the time of writing, the HPE Cray R distribution included the following packages:
Full SystemPackages in library \u2018/opt/R/4.0.3.0/lib64/R/library\u2019:\n\nbase The R Base Package\nboot Bootstrap Functions (Originally by Angelo Canty\n for S)\nclass Functions for Classification\ncluster \"Finding Groups in Data\": Cluster Analysis\n Extended Rousseeuw et al.\ncodetools Code Analysis Tools for R\ncompiler The R Compiler Package\ndatasets The R Datasets Package\nforeign Read Data Stored by 'Minitab', 'S', 'SAS',\n 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...\ngraphics The R Graphics Package\ngrDevices The R Graphics Devices and Support for Colours\n and Fonts\ngrid The Grid Graphics Package\nKernSmooth Functions for Kernel Smoothing Supporting Wand\n & Jones (1995)\nlattice Trellis Graphics for R\nMASS Support Functions and Datasets for Venables and\n Ripley's MASS\nMatrix Sparse and Dense Matrix Classes and Methods\nmethods Formal Methods and Classes\nmgcv Mixed GAM Computation Vehicle with Automatic\n Smoothness Estimation\nnlme Linear and Nonlinear Mixed Effects Models\nnnet Feed-Forward Neural Networks and Multinomial\n Log-Linear Models\nparallel Support for Parallel computation in R\nrpart Recursive Partitioning and Regression Trees\nspatial Functions for Kriging and Point Pattern\n Analysis\nsplines Regression Spline Functions and Classes\nstats The R Stats Package\nstats4 Statistical Functions using S4 Classes\nsurvival Survival Analysis\ntcltk Tcl/Tk Interface\ntools Tools for Package Development\nutils The R Utils Package\n
4-cabinet system Packages in library \u2018/opt/R/4.0.2.0/lib64/R/library\u2019:\n\nbase The R Base Package\nboot Bootstrap Functions (Originally by Angelo Canty\n for S)\nclass Functions for Classification\ncluster \"Finding Groups in Data\": Cluster Analysis\n Extended Rousseeuw et al.\ncodetools Code Analysis Tools for R\ncompiler The R Compiler Package\ndatasets The R Datasets Package\nforeign Read Data Stored by 'Minitab', 'S', 'SAS',\n 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...\ngraphics The R Graphics Package\ngrDevices The R Graphics Devices and Support for Colours\n and Fonts\ngrid The Grid Graphics Package\nKernSmooth Functions for Kernel Smoothing Supporting Wand\n & Jones (1995)\nlattice Trellis Graphics for R\nMASS Support Functions and Datasets for Venables and\n Ripley's MASS\nMatrix Sparse and Dense Matrix Classes and Methods\nmethods Formal Methods and Classes\nmgcv Mixed GAM Computation Vehicle with Automatic\n Smoothness Estimation\nnlme Linear and Nonlinear Mixed Effects Models\nnnet Feed-Forward Neural Networks and Multinomial\n Log-Linear Models\nparallel Support for Parallel computation in R\nrpart Recursive Partitioning and Regression Trees\nspatial Functions for Kriging and Point Pattern\n Analysis\nsplines Regression Spline Functions and Classes\nstats The R Stats Package\nstats4 Statistical Functions using S4 Classes\nsurvival Survival Analysis\ntcltk Tcl/Tk Interface\ntools Tools for Package Development\nutils The R Utils Package\n
"},{"location":"data-tools/cray-r/#running-r-on-the-compute-nodes","title":"Running R on the compute nodes","text":"In this section, we provide an example R job submission scripts for using R on the ARCHER2 compute nodes.
"},{"location":"data-tools/cray-r/#serial-r-submission-script","title":"Serial R submission script","text":"#!/bin/bash --login\n\n#SBATCH --job-name=r_test\n#SBATCH --ntasks=1\n#SBATCH --time=00:10:00\n\n# Replace [budget code] below with your project code (e.g., t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n# Load the R module\nmodule load cray-R\n\n# Run your R progamme\nRscript serial_test.R\n
On completion, the output of the R script will be available in the job output file.
"},{"location":"data-tools/darshan/","title":"Darshan","text":"Darshan is a scalable HPC I/O characterization tool. Darshan is designed to capture an accurate picture of application I/O behavior, including properties such as patterns of access within files, with minimum overhead. The name is taken from a Sanskrit word for \"sight\" or \"vision\".
Darshan is developed at the Argonne Leadership Computing Facility (ALCF)
Useful links:
Using Darshan generally consists of two stages:
To collect IO profile data you add the command:
module load darshan\n
to your job submission script as the last module
command before you run your program. As Darshan does not distinguish between different software run in your job submission script, we typically recommand that you use a structure like:
module load darshan\nsrun ...usual software launch options...\nmodule remove darshan\n
This will avoid Darshan profiling IO for operations that are not part of your main parallel program.
Tip
There may be some periods when Darshan monitoring is enabled by default for all users. During these periods, you can disable Darshan monitoring by adding the command module remove darshan
to your job submission script. Periods of Darshan monitoring will be noted on the ARCHER2 Service Status page.
Important
The darshan
module is dependent on the compiler environment you are using and you should ensure that you load the darshan
module that matches the compiler environment you used to compile the program you are analysing. For example, if your software was compiled using PrgEnv-gnu
, then you would need to activate the GCC compiler environment before loading the darshan
module to ensure you get the GCC version of Darshan. This means loading the correct PrgEnv-
module before you load the darshan
module:
module load PrgEnv-gnu\nmodule load darshan\nsrun ...usual software launch options...\nmodule remove darshan\n
"},{"location":"data-tools/darshan/#location-of-darshan-profile-logs","title":"Location of Darshan profile logs","text":"Darshan writes all profile logs to a shared location on the ARCHER2 NVMe Lustre file system. You can find your profile logs at:
/mnt/lustre/a2fs-nvme/system/darshan/YYYY/MM/DD\n
where YYYY/MM/DD
correspond to the date on which your job ran.
The simplest way to analyse the profile log files is to use the darshan-parser
utility on the ARCHER2 login nodes. You make the Darshan analysis utilities available with the command:
module load darshan-util\n
Once this is loaded, you can produce and IO performance summary from a profile log file with:
darshan-parser --perf /path/to/darshan/log/file.darshan\n
You can get a dump of all data in the Darshan profile log by omitting the --perf
option, e.g.:
darshan-parser /path/to/darshan/log/file.darshan\n
Tip
The darshan-job-summary.pl
and darshan-summary-per-file.sh
utilities do not work on ARCHER2 as the required graphical packages are not currently available.
Documentation on the Darshan analysis utilities are available at:
Linaro Forge provides debugging and profiling tools for MPI parallel applications, and OpenMP or pthreads multi-threaded applications (and also hydrid MPI/OpenMP). Forge DDT is the debugger and MAP is the profiler.
"},{"location":"data-tools/forge/#user-interface","title":"User interface","text":"There are two ways of running the Forge user interface. If you have a good internet connection to ARCHER2, the GUI can be run on the front-end (with an X-connection). Alternatively, one can download a copy of the Forge remote client to your laptop or desktop, and run it locally. The remote client should be used if at all possible.
To download the remote client, see the Forge download pages. Version 24.0 is known to work at the time of writing. A section further down this page explains how to use the remote client, see Connecting with the remote client.
"},{"location":"data-tools/forge/#licensing","title":"Licensing","text":"ARCHER2 has a licence for up to 2080 tokens, where a token represents an MPI parallel process. Running Forge DDT/MAP to debug/profile a code running across 16 nodes using 128 MPI ranks per node would require 2048 tokens. If you wish to run on more nodes, say 32, then it will be necessary to reduce the number of tasks per node so as to fall below the maximum number of tokens allowed.
Please note, Forge licence tokens are shared by all ARCHER2 (and Cirrus) users.
To see how many tokens are in use, you can view the licence server status page by first setting up an SSH tunnel to the node hosting the licence server.
ssh <username>@login.archer2.ac.uk -L 4241:dvn04:4241\n
You can now view the status page from within a local browser, see http://localhost:4241/status.html.
Note
The licence status page may contain multiple licences, indicated by a row of buttons (one per licence) near the top of the page. The details of the 12-month licence described above can be accessed by clicking on the first button in the row. Additional buttons may appear at various times for boosted licences: once a quarter, ARCHER2 will have a boosted 7-day licence offering 8192 tokens, sufficient for 64 nodes running 128 MPI ranks per node. Please contact the Service Desk if you have a specific requirement that exceeds the current Forge licence provision.
Note
The licence status page refers to the Arm Licence Server. Arm is the name of the company that originally developed Forge before it was acquired by Linaro.
"},{"location":"data-tools/forge/#one-time-set-up-for-using-forge","title":"One time set-up for using Forge","text":"A preliminary step is required to set up the necessary Forge configuration files that allow DDT and MAP to initialise its environment correctly so that it can, for example, interact with the Slurm queue system. These steps should be performed in the /work
file system on ARCHER2.
It is recommended that these commands are performed in the top-level work file system directory for the user account, i.e., ${HOME/home/work}
.
module load forge\ncd ${HOME/home/work}\nsource ${FORGE_DIR}/config-init\n
Running the source
command will create a directory ${HOME/home/work}/.forge
that contains the following files.
system.config user.config\n
Warning
The config-init
script may output, Warning: failed to read system config
. Please ignore as subsequent messages should indicate that the new configuration files have been created.
Within the system.config
file you should find that shared directory
is set to the equivalent of ${HOME/home/work/.forge}
. That directory will also store other relevant files when Forge is run.
DDT (Distributed Debugging Tool) provides an easy-to-use graphical interface for source-level debugging of compiled C/C++ or Fortran codes. It can be used for non-interactive debugging, and there is also some limited support for python debugging.
"},{"location":"data-tools/forge/#preparation","title":"Preparation","text":"To prepare your program for debugging, compile and link in the normal way but remember to include the -g
compiler option to retain symbolic information in the executable. For some programs, it may be necessary to reduce the optimisation to -O0
to obtain full and consistent information. However, this in itself can change the behaviour of bugs, so some experimentation may be necessary.
A non-interactive method of debugging is available which allows information to be obtained on the state of the execution at the point of failure in a batch job.
Such a job can be submitted to the batch system in the usual way. The relevant command to start the executable is as follows.
# ... Slurm batch commands as usual ...\n\nmodule load forge\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nddt --verbose --offline --mpi=slurm --np 8 \\\n --mem-debug=fast --check-bounds=before \\\n ./my_executable\n
The parallel launch is delegated to ddt
and the --mpi=slurm
option indicates to ddt
that the relevant queue system is Slurm (there is no explicit srun
). It will also be necessary to state explicitly to ddt
the number of processes required (here --np 8
). For other options see, e.g., ddt --help
.
Note that higher levels of memory debugging can result in extremely slow execution. The example given above uses the default --mem-debug=fast
which should be a reasonable first choice.
Execution will produce a .html
format report which can be used to examine the state of execution at the point of failure.
You can also start the client interactively (for details of remote launch, see Connecting with the remote client).
module load forge\nddt\n
This should start a window as shown below. Click on the DDT panel on the left, and then on the Run and debug a program option. This will bring up the Run dialogue as shown.
Note:
One can start either DDT or MAP by clicking the appropriate panel on the left-hand side;
If the license has connected successfully, a serial number will be shown in small text at the lower left.
In the Application sub panel of the Run dialog box, details of the executable, command line arguments or data files, the working directory and so on should be entered.
Click the MPI checkbox and specify the MPI implementation. This is done by clicking the Details button and then the Change button. Choose the SLURM (generic) implementation from the drop-down menu and click OK. You can then specify the required number of nodes/processes and so on.
Click the OpenMP checkbox and select the relevant number of threads (if there is no OpenMP in the application itself, select 1 thread).
Click the Submit to Queue checkbox and then the associated Configure button. A new set of options will appear such as Submission template file, where you can enter ${FORGE_DIR}/templates/archer2.qtf
and click OK. This template file provides many of the options required for a standard batch job. You will then need to click on the Queue Parameters button in the same section and specify the relevant project budget, see the Account entry.
The default queue template file configuration uses the short QoS with the standard time limit of 20 minutes. If something different is required, one can edit the settings. Alternatively, one can copy the archer2.qtf
file (to ${HOME/home/work}/.forge
) and make the relevant changes. This new template file can then be specified in the dialog window.
There may be a short delay while the sbatch job starts. Debugging should then proceed as described in the Linaro Forge documentation.
"},{"location":"data-tools/forge/#using-map","title":"Using MAP","text":"Load the forge
module:
module load forge\n
"},{"location":"data-tools/forge/#linking","title":"Linking","text":"MAP uses two small libraries to collect data from your program. These are called map-sampler
and map-sampler-pmpi
. On ARCHER2, the linking of these libraries is usually done automatically via the LD_PRELOAD mechanism, but only if your program is dynamically linked. Otherwise, you will need to link the MAP libraries manually by providing explicit link options.
The library paths specified in the link options will depend on the programming environment you are using as well as the Cray programming release. Here are the paths for each of the compiler environments consistent with the Cray Programming Release (CPE) 22.12 using the default OFI as the low-level comms protocol:
PrgEnv-cray
: ${FORGE_DIR}/map/libs/default/cray/ofi
PrgEnv-gnu
: ${FORGE_DIR}/map/libs/default/gnu/ofi
PrgEnv-aocc
: ${FORGE_DIR}/map/libs/default/aocc/ofi
For example, for PrgEnv-gnu
the additional options required at link time are given below.
-L${FORGE_DIR}/map/libs/default/gnu/ofi \\\n-lmap-sampler-pmpi -lmap-sampler \\\n-Wl,--eh-frame-hdr -Wl,-rpath=${FORGE_DIR}/map/libs/default/gnu/ofi\n
The MAP libraries for other Cray programming releases can be found under ${FORGE_DIR}/map/libs
. If you require MAP libraries built for the UCX comms protocol, simply replace ofi
with ucx
in the library path.
Submit a batch job in the usual way, and include the lines:
# ... Slurm batch commands as usual ...\n\nmodule load forge\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nmap -n <number of MPI processes> --mpi=slurm --mpiargs=\"--hint=nomultithread --distribution=block:block\" --profile ./my_executable\n
Successful execution will generate a file with a .map
extension.
This .map
file may be viewed via the GUI (start with either map
or forge
) by selecting the Load a profile data file from a previous run option. The resulting file selection dialog box can then be used to locate the .map
file.
If one starts the Forge client on e.g., a laptop, one should see the main window as shown above. Select Remote Launch and then Configure from the drop-down menu. In the Configure Remote Connections dialog box click Add. The following window should be displayed. Fill in the fields as shown. The Connection Name is just a tag for convenience (useful if a number of different accounts are in use). The Host Name should be as shown with the appropriate username. The Remote Installation Directory should be exactly as shown. The Remote Script is needed to execute additional environment commands on connection. A default script is provided in the location shown.
/work/y07/shared/utils/core/forge/latest/remote-init\n
Other settings can be as shown. Remember to click OK when done.
From the Remote Launch menu you should now see the new Connection Name. Select this, and enter the relevant ssh passphase and machine password to connect. A remote connection will allow you to debug, or view a profile, as discussed above.
If different commands are required on connection, a copy of the remote-init
script can be placed in, e.g., ${HOME/home/work}/.forge
and edited as necessary. The full path of the new script should then be specified in the remote launch settings dialog box. Note that the script changes the directory to the /work/
file system so that batch submissions via sbatch
will not be rejected.
Finally, note that ssh
may need to be configured so that it picks up the correct local public key file. This may be done, e.g., via the local .ssh/config
configuration file.
Navigate to https://app.globus.org
Log in with your Globus identity (this could be a globusid.org or other identity)
In File Manager, use the search tool to search for \u201cArcher2 file systems\u201d. Select it.
In the transfer pane, you are told that Authentication/Consent is required. Click Continue.
Click on the ARCHER2 Safe (safe.epcc.ed.ac.uk) link
Select the correct User account (if you have more than one)
Click Accept
Now confirm your Globus credentials \u2013 click Continue
Click on the SAFE id you selected previously
Make sure the correct User account is selected and Accept again
Your ARCHER2 /home directory will be shown
You can switch to viewing e.g. your /work directory by editing the path, or using the \"up one folder\" and selecting folders to move down the tree, as required
"},{"location":"data-tools/globus/#setting-up-the-other-end-of-the-transfer","title":"Setting up the other end of the transfer","text":"Make sure you select two-panel view mode
"},{"location":"data-tools/globus/#laptop","title":"Laptop","text":"If you wish to transfer data to/from your personal laptop or other device, click on the Collection Search in the right-hand panel
Use the link to \u201cGet Globus Connect Personal\u201d to create a Collection for your local drive.
"},{"location":"data-tools/globus/#other-server-eg-jasmin","title":"Other server e.g. JASMIN","text":"If you wish to connect to another server, you will need to search for the Collection e.g. JASMIN Default Collection and authenticate
Please see the JASMIN Globus page for more information
"},{"location":"data-tools/globus/#setting-up-and-initiating-the-transfer","title":"Setting up and initiating the transfer","text":"Once you are connected to both the Source and Destination Collections, you can use the File Manager to select the files to be transferred, and then click the Start button to initiate the transfer
A pop-up will appear once the Transfer request has been submitted successfully
Clicking on the \u201cView Details\u201d will show the progress and final status of the transfer
"},{"location":"data-tools/globus/#using-a-different-archer2-account","title":"Using a different ARCHER2 account","text":"If you want to use Globus with a different account on ARCHER2, you will have to go to Settings
Manage Identities
And Unlink the current ARCHER2 safe identity, then repeat the link process with the other ARCHER2 account
"},{"location":"data-tools/julia/","title":"Julia","text":"Julia is a general purpose software used widely in datascience and for data visualisation.
Important
This documentation is provided by an external party (i.e. not by the ARCHER2 service itself). Julia is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"data-tools/julia/#first-time-installation","title":"First time installation","text":"Note
There is no centrally installed version of Julia, so you will have to manually install it and any packages you may need. The following guide was tested on julia-1.6.6.
You will first need to download Julia into your work directory and untar the folder. You should then add the folder to your system path so you can use the julia
executable. Finally, you need to tell Julia to install any packages in the work directory as opposed to the default home directory, which can only be accessed from the login nodes. This can be done with the following code
export WORK=/work/t01/t01/auser\ncd $WORK\n\nwget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.6-linux-x86_64.tar.gz\ntar zxvf julia-1.6.6-linux-x86_64.tar.gz\nrm ./julia-1.6.6-linux-x86_64.tar.gz\n\nexport PATH=\"$PATH:$WORK/julia-1.6.6/bin\"\n\nmkdir ./.julia\nexport JULIA_DEPOT_PATH=\"$WORK/.julia\"\nexport PATH=\"$PATH:$WORK/$JULIA_DEPOT_PATH/bin\"\n
At this point you should have a working installation of Julia! The environment variables will however be cleared when you log out of the terminal. You can set them in the .bashrc
file so that they're automatically defined every time you log in by adding the following lines to the end of the file ~/.bashrc
export WORK=\"/work/t01/t01/auser\"\nexport JULIA_DEPOT_PATH=\"$WORK/.julia\"\nexport PATH=\"$PATH:$WORK/julia-1.6.6/bin\"\nexport PATH=\"$PATH:$JULIA_DEPOT_PATH/bin\"\n
"},{"location":"data-tools/julia/#installing-packages-and-using-environments","title":"Installing packages and using environments","text":"Julia has a built in package manager which can be used to install registered packages quickly and easily. Like with many other high level programming languages we can make use of environments to control dependencies etc.
To make an environment, first navigate to where you want your environment to be (ideally a subfolder of your /work/
directory) and create an empty folder to store the environment in. Then launch Julia with the --project flag.
cd $WORK\nmkdir ./MyTestEnv\njulia --project=$WORK/MyTestEnv\n
This launches Julia in the MyTestEnv
environment. You can then install packages as usual using the normal commands in the Julia terminal. E.g.
using Pkg\nPkg.add(\"Oceananigans\")\n
"},{"location":"data-tools/julia/#configuring-mpijl","title":"Configuring MPI.jl","text":"The MPI.jl
package doesn't use the system MPICH implementation by default. You can set it up to do this by following the steps below. First you will need to load the cray-mpich
module and define some environment variables (see here for further details). Then you can launch Julia in an environment of your choice, ready to build.
module load cray-mpich/8.1.23\nexport JULIA_MPI_BINARY=\"system\"\nexport JULIA_MPI_PATH=\"\"\nexport JULIA_MPI_LIBRARY=\"/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi.so\"\nexport JULIA_MPIEXEC=\"srun\"\n\njulia --project=<<path to environment>>\n
Once in the Julia terminal you can build the MPI.jl
package using the following code. The final line installs the mpiexecjl
command which should be used instead of srun
to launch mpi processes.
using Pkg\nPkg.build(\"MPI\"; verbose=true)\nMPI.install_mpiexecjl(command = \"mpiexecjl\", force = false, verbose = true)\n
The mpiexecjl
command will be installed in the directory that JULIA_DEPOT_PATH
points too. Note
You only need to do this once per environment.
"},{"location":"data-tools/julia/#running-julia-on-the-compute-nodes","title":"Running Julia on the compute nodes","text":"Below is an example script for running Julia with mpi on the compute nodes
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=<<job-name>>\n#SBATCH --time=00:19:00\n\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=24\n#SBATCH --cpus-per-task=1\n\n#SBATCH --qos=short\n#SBATCH --reservation=shortqos\n\n#SBATCH --account=<<your account>>\n#SBATCH --partition=standard\n\n# Setup the job environment (this module needs to be loaded before any other modules)\nmodule load PrgEnv-cray\nmodule load cray-mpich/8.1.23\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\nexport JULIA_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Define some paths\nexport WORK=/work/t01/t01/auser\n\nexport JULIA=\"$WORK/julia-1.6.6/bin/julia\" # The julia executable\nexport PATH=\"$PATH:$WORK/julia-1.6.6/bin\" # The folder of the julia executable\nexport JULIA_DEPOT_PATH=\"$WORK/.julia\"\nexport MPIEXECJL=\"$JULIA_DEPOT_PATH/bin/mpiexecjl\" # The path to the mpiexexjl executable\n\n$MPIEXECJL --project=$WORK/MyTestEnv -n 24 $JULIA ./MyMpiJuliaScript.jl\n
The above script uses MPI but you can also use multithreading instead by setitng the JULIA_NUM_THREADS
environment variable.
The Performance Application Programming Interface (PAPI) is an API that facilitates the reading of performance counter data without needing to know the details of the underlying hardware.
For convenience, we have developed an MPI-based wrapper for PAPI, called papi_mpi_lib
, which can be found via the link below.
https://github.com/cresta-eu/papi_mpi_lib
The PAPI MPI Library makes it possible to monitor a user-defined set of hardware performance counters during the execution of an MPI code running across multiple compute nodes. The library is lightweight, containing just four functions, and is intended to be straightforward to use. Once you've decided where in your code you wish to record counter values, you can control which counters are read at runtime by setting the PAT_RT_PERFCTR
environment variable in the job submission script. As your code executes, the specified counters will be read at various points. After each reading, the counter values are summed by rank 0 (via an MPI reduction) before being output to a log file.
You can discover which counters are available on ARCHER2 compute nodes by submitting the following single node job.
#!/bin/bash --login\n\n#SBATCH -J papi\n#SBATCH --time=00:20:00\n#SBATCH --exclusive\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=1\n#SBATCH --cpus-per-task=1\n#SBATCH --account=<budget code>\n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --export=none\n\nfunction papi_query() {\n export LD_LIBRARY_PATH=/opt/cray/pe/papi/$2/lib64:/opt/cray/libfabric/$3/lib64\n module -q restore\n\n module -q load cpe/$1\n module -q load papi/$2\n\n mkdir -p $1\n papi_component_avail -d &> $1/papi_component_avail.txt\n papi_native_avail -c &> $1/papi_native_avail.txt\n papi_avail -c -d &> $1/papi_avail.txt\n}\n\npapi_query 22.12 6.0.0.17 1.12.1.2.2.0.0\n
The job runs various papi
commands with the output being directed to specific text files. Please consult the text files to see which counters are available. Note, counters that are not available may still be listed in the file, but with a label such as <NA>
.
As of July 2023, the Cray Programming Environment (CPE), PAPI and libfabric versions on ARCHER2, were 22.12
, 6.0.0.17
and 1.12.1.2.2.0.0
respectively; these versions may change in the future.
Alternatively, you can run pat_help counters rome
from a login node to check the availability of individual counters.
Further information on papi_mpi_lib
along with test harnesses and example scripts can be found by reading the PAPI MPI Library readme file.
ParaView is a data visualisation and analysis package. Whilst ARCHER2 compute or login nodes do not have graphics cards installed in them, ParaView is installed so the visualisation libraries and applications can be used to post-process simulation data. The ParaView server (pvserver
), batch application (pvbatch
), and the Python interface (pvpython
) are all available. Users are able to run the server on the compute nodes and connect to a local ParaView client running on their own computer.
ParaView is available through the paraview
module.
module load paraview\n
Once the module has been added, the ParaView executables, tools, and libraries will be available.
"},{"location":"data-tools/paraview/#connecting-to-pvserver-on-archer2","title":"Connecting to pvserver on ARCHER2","text":"For doing visualisation, you should connect to pvserver
from a local ParaView client running on your own computer.
Note
You should make sure the version of ParaView you have installed locally is the same as the one on ARCHER2 (version 5.10.1).
The following instructions are for running pvserver in an interactive job. Start an iteractive job using:
srun --nodes=1 --exclusive --time=00:20:00 \\\n --partition=standard --qos=short --pty /bin/bash\n
Once the job starts the command prompt will change to show you are now on the compute node, e.g.:
auser@nid001023:/work/t01/t01/auser> \n
Then load the ParaView module and start pvserver
with the srun
command,
auser@nid001023:/work/t01/t01/auser> module load paraview\nauser@nid001023:/work/t01/t01/auser> srun --overlap --oversubscribe -n 4 \\\n> pvserver --mpi --force-offscreen-rendering\nWaiting for client...\nConnection URL: cs://nid001023:11111\nAccepting connection(s): nid001023:11111\n
Note
The previous example uses 4 compute cores to run pvserver
. You can increase the number of cores in case the visualisation does not run smoothly. Please bear in mind that, depending on the testcase, a large number of compute cores can lead to an out-of-memory runtime error.
In a separate terminal you can now set up an SSH tunnel with the node ID and port number which the pvserver
is using, e.g.:
ssh -L 11111:nid001023:11111 auser@login.archer2.ac.uk \n
enter your password and passphrase as usual.
You can then connect from your local client using the following connection settings:
Name: archer2 \nServer Type: Client/Server \nHost: localhost \nPort: 11111\n
Note
The Host from the local client should be set to \"localhost\" when using the SSH tunnel. The \"Name\" field can be set to a name of your choosing. 11111 is the default port for pvserver
.
If it has connected correctly, you should see the following:
Waiting for client...\nConnection URL: cs://nid001023:11111\nAccepting connection(s): nid001023:11111\nClient connected.\n
"},{"location":"data-tools/paraview/#using-batch-mode-pvbatch","title":"Using batch-mode (pvbatch)","text":"A pvbatch
script can be run in a standard job script. For example the following will run on a single node:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=example_paraview_job\n#SBATCH --time=0:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load paraview\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread pvbatch pvbatchscript.py\n
"},{"location":"data-tools/paraview/#compiling-paraview","title":"Compiling ParaView","text":"The latest instructions for building ParaView on ARCHER2 may be found in the GitHub repository of build instructions:
The ARCHER2 compute nodes each have a set of so-called Power Management (PM) counters. These cover point-in-time power readings for the whole node, and for the CPU and memory domains. The accumulated energy use is also recorded at the same level of detail. Further, there are two temperature counters, one for each socket/processor on the node. The counters are read ten times per second and the data written to a set of files stored within node memory (located at /sys/cray/pm_counters/
).
For convenience, we have developed an MPI-based wrapper, called pm_mpi_lib
that facilitates the reading of the PM counter files, see the link below.
https://github.com/cresta-eu/pm_mpi_lib
The PM MPI Library makes it possible to monitor the Power Management counters during the execution of an MPI code running across multiple compute nodes. The library is lightweight, containing just three functions, and is intended to be straightforward to use. You simply decide which parts of your code you wish to profile as regards energy usage and/or power consumption.
As your code executes, the PM counters will be read at various points by a single designated monitor rank on each node assigned to the job. These readings are then written to a log file, which, after the job completes, will contain one set of time-stamped readings per node for every call to the pm_mpi_record
function made from within your code. The readings can then be aggregated according to preference.
Further information along with test harnesses and example scripts can be found by reading the PM MPI Library readme file.
"},{"location":"data-tools/spack/","title":"Spack","text":"Spack is a package manager, a tool to assist with building and installing software as well as determining what dependencies are required and installing those. It was originally designed for use on HPC systems, where several variations of a given package may be installed alongside one another for different use cases -- for example different versions, built with different compilers, using MPI or hybrid MPI+OpenMP. Spack is principally written in Python but has a component written in Answer Set Programming (ASP) which is used to determine the required dependencies for a given package installation.
Users are welcome to install Spack themselves in their own directories, but we are making an experimental installation tailored for ARCHER2 available centrally. This page provides documentation on how to activate and install packages using the central installation on ARCHER2. For more in-depth information on using Spack itself please see the developers' documentation.
Important
As ARCHER2's central Spack installation is still in an experimental stage please be aware that we cannot guarantee that it will work with full functionality and we may not be able to provide support. The centrally-provided configuration and Spack-installed software may be subject to change.
"},{"location":"data-tools/spack/#activating-spack","title":"Activating Spack","text":"As it is still in an experimental stage, the Spack module is not made available to users by default. You must firstly load the other-software
module:
auser@ln01:~> module load other-software\n
Several modules with spack
in their name will become visible to you. You should load the spack
module:
auser@ln01:~> module load spack\n
This configures Spack to place its cache on and install software to a directory called .spack
in your base work directory, e.g. at /work/t01/t01/auser/.spack
.
At this point Spack is available to you via the spack
command. You can get started with spack help
, reading the Spack documentation, or by testing a package's installation.
At its simplest, Spack installs software with the spack install
command:
auser@ln01:~> spack install gromacs\n
This very simple gromacs
installation specification, or spec, would install GROMACS using the default options given by the Spack gromacs
package. The spec can be expanded to include which options you like. For example, the command
auser@ln01:~> spack install gromacs@2024.2%gcc+mpi\n
would use the GCC compiler to install an MPI-enabled version of GROMACS version 2024.2.
Tip
Spack needs to bootstrap the installation of some extra software in order to function, principally clingo
which is used to solve the dependencies required for an installation. The first time you ask Spack to concretise a spec into a precise set of requirements, it will take extra time as it downloads this software and extracts it into a local directory for Spack's use.
You can find information about any Spack package and the options available to use with the spack info
command:
auser@ln01:~> spack info gromacs\n
Tip
The Spack developers also provide a website at https://packages.spack.io/ where you can search for and examine packages, including all information on options, versions and dependencies.
When installing a package, Spack will determine what dependencies are required to support it. If they are not already available to Spack, either as packages that it has installed beforehand or as external dependencies, then Spack will also install those, marking them as implicity installed, as opposed to the explicit installation of the package you requested. If you want to see the dependencies of a package before you install it, you can use spack spec
to see the full concretised set of packages:
auser@ln01:~> spack spec gromacs@2024.2%gcc+mpi\n
Tip
Spack on ARCHER2 has been configured to use as much of the HPE Cray Programming Environment as possible. For example, this means that Cray LibSci will be used to provide the BLAS, LAPACK and ScaLAPACK dependencies and Cray MPICH will provide MPI. It is also configured to allow it to re-use as dependencies any packages that the ARCHER2 CSE team has spack install
ed centrally, potentially helping to save you build time and storage quota.
Spack provides a module-like way of making software that you have installed available to use. If you have a GROMACS installation, you can make it available to use with spack load
:
auser@ln01:~> spack load gromacs\n
At this point you should be able to use the software as normal. You can then remove it once again from the environment with spack unload
:
auser@ln01:~> spack unload gromacs\n
If you have multiple variants of the same package installed, you can use the spec to distinguish between them. You can always check what packages have been installed using the spack find
command. If no other arguments are given it will simply list all installed packages, or you can give a package name to narrow it down:
auser@ln01:~> spack find gromacs\n
You can see your packages' install locations using spack find --paths
or spack find -p
.
In any Spack command that requires as an argument a reference to an installed package, you can provide a hash reference to it rather than its spec. You can see the first part of the hash by running spack find -l
, or the full hash with spack find -L
. Then use the hash in a command by prefixing it with a forward slash, e.g. wjy5dus
becomes /wjy5dus
.
If you have two packages installed which appear identical in spack find
apart from their hash, you can differentiate them with spack diff
:
auser@ln01:~> spack diff /wjy5dus /bleelvs\n
You can uninstall your packages with spack uninstall
:
auser@ln01:~> spack uninstall gromacs@2024.2\n
and of course, to be absolutely certain that you are uninstalling the correct package, you can provide the hash:
auser@ln01:~> spack uninstall /wjy5dus\n
Uninstalling a package will leave behind any implicitly installed packages that were installed to support it. Spack may have also installed build-time dependencies that aren't actually needed any more -- these are often packages like autoconf
, cmake
and m4
. You can run the garbage collection command to uninstall any build dependencies and implicit dependencies that are no longer required:
auser@ln01:~> spack gc\n
If you commonly use a set of Spack packages together you may want to consider using a Spack environment to assist you in their installation and management. Please see the Spack documentation for more information.
"},{"location":"data-tools/spack/#custom-configuration","title":"Custom configuration","text":"Spack is configured using YAML files. The central installation on ARCHER2 made available to users is configured to use the HPE Cray Programming Environment and to allow you to start installing software to your /work
directories right away, but if you wish to make any changes you can provide your own overriding userspace configuration.
Your own configuration should fit in the user level scope. On ARCHER2 Spack is configured to, by default, place and look for your configuration files in your work directory at e.g. /work/t01/t01/auser/.spack
. You can however override this to have Spack use any directory you choose by setting the SPACK_USER_CONFIG_PATH
environment variable, for example:
auser@ln01:~> export SPACK_USER_CONFIG_PATH=/work/t01/t01/auser/spack-config\n
Of course this will need to be a directory where you have write permissions, such in your home or work directories, or in one of your project's shared
directories.
You can edit the configuration files directly in a text editor or by running, for example:
auser@ln01:~> spack config edit repos\n
which would open your repos.yaml
in vim
.
Tip
If you would rather not use vim
, you can change which editor is used by Spack by setting the SPACK_EDITOR
environment variable.
The final configuration used by Spack is a compound of several scopes, from the Spack defaults which are overridden by the ARCHER2 system configuration files, which can then be overridden in turn by your own configurations. You can see what options are in use at any point by running, for example:
auser@ln01:~> spack config get config\n
which goes through any and all config.yaml
files known to Spack and sets the options according to those files' level of precedence. You can also get more information on which files are responsible for which lines in the final active configuration by running, for example to check packages.yaml
:
auser@ln01:~> spack config blame packages\n
Unless you have already written a packages.yaml
of your own, this will show a mix of options originating from the Spack defaults and also from an archer2-user
directory which is where we have told Spack how to use packages from the HPE Cray Programming Environment.
If there is some behaviour in Spack that you want to change, looking at the output of spack config get
and spack config blame
may help to show what you would need to do. You can then write your own user scope configuration file to set the behaviour you want, which will override the option as set by the lower-level scopes.
Please see the Spack documentation to find out more about writing configuration files.
"},{"location":"data-tools/spack/#writing-new-packages","title":"Writing new packages","text":"A Spack package is at its core a Python package.py
file which provides instructions to Spack on how to obtain source code and compile it. A very simple package will allow it to build just one version with one compiler and one set of options. A more fully-featured package will list more versions and include logic to build them with different compilers and different options, and to also pick its dependencies correctly according to what is chosen.
Spack provides several thousand packages in its builtin
repository. You may be able to use these with no issues on ARCHER2 by simply running spack install
as described above, but if you do run into problems in the interaction between Spack and the CPE compilers and libraries then you may wish to write your own. Where the ARCHER2 CSE service has encountered problems with packages we have provided our own in a repository located at $SPACK_ROOT/var/spack/repos/archer2
.
A package repository is a directory containing a repo.yaml
configuration file and another directory called packages
. Directories within the latter are named for the package they provide, for example cp2k
, and contain in turn a package.py
. You can create a repository from scratch with the command
auser@ln01:~> spack repo create dirname\n
where dirname
is the name of the directory holding the repository. This command will create the directory in your current working directory, but you can choose to instead provide a path to its location. You can then make the new repository available to Spack by running:
auser@ln01:~> spack repo add dirname\n
This adds the path to dirname
to the repos.yaml
file in your user scope configuration directory as described above. If your repos.yaml
doesn't yet exist, it will be created.
A Spack repository can similarly be removed from the config using:
auser@ln01:~> spack repo rm dirname\n
"},{"location":"data-tools/spack/#namespaces-and-repository-priority","title":"Namespaces and repository priority","text":"A package can exist in several repositories. For example, the Quantum Espresso package is provided by both the builtin
repository provided with Spack and also by the archer2
repository; the latter has been patched to work on ARCHER2.
To distinguish between these packages, each repository's packages exist within that repository's namespace. By default the namespace is the same as the name of the directory it was created in, but Spack does allow it to be different. Both builtin
and archer2
use the same directory name and namespace.
Tip
If you want your repository namespace to be different from the name of the directory, you can change it either by editing the repository's repo.yaml
or by providing an extra argument to spack repo create
:
auser@ln01:~> spack repo create dirname namespace\n
Running spack find -N
will return the list of installed packages with their namespace. You'll see that they are then prefixed with the repository namespace, for example builtin.bison@3.8.2
and archer2.quantum-espresso@7.2
. In order to avoid ambiguity when managing package installation you can always prefix a spec with a repository namespace.
If you don't include the repository in a spec, Spack will search in order all the repositories it has been configured to use until it finds a matching package, which it will then use. The earlier in the list of repositories, the higher the priority. You can check this with:
auser@ln01:~> spack repo list\n
If you run this without having added any repositories of your own, you will see that the two available repositories are archer2
and builtin
, in this order. This means that archer2
has higher priority. Because of this, running spack install quantum-espresso
would install archer2.quantum-espresso
, but you could still choose to install from the other repository with spack install builtin.quantum-espresso
.
Once you have a repository of your own in place, you can create new packages to store within it. Spack has a spack create
command which will do the initial setup and create a boilerplate package.py
. To create an empty package called packagename
you would run:
auser@ln01:~> spack create --name packagename\n
However, it will very often be more efficient if you instead provide a download URL for your software as the argument. For example, the Code_Saturne 8.0.3 source is obtained from https://www.code-saturne.org/releases/code_saturne-8.0.3.tar.gz
, so you can run:
auser@ln01:~> spack create https://www.code-saturne.org/releases/code_saturne-8.0.3.tar.gz\n
Spack will determine from this the package name, the download URLs for all versions X.Y.Z matching the https://www.code-saturne.org/releases/code_saturne-X.Y.Z.tar.gz
pattern. It will then ask you interactively which of these you want to use. Finally, it will download the .tar.gz
archives for those versions and calculate their checksums, then place all this information in the initial version of the package for you. This takes away a lot of the initial work!
At this point you can get to work on the package. You can edit an existing package by running
auser@ln01:~> spack edit packagename\n
or by directly opening packagename/package.py
within the repository with a text editor.
The boilerplate code will note several sections for you to fill out. If you did provide a source code download URL, you'll also see listed the versions you chose and their checksums.
A package is implemented as a Python class. You'll see that by default it will inherit from the AutotoolsPackage
class which defines how a package following the common configure
> make
> make install
process should be built. You can change this to another build system, for example CMakePackage
. If you want, you can have the class inherit from several different types of build system classes and choose between them at install time.
Options must be provided to the build. For an AutotoolsPackage
package, you can write a configure_args
method which very simply returns a list of the command line arguments you would give to configure
if you were building the code yourself. There is an identical cmake_args
method for CMakePackage
packages.
Finally, you will need to provide your package's dependencies. In the main body of your package class you should add calls to the depends_on()
function. For example, if your package needs MPI, add depends_on(\"mpi\")
. As the argument to the function is a full Spack spec, you can provide any necessary versioning or options, so, for example, if you need PETSc 3.18.0 or newer with Fortran support, you can call depends_on(\"petsc+fortran@3.18.0:\")
.
If you know that you will only ever want to build a package one way, then providing the build options and dependencies should be all that you need to do. However, if you want to allow for different options as part of the install spec, patch the source code or perform post-install fixes, or take more manual control of the build process, it can become much more complex. Thankfully the Spack developers have provided excellent documentation covering the whole process, and there are many existing packages you can look at to see how it's done.
"},{"location":"data-tools/spack/#tips-when-writing-packages-for-archer2","title":"Tips when writing packages for ARCHER2","text":"Here are some useful pointers when writing packages for use with the HPE Cray Programming Environment on ARCHER2.
"},{"location":"data-tools/spack/#cray-compiler-wrappers","title":"Cray compiler wrappers","text":"An important point of note is that Spack does not use the Cray compiler wrappers cc
, CC
and ftn
when compiling code. Instead, it uses the underlying compilers themselves. Remember that the wrappers automate the use of Cray LibSci, Cray FFTW, Cray HDF5 and Cray NetCDF. Without this being done for you, you may need to take extra care to ensure that the options needed to use those libraries are correctly set.
Cray LibSci provides optimised implementations of BLAS, BLACS, LAPACK and ScaLAPACK on ARCHER2. These are bundled together into single libraries named for variants on libsci_cray.so
. Although Spack itself knows about LibSci, many applications don't and it can sometimes be tricky to get them to use these libraries when they are instead looking for libblas.so
and the like.
The configure
or cmake
or equivalent step for your software will hopefully allow you to manually point it to the correct library. For example, Code_Saturne's configure
can take the options --with-blas-lib
and --with-blas-libs
which respectively tell it the location to search and the libraries to use in order to build against BLAS.
Spack can provide the correct BLAS library search and link flags to be passed on to configure
via self.spec[\"blas\"].libs
, a LibraryList
object. So, the Code_Saturne package uses the following configure_args()
method:
def configure_args(self):\n blas = self.spec[\"blas\"].libs\n args = [\"--with-blas-lib={0}\".format(blas.search_flags),\n \"--with-blas-libs={0}\".format(blas.link_flags)]\n return args\n
Here the blas.search_flags
attribute is resolved to a -L
library search flag using the path to the correct LibSci directory, taking into account whether the libraries for the Cray, GCC or AOCC compilers should be used. blas.link_flags
similarly gives a -l
flag for the correct LibSci library. Depending on what you need, the LibraryList
has other attributes which can help you pass the options needed to get configure
to find and use the correct library.
If you develop a package for use on ARCHER2 please do consider opening a pull request to the GitHub repository.
"},{"location":"data-tools/visidata/","title":"VisiData","text":"VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.
"},{"location":"data-tools/visidata/#useful-links","title":"Useful links","text":"You can access VisiData on ARCHER2 by loading the visidata
module:
module load visidata\n
Once the module has been loaded, VisiData is available via the vd
command.
Visidata can also be used in scripts by saving a command log and replaying it. See the VisiData documentation on saving and restoring VisiData sessions.
"},{"location":"data-tools/vmd/","title":"VMD","text":"VMD is a visualisation program for displaying, animating, and analysing molecular systems using 3D graphics, and built-In tcl/tk scripting.
"},{"location":"data-tools/vmd/#useful-links","title":"Useful links","text":"VMD is available through the vmd
module.
module load vmd\n
Once the module has been added the VMD executables, tools, and libraries will be made available.
Without anything else, this allows you to run VMD in \"text-only\" mode with:
vmd -dispdev text\n
If you want to launch VMD with a GUI, see the requirements on the next section.
"},{"location":"data-tools/vmd/#launching-vmd-with-a-gui","title":"Launching VMD with a GUI","text":"To be able to launch VMD with it's graphical interface, your machine needs to support the x11 \"X windows system\". Most Linux and *NIX systems support this by default. If you're using Windows (through WSL, for example), you will need an X11 display server, we recommend XMing. For macOS, we recommend XQuartz, but please be aware that there's some extra configuration needed, please see the next section
To launch VMD with a GUI, once you have a running X11 display server on your local machine, you'll need to connect to ARCHER2 with X11 forwarding enabled, please follow the instructions in the logging in section. Once you're connected to ARCHER2, load the VMD module with:
module load vmd\n
and launch VMD with:
vmd\n
"},{"location":"data-tools/vmd/#using-vmd-from-macos","title":"Using VMD from macOS","text":"If you're using macOS and XQuartz, before you're able to launch VMD with a GUI, you will need to change the XQuartz configuration. On a local terminal (that is, not connected to ARCHER2), run the following command:
defaults write org.xquartz.X11 enable_iglx -bool true\n
then, restart XQuartz. You will now be able to launch VMD's GUI without a segmentation fault
.
The latest instructions for building VMD on ARCHER2 may be found in the GitHub repository of build instructions.
"},{"location":"essentials/","title":"Essential Skills","text":"This section provides information and links on essential skills required to use ARCHER2 efficiently: e.g. using Linux command line, accessing help and documentation.
"},{"location":"essentials/#terminal","title":"Terminal","text":"In order to access HPC machines such as ARCHER2 you will need to use a Linux command line terminal window
Options for Linux, MacOS and Windows are described under our Connecting to ARCHER2 guide
"},{"location":"essentials/#linux-command-line","title":"Linux Command Line","text":"A guide to using the Unix Shell for complete novices
For those already familiar with the basics there is also a lesson on shell extras
"},{"location":"essentials/#basic-slurm-commands","title":"Basic Slurm commands","text":"Slurm is the scheduler used on ARCHER2 and we provide a guide to using the basic Slurm commands including how to find out:
The following text editors are available on ARCHER2
Name Description Examples emacs A widely used editor with a focus on extensibility.emacs -nw sharpen.pbs
CTRL+X CTRL+C
quits CTRL+X CTRL+S
saves nano A small, free editor with a focus on user friendliness. nano sharpen.pbs
CTRL+X
quits CTRL+O
saves vi A mode based editor with a focus on aiding code development. vi cfd.f90
:q
in command mode quits :q!
in command mode quits without saving :w
in command mode saves i
in command mode switches to insert mode ESC
in insert mode switches to command mode If you are using MobaXterm on Windows you can use the inbuilt MobaTextEditor text file editor.
You can edit on your local machine using your preferred text editor, and then upload the file to ARCHER2. Make sure you can save the file using Linux line-endings. Notepad, for example, will support Unix/Linux line endings (LF), Macintosh line endings (CR), and Windows Line endings (CRLF)
"},{"location":"essentials/#quick-reference-sheet","title":"Quick Reference Sheet","text":"We have produced this Quick Reference Sheet which you may find useful.
"},{"location":"faq/","title":"ARCHER2 Frequently Asked Questions","text":"This section documents some of the questions raised to the Service Desk on ARCHER2, and the advice and solutions.
"},{"location":"faq/#user-accounts","title":"User accounts","text":""},{"location":"faq/#username-already-in-use","title":"Username already in use","text":"Q. I created a machine account on ARCHER2 for a training course, but now I want to use that machine username for my main ARCHER2 project, and the system will not let me, saying \"that name is already in use\". How can I re-use that username.
A. Send an email to the service desk, letting us know the username and project that you set up previously, and asking for that account and any associated data to be deleted. Once deleted, you can then re-use that username to request an account in your main ARCHER2 project.
"},{"location":"faq/#data","title":"Data","text":""},{"location":"faq/#undeleteable-file-nfsxxxxxxxxxxx","title":"Undeleteable file .nfsXXXXXXXXXXX","text":"Q. I have a file called .nfsXXXXXXXXXXX (where XXXXXXXXXXX is a long hexadecimal string) in my /home folder but I can't delete it.
A. This file will have been created during a file copy which failed. Trying to delete it will give an error \"Device or resource busy\", even though the copy has ended and no active task is locking it.
echo -n >.nfsXXXXXXXXXXX
will remove it.
"},{"location":"faq/#running-on-archer2","title":"Running on ARCHER2","text":""},{"location":"faq/#oom-error-on-archer2","title":"OOM error on ARCHER2","text":"Q. Why is my code failing on ARCHER2 with an out of memory (OOM) error?
A. You are requesting too much memory per process. We recommend that you try running the same job on underpopulated nodes. This can be done by editing reducing the --ntasks-per-node
in your Slurm submission script. Please lower it to half of its value when it fails (so if you have --ntasks-per-node=128
, reduce it to --ntasks-per-node=64
).
Q. How can I check which budget code(s) I can use?
A. You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g. e123 resources
and then under Resource Pool to the right of this, a note of the remaining budget.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=user,Account%12,MaxTRESMins,QOS%40\n
This will list all the budget codes that you have access to (but not the amount of budget available) e.g.
User Account MaxTRESMins QOS\n-------- ------------ ------------ -----------------------------------\n userx e123-test largescale,long,short,standard\n userx e123 cpu=0 largescale,long,short,standard\n
This shows that userx
is a member of budgets e123-test
and e123
. However, the cpu=0
indicates that the e123
budget is empty or disabled. This user can submit jobs using the e123-test
budget.
You can only check the amount of available budget via SAFE - see above.
"},{"location":"faq/#estimated-start-time-of-queued-jobs","title":"Estimated start time of queued jobs","text":"Q. I\u2019ve checked the estimated start time for my queued jobs using \u201csqueue -u $USER --start\u201d. Why does the estimated start time keep changing?
A. ARCHER2 uses the Slurm scheduler to queue jobs for the compute nodes. Slurm attempts to find a better schedule as jobs complete and new jobs are added to the queue. This helps to maximise the use of resources by minimising the number of idle compute nodes, in turn reducing your wait time in the queue.
However, If you periodically check the estimated start time of your queued jobs, you may notice that the estimate changes or even disappears. This is because Slurm only assigns the top entries in the queue with an estimated start time. As the schedule changes, your jobs could move in and out of this top region and thus gain or lose an estimated start time.
"},{"location":"faq/upgrade-2023/","title":"ARCHER2 Upgrade: 2023","text":"During the first half of 2023 ARCHER went through a major software upgrade.
On this page we describe the background to the changes what impact the changes have had for users, any action you should expect to take following the upgrade and information on the versions on updated software.
If you have any questions or concerns, please contact the ARCHER2 Service Desk.
"},{"location":"faq/upgrade-2023/#why-did-the-upgrade-happen","title":"Why did the upgrade happen?","text":"There are a number of reasons why ARCHER2 needed to go through this major software upgrade. All of these reasons are related to the fact that the previous system software setup was out of date; due to this, maintenance of the service was very difficult and updating software within the current framework was not possible. Some specific issues were:
This major software upgrade involved a complete re-install of system software followed by a reinstatement of local configurations (e.g. Slurm, authentication services, SAFE integration). Unfortunately, this major work required a long period of downtime but this was planned with all service partners to minimise the outage and give as much notice to users as possible so that they could plan accordingly.
The outage dates were:
The allocation periods (where appropriate) were extended for the outage period. The changes were in place when the service was returned.
After the upgrade process there are a number of changes that may require action from users
"},{"location":"faq/upgrade-2023/#updated-login-node-host-keys","title":"Updated login node host keys","text":"If you previously logged into the ARCHER2 system before the upgrade you may see an error from SSH that looks like:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts file (in the example above the offending line is line #11).
The current login node host keys are always documented in the User Guide
"},{"location":"faq/upgrade-2023/#recompile-and-test-software","title":"Recompile and test software","text":"As the new system is based on a new OS version and new versions of compilers and libraries we strongly recommend that all users recompile and test all software on the service. The ARCHER2 CSE service recompiled all centrally installed software.
"},{"location":"faq/upgrade-2023/#no-python-2-installation","title":"No Python 2 installation","text":"There is no Python 2 installation available as part of supported software following the upgrade. Python 3 continues to be fully-supported.
"},{"location":"faq/upgrade-2023/#impact-on-data-on-the-service","title":"Impact on data on the service","text":"srun
","text":"Change in Slurm behaviour. The setting from the --cpus-per-task
option to sbatch/salloc is no longer propagated by default to srun
commands in the job script.
This can lead to very poor performance due to oversubscription of cores with processes/threads if job submission scripts are not updated. The simplest workaround is to add the command:
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n
before any srun commands in the script. You can also explicitly use the --cpus-per-task
option to srun if you prefer.
This change only affects users who use a placement scheme where placement of processes on sockets is cyclic (e.g. --distribution=block:cyclic
). The Slurm definition of a \u201csocket\u201d has changed. The previous setting on ARCHER2 was that a socket = 16 cores (all share a DRAM memory controller). On the updated ARCHER2, the setting of a socket = 4 cores (corresponding to a CCX - Core CompleX). Each CCX shares 16 MB L3 Cache.
The paths you need to bind and the LD_LIBRARY_PATH
settings required to use Cray MPICH with MPI in Singularity containers have changed. The updated settings are documented in the Containers section of the User and Best Practice Guide. This also includes updated information on building containers with MPI to use on ARCHER2.
The AMD \u03bcProf tool is not available on the upgraded system yet. We are working to get this fixed as soon as possible.
"},{"location":"faq/upgrade-2023/#what-software-versions-will-be-available-after-the-upgrade","title":"What software versions will be available after the upgrade?","text":"System software:
Compilers:
Communication libraries:
Numerical libraries:
IO Libraries:
Tools:
For full information, see CPE 22.12 Release Notes
CCE 15
C++ applications built using CCE 13 or earlier should be recompiled due to the significant changes that were necessary to implement C++17. This is expected to be a one-time requirement.
Some non-standard Cray Fortran extensions supporting shorthand notation for logical operations will be removed in a future release. CCE 15 will issue warning messages when these are encountered, providing time to adapt the application to use standard Fortran.
HPE Cray MPICH 8.1.23
Cray MPICH 8.1.23 can support only ~2040 simultaneous MPI communicators.
"},{"location":"faq/upgrade-2023/#cse-supported-software","title":"CSE supported software","text":"Default version in italics
Software Versions CASTEP 22.11, 23.11 Code_Saturne 7.0.1 ChemShell/PyChemShell 3.7.1/21.0.3 CP2K 2023.1 FHI-aims 221103 GROMACS 2022.4 LAMMPS 17_FEB_2023 NAMD 2.14 Nektar++ 5.2.0 NWChem 7.0.2 ONETEP 6.9.1.0 OpenFOAM v10.20230119 (.org), v2212 (.com) Quantum Espresso 6.8, 7.1 VASP 5.4.4.pl2, 6.3.2, 6.4.1-vtst, 6.4.1 Software Versions AOCL 3.1, 4.0 Boost 1.81.0 GSL 2.7 HYPRE 2.18.0, 2.25.0 METIS/ParMETIS 5.1.0/4.0.3 MUMPS 5.3.5, 5.5.1 PETSc 13.14.2, 13.18.5 PT/Scotch 6.1.0, 07.0.3 SLEPC 13.14.1, 13.18.3 SuperLU/SuperLU_Dist 5.2.2 / 6.4.0, 8.1.2 Trilinos 12.18.1"},{"location":"known-issues/","title":"ARCHER2 Known Issues","text":"This section highlights known issues on ARCHER2, their potential impacts and any known workarounds. Many of these issues are under active investigation by HPE Cray and the wider service.
Info
This page was last reviewed on 9 November 2023
"},{"location":"known-issues/#open-issues","title":"Open Issues","text":""},{"location":"known-issues/#atp-module-tries-to-write-to-home-from-compute-nodes-added-2024-04-29","title":"ATP Module tries to write to /home from compute nodes (Added: 2024-04-29)","text":"The ATP Module tries to execute a mkdir
command in the /home
filesystem. When running the ATP module on the compute nodes, this will lead to an error, as the compute nodes cannot access the /home
filesystem.
To circumvent the error, add the line:
export HOME=${HOME/home/work}\n
in the slurm script, so that the ATP module will write to /work
instead.
For situations where users are close to user or project quotas on work (Lustre) file systems we have seen cases of the following behaviour:
If you see these symptoms: slower than expected performance, data corruption; then you should check if you are close to your storage quota (either user or project quota). If you are, you may be experiencing this issue. Either remove data to free up space or request more storage quota.
"},{"location":"known-issues/#e-mail-alerts-from-slurm-do-not-work-added-2023-11-09","title":"e-mail alerts from Slurm do not work (Added: 2023-11-09)","text":"Email alerts from Slurm (--mail-type
and --mail-user
options) do not produce emails to users. We are investigating with Universtiy of Edinburgh Information Services to enable this Slurm feature in the future.
We have seen cases when using the (non-default) UCX communications protocol where the peak in memory use is much higher than would be expected. This leads to jobs failing unexpectedly with an OOM (Out Of Memory) error. The workaround is to use Open Fabrics (OFI) communication protocol instead. OFI is the default protocol on ARCHER2 and so does not usually need to be explicitly loaded; but if you have UCX loaded, you can switch to OFI by adding the following lines to your submission script before you run your application:
module load craype-network-ofi\nmodule load cray-mpich\n
It can be very useful to track the memory usage of your job as it runs, for example to see whether there is high usage on all nodes, or a single node, if usage increases gradually or rapidly etc.
Here are instructions on how to do this using a couple of small scripts.
"},{"location":"known-issues/#slurm-cpu-freqx-option-is-not-respected-when-used-with-sbatch-added-2023-01-18","title":"Slurm--cpu-freq=X
option is not respected when used with sbatch
(Added: 2023-01-18)","text":"If you specify the CPU frequency using the --cpu-freq
option with the sbatch
command (either using the script #SBATCH --cpu-freq=X
method or the --cpu-freq=X
option directly) then this option will not be respected as the default setting for ARCHER2 (2.0 GHz) will override the option. You should specify the --cpu-freq
option to srun
directly instead within the job submission script. i.e.:
srun --cpu-freq=2250000 ...\n
You can find more information on setting the CPU frequency in the User Guide.
"},{"location":"known-issues/#research-software","title":"Research Software","text":"There are several outstanding issues for the centrally installed Research Software:
Users should also check individual software pages, for known limitations/ caveats, for the use of software on the Cray EX platform and Cray Linux Environment.
"},{"location":"known-issues/#issues-with-rpath-for-non-default-library-versions","title":"Issues with RPATH for non-default library versions","text":"When you compile applications against non-default versions of libraries within the HPE Cray software stack and use the environment variable CRAY_ADD_RPATH=yes
to try and encode the paths to these libraries within the binary this will not be respected at runtime and the binaries will use the default versions instead.
The workaround for this issue is to ensure that you set:
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
at both compile and runtime. For more details on using non-default versions of libraries, see the description in the User and Best Practice Guide
"},{"location":"known-issues/#mpi-ucx-error-ivb_reg_mr","title":"MPIUCX ERROR: ivb_reg_mr
","text":"If you are using the UCX layer for MPI communication you may see an error such as:
[1613401128.440695] [nid001192:11838:0] ib_md.c:325 UCX ERROR ibv_reg_mr(address=0xabcf12c0, length=26400, access=0xf) failed: Cannot allocate memory\n[1613401128.440768] [nid001192:11838:0] ucp_mm.c:137 UCX ERROR failed to register address 0xabcf12c0 mem_type bit 0x1 length 26400 on md[4]=mlx5_0: Input/output error (md reg_mem_types 0x15)\n[1613401128.440773] [nid001192:11838:0] ucp_request.c:269 UCX ERROR failed to register user buffer datatype 0x8 address 0xabcf12c0 len 26400: Input/output error\nMPICH ERROR [Rank 1534] [job id 114930.0] [Mon Feb 15 14:58:48 2021] [unknown] [nid001192] - Abort(672797967) (rank 1534 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack:\nPMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed\nMPID_Isend(416)......:\nMPID_isend_unsafe(92):\nMPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error)\naborting job:\nFatal error in PMPI_Isend: Other MPI error, error stack:\nPMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed\nMPID_Isend(416)......:\nMPID_isend_unsafe(92):\nMPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error)\n[1613401128.457254] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e09 apid 200002e3e is not released, refcount 1\n[1613401128.457261] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e08 apid 100002e3e is not released, refcount 1\n
You can add the following line to your job submission script before the srun
command to try and workaround this error:
export UCX_IB_REG_METHODS=direct\n
Note
Setting this flag may have an impact on code performance.
"},{"location":"known-issues/#aocc-compiler-fails-to-compile-with-netcdf-added-2021-11-18","title":"AOCC compiler fails to compile with NetCDF (Added: 2021-11-18)","text":"There is currently a problem with the module file which means cray-netcdf-hdf5parallel will not operate correctly in PrgEnv-aocc. An example of the error seen is:
F90-F-0004-Corrupt or Old Module file /opt/cray/pe/netcdf-hdf5parallel/4.7.4.3/crayclang/9.1/include/netcdf.mod (netcdf.F90: 8)\n
The current workaround for this is to load module epcc-netcdf-hdf5parallel instead if PrgEnv-aocc is required.
"},{"location":"known-issues/#slurm-export-option-does-not-work-in-job-submission-script","title":"Slurm--export
option does not work in job submission script","text":"The option --export=ALL
propagates all the environment variables from the login node to the compute node. If you include the option in the job submission script, it is wrongly ignored by Slurm. The current workaround is to include the option when the job submission script is launched. For instance:
sbatch --export=ALL myjob.slurm\n
"},{"location":"known-issues/#recently-resolved-issues","title":"Recently Resolved Issues","text":""},{"location":"other-software/","title":"Software provided by external parties","text":"This section describes software that has been installed on ARCHER2 by external parties (i.e. not by the ARCHER2 service itself) for general use by ARCHER2 users or provides useful notes on software that is not installed centrally.
Important
While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"other-software/#research-software","title":"Research Software","text":"This page has moved
"},{"location":"other-software/cesm-further-examples/","title":"Cesm further examples","text":"This page has moved
"},{"location":"other-software/cesm213/","title":"Cesm213","text":"This page has moved
"},{"location":"other-software/cesm213_run/","title":"Cesm213 run","text":"This page has moved
"},{"location":"other-software/cesm213_setup/","title":"Cesm213 setup","text":"This page has moved
"},{"location":"other-software/crystal/","title":"Crystal","text":"This page has moved
"},{"location":"publish/","title":"ARCHER2 and publications","text":"This section provides information on how to acknowledge the use of ARCHER2 in your published work and how to register your work on ARCHER2 into the ARCHER2 publications database via SAFE.
"},{"location":"publish/#acknowledging-archer2","title":"Acknowledging ARCHER2","text":"We will shortly be publishing a description of the ARCHER2 service with a DOI that you can cite in your published work that arises from the use of ARCHER2. Until that time, please add the following words to any work you publish that arises from your use of ARCHER2:
This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk).
You should also tag outputs with the keyword \"ARCHER2\" whenever possible.
"},{"location":"publish/#archer2-publication-database","title":"ARCHER2 publication database","text":"The ARCHER2 service maintains a publication database of works that have arisen from ARCHER2 and links them to project IDs that have ARCHER2 access. We ask all users of ARCHER2 to register any publications in the database - all you need is your publication's DOI.
Registering your publications in SAFE has a number of advantages:
You will need a DOI for the publication you wish to register. A DOI has the form of an set of ID strings separated by slashes. For example, 10.7488/ds/1505
, you should not include the web host address which provides a link to the DOI.
Login to SAFE. Then:
Login to SAFE. Then:
At the moment we support export lists of DOIs to comma-separated values (CSV) files. This does not export all the metadata, just the DOIs themselves with a maximum of 25 DOIs per line. This format is primarily useful for importing into ResearchFish (where you can paste in the comma-separated lists to import publications). We plan to add further export formats in the future.
Login to SAFE. Then:
The ARCHER2 quickstart guides provide the minimum information for new users or users transferring from ARCHER. There are two sections available which are meant to be followed in sequence.
This guide aims to quickly enable developers to work on ARCHER2. It assumes that you are familiar with the material in the Quickstart for users section.
"},{"location":"quick-start/quickstart-developers/#compiler-wrappers","title":"Compiler wrappers","text":"When compiling code on ARCHER2, you should make use of the HPE Cray compiler wrappers. These ensure that the correct libraries and headers (for example, MPI or HPE LibSci) will be used during the compilation and linking stages. These wrappers should be accessed by providing the following compiler names.
Language Wrapper name C cc C++ CC Fortran ftnThis means that you should use the wrapper names whether on the command line, in build scripts, or in configure options. It could be helpful to set some or all of the following environment variables before running a build to ensure that the build tool is aware of the wrappers.
export CC=cc\nexport CXX=CC\nexport FC=ftn\nexport F77=ftn\nexport F90=ftn\n
man
pages are available for each wrapper. You can also see the full set of compiler and linker options being used by passing the -craype-verbose
option to the wrapper.
Tip
The HPE Cray compiler wrappers should be used instead of the MPI compiler wrappers such as mpicc
, mpicxx
and mpif90
that you may have used on other HPC systems.
On login to ARCHER2, the PrgEnv-cray
compiler environment will be loaded, as will a cce
module. The latter makes available the Cray compilers from the Cray Compiling Environment (CCE), while the former provides the correct wrappers and support to use them. The GNU Compiler Collection (GCC) and the AMD compiler environment (AOCC) are also available.
To make use of any particular compiler environment, you load the correct PrgEnv
module. After doing so the compiler wrappers (cc
, CC
and ftn
) will correctly call the compilers from the new suite. The default version of the corresponding compiler suite will also be loaded, but you may swap to another available version if you wish.
The following table summarises the suites and associated compiler environments.
Suite name Module Programming environment collection CCEcce
PrgEnv-cray
GCC gcc
PrgEnv-gnu
AOCC aocc
PrgEnv-aocc
As an example, after logging in you may wish to use GCC as your compiler suite. Running module load PrgEnv-gnu
will replace the default CCE (Cray) environment with the GNU environment. It will also unload the cce
module and load the default version of the gcc
module; at the time of writing, this is GCC 11.2.0. If you need to use a different version of GCC, for example 10.3.0, you would follow up with module load gcc/10.3.0
. At this point you may invoke the compiler wrappers and they will correctly use the HPE libraries and tools in conjunction with GCC 10.3.0.
When choosing the compiler environment, a big factor will likely be which compilers you have previously used for your code's development. The Cray Fortran compiler is similar to the compiler you may be familiar with from ARCHER, while the Cray C and C++ compilers provided on ARCHER2 are new versions that are now derived from Clang. The GCC suite provides gcc/g++ and gfortran. The AOCC suite provides AMD Clang/Clang++ and AMD Flang.
Note
The Intel compilers are not available on ARCHER2.
"},{"location":"quick-start/quickstart-developers/#useful-compiler-options","title":"Useful compiler options","text":"The compiler options you use will depend on both the software you are building and also on the current stage of development. The following flags should be a good starting point for reasonable performance.
Compilers Optimisation flags Cray C/C++-O2 -funroll-loops -ffast-math
Cray Fortran Default options GCC -O2 -ftree-vectorize -funroll-loops -ffast-math
Tip
If you want to use GCC version 10 or greater to compile MPI Fortran code, you must add the -fallow-argument-mismatch
option when compiling otherwise you will see compile errors associated with MPI functions.
When you are happy with your code's performance you may wish to enable more aggressive optimisations; in this case you could start using the following flags. Please note, however, that these optimisations may lead to deviations from IEEE/ISO specifications. If your code relies on strict adherence then using these flags may cause incorrect output.
Compilers Optimisation flags Cray C/C++-Ofast -funroll-loops
Cray Fortran -O3 -hfp3
GCC -Ofast -funroll-loops
Vectorisation is enabled by the Cray Fortran compiler at -O1
and above, by Cray C and C++ at -O2
and above or when using -ftree-vectorize
, and by the GCC compilers at -O3
and above or when using -ftree-vectorize
.
You may wish to promote default real
and integer
types in Fortran codes from 4 to 8 bytes. In this case, the following flags may be used.
real
and integer
promotion flags Cray Fortran -s real64 -s integer64
gfortran -freal-4-real-8 -finteger-4-integer-8
More documentation on the compilers is available through man
. The pages to read are accessed as follow.
man craycc
man crayCC
man crayftn
GNU man gcc
man g++
man gfortran
Tip
There are no man
pages for the AOCC compilers at the moment.
Executables on ARCHER2 link dynamically, and the Cray Programming Environment does not currently support static linking. This is in contrast to ARCHER where the default was to build statically.
If you attempt to link statically, you will see errors similar to:
/usr/bin/ld: cannot find -lpmi\n/usr/bin/ld: cannot find -lpmi2\ncollect2: error: ld returned 1 exit status\n
The compiler wrapper scripts on ARCHER link runtime libraries in using the RUNPATH
by default. This means that the paths to the runtime libraries are encoded into the executable so you do not need to load the compiler environment in your job submission scripts.
The default behaviour of a dynamically linked executable will be to allow the linker to provide the libraries it needs at runtime by searching the paths in the LD_LIBRARY_PATH
environment and then by searching the paths in the RUNPATH
variable setting of the binary. This is flexible in that it allows an executable to use newly installed library versions without rebuilding, but in some cases you may prefer to bake the paths to specific libraries into the executable RUNPATH
, keeping them constant. While the libraries are still dynamically loaded at run time, from the end user's point of view the resulting behaviour will be similar to that of a statically compiled executable in that they will not need to concern themselves with ensuring the linker will be able to find the libraries.
This is achieved by providing additional paths to add to RUNPATH
to the compiler as options. To set the compiler wrappers to do this, you can set the following environment variable.
export CRAY_ADD_RPATH=yes\n
"},{"location":"quick-start/quickstart-developers/#using-rpaths-to-link","title":"Using RPATHs to link","text":"RPATH
differs from RUNPATH
in that it searches RPATH directories for libraries before searching the paths in LD_LIBRARY_PATH
so they cannot be overridden in the same way at runtime.
You can provide RPATHs directly to the compilers using the -Wl,-rpath=<path-to-directory>
flag, where the provided path is to the directory containing the libraries which are themselves typically specified with flags of the type -l<library-name>
.
The following debugging tools are available on ARCHER2:
module load gdb4hpc
.module load valgrind4hpc
.module load cray-stat
.To get started debugging on ARCHER2, you might like to use gdb4hpc. You should first of all compile your code using the -g
flag to enable debugging symbols. Once compiled, load the gdb4hpc module and start it:
module load gdb4hpc\ngdb4hpc\n
Once inside gdb4hpc, you can start your program's execution with the launch
command:
dbg all> launch $my_prog{128} ./prog\n
In this example, a job called my_prog
will be launched to run the executable file prog
over 128 cores on a compute node. If you run squeue
in another terminal you will be able to see it running. Inside gdb4hpc you may then step
through the code's execution, continue
to breakpoints that you set with break
, print
the values of variables at these points, and perform a backtrace
on the stack if the program crashes. Debugging jobs will end when you exit gdb4hpc, or you can end them yourself by running, in this example, release $my_prog
.
For more information on debugging parallel codes, see the documentation in the Debugging section of the ARCHER2 User and Best Practice Guide.
"},{"location":"quick-start/quickstart-developers/#profiling-tools","title":"Profiling tools","text":"Profiling on ARCHER2 is provided through the Cray Performance Measurement and Analysis Tools (CrayPAT). This has a number of different components:
pat_build
, the utility used to instrument programs, the CrayPat run time environment, which collects the specified performance data during program execution, and pat_report
, the first-level data analysis tool, used to produce text reports or export data for more sophisticated analysis.The above tools are made available for use by firstly loading the perftools-base
module followed by either perftools
(for CrayPAT, Reveal and Apprentice2) or one of the perftools-lite
modules.
The simplest way to get started profiling your code is with CrayPAT-lite. For example, to sample a run of a code you would load the perftools-base
and perftools-lite
modules, and then compile (you will receive a message that the executable is being instrumented). Performing a batch run as usual with this executable will produce a directory such as my_prog+74653-2s
which can be passed to pat_report
to view the results. In this example,
pat_report -O calltree+src my_prog+74653-2s\n
will produce a report containing the call tree. You can view available report keywords to be provided to the -O
option by running pat_report -O -h
. The available perftools-lite
modules are:
perftools-lite
, instrumenting a basic sampling experiment.perftools-lite-events
, instrumenting a tracing experiment.perftools-lite-gpu
, instrumenting OpenACC and OpenMP 4 use of GPUs.perftools-lite-hbm
, instrumenting for memory bandwidth usage.perftools-lite-loops
, instrumenting a loop work estimate experiment.Tip
For more information on profiling parallel codes, see the documentation in the Profiling section of the ARCHER2 User and Best Practice Guide.
"},{"location":"quick-start/quickstart-developers/#useful-links","title":"Useful Links","text":"Links to other documentation you may find useful:
Once you have set up your machine account and logged on, run a job or two and possibly updated and compiled your code: what next?
There is still loads of support and advice available to you:
Getting Started on ARCHER2 gives an overview of some of this help.
Advice on how to Get Access with different funding routes, and if your chosen route requires you to complete a Technical Assessment, we have advice on How to prepare a successful TA
And we also have a comprehensive Training Programme for all levels of experience and a wide range of different uses. All our training is free for UK Academics and we have a list of upcoming training and also all the materials and resources from previous training events.
"},{"location":"quick-start/quickstart-users-totp/","title":"Quickstart for users","text":"This guide aims to quickly enable new users to get up and running on ARCHER2. It covers the process of getting an ARCHER2 account, logging in and running your first job.
"},{"location":"quick-start/quickstart-users-totp/#request-an-account-on-archer2","title":"Request an account on ARCHER2","text":"Important
You need to use both a password and a passphrase-protected SSH key pair to log into ARCHER2. You get the password from SAFE, but, you will also need to setup your own SSH key pair and add the public part to your account via SAFE before you will be able to log in. We cover the authentication steps below.
"},{"location":"quick-start/quickstart-users-totp/#obtain-an-account-on-the-safe-website","title":"Obtain an account on the SAFE website","text":"Warning
We have seen issues with Gmail blocking emails from SAFE so we recommend that users use their institutional/work email address rather than Gmail addresses to register for SAFE accounts.
The first step is to sign up for an account on the ARCHER2 SAFE website. The SAFE account is used to manage all of your login accounts, allowing you to report on your usage and quotas. To do this:
You are now registered. Your SAFE password will be emailed to the email address you provided. You can then login with that email address and password. (You can change your initial SAFE password whenever you want by selecting the Change SAFE password option from the Your details menu.)
"},{"location":"quick-start/quickstart-users-totp/#request-an-archer2-login-account","title":"Request an ARCHER2 login account","text":"Once you have a SAFE account and an SSH key you will need to request a user account on ARCHER2 itself. To do this you will require a Project Code; you usually obtain this from the Principle Investigator (PI) or project manager for the project you will be working on. Once you have the Project Code:
Full systemThe PI or project manager of the project will be asked to approve your request. After your request has been approved the account will be created and when this has been done you will receive an email. You can then come back to SAFE and pick up the initial single-use password for your new account.
Note
ARCHER2 account passwords are also sometimes referred to as LDAP passwords by the system.
"},{"location":"quick-start/quickstart-users-totp/#generating-and-adding-an-ssh-key-pair","title":"Generating and adding an SSH key pair","text":"How you generate your SSH key pair depends on which operating system you use and which SSH client you use to connect to ARCHER2. We will not cover the details on generating an SSH key pair here, but detailed information on this topic is available in the ARCHER2 User and Best Practice Guide.
After generating your SSH key pair, add the public part to your login account using SAFE:
Once you have done this, your SSH key will be added to your ARCHER2 account.
Remember, you will need to use both an SSH key and password to log into ARCHER2 so you will also need to collect your initial password before you can log into ARCHER2 for the first time. We cover this next.
Note
If you want to connect to ARCHER2 from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.
"},{"location":"quick-start/quickstart-users-totp/#login-to-archer2","title":"Login to ARCHER2","text":"To log into ARCHER2 you should use the address:
Full systemssh [userID]@login.archer2.ac.uk
The order in which you are asked for credentials depends on the system you are accessing:
Full systemYou will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you previously logged into the ARCHER2 system before the major upgrade in May/June 2023 with your account you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Tip
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_archer2
you would use the command ssh -i keys/id_rsa_archer2 username@login.archer2.ac.uk
to log in.
Remember, you will need to use both an SSH key and Time-based one-time password to log into ARCHER2 so you will also need to set up your TOTP before you can log into ARCHER2.
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed You will not use your password when logging on to ARCHER2 after the initial logon.
Hint
More information on connecting to ARCHER2 is available in the Connecting to ARCHER2 section of the User Guide.
"},{"location":"quick-start/quickstart-users-totp/#file-systems-and-manipulating-data","title":"File systems and manipulating data","text":"ARCHER2 has a number of different file systems and understanding the difference between them is crucial to being able to use the system. In particular, transferring and moving data often requires a bit of thought in advance to ensure that the data is secure and in a useful form.
ARCHER2 file systems are:
All users have a directory on one of the home file systems and on one of the work file systems. The directories are located at:
/home/[project ID]/[project ID]/[user ID]
(this is also set as your home directory)/work/[project ID]/[project ID]/[user ID]
Top tips for managing data on ARCHER2:
tar
or zip
).tar
or rsync
between file systems mounted on ARCHER2 avoid the use of compression options as these can slow performance (time saved by transferring smaller compressed files is usually less than the overhead added by having to compress files on the fly).Hint
Information on the file systems and best practice in managing you data is available in the Data management and transfer section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users-totp/#accessing-software","title":"Accessing software","text":"Software on ARCHER2 is principally accessed through modules. These load and unload the desired applications, compilers, tools and libraries through the module
command and its subcommands. Some modules will be loaded by default on login, providing a default working environment; many more will be available for use but initially unloaded, allowing you to set up the environment to suit your needs.
At any stage you can check which modules have been loaded by running
module list\n
Running the following command will display all environment modules available on ARCHER2, whether loaded or unloaded
module avail\n
The search field for this command may be narrowed by providing the first few characters of the module name being queried. For example, all available versions and variants of VASP may be found by running
module avail vasp\n
You will see that different versions are available for many modules. For example, vasp/5/5.4.4.pl2
and vasp/6/6.3.2
are two available versions of VASP on the full system. Furthermore, a default version may be specified; this is used if no version is provided by the user.
Important
VASP is licensed software, as are other software packages on ARCHER2. You must have a valid licence to use licensed software on ARCHER2. Often you will need to request access through the SAFE. More on this below.
The module load
command loads a module for use. Following the above,
module load vasp/6\n
would load the default version of VASP 6, while
module load vasp/6/6.3.2\n
would specifically load version 6.3.2
. A loaded module may be unloaded through the identical module remove
command, e.g.
module unload vasp\n
The above unloads whichever version of VASP is currently in the environment. Rather than issuing separate unload and load commands, versions of a module may be swapped as follows:
module swap vasp vasp/5/5.4.4.pl2\n
Other helpful commands are:
module help <modulename>
which provides a short description of the modulemodule show <modulename>
which displays the contents of the modulefilemodule restore
which returns you to the default module setup as if you had just logged inTip
You should not use the module purge
command on ARCHER2 as this will cause issues for the HPE Cray programming environment. If you wish to reset your modules, you should use the module restore
command instead.
Points to be aware of include:
module show
) should reveal the cause of the conflict and how to resolve it.module show
.More information on modules and the software environment on ARCHER2 can be found in the Software environment section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users-totp/#requesting-access-to-licensed-software","title":"Requesting access to licensed software","text":"Some of the software installed on ARCHER2 requires a user to have a valid licence agreed with the software owners/developers to be able to use it (for example, VASP). Although you will be able to load this software on ARCHER2, you will be barred from actually using it until your licence has been verified.
You request access to licensed software through the SAFE (the web administration tool you used to apply for your account and retrieve your initial password) by being added to the appropriate Package Group. To request access to licensed software:
Your request will then be processed by the ARCHER2 Service Desk who will confirm your license with the software owners/developers before enabling your access to the software on ARCHER2. This can take several days (depending on how quickly the software owners/developers take to respond) but you will be advised once this has been done.
"},{"location":"quick-start/quickstart-users-totp/#create-a-job-submission-script","title":"Create a job submission script","text":"To run a program on the ARCHER2 compute nodes you need to write a job submission script that tells the system how many compute nodes you want to reserve and for how long. You also need to use the srun
command to launch your parallel executable.
Hint
For a more details on the Slurm scheduler on ARCHER2 and writing job submission scripts see the Running jobs on ARCHER2 section of the User and Best Practice Guide.
Important
Parallel jobs on ARCHER2 should be run from the work file systems as the home file systems are not available on the compute nodes - you will see a chdir
or file not found error if you try to access data on the home file system within a parallel job running on the compute nodes.
Create a job submission script called submit.slurm
in your space on the work file systems using your favourite text editor. For example, using vim
:
auser@ln01:~> cd /work/t01/t01/auser\nauser@ln01:/work/t01/t01/auser> vim submit.slurm\n
Tip
You will need to use your project code and username to get to the correct directory. i.e. replace the t01
above with your project code and replace the username auser
with your ARCHER2 username.
Paste the following text into your job submission script, replacing ENTER_YOUR_BUDGET_CODE_HERE
with your budget code e.g. e99-ham
, ENTER_PARTITION_HERE
with the partition you wish to run on (e.g standard
), and ENTER_QOS_HERE
with the quality of service you want (e.g. standard
).
#!/bin/bash --login\n\n#SBATCH --job-name=test_job\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:5:0\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the xthi module to get access to the xthi program\nmodule load xthi\n\n# Recommended environment settings\n# Stop unintentional multi-threading within software libraries\nexport OMP_NUM_THREADS=1\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# srun launches the parallel program based on the SBATCH options\nsrun --distribution=block:block --hint=nomultithread xthi_mpi\n
"},{"location":"quick-start/quickstart-users-totp/#submit-your-job-to-the-queue","title":"Submit your job to the queue","text":"You submit your job to the queues using the sbatch
command:
auser@ln01:/work/t01/t01/auser> sbatch submit.slurm\nSubmitted batch job 23996\n\nThe value returned is your *Job ID*.\n
"},{"location":"quick-start/quickstart-users-totp/#monitoring-your-job","title":"Monitoring your job","text":"You use the squeue
command to examine jobs in the queue. To list all the jobs you have in the queue, use:
auser@ln01:/work/t01/t01/auser> squeue -u $USER\n
squeue
on its own lists all jobs in the queue from all users.
The job submission script above should write the output to a file called slurm-<jobID>.out
(i.e. if the Job ID was 23996, the file would be slurm-23996.out
), you can check the contents of this file with the cat
command. If the job was successful you should see output that looks something like:
auser@ln01:/work/t01/t01/auser> cat slurm-23996.out\nNode 0, hostname nid001020\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode 0, rank 8, thread 0, (affinity = 8)\nNode 0, rank 9, thread 0, (affinity = 9)\nNode 0, rank 10, thread 0, (affinity = 10)\nNode 0, rank 11, thread 0, (affinity = 11)\nNode 0, rank 12, thread 0, (affinity = 12)\nNode 0, rank 13, thread 0, (affinity = 13)\nNode 0, rank 14, thread 0, (affinity = 14)\nNode 0, rank 15, thread 0, (affinity = 15)\nNode 0, rank 16, thread 0, (affinity = 16)\nNode 0, rank 17, thread 0, (affinity = 17)\nNode 0, rank 18, thread 0, (affinity = 18)\nNode 0, rank 19, thread 0, (affinity = 19)\nNode 0, rank 20, thread 0, (affinity = 20)\nNode 0, rank 21, thread 0, (affinity = 21)\n... output trimmed ...\n
If something has gone wrong, you will find any error messages in the file instead of the expected output.
"},{"location":"quick-start/quickstart-users-totp/#acknowledging-archer2","title":"Acknowledging ARCHER2","text":"You should use the following phrase to acknowledge ARCHER2 for all research outputs that were generated using the ARCHER2 service:
This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk).
You should also tag outputs with the keyword \"ARCHER2\" whenever possible.
"},{"location":"quick-start/quickstart-users-totp/#useful-links","title":"Useful Links","text":"If you plan to compile your own programs on ARCHER2, you may also want to look at Quickstart for developers.
Other documentation you may find useful:
This guide aims to quickly enable new users to get up and running on ARCHER2. It covers the process of getting an ARCHER2 account, logging in and running your first job.
"},{"location":"quick-start/quickstart-users/#request-an-account-on-archer2","title":"Request an account on ARCHER2","text":"Important
To access ARCHER2, you need to use two sets of credentials: your SSH key pair protected by a passphrase and a Time-based one-time password (TOTP). Additionally, the first time you ever log into an account on ARCHER2, you will need to use a single use password you retrieve from SAFE.
"},{"location":"quick-start/quickstart-users/#obtain-an-account-on-the-safe-website","title":"Obtain an account on the SAFE website","text":"Warning
We have seen issues with Gmail blocking emails from SAFE so we recommend that users use their institutional/work email address rather than Gmail addresses to register for SAFE accounts.
The first step is to sign up for an account on the ARCHER2 SAFE website. The SAFE account is used to manage all of your login accounts, allowing you to report on your usage and quotas. To do this:
You are now registered. Your SAFE password will be emailed to the email address you provided. You can then login with that email address and password. (You can change your initial SAFE password whenever you want by selecting the Change SAFE password option from the Your details menu.)
"},{"location":"quick-start/quickstart-users/#request-an-archer2-login-account","title":"Request an ARCHER2 login account","text":"Once you have a SAFE account and an SSH key you will need to request a user account on ARCHER2 itself. To do this you will require a Project Code; you usually obtain this from the Principle Investigator (PI) or project manager for the project you will be working on. Once you have the Project Code:
Full systemThe PI or project manager of the project will be asked to approve your request. After your request has been approved the account will be created and when this has been done you will receive an email. You can then come back to SAFE and pick up the initial single-use password for your new account.
Note
ARCHER2 account passwords are also sometimes referred to as LDAP passwords by the system.
"},{"location":"quick-start/quickstart-users/#generating-and-adding-an-ssh-key-pair","title":"Generating and adding an SSH key pair","text":"How you generate your SSH key pair depends on which operating system you use and which SSH client you use to connect to ARCHER2. We will not cover the details on generating an SSH key pair here, but detailed information on this topic is available in the ARCHER2 User and Best Practice Guide.
After generating your SSH key pair, add the public part to your login account using SAFE:
Once you have done this, your SSH key will be added to your ARCHER2 account.
Remember, you will need to use both an SSH key and password to log into ARCHER2 so you will also need to collect your initial password before you can log into ARCHER2 for the first time. We cover this next.
Note
If you want to connect to ARCHER2 from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.
"},{"location":"quick-start/quickstart-users/#login-to-archer2","title":"Login to ARCHER2","text":"To log into ARCHER2 you should use the address:
ssh [userID]@login.archer2.ac.uk
The order in which you are asked for credentials depends on the system you are accessing:
You will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you previously logged into the ARCHER2 system before the major upgrade in May/June 2023 with your account you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Tip
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_archer2
you would use the command ssh -i keys/id_rsa_archer2 username@login.archer2.ac.uk
to log in.
Remember, you will need to use both an SSH key and Time-based one-time password to log into ARCHER2 so you will also need to set up your TOTP before you can log into ARCHER2.
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed You will not use your password when logging on to ARCHER2 after the initial logon.
Hint
More information on connecting to ARCHER2 is available in the Connecting to ARCHER2 section of the User Guide.
"},{"location":"quick-start/quickstart-users/#file-systems-and-manipulating-data","title":"File systems and manipulating data","text":"ARCHER2 has a number of different file systems and understanding the difference between them is crucial to being able to use the system. In particular, transferring and moving data often requires a bit of thought in advance to ensure that the data is secure and in a useful form.
ARCHER2 file systems are:
All users have a directory on one of the home file systems and on one of the work file systems. The directories are located at:
/home/[project ID]/[project ID]/[user ID]
(this is also set as your home directory)/work/[project ID]/[project ID]/[user ID]
Top tips for managing data on ARCHER2:
tar
or zip
).tar
or rsync
between file systems mounted on ARCHER2 avoid the use of compression options as these can slow performance (time saved by transferring smaller compressed files is usually less than the overhead added by having to compress files on the fly).Hint
Information on the file systems and best practice in managing you data is available in the Data management and transfer section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users/#accessing-software","title":"Accessing software","text":"Software on ARCHER2 is principally accessed through modules. These load and unload the desired applications, compilers, tools and libraries through the module
command and its subcommands. Some modules will be loaded by default on login, providing a default working environment; many more will be available for use but initially unloaded, allowing you to set up the environment to suit your needs.
At any stage you can check which modules have been loaded by running
module list\n
Running the following command will display all environment modules available on ARCHER2, whether loaded or unloaded
module avail\n
The search field for this command may be narrowed by providing the first few characters of the module name being queried. For example, all available versions and variants of VASP may be found by running
module avail vasp\n
You will see that different versions are available for many modules. For example, vasp/5/5.4.4.pl2
and vasp/6/6.3.2
are two available versions of VASP on the full system. Furthermore, a default version may be specified; this is used if no version is provided by the user.
Important
VASP is licensed software, as are other software packages on ARCHER2. You must have a valid licence to use licensed software on ARCHER2. Often you will need to request access through the SAFE. More on this below.
The module load
command loads a module for use. Following the above,
module load vasp/6\n
would load the default version of VASP 6, while
module load vasp/6/6.3.2\n
would specifically load version 6.3.2
. A loaded module may be unloaded through the identical module remove
command, e.g.
module unload vasp\n
The above unloads whichever version of VASP is currently in the environment. Rather than issuing separate unload and load commands, versions of a module may be swapped as follows:
module swap vasp vasp/5/5.4.4.pl2\n
Other helpful commands are:
module help <modulename>
which provides a short description of the modulemodule show <modulename>
which displays the contents of the modulefilemodule restore
which returns you to the default module setup as if you had just logged inTip
You should not use the module purge
command on ARCHER2 as this will cause issues for the HPE Cray programming environment. If you wish to reset your modules, you should use the module restore
command instead.
Points to be aware of include:
module show
) should reveal the cause of the conflict and how to resolve it.module show
.More information on modules and the software environment on ARCHER2 can be found in the Software environment section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users/#requesting-access-to-licensed-software","title":"Requesting access to licensed software","text":"Some of the software installed on ARCHER2 requires a user to have a valid licence agreed with the software owners/developers to be able to use it (for example, VASP). Although you will be able to load this software on ARCHER2, you will be barred from actually using it until your licence has been verified.
You request access to licensed software through the SAFE (the web administration tool you used to apply for your account and retrieve your initial password) by being added to the appropriate Package Group. To request access to licensed software:
Your request will then be processed by the ARCHER2 Service Desk who will confirm your license with the software owners/developers before enabling your access to the software on ARCHER2. This can take several days (depending on how quickly the software owners/developers take to respond) but you will be advised once this has been done.
"},{"location":"quick-start/quickstart-users/#create-a-job-submission-script","title":"Create a job submission script","text":"To run a program on the ARCHER2 compute nodes you need to write a job submission script that tells the system how many compute nodes you want to reserve and for how long. You also need to use the srun
command to launch your parallel executable.
Hint
For a more details on the Slurm scheduler on ARCHER2 and writing job submission scripts see the Running jobs on ARCHER2 section of the User and Best Practice Guide.
Important
Parallel jobs on ARCHER2 should be run from the work file systems as the home file systems are not available on the compute nodes - you will see a chdir
or file not found error if you try to access data on the home file system within a parallel job running on the compute nodes.
Create a job submission script called submit.slurm
in your space on the work file systems using your favourite text editor. For example, using vim
:
auser@ln01:~> cd /work/t01/t01/auser\nauser@ln01:/work/t01/t01/auser> vim submit.slurm\n
Tip
You will need to use your project code and username to get to the correct directory. i.e. replace the t01
above with your project code and replace the username auser
with your ARCHER2 username.
Paste the following text into your job submission script, replacing ENTER_YOUR_BUDGET_CODE_HERE
with your budget code e.g. e99-ham
, ENTER_PARTITION_HERE
with the partition you wish to run on (e.g standard
), and ENTER_QOS_HERE
with the quality of service you want (e.g. standard
).
#!/bin/bash --login\n\n#SBATCH --job-name=test_job\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:5:0\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the xthi module to get access to the xthi program\nmodule load xthi\n\n# Recommended environment settings\n# Stop unintentional multi-threading within software libraries\nexport OMP_NUM_THREADS=1\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# srun launches the parallel program based on the SBATCH options\nsrun --distribution=block:block --hint=nomultithread xthi_mpi\n
"},{"location":"quick-start/quickstart-users/#submit-your-job-to-the-queue","title":"Submit your job to the queue","text":"You submit your job to the queues using the sbatch
command:
auser@ln01:/work/t01/t01/auser> sbatch submit.slurm\nSubmitted batch job 23996\n\nThe value returned is your *Job ID*.\n
"},{"location":"quick-start/quickstart-users/#monitoring-your-job","title":"Monitoring your job","text":"You use the squeue
command to examine jobs in the queue. To list all the jobs you have in the queue, use:
auser@ln01:/work/t01/t01/auser> squeue -u $USER\n
squeue
on its own lists all jobs in the queue from all users.
The job submission script above should write the output to a file called slurm-<jobID>.out
(i.e. if the Job ID was 23996, the file would be slurm-23996.out
), you can check the contents of this file with the cat
command. If the job was successful you should see output that looks something like:
auser@ln01:/work/t01/t01/auser> cat slurm-23996.out\nNode 0, hostname nid001020\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode 0, rank 8, thread 0, (affinity = 8)\nNode 0, rank 9, thread 0, (affinity = 9)\nNode 0, rank 10, thread 0, (affinity = 10)\nNode 0, rank 11, thread 0, (affinity = 11)\nNode 0, rank 12, thread 0, (affinity = 12)\nNode 0, rank 13, thread 0, (affinity = 13)\nNode 0, rank 14, thread 0, (affinity = 14)\nNode 0, rank 15, thread 0, (affinity = 15)\nNode 0, rank 16, thread 0, (affinity = 16)\nNode 0, rank 17, thread 0, (affinity = 17)\nNode 0, rank 18, thread 0, (affinity = 18)\nNode 0, rank 19, thread 0, (affinity = 19)\nNode 0, rank 20, thread 0, (affinity = 20)\nNode 0, rank 21, thread 0, (affinity = 21)\n... output trimmed ...\n
If something has gone wrong, you will find any error messages in the file instead of the expected output.
"},{"location":"quick-start/quickstart-users/#acknowledging-archer2","title":"Acknowledging ARCHER2","text":"You should use the following phrase to acknowledge ARCHER2 for all research outputs that were generated using the ARCHER2 service:
This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk).
You should also tag outputs with the keyword \"ARCHER2\" whenever possible.
"},{"location":"quick-start/quickstart-users/#useful-links","title":"Useful Links","text":"If you plan to compile your own programs on ARCHER2, you may also want to look at Quickstart for developers.
Other documentation you may find useful:
ARCHER2 provides a number of research software packages as centrally supported packages. Many of these packages are free to use, but others require a license (which you, or your research group, need to supply).
This section also contains information on research software contributed and/or supported by third parties (marked with a * in the list below).
For centrally supported packages, the version available will usually be the current stable release, to include major releases and significant updates. We will usually not maintain older versions and versions no longer supported by the developers of the package.
The following sections provide details on access to each of the centrally installed packages (software that is not part of the fully-supported software stack are marked with *):
If the software you are interested in is not in the above list, we may still be able to help you install your own version, either individually, or as a project. Please contact the Service Desk.
"},{"location":"research-software/casino/","title":"CASINO","text":"Note
CASINO is not available as central install/module on ARCHER2 at this time. This page provides tips on using CASINO on ARCHER2 for users who have obtained their own copy of the code.
Important
CASINO is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
CASINO is a computer program system for performing quantum Monte Carlo (QMC) electronic structure calculations that has been developed by a group of researchers initially working in the Theory of Condensed Matter group in the Cambridge University physics department, and their collaborators, over more than 20 years. It is capable of calculating incredibly accurate solutions to the Schr\u00f6dinger equation of quantum mechanics for realistic systems built from atoms.
"},{"location":"research-software/casino/#useful-links","title":"Useful Links","text":"You should use the linuxpc-gcc-slurm-parallel.archer2
configuration that is supplied along with the CASINO source code to build on ARCHER2 and ensure that you build the \"Shm\" (System-V shared memory) version of the code.
Bug
The linuxpc-cray-slurm-parallel.archer2
configuration produces a binary that crashes with a segfault and should not be used.
The performance of CASINO on ARCHER2 is critically dependent on three things:
Next, we show how to make sure that the MPI transport layer is set to UCX, how to set the number of cores sharing the System-V shared memory segments and how to pin MPI processes sequentially to cores.
Finally, we provide a job submission script that demonstrates all these options together.
"},{"location":"research-software/casino/#setting-the-mpi-transport-layer-to-ucx","title":"Setting the MPI transport layer to UCX","text":"In your job submission script that runs CASINO you switch to using UCX as the MPI transport layer by including the following lines before you run CASINO (i.e. before the srun
command that launches the CASINO executable):
module load PrgEnv-gnu\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\n
"},{"location":"research-software/casino/#setting-the-number-of-cores-sharing-memory","title":"Setting the number of cores sharing memory","text":"In your job submission script you set the number of cores sharing memory segments by setting the CASINO_NUMABLK
environment variable before you run CASINO. For example, to specify that there should be shared memory segments each shared between 16 cores, you would use:
export CASINO_NUMABLK=16\n
Tip
If you do not set CASINO_NUMABLK
then CASINO will use the default of all cores on a node (the equivalent of setting it to 128) which will give very poor performance so you should always set this environment variable. Setting CASINO_NUMABLK
to 8 or 16 cores gives the best performance. 32 cores is acceptable if you want to maximise memory efficiency. Using 64 and 128 gives poor performance.
For shared memory segments to work efficiently MPI processes must be pinned sequentially to cores on compute nodes (so that cores sharing memory are close in the node memory hierarchy). To do this, you add the following options to the srun
command in your job script that runs the CASINO executable:
--distribution=block:block --hint=nomultithread\n
"},{"location":"research-software/casino/#example-casino-job-submission-script","title":"Example CASINO job submission script","text":"The following script will run a CASINO job using 16 nodes (2048 cores).
#!/bin/bash\n\n# Request 16 nodes with 128 MPI tasks per node for 20 minutes\n#SBATCH --job-name=CASINO\n#SBATCH --nodes=16\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Ensure we are using UCX as the MPI transport layer\nmodule load PrgEnv-gnu\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\n\n# Set CASINO to share memory across 16 core blocks\nexport CASINO_NUMABLK=16\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the location of the CASINO executable - this must be on /work\n# Replace this with the path to your compiled CASINO binary\nCASINO_EXE=/work/t01/t01/auser/CASINO/bin_qmc/linuxpc-gcc-slurm-parallel.archer2/Shm/opt/casino\n\n# Launch CASINO with MPI processes pinned to cores in a sequential order\nsrun --distribution=block:block --hint=nomultithread ${CASINO_EXE}\n
"},{"location":"research-software/casino/#casino-performance-on-archer2","title":"CASINO performance on ARCHER2","text":"We have run the benzene_dimer benchmark on ARCHER2 with the following configuration:
linuxpc-gcc-slurm-parallel.archer2
, \"Shm\" versionTimings are reported as time taken for 100 equilibration steps in DMC calculation.
"},{"location":"research-software/casino/#casino_numablk8","title":"CASINO_NUMABLK=8","text":"Nodes Time taken (s) Speedup 1 289.90 1.0 2 154.93 1.9 4 81.06 3.6 8 41.44 7.0 16 23.16 12.5"},{"location":"research-software/castep/","title":"CASTEP","text":"CASTEP is a leading code for calculating the properties of materials from first principles. Using density functional theory, it can simulate a wide range of properties of materials proprieties including energetics, structure at the atomic level, vibrational properties, electronic response properties etc. In particular it has a wide range of spectroscopic features that link directly to experiment, such as infra-red and Raman spectroscopies, NMR, and core level spectra.
"},{"location":"research-software/castep/#useful-links","title":"Useful Links","text":"CASTEP is only available to users who have a valid CASTEP licence.
If you have a CASTEP licence and wish to have access to CASTEP on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
"},{"location":"research-software/castep/#note-on-using-relativistic-j-dependent-pseudopotentials","title":"Note on using Relativistic J-dependent pseudopotentials","text":"These pseudopotentials cannot be generated on the fly by CASTEP and so are available in the following directory on ARCHER2:
/work/y07/shared/apps/core/castep/pseudopotentials\n
"},{"location":"research-software/castep/#running-parallel-castep-jobs","title":"Running parallel CASTEP jobs","text":"The following script will run a CASTEP job using 2 nodes (256 cores). it assumes that the input files have the file stem text_calc
.
#!/bin/bash\n\n# Request 2 nodes with 128 MPI tasks per node for 20 minutes\n#SBATCH --job-name=CASTEP\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Load the CASTEP module, avoid any unintentional OpenMP threading by\n# setting OMP_NUM_THREADS, and launch the code.\nmodule load castep\nexport OMP_NUM_THREADS=1\nsrun --distribution=block:block --hint=nomultithread castep.mpi test_calc\n
"},{"location":"research-software/castep/#using-serial-castep-tools","title":"Using serial CASTEP tools","text":"Serial CASTEP tools are available in the standard CASTEP module.
"},{"location":"research-software/castep/#compiling-castep","title":"Compiling CASTEP","text":"The latest instructions for building CASTEP on ARCHER2 may be found in the GitHub repository of build instructions:
In the process of porting CESM 2.1.3 to ARCHER2, a set of 4 long runs were carried out. This page contains the four example cases which have been validated with longer runs. They vary in the numbers of cores or threads used, but included here are the PE layouts used in these validation runs, which can be used as a guide for other runs. While only these four compsets and grids have been validated, CESM2 is not bound to just these cases. Links to the UCAR/NCAR pages on configurations, compsets and grids are in the useful links section of the CESM2.1.3 on ARCHER2 page, which can be used to find many of the defined compsets for CESM2.1.3.
"},{"location":"research-software/cesm-further-examples/#atmosphere-only-f2000climo","title":"Atmosphere-only / F2000climo","text":"This compset uses the F09 grid which is roughly equivalent to a 1 degree resolution. On ARCHER2 with four nodes this configuration should give a throughput of around 7.8 simulated years per (wallclock) day (SYPD). The commands to set up and run the case are as follows:
${CIMEROOT}/scripts/create_newcase --case [case name] --compset F2000climo --res f09_f09_mg17 --walltime [enough time] --project [project code]\ncd [case directory]\n./xmlchange NTASKS=512,NTASKS_ESP=1\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm-further-examples/#slab-ocean-etest","title":"Slab Ocean / ETEST","text":"The slab ocean case is similar to the atmosphere-only case in terms of resources needed, as the slab ocean is inexpensive to simulate in comparison to the atmosphere. The setup detailed below uses two OMP threads, and more tasks than were used by the F2000climo case, and so a throughput of around 20 SYPD can be expected. Unlike F2000climo, but like most compsets, this is unsupported (meaning it has not been scientifically verified by NCAR personnel) and as such an extra argument is required when creating the case. The arguments for ROOTPE are to guard against poor decisions being automatically chosen with respect to resources.
${CIMEROOT}/scripts/create_newcase --case [case name] --compset ETEST --res f09_g17 --walltime [enough time] --project [project code] --run-unsupported\ncd [case directory]\n./xmlchange NTASKS=1024,NTASKS_ESP=1\n./xmlchange NTHRDS=2\n./xmlchange ROOTPE_ICE=0,ROOTPE_OCN=0\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm-further-examples/#coupled-ocean-b1850","title":"Coupled Ocean / B1850","text":"Compsets with the B
prefix are fully coupled, and actively simulate all components. As such, This case is more expensive to run, most especially the ocean component. This case can be set up to run on dedicated nodes by changing the $ROOTPE
variables (run the ./pelayout command to check that you have things as you wish). This should give a throughput of just over 10 SYPD.
${CIMEROOT}/scripts/create_newcase --case [case name] --compset B1850 --res f09_g17 --walltime [enough time] --project [project name]\ncd [case directory]\n./xmlchange NTASKS_CPL=1024,NTASKS_ICE=256,NTASKS_LND=256,NTASKS_GLC=128,NTASKS_ROF=128,NTASKS_WAV=256,NTASKS_OCN=512,NTASKS_ATM=1024\n./xmlchange ROOTPE_CPL=0,ROOTPE_ICE=0,ROOTPE_LND=256,ROOTPE_GLC=512,ROOTPE_ROF=640,ROOTPE_WAV=768,ROOTPE_OCN=1024,ROOTPE_ATM=0\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
You can also define the PE layout in terms of full nodes by using negative values. As such, for a $MAX_MPITASKS_PER_NODE=128
and $MAX_TASKS_PER_NODE=128
, the below is equivalent to the above:
${CIMEROOT}/scripts/create_newcase --case [case name] --compset B1850 --res f09_g17 --walltime [enough time] --project [project name]\ncd [case directory]\n./xmlchange NTASKS_CPL=-8,NTASKS_ICE=-2,NTASKS_LND=-2,NTASKS_GLC=-1,NTASKS_ROF=-1,NTASKS_WAV=-2,NTASKS_OCN=-4,NTASKS_ATM=-8\n./xmlchange ROOTPE_CPL=0,ROOTPE_ICE=0,ROOTPE_LND=-2,ROOTPE_GLC=-4,ROOTPE_ROF=-5,ROOTPE_WAV=-6,ROOTPE_OCN=-8,ROOTPE_ATM=0\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm-further-examples/#waccm-x-fxhist","title":"WACCM-X / FXHIST","text":"The WACCM-X case needs care during the set up and running for a couple of reasons. Firstly, as mentioned in the known issues section on archiving errors the short-term archiver can sometimes move too many files and thus create problems with resubmissions. Secondly, it can pick up other files in the cesm_inputdata directory, causing issues when running. WACCM-X is also comparatively very expensive, and so only has an expected throughput of a little over 1.5 SYPD, and that when on a coarser grid than above. The setup for running a WACCM-X case with approximately 2 degree resolution and no short-term archiving is
${CIMEROOT}/scripts/create_newcase --case [case name] --compset FXHIST --res f19_f19_mg16 --walltime [enough time] --project [project name] --run-unsupported\ncd [case directory]\n./xmlchange NTASKS=512,NTASKS_ESP=1\n./xmlchange NTHRDS=2\n./xmlchange DOUT_S=FALSE\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm/","title":"Community Earth System Model (CESM2)","text":"CESM2 is a fully-coupled, community, global climate model that provides state-of-the-art computer simulations of the Earth's past, present, and future climate states. It has seven different components: atmosphere, ocean, river run off, sea ice, land ice, waves and adaptive river transport.
Important
CESM is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/cesm/#cesm-213","title":"CESM 2.1.3","text":"At the time of writing, CESM 2.1.3 is the latest scientifically verified version of the model.
"},{"location":"research-software/cesm/#setting-up-cesm-213-on-archer2","title":"Setting up CESM 2.1.3 on ARCHER2","text":"Due to the nature of CESM2, there is not a centrally installed version of the program available on ARCHER2. Instead, users download their own copy of the program and make use of ARCHER2-specific configurations that have been rigorously tested.
The setup process has been streamlined on ARCHER2 and can be carried out by following the instructions on the ARCHER2 CESM2.1.3 setup page
"},{"location":"research-software/cesm/#using-cesm-213-on-archer2","title":"Using CESM 2.1.3 on ARCHER2","text":"A quickstart guide for running a simple coupled case of CESM 2.1.3 on ARCHER2 can be found here. It should be noted that this is only a quickstart guide with a focus on the way that CESM 2.1.3 should be run specifically on ARCHER2, and is not intended to replace the larger CESM or CIME documentation linked to below.
"},{"location":"research-software/cesm/#useful-links","title":"Useful Links","text":""},{"location":"research-software/cesm/#documentation","title":"Documentation","text":"If this is your first time running CESM2, it is highly recommended that you consult both the CIME documentation and the NCAR CESM pages for the version used in CESM 2.1.3, paying particular attention to the pages on Basic Usage of CIME which gives detailed description of the basic commands needed to get a model running.
"},{"location":"research-software/cesm/#compsets-and-configurations","title":"Compsets and Configurations","text":"CESM2 allows simulations to be carried out using a very wide range of configurations. If you are new to CESM2 it is highly recommended that, unless you are running a case you are already familiar with, you consult the CESM2.1 Configurations page. You can also see a list of the defined compsets already available on the component set definitions page. More information about configurations, grids and compsets can be found on the CESM2 Configurations and Grids page, which includes links to the configuration settings of the different components.
"},{"location":"research-software/cesm213_run/","title":"Quick Start: CESM Model Workflow (CESM 2.1.3)","text":"This is the procedure for quickly setting up and running a simple CESM2 case on ARCHER2. This document is based on the general quickstart guide for CESM 2.1, with modifications to give instructions specific to ARCHER2. For more expansive instructions on running CESM 2.1, please consult the NCAR CESM pages
Before following these instructions, ensure you have completed the setup procedure (see Setting up CESM2 on ARCHER2).
For your target case, the first step is to select a component set, and a resolution for your case. For the purposes of this guide, we will be looking at a simple coupled case using the B1850
compset and the f19_g17
resolution.
The current configuration of CESM 2.1.3 on ARCHER2 has been validated with the F2000 (atmosphere only), ETEST (slab ocean), B1850 (fully coupled) and FX2000 (WACCM-X) compsets. Instructions for these are here: CESM2.1.3 further examples.
Details of available component sets and resolutions are available from the query_config tool located in the my_cesm_sandbox/cime/scripts
directory
cd my_cesm_sandbox/cime/scripts\n./query_config --help\n
See the supported component sets, supported model resolutions and supported machines for a complete list of CESM2 supported component sets, grids and computational platforms.
Note: Variables presented as $VAR
in this guide typically refer to variables in XML files in a CESM case. From within a case directory, you can determine the value of such a variable with ./xmlquery VAR
. In some instances, $VAR
refers to a shell variable or some other variable; we try to make these exceptions clear.
There are three stages to preparing the case: create, setup and build. Here you can find information on each of these steps
"},{"location":"research-software/cesm213_run/#1-create-a-case","title":"1. Create a case","text":"The create_newcase command creates a case directory containing the scripts and XML files to configure a case (see below) for the requested resolution, component set, and machine. create_newcase has three required arguments: --case
, --compset
and --res
(invoke create_newcase --help for help).
On machines where a project or account code is needed (including ARCHER2), you must either specify the --project
argument to create_newcase or set the $PROJECT
variable in your shell environment.
If running on a supported machine, that machine will normally be recognized automatically and therefore it is not required to specify the --machine
argument to create_newcase. For CESM 2.1.3, ARCHER2 is classed as an unsupported machine, however the configurations for ARCHER2 are included in the version of cime downloaded in the setup process, and so adding the --machine
flag should not be necessary.
Invoke create_newcase as follows:
./create_newcase --case CASENAME --compset COMPSET --res GRID --project PROJECT\n
where:
CASENAME
defines the name of your case (stored in the $CASE
XML variable). This is a very important piece of metadata that will be used in filenames, internal metadata and directory paths. create_newcase will create the case directory with the same name as the CASENAME
. If CASENAME
is simply a name (not a path), the case directory is created in the directory where you executed create_newcase. If CASENAME
is a relative or absolute path, the case directory is created there, and the name of the case will be the last component of the path. The full path to the case directory will be stored in the $CASEROOT
XML variable. See CESM2 Experiment Casenames for details regarding CESM experiment case naming conventions.COMPSET
is the component set.GRID
is the model resolution.PROJECT
is you project code on ARCHER2.Here is an example on ARCHER2 with the CESM2 module loaded:
$CIMEROOT/scripts/create_newcase --case $CESM_ROOT/runs/b.e20.B1850.f19_g17.test --compset B1850 --res f19_g17 --project n02\n
"},{"location":"research-software/cesm213_run/#2-setting-up-the-case-run-script","title":"2. Setting up the case run script","text":"Issuing the case.setup command creates scripts needed to run the model along with namelist user_nl_xxx
files, where xxx denotes the set of components for the given case configuration. Before invoking case.setup, modify the env_mach_pes.xml
file in the case directory using the xmlchange command as needed for the experiment.
cd to the case directory. Following the example from above:
cd $CESM_ROOT/runs/b.e20.B1850.f19_g17.test\n
Invoke the case.setup command.
./case.setup\n
If any changes are made to the case, case.setup can be re-run using
./case.setup --reset\n
"},{"location":"research-software/cesm213_run/#3-build-the-executable-using-the-casebuild-command","title":"3. Build the executable using the case.build command","text":"Run the build script.
./case.build\n
This build may take a while to run, and have periods where the build process doesn't seem to be doing anything. You should only cancel the build if there has been no activity by the build script after 15 minutes.
The CESM executable will appear in the directory given by the XML variable $EXEROOT
, which can be queried using:
./xmlquery EXEROOT\n
by default, this will be the bld
directory in your case directory.
If any changes are made to xml parameters that would necessitate rebuilding (see the Making Changes section below), then you can apply these by running
./case.setup --reset\n./case.build --clean-all\n./case.build\n
"},{"location":"research-software/cesm213_run/#input-data","title":"Input Data","text":"Each case of CESM will require input data, which is downloaded from UCAR servers. Input data from similar compsets is often reused, so running two similar cases may not require downloading any additional input data for the second case.
You can check to see if the required input data is already in your input data directory using
./check_input_data\n
If it is not present you can download the input data for the case prior to running the case using
./check_input_data --download\n
This can be useful for cases where a large amount of data is needed, as you can write a simple slurm script to run this download on the serial queue. Information on creating job submission scripts can be found on the ARCHER2 page on Running Jobs.
Downloading the case input data at this stage is optional, and if skipped the data will be downloaded using the login node when you run the case.submit script. This may cause the case.submit script to take a long time to download.
An important thing to note is that your input data will be stored in your /work area, and will contribute to your storage allocation. These input files can sometimes take up a large amount of space, and so it is recommended that you do not keep any input data that is no longer needed.
"},{"location":"research-software/cesm213_run/#making-changes-to-a-case","title":"Making changes to a case","text":"After creating a new case, the CIME functions can be used to make changes to the case setup, such as changing the wallclock time, number of cores etc.
You can query settings using the xmlquery script from your case directory:
./xmlquery <name_of_setting>\n
Adding the -p
flag allows you to look up partial names, for example
$ ./xmlquery -p JOB\n\nOutput:\nResults in group case.run\n JOB_QUEUE: standard\n JOB_WALLCLOCK_TIME: 01:30:00\n\nResults in group case.st_archive\n JOB_QUEUE: short\n JOB_WALLCLOCK_TIME: 0:20:00\n
Here all parameters that match the JOB
pattern are returned. It is worth noting that the parameters JOB_QUEUE
and JOB_WALLCLOCK_TIME
are present for both the case.run job and the case.st_archive job. To view just one of these, you can use the --subgroup
flag:
$ ./xmlquery -p JOB --subgroup case.run\n\nOutput:\nResults in group case.run\n JOB_QUEUE: standard\n JOB_WALLCLOCK_TIME: 01:30:00\n
When you know which setting you want to change, you can do so using the xmlchange command
./xmlchange <name_of_setting>=<new_value>\n
For example to change the wallclock time for the case.run job to 30 minutes, without knowing the exact name, you could do
$ ./xmlquery -p WALLCLOCK\n\nOutput:\nResults in group case.run\n JOB_WALLCLOCK_TIME: 24:00:00\n\nResults in group case.st_archive\n JOB_WALLCLOCK_TIME: 0:20:00\n\n$ ./xmlchange JOB_WALLCLOCK_TIME=00:30:00 --subgroup case.run\n\n$ ./xmlquery JOB_WALLCLOCK_TIME\n\nOutput:\nResults in group case.run\n JOB_WALLCLOCK_TIME: 00:30:00\n\nResults in group case.st_archive\n JOB_WALLCLOCK_TIME: 0:20:00\n
Note: If you try to set a parameter equal to a value that is not known to the program, it might suggest using a --force
flag. This may be useful, for example, in the case of using a queue that has not been configured yet, but use with care!
Some changes to the case must be done before calling ./case.setup
or ./case.build
, otherwise the case will need to be reset or cleaned, using ./case.setup --reset
and ./case.build --clean-all
. These are as follows:
Before calling ./case.setup
, changes to NTASKS
, NTHRDS
, ROOTPE
, PSTRID
and NINST
must be made, as well as any changes to the env_mach_specific.xml
file, which contains some configuration for the module environment and environment variables.
Before calling ./case.build
, ./case.setup
must have been called and any changes to env_build.xml
and Macros.make
must have been made. This includes whether you have edited the file directly, or used ./xmlchange
to alter the variables.
Many of the namelist variables can be changed just before calling ./case.submit
.
Modify runtime settings in env_run.xml
(optional). At this point you may want to change the running parameters of your case, such as run length. By default, the model is set to run for 5 days based on the $STOP_N
and $STOP_OPTION
variables:
./xmlquery STOP_OPTION,STOP_N\n
These default settings can be useful in troubleshooting runtime problems before submitting for a longer time, but will not allow the model to run long enough to produce monthly history climatology files. In order to produce history files, increase the run length to a month or longer:
./xmlchange STOP_OPTION=nmonths,STOP_N=1\n
If you want a longer run, for example 30 years, this cannot be done in a single job as the amount of wallclock time required would be considerably longer than the maximum allowed by the ARCHER2 queue system. To do this, you would split the simulation into appropriate chunks, such as 6 chunks of 5 years (assuming a simulated years per day (SYPD) of greater than 5 - some values for SYPD on ARCHER2 are given in the further examples page). Using the $RESUBMIT
xml variable and setting the values of the $STOP_OPTION
and $STOP_N
variables accordingly you can then chain the running of these chunks:
./xmlchange RESUBMIT=6, STOP_OPTION= nyears, and STOP_N= 5\n
This would then run 6 resubmissions, each new job picking back up where the previous job had stopped. For more information about this, see the user guide page on running a case.
Once you have set your job to run for the correct length of time, it is a good idea to check the correct amount of resource is available for the job. You can quickly check the job submission parameters by running
./preview_run\n
which will show you at a glance the wallclock times, job queues and the list of jobs to be submitted, as well as other parameters such as the number of MPI tasks, number of OpenMP threads.
Submit the job to the batch queue using the case.submit command.
./case.submit\n
The case.submit script will submit a job called .case.run, and if $DOUT_S
is set to TRUE
it will also submit a short-term archiving job. By default, the queue these jobs are submitted to is the standard
queue. For information on the resources available on each queue, see the QOS guide.
Note: There is a small possibility that your job may initially fail with the error message ERROR: Undefined env var 'CESM_ROOT'
. This could have two causes: 1. You do not have the CESM2/2.1.3 module loaded. This module needs to be loaded when running the case as well as when building the case. Try running again after having run module load CESM2/2.1.3
2. This could also be due to a known issue with ARCHER2 where adding the SBATCH directive export=ALL
to a slurm script will not work (see the ARCHER2 known issues entry on the subject). The ARCHER2 configuration included in the version of cime that was downloaded during setup should apply a work-around to this, and so you should not see this error in this case. It may still occur in some corner cases however. To avoid this, ensure that the environment from which you are submitting your case has the CESM2/2.1.3 module loaded and run the case.submit script with the following command
./case.submit -a=--export=ALL\n
When the job is complete, most output will not necessarily be written under the case directory, but instead under some other directories. Review the following directories and files, whose locations can be found with xmlquery (note: xmlquery can be run with a list of comma separated names and no spaces):
./xmlquery RUNDIR,CASE,CASEROOT,DOUT_S,DOUT_S_ROOT\n
$RUNDIR
This directory is set in the env_run.xml
file. This is the location where CESM2 was run. There should be log files there for every component (i.e. of the form cpl.log.yymmdd-hhmmss) if $DOUT_S == FALSE
. Each component writes its own log file. Also see whether any restart or history files were written. To check that a run completed successfully, check the last several lines of the cpl.log file for the string \\\"SUCCESSFUL TERMINATION OF CPL7-cesm\\\".
$DOUT_S_ROOT/$CASE
$DOUT_S_ROOT
refers to the short-term archive path location on local disk. This path is used by the case.st_archive script when $DOUT_S = TRUE
. See CESM Model Output File Locations for details regarding the component model output filenames and locations.
$DOUT_S_ROOT/$CASE
is the short-term archive directory for this case. If $DOUT_S
is FALSE, then no archive directory should exist. If $DOUT_S
is TRUE, then log, history, and restart files should have been copied into a directory tree here.
$DOUT_S_ROOT/$CASE/logs
The log files should have been copied into this directory if the run completed successfully and the short-term archiver is turned on with $DOUT_S = TRUE
. Otherwise, the log files are in the $RUNDIR
.
$CASEROOT
There could be standard out and/or standard error files output from the batch system.
$CASEROOT/CaseDocs
The case namelist files are copied into this directory from the $RUNDIR
.
$CASEROOT/timing
There should be two timing files there that summarize the model performance.
As CESM jobs are submitted to the ARCHER2 batch system, they can be monitored in the same way as other jobs, using the command
squeue -u $USER\n
You can get more details about the batch scheduler by consulting the ARCHER2 scheduling guide.
"},{"location":"research-software/cesm213_run/#archiving","title":"Archiving","text":"The CIME framework allows for short-term and long-term archiving of model output. This is particularly useful when the model is configured to output to a small storage space and large files may need to be moved during larger simulations. On ARCHER2, the model is configured to use short-term archiving, but not yet configured for long-term archiving.
Short-term archiving is on by default for compsets and can be toggled on and off using the DOUT_S parameter set to True or False using the xmlchange script:
./xmlchange DOUT_S=FALSE\n
When DOUT_S=TRUE
, calling ./case.submit will automatically submit a \u201cst_archive\u201d job to the batch system that will be held in the queue until the main job is complete. This can be configured in the same way as the main job for a different queue, wallclock time, etc. One change that may be advisable to make would be to change the queue your st_archive job is submitted to, as archiving does not require a large amount of resources and the short and serial queues on ARCHER2 do not use your project allowance. This would be done using the xmlchange script almost the same as for the case.run job. Note that the main job and the archiving job share some parameter names such as JOB_QUEUE
, and so a flag (--subgroup) specifying which you want to change should be used, as below:
./xmlchange JOB_QUEUE=short --subgroup case.st_archive\n
If the --subgroup
flag is not used, then the JOB_QUEUE
value for both the case.run and case.st_archive jobs will be changed. You can verify that they are different by running
./xmlquery JOB_QUEUE\n
which will show the value of this parameter for both jobs.
The archive is set up to move .nc
files and logs from $CESM_ROOT/runs/$CASE
to $CESM_ROOT/archive/$CASE
. As such, your /work
storage quota is being used whether archiving is switched on or off, and so it would be recommended that data you wish to retain be moved to another service such as a group workspace on JASMIN. See the Data Management and Transfer guide for more information on archiving data from ARCHER2. If you want to archive your files directly to a different location than the default, this can be set using the $DOUT_S_ROOT
parameter.
If a run fails, the first place to check is the run submission output file, usually located at
$CASEROOT/run.$CASE\n
so, for the example job run in this guide, the output file will be at
$CESM_ROOT/runs/b.e20.B1850.f19_g17.test/run.b.e20.B1850.f19_g17.test\n
If any errors have occurred, the location of the relevant log in which you can examine this error will be printed towards the end of this output file. The log will usually be located at
$CASEROOT/run/cesm.log.*\n
so in this case, the path would be
$CESM_ROOT/runs/b.e20.B1850.f19_g17.test/run/cesm.log.*\n
"},{"location":"research-software/cesm213_run/#known-issues-and-common-problems","title":"Known Issues and Common Problems","text":""},{"location":"research-software/cesm213_run/#input-data-errors","title":"Input data errors","text":"Occasionally, the input data for a case is not downloaded correctly. Unfortunately, in these cases the checksum test run by the check_input_data
script will not catch the corrupted fields in the file. The error message displayed can vary somewhat, but a common error message is
ERROR timeaddmonths(): MM out of range\"\n
You can often spot these errors by examining the log as described above, as the error will occur shortly after a file has been read. If this happens, delete the file in question from your cesm_inputdata
directory and rerun
./check_input_data --download\n
to ensure that the data is downloaded correctly."},{"location":"research-software/cesm213_run/#sigfpe-errors","title":"SIGFPE errors","text":"If running a case with the DEBUG flag enabled, you may see some SIGFPE errors. In this case, the traceback shown in the logs will show the error as originating in one of three places:
This problem is caused by 'short-circuit' logic in the affected files, where there may be a conditional of the form
if (A .and. B) then....\n
where B cannot be properly evaluated if A fails, for example if ( x /= 0 .and. y/x > c ) then....\n
which would result in a divide-by-zero error if the second condition was evaluated after the first condition had already failed. In standard simulations, the second condition would be skipped in these cases however if the user has set
./xmlchange DEBUG=TRUE\n
then the second condition will not be skipped and a SIGFPE error will occur.
If encountering these errors, a user can do one of two things. The simplest solution is to turn off the DEBUG flag with
./xmlchange DEBUG=TRUE\n
If this option is not possible however, and your simulation absolutely needs to be run in DEBUG mode, then the conditional can be modified in the program code. THIS IS DONE AT YOUR OWN RISK!!! The fix that has been applied for the WW3 component can be seen here. It is recommended that if you are making any changes to the code for this reason, that you revert your changes back once you no longer need to run your case in DEBUG mode."},{"location":"research-software/cesm213_run/#sigsegv-errors","title":"SIGSEGV errors","text":"Sometimes an error will occur where a run is ended prematurely and gives an error of the form
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.\n
This can often be solved by increasing the amount of available memory per task, either by changing the maximum number of MPI tasks per node by using
./xmlchange MAX_TASKS_PER_NODE=64\n
or by increasing the number of threads used by using
./xmlchange NTHRDS=2\n
This will double the amount of memory available for each physical core
"},{"location":"research-software/cesm213_run/#archiving-errors","title":"Archiving Errors","text":"When running WACCM-X cases (compsets starting FX*), there can sometimes be problems when running restart jobs. This is caused by the short-term archiving job mistakenly moving files needed for restarts to the archive. To ensure this does not happen, it can be a good idea when running WACCM-X simulations to turn off the short-term archiver using
./xmlchange DOUT_S=FALSE\n
While this behaviour has so far only been observed for WACCM-X jobs, it is possible that this behaviour can occur with other compsets
"},{"location":"research-software/cesm213_run/#job-failing-instantly-with-undefined-environment-variable","title":"Job Failing instantly with undefined environment variable","text":"There is a small possibility that your job may initially fail with the error message
ERROR: Undefined env var 'CESM_ROOT'\n
This could have two causes: 1. You do not have the CESM2/2.1.3 module loaded. This module needs to be loaded when running the case as well as when building the case. Try running again after having run module load CESM2/2.1.3
2. This could also be due to a known issue with ARCHER2 where adding the SBATCH directive export=ALL
to a slurm script will not work (see the ARCHER2 known issues entry on the subject). The ARCHER2 configuration included in the version of cime that was downloaded during setup should apply a work-around to this, and so you should not see this error in this case. It may still occur in some corner cases however. To avoid this, ensure that the environment from which you are submitting your case has the CESM2/2.1.3 module loaded and run the case.submit script with the following command ./case.submit -a=--export=ALL\n
"},{"location":"research-software/cesm213_setup/","title":"First-Time setup of CESM 2.1.3","text":"Important
These instructions are intended for users of the n02
project. Downloads may be incomplete if you are not a member of n02
.
Due to the nature of the CESM program, a centrally installed version of the code is not provided on ARCHER2. Instead, a user needs to download and set up the program themselves in their /work
area. The installation is done in three steps:
After setup, CESM is ready to run a simple case.
"},{"location":"research-software/cesm213_setup/#downloading-cesm-213-and-setting-up-the-directory-structure","title":"Downloading CESM 2.1.3 And Setting Up The Directory Structure","text":"For ease of use, a setup script has been created which downloads CESM 2.1.3, creates the directory structure needed for running CESM2 cases and creates a hidden file in your home directory containing environment variables needed by CESM.
To execute this script, run the following in an archer2 terminal
module load cray-python\nsource /work/n02/shared/CESM2/setup_cesm213.sh\n
This script will create a directory, defaulting to /work/$GROUP/$GROUP/$USER/cesm/CESM2.1.3
, where $GROUP
is your default group, for example n02, and populate it with the following subdirectories: * archive
- short-term archiving for completed runs, * ccsm_baselines
- baseline files, * cesm_inputdata
- input data downloaded and used when running cases, * runs
- location of the case files used when running a case, * cesm directory - location of the cesm source code and the various components. Defaults to my_cesm_sandbox
The default locations for the CESM root directory and the CESM location can be overridden during installation either by entering new paths at runtime when prompted or by providing them as command line arguments, for example
source /work/n02/shared/CESM2/setup_cesm213.sh -p /work/n03/n03/$USER/CESM213 -l cesm_prog\n
"},{"location":"research-software/cesm213_setup/#manual-setup-instructions","title":"Manual setup instructions","text":"If you have trouble with running the setup script, you can install manually by running the following commands:
PREFIX=\"path/to/your/desired/cesm/root/location\"\nCESM_DIR_LOC=\"name_of_install_directory_for_cesm\"\n\nmkdir -p $PREFIX\ncd $PREFIX\nmkdir -p archive\nmkdir -p ccsm_baselines\nmkdir -p cesm_inputdata\nmkdir -p runs\n\nCESM_LOC=$PREFIX/$CESM_DIR_LOC\n\ngit clone -b release-cesm2.1.3 https://github.com/ESCOMP/CESM.git $CESM_LOC\ncd $CESM_LOC\ngit checkout release-cesm2.1.3\n\ntee ${HOME}/.cesm213 <<EOF > /dev/null\n### CESM 2.1.3 on ARCHER2 Path File\n### Do Not Edit This File Unless You Know What You Are Doing\nCIME_MODEL=cesm\nCESM_ROOT=$PREFIX\nCESM_LOC=$PREFIX/$CESM_DIR_LOC\nCIMEROOT=$PREFIX/$CESM_DIR_LOC/cime\nEOF\n\necho \"module use /work/n02/shared/CESM2/module\" >> ~/.bashrc\nmodule use /work/n02/shared/CESM2/module\nmodule load CESM2/2.1.3\n
"},{"location":"research-software/cesm213_setup/#linking-and-downloading-components","title":"Linking And Downloading Components","text":"CESM utilises multiple components, including CAM (atmosphere), CICE (sea ice), CISM (ice sheets), CTSM (land), MOSART (adaptive river transport), POP2 (ocean), RTM (river transport) and WW3 (waves), all of which are connected using the Common Infrastructure for Modelling the Earth (CIME). These components are hosted on github, and during the setup process they are downloaded.
Before downloading the external components, you must first modify the file $CESM_LOC/Externals.cfg
. This will change the version of CIME from the default cime 5.6.32 to the maintained cime 5.6 branch. This is done by modifying the file so that the cime section goes from
[cime]\ntag = cime5.6.32\nprotocol = git\nrepo_url = https://github.com/ESMCI/cime\nlocal_path = cime\nrequired = True\n
to
[cime]\nbranch = maint-5.6\nprotocol = git\nrepo_url = https://github.com/ESMCI/cime\nlocal_path = cime\nexternals = Externals_cime.cfg\nrequired = True\n
In the same $CESM_LOC/Externals.cfg
file, also update the version of CAM:
[cam]\ntag = cam_cesm2_1_rel_41\nprotocol = git\nrepo_url = https://github.com/ESCOMP/CAM\nlocal_path = components/cam\nexternals = Externals_CAM.cfg\nrequired = True\n
to
[cam]\ntag = cam_cesm2_1_rel\nprotocol = git\nrepo_url = https://github.com/ESCOMP/CAM\nlocal_path = components/cam\nexternals = Externals_CAM.cfg\nrequired = True\n
By making these changes, the configurations for archer2 are brought in along with some bug fixes
Once this has been done you are free to download the external components by executing the commands
cd $CESM_LOC\n./manage_externals/checkout_externals\n
The first time you run the checkout_externals script, you may be asked to accept a certificate, and you may also get an error of the form
svn: E120108: Error running context: The server unexpectedly closed the connection.\n
If this happens, rerun the checkout_externals script and it should download the external components correctly."},{"location":"research-software/cesm213_setup/#building-cprnc","title":"Building cprnc","text":"cprnc is a generic tool for analyzing a netcdf file or comparing two netcdf files. It is used in various places by CESM and the source is included with cime.
To build, execute the following commands
module load CESM2/2.1.3\ncd $CIMEROOT/tools/cprnc\ncmake . -DNetCDF_Fortran_LIBRARIES=libnetcdff.so -DNetCDF_C_LIBRARIES=libnetcdf.so\nmake\n
You are now ready to run a simple test case!
"},{"location":"research-software/chemshell/","title":"ChemShell","text":"ChemShell is a script-based chemistry code focusing on hybrid QM/MM calculations with support for standard quantum chemical or force field calculations. There are two versions: an older Tcl-based version Tcl-ChemShell and a more recent python-based version Py-ChemShell.
The advice from https://www.chemshell.org/licence on the difference is:
We consider Py-ChemShell 23.0 to be suitable for production calculations on both materials systems and biomolecules, and recommend that new ChemShell users should use the Python-based version.
We continue to maintain the original Tcl-based version of ChemShell and distribute it on request. Tcl-ChemShell currently contains some features that are not yet available in Py-ChemShell (but will be soon!) including a QM/MM MD driver and multiple electronic state calculations. At the present time if you need this functionality you will need to obtain a licence for Tcl-Chemshell.
"},{"location":"research-software/chemshell/#useful-links","title":"Useful Links","text":"The python-based version of ChemShell is open-source and is freely available to all users on ARCHER2. The version of Py-ChemShell pre-installed on ARCHER2 is compiled with NWChem and GULP as libraries.
Warning
Py-ChemShell on ARCHER2 is compiled with GULP 6.0. This is a licenced software that is free to use for academics. If you are not an academic user (or if you are using Py-ChemShell for non-academic work), please ensure that you have the correct GULP licence before using GULP functionalities in py-ChemShell or make sure that you are not using any of the GULP functionalities in your code (i.e., do not set theory=GULP in your calculations).
"},{"location":"research-software/chemshell/#running-parallel-py-chemshell-jobs","title":"Running parallel Py-ChemShell jobs","text":"Unlike most other ARCHER2 software packages, the Py-ChemShell module is built in such a way as to enable users to create and submit jobs to the compute nodes by running a chemsh
script from the login node rather than by creating and submitting a Slurm submission script. Below is an example command for submitting a pure MPI Py-ChemShell job running on 8 nodes (128x8 cores) with the chemsh
command:
# Run this from the login node\n module load py-chemshell\n\n # Replace [budget code] below with your project code (e.g. t01)\n chemsh --submit \\\n --jobname pychmsh \\\n --account [budget code] \\\n --partition standard \\\n --qos standard \\\n --walltime 0:10:0 \\\n --nnodes 8 \\\n --nprocs 1024 \\ \n py-chemshell-job.py\n
"},{"location":"research-software/chemshell/#using-tcl-chemshell-on-archer2","title":"Using Tcl-ChemShell on ARCHER2","text":"The older version of Tcl-based ChemShell requires a license. Users with a valid license should request access via the ARCHER2 SAFE.
"},{"location":"research-software/chemshell/#running-parallel-tcl-chemshell-jobs","title":"Running parallel Tcl-ChemShell jobs","text":"The following script will run a pure MPI Tcl-based ChemShell job using 8 nodes (128x8 cores).
#!/bin/bash\n\n#SBATCH --job-name=lammps_test\n#SBATCH --nodes=8\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load tcl-chemshell/3.7.1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread chemsh.x input.chm\n
"},{"location":"research-software/code-saturne/","title":"Code_Saturne","text":"Code_Saturne solves the Navier-Stokes equations for 2D, 2D-axisymmetric and 3D flows, steady or unsteady, laminar or turbulent, incompressible or weakly dilatable, isothermal or not, with scalar transport if required. Several turbulence models are available, from Reynolds-averaged models to large-eddy simulation (LES) models. In addition, a number of specific physical models are also available as \"modules\": gas, coal and heavy-fuel oil combustion, semi-transparent radiative transfer, particle-tracking with Lagrangian modeling, Joule effect, electrics arcs, weakly compressible flows, atmospheric flows, rotor/stator interaction for hydraulic machines.
"},{"location":"research-software/code-saturne/#useful-links","title":"Useful Links","text":"Code_Saturne is released under the GNU General Public Licence v2 and so is freely available to all users on ARCHER2.
You can load the default GCC build of Code_Saturne for use by running the following command:
module load code_saturne\n
This will load the default code_saturne/7.0.1-gcc11
module. A build using the CCE compilers, code_saturne/7.0.1-cce12
, has also been made optionally available to users on the full ARCHER2 system as testing indicates that this may provide improved performance over the GCC build.
After setting up a case it should be initialized by running the following command from the case directory, where setup.xml is the input file:
code_saturne run --initialize --param setup.xml\n
This will create a directory named for the current date and time (e.g. 20201019-1636) inside the RESU directory. Inside the new directory will be a script named run_solver. You may alter this to resemble the script below, or you may wish to simply create a new one with the contents shown.
If you wish to alter the existing run_solver script you will need to add all the #SBATCH
options shown to set the job name, size and so on. You should also add the two module
commands, and srun --distribution=block:block --hint=nomultithread
as well as the --mpi
option to the line executing ./cs_solver
to ensure parallel execution on the compute nodes. The export LD_LIBRARY_PATH=...
and cd
commands are redundant and may be retained or removed.
This script will run an MPI-only Code_Saturne job using the default GCC build and UCX over 4 nodes (128 x 4 = 512 cores) for a maximum of 20 minutes.
#!/bin/bash\n#SBATCH --export=none\n#SBATCH --job-name=CSExample\n#SBATCH --time=0:20:0\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the GCC build of Code_Saturne 7.0.1\nmodule load cpe/21.09\nmodule load PrgEnv-gnu\nmodule load code_saturne\n\n# Switch to mpich-ucx implementation (see info note below)\nmodule swap craype-network-ofi craype-network-ucx\nmodule swap cray-mpich cray-mpich-ucx\n\n# Prevent threading.\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Run solver.\nsrun --distribution=block:block --hint=nomultithread ./cs_solver --mpi $@\n
The script can then be submitted to the batch system with sbatch
.
Info
There is a known issue with the default MPI collectives which is causing performance issues on Code_Saturne. The suggested workaround is to switch to the mpich-ucx implementation. For this to link correctly on the full system, the extra cpe/21.09
and PrgEnv-gnu
modules also have to be explicitly loaded.
The latest instructions for building Code_Saturne on ARCHER2 may be found in the GitHub repository of build instructions:
CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. CP2K provides a general framework for different modelling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO), and classical force fields (AMBER, CHARMM). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method.
"},{"location":"research-software/cp2k/#useful-links","title":"Useful links","text":"CP2K is available through the cp2k
module. MPI only cp2k.popt
and MPI/OpenMP Hybrid cp2k.psmp
binaries are available.
For ARCHER2, CP2K has been compiled with the following optional features: FFTW
for fast Fourier transforms, libint
to enable methods including Hartree-Fock exchange, libxc
to provide a wider choice of exchange-correlation functionals, ELPA
for improved performance of matrix diagonalisation, PLUMED
to allow enhanced sampling methods.
See CP2K compile instructions for a full list of optional features.
If there is an optional feature not available, and which you would like, please contact the Service Desk. Experts may also wish to compile their own versions of the code (see below for instructions).
"},{"location":"research-software/cp2k/#running-parallel-cp2k-jobs","title":"Running parallel CP2K jobs","text":""},{"location":"research-software/cp2k/#mpi-only-jobs","title":"MPI only jobs","text":"To run CP2K using MPI only, load the cp2k
module and use the cp2k.psmp
executable.
For example, the following script will run a CP2K job using 4 nodes (128x4 cores):
#!/bin/bash\n\n# Request 4 nodes using 128 cores per node for 128 MPI tasks per node.\n\n#SBATCH --job-name=CP2K_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevent CP2K module\nmodule load cp2k\n\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --hint=nomultithread --distribution=block:block cp2k.psmp -i MYINPUT.inp\n
"},{"location":"research-software/cp2k/#mpiopenmp-hybrid-jobs","title":"MPI/OpenMP hybrid jobs","text":"To run CP2K using MPI and OpenMP, load the cp2k
module and use the cp2k.psmp
executable.
#!/bin/bash\n\n# Request 4 nodes with 16 MPI tasks per node each using 8 threads;\n# note this means 128 MPI tasks in total.\n# Remember to replace [budget code] below with your account code,\n# e.g. '--account=t01'.\n\n#SBATCH --job-name=CP2K_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=16\n#SBATCH --cpus-per-task=8\n#SBATCH --time=00:20:00\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevant CP2K module\nmodule load cp2k\n\n# Ensure OMP_NUM_THREADS is consistent with cpus-per-task above\nexport OMP_NUM_THREADS=8\nexport OMP_PLACES=cores\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --hint=nomultithread --distribution=block:block cp2k.psmp -i MYINPUT.inp\n
"},{"location":"research-software/cp2k/#compiling-cp2k","title":"Compiling CP2K","text":"The latest instructions for building CP2K on ARCHER2 may be found in the GitHub repository of build instructions:
CRYSTAL is a general-purpose program for the study of crystalline solids. The CRYSTAL program computes the electronic structure of periodic systems within Hartree Fock, density functional or various hybrid approximations (global, range-separated and double-hybrids). The Bloch functions of the periodic systems are expanded as linear combinations of atom centred Gaussian functions. Powerful screening techniques are used to exploit real space locality. Restricted (Closed Shell) and Unrestricted (Spin-polarized) calculations can be performed with all-electron and valence-only basis sets with effective core pseudo-potentials. The current release is CRYSTAL23.
Important
CRYSTAL is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/crystal/#useful-links","title":"Useful Links","text":"CRYSTAL is only available to users who have a valid CRYSTAL license. You request access through SAFE:
Please have your license details to hand.
"},{"location":"research-software/crystal/#running-parallel-crystal-jobs","title":"Running parallel CRYSTAL jobs","text":"The following script will run CRYSTAL using pure MPI for parallelisation using 256 MPI processes, 1 per core across 2 nodes. It assumes that the input file is tio2.d12
#!/bin/bash\n#SBATCH --nodes=2\n#SBATCH --time=0:20:00\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load crystal/23-1.0.1-2\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Change this to the name of your input file\ncp tio2.d12 INPUT\n\nsrun --hint=nomultithread --distribution=block:block MPPcrystal\n
An equivalent 2 node job using MPI+OpenMP parallelism with 4 threads per MPI process, 64 MPI processes, 1 thread per core across 2 nodes would be:
#!/bin/bash\n#SBATCH --nodes=2\n#SBATCH --time=0:20:00\n#SBATCH --ntasks-per-node=32\n#SBATCH --cpus-per-task=4\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load crystal/23-1.0.1-2\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Change this to the name of your input file\ncp tio2.d12 INPUT\n\nexport OMP_NUM_THREADS=4\nexport OMP_PLACES=cores\nexport OMP_STACKSIZE=16M\n\nsrun --hint=nomultithread --distribution=block:block MPPcrystalOMP\n
"},{"location":"research-software/crystal/#tips-and-known-issues","title":"Tips and known issues","text":""},{"location":"research-software/crystal/#cpu-frequency","title":"CPU frequency","text":"You should run some short (1 or 2 SCF cycles) jobs to test the scaling of your job so you can decide on the balance between cost to your budget and the time it takes to get a result. You now should include a few tests at different clock rates as part of this process.
Based on a few simple tests we have run it is likely that jobs dominated by building the Kohn-Sham matrix (SHELLX+MONMO3+NUMDFT in the output) will see minimal energy savings and better performance at 2.25GHz. Jobs dominated by the ScaLapack calls (MPP_DIAG in the output) may show useful energy savings at 2.0GHz.
"},{"location":"research-software/crystal/#out-of-memory-errors","title":"Out-of-memory errors","text":"Long-running jobs may encounter unexpected errors of the form
slurmstepd: error: Detected 1 oom-kill event(s) in step 411502.0 cgroup.\n
These are related to a memory leak in the underlying libfabric communication layer, which will be fixed in a future release. In the meantime, it should be possible to work around the problem by adding export FI_MR_CACHE_MAX_COUNT=0 \n
to the SLURM submission script."},{"location":"research-software/fhi-aims/","title":"FHI-aims","text":"FHI-aims is an all-electron electronic structure code based on numeric atom-centered orbitals. It enables first-principles simulations with very high numerical accuracy for production calculations, with excellent scalability up to very large system sizes (thousands of atoms) and up to very large, massively parallel supercomputers (ten thousand CPU cores).
"},{"location":"research-software/fhi-aims/#useful-links","title":"Useful Links","text":"FHI-aims is only available to users who have a valid FHI-aims licence.
If you have a FHI-aims licence and wish to have access to FHI-aims on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
"},{"location":"research-software/fhi-aims/#running-parallel-fhi-aims-jobs","title":"Running parallel FHI-aims jobs","text":"The following script will run a FHI-aims job using 8 nodes (1024 cores). The script assumes that the input have the default names control.in
and geometry.in
.
#!/bin/bash\n\n# Request 2 nodes with 128 MPI tasks per node for 20 minutes\n#SBATCH --job-name=FHI-aims\n#SBATCH --nodes=8\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the FHI-aims module, avoid any unintentional OpenMP threading by\n# setting OMP_NUM_THREADS, and launch the code.\nmodule load fhiaims\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport OMP_NUM_THREADS=1\nsrun --distribution=block:block --hint=nomultithread aims.mpi.x\n
"},{"location":"research-software/fhi-aims/#compiling-fhi-aims","title":"Compiling FHI-aims","text":"The latest instructions for building FHI-aims on ARCHER2 may be found in the GitHub repository of build instructions:
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.
"},{"location":"research-software/gromacs/#useful-links","title":"Useful Links","text":"GROMACS is Open Source software and is freely available to all users. Three executable versions are available on the normal (CPU-only) modules:
gmx_mpi
gmx_mpi_d
gmx
We also provide a GPU version of GROMACS that will run on the MI210 GPU nodes, it's named gromacs/2022.4-GPU
and can be loaded with
module load gromacs/2022.4-GPU\n
Important
The gromacs
modules reset the CPU frequency to the highest possible value (2.25 GHz) as this generally achieves the best balance of performance to energy use. You can change this setting by following the instructions in the Energy use section of the User Guide.
The following script will run a GROMACS MD job using 4 nodes (128x4 cores) with pure MPI.
#!/bin/bash\n\n#SBATCH --job-name=mdrun_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Setup the environment\nmodule load gromacs\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport OMP_NUM_THREADS=1 \nsrun --distribution=block:block --hint=nomultithread gmx_mpi mdrun -s test_calc.tpr\n
"},{"location":"research-software/gromacs/#running-hybrid-mpiopenmp-jobs","title":"Running hybrid MPI/OpenMP jobs","text":"The following script will run a GROMACS MD job using 4 nodes (128x4 cores) with 6 MPI processes per node (24 MPI processes in total) and 6 OpenMP threads per MPI process.
#!/bin/bash\n#SBATCH --job-name=mdrun_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=16\n#SBATCH --cpus-per-task=8\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Setup the environment\nmodule load gromacs\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport OMP_NUM_THREADS=8\nsrun --distribution=block:block --hint=nomultithread gmx_mpi mdrun -s test_calc.tpr\n
"},{"location":"research-software/gromacs/#running-gromacs-on-the-amd-mi210-gpus","title":"Running GROMACS on the AMD MI210 GPUs","text":"The following script will run a GROMACS MD job using 1 GPU with 1 MPI process 8 OpenMP threads per MPI process.
#!/bin/bash\n#SBATCH --job-name=mdrun_gpu\n#SBATCH --gpus=1\n#SBATCH --time=00:20:00\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-shd # or gpu-exc\n\n# Setup the environment\nmodule load gromacs/2022.4-GPU\n\nexport OMP_NUM_THREADS=8\nsrun --ntasks=1 --cpus-per-task=8 gmx_mpi mdrun -ntomp 8 --noconfout -s calc.tpr\n
"},{"location":"research-software/gromacs/#compiling-gromacs","title":"Compiling Gromacs","text":"The latest instructions for building GROMACS on ARCHER2 may be found in the GitHub repository of build instructions:
LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is a classical molecular dynamics code. LAMMPS has potentials for solid-state materials (metals, semiconductors) and soft matter (biomolecules, polymers), and coarse-grained or mesoscopic systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, mesoscopic, or continuum scale.
"},{"location":"research-software/lammps/#useful-links","title":"Useful Links","text":"LAMMPS is freely available to all ARCHER2 users.
The centrally installed version of LAMMPS is compiled with all the standard packages included: ASPHERE
, BODY
, CLASS2
, COLLOID
, COMPRESS
, CORESHELL
, DIPOLE
, GRANULAR
, KSPACE
, MANYBODY
, MC
, MISC
, MOLECULE
, OPT
, PERI
, QEQ
, REPLICA
, RIGID
, SHOCK
, SNAP
, SRD
.
We do not install any USER
packages. If you are interested in a USER
package, we would encourage you to try to compile your own version and we can help out if necessary (see below).
Important
The lammps
modules reset the CPU frequency to the highest possible value (2.25 GHz) as this generally achieves the best balance of performance to energy use. You can change this setting by following the instructions in the Energy use section of the User Guide.
LAMMPS can exploit multiple nodes on ARCHER2 and will generally be run in exclusive mode using more than one node.
For example, the following script will run a LAMMPS MD job using 4 nodes (128x4 cores) with MPI only.
#!/bin/bash\n\n#SBATCH --job-name=lammps_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load lammps\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread lmp -i in.test -l out.test\n
"},{"location":"research-software/lammps/#compiling-lammps","title":"Compiling LAMMPS","text":"The large range of optional packages available for LAMMPS, and opportunity for extensibility, may mean that it is convenient for users to compile their own copy. In practice, LAMMPS is relatively easy to compile, so we encourage users to have a go.
Compilation instructions for LAMMPS on ARCHER2 can be found on GitHub:
The Massachusetts Institute of Technology General Circulation Model (MITgcm) is a numerical model designed for study of the atmosphere, ocean, and climate. MITgcm's flexible non-hydrostatic formulation enables it to simulate fluid phenomena over a wide range of scales; its adjoint capabilities enable it to be applied to sensitivity questions and to parameter and state estimation problems. By employing fluid equation isomorphisms, a single dynamical kernel can be used to simulate flow of both the atmosphere and ocean.
"},{"location":"research-software/mitgcm/#useful-links","title":"Useful Links","text":"MITgcm is not available via a module on ARCHER2 as users will build their own executables specific to the problem they are working on.
You can obtain the MITgcm source code from the developers by cloning from the GitHub repository with the command
git clone https://github.com/MITgcm/MITgcm.git\n
You should then copy the ARCHER2 optfile into the MITgcm directories.
Warning
A current ARCHER2 optfile is not available at the present time. Please contact support@archer2.ac.uk
for help.
You should also set the following environment variables. MITGCM_ROOTDIR
is used to locate the source code and should point to the top MITgcm directory. Optionally, adding the MITgcm tools directory to your PATH
environment variable makes it easier to use tools such as genmake2
, and the MITGCM_OPT
environment variable makes it easier to refer to pass the optfile to genmake2
.
export MITGCM_ROOTDIR=/path/to/MITgcm\nexport PATH=$MITGCM_ROOTDIR/tools:$PATH\nexport MITGCM_OPT=$MITGCM_ROOTDIR/tools/build_options/dev_linux_amd64_cray_archer2\n
When using genmake2
to create the Makefile, you will need to specify the optfile to use. Other commonly used options might be to use extra source code with the -mods
option, to enable MPI with -mpi
, and to enable OpenMP with -omp
. You might then run a command that resembles the following:
genmake2 -mods /path/to/additional/source -mpi -optfile $MITGCM_OPT\n
You can read about the full set of options available to genmake2
by running
genmake2 -help\n
Finally, you may then build your executable by running
make depend\nmake\n
"},{"location":"research-software/mitgcm/#running-mitgcm-on-archer2","title":"Running MITgcm on ARCHER2","text":""},{"location":"research-software/mitgcm/#pure-mpi","title":"Pure MPI","text":"Once you have built your executable you can write a script like the following which will allow it to run on the ARCHER2 compute nodes. This example would run a pure MPI MITgcm simulation over 2 nodes of 128 cores each for up to one hour.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MITgcm-simulation\n#SBATCH --time=1:0:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 256 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\nsrun --distribution=block:block --hint=nomultithread ./mitgcmuv\n
"},{"location":"research-software/mitgcm/#hybrid-openmp-mpi","title":"Hybrid OpenMP & MPI","text":"Warning
Running the model in hybrid mode may lead to performance decreases as well as increases. You should be sure to profile your code both as a pure MPI application and as a hybrid OpenMP-MPI application to ensure you are making efficient use of resources. Be sure to read both the Archer2 advice on OpenMP and the MITgcm documentation first.
Note
Early versions of the ARCHER2 MITgcm optfile do not contain an OMPFLAG
. Please ensure you have an up to date copy of the optfile before attempting to compile OpenMP enabled codes.
Depending upon your model setup, you may wish to run the MITgcm code as a hybrid OpenMP-MPI application. In terms of compiling the model, this is as simple as using the flag -omp
when calling genmake2
, and updating your SIZE.h
file to have multiple tiles per process.
The model can be run using a slurm job submission script similar to that shown below. This example will run MITgcm across 2 nodes, with each node using 16 MPI processes, and each process using 4 threads. Note that this would underpopulate the nodes \u2014 i.e. we will only be using 128 of the 256 cores available to us. This can also sometimes lead to performance increases.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MITgcm-hybrid-simulation\n#SBATCH --time=1:0:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=16\n#SBATCH --cpus-per-task=4\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=4 # Set to number of threads per process\nexport OMP_PLACES=\"cores(128)\" # Set to total number of threads\nexport OMP_PROC_BIND=true # Required if we want to underpopulate nodes\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 256 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\nsrun --distribution=block:block --hint=nomultithread ./mitgcmuv\n
One final note, is that you should remember to update the eedata
file in the model's run directory to ensure the number of threads requested there match those requested in the job submission script.
The ECCO version 4 state estimate (ECCOv4-r4) is an observationally-constrained numerical solution produced by the ECCO group at JPL. If you would like to reproduce the state estimate on ARCHER2 in order to create customised runs and experiments, follow the instructions below. They have been slightly modified from the JPL instructions for ARCHER2.
For more information, see the ECCOv4-r4 website https://ecco-group.org/products-ECCO-V4r4.htm
"},{"location":"research-software/mitgcm/#get-the-eccov4-r4-source-code","title":"Get the ECCOv4-r4 source code","text":"First, navigate to your directory on the /work
filesystem in order to get access to the compute nodes. Next, create a working directory, perhaps MYECCO, and navigate into this working directory:
mkdir MYECCO\ncd MYECCO\n
In order to reproduce ECCOv4-r4, we need a specific checkpoint of the MITgcm source code.
git clone https://github.com/MITgcm/MITgcm.git -b checkpoint66g\n
Next, get the ECCOv4-r4 specific code from GitHub:
cd MITgcm\nmkdir -p ECCOV4/release4\ncd ECCOV4/release4\ngit clone https://github.com/ECCO-GROUP/ECCO-v4-Configurations.git\nmv ECCO-v4-Configurations/ECCOv4\\ Release\\ 4/code .\nrm -rf ECCO-v4-Configurations\n
"},{"location":"research-software/mitgcm/#get-the-eccov4-r4-forcing-files","title":"Get the ECCOv4-r4 forcing files","text":"The surface forcing and other input files that are too large to be stored on GitHub are available via NASA data servers. In total, these files are about 200 GB in size. You must register for an Earthdata account and connect to a WebDAV server in order to access these files. For more detailed instructions, read the help page https://ecco.jpl.nasa.gov/drive/help.
First, apply for an Earthdata account: https://urs.earthdata.nasa.gov/users/new
Next, acquire your WebDAV credentials: https://ecco.jpl.nasa.gov/drive (second box from the top)
Now, you can use wget to download the required forcing and input files:
wget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_forcing\nwget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_init\nwget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_ecco\n
After using wget
, you will notice that the input*
directories are, by default, several levels deep in the directory structure. Use the mv
command to move the input*
directories to the directory where you executed the wget
command. Specifically,
mv ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_forcing/ .\nmv ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_init/ .\nmv ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_ecco/ .\nrm -rf ecco.jpl.nasa.gov\n
"},{"location":"research-software/mitgcm/#compiling-and-running-eccov4-r4","title":"Compiling and running ECCOv4-r4","text":"The steps for building the ECCOv4-r4 instance of MITgcm are very similar to those for other build cases. First, wou will need to create a build directory:
cd MITgcm/ECCOV4/release4\nmkdir build\ncd build\n
Load the NetCDF modules:
module load cray-hdf5\nmodule load cray-netcdf\n
If you haven't already, set your environment variables:
export MITGCM_ROOTDIR=../../../../MITgcm\nexport PATH=$MITGCM_ROOTDIR/tools:$PATH\nexport MITGCM_OPT=$MITGCM_ROOTDIR/tools/build_options/dev_linux_amd64_cray_archer2\n
Next, compile the executable:
genmake2 -mods ../code -mpi -optfile $MITGCM_OPT\nmake depend\nmake\n
Once you have compiled the model, you will have the mitgcmuv executable for ECCOv4-r4.
"},{"location":"research-software/mitgcm/#create-run-directory-and-link-files","title":"Create run directory and link files","text":"In order to run the model, you need to create a run directory and link/copy the appropriate files. First, navigate to your directory on the work
filesystem. From the MITgcm/ECCOV4/release4
directory:
mkdir run\ncd run\n\n# link the data files\nln -s ../input_init/NAMELIST/* .\nln -s ../input_init/error_weight/ctrl_weight/* .\nln -s ../input_init/error_weight/data_error/* .\nln -s ../input_init/* .\nln -s ../input_init/tools/* .\nln -s ../input_ecco/*/* .\nln -s ../input_forcing/eccov4r4* .\n\npython mkdir_subdir_diags.py\n\n# manually copy the mitgcmuv executable\ncp -p ../build/mitgcmuv .\n
For a short test run, edit the nTimeSteps
variable in the file data
. Comment out the default value and uncomment the line reading nTimeSteps=8
. This is a useful test to make sure that the model can at least start up.
To run on ARCHER2, submit a batch script to the Slurm scheduler. Here is an example submission script:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=ECCOv4r4-test\n#SBATCH --time=1:0:0\n#SBATCH --nodes=8\n#SBATCH --ntasks-per-node=12\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# For adjoint runs the default cpu-freq is a lot slower\n#SBATCH --cpu-freq=2250000\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 256 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\nsrun --distribution=block:block --hint=nomultithread ./mitgcmuv\n
This configuration uses 96 MPI processes at 12 MPI processes per node. Once the run has finished, in order to check that the run has successfully completed, check the end of one of the standard output files.
tail STDOUT.0000\n
It should read
PROGRAM MAIN: Execution ended Normally\n
The files named STDOUT.*
contain diagnostic information that you can use to check your results. As a first pass, check the printed statistics for any clear signs of trouble (e.g. NaN values, extremely large values).
If you have access to the commercial TAF software produced by http://FastOpt.de, then you can compile and run the ECCOv4-r4 instance of MITgcm in adjoint mode. This mode is useful for comprehensive sensitivity studies and for constructing state estimates. From the MITgcm/ECCOV4/release4
directory, create a new code directory and a new build directory:
mkdir code_ad\ncd code_ad\nln -s ../code/* .\ncd ..\nmkdir build_ad\ncd build_ad\n
In this instance, the code_ad
and code
directories are identical, although this does not have to be the case. Make sure that you have the staf
script in your path or in the build_ad
directory itself. To make sure that you have the most up-to-date script, run:
./staf -get staf\n
To test your connection to the FastOpt servers, try:
./staf -test\n
You should receive the following message:
Your access to the TAF server is enabled.\n
The compilation commands are similar to those used to build the forward case.
# load relevant modules\nmodule load cray-netcdf-hdf5parallel\nmodule load cray-hdf5-parallel\n\n# compile adjoint model\n../../../MITgcm/tools/genmake2 -ieee -mpi -mods=../code_ad -of=(PATH_TO_OPTFILE)\nmake depend\nmake adtaf\nmake adall\n
The source code will be packaged and forwarded to the FastOpt servers, where it will undergo source-to-source translation via the TAF algorithmic differentiation software. If the compilation is successful, you will have an executable named mitgcmuv_ad
. This will run the ECCOv4-r4 configuration of MITgcm in adjoint mode. As before, create a run directory and copy in the relevant files. The procedure is the same as for the forward model, with the following modifications:
cd ..\nmkdir run_ad\ncd run_ad\n# manually copy the mitgcmuv executable\ncp -p ../build_ad/mitgcmuv_ad .\n
To run the model, change the name of the executable in the Slurm submission script; everything else should be the same as in the forward case. As above, at the end of the run you should have a set of STDOUT.*
files that you can examine for any obvious problems.
If TAF compilation fails with an error like failed to convert GOTPCREL relocation; relink with --no-relax
then add the following line to the FFLAGS options: -Wl,--no-relax
.
In an adjoint run, there is a balance between storage (i.e. saving the model state to disk) and recomputation (i.e. integrating the model forward from a stored state). Changing the nchklev
parameters in the tamc.h
file at compile time is how you control the relative balance between storage and recomputation.
A suggested strategy that has been used on a variety of HPC platforms is as follows: 1. Set nchklev_1
as large as possible, up to the size allowed by memory on your machine. (Use the size
command to estimate the memory per process. This should be just a little bit less than the maximum allowed on the machine. On ARCHER2 this is 2 GB (standard) and 4 GB (high memory)). 2. Next, set nchklev_2
and nchklev_3
to be large enough to accommodate the entire run. A common strategy is to set nchklev_2 = nchklev_3 = sqrt(numsteps/nchklev_1) + 1
. 3. If the nchklev_2
files get too big, then you may have to add a fourth level (i.e. nchklev_4
), but this is unlikely.
This strategy allows you to keep as much in memory as possible, minimising the I/O requirements for the disk. This is useful, as I/O is often the bottleneck for MITgcm runs on HPC.
Another way to adjust performance is to adjust how tapelevel I/O is handled. This strategy performs well for most configurations:
C o tape settings\n#define ALLOW_AUTODIFF_WHTAPEIO\n#define AUTODIFF_USE_OLDSTORE_2D\n#define AUTODIFF_USE_OLDSTORE_3D\n#define EXCLUDE_WHIO_GLOBUFF_2D\n#define ALLOW_INIT_WHTAPEIO\n
"},{"location":"research-software/mo-unified-model/","title":"Met Office Unified Model","text":"The Met Office Unified Model (\"the UM\") is a numerical model of the atmosphere used for both weather and climate applications. It is often coupled to the NEMO ocean model using the OASIS coupling framework to provide a full Earth system model.
"},{"location":"research-software/mo-unified-model/#useful-links","title":"Useful Links","text":"Information on using the UM is provided by the NCAS Computational Modelling Service (CMS).
"},{"location":"research-software/namd/","title":"NAMD","text":"NAMD is an award-winning parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
"},{"location":"research-software/namd/#useful-links","title":"Useful Links","text":"NAMD is freely available to all ARCHER2 users.
ARCHER2 has two versions of NAMD available: no-SMP (namd/2.14-nosmp
) or SMP (namd/2.14
). The SMP (Shared Memory Parallelism) build of NAMD introduces threaded parallelism to address memory limitations. The no-SMP build will typically provide the best performance but most users will require SMP in order to cope with high memory requirements.
Important
The namd
modules reset the CPU frequency to the highest possible value (2.25 GHz) as this generally achieves the best balance of performance to energy use. You can change this setting by following the instructions in the Energy use section of the User Guide.
Using no-SMP NAMD will run jobs with only MPI processes and will not introduce additional threaded parallelism. This is the simplest approach to running NAMD jobs and is likely to give the best performance unless simulations are limited by high memory requirements.
The following script will run a pure MPI NAMD MD job using 4 nodes (i.e. 128x4 = 512 MPI parallel processes).
#!/bin/bash\n\n# Request four nodes to run a job of 512 MPI tasks with 128 MPI\n# tasks per node, here for maximum time 20 minutes.\n\n#SBATCH --job-name=namd-nosmp\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load namd/2.14-nosmp\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread namd2 input.namd\n
"},{"location":"research-software/namd/#running-smp-namd-jobs","title":"Running SMP NAMD jobs","text":"If your jobs runs out of memory, then using the SMP version of NAMD will reduce the memory requirements. This involves launching a combination of MPI processes for communication and worker threads which perform computation.
The following script will run a SMP NAMD MD job using 4 nodes with 8 MPI communication processes per node and 16 worker threads per communication process (i.e. a fully-occupied node with all 512 cores populated with processes).
#!/bin/bash\n#SBATCH --job-name=namd-smp\n#SBATCH --ntasks-per-node=32\n#SBATCH --cpus-per-task=4\n#SBATCH --nodes=4\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevant modules\nmodule load namd\n\n# Set procs per node (PPN) & OMP_NUM_THREADS\nexport PPN=$(($SLURM_CPUS_PER_TASK-1))\nexport OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK\nexport OMP_PLACES=cores\n\n# Record PPN in the output file\necho \"Number of worker threads PPN = $PPN\"\n\n# Run NAMD\nsrun --distribution=block:block --hint=nomultithread namd2 +setcpuaffinity +ppn $PPN input.namd\n
Important
Please do not set SRUN_CPUS_PER_TASK
when running the SMP version of NAMD. Otherwise, Charm++ will be unable to pin processes to CPUs, causing NAMD to abort with errors such as Couldn't bind to cpuset 0x00000010,,,0x0: Invalid argument
.
How do I choose an optimal choice of MPI processes and worker threads for my simulations? The optimal choice for the numbers of MPI processes and worker threads per node depends on the data set and the number of compute nodes. Before running large production jobs, it is worth experimenting with these parameters to find the optimal configuration for your simulation.
We recommend that users match the ARCHER2 NUMA architecture to find the optimal balance of thread and process parallelism. The NUMA levels on ARCHER2 compute nodes are: 4 cores per CCX, 8 cores per CCD, 16 cores per memory controller, 64 cores per socket. For example, the above submission script specifies 32 MPI communication processes per node and 4 worker threads per communication process which places 1 MPI process per CCX on each node.
Note
To ensure fully occupied nodes with the SMP build of NAMD and match the NUMA layout, the optimal values of (tasks-per-node
, cpus-per-task
) are likely to be (32,4), (16,8) or (8,16).
How do I choose a value for the +ppn flag? The number of workers per communication process is specified by the +ppn argument to NAMD, which is set here to equal cpus-per-task - 1, to leave a CPU-core free for the associated MPI process.
We recommend that users reserve a thread per process to improve the scalability. Reserving this thread on a many-cores-per-node architecture like ARCHER2 will reduce the communication between threads and improve the scalability.
"},{"location":"research-software/namd/#compiling-namd","title":"Compiling NAMD","text":"The latest instructions for building NAMD on ARCHER2 may be found in the GitHub repository of build instructions.
ARCHER2 Full System
"},{"location":"research-software/nektarplusplus/","title":"Nektar++","text":"Nektar++ is a tensor product based finite element package designed to allow one to construct efficient classical low polynomial order h-type solvers (where h is the size of the finite element) as well as higher p-order piecewise polynomial order solvers.
The Nektar++ framework comes with a number of solvers and also allows one to construct a variety of new solvers. Users can therefore use Nektar++ just to run simulations, or to extend and/or develop new functionality.
"},{"location":"research-software/nektarplusplus/#useful-links","title":"Useful Links","text":"Nektar++ is released under an MIT license and is available to all users on the ARCHER2 full system.
"},{"location":"research-software/nektarplusplus/#where-can-i-get-help","title":"Where can I get help?","text":"Specific issues with Nektar++ itself might be submitted to the issue tracker at the Nektar++ gitlab repository (see link above). More general questions might also be directed to the Nektar-users mailing list. Issues specific to the use or behaviour of Nektar++ on ARCHER2 should be sent to the Service Desk.
"},{"location":"research-software/nektarplusplus/#running-parallel-nektar-jobs","title":"Running parallel Nektar++ jobs","text":"Below is the submission script for running the Taylor-Green Vortex, one of the Nektar++ tutorials, see https://doc.nektar.info/tutorials/latest/incns/taylor-green-vortex/incns-taylor-green-vortex.html#incns-taylor-green-vortexch4.html .
You first need to download the archive linked on the tutorial page.
cd /path/to/work/dir\nwget https://doc.nektar.info/tutorials/latest/incns/taylor-green-vortex/incns-taylor-green-vortex.tar.gz\ntar -xvzf incns-taylor-green-vortex.tar.gz\n
#!/bin/bash\n#SBATCH --job-name=nektar\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=32\n#SBATCH --cpus-per-task=1\n#SBATCH --time=02:00:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load nektar\n\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nNEK_INPUT_PATH=/path/to/work/dir/incns-taylor-green-vortex/completed/solver64\n\nsrun --distribution=block:cyclic --hint=nomultithread \\\n ${NEK_DIR}/bin/IncNavierStokesSolver \\\n ${NEK_INPUT_PATH}/TGV64_mesh.xml \\\n ${NEK_INPUT_PATH}/TGV64_conditions.xml\n
"},{"location":"research-software/nektarplusplus/#compiling-nektar","title":"Compiling Nektar++","text":"Instructions for building Nektar++ on ARCHER2 may be found in the GitHub repository of build instructions:
The Nektar++ team have themselves also provided detailed instructions on the build process, updated following the mid-2023 system update, on the Nektar++ website:
This page also provides instructions on how to run jobs using your local installation.
"},{"location":"research-software/nemo/","title":"NEMO","text":"NEMO (Nucleus for European Modelling of the Ocean) is a state-of-the-art framework for research activities and forecasting services in ocean and climate sciences, developed in a sustainable way by a European consortium.
"},{"location":"research-software/nemo/#useful-links","title":"Useful Links","text":"NEMO is released under a CeCILL license and is freely available to all users on ARCHER2.
"},{"location":"research-software/nemo/#compiling-nemo","title":"Compiling NEMO","text":"A central install of NEMO is not appropriate for most users of ARCHER2 since many configurations will want to add bespoke code changes.
The latest instructions for building NEMO on ARCHER2 are found in the Github repository of build instructions:
Typical NEMO production runs perform significant I/O management to handle the very large volumes of data associated with ocean modelling. To address this, NEMO ocean clients are interfaced with XIOS I/O servers. XIOS is a library which manages NetCDF outputs for climate models. NEMO uses XIOS to simplify the I/O management and introduce dedicated processors to manage large volumes of data.
Users can choose to run NEMO in attached or detached mode: - In attached mode each processor acts as an ocean client and I/O-server process. - In detached mode ocean clients and external XIOS I/O-server processors are separately defined.
Running NEMO in attached mode can be done with a simple submission script specifying both the NEMO and XIOS executable to srun
. However, typical production runs of NEMO will perform significant I/O management and will be unable to run in attached mode.
Detached mode introduces external XIOS I/O-servers to help manage the large volumes of data. This requires users to specify the placement of clients and servers on different cores throughout the node using the \u2013cpu-bind=map_cpu:<cpu map>
srun option to define a CPU map or mask. It is tedious to construct these maps by hand. Instead, Andrew Coward provides a tool to aid users in the construction submission scripts:
/work/n01/shared/nemo/mkslurm_hetjob\n/work/n01/shared/nemo/mkslurm_hetjob_Gnu\n
Usage of the script:
usage: mkslurm_hetjob [-h] [-S S] [-s S] [-m M] [-C C] [-g G] [-N N] [-t T]\n [-a A] [-j J] [-v]\n\nPython version of mkslurm_alt by Andrew Coward using HetJob. Server placement\nand spacing remains as mkslurm but clients are always tightly packed with a\ngap left every \"NC_GAP\" cores where NC_GAP can be given by the -g argument.\nvalues of 4, 8 or 16 are recommended.\n\noptional arguments:\n -h, --help show this help message and exit\n -S S num_servers (default: 4)\n -s S server_spacing (default: 8)\n -m M max_servers_per_node (default: 2)\n -C C num_clients (default: 28)\n -g G client_gap_interval (default: 4)\n -N N ncores_per_node (default: 128)\n -t T time_limit (default: 00:10:00)\n -a A account (default: n01)\n -j J job_name (default: nemo_test)\n -v show human readable hetjobs (default: False)\n
Note
We recommend that you retain your own copy of this script as it is not directly provided by the ARCHER2 CSE team and subject to change. Once obtained, you can set your own defaults for options in the script.
For example, to run with 4 XIOS I/O-servers (a maximum of 2 per node), each with sole occupancy of a 16-core NUMA region and 96 ocean cores, spaced with a idle core in between each, use:
./mkslurm_hetjob -S 4 -s 16 -m 2 -C 96 -g 2 > myscript.slurm\n\nINFO:root:Running mkslurm_hetjob -S 4 -s 16 -m 2 -C 96 -g 2 -N 128 -t 00:10:00 -a n01 -j nemo_test -v False\nINFO:root:nodes needed= 2 (256)\nINFO:root:cores to be used= 100 (256)\n
This has reported that 2 nodes are needed with 100 active cores spread over 256 cores. This will also have produced a submission script \"myscript.slurm\":
#!/bin/bash\n#SBATCH --job-name=nemo_test\n#SBATCH --time=00:10:00\n#SBATCH --account=n01\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-core=1\n\n# Created by: mkslurm_hetjob -S 4 -s 16 -m 2 -C 96 -g 2 -N 128 -t 00:10:00 -a n01 -j nemo_test -v False\nmodule swap craype-network-ofi craype-network-ucx\nmodule swap cray-mpich cray-mpich-ucx\nmodule load cray-hdf5-parallel/1.12.0.7\nmodule load cray-netcdf-hdf5parallel/4.7.4.7\nexport OMP_NUM_THREADS=1\n\ncat > myscript_wrapper.sh << EOFB\n#!/bin/ksh\n#\nset -A map ./xios_server.exe ./nemo\nexec_map=( 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 )\n#\nexec \\${map[\\${exec_map[\\$SLURM_PROCID]}]}\n##\nEOFB\nchmod u+x ./myscript_wrapper.sh\n\nsrun --mem-bind=local \\\n--ntasks=100 --ntasks-per-node=50 --cpu-bind=v,mask_cpu:0x1,0x10000,0x100000000,0x400000000,0x1000000000,0x4000000000,0x10000000000,0x40000000000,0x100000000000,0x400000000000,0x1000000000000,0x4000000000000,0x10000000000000,0x40000000000000,0x100000000000000,0x400000000000000,0x1000000000000000,0x4000000000000000,0x10000000000000000,0x40000000000000000,0x100000000000000000,0x400000000000000000,0x1000000000000000000,0x4000000000000000000,0x10000000000000000000,0x40000000000000000000,0x100000000000000000000,0x400000000000000000000,0x1000000000000000000000,0x4000000000000000000000,0x10000000000000000000000,0x40000000000000000000000,0x100000000000000000000000,0x400000000000000000000000,0x1000000000000000000000000,0x4000000000000000000000000,0x10000000000000000000000000,0x40000000000000000000000000,0x100000000000000000000000000,0x400000000000000000000000000,0x1000000000000000000000000000,0x4000000000000000000000000000,0x10000000000000000000000000000,0x40000000000000000000000000000,0x100000000000000000000000000000,0x400000000000000000000000000000,0x1000000000000000000000000000000,0x4000000000000000000000000000000,0x10000000000000000000000000000000,0x40000000000000000000000000000000 ./myscript_wrapper.sh\n
Submitting this script in a directory with the nemo and xios_server.exe executables will run the desired MPMD job. The exec_map array shows the position of each executable in the rank list (0 = xios_server.exe, 1 = nemo). For larger core counts the cpu_map can be limited to a single node map which will be cycled through as many times as necessary.
"},{"location":"research-software/nemo/#how-to-optimise-the-performance-of-nemo","title":"How to optimise the performance of NEMO","text":"Note
Our optimisation advice is based on the ARCHER2 4-cabinet preview system with the same node architecture as the current ARCHER2 service but a total of 1,024 compute nodes. During these investigations we used NEMO-4.0.6 and XIOS-2.5.
Through testing with idealised test cases to optimise the computational performance (i.e. without the demanding I/O management that is typical of NEMO production runs), we have found that drastically under-populating the nodes does not affect the performance of the computation. This indicates that users can reserve large portions of the nodes without a performance detriment. Users can run larger simulations by reserving up to 75% of the node can be reserved for I/O management (i.e. XIOS I/O-servers).
XIOS I/O-servers can be more lightly packed than ocean clients and should be evenly distributed amongst the nodes i.e. not concentrated on a specific node. We found that placing 1 XIOS I/O-server per node with 4, 8, and 16 dedicated cores did not affect the performance. However, the performance was affected when allocating dedicated I/O-server cores outside of a 16-core NUMA region. Thus, users should confine XIOS I/O-servers to NUMA regions to improve performance and benefit from the memory hierarchy.
"},{"location":"research-software/nemo/#a-performance-investigation","title":"A performance investigation","text":"Note
These results were collated during early user testing of the ARCHER2 service by Andrew Coward and is subject to change.
This table shows some preliminary results of a repeated 60 day simulation of the ORCA2_ICE_PISCES, SETTE configuration using various core counts and packing strategies:
Note
These results used the mkslurm script, now hosted in /work/n01/shared/nemo/old_scripts/mkslurm
It is clear from the previous results that fully populating an ARCHER2 node is unlikely to provide the optimal performance for any codes with moderate memory bandwidth requirements. The explored regular packing strategy does not allow experimentation with less wasteful packing strategies than half-population though.
There may be a case, for example, for just leaving every 1 in 4 cores idle, or every 1 in 8, or even fewer idle cores per node. The mkslurm_alt script (/work/n01/shared/nemo/old_scripts/mkslurm_alt) provided a method of generating cpu-bind maps for exploring these strategies. The script assumed no change in the packing strategy for the servers but the core spacing argument (-c) for the ocean cores is replaced by a -g option representing the frequency of a gap in the, otherwise tightly-packed, ocean cores.
Preliminary tests have been conducted with the ORCA2_ICE_PISCES SETTE test case. This is a relatively small test case that will fit onto a single node. It is also small enough to perform well in attached mode. First some baseline tests in attached mode.
Previous tests used 4 I/O servers each occupying a single NUMA. For this size model, 2 servers occupying half a NUMA each will suffice. That leaves 112 cores with which to try different packing strategies. Is it possible to match or better this elapsed time on a single node including external I/O servers? -Yes! -but not with an obvious gap frequency:
And activating land suppression can reduce times further:
The optimal two-node solution is also shown (this is quicker but the one node solution is cheaper).
This leads us to the current iteration of the mkslurm script - mkslurm_hetjob. Note a tightly-packed placement with no gaps amongst the ocean processes can be generated using a client gap interval greater than the number of clients. This script has been used to explore the different placement strategies with a larger configuration based on eORCA025. In all cases, 8 XIOS servers were used, each with sole occupancy of a 16-core NUMA and a maximum of 2 servers per node. The rest of the initial 4 nodes (and any subsequent ocean core-only nodes) were filled with ocean cores at various packing densities (from tightly packed to half-populated). A summary of the results are shown below.
The limit of scalability for this problem size lies around 1500 cores. One interesting aspect is that the cost, in terms of node hours, remains fairly flat up to a thousand processes and the choice of gap placement makes much less difference as the individual domains shrink. It looks as if, so long as you avoid inappropriately high numbers of processors, choosing the wrong placement won't waste your allocation but may waste your time.
"},{"location":"research-software/nwchem/","title":"NWChem","text":"NWChem aims to provide its users with computational chemistry tools that are scalable both in their ability to treat large scientific computational chemistry problems efficiently, and in their use of available parallel computing resources from high-performance parallel supercomputers to conventional workstation clusters. The NWChem software can handle: biomolecules, nanostructures, and solid-state system; from quantum to classical, and all combinations; Gaussian basis functions or plane-waves; scaling from one to thousands of processors; properties and relativity.
"},{"location":"research-software/nwchem/#useful-links","title":"Useful Links","text":"NWChem is released under an Educational Community License (ECL 2.0) and is freely available to all users on ARCHER2.
"},{"location":"research-software/nwchem/#where-can-i-get-help","title":"Where can I get help?","text":"If you have problems accessing or running NWChem on ARCHER2, please contact the Service Desk. General questions on the use of NWChem might also be directed to the [NWChem forum][1]. More experienced users with detailed technical issues on NWChem should consider submitting them to the NWChem GitHub issue tracker.
"},{"location":"research-software/nwchem/#running-nwchem-jobs","title":"Running NWChem jobs","text":"The following script will run a NWChem job using 2 nodes (256 cores) in the standard partition. It assumes that the input file is called test_calc.nw
.
#!/bin/bash\n\n# Request 2 nodes with 128 MPI tasks per node for 20 minutes\n\n#SBATCH --job-name=NWChem_test\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the NWChem module, avoid any unintentional OpenMP threading by\n# setting OMP_NUM_THREADS, and launch the code.\nmodule load nwchem\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread nwchem test_calc\n
"},{"location":"research-software/nwchem/#compiling-nwchem","title":"Compiling NWChem","text":"The latest instructions for building NWChem on ARCHER2 may be found in the GitHub repository of build instructions:
ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling code for quantum-mechanical calculations based on density-functional theory.
"},{"location":"research-software/onetep/#useful-links","title":"Useful Links","text":"ONETEP is only available to users who have a valid ONETEP licence.
If you have a ONETEP licence and wish to have access to ONETEP on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
"},{"location":"research-software/onetep/#running-parallel-onetep-jobs","title":"Running parallel ONETEP jobs","text":"The following script, supplied by the ONETEP developers, will run a ONETEP job using 2 nodes (256 cores) with 16 MPI processes per node and 8 OpenMP threads per MPI process. It assumes that there is a single calculation options file with the .dat
extension in the working directory.
#!/bin/bash\n\n# --------------------------------------------------------------------------\n# A SLURM submission script for ONETEP on ARCHER2 (full 23-cabinet system).\n# Central install, Cray compiler version.\n# Supports hybrid (MPI/OMP) parallelism.\n#\n# 2022.06 Jacek Dziedzic, J.Dziedzic@soton.ac.uk\n# University of Southampton\n# Lennart Gundelach, L.Gundelach@soton.ac.uk\n# University of Southampton\n# Tom Demeyere, T.Demeyere@soton.ac.uk\n# University of Southampton\n# --------------------------------------------------------------------------\n\n# v1.00 (2022.06.04) jd: Adapted from the user-compiled Cray compiler version.\n\n# ==========================================================================================================\n# Edit the following lines to your liking.\n#\n#SBATCH --job-name=mine # Name of the job.\n#SBATCH --nodes=2 # Number of nodes in job.\n#SBATCH --ntasks-per-node=16 # Number of MPI processes per node.\n#SBATCH --cpus-per-task=8 # Number of OMP threads spawned from each MPI process.\n#SBATCH --time=5:00:00 # Max time for your job (hh:mm:ss).\n#SBATCH --partition=standard # Partition: standard memory CPU nodes with AMD EPYC 7742 64-core processor\n#SBATCH --account=t01 # Replace 't01' with your budget code.\n#SBATCH --qos=standard # Requested Quality of Service (QoS), See ARCHER2 documentation\n\nexport OMP_NUM_THREADS=8 # Repeat the value from 'cpus-per-task' here.\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set up the job environment, loading the ONETEP module.\n# The module automatically sets OMP_PLACES, OMP_PROC_BIND and FI_MR_CACHE_MAX_COUNT.\n# To use a different binary, replace this line with either (drop the leading '#')\n# module load onetep/6.1.9.0-GCC-LibSci\n# to use the GCC-libsci binary, or with\n# module load onetep/6.1.9.0-GCC-MKL\n# to use the GCC-MKL binary.\n\nmodule load onetep/6.1.9.0-CCE-LibSci\n\n# ==========================================================================================================\n# !!! You should not need to modify anything below this line.\n# ==========================================================================================================\n\nworkdir=`pwd`\necho \"--- This is the submission script, the time is `date`.\"\n\n# Figure out ONETEP executable\nonetep_exe=`which onetep.archer2`\necho \"--- ONETEP executable is $onetep_exe.\"\n\nonetep_launcher=`echo $onetep_exe | sed -r \"s/onetep.archer2/onetep_launcher/\"`\n\necho \"--- workdir is '$workdir'.\"\necho \"--- onetep_launcher is '$onetep_launcher'.\"\n\n# Ensure exactly 1 .dat file in there.\nndats=`ls -l *dat | wc -l`\n\nif [ \"$ndats\" == \"0\" ]; then\n echo \"!!! There is no .dat file in the current directory. Aborting.\" >&2\n touch \"%NO_DAT_FILE\"\n exit 2\nfi\n\nif [ \"$ndats\" == \"1\" ]; then\n true\nelse\n echo \"!!! More than one .dat file in the current directory, that's too many. Aborting.\" >&2\n touch \"%MORE_THAN_ONE_DAT_FILE\"\n exit 3\nfi\n\nrootname=`echo *.dat | sed -r \"s/\\.dat\\$//\"`\nrootname_dat=$rootname\".dat\"\nrootname_out=$rootname\".out\"\nrootname_err=$rootname\".err\"\n\necho \"--- The input file is $rootname_dat, the output goes to $rootname_out and errors go to $rootname_err.\"\n\n# Ensure ONETEP executable is there and is indeed executable.\nif [ ! -x \"$onetep_exe\" ]; then\n echo \"!!! $onetep_exe does not exist or is not executable. Aborting!\" >&2\n touch \"%ONETEP_EXE_MISSING\"\n exit 4\nfi\n\n# Ensure onetep_launcher is there and is indeed executable.\nif [ ! -x \"$onetep_launcher\" ]; then\n echo \"!!! $onetep_launcher does not exist or is not executable. Aborting!\" >&2\n touch \"%ONETEP_LAUNCHER_MISSING\"\n exit 5\nfi\n\n# Dump the module list to a file.\nmodule list >\\$modules_loaded 2>&1\n\nldd $onetep_exe >\\$ldd\n\n# Report details\necho \"--- Number of nodes as reported by SLURM: $SLURM_JOB_NUM_NODES.\"\necho \"--- Number of tasks as reported by SLURM: $SLURM_NTASKS.\"\necho \"--- Using this srun executable: \"`which srun`\necho \"--- Executing ONETEP via $onetep_launcher.\"\n\n\n# Actually run ONETEP\n# Additional srun options to pin one thread per physical core\n########################################################################################################################################################\nsrun --hint=nomultithread --distribution=block:block -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS $onetep_launcher -e $onetep_exe -t $OMP_NUM_THREADS $rootname_dat >$rootname_out 2>$rootname_err\n########################################################################################################################################################\n\necho \"--- srun finished at `date`.\"\n\n# Check for error conditions\nresult=$?\nif [ $result -ne 0 ]; then\n echo \"!!! srun reported a non-zero exit code $result. Aborting!\" >&2\n touch \"%SRUN_ERROR\"\n exit 6\nfi\n\nif [ -r $rootname.error_message ]; then\n echo \"!!! ONETEP left an error message file. Aborting!\" >&2\n touch \"%ONETEP_ERROR_DETECTED\"\n exit 7\nfi\n\ntail $rootname.out | grep completed >/dev/null 2>/dev/null\nresult=$?\nif [ $result -ne 0 ]; then\n echo \"!!! ONETEP calculation likely did not complete. Aborting!\" >&2\n touch \"%ONETEP_DID_NOT_COMPLETE\"\n exit 8\nfi\n\necho \"--- Looks like everything went fine. Praise be.\"\ntouch \"%DONE\"\n\necho \"--- Finished successfully at `date`.\"\n
"},{"location":"research-software/onetep/#hints-and-tips","title":"Hints and Tips","text":"See the information in the ONETEP documentation.
"},{"location":"research-software/onetep/#compiling-onetep","title":"Compiling ONETEP","text":"The latest instructions for building ONETEP on ARCHER2 may be found in the GitHub repository of build instructions:
OpenFOAM is an open-source toolbox for computational fluid dynamics. OpenFOAM consists of generic tools to simulate complex physics for a variety of fields of interest, from fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics, electromagnetism and the pricing of financial options.
The core technology of OpenFOAM is a flexible set of modules written in C++. These are used to build solvers and utilities to perform pre-processing and post-processing tasks ranging from simple data manipulation to visualisation and mesh processing.
There are a number of different flavours of the OpenFOAM package with slightly different histories, and slightly different features. The two most common are distributed by openfoam.org and openfoam.com.
"},{"location":"research-software/openfoam/#useful-links","title":"Useful Links","text":"OpenFOAM is released under a GPL v3 license and is freely available to all users on ARCHER2.
Upgrade 2023Full systemauser@ln01> module avail openfoam\n--------------- /work/y07/shared/archer2-lmod/apps/core -----------------\nopenfoam/com/v2106 openfoam/org/v9.20210903\nopenfoam/com/v2212 (D) openfoam/org/v10.20230119 (D)\n
Note: the older versions were recompiled under PE22.12 in April 2023.
auser@ln01> module avail openfoam\n--------------- /work/y07/shared/archer2-lmod/apps/core -----------------\nopenfoam/com/v2106 openfoam/org/v9.20210903 (D)\nopenfoam/org/v8.20200901\n
Versions from openfoam.org are typically v8.0 etc and there is typically one release per year (in June; with a patch release in September). Versions from openfoam.com are e.g., v2106 (to be read as 2021 June) and there are typically two releases a year (one in June, and one in December).
To use OpenFOAM on ARCHER2 you should first load an OpenFOAM module, e.g.
user@ln01:> module load PrgEnv-gnu\nuser@ln01:> module load openfoam/com/v2106\n
(Note that the openfoam
module will automatically load PrgEnv-gnu
if it is not already active.) The module defines only the base installation directory via the environment variable FOAM_INSTALL_DIR
. After loading the module you need to source the etc/bashrc
file provided by OpenFOAM, e.g.
source ${FOAM_INSTALL_DIR}/etc/bashrc\n
You should then be able to use OpenFOAM. The above commands will also need to be added to any job/batch submission scripts you want to use to run OpenFOAM. Note that all the centrally installed versions of OpenFOAM are compiled under PrgEnv-gnu
.
Note there are no default module versions specified. It is recommended to use a fully qualified module name (with the exact version, as in the example above).
"},{"location":"research-software/openfoam/#extensions-to-openfoam","title":"Extensions to OpenFOAM","text":"Many packages extend the central OpenFOAM functionality in some way. However, there is no completely standardised way in which this works. Some packages assume they have write access to the main OpenFOAM installation. If this is the case, you must install your own version before continuing. This can be done on an individual basis, or a per-project basis using the project shared directories.
Some packages are installed in the OpenFOAM user directory, by default this is set to $HOME/OpenFOAM/$USER-[openfoam-version]
. This can be changed (e.g. to the work filesystem) by adding WM_PROJECT_USER_DIR=/work/a01/a01/auser/OpenFOAM/auser-[openfoam-version]
as an argument to source ${FOAM_INSTALL_DIR}/etc/bashrc
. For example:
source ${FOAM_INSTALL_DIR}/etc/bashrc WM_PROJECT_USER_DIR=/work/a01/a01/auser/OpenFOAM/auser-v2106\n
"},{"location":"research-software/openfoam/#compiling-openfoam","title":"Compiling OpenFOAM","text":"If you want to compile your own version of OpenFOAM, instructions are available for ARCHER2 at:
While it is possible to run limited OpenFOAM pre-processing and post-processing activities on the front end, we request all significant work is submitted to the queue system. Please remember that the front end is a shared resource.
A typical SLURM job submission script for OpenFOAM is given here. This would request 4 nodes to run with 128 MPI tasks per node (a total of 512 MPI tasks). Each MPI task is allocated one core (--cpus-per-task=1
).
#!/bin/bash\n\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --distribution=block:block\n#SBATCH --hint=nomultithread\n#SBATCH --time=00:10:00\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Load the appropriate module and source the OpenFOAM bashrc file\n\nmodule load openfoam/org/v10.20230119\n\nsource ${FOAM_INSTALL_DIR}/etc/bashrc\n\n# Run OpenFOAM work, e.g.,\n\nsrun interFoam -parallel\n
#!/bin/bash\n\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --distribution=block:block\n#SBATCH --hint=nomultithread\n#SBATCH --time=00:10:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the appropriate module and source the OpenFOAM bashrc file\n\nmodule load openfoam/org/v8.20210901\n\nsource ${FOAM_INSTALL_DIR}/etc/bashrc\n\n# Run OpenFOAM work, e.g.,\n\nsrun interFoam -parallel\n
"},{"location":"research-software/openfoam/#module-version-history","title":"Module version history","text":"The following centrally installed versions are available.
"},{"location":"research-software/openfoam/#upgrade-2023","title":"Upgrade 2023","text":"Module openfoam/com/v2212
installed as default April 2023 (PE 22.12). This is version v2212 (December 2022). See the OpenFOAM.com v2212 release announcement
Module openfoam/com/v2106
was recompiled April 2023 (PE 22.12). This is version v2106 (June 2021). See the OpenFOAM.com v2106 release announcement
Module openfoam/org/v10.20230119
installed as default April 2023 (PE 22.12) This is version 10 patch release 19th January 2023. See version 10 patch news
Module openfoam/org/v9.20210903
was recompiled April 2023 (PE 22.12). This is version 9 patch release 3rd September 2021. See version 9 patch release news.
Module openfoam/com/v2106
installed October 2021 (Cray PE 21.04). Version v2106 (June 2021). See OpenFOAM.com website
Module openfoam/org/v9.20200903
installed October 2021 (Cray PE 21.09). Version 9 patch release 3rd September 2021. See OpenFOAM.org website
Module openfoam/org/v8.20200901
installed October 2021 (Cray PE 21.09). Version 8 patch release 1st September 2020. See OpenFOAM.org website
ORCA is an ab initio quantum chemistry program package that contains modern electronic structure methods including density functional theory, many-body perturbation, coupled cluster, multireference methods, and semi-empirical quantum chemistry methods. Its main field of application is larger molecules, transition metal complexes, and their spectroscopic properties. ORCA is developed in the research group of Frank Neese. The free version is available only for academic use at academic institutions.
Important
ORCA is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/orca/#useful-links","title":"Useful Links","text":"ORCA is available for academic use on ARCHER2 only. If you wish to use ORCA for commercial applications, you must contact the ORCA developers.
"},{"location":"research-software/orca/#running-parallel-orca-jobs","title":"Running parallel ORCA jobs","text":"The following script will run an ORCA job on the ARCHER2 system using 256 MPI processes across 2 nodes, each MPI process will be placed on a separate physical core. It assumes that the input file is my_calc.inp
#!/bin/bash\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:20:00\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load orca\n\n# Launch the ORCA calculation\n# * You must use \"$ORCADIR/orca\" so the application has the full executable path\n# * Do not use \"srun\" to launch parallel ORCA jobs as they use OpenMPI rather than Cray MPICH\n# * Remember to change the name of the input file to match your file name\n$ORCADIR/orca my_calc.inp\n
"},{"location":"research-software/qchem/","title":"QChem","text":"QChem is an ab initio quantum chemistry software package for fast and accurate simulations of molecular systems, including electronic and molecular structure, reactivities, properties, and spectra.
Important
QChem is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/qchem/#useful-links","title":"Useful Links","text":"ARCHER2 has a site licence for QChem.
"},{"location":"research-software/qchem/#running-parallel-qchem-jobs","title":"Running parallel QChem jobs","text":"Important
QChem parallelisation is only available on ARCHER2 by using multiple threads within a single compute node. Multi-process and multi-node parallelisation will not work on ARCHER2.
The following script will run QChem using 16 OpenMP threads using the input in hf3c.in
.
#!/bin/bash\n#SBATCH --nodes=1\n#SBATCH --time=1:0:0\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=16\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load qchem\n\nexport OMP_PLACES=cores\nexport OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport SLURM_HINT=\"nomultithread\"\nexport SLURM_DISTRIBUTION=\"block:block\"\n\nqchem -slurm -nt $OMP_NUM_THREADS hf3c.in hf3c.out\n
"},{"location":"research-software/qe/","title":"Quantum Espresso","text":"Quantum Espresso (QE) is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials.
"},{"location":"research-software/qe/#useful-links","title":"Useful Links","text":"QE is released under a GPL v2 license and is freely available to all ARCHER2 users.
"},{"location":"research-software/qe/#running-parallel-qe-jobs","title":"Running parallel QE jobs","text":"For example, the following script will run a QE pw.x
job using 4 nodes (128x4 cores).
#!/bin/bash\n\n# Request 4 nodes to run a 512 MPI task job with 128 MPI tasks per node.\n# The maximum walltime limit is set to be 20 minutes.\n\n#SBATCH --job-name=qe_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevant Quantum Espresso module\nmodule load quantum_espresso\n\n#\u00a0Set number of OpenMP threads to 1 to prevent multithreading by libraries\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --hint=nomultithread --distribution=block:block pw.x < test_calc.in\n
"},{"location":"research-software/qe/#hints-and-tips","title":"Hints and tips","text":"The QE module is set to load up the default QE-provided pseudo-potentials. If you wish to use non-default pseudo-potentials, you will need to change the ESPRESSO_PSEUDO
variable to point to the directory you wish. This can be done by adding the following line after the module is loaded
export ESPRESSO_PSEUDO /path/to/pseudo_potentials\n
"},{"location":"research-software/qe/#compiling-qe","title":"Compiling QE","text":"The latest instructions for building QE on ARCHER2 can be found in the GitHub repository of build instructions:
The Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale materials modelling, e.g. electronic structure calculations and quantum-mechanical molecular dynamics, from first principles.
VASP computes an approximate solution to the many-body Schr\u00f6dinger equation, either within density functional theory (DFT), solving the Kohn-Sham equations, or within the Hartree-Fock (HF) approximation, solving the Roothaan equations. Hybrid functionals that mix the Hartree-Fock approach with density functional theory are implemented as well. Furthermore, Green's functions methods (GW quasiparticles, and ACFDT-RPA) and many-body perturbation theory (2nd-order M\u00f8ller-Plesset) are available in VASP.
In VASP, central quantities, like the one-electron orbitals, the electronic charge density, and the local potential are expressed in plane wave basis sets. The interactions between the electrons and ions are described using norm-conserving or ultrasoft pseudopotentials, or the projector-augmented-wave method.
To determine the electronic ground state, VASP makes use of efficient iterative matrix diagonalisation techniques, like the residual minimisation method with direct inversion of the iterative subspace (RMM-DIIS) or blocked Davidson algorithms. These are coupled to highly efficient Broyden and Pulay density mixing schemes to speed up the self-consistency cycle.
"},{"location":"research-software/vasp/#useful-links","title":"Useful Links","text":"VASP is only available to users who have a valid VASP licence.
If you have a VASP 5 or 6 licence and wish to have access to VASP on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
Note
Both VASP 5 and VASP 6 are available on ARCHER2. You generally need a different licence for each of these versions.
"},{"location":"research-software/vasp/#running-parallel-vasp-jobs","title":"Running parallel VASP jobs","text":"To access VASP you should load the appropriate vasp
module in your job submission scripts.
To load the default version of VASP, you would use:
module load vasp\n
Tip
VASP 6.4.3 and above have all been compiled to include Wannier90 functionality. Older versions of VASP on ARCHER2 do not include Wannier90.
Once loaded, the executables are called:
vasp_std
- Multiple k-point versionvasp_gam
- GAMMA-point only versionvasp_ncl
- Non-collinear versionOnce the module has been loaded, you can access the LDA and PBE pseudopotentials for VASP on ARCHER2 at:
$VASP_PSPOT_DIR\n
Tip
VASP 6 can make use of OpenMP threads in addition to running with pure MPI. We will add notes on performance and use of threading in VASP as information becomes available.
Example VASP submission script
#!/bin/bash\n\n# Request 16 nodes (2048 MPI tasks at 128 tasks per node) for 20 minutes. \n\n#SBATCH --job-name=VASP_test\n#SBATCH --nodes=16\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the VASP module\nmodule load vasp/6\n\n# Avoid any unintentional OpenMP threading by setting OMP_NUM_THREADS\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the code - the distribution and hint options are important for performance\nsrun --distribution=block:block --hint=nomultithread vasp_std\n
"},{"location":"research-software/vasp/#vasp-transition-state-tools-vtst","title":"VASP Transition State Tools (VTST)","text":"As well as the standard VASP 5 modules, we provide versions of VASP 5 with the VASP Transition State Tools (VTST) from the University of Texas added. The VTST version adds various functionality to VASP and provides additional scripts to use with VASP. Additional functionality includes:
Full details of these methods and the provided scripts can be found on the VTST website.
On ARCHER2, the VTST version of VASP 5 can be accessed by loading the modules with VTST
in the module name, for example:
module load vasp/6/6.4.1-vtst\n
"},{"location":"research-software/vasp/#compiling-vasp-on-archer2","title":"Compiling VASP on ARCHER2","text":"If you wish to compile your own version of VASP on ARCHER2 (either VASP 5 or VASP 6) you can find information on how we compiled the central versions in the build instructions GitHub repository. See:
The VASP modules are setup to use the OpenFabrics MPI transport protocol as testing has shown that this passes all the regression tests and gives the most reliable operation on ARCHER2. However, there may be cases where using UCX can give better performance than OpenFabrics.
If you want to try the UCX transport protocol then you can do this using by loading additional modules after you have loaded the VASP modules. For example, for VASP 6, you would use:
module load vasp/6\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\n
"},{"location":"research-software/vasp/#increasing-the-cpu-frequency-and-enabling-turbo-boost","title":"Increasing the CPU frequency and enabling turbo-boost","text":"The default CPU frequency is currently set to 2 GHz on ARCHER2. While many VASP calculations are memory or MPI bound, some calculations can be CPU bound. For those cases, you may see a signiicant difference in performance by increasing the CPU frequency and enabling turbo-boost (though you will almost certainly also be less energy efficient).
You can do this by adding the line:
export SLURM_CPU_FREQ_REQ=2250000\n
in your job submission script before the srun command
"},{"location":"research-software/vasp/#performance-tips","title":"Performance tips","text":"The performance of VASP depends on the version of VASP used, the performance of MPI collective operations, the choice of VASP parallelisation parameters (NCORE
/NPAR
and KPAR
) and how many MPI processes per node are used.
KPAR: You should always use the maximum value of KPAR
that is possible for your calculation within the memory limits of what is possible.
NCORE/NPAR: We have found that the optimal values of NCORE
(and hence NPAR
) depend on both the type of calculation you are performing (e.g. pure DFT, hybrid functional, \u0393-point, non-collinear) and the number of nodes/cores you are using for your calculation. In practice, this means that you should experiment with different values to find the best choice for your calculation. There is information below on the best choices for the benchmarks we have run on ARCHER2 that may serve as a useful starting point. The performance difference from choosing different values can vary by up to 100% so it is worth spending time investigating this.
MPI processes per node We found that it is sometimes beneficial to performance to use less MPI processes per node than the total number of cores per node in some cases for the benchmarks used.
OpenMP threads Using multiple OpenMP threads per MPI process can be beneficial to performance. 4 OpenMP threads per MPI process typically sees the best performance in the tests we have performed.
"},{"location":"research-software/vasp/#vasp-performance-data-on-archer2","title":"VASP performance data on ARCHER2","text":"VASP performance data on ARCHER2 is currently available for two different benchmark systems:
Basic information:
vasp_ncl
NELM = 6
Performance summary:
vasp/6/6.4.2-mkl19
modules)NCORE
:NCORE = 16
KPAR = 2
is maximum that can be used on standard memory nodes Setup details: - vasp/6/6.4.2-mkl19
module - GCC 11.2.0 - MKL 19.5 for BLAS/LAPACK/ScaLAPACK and FFTW - OFI for MPI transport layer
This page has moved
"},{"location":"research-software/chemshell/chemshell/","title":"Chemshell","text":"This page has moved
"},{"location":"research-software/code-saturne/code-saturne/","title":"Code saturne","text":"This page has moved
"},{"location":"research-software/cp2k/cp2k/","title":"Cp2k","text":"This page has moved
"},{"location":"research-software/fhi-aims/fhi-aims/","title":"Fhi aims","text":"This page has moved
"},{"location":"research-software/gromacs/gromacs/","title":"Gromacs","text":"This page has moved
"},{"location":"research-software/lammps/lammps/","title":"Lammps","text":"This page has moved
"},{"location":"research-software/mitgcm/mitgcm/","title":"Mitgcm","text":"This page has moved
"},{"location":"research-software/mo-unified-model/mo-unified-model/","title":"Mo unified model","text":"This page has moved
"},{"location":"research-software/namd/namd/","title":"Namd","text":"This page has moved
"},{"location":"research-software/nektarplusplus/nektarplusplus/","title":"Nektarplusplus","text":"This page has moved
"},{"location":"research-software/nemo/nemo/","title":"Nemo","text":"This page has moved
"},{"location":"research-software/nwchem/nwchem/","title":"Nwchem","text":"This page has moved
"},{"location":"research-software/onetep/onetep/","title":"Onetep","text":"This page has moved
"},{"location":"research-software/openfoam/openfoam/","title":"Openfoam","text":"This page has moved
"},{"location":"research-software/qe/qe/","title":"Qe","text":"This page has moved
"},{"location":"research-software/vasp/vasp/","title":"Vasp","text":"This page has moved
"},{"location":"software-libraries/","title":"Software Libraries","text":"This section provides information on centrally-installed software libraries and library-based packages. These provide significant functionality that is of interest to both users and developers of applications.
Libraries are made available via the module system, and fall into a number of distinct groups.
"},{"location":"software-libraries/#libraries-via-modules-cray-","title":"Libraries via modulescray-*
","text":"The following libraries are available as modules prefixed by cray-
and may be of direct interest to developers and users. The modules are provided by HPE Cray to be optimised for performance on the ARCHER2 hardware, and should be used where possible. The relevant modules are:
cray-fftw ...details for module load cray-fttw...
FFTW (Fastest Fourier Transform in the West) is a standard package for discrete Fourier transforms. See the FFTW home page
cray-hdf5 and cray-hdf5-parallel ...details for hdf5...
Hierarchical Data Format (HDF5) is a high-performance and portable data format and data model. These modules provide serial and parallel variants of HDF5. See the HDF5 home page
cray-libsci ...details for cray-libsci...
BLAS, LAPACK, BLACS, and SCALAPACK provide basic linear algebra functionality such as vector-vector, matrix-vector, and matrix-matrix multiplication. Module cray-libsci
is loaded by default in all programming environments.
cray-netcdf ...details for cray-netcdf...
Serial version of Network Common Data Form (NetCDF), a widely used and portable data format. See the NETCDF website
cray-netcdf-hdf5parallel
A serial NetCDF built against parallel HDF5. Load module cray-hdf5-parallel
first.
cray-parallel-netcdf ...deatils for Parallel NetCDF...
A parallel NetCDF implementation (sometimes referred to as \"Pnetcdf\").
All libraries provided by modules prefixed cray-
integrate with the compiler environment, and so appropriate compiler and link stage options are injected when using the standard compiler wrappers cc
, CC
and ftn
.
The following libraries will also made available by the ARCHER2 CSE team:
ADIOS2 ...details for AOCL on ARCHER2...
ADIOS2 parallel IO libray.
AOCL ...details for AOCL on ARCHER2...
AOCL (AMD Optimizing CPU Libraries) provides a set of numerical libraries optimised for AMD \"Zen\"-based processors.
ARPACK-NG ...details for ARPACK-NG on ARCHER2...
ARPACK-NG (Arnodli Package) computes eigenvalues and eigenvectors of large sparse matrics.
Boost ...details for Boost on ARCHER2...
Boost is a portable C++ library providing reference implementations of many common containers, operations and algorithms.
Eigen ...details for Eigen on ARCHER2...
Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
GLM ...details for GLM on ARCHER2...
GLM (GL Math library) is a C++ header-only library for performing operations commonly encountered in graphics applications.
Hypre ...details for HYPRE on ARCHER2...
HYPRE provides pre-conditioners and solvers for sparse linear algebra problems.
Metis and Parmetis ...details for Metis and Parmetis...
METIS is a set of (serial) routines for partitioning graphs and meshes, and computing reduced-fill orderings of sparse matrices. It is commonly used e.g., to compute decompositions for finite element problems. Parmetis is the distributed memory counterpart.
Mumps ...details for MUMPS on ARCHER2...
MUMPS provides parallel direct solution of large sparse matrix problems.
PETSc ...details for PETSc on ARCHER2...
PETSc is a general package with functionality related to the solution of a wide range of problems described by partial differential equations.
Scotch ...details for Scotch and PT-Scotch on ARCHER2...
Scotch (and its parallel partner PT-Scotch) is a graph partitioning library.
SLEPc ...details for SLEPc on ARCHER2...
SLEPc is a package for large eigenvalue problems based on PETSc.
SuperLU and SuperLU_DIST ...details for SuperLU on ARCHER2...
SuperLU provides solutions to large non-symmetric sparse systems. SuperLU_DIST is the distributed memory version.
Trilinos ...details for Trilinos on ARCHER2...
Trilinos is a large collection of packages for the solution of complex scientific and engineering problems.
Again, all the libraries listed above are supported by all programming environments via the module system. Additional compile and link time flags should not be required.
"},{"location":"software-libraries/#building-your-own-library-versions","title":"Building your own library versions","text":"For the libraries listed in this section, a set of build and installation scripts are available at the ARCHER2 Github repository.
Follow the instructions to build the relevant package (note this is the cse-develop
branch of the repository). See also individual libraries pages in the list above for further details.
The scripts available from this repository should work in all three programming environments.
"},{"location":"software-libraries/adios/","title":"ADIOS","text":"The Adaptable I/O System (ADIOS) is developed at Oak Ridge National Laboratory and is freely available under a BSD license. The current development is ADIOS2.
"},{"location":"software-libraries/adios/#version-history","title":"Version history","text":"CurrentVersions of ADIOS2 for different programming environments are available. See, e.g.:
user@ln01:> module load other-software\nuser@ln01:> module avail adios2\n
Please load the appropriate module for the current programming environment. Upgrade 2023 The central installation of ADIOS (version 1) has been removed as it is no longer actively developed.
Full system4-cabinet systemadios/1.13.1
installed October 2021 (PE 21.04)adios/1.13.1
installed January 2021Configuration details for ADIOS2 are obtained via the utility adios2-config
which should be available in the PATH
once ADIOS is installed. For example, to recover the compiler options required to provide serial C include files, issue:
$ adios2-config -s -c\n
Use adios2-config --help
for a summary of options. To compile and link application, such statements can be embedded in a Makefile via, e.g.,
ADIOS_INC := $(shell adios2-config -s -c)\nADIOS_CLIB := $(shell adios2-config -s -l)\n
See the ADIOS2 user manual for further details and examples."},{"location":"software-libraries/adios/#compile-your-own-version","title":"Compile your own version","text":"Details for ADIOS2 are pending.
"},{"location":"software-libraries/adios/#resources","title":"Resources","text":"The ADIOS2 user manual
The ADIOS2 github repository
"},{"location":"software-libraries/aocl/","title":"AMD Optimizing CPU Libraries (AOCL)","text":"AMD Optimizing CPU Libraries (AOCL) are a set of numerical libraries optimized for AMD \u201cZen\u201d-based processors, including EPYC, Ryzen Threadripper PRO, and Ryzen.
AOCL is comprised of eight libraries: - BLIS (BLAS Library) - libFLAME (LAPACK) - AMD-FFTW - LibM (AMD Core Math Library) - ScaLAPACK - AMD Random Number Generator (RNG) - AMD Secure RNG - AOCL-Sparse
Tip
AOCL 3.1
and 4.0
are available. 3.1
is default.
Important
AOCL does not currently support the Cray programming environment and is currently unavailable with PrgEnv-cray
loaded.
Important
The cray-libsci
module is loaded by default for all users and this module also contains definitions of BLAS, LAPACK and ScaLAPACK routines that conflict with those in AOCL. The aocl
module automatically unloads cray-libsci
.
AOCL 3.1
and 4.0
is available for all versions of the GCC compilers: gcc/11.2.0
and gcc/10.3.0
module load PrgEnv-gnu\nmodule load aocl\n
"},{"location":"software-libraries/aocl/#aocc-programming-environment","title":"AOCC Programming Environment","text":"AOCL 3.1
and 4.0
is available for all versions of the AOCC compilers: aocc/3.2.0
.
module load PrgEnv-aocc\nmodule load aocl\n
"},{"location":"software-libraries/aocl/#resources","title":"Resources","text":"For more information on AOCL, please see: https://developer.amd.com/amd-aocl/#documentation
"},{"location":"software-libraries/aocl/#version-history","title":"Version history","text":"Current modules:
aocl/3.1
installed June 2023aocl/4.0
installed June 2023The Arnoldi Package (ARPACK) was designed to compute eigenvalues and eigenvectors of large sparse matrices. Originally from Rice University, an open source version (ARPACK-NG) is available under a BSD license and is made available here.
"},{"location":"software-libraries/arpack/#compiling-and-linking-with-arpack","title":"Compiling and linking with ARPACK","text":"module load arpack-ng
To compile an application against the ARPACK-NG libraries, load the arpack-ng
module and use the compiler wrappers cc
, CC
, and ftn
in the usual way.
The arpack-ng
module defines ARPACK_NG_DIR
which locates the root of the installation for the current programming environment.
arpack-ng/3.8.0
installed October 2021 (PE 21.04)The current supported version of MUMPS on Archer2 can be compiled using a script available from the Archer githug repository.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/arpack-ng.sh --prefix=/path/to/install/location\n
where the --prefix
specifies a suitable location. See the Archer2 github repository for further options and details. Note that the build process runs the tests, for which an salloc
allocation is required to allow the parallel tests to run correctly."},{"location":"software-libraries/arpack/#resources","title":"Resources","text":"ARPACK-NG github site
"},{"location":"software-libraries/boost/","title":"Boost","text":"Boost provide portable C++ libraries useful in a broad range of contexts. The libraries are freely available under the terms of the Boost Software license.
"},{"location":"software-libraries/boost/#compiling-and-linking","title":"Compiling and linking","text":"module load boost
The C++ compiler wrapper CC
will introduce the appropriate options to compile an application against the Boost libraries. The other compiler wrappers (cc
and ftn
) do not introduce these options.
To check exactly what options are introduced type, e.g.,
$ CC --cray-print-opts\n
The boost
module also defines the environment variable BOOST_DIR
as the root of the installation for the current programming environment if this information is needed.
boost/1.81.0
installed May 2023 (PE 22.12)boost/1.72.0
recompiled May 2023 (PE 22.12)boost/1.72
installed October 2021 (PE 21.04)boost/1.72.0
installed January 2021The following libraries are installed: atomic chrono container context contract coroutine date_time exception fiber filesystem graph_parallel graph iostreams locale log math mpi program_options random regex serialization stacktrace system test thread timer type_erasure wave
The ARCHER2 Github repository contains a recipe for compiling Boost for the different programming environments.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout cse-develop\n$ ./sh/boost.sh --prefix=/path/to/install/location\n
where the --prefix
determines the install location. The list of libraries compiled is specified in the boost.sh
script. See the ARCHER2 Github repository for further information."},{"location":"software-libraries/boost/#resources","title":"Resources","text":"Boost home page.
Documentation (HTML) for the current version.
Boost GitHub repository.
Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
"},{"location":"software-libraries/eigen/#compiling-with-eigen","title":"Compiling with Eigen","text":"module load eigen
To compile an application with the Eigen header files, load the eigen
module and use the compiler wrappers cc
, CC
, or ftn
in the usual way. The relevant header files will be introduced automatically.
The header files are located in /work/y07/shared/libs/core/eigen/3.4.0/
, and can be included manually at compilation without loading the module if required.
eigen/3.4.0
installed October 2021The current supported version on Archer2 can be built using the following script
$ wget https://gitlab.com/libeigen/eigen/-/archive/3.4.0/eigen-3.4.0.tar.gz\n$ tar xvf eigen-3.4.0.tar.gz\n$ cmake eigen-3.4.0/ -DCMAKE_INSTALL_PREFIX=/path/to/install/location\n$ make install\n
where the -DCMAKE_INSTALL_PREFIX
option determines the install directory. Installing in this way will also build the Eigen documentation and unit-tests."},{"location":"software-libraries/eigen/#resources","title":"Resources","text":"Eigen home page
Getting Started guide
"},{"location":"software-libraries/fftw/","title":"FFTW","text":"module load cray-fftw
FFTW is a C subroutine library (which includes a Fortran interface) for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
Only the version 3 interface is available on ARCHER2.
"},{"location":"software-libraries/glm/","title":"GLM","text":"OpenGL Mathemetics (GLM) is a header-only C++ library which performs operations typically encountered in graphics applications, but can also be relevant to scientific applications. GLM is freely available under an MIT license.
"},{"location":"software-libraries/glm/#compiling-with-glm","title":"Compiling with GLM","text":"module load glm
The compiler wrapper CC
will automatically location the required include directory when the module is loaded.
The glm
module also defines the environment variable GLM_DIR
which carries the root of the installation, if needed.
glm/0.9.9.6
installed October 2021 (PE 21.04)glm/0.9.9.6
installed January 2021One can follow the instructions used to install the current version on ARCHER2 via the ARCHER2 Github repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2021-10\n$ ./sh/glm.sh --prefix=/path/to/install/location\n
where the --prefix
option sets the install location. See the ARCHER2 Github repository for further details."},{"location":"software-libraries/glm/#resources","title":"Resources","text":"The GLM Github repository.
"},{"location":"software-libraries/hdf5/","title":"HDF5","text":"The Hierarchical Data Format HDF5 (and its parallel manifestation HDF5 parallel) is a standard library and data format developed and supported by The HDF Group, and is released under a BSD-like license.
Both serial and parallel versions are available on ARCHER2 as standard modules:
module load cray-hdf5
(serial version)module load cray-hdf5-parallel
(MPI parallel version)Use module help
to locate cray-
specific release notes on a particular version.
Known issues:
Upgrade 2023Full system4-cabinet systemcray-hdf5-parallel
will not operate correctly in PrgEnv-aocc
. One can load module epcc-cray-hdf5-parallel
instead as a work-around if PrgEnv-aocc
is required.Some general comments and information on serial and parallel I/O to ARCHER2 are given in the section on I/O and file systems.
"},{"location":"software-libraries/hdf5/#compiling-applications-against-hdf5","title":"Compiling applications against HDF5","text":"If the appropriate programming environment and HDF5 modules are loaded, compiling applications against the HDF5 libraries should straightforward. You should use the compiler wrappers cc
, CC
, and/or ftn
. See, e.g., cc --cray-print-opts
for the full list of include paths and library paths and options added by the compiler wrapper.
The HDF5 support website includes general documentation.
For parallel HDF5, some tutorials and presentations are available.
"},{"location":"software-libraries/hypre/","title":"HYPRE","text":"HYPRE is a library of linear solvers for structured and unstructured problems with a particular emphasis on multigrid. It is a product of the Lawrence Livermore National Laboratory and is distributed under either the MIT license or the Apache license.
"},{"location":"software-libraries/hypre/#compiling-and-linking-with-hypre","title":"Compiling and linking with HYPRE","text":"module load hypre
To compile and link an application with the HYPRE libraries, load the hypre
module and use the compiler wrappers cc
, CC
, or ftn
in the usual way. The relevant include files and libraries will be introduced automatically.
Two versions of HYPRE are included: one with, and one without, OpenMP. The relevant version will be selected if e.g., -fopenmp
is included in the compile or link stage.
The hypre
module defines the environment variable HYPRE_DIR
which will show the root of the installation for the current programming environment if required.
hypre/2.25.0
installed as default May 2023 (PE 22.12)hypre/2.18.0
recompiled and installed May 2023 (PE 22.12)hypre/2.18.0
installed October 2021 (PE 21.04)hypre/2.18.0
installed January 2021The current supported version on Archer2 can be built using the script from the Archer2 repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/hypre.sh --prefix=/path/to/install/location\n
where the --prefix
option determines the install directory. See the Archer2 github repository for more information."},{"location":"software-libraries/hypre/#resources","title":"Resources","text":"HYPRE home page
The latest HYPRE user manual (HTML)
An older pdf version
HYPRE github repository
"},{"location":"software-libraries/libsci/","title":"HPE Cray LibSci","text":"module load cray-libsci
(note: loaded by default for all users)Cray scientific libraries, available for all compiler choices provides access to the Fortran BLAS and LAPACK interface for basic linear algebra, the corresponding C interfaces CBLAS and LAPACKE, and BLACS and ScaLAPACK for parallel linear algebra. Type man intro_libsci
for further details.
Additionally there is GPU support available via the cray-libsci_acc
module. More information can be found here.
Matio is a library which allows reading and writing matrices in MATLAB MAT format. It is an open source development released under a BSD license.
"},{"location":"software-libraries/matio/#compiling-and-linking-against-matio","title":"Compiling and linking against Matio","text":"module load matio
Load the matio
module and use the standard compiler wrappers cc
, CC
, or ftn
in the usual way. The appropriate header files and libraries will be included automatically via the compiler wrappers.
The matio
module set the PATH
variable so that the stand-alone utility matdump
can be used. The module also defines MATIO_PATH
which gives the root of the installation if this is needed.
matio/1.5.23
installed May 2023 (PE 22.12)matio/1.5.18
is removed.matio/1.5.18
installed October 2021 (PE 21.04)matio/1.5.18
installed January 2021A version of Matio as currently installed on Archer2 can be compiled using the script avaailable from the Archer2 github repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/matio.sh --prefix=/path/to/install/location\n
where --prefix
defines the location of the installation."},{"location":"software-libraries/matio/#resources","title":"Resources","text":"Matio github repository
"},{"location":"software-libraries/mesa/","title":"Mesa","text":"Mesa is an open-source implementation of OpenGL, Vulkan, and other graphics API to vendor-specific hardware drivers.
"},{"location":"software-libraries/mesa/#compiling-with-mesa","title":"Compiling with Mesa","text":"module load mesa
To compile an application with the mesa header files, load the mesa
module and use the compiler wrappers in the usual way. The relevant header files will be introduced automatically.
The header files are located in /work/y07/shared/libs/core/mesa/21.0.1/
, and can be included manually at compilation without loading the module if required.
mesa/21.0.1
installed June 2023Build recipe for this module can be found at the HPC-UK github repo
"},{"location":"software-libraries/mesa/#resources","title":"Resources","text":"Mesa home page
"},{"location":"software-libraries/metis/","title":"Metis and Parmetis","text":"The University of Minnesota provide a family of libraries for partitioning graphs and meshes, and computing fill-reducing ordering of sparse matrices. These libraries coming broadly under the label of \"Metis\". They are free to use for educational and research purposes.
"},{"location":"software-libraries/metis/#metis","title":"Metis","text":"module load metis
Metis is the sequential library for partitioning problems; it also supplies a number of simple stand-alone utility programs to access the Metis API for graph and mesh partitioning, and graph and mesh manipulation. The stand alone programs typically read a graph or mesh from file which must be in \"metis\" format.
"},{"location":"software-libraries/metis/#compiling-and-linking-with-metis","title":"Compiling and linking with Metis","text":"The Metis library available via module load metis
comes both with and without support for OpenMP. When using the compiler wrappers cc
, CC
, and ftn
, the appropriate version will be selected based on the presence or absence of, e.g., -fopenmp
in the compile or link invocation.
Use, e.g.,
$ cc --cray-print-opts\n
or $ cc -fopenmp --cray-print-opts\n
to see exactly what options are being issued by the compiler wrapper when the metis
module is loaded. Metis is currently provided as static libraries, so it should not be necessary to re-load the metis
module at run time.
The serial utilities (e.g. gpmetis
for graph partitioning) are supplied without OpenMP. These may then be run on the front end for small problems if the metis
module is loaded.
The metis
module defines the environment variable METIS_DIR
which indicates the current location of the Metis installation.
Note the metis
and parmetis
libraries (and dependent modules) have been compiled with the default 32-bit integer indexing, and 4-byte floating point options.
module load parmetis
Parmetis is the distributed memory incarnation of the Metis functionality. As for the metis
module, Parmetis is integrated with use of the compiler wrappers cc
, CC
, and ftn
.
Parmetis depends on the metis
module, which is loaded automatically by the parmetis
module.
The parmetis
module defines the environment variable PARMETIS_DIR
which holds the current location of the Parmetis installation. This variable may not respond to a change of compiler version within a given programming environment. If you wish to use PARMETIS_DIR
in such a context, you may need to (re-)load the parmetis
module after the change of compiler version.
metis/5.1.0
recompiled and installed May 2023 (PE22.12)partmetis/4.0.3
recompiled and installed May 2023 (PE22.12)metis/5.1.0
installed October 2021 (PE21.04)parmetis/4.0.3
installed January 2021 (PE21.04)metis/5.1.0
installed January 2021parmetis/4.0.3
installed January 2021The build procedure used for the Metis and Parmetis libraries on Archer2 is available via github.
"},{"location":"software-libraries/metis/#metis_1","title":"Metis","text":"The latest Archer2 version of Metis can be installed
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n
where --prefix
determines the install location. This will download and install the default version for the current programming environment.
Parmetis can be installed in via the same mechanism as Metis:
$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n
The Metis package should be installed first (as above) using the same location. See the Archer2 repository for further details and options."},{"location":"software-libraries/metis/#resources","title":"Resources","text":"-- Metis and Parmetis at github
"},{"location":"software-libraries/mkl/","title":"Intel Math Kernel Library (MKL)","text":"The Intel Maths Kernel Libraries (MKL) contain a variety of optimised numerical libraries including BLAS, LAPACK, ScaLAPACK and FFTW. In general, the exact commands required to build against MKL depend on the details of compiler, environment, requirements for parallelism, and so on. The Intel MKL link line advisor should be consulted.
Some examples are given below. Note that loading the mkl
module will provide the environment variable MKLROOT
which holds the location of the various MKL components.
Warning
The ARCHER2 CSE team have seen that using MKL on ARCHER2 for some software leads to failed regression tests due to numerical differences between refernece results and those produced with software using MKL.
We strongly recommend that you use the HPE Cray LibSci and HPE Cray FFTW libraries for software if at all possible rather than MKL. If you do decide to use MKL on ARCHER2, then you should carefully validate results from your software to ensure that it is giving the expected results.
Important
The cray-libsci
module is loaded by default for all users and this module also contains definitions of BLAS, LAPACK and ScaLAPACK routines that conflict with those in MKL. The mkl
module automatically unloads cray-libsci
.
Important
The mkl
module needs to be loaded both at compile time and at runtime (usually in your job submission script).
Tip
MKL only supports the GCC programming environment (PrgEnv-gnu
). Other programming environments may work but this is untested and unsupported on ARCHER2.
Swap modules:
module load PrgEnv-gnu\nmodule load mkl\n
Language Compile options Link options Fortran -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl
C/C++ -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl
"},{"location":"software-libraries/mkl/#threaded-mkl-with-gcc","title":"Threaded MKL with GCC","text":"Swap modules:
module load PrgEnv-gnu\nmodule load mkl\n
Language Compile options Link options Fortran -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
C/C++ -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
"},{"location":"software-libraries/mkl/#mkl-parallel-scalapack-with-gcc","title":"MKL parallel ScaLAPACK with GCC","text":"Swap modules:
module load PrgEnv-gnu\nmodule load mkl\n
Language Compile options Link options Fortran -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm -ldl
C/C++ -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm -ldl
"},{"location":"software-libraries/mumps/","title":"MUMPS","text":"MUMPS is a parallel solver for large sparse systems and features a 'multifrontal' method and is developed largely at CERFCAS, ENS Lyon, IRIT Toulouse, INRIA, and the University of Bordeaux. It is provided free of charge and is largely under a CeCILL-C license.
"},{"location":"software-libraries/mumps/#compiling-and-linking-with-mumps","title":"Compiling and linking with MUMPS","text":"module load mumps
To compile an application against the MUMPS libraries, load the mumps
module and use the compiler wrappers cc
, CC
, and ftn
in the usual way.
MUMPS is configured to allow Pord, Metis, Parmetis, and Scotch orderings.
Two versions of MUMPS are provided: one with, and one without, OpenMP. The relevant version will be selected if the relevant option is included at the compile stage.
The mumps
module defines MUMPS_DIR
which locates the root of the installation for the current programming environment.
mumps/5.5.1
installed as default May 2023 (PE 22.12)mumps/5.3.5
recompiled May 2023 (PE 22.12)Note: mumps/5.5.1
uses scotch/7.0.3
while mumps/5.3.5
uses scotch/6.1.0
.
mumps/5.3.5
installed October 2021 (PE 21.04)mumps/5.2.1
installed January 2021Known issues: The OpenMP version in PrgEnv-aocc
is not available at the moment.
The current supported version of MUMPS on Archer2 can be compiled using a script available from the Archer githug repository.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/scotchv7.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/mumps.sh --prefix=/path/to/install/location\n
where the --prefix
option should be the same for MUMPS at the three dependencies (Metis, Parmetis, and Scotch Version 7). See the Archer2 github repository for further options and details."},{"location":"software-libraries/mumps/#resources","title":"Resources","text":"The MUMPS home page
MUMPS user manual (Version 5.6, pdf)
"},{"location":"software-libraries/netcdf/","title":"NetCDF","text":"The Network Common Data Form NetCDF (and its parallel manifestation NetCDF parallel) is a standard library and data format developed and supported by UCAR is released under a BSD-like license.
Both serial and parallel versions are available on ARCHER2 as standard modules:
module load cray-netcdf
(serial version)module load cray-netcdf-hdf5parallel
(MPI parallel version)Note that one should first load the relevant HDF module file, e.g.,
$ module load cray-hdf5\n$ module load cray-netcdf\n
for the serial version. Use module spider
to locate available versions, and use module help
to locate cray-
specific release notes on a particular version.
Known issues:
Upgrade 2023Full system4-cabinet systemcray-netcdf-hdf5parallel
will not operate correctly in PrgEnv-aocc
. One can load module epcc-netcdf-hdf5parallel
instead as a work-around if PrgEnv-aocc
is required.Some general comments and information on serial and parallel I/O to ARCHER2 are given in the section on I/O and file systems.
"},{"location":"software-libraries/netcdf/#resources","title":"Resources","text":"The NetCDF home page.
"},{"location":"software-libraries/petsc/","title":"PETSc","text":"PETSc is a suite of parallel tools for solution of partial differential equations. PETSc is developed at Argonne National Laboratory and is freely available under a BSD 2-clause license.
"},{"location":"software-libraries/petsc/#build","title":"Build","text":"module load petsc
Applications may be linked against PETSc by loading the petsc
module and using the compiler wrappers cc
, CC
, and ftn
in the usual way. Details of options introduced by the compiler wrappers can be examined via, e.g.,
$ cc --cray-print-opts\n
PETSC is configured with Metis, Parmetis, and Scotch orderings, and to support HYPRE, MUMPS, SuperLU, and SuperLU-DIST. PETSc is compiled without OpenMP.
The petsc
module defines the environment variable PETSC_DIR
as the root of the installation if this is required.
petsc/3.18.5
installed as default May 2023 (PE 22.12)petsc/3.14.2
recompiled May 2023 (PE 22.12)Note: PETSc has a number of dependencies; where applicable, the newer version of PETSc depends on the newer module version of each relevant dependency. Check module list
to be sure.
petsc/3.14.2
installed October 2021 (PE 21.04)petsc/3.13.3
installed January 2021Known issues: PETSc is not currently available for PrgEnv-aocc
. There is no HYPRE support in this version.
It is possible to follow the steps used to build the current version on Archer2. These steps are codified at the Archer2 github repository and include a number of dependencies to be built in the correct order:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2012-12\n$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/hypre.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/scotchv7.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/mumps.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/superlu.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/superlu-dist.sh --prefix=/path/to/install/location\n\n$ module load cray-hdf5\n$ ./sh/petsc.sh --prefix=/path/to/install/location\n
The --prefix
option indicating the install directory should be the same in all cases. See the Archer2 github repository for further details (and options). This will compile version 3.18.5 against the latest module versions of each dependency."},{"location":"software-libraries/petsc/#resources","title":"Resources","text":"PETSc home page
Current PETSc documentation (HTML)
"},{"location":"software-libraries/scotch/","title":"Scotch and PT-Scotch","text":"Scotch and its parallel version PT-Scotch are provided by Labri at the University of Bordeaux and INRIA Bordeaux South-West. They are used for graph partitioning and ordering problems. The libraries are freely available for scientific use under a license similar to the LGPL license.
"},{"location":"software-libraries/scotch/#scotch-and-pt-scotch_1","title":"Scotch and PT-Scotch","text":"module load scotch
The scotch
module provides access to both the Scotch and PT-Scotch libraries via the compiler system. A number of stand-alone utilities are also provided as part of the package.
If the scotch
module is loaded, then applications may be automatically compiled and linked against the libraries for the current programming environment. Check, e.g.,
$ cc --cray-print-opts\n
if you wish to see exactly what options are generated by the compiler wrappers. Scotch and PT-Scotch libraries are provides as static archives only. The compiler wrappers do not give access to the libraries libscotcherrexit.a
or libptscotcherrexit.a
. If you wish to perform your own error handling these libraries must be linked manually.
The scotch
module defines the environment SCOTCH_DIR
which holds the root of the installation for a given programming environment. Libraries are present in ${SCOTCH_DIR}/lib
.
Stand-alone applications are also available. See the Scotch and PT-Scotch user manuals for further details.
"},{"location":"software-libraries/scotch/#module-version-history","title":"Module version history","text":"Upgrade 2023Full system4-cabinet systemscotch/7.0.3
installed May 2023 (PE 22.12)scotch/6.1.0
recompiled May 2023 (PE 22.12)Note: scotch/7.0.3
has disabled a number of features including the Metis compatibility layer, and threads, to allow all tests to pass.
Module `scotch/6.1.0 installed October 2021 (PE 21.04)
Known issue: a small number of the standard PT-Scotch tests are failing (all programming environments). Symptoms include truncated MPI_Recvs
. This is currently being investigated.
Module scotch/6.0.10
installed January 2021
Known issue: a small number of the standard PT-Scotch tests are failing (all programming environments). Symptoms include truncated MPI_Recvs
. This is currently being investigated.
The build procedure for the Scotch package on Archer2 is available via github.
"},{"location":"software-libraries/scotch/#scotch-and-pt-scotch_2","title":"Scotch and PT-Scotch","text":"The latest Scotch and PT-Scotch libraries are installed on Archer using the following mechanism:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/scotchv7.sh --prefix=/path/to/install/location\n
where the --prefix
option defines the destination for the install. This script will download, compile and install version 7.0.3. A separate script (scotch.sh
) in the same location is used for version 6."},{"location":"software-libraries/scotch/#resources","title":"Resources","text":"The Scotch home page
Scotch user manual (pdf)
PT-Scotch user manual (pdf)
"},{"location":"software-libraries/slepc/","title":"SLEPC","text":"The Scalable Library for Eigenvalue Problem computations is an extension of PETSc developed at the Universitat Politecnica de Valencia. SLEPc is freely available under a 2-clause BSD license.
"},{"location":"software-libraries/slepc/#compiling-and-linking-with-slepc","title":"Compiling and linking with SLEPc","text":"module load slepc
To compile an application against the SLEPc libraries, load the slepc
module and use the compiler wrappers cc
, CC
, and ftn
in the usual way. Static libraries are available so no module is required at run time.
The SLEPc module defines SLEPC_DIR
which locates the root of the installation.
slepc/3.18.3
installed as default May 2023 (PE 22.12)slepc/3.14.1
recompiled May 2023 (PE 22.12)Note: each SLEPc module depends on a PETSc module with the same minor version number.
slepc/3.14.1
installed October 2021 (PE 21.04)slepc/3.13.2
installed January 2021The version of SLEPc currently available on ARCHER2 can be compiled using a script available from the ARCHER2 github repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/slepc.sh --prefix=/path/to/install/location\n
The dependencies (including PETSc) can be built in the same way, or taken from the existing modules. See the ARCHER2 github repository for further information."},{"location":"software-libraries/slepc/#resources","title":"Resources","text":"SLEPc home page
Latest release version of SLEPc user manual (PDF)
SLEPc Gitlab repository
"},{"location":"software-libraries/superlu/","title":"SuperLU and SuperLU_DIST","text":"SuperLU and SuperLU_DIST are libraries for the direct solution of large sparse non-symmetric systems of linear equations, typically by factorisation and back-substitution. The libraries are provided by Lawrence Berkeley National Laboratory and are freely available under a slightly modified BSD-style license.
Two separate modules are provided for SuperLU and SuperLU_DIST.
"},{"location":"software-libraries/superlu/#superlu","title":"SuperLU","text":"module load superlu
This module provides the serial library SuperLU.
"},{"location":"software-libraries/superlu/#compiling-and-linking-with-superlu","title":"Compiling and linking with SuperLU","text":"Compiling and linking SuperLU applications requires no special action beyond module load superlu
and using the standard compiler wrappers cc
, CC
, or ftn
. The exact options issued by the compiler wrapper can be examined via, e.g.,
$ cc --cray-print-opts\n
while the module is loaded. The module defines the environment variable SUPERLU_DIR
as the root location of the installation for a given programming environment.
superlu/5.2.2
recompiled May 2023 (PE 22.12)superlu/5.2.2
installed October 2021 (PE 21.04)superle/5.2.1
installed January 2021module load superlu-dist
This modules provides the distributed memory parallel library SuperLU_DIST both with and without OpenMP.
"},{"location":"software-libraries/superlu/#compiling-and-linking-superlu_dist","title":"Compiling and linking SuperLU_DIST","text":"Use the standard compiler wrappers:
$ cc my_superlu_dist_application.c\n
or $ cc -fopenmp my_superlu_dist_application.c\n
to compile the and link against the appropriate libraries. The superlu-dist
module defines the environment variable SUPERLU_DIST_DIR
as the root of the installation for the current programming environment.
superlu-dist/8.1.2
installed as default May 2023 (PE 22.12)superlu-dist/6.4.0
recompiled May 2023 (PE 22.12)superlu-dist/6.4.0
installed October 2021 (PE 21.04)superlu-dist/6.1.1
installed January 2021The build used for Archer2 can be replicated by using the scripts provided at the Archer2 repository.
"},{"location":"software-libraries/superlu/#superlu_1","title":"SuperLU","text":"The current Archer2 supported version may be built via
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/superlu.sh --prefix=/path/to/install/location\n
where the --prefix
option controls the install destination."},{"location":"software-libraries/superlu/#superlu_dist_1","title":"SuperLU_DIST","text":"SuperLU_DIST is configured using Metis and Parmetis, so these should be installed first:
$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/superlu_dist.sh --prefix=/path/to/install/location\n
will download, compile, and install the relevant libraries. The install location should be the same for all three packages. See the Archer2 github repository for further options and details."},{"location":"software-libraries/superlu/#resources","title":"Resources","text":"The Supernodal LU project home page
The SuperLU User guide (pdf). This describes both SuperLU and SuperLU_DIST.
The SuperLU github repository
The SuperLU_DIST github repository
"},{"location":"software-libraries/trilinos/","title":"Trilinos","text":"Trilinos is a large collection of packages with software components that can be used for scientific and engineering problems. Most of the package are released under a BSD license (and some under LGPL).
"},{"location":"software-libraries/trilinos/#compiling-and-linking-against-trilinos","title":"Compiling and linking against Trilinos","text":"module load trilinos
Applications may be built against the module version of Trilinos by using the using the compiler wrappers CC
or ftn
in the normal way. The appropriate include files and library paths will be inserted automatically. Trilinos is build with OpenPM enabled.
The trilinos
module defines the environment variable TRILINOS_DIR
as the root of the installation for the current programming environment.
Trilinos also provides a small number of stand-alone executables which are available via the standard PATH
mechanism while the module is loaded.
trilinos/12.18.1
recompiled May 2023 (PE 22.12)Note that Trilinos is not currently available for PrgEnv-aocc
.
trilinos/12.18.1
installed October 2021 (PE 21.04)If using AMD compilers, module version aocc/3.0.0
is required.
module trilinos/12.18.1
installed January 2021Known issue
Trilinos is not available in PrgEnv-aocc
at the moment.
Known issue
The ForTrilinos
package is not available in this version.
Packages enabled are: Amesos, Amesos2, Anasazi, AztecOO Belos Epetra EpretExt FEI Galeri GlobiPack Ifpack Ifpack2 Intrepid Isorropia Kokkos Komplex Mesquite ML Moertel MueLu NOX OptiPack Pamgen Phalanx Piro Pliris ROL RTOp Rythmos Sacado Shards ShyLU STK STKSearch STKTopology STKUtil Stratimikos Teko Teuchos Thyra Tpetra TrilinosCouplings Triutils Xpetra Zoltan Zoltan2
A script which has details of the relevant configuration options for Trilinos is available at the ARCHER2 Github repository. The script will build a static-only version of the libraries.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ...\n$ ./sh/trilinos.sh --prefix=/path/to/install/location\n
where --prefix
sets the installation location. The ellipsis ...
is standing for the dependencies used to build Trilinos, which here are: metis, parmetis, superlu, superlu-dist, scotch, mumps, glm, boost
. These packages should be built as described in their corresponding pages linked in the menu on the left. See the ARCHER2 Github repository for further details.
Note that Trilinos may take up to one hour to compile on its own, and so the compilation is best performed as a batch job.
"},{"location":"software-libraries/trilinos/#resources","title":"Resources","text":"Trilinos home page
Trilinos Github repository
The ARCHER2 User and Best Practice Guide covers all aspects of use of the ARCHER2 service. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of ARCHER2 and more technical topics.
The User and Best Practice Guide contains the following sections:
As well as being used for scientific simulations, ARCHER2 can also be used for data pre-/post-processing and analysis. This page provides an overview of the different options for doing so.
"},{"location":"user-guide/analysis/#using-the-login-nodes","title":"Using the login nodes","text":"The easiest way to run non-computationally intensive data analysis is to run directly on the login nodes. However, please remember that the login nodes are a shared resource and should not be used for long-running tasks.
"},{"location":"user-guide/analysis/#example-running-an-r-script-on-a-login-node","title":"Example: Running an R script on a login node","text":"module load cray-R\nRscript example.R\n
"},{"location":"user-guide/analysis/#using-the-compute-nodes","title":"Using the compute nodes","text":"If running on the login nodes is not feasible (e.g. due to memory requirements or computationally intensive analysis), the compute nodes can also be used for data analysis.
Important
This is a more expensive option, as you will be charged for using the entire node, even though your analysis may only be using one core.
"},{"location":"user-guide/analysis/#example-running-an-r-script-on-a-compute-node","title":"Example: Running an R script on a compute node","text":"#!/bin/bash\n#SBATCH --job-name=data_analysis\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load cray-R\n\nRscript example.R\n
An advantage of this method is that you can use Job chaining to automate the process of analysing your output data once your compute job has finished.
"},{"location":"user-guide/analysis/#using-interactive-jobs","title":"Using interactive jobs","text":"For more interactive analysis, it may be useful to use salloc
to reserve a compute node on which to do your analysis. This allows you to run jobs directly on the compute nodes from the command line without using a job submission script. More information on interactive jobs can be found here.
auser@ln01:> salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 \\\n --time=00:20:00 --partition=standard --qos=short \\\n --account=[budget code]\n
Note
If you want to run for longer than 20 minutes, you will need to use a different QoS as the maximum runtime for the short
QoS is 20 mins.
The data analysis nodes on the ARCHER2 system are designed for large compilations, post-calculation analysis and data manipulation. They should be used for jobs which are too small to require a whole compute node, but which would have an adverse impact on the operation of the login nodes if they were run interactively.
Unlike compute nodes, the data analysis nodes are able to access the home, work, and the RDFaaS file systems. They can also be used to transfer data from a remote system to ARCHER2 and vice versa (using e.g. scp
or rsync
). This can be useful when transferring large amounts of data that might take hours to complete.
The ARCHER2 data analysis nodes can be reached by using the serial
partition and the serial
QoS. Unlike other nodes on ARCHER2, you may only request part of a single node and you will likely be sharing the node with other users.
The data analysis nodes are set up such that you can specify the number of cores you want to use (up to 32 physical cores) and the amount of memory you want for your job (up to 125 GB). You can have multiple jobs running on the data analysis nodes at the same time, but the total number of cores used by those jobs cannot exceed 32, and the total memory used by jobs currently running from a single user cannot exceed 125 GB -- any jobs above this limit will remain pending until your previous jobs are finished.
You do not need to specify both number of cores and memory for jobs on the data analysis nodes. By default, you will get 1984 MiB of memory per core (which is a little less than 2 GB), when specifying cores only, and 1 core when specifying the memory only.
Note
Each data analysis node is fitted with 512 GB of memory. However, a small amount of this memory is needed for system processes, which is why we set an upper limit of 125 GB per user (a user is limited to one quarter of the RAM on a node). This is also why the per-core default memory allocation is slightly less than 2 GB.
Note
When running on the data analysis nodes, you must always specify either the number of cores you want, the amount of memory you want, or both. The examples shown below specify the number of cores with the --ntasks
flag and the memory with the --mem
flag. If you are only wanting to specify one of the two, please remember to delete the other one.
A Slurm batch script for the data analysis nodes looks very similar to one for the compute nodes. The main differences are that you need to use --partition=serial
and --qos=serial
, specify the number of tasks (rather than the number of nodes) and/or specify the amount of memory you want. For example, to use a single core and 4 GB of memory, you would use something like:
#!/bin/bash\n\n# Slurm job options (job-name, job time)\n#SBATCH --job-name=data_analysis\n#SBATCH --time=0:20:0\n#SBATCH --ntasks=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n# Define memory required for this jobs. By default, you would\n# get just under 2 GB, but you can ask for up to 125 GB.\n#SBATCH --mem=4G\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\nmodule load cray-python\n\npython my_analysis_script.py\n
"},{"location":"user-guide/analysis/#interactive-session-on-the-data-analysis-nodes","title":"Interactive session on the data analysis nodes","text":"There are two ways to start an interactive session on the data analysis nodes: you can either use salloc
to reserve a part of a data analysis node for interactive jobs; or, you can use srun
to open a terminal on the node and run things on the node directly. You can find out more information on the advantages and disadvantages of both of these methods in the Running jobs on ARCHER2 section of the User and Best Practice Guide.
salloc
for interactive access","text":"You can reserve resources on a data analysis node using salloc
. For example, to request 1 core and 4 GB of memory for 20 minutes, you would use:
auser@ln01:~> salloc --time=00:20:00 --partition=serial --qos=serial \\\n --account=[budget code] --ntasks=1 \\\n --mem=4G\n
When you submit this job, your terminal will display something like:
salloc: Pending job allocation 523113\nsalloc: job 523113 queued and waiting for resources\nsalloc: job 523113 has been allocated resources\nsalloc: Granted job allocation 523113\nsalloc: Waiting for resource configuration\nsalloc: Nodes dvn01 are ready for job\n\nauser@ln01:~>\n
It may take some time for your interactive job to start. Once it runs you will enter a standard interactive terminal session (a new shell). Note that this shell is still on the front end (the prompt has not changed). Whilst the interactive session lasts you will be able to run jobs on the data analysis nodes by issuing the srun
command directly at your command prompt. The maximum number of cores and memory you can use is limited by resources requested in the salloc
command (or by the defaults if you did not explicitly ask for particular amounts of resource).
Your session will end when you hit the requested walltime. If you wish to finish before this you should use the exit
command - this will return you to your prompt before you issued the salloc
command.
srun
for interactive access","text":"You can get a command prompt directly on the data analysis nodes by using the srun
command directly. For example, to reserve 1 core and 8 GB of memory, you would use:
auser@ln01:~> srun --time=00:20:00 --partition=serial --qos=serial \\\n --account=[budget code] \\\n --ntasks=1 --mem=8G \\\n --pty /bin/bash\n
The --pty /bin/bash
will cause a new shell to be started on the data analysis node. (This is perhaps closer to what many people consider an 'interactive' job than the method using the salloc
method described above.)
One can now issue shell commands in the usual way.
When finished, type exit
to relinquish the allocation and control will be returned to the front end.
By default, the interactive shell will retain the environment of the parent. If you want a clean shell, remember to specify the --export=none
option to the srun
command.
You can view data on the data analysis nodes by starting an interactive srun
session with the --x11
flag to export the X display back to your local system. For 1 core with * GB of memory:
auser@ln01:~> srun --time=00:20:00 --partition=serial --qos=serial \\\n --hint=nomultithread --account=[budget code] \\\n --ntasks=1 --mem=8G --x11 --pty /bin/bash\n
Tip
Data visualisation on ARCHER2 is only possible if you used the -X
or -Y
flag to the ssh
command when when logging in to the system.
Singularity can be useful for data analysis, as sites such as DockerHub or SingularityHub contain many pre-built images of data analysis tools that can be simply downloaded and used on ARCHER2. More information about Singularity on ARCHER2 can be found in the Containers section section of the User and Best Practice Guide.
"},{"location":"user-guide/analysis/#data-analysis-tools","title":"Data analysis tools","text":"Useful tools for data analysis can be found on the Data Analysis and Tools page.
"},{"location":"user-guide/connecting-totp/","title":"Connecting to ARCHER2","text":"This section covers the basic connection methods.
On the ARCHER2 system, interactive access is achieved using SSH, either directly from a command-line terminal or using an SSH client. In addition, data can be transferred to and from the ARCHER2 system using scp
from the command line or by using a file-transfer client.
Before following the process below, we assume you have set up an account on ARCHER2 through the EPCC SAFE. Documentation on how to do this can be found at:
Linux distributions include a terminal application that can be used for SSH access to the ARCHER2 login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g., GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.
"},{"location":"user-guide/connecting-totp/#macos","title":"MacOS","text":"MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.
"},{"location":"user-guide/connecting-totp/#windows","title":"Windows","text":"A typical Windows installation will not include a terminal client, though there are various clients available. We recommend Windows users download and install MobaXterm to access ARCHER2. It is very easy to use and includes an integrated X Server, which allows you to run graphical applications on ARCHER2.
You can download MobaXterm Home Edition (Installer Edition) from the following link:
Double-click the downloaded Microsoft Installer file (.msi) and follow the instructions from the Windows Installation Wizard. Note, you might need to have administrator rights to install on some versions of Windows. Also, make sure to check whether Windows Firewall has blocked any features of this program after installation (Windows will warn you if the built-in firewall blocks an action, and gives you the opportunity to override the behaviour).
Once installed, start MobaXterm and then click \"Start local terminal\".
Tips
If you download the .zip file rather than the .msi, make sure you unzip it before attempting to run the installer.
If you do not have administrator rights, you can use the Portable edition of MobaXterm.
If this is your first time using MobaXterm, you should check that a permanent /home directory has been set up (otherwise, all saved info will be lost from session to session). Go to \"Settings\" -> \"Configuration\" and check that a path is set in the field marked \"Persistent home directory\". If prompted, make sure path is set as \"private\".
Any SSH key generated in MobaXterm will, by default, be stored in the permanent /home directory (see above). That is, if your /home directory is _MyDocuments_\\MobaXterm\\home
then within that folder you will find a folder named _MyDocuments_\\MobaXterm\\home\\.ssh
containing your keys. This folder will be 'hidden' by default, so you may need to tick 'Hidden items' under 'View' in Windows Explorer to see it.
MobaXterm also allows you to set up pre-configured SSH sessions with the username, login host and key details saved. You are welcome to use this, rather than using the \"Local terminal\", but we are not able to assist with debugging connection issues if you choose this method.
To access ARCHER2, you need to use two sets of credentials: your SSH key pair protected by a passphrase and a Time-based one-time password. You can find more detailed instructions on how to set up your credentials to access ARCHER2 from Windows, MacOS and Linux below.
"},{"location":"user-guide/connecting-totp/#ssh-key-pairs","title":"SSH Key Pairs","text":"You will need to generate an SSH key pair protected by a passphrase to access ARCHER2.
Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:
$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n| . ...+o++++. |\n| . . . =o.. |\n|+ . . .......o o |\n|oE . . |\n|o = . S |\n|. +.+ . |\n|. oo |\n|. . |\n| .. |\n+-----------------+\n
(remember to replace \"your@email.com\" with your e-mail address).
"},{"location":"user-guide/connecting-totp/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:
Login to SAFE.
Then:
Once you have done this, your SSH key will be added to your ARCHER2 account.
"},{"location":"user-guide/connecting-totp/#mfa-time-based-one-time-passcode-totp-code","title":"MFA Time-based one-time passcode (TOTP code)","text":"Remember, you will need to use both an SSH key and time-based one-time passcode to log into ARCHER2 so you will also need to set up a method for generating a TOTP code before you can log into ARCHER2.
"},{"location":"user-guide/connecting-totp/#first-login-password-required","title":"First login: password required","text":"Important
You will not use your password when logging on to ARCHER2 after the first login for a new account.
As an additional security measure, you will also need to use a password from SAFE for your first login to ARCHER2 with a new account. When you log into ARCHER2 for the first time with a new account, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed. You will no longer need this password to log into ARCHER2 from this point forwards, you will use your SSH key and TOTP code as described above.
"},{"location":"user-guide/connecting-totp/#ssh-clients","title":"SSH Clients","text":"As noted above, you interact with ARCHER2, over an encrypted communication channel (specifically, Secure Shell version 2 (SSH-2)). This allows command-line access to one of the login nodes of ARCHER2, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers, when used in conjunction with an X Server.
"},{"location":"user-guide/connecting-totp/#logging-in","title":"Logging in","text":"The login addresses for ARCHER2 are:
You can use the following command from the terminal window to log in to ARCHER2:
Full systemssh username@login.archer2.ac.uk\n
The order in which you are asked for credentials depends on the system you are accessing:
Full systemYou will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you logged into ARCHER2 with your account before the major upgrade in May/June 2023 you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Warning
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_ARCHER2
you would use the command ssh -i keys/id_rsa_ARCHER2 username@login.archer2.ac.uk
to log in (or the equivalent for the 4-cabinet system).
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three-step process:
Your password will now have been changed
To allow remote programs, especially graphical applications, to control your local display, such as for a debugger, use:
Full systemssh -X username@login.archer2.ac.uk\n
Some sites recommend using the -Y
flag. While this can fix some compatibility issues, the -X
flag is more secure.
Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:
Adding the host keys to your SSH configuration file provides an extra level of security for your connections to ARCHER2. The host keys are checked against the login nodes when you login to ARCHER2 and if the remote server key does not match the one in the configuration file, the connection will be refused. This provides protection against potential malicious servers masquerading as the ARCHER2 login nodes.
"},{"location":"user-guide/connecting-totp/#loginarcher2acuk","title":"login.archer2.ac.uk","text":"login.archer2.ac.uk ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBANu9BQJ1UFr4nwy8X5seIPgCnBl1TKc8XBq2YVY65qS53QcpzjZAH53/CtvyWkyGcmY8/PWsJo9sXHqzXVSkzk=\n\nlogin.archer2.ac.ukssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDFGGByIrskPayB5xRm3vkWoEc5bVtTCi0oTGslD8m+M1Sc/v2IV6FxaEVXGwO9ErQwrtFQRj0KameLS3Jn0LwQ13Tw+vTXV0bsKyGgEu2wW+BSDijGpbxRZXZrg30TltZXd4VkTuWiE6kyhJ6qiIIR0nwfDblijGy3u079gM5Om/Q2wydwh0iAASRzkqldL5bKDb14Vliy7tCT3TJXI49+qIagWUhNEzyN1j2oK/2n3JdflT4/anQ4jUywVG4D1Tor/evEeSa3h5++gbtgAXZaCtlQbBxwckmTetXqnlI+pvkF0AAuS18Bh+hdmvT1+xW0XLv7CMA64HfR93XgQIIuPqFAS1p+HuJkmk4xFAdwrzjnpYAiU5Apkq+vx3W957/LULzZkeiFQY2Y3CY9oPVR8WBmGKXOOBifhl2Hvd51fH1wd0Lw7Zph53NcVSQQhdDUVhgsPJA3M/+UlqoAMEB/V6ESE2z6yrXVfNjDNbbgA1K548EYpyNR8z4eRtZOoi0=\n\nlogin.archer2.ac.uk ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINyptPmidGmIBYHPcTwzgXknVPrMyHptwBgSbMcoZgh5\n
Host key verification can fail if this key is out of date, a problem which can be fixed by removing the offending entry in ~/.ssh/known_hosts
and replacing it with the new key published here. We recommend users should check this page for any key updates and not just accept a new key from the server without confirmation.
Typing in the full command to log in or transfer data to ARCHER2 can become tedious as it often has to be repeated several times. You can use the SSH configuration file, usually located on your local machine at .ssh/config
to make the process more convenient.
Each remote site (or group of sites) can have an entry in this file, which may look something like:
Full systemHost archer2\n HostName login.archer2.ac.uk\n User username\n
(remember to replace username
with your actual username!).
Taking the full-system example: the Host
line defines a short name for the entry. In this case, instead of typing ssh username@login.archer2.ac.uk
to access the ARCHER2 login nodes, you could use ssh archer2
instead. The remaining lines define the options for the host.
Hostname login.archer2.ac.uk
--- defines the full address of the hostUser username
--- defines the username to use by default for this host (replace username
with your own username on the remote host)Now you can use SSH to access ARCHER2 without needing to enter your username or the full hostname every time:
ssh archer2\n
You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config manual page (or man ssh_config
on any machine with SSH installed) for a description of the SSH configuration file. For example, you may find the IdentityFile
option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.
Bug
There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512)
, you will need to either specify the path to your ssh key in the command line (using the -i
option as described above) or add that path to your SSH config file by using the IdentityFile
option.
If you find you are unable to connect to ARCHER2, there are some simple checks you may use to diagnose the issue, which are described below. If you are having difficulties connecting, we suggest trying these before contacting the ARCHER2 Service Desk.
"},{"location":"user-guide/connecting-totp/#use-the-userloginarcher2acuk-syntax-rather-than-l-user-loginarcher2acuk","title":"Use theuser@login.archer2.ac.uk
syntax rather than -l user login.archer2.ac.uk
","text":"We have seen a number of instances where people using the syntax
ssh -l user login.archer2.ac.uk\n
have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:
ssh user@login.archer2.ac.uk\n
works more reliably.
"},{"location":"user-guide/connecting-totp/#can-you-connect-to-the-login-node","title":"Can you connect to the login node?","text":"Try the command ping -c 3 login.archer2.ac.uk
, on Linux or MacOS, or ping -n 3 login.archer2.ac.uk
on Windows. If you successfully connect to the login node, the output should include:
--- login.archer2.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n
(the ping time '38ms' is not important). If not all packets are received there could be a problem with your Internet connection, or the login node could be unavailable.
"},{"location":"user-guide/connecting-totp/#ssh-key","title":"SSH key","text":"If you get the error message Permission denied (publickey)
, this may indicate a problem with your SSH key. Some things to check:
Have you uploaded the key to SAFE? Please note that if the same key is re-uploaded, SAFE will not map the \"new\" key to ARCHER2. If for some reason this is required, please delete the key first, then re-upload.
Is SSH using the correct key? You can check which keys are being found and offered by SSH using ssh -vvv
. If your private key has a non-default name, you should use the -i
option to provide it to ssh. For example, ssh -i path/to/key username@login.archer2.ac.uk
.
Are you entering the passphrase correctly? You will be asked for your private key's passphrase first. If you enter it incorrectly you will usually be asked to enter it again (usually you will get three chances, after which SSH will fail with Permission denied (publickey)
). If you would like to confirm your passphrase without attempting to connect, you can use ssh-keygen -y -f /path/to/private/key
. If successful, this command will print the corresponding public key. You can also use this to check that you have uploaded the correct public key to SAFE.
Are permissions correct on the SSH key? One common issue is that the permissions are set incorrectly on either the key files or the directory it is contained in. On Linux and MacOS, if your private keys are held in ~/.ssh/
you can check this with ls -al ~/.ssh
. This should give something similar to the following output:
$ ls -al ~/.ssh/\n drwx------. 2 user group 48 Jul 15 20:24 .\n drwx------. 12 user group 4096 Oct 13 12:11 ..\n -rw-------. 1 user group 113 Jul 15 20:23 authorized_keys\n -rw-------. 1 user group 12686 Jul 15 20:23 id_rsa\n -rw-r--r--. 1 user group 2785 Jul 15 20:23 id_rsa.pub\n -rw-r--r--. 1 user group 1967 Oct 13 14:11 known_hosts\n
The important section here is the string of letters and dashes at the start, for the lines ending in .
, id_rsa
, and id_rsa.pub
, which indicate permissions on the containing directory, private key, and public key, respectively. If your permissions are not correct, they can be set with chmod
. Consult the table below for the relevant chmod
command.
chmod
Code Directory drwx------
700 Private Key -rw-------
600 Public Key -rw-r--r--
644 chmod
can be used to set permissions on the target in the following way: chmod <code> <target>
. So for example to set correct permissions on the private key file id_rsa_ARCHER2
, use the command chmod 600 id_rsa_ARCHER2
.
On Windows, permissions are handled differently but can be set by right-clicking on the file and selecting Properties > Security > Advanced. The user, SYSTEM, and Administrators should have Full control
, and no other permissions should exist for both the public and private key files, as well as the containing folder.
Tip
Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -
, or directory d
. The next three characters indicate the owning user's permissions. The first character is r
if they have read permission, -
if they don't, the second character is w
if they have write permission, -
if they don't, the third character is x
if they have execute permission, -
if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r--
indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod
codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------
becomes 111 000 000
-> 700
.
If your TOTP passcode is being consistently rejected, you can remove MFA from your account and then re-enable it.
"},{"location":"user-guide/connecting-totp/#ssh-verbose-output","title":"SSH verbose output","text":"The verbose-debugging output from ssh
can be very useful for diagnosing issues. In particular, it can be used to distinguish between problems with the SSH key and password. To enable verbose output, add the -vvv
flag to your SSH command. For example:
ssh -vvv username@login.archer2.ac.uk\n
The output is lengthy, but somewhere in there you should see lines similar to the following:
debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: password, keyboard-interactive\n
In the text above, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded
indicates that the SSH key has been accepted. By default SSH will go through a list of standard private-key files, as well as any you have specified with -i
or a config file. To succeed, one of these private keys needs to match to the public key uploaded to SAFE.
If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey)
. If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.
You should next see something similiar to:
debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (keyboard-interactive).\n
If you do not see the Password:
prompt you may have connection issues, or there could be a problem with the ARCHER2 login nodes. If you do not see Authenticated with partial success
it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success
, it means your password was accepted, and your SSH key will now be checked.
The equivalent information can be obtained in PuTTY by enabling All Logging in settings.
"},{"location":"user-guide/connecting-totp/#related-software","title":"Related Software","text":""},{"location":"user-guide/connecting-totp/#tmux","title":"tmux","text":"tmux is a multiplexer application available on the ARCHER2 login nodes. It allows for multiple sessions to be open concurrently and these sessions can be detached and run in the background. Furthermore, sessions will continue to run after a user logs off and can be reattached to upon logging in again. It is particularly useful if you are connecting to ARCHER2 on an unstable Internet connection or if you wish to keep an arrangement of terminal applications running while you disconnect your client from the Internet -- for example, when moving between your home and workplace.
"},{"location":"user-guide/connecting/","title":"Connecting to ARCHER2","text":"This section covers the basic connection methods.
On the ARCHER2 system, interactive access is achieved using SSH, either directly from a command-line terminal or using an SSH client. In addition, data can be transferred to and from the ARCHER2 system using scp
from the command line or by using a file-transfer client.
Before following the process below, we assume you have set up an account on ARCHER2 through the EPCC SAFE. Documentation on how to do this can be found at:
Linux distributions include a terminal application that can be used for SSH access to the ARCHER2 login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g., GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.
"},{"location":"user-guide/connecting/#macos","title":"MacOS","text":"MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.
"},{"location":"user-guide/connecting/#windows","title":"Windows","text":"A typical Windows installation will not include a terminal client, though there are various clients available. We recommend Windows users download and install MobaXterm to access ARCHER2. It is very easy to use and includes an integrated X Server, which allows you to run graphical applications on ARCHER2.
You can download MobaXterm Home Edition (Installer Edition) from the following link:
Double-click the downloaded Microsoft Installer file (.msi) and follow the instructions from the Windows Installation Wizard. Note, you might need to have administrator rights to install on some versions of Windows. Also, make sure to check whether Windows Firewall has blocked any features of this program after installation (Windows will warn you if the built-in firewall blocks an action, and gives you the opportunity to override the behaviour).
Once installed, start MobaXterm and then click \"Start local terminal\".
Tips
If you download the .zip file rather than the .msi, make sure you unzip it before attempting to run the installer.
If you do not have administrator rights, you can use the Portable edition of MobaXterm.
If this is your first time using MobaXterm, you should check that a permanent /home directory has been set up (otherwise, all saved info will be lost from session to session). Go to \"Settings\" -> \"Configuration\" and check that a path is set in the field marked \"Persistent home directory\". If prompted, make sure path is set as \"private\".
Any SSH key generated in MobaXterm will, by default, be stored in the permanent /home directory (see above). That is, if your /home directory is _MyDocuments_\\MobaXterm\\home
then within that folder you will find a folder named _MyDocuments_\\MobaXterm\\home\\.ssh
containing your keys. This folder will be 'hidden' by default, so you may need to tick 'Hidden items' under 'View' in Windows Explorer to see it.
MobaXterm also allows you to set up pre-configured SSH sessions with the username, login host and key details saved. You are welcome to use this, rather than using the \"Local terminal\", but we are not able to assist with debugging connection issues if you choose this method.
To access ARCHER2, you need to use two sets of credentials: your SSH key pair protected by a passphrase and a Time-based one-time password. You can find more detailed instructions on how to set up your credentials to access ARCHER2 from Windows, MacOS and Linux below.
"},{"location":"user-guide/connecting/#ssh-key-pairs","title":"SSH Key Pairs","text":"You will need to generate an SSH key pair protected by a passphrase to access ARCHER2.
Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:
$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n| . ...+o++++. |\n| . . . =o.. |\n|+ . . .......o o |\n|oE . . |\n|o = . S |\n|. +.+ . |\n|. oo |\n|. . |\n| .. |\n+-----------------+\n
(remember to replace \"your@email.com\" with your e-mail address).
"},{"location":"user-guide/connecting/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:
Login to SAFE.
Then:
Once you have done this, your SSH key will be added to your ARCHER2 account.
"},{"location":"user-guide/connecting/#mfa-time-based-one-time-passcode-totp-code","title":"MFA Time-based one-time passcode (TOTP code)","text":"Remember, you will need to use both an SSH key and time-based one-time passcode to log into ARCHER2 so you will also need to set up a method for generating a TOTP code before you can log into ARCHER2.
"},{"location":"user-guide/connecting/#first-login-password-required","title":"First login: password required","text":"Important
You will not use your password when logging on to ARCHER2 after the first login for a new account.
As an additional security measure, you will also need to use a password from SAFE for your first login to ARCHER2 with a new account. When you log into ARCHER2 for the first time with a new account, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed. You will no longer need this password to log into ARCHER2 from this point forwards, you will use your SSH key and TOTP code as described above.
"},{"location":"user-guide/connecting/#ssh-clients","title":"SSH Clients","text":"As noted above, you interact with ARCHER2, over an encrypted communication channel (specifically, Secure Shell version 2 (SSH-2)). This allows command-line access to one of the login nodes of ARCHER2, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers, when used in conjunction with an X Server.
"},{"location":"user-guide/connecting/#logging-in","title":"Logging in","text":"The login addresses for ARCHER2 are:
You can use the following command from the terminal window to log in to ARCHER2:
Full systemssh username@login.archer2.ac.uk\n
The order in which you are asked for credentials depends on the system you are accessing:
Full systemYou will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you logged into ARCHER2 with your account before the major upgrade in May/June 2023 you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Warning
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_ARCHER2
you would use the command ssh -i keys/id_rsa_ARCHER2 username@login.archer2.ac.uk
to log in (or the equivalent for the 4-cabinet system).
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three-step process:
Your password will now have been changed
To allow remote programs, especially graphical applications, to control your local display, such as for a debugger, use:
Full systemssh -X username@login.archer2.ac.uk\n
Some sites recommend using the -Y
flag. While this can fix some compatibility issues, the -X
flag is more secure.
Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:
Adding the host keys to your SSH configuration file provides an extra level of security for your connections to ARCHER2. The host keys are checked against the login nodes when you login to ARCHER2 and if the remote server key does not match the one in the configuration file, the connection will be refused. This provides protection against potential malicious servers masquerading as the ARCHER2 login nodes.
"},{"location":"user-guide/connecting/#loginarcher2acuk","title":"login.archer2.ac.uk","text":"login.archer2.ac.uk ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBANu9BQJ1UFr4nwy8X5seIPgCnBl1TKc8XBq2YVY65qS53QcpzjZAH53/CtvyWkyGcmY8/PWsJo9sXHqzXVSkzk=\n\nlogin.archer2.ac.ukssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDFGGByIrskPayB5xRm3vkWoEc5bVtTCi0oTGslD8m+M1Sc/v2IV6FxaEVXGwO9ErQwrtFQRj0KameLS3Jn0LwQ13Tw+vTXV0bsKyGgEu2wW+BSDijGpbxRZXZrg30TltZXd4VkTuWiE6kyhJ6qiIIR0nwfDblijGy3u079gM5Om/Q2wydwh0iAASRzkqldL5bKDb14Vliy7tCT3TJXI49+qIagWUhNEzyN1j2oK/2n3JdflT4/anQ4jUywVG4D1Tor/evEeSa3h5++gbtgAXZaCtlQbBxwckmTetXqnlI+pvkF0AAuS18Bh+hdmvT1+xW0XLv7CMA64HfR93XgQIIuPqFAS1p+HuJkmk4xFAdwrzjnpYAiU5Apkq+vx3W957/LULzZkeiFQY2Y3CY9oPVR8WBmGKXOOBifhl2Hvd51fH1wd0Lw7Zph53NcVSQQhdDUVhgsPJA3M/+UlqoAMEB/V6ESE2z6yrXVfNjDNbbgA1K548EYpyNR8z4eRtZOoi0=\n\nlogin.archer2.ac.uk ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINyptPmidGmIBYHPcTwzgXknVPrMyHptwBgSbMcoZgh5\n
Host key verification can fail if this key is out of date, a problem which can be fixed by removing the offending entry in ~/.ssh/known_hosts
and replacing it with the new key published here. We recommend users should check this page for any key updates and not just accept a new key from the server without confirmation.
Typing in the full command to log in or transfer data to ARCHER2 can become tedious as it often has to be repeated several times. You can use the SSH configuration file, usually located on your local machine at .ssh/config
to make the process more convenient.
Each remote site (or group of sites) can have an entry in this file, which may look something like:
Full systemHost archer2\n HostName login.archer2.ac.uk\n User username\n
(remember to replace username
with your actual username!).
Taking the full-system example: the Host
line defines a short name for the entry. In this case, instead of typing ssh username@login.archer2.ac.uk
to access the ARCHER2 login nodes, you could use ssh archer2
instead. The remaining lines define the options for the host.
Hostname login.archer2.ac.uk
--- defines the full address of the hostUser username
--- defines the username to use by default for this host (replace username
with your own username on the remote host)Now you can use SSH to access ARCHER2 without needing to enter your username or the full hostname every time:
ssh archer2\n
You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config manual page (or man ssh_config
on any machine with SSH installed) for a description of the SSH configuration file. For example, you may find the IdentityFile
option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.
Bug
There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512)
, you will need to either specify the path to your ssh key in the command line (using the -i
option as described above) or add that path to your SSH config file by using the IdentityFile
option.
If you find you are unable to connect to ARCHER2, there are some simple checks you may use to diagnose the issue, which are described below. If you are having difficulties connecting, we suggest trying these before contacting the ARCHER2 Service Desk.
"},{"location":"user-guide/connecting/#use-the-userloginarcher2acuk-syntax-rather-than-l-user-loginarcher2acuk","title":"Use theuser@login.archer2.ac.uk
syntax rather than -l user login.archer2.ac.uk
","text":"We have seen a number of instances where people using the syntax
ssh -l user login.archer2.ac.uk\n
have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:
ssh user@login.archer2.ac.uk\n
works more reliably.
"},{"location":"user-guide/connecting/#can-you-connect-to-the-login-node","title":"Can you connect to the login node?","text":"Try the command ping -c 3 login.archer2.ac.uk
, on Linux or MacOS, or ping -n 3 login.archer2.ac.uk
on Windows. If you successfully connect to the login node, the output should include:
--- login.archer2.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n
(the ping time '38ms' is not important). If not all packets are received there could be a problem with your Internet connection, or the login node could be unavailable.
"},{"location":"user-guide/connecting/#ssh-key","title":"SSH key","text":"If you get the error message Permission denied (publickey)
, this may indicate a problem with your SSH key. Some things to check:
Have you uploaded the key to SAFE? Please note that if the same key is re-uploaded, SAFE will not map the \"new\" key to ARCHER2. If for some reason this is required, please delete the key first, then re-upload.
Is SSH using the correct key? You can check which keys are being found and offered by SSH using ssh -vvv
. If your private key has a non-default name, you should use the -i
option to provide it to ssh. For example, ssh -i path/to/key username@login.archer2.ac.uk
.
Are you entering the passphrase correctly? You will be asked for your private key's passphrase first. If you enter it incorrectly you will usually be asked to enter it again (usually you will get three chances, after which SSH will fail with Permission denied (publickey)
). If you would like to confirm your passphrase without attempting to connect, you can use ssh-keygen -y -f /path/to/private/key
. If successful, this command will print the corresponding public key. You can also use this to check that you have uploaded the correct public key to SAFE.
Are permissions correct on the SSH key? One common issue is that the permissions are set incorrectly on either the key files or the directory it is contained in. On Linux and MacOS, if your private keys are held in ~/.ssh/
you can check this with ls -al ~/.ssh
. This should give something similar to the following output:
$ ls -al ~/.ssh/\n drwx------. 2 user group 48 Jul 15 20:24 .\n drwx------. 12 user group 4096 Oct 13 12:11 ..\n -rw-------. 1 user group 113 Jul 15 20:23 authorized_keys\n -rw-------. 1 user group 12686 Jul 15 20:23 id_rsa\n -rw-r--r--. 1 user group 2785 Jul 15 20:23 id_rsa.pub\n -rw-r--r--. 1 user group 1967 Oct 13 14:11 known_hosts\n
The important section here is the string of letters and dashes at the start, for the lines ending in .
, id_rsa
, and id_rsa.pub
, which indicate permissions on the containing directory, private key, and public key, respectively. If your permissions are not correct, they can be set with chmod
. Consult the table below for the relevant chmod
command.
chmod
Code Directory drwx------
700 Private Key -rw-------
600 Public Key -rw-r--r--
644 chmod
can be used to set permissions on the target in the following way: chmod <code> <target>
. So for example to set correct permissions on the private key file id_rsa_ARCHER2
, use the command chmod 600 id_rsa_ARCHER2
.
On Windows, permissions are handled differently but can be set by right-clicking on the file and selecting Properties > Security > Advanced. The user, SYSTEM, and Administrators should have Full control
, and no other permissions should exist for both the public and private key files, as well as the containing folder.
Tip
Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -
, or directory d
. The next three characters indicate the owning user's permissions. The first character is r
if they have read permission, -
if they don't, the second character is w
if they have write permission, -
if they don't, the third character is x
if they have execute permission, -
if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r--
indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod
codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------
becomes 111 000 000
-> 700
.
If your TOTP passcode is being consistently rejected, you can remove MFA from your account and then re-enable it.
"},{"location":"user-guide/connecting/#ssh-verbose-output","title":"SSH verbose output","text":"The verbose-debugging output from ssh
can be very useful for diagnosing issues. In particular, it can be used to distinguish between problems with the SSH key and password. To enable verbose output, add the -vvv
flag to your SSH command. For example:
ssh -vvv username@login.archer2.ac.uk\n
The output is lengthy, but somewhere in there you should see lines similar to the following:
debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: password, keyboard-interactive\n
In the text above, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded
indicates that the SSH key has been accepted. By default SSH will go through a list of standard private-key files, as well as any you have specified with -i
or a config file. To succeed, one of these private keys needs to match to the public key uploaded to SAFE.
If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey)
. If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.
You should next see something similiar to:
debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (keyboard-interactive).\n
If you do not see the Password:
prompt you may have connection issues, or there could be a problem with the ARCHER2 login nodes. If you do not see Authenticated with partial success
it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success
, it means your password was accepted, and your SSH key will now be checked.
The equivalent information can be obtained in PuTTY by enabling All Logging in settings.
"},{"location":"user-guide/connecting/#related-software","title":"Related Software","text":""},{"location":"user-guide/connecting/#tmux","title":"tmux","text":"tmux is a multiplexer application available on the ARCHER2 login nodes. It allows for multiple sessions to be open concurrently and these sessions can be detached and run in the background. Furthermore, sessions will continue to run after a user logs off and can be reattached to upon logging in again. It is particularly useful if you are connecting to ARCHER2 on an unstable Internet connection or if you wish to keep an arrangement of terminal applications running while you disconnect your client from the Internet -- for example, when moving between your home and workplace.
"},{"location":"user-guide/containers/","title":"Containers","text":"This page was originally based on the documentation at the University of Sheffield HPC service
Designed around the notion of mobility of compute and reproducible science, Singularity enables users to have full control of their operating system environment. This means that a non-privileged user can \"swap out\" the Linux operating system and environment on the host for a Linux OS and environment that they control. So if the host system is running CentOS Linux but your application runs in Ubuntu Linux with a particular software stack, you can create an Ubuntu image, install your software into that image, copy the image to another host (e.g. ARCHER2), and run your application on that host in its native Ubuntu environment.
Singularity also allows you to leverage the resources of whatever host you are on. This includes high-speed interconnects (e.g. Slingshot on ARCHER2), file systems (e.g. /home and /work on ARCHER2) and potentially other resources.
Note
Singularity only supports Linux containers. You cannot create images that use Windows or macOS (this is a restriction of the containerisation model rather than Singularity).
"},{"location":"user-guide/containers/#useful-links","title":"Useful Links","text":"Similar to Docker, a Singularity container is a self-contained software stack. As Singularity does not require a root-level daemon to run its containers (as is required by Docker) it is suitable for use on multi-user HPC systems such as ARCHER2. Within the container, you have exactly the same permissions as you do in a standard login session on the system.
In practice, this means that a container image created on your local machine with all your research software installed for local development will also run on ARCHER2.
Pre-built container images (such as those on DockerHub or SingularityHub archive can simply be downloaded and used on ARCHER2 (or anywhere else Singularity is installed).
Creating and modifying container images requires root permission and so must be done on a system where you have such access (in practice, this is usually within a virtual machine on your laptop/workstation).
Note
SingularityHub was a publicly available cloud service for Singularity container images active from 2016 to 2021. It built container recipes from Github repositories on Google Cloud, and container images were available via the command line Singularity or sregistry software. These container images are still available now in the SingularityHub Archive
"},{"location":"user-guide/containers/#using-singularity-images-on-archer2","title":"Using Singularity Images on ARCHER2","text":"Singularity containers can be used on ARCHER2 in a number of ways, including:
We provide information on each of these scenarios below. First, we describe briefly how to get existing container images onto ARCHER2 so that you can launch containers based on them.
"},{"location":"user-guide/containers/#getting-existing-container-images-onto-archer2","title":"Getting existing container images onto ARCHER2","text":"Singularity container images are files, so, if you already have a container image, you can use scp
to copy the file to ARCHER2 as you would with any other file.
If you wish to get a file from one of the container image repositories, then Singularity allows you to do this from ARCHER2 itself.
For example, to retrieve a container image from SingularityHub on ARCHER2 we can simply issue a Singularity command to pull the image.
auser@ln03:~> singularity pull hello-world.sif shub://vsoch/hello-world\n
The container image located at the shub
URI is written to a Singularity Image File (SIF) called hello-world.sif
.
Once you have a container image file, launching a container based on the container image on the login nodes in an interactive way is extremely simple: you use the singularity shell
command. Using the container image we built in the example above:
auser@ln03:~> singularity shell hello-world.sif\nSingularity>\n
Within a Singularity container your home directory will be available.
Once you have finished using your container, you can return to the ARCHER2 login node prompt with the exit
command:
Singularity> exit\nexit\nauser@ln03:~>\n
"},{"location":"user-guide/containers/#interactive-use-on-the-compute-nodes","title":"Interactive use on the compute nodes","text":"The process for using a container interactively on the compute nodes is very similar to that for the login nodes. The only difference is that you first have to submit an interactive serial job (from a location on /work
) in order to get interactive access to the compute node.
For example, to reserve a full node for you to work on interactively you would use:
auser@ln03:/work/t01/t01/auser> srun --nodes=1 --exclusive --time=00:20:00 \\\n --account=[budget code] \\\n --partition=standard --qos=standard \\\n --pty /bin/bash\n\n...wait until job starts...\n\nauser@nid00001:/work/t01/t01/auser>\n
Note that the prompt has changed to show you are on a compute node. Now you can launch a container in the same way as on the login node.
auser@nid00001:/work/t01/t01/auser> singularity shell hello-world.sif\nSingularity> exit\nexit\nauser@nid00001:/work/t01/t01/auser> exit\nauser@ln03:/work/t01/t01/auser>\n
Note
We used exit
to leave the interactive container shell and then exit
again to leave the interactive job on the compute node.
You can also use Singularity containers within a non-interactive batch script as you would any other command. If your container image contains a runscript then you can use singularity run
to execute the runscript in the job. You can also use singularity exec
to execute arbitrary commands (or scripts) within the container.
An example job submission script to run a serial job that executes the runscript within a container based on the container image in the hello-world.sif
file that we downloaded previously to an ARCHER2 login node would be as follows.
#!/bin/bash --login\n\n# Slurm job options (name, compute nodes, job time)\n\n#SBATCH --job-name=helloworld\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:10:00\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Run the serial executable\nsingularity run $SLURM_SUBMIT_DIR/hello-world.sif\n
You submit this in the usual way and the standard output and error should be written to slurm-...
, where the output filename ends with the job number.
Running a Singularity container in parallel across a number of compute nodes requires some preparation. In general though, Singularity can be run within the parallel job launcher (srun
).
srun <options> \\\n singularity <options> /path/to/image/file \\\n app <options>\n
The code snippet above shows the launch command as having three nested parts, srun
, the singularity environment and the containerised application.
The Singularity container image must be compatible with the MPI environment on the host; either, the containerised app has been built against the appropriate MPI libraries or the container itself contains an MPI library that is compatible with the host MPI. The latter situation is known as the hybrid model; this is the approach taken in the sections that follow.
"},{"location":"user-guide/containers/#creating-your-own-singularity-container-images","title":"Creating Your Own Singularity Container Images","text":"As we saw above, you can create Singularity container images by importing from DockerHub or Singularity Hub on ARCHER2 itself. If you wish to create your own custom container image to use with Singularity then you must use a system where you have root (or administrator) privileges - often your own laptop or workstation.
There are a number of different options to create container images on your local system to use with Singularity on ARCHER2. We are going to use Docker on our local system to create the container image, push the new container image to Docker Hub and then use Singularity on ARCHER2 to convert the Docker container image to a Singularity container image SIF file.
For macOS and Windows users we recommend installing Docker Desktop. For Linux users, we recommend installing Docker directly on your local system. See the Docker documentation for full details on how to install Docker Desktop/Docker.
"},{"location":"user-guide/containers/#building-container-images-using-docker","title":"Building container images using Docker","text":"Note
We assume that you are familiar with using Docker in these instructions. You can find an introduction to Docker at Reproducible Computational Environments Using Containers: Introduction to Docker
As usual, you can build container images with a command similar to:
docker build --platform linux/amd64 -t <username>/<image name>:<version> .\n
Where:
<username>
is your Docker Hub username<image name>
is the name of the container image you wish to create<version>
- specifies the version of the image you are creating (e.g. \"latest\", \"v1\").
is the build context - in this example it is the location of the DockerfileNote, you should use the --platform linux/amd64
option to ensure that the container image is compatible with the processor architecture on ARCHER2.
MPI on ARCHER2 is provided by the Cray MPICH libraries with the interface to the high-performance Slingshot interconnect provided via the OFI interface. Therefore, as per the Singularity MPI Hybrid model, we will build our container image such that it contains a version of the MPICH MPI library compiled with support for OFI. Below, we provide instructions on creating a container image with a version of MPICH compiled in this way. We then provide an example of how to run a Singularity container with MPI over multiple ARCHER2 compute nodes.
"},{"location":"user-guide/containers/#building-an-image-with-mpi-from-scratch","title":"Building an image with MPI from scratch","text":"Warning
Remember, all these steps should be executed on your local system where you have administrator privileges and Docker installed, not on ARCHER2.
We will illustrate the process of building a Singularity image with MPI from scratch by building an image that contains MPI provided by MPICH and the OSU MPI benchmarks. As part of the container image creation we need to download the source code for both MPICH and the OSU benchmarks. At the time of writing, the stable MPICH release is 3.4.2 and the stable OSU benchmark release is 5.8 - this may have changed by the time you are following these instructions.
First, create a Dockerfile that describes how to build the image:
FROM ubuntu:20.04\n\nENV DEBIAN_FRONTEND=noninteractive\n\n# Install the necessary packages (from repo)\nRUN apt-get update && apt-get install -y --no-install-recommends \\\n apt-utils \\\n build-essential \\\n curl \\\n libcurl4-openssl-dev \\\n libzmq3-dev \\\n pkg-config \\\n software-properties-common\nRUN apt-get clean\nRUN apt-get install -y dkms\nRUN apt-get install -y autoconf automake build-essential numactl libnuma-dev autoconf automake gcc g++ git libtool\n\n# Download and build an ABI compatible MPICH\nRUN curl -sSLO http://www.mpich.org/static/downloads/3.4.2/mpich-3.4.2.tar.gz \\\n && tar -xzf mpich-3.4.2.tar.gz -C /root \\\n && cd /root/mpich-3.4.2 \\\n && ./configure --prefix=/usr --with-device=ch4:ofi --disable-fortran \\\n && make -j8 install \\\n && rm -rf /root/mpich-3.4.2 \\\n && rm /mpich-3.4.2.tar.gz\n\n# OSU benchmarks\nRUN curl -sSLO http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.4.1.tar.gz \\\n && tar -xzf osu-micro-benchmarks-5.4.1.tar.gz -C /root \\\n && cd /root/osu-micro-benchmarks-5.4.1 \\\n && ./configure --prefix=/usr/local CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx \\\n && cd mpi \\\n && make -j8 install \\\n && rm -rf /root/osu-micro-benchmarks-5.4.1 \\\n && rm /osu-micro-benchmarks-5.4.1.tar.gz\n\n# Add the OSU benchmark executables to the PATH\nENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt:$PATH\nENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/collective:$PATH\n\n# path to mlx libraries in Ubuntu\nENV LD_LIBRARY_PATH=/usr/lib/libibverbs:$LD_LIBRARY_PATH\n
A quick overview of what the above Dockerfile is doing:
ubuntu:20.04
Docker image.RUN
sections with apt-get
commands: install the base packages required from the Ubunntu package reposENV
sections: add the OSU benchmark executables to the PATH so they can be executed in the container without specifying the full path; set the correct paths to the network libraries within the container.Now we can go ahead and build the container image using Docker (this assumes that you issue the command in the same directory as the Dockerfile you created based on the specification above):
docker build --platform linux/amd64 -t auser/osu-benchmarks:5.4.1 .\n
(Remember to change auser
to your Dockerhub username.)
Once you have successfully built your container image, you should push it to Dockerhub:
docker push auser/osu-benchmarks:5.4.1\n
Finally, you need to use Singularity on ARCHER2 to convert the Docker container image to a Singularity container image file. Log into ARCHER2, move to the work file system and then use a command like:
auser@ln01:/work/t01/t01/auser> singularity build osu-benchmarks_5.4.1.sif docker://auser/osu-benchmarks:5.4.1\n
Tip
You can find a copy of the osu-benchmarks_5.4.1.sif
image on ARCHER2 in the directory $EPCC_SINGULARITY_DIR
if you do not want to build it yourself but still want to test.
Tip
These instructions assume you have built a Singularity container image file on ARCHER2 that includes MPI provided by MPICH with the OFI interface. See the sections above for how to build such container images.
Once you have built your Singularity container image file that includes MPICH built with OFI for ARCHER2, you can use it to run parallel jobs in a similar way to non-Singularity jobs. The example job submission script below uses the container image file we built above with MPICH and the OSU benchmarks to run the Allreduce benchmark on two nodes where all 128 cores on each node are used for MPI processes (so, 256 MPI processes in total).
#!/bin/bash\n\n# Slurm job options (name, compute nodes, job time)\n#SBATCH --job-name=singularity_parallel\n#SBATCH --time=0:10:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n#SBATCH --account=[budget code]\n\n# Load the module to make the Cray MPICH ABI available\nmodule load cray-mpich-abi\n\nexport OMP_NUM_THREADS=1\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n#\u00a0Set the LD_LIBRARY_PATH environment variable within the Singularity container\n# to ensure that it used the correct MPI libraries.\nexport SINGULARITYENV_LD_LIBRARY_PATH=\"/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib:/opt/cray/libfabric/1.12.1.2.2.0.0/lib64:/opt/cray/pe/gcc-libs:/opt/cray/pe/gcc-libs:/opt/cray/pe/lib64:/opt/cray/pe/lib64:/opt/cray/xpmem/default/lib64:/usr/lib64/libibverbs:/usr/lib64:/usr/lib64\"\n\n# This makes sure HPE Cray Slingshot interconnect libraries are available\n# from inside the container.\nexport SINGULARITY_BIND=\"/opt/cray,/var/spool,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib,/etc/host.conf,/etc/libibverbs.d/mlx5.driver,/etc/libnl/classid,/etc/resolv.conf,/opt/cray/libfabric/1.12.1.2.2.0.0/lib64/libfabric.so.1,/opt/cray/pe/gcc-libs/libatomic.so.1,/opt/cray/pe/gcc-libs/libgcc_s.so.1,/opt/cray/pe/gcc-libs/libgfortran.so.5,/opt/cray/pe/gcc-libs/libquadmath.so.0,/opt/cray/pe/lib64/libpals.so.0,/opt/cray/pe/lib64/libpmi2.so.0,/opt/cray/pe/lib64/libpmi.so.0,/opt/cray/xpmem/default/lib64/libxpmem.so.0,/run/munge/munge.socket.2,/usr/lib64/libibverbs/libmlx5-rdmav34.so,/usr/lib64/libibverbs.so.1,/usr/lib64/libkeyutils.so.1,/usr/lib64/liblnetconfig.so.4,/usr/lib64/liblustreapi.so,/usr/lib64/libmunge.so.2,/usr/lib64/libnl-3.so.200,/usr/lib64/libnl-genl-3.so.200,/usr/lib64/libnl-route-3.so.200,/usr/lib64/librdmacm.so.1,/usr/lib64/libyaml-0.so.2\"\n\n# Launch the parallel job.\nsrun --hint=nomultithread --distribution=block:block \\\n singularity run osu-benchmarks_5.4.1.sif \\\n osu_allreduce\n
The only changes from a standard submission script are:
SINGULARITY_LD_LIBRARY_PATH
to ensure that the excutable can find the correct libraries are available within the container to be able to use HPE Cray Slingshot interconnect.SINGULARITY_BIND
to ensure that the correct libraries are available within the container to be able to use HPE Cray Slingshot interconnect.srun
calls the singularity
software with the container image file we created rather than the parallel program directly.Important
Remember that the image file must be located on /work
to run jobs on the compute nodes.
If the job runs correctly, you should see output similar to the following in your slurm-*.out
file:
Lmod is automatically replacing \"cray-mpich/8.1.23\" with\n\"cray-mpich-abi/8.1.23\".\n\n\n# OSU MPI Allreduce Latency Test v5.4.1\n# Size Avg Latency(us)\n4 7.93\n8 7.93\n16 8.13\n32 8.69\n64 9.54\n128 13.75\n256 17.04\n512 25.94\n1024 29.43\n2048 43.53\n4096 46.53\n8192 46.20\n16384 55.85\n32768 83.11\n65536 136.90\n131072 257.13\n262144 486.50\n524288 1025.87\n1048576 2173.25\n
"},{"location":"user-guide/containers/#using-containerised-hpe-cray-programming-environments","title":"Using Containerised HPE Cray Programming Environments","text":"An experimental containerised CPE module has been setup on ARCHER2. The module is not available by default but can be made accessible by running module use
with the right path.
module use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n
The purpose of the ccpe
module(s) is to allow developers to check that their code compiles with the latest Cray Programming Environment (CPE) releases. The CPE release installed on ARCHER2 (currently CPE 22.12) will typically be older than the latest available. A more recent containerised CPE therefore gives developers the opportunity to try out the latest compilers and libraries before the ARCHER CPE is upgraded.
Note
The Containerised CPEs support CCE and GCC compilers, but not AOCC compilers.
The ccpe/23.12
module then provides access to CPE 23.12 via a Singularity image file, located at /work/y07/shared/utils/dev/ccpe/23.12/cpe_23.12.sif
. Singularity containers can be run such that locations on the host file system are still visible. This means source code stored on /work
can be compiled from inside the CPE container. And any output resulting from the compilation, such as object files, libraries and executables, can be written to /work
also. This ability to bind to locations on the host is necessary as the container is immutable, i.e., you cannot write files to the container itself.
Any executable resulting from a containerised CPE build can be run from within the container, allowing the developer to test the performance of the containerised libraries, e.g., libmpi_cray
, libpmi2
, libfabric
.
We'll now show how to build and run a simple Hello World MPI example using a containerised CPE.
First, cd
to the directory containing the Hello World MPI source, makefile and build script. Examples of these files are given below.
#!/bin/bash\n\nmake clean\nmake\n\necho -e \"\\n\\nldd helloworld\"\nldd helloworld\n
MF= Makefile\n\nFC= ftn\nFFLAGS= -O3\nLFLAGS= -lmpichf90\n\nEXE= helloworld\nFSRC= helloworld.f90\n\n#\n# No need to edit below this line\n#\n\n.SUFFIXES:\n.SUFFIXES: .f90 .o\n\nOBJ= $(FSRC:.f90=.o)\n\n.f90.o:\n $(FC) $(FFLAGS) -c $<\n\nall: $(EXE)\n\n$(EXE): $(OBJ)\n $(FC) $(FFLAGS) -o $@ $(OBJ) $(LFLAGS)\n\nclean:\n rm -f $(OBJ) $(EXE) core\n
!\n! Prints 'Hello World' from rank 0 and\n! prints what processor it is out of the total number of processors from\n! all ranks\n!\n\nprogram helloworld\n use mpi\n\n implicit none\n\n integer :: comm, rank, size, ierr\n integer :: last_arg\n\n comm = MPI_COMM_WORLD\n\n call MPI_INIT(ierr)\n\n call MPI_COMM_RANK(comm, rank, ierr)\n call MPI_COMM_SIZE(comm, size, ierr)\n\n ! Each process prints out its rank\n write(*,*) 'I am ', rank, 'out of ', size,' processors.'\n\n call sleep(1)\n\n call MPI_FINALIZE(ierr)\n\nend program helloworld\n
The ldd
command at the end of the build script is simply there to confirm that the code is indeed linked to containerised libraries that form part of the CPE 23.12 release.
The next step is to launch a job (via sbatch
) on a serial node that instantiates the containerised CPE 23.12 image and builds the Hello World MPI code.
#!/bin/bash\n\n#SBATCH --job-name=ccpe-build\n#SBATCH --ntasks=8\n#SBATCH --time=00:10:00\n#SBATCH --account=<budget code>\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n#SBATCH --export=none\n\nexport OMP_NUM_THREADS=1\n\nmodule use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n\nBUILD_CMD=\"${CCPE_BUILDER} ${SLURM_SUBMIT_DIR}/build.sh\"\n\nsingularity exec --cleanenv \\\n --bind ${CCPE_BIND_ARGS},${SLURM_SUBMIT_DIR} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \\\n ${CCPE_IMAGE_FILE} ${BUILD_CMD}\n
The CCPE
environment variables shown above (e.g., CCPE_BUILDER
and CCPE_IMAGE_FILE
) are set by the loading of the ccpe/23.12
module. The CCPE_BUILDER
variable holds the path to the script that prepares the containerised environment prior to running the build.sh
script. You can run cat ${CCPE_BUILDER}
to take a closer look at what is going on.
Note
Passing the ${SLURM_SUBMIT_DIR}
path to Singularity via the --bind
option allows the CPE container to access the source code and write out the executable using locations on the host.
Running the newly-built code is similarly straightforward; this time the containerised CPE is launched on the compute nodes using the srun
command.
#!/bin/bash\n\n#SBATCH --job-name=helloworld\n#SBATCH --nodes=2\n#SBATCH --tasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n#SBATCH --account=<budget code>\n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --export=none\n\nexport OMP_NUM_THREADS=1\n\nmodule use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n\nRUN_CMD=\"${SLURM_SUBMIT_DIR}/helloworld\"\n\nsrun --distribution=block:block --hint=nomultithread --chdir=${SLURM_SUBMIT_DIR} \\\n singularity exec --bind ${CCPE_BIND_ARGS},${SLURM_SUBMIT_DIR} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \\\n ${CCPE_IMAGE_FILE} ${RUN_CMD}\n
If you wish you can at runtime replace a containerised library with its host equivalent. You may for example decide to do this for a low-level communications library such as libfabric
or libpmi
. This can be done by adding (before the srun
command) something like the following line to the submit-run.slurm
file.
source ${CCPE_SET_HOST_PATH} \"/opt/cray/pe/pmi\" \"6.1.8\" \"lib\"\n
As of April 2024, the version of PMI available on ARCHER2 is 6.1.8 (CPE 22.12), and so the command above would allow you to isolate the impact of the containerised PMI library, which for CPE 23.12 is PMI 6.1.13. To see how the setting of the host library is done, simply run cat ${CCPE_SET_HOST_PATH}
after loading the ccpe
module.
An MPI code that just prints a message from each rank is obviously very simple. Real-world codes such as CP2K or GROMACS will often require additional software for compilation, e.g., Intel MKL libraries or tools that control the build process such as CMake
. The way round this sort of problem is to point the CCPE container at the locations on the host where the software is installed.
#!/bin/bash\n\n#SBATCH --job-name=ccpe-build\n#SBATCH --ntasks=8\n#SBATCH --time=00:10:00\n#SBATCH --account=<budget code>\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n#SBATCH --export=none\n\nexport OMP_NUM_THREADS=1\n\nmodule use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n\nCMAKE_DIR=\"/work/y07/shared/utils/core/cmake/3.21.3\"\n\nBUILD_CMD=\"${CCPE_BUILDER} ${SLURM_SUBMIT_DIR}/build.sh\"\n\nsingularity exec --cleanenv \\\n --bind ${CCPE_BIND_ARGS},${CMAKE_DIR},${SLURM_SUBMIT_DIR} \\\n --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \\\n ${CCPE_IMAGE_FILE} ${BUILD_CMD}\n
The submit-cmake-build.slurm
script shows how the --bind
option can be used to make the CMake
installation on ARCHER2 accessible from within the container. The build.sh
script can then call the cmake
command directly (once the CMake
bin directory has been added to the PATH
environment variable).
This content has been moved to archer-migration/data-migration
"},{"location":"user-guide/data/","title":"Data management and transfer","text":"This section covers best practice and tools for data management on ARCHER2 along with a description of the different storage available on the service.
The IO section has information on achieving good performance for reading and writing data to the ARCHER2 storage along with information and advice on different IO patterns.
Information
If you have any questions on data management and transfer please do not hesitate to contact the ARCHER2 service desk at support@archer2.ac.uk.
"},{"location":"user-guide/data/#useful-resources-and-links","title":"Useful resources and links","text":"We strongly recommend that you give some thought to how you use the various data storage facilities that are part of the ARCHER2 service. This will not only allow you to use the machine more effectively but also to ensure that your valuable data is protected.
Here are the main points you should consider:
rsync
, tar
, zip
) and generally encourage you to use them to reduce data volumes. However, in some cases, the time spent compressing the data can take longer than actually transferring the uncompressed data; particularly when transferring data between two locations that both have large data transfer bandwidth available.scp
(and rsync
over scp
) your data will be encrypted introducing a static overhead per file. This issue can be minimised by reducing the number files to be transferred by creating archives. You can also change the encryption algorithm to one that involves minimal encryption. The fastest performing cipher that is commonly available in SSH at the moment is generally aes128-ctr
as most common processors provide a hardware implementation.The ARCHER2 service, like many HPC systems, has a complex structure. There are a number of different data storage types available to users:
/epsrc
and /general
)Each type of storage has different characteristics and policies, and is suitable for different types of use.
Important
All users have a directory on one of the home file systems and on one of the work file systems. The directories are located at:
/home/[project ID]/[project ID]/[user ID]
(this is also set as your home directory)/work/[project ID]/[project ID]/[user ID]
There are also three different types of node available to users:
Each type of node sees a different combination of the storage types. The following table shows which storage options are avalable on different node types:
Storage Login Nodes Compute Nodes Data analysis nodes Notes /home yes no yes Incremental backup /work yes yes yes No backup, high performance Solid state (NVMe) yes yes yes No backup, high performance RDFaaS yes no yes Disaster recovery backupImportant
Only the work file systems and the solid state (NVMe) file system are visible on the compute nodes. This means that all data required by calculations at runtime (input data, application binaries, software libraries, etc.) must be placed on one of these file systems.
You may see \"file not found\" errors if you try to access data on the /home or RDFaaS file systems when running on the compute nodes.
"},{"location":"user-guide/data/#home-file-systems","title":"Home file systems","text":"There are four independent home file-systems. Every project has an allocation on one of the four. You do not need to know which one your project uses as your projects space can always be accessed via the path /home/[project ID]
with your personal directory at /home/[project ID]/[project ID]/[user ID]
. Each home file-system is approximately 100 TB in size and is implemented using standard Network Attached Storage (NAS) technology. This means that these disks are not particularly high performance but are well suited to standard operations like compilation and file editing. These file systems are visible from the ARCHER2 login nodes.
The home file systems are fully backed up. The home file systems retain snapshots which can be used to recover past versions of files. Snapshots are taken weekly (for each of the past two weeks), daily (for each of the past two days) and hourly (for each of the last 6 hours). You can access the snapshots at .snapshot
from any given directory on the home file systems. Note that the .snapshot
directory will not show up under any version of \u201cls\u201d and will not tab complete.
These file systems are a good location to keep source code, copies of scripts and compiled binaries. Small amounts of important data can also be copied here for safe keeping though the file systems are not fast enough to manipulate large datasets effectively.
"},{"location":"user-guide/data/#quotas-on-home-file-systems","title":"Quotas on home file systems","text":"All projects are assigned a quota on the home file systems. The project PI or manager can split this quota up between users or groups of users if they wish.
You can view any home file system quotas that apply to your account by logging into SAFE and navigating to the page for your ARCHER2 login account.
Tip
Quota and usage data on SAFE is updated twice daily so may not be exactly up to date with the situation on the systems themselves.
"},{"location":"user-guide/data/#work-file-systems","title":"Work file systems","text":"There are currently three work file systems on the full ARCHER2 service. Each of these file systems is 3.4 PB and a portion of one of these file systems is available to each project. You do not usually need to know which one your project uses as your projects space can always be accessed via the path /work/[project ID]
with your personal directory at /work/[project ID]/[project ID]/[user ID]
.
All of these are high-performance, Lustre parallel file systems. They are designed to support data in large files. The performance for data stored in large numbers of small files is probably not going to be as good.
These file systems are available on the compute nodes and are the default location users should use for data required at runtime on the compute nodes.
Warning
There are no backups of any data on the work file systems. You should not rely on these file systems for long term storage.
Ideally, these file systems should only contain data that is:
In practice it may be convenient to keep copies of datasets on the work file systems that you know will be needed at a later date. However, make sure that important data is always backed up elsewhere and that your work would not be significantly impacted if the data on the work file systems was lost.
Large data sets can be moved to the RDFaaS storage or transferred off the ARCHER2 service entirely.
If you have data on the work file systems that you are not going to need in the future please delete it.
"},{"location":"user-guide/data/#quotas-on-the-work-file-systems","title":"Quotas on the work file systems","text":"As for the home file systems, all projects are assigned a quota on the work file systems. The project PI or manager can split this quota up between users or groups of users if they wish.
You can view any work file system quotas that apply to your account by logging into SAFE and navigating to the page for your ARCHER2 login account.
Tip
Quota and usage data on SAFE is updated twice daily so may not be exactly up to date with the situation on the systems themselves.
You can also examine up to date quotas and usage on the ARCHER2 systems themselves using the lfs quota
command. To do this:
auser
in project t01
then I would:cd /work/t01/t01/auser\n
auser@ln03:/work/t01/t01/auser> lfs quota -hu auser .\nDisk quotas for usr auser (uid 5496):\n Filesystem used quota limit grace files quota limit grace\n . 1.366G 0k 0k - 5486 0 0 -\nuid 5496 is using default block quota setting\nuid 5496 is using default file quota setting\n
the quota
and limit
of 0k
here indicate that no user quota is set for this user
auser@ln03:/work/t01/t01/auser> lfs quota -hp $(id -g) .\nDisk quotas for prj 1009 (pid 1009):\n Filesystem used quota limit grace files quota limit grace\n . 2.905G 0k 0k - 25300 0 0 -\npid 1009 is using default block quota setting\npid 1009 is using default file quota setting\n
"},{"location":"user-guide/data/#solid-state-nvme-file-system-scratch-storage","title":"Solid state (NVMe) file system - scratch storage","text":"Important
The solid state storage system is configured as scratch storage with all files that have not been accessed in the last 28 days being automatically deleted. This implementation starts on 28 Feb 2024, i.e. any files not accessed since 1 Feb 2024 will be automatically removed on 28 Feb 2024.
The solid state storage file system is a 1 PB high performance parallel Lustre file system similar to the work file systems. However, unlike the work file systems, all of the disks are based solid state storage (NVMe) technology. This changes the performance characteristics of the file system compared to the work file systems. Testing by the ARCHER2 CSE team at EPCC has shown that you may see I/O performance improvements from the solid state storage compared to the standard work Lustre file systems on ARCHER2 if your I/O model has the following characteristics or similar:
Data on the solid state (NVMe) file system is visible on the compute nodes
Important
If you use MPI-IO approaches to reading/writing data - this includes parallel HDF5 and parallel NetCDF - then you very unlikely to see any performance improvements from using the solid state storage over the standard parallel Lustre file systems on ARCHER2.
Warning
There are no backups of any data on the solid state (NVMe) file system. You should not rely on this file system for long term storage.
"},{"location":"user-guide/data/#access-to-the-solid-state-file-system","title":"Access to the solid state file system","text":"Projects do not have access to the solid state file system by default. If your project does not yet have access and you want access for your project, please contact the Service Desk to request access.
"},{"location":"user-guide/data/#location-of-directories","title":"Location of directories","text":"You can find your directory on the file system at:
/mnt/lustre/a2fs-nvme/work/<project code>/<project code>/<username>\n
For example, if my username is auser
and I am in project t01
, I could find my solid state storage directory at:
/mnt/lustre/a2fs-nvme/work/t01/t01/auser\n
"},{"location":"user-guide/data/#quotas-on-solid-state-file-system","title":"Quotas on solid state file system","text":"Important
All projects have the same, large quota of 250,000 GiB on the solid state file system to allow them to use it as a scratch file system. Remember, any files that have not been accessed in the last 28 days will be automatically deleted.
You query quotas for the solid state file system in the same way as quotas on the work file systems.
Bug
Usage and quotas of the solid state file system are not yet available in SAFE - you should use commands such as lfs quota -hp $(id -g) .
to query quotas on the solid state file system.
You can identify which files you own that are candidates for deletion at the next scratch file system purge using the find
command in the following format:
find /mnt/lustre/a2fs-nvme/work/<project code> -atime +28 -type f -print\n
For example, if my account is in project t01
, I would use:
find /mnt/lustre/a2fs-nvme/work/t01 -atime +28 -type f -print\n
"},{"location":"user-guide/data/#rdfaas-file-systems","title":"RDFaaS file systems","text":"The RDFaaS file systems provide additional capacity for projects to store data that is not currently required on the compute nodes but which is too large for the Home file systems.
Warning
The RDFaaS file systems are backed up for disaster recovery purposes only (e.g. loss of the whole file system) so it is not possible to recover individual files if they are deleted by mistake or otherwise lost.
Tip
Not all projects on ARCHER2 have access to RDFaaS, if you do have access, this will show up in the login account page on SAFE for your ARCHER2 login account.
If you have access to RDFaaS, you will have a directory in one of two file systems: either /epsrc
or /general
.
For example, if your username is auser
and you are in the e05
project, then your RDFaaS directory will be at:
/epsrc/e05/e05/auser\n
The RDFaaS file systems are not available on the ARCHER2 compute nodes.
Tip
If you are having issues accessing data on the RDFaaS file system then please contact the ARCHER2 Service Desk
"},{"location":"user-guide/data/#copying-data-from-rdfaas-to-work-file-systems","title":"Copying data from RDFaaS to Work file systems","text":"You should use the standard Linux cp
command to copy data from the RDFaaS file system to other ARCHER2 file systems (usually /work
). For example, to transfer the file important-data.tar.gz
from the RDFaaS file system to /work
you would use the following command (assuming you are user auser
in project e05
):
cp /epsrc/e05/e05/auser/important-data.tar.gz /work/e05/e05/auser/\n
(remember to replace the project code and username with your own username and project code. You may also need to use /general
if your data was there on the RDF file systems).
Some large projects may choose to split their resources into multiple subprojects. These subprojects will have identifiers appended to the main project ID. For example, the rse
subgroup of the z19
project would have the ID z19-rse
. If the main project has allocated storage quotas to the subproject the directories for this storage will be found at, for example:
/home/z19/z19-rse/auser\n
Your Linux home directory will generally not be changed when you are made a member of a subproject so you must change directories manually (or change the ownership of files) to make use of this different storage quota allocation.
"},{"location":"user-guide/data/#sharing-data-with-other-archer2-users","title":"Sharing data with other ARCHER2 users","text":"How you share data with other ARCHER2 users depends on whether or not they belong to the same project as you. Each project has two shared folders that can be used for sharing data.
"},{"location":"user-guide/data/#sharing-data-with-archer2-users-in-your-project","title":"Sharing data with ARCHER2 users in your project","text":"Each project has an inner shared folder.
/work/[project code]/[project code]/shared\n
This folder has read/write permissions for all project members. You can place any data you wish to share with other project members in this directory. For example, if your project code is x01 the inner shared folder would be located at /work/x01/x01/shared
.
Some projects have subprojects (also often referred to as a 'project groups' or sub-budgets) e.g. project e123 might have a project group e123-fred for a sub-group of researchers working with Fred.
Often project groups do not have a disk quota set, but if the project PI does set up a group disk quota e.g. for /work then additional directories are created:
/work/e123/e123-fred\n/work/e123/e123-fred/shared\n/work/e123/e123-fred/<user> (for every user in the group)\n
and all members of the /work/e123/e123-fred
group will be able to use the /work/e123/e123-fred/shared
directory to share their files.
Note
If files are copied from their usual directories they will keep the original ownership. To grant ownership to the group:
chown -R $USER:e123-fred /work/e123/e123-fred/ ...
Each project also has an outer shared folder.:
/work/[project code]/shared\n
It is writable by all project members and readable by any user on the system. You can place any data you wish to share with other ARCHER2 users who are not members of your project in this directory. For example, if your project code is x01 the outer shared folder would be located at /work/x01/shared
.
You should check the permissions of any files that you place in the shared area, especially if those files were created in your own ARCHER2 account. Files of the latter type are likely to be readable by you only.
The chmod
command below shows how to make sure that a file placed in the outer shared folder is also readable by all ARCHER2 users.
chmod a+r /work/x01/shared/your-shared-file.txt\n
Similarly, for the inner shared folder, chmod
can be called such that read permission is granted to all users within the x01 project.
chmod g+r /work/x01/x01/shared/your-shared-file.txt\n
If you're sharing a set of files stored within a folder hierarchy the chmod
is slightly more complicated.
chmod -R a+Xr /work/x01/shared/my-shared-folder\nchmod -R g+Xr /work/x01/x01/shared/my-shared-folder\n
The -R
option ensures that the read permission is enabled recursively and the +X
guarantees that the user(s) you're sharing the folder with can access the subdirectories below my-shared-folder
.
Every file has an owner group that specifies access permissions for users belonging to that group. It's usually the case that the group id is synonymous with the project code. Somewhat confusingly however, projects can contain groups of their own, called subprojects, which can be assigned disk space quotas distinct from the project.
chown -R $USER:x01-subproject /work/x01/x01-subproject/$USER/my-folder\n
The chown
command above changes the owning group for all the files within my-folder
to the x01-subproject
group. This might be necessary if previously those files were owned by the x01 group and thereby using some of the x01 disk quota.
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.
The method you use to transfer data to/from ARCHER2 will depend on how much you want to transfer and where to. The methods we cover in this guide are:
Before discussing specific data transfer methods, we cover archiving which is an essential process for transferring data efficiently.
"},{"location":"user-guide/data/#archiving","title":"Archiving","text":"If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger \"archive\" file for ease of transfer and manipulation. A single large file makes more efficient use of the file system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar
and zip
.
The tar
command packs files into a \"tape archive\" format. The command has general form:
tar [options] [file(s)]\n
Common options include:
-c
create a new archive-v
verbosely list files processed-W
verify the archive after writing-l
confirm all file hard links are included in the archive-f
use an archive file (for historical reasons, tar writes its output to stdout by default rather than a file).-b 2048
use a 1 MiB block size (better performance and less contention on Lustre compared to the default block size)Putting these together:
tar -cvWlf mydata.tar mydata\n
will create and verify an archive.
To extract files from a tar file, the option -x
is used. For example:
tar -b 2048 -xf mydata.tar\n
will recover the contents of mydata.tar
to the current working directory (using a block size of 1 MiB to improve Lustre performance and reduce contention).
To verify an existing tar file against a set of data, the -d
(diff) option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:
$> tar -df mydata.tar mydata/*\nmydata/damaged_file: Mod time differs\nmydata/damaged_file: Size differs\n
Note
tar files do not store checksums with their data, requiring the original data to be present during verification.
Tip
Further information on using tar
can be found in the tar
manual (accessed via man tar
or at man tar).
The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:
zip [options] mydata.zip [file(s)]\n
Common options are:
-r
used to zip up a directory-#
where \"#\" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process.Together:
zip -0r mydata.zip mydata\n
will create an archive.
Note
Unlike tar, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system layout.
The corresponding unzip
command is used to extract data from the archive. The simplest use case is:
unzip mydata.zip\n
which recovers the contents of the archive to the current working directory.
Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip
provides options for verifying this checksum against the stored files. The relevant flag is -t
and is used as follows:
$> unzip -t mydata.zip\nArchive: mydata.zip\n testing: mydata/ OK\n testing: mydata/file OK\nNo errors detected in compressed data of mydata.zip.\n
Tip
Further information on using zip
can be found in the zip
manual (accessed via man zip
or at man zip).
The easiest way of transferring data to/from ARCHER2 is to use one of the standard programs based on the SSH protocol such as scp
, sftp
or rsync
. These all use the same underlying mechanism (SSH) as you normally use to log-in to ARCHER2. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine (ARCHER2 in this case).
To avoid having to type in your password multiple times you can set up a SSH key pair and use an SSH agent as documented in the User Guide at connecting
.
The SSH protocol encrypts all traffic it sends. This means that file transfer using SSH consumes a relatively large amount of CPU time at both ends of the transfer (for encryption and decryption). The ARCHER2 login nodes have fairly fast processors that can sustain about 100 MB/s transfer. The encryption algorithm used is negotiated between the SSH client and the SSH server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The aes128-ctr
or aes256-ctr
algorithms are well supported and fast as they are implemented in hardware. These are not usually the default choice when using scp
so you will need to manually specify them.
A single SSH based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce metadata interactions it is a good idea to overlap transfers of files from different directories.
In addition, you should consider the following when transferring data:
gzip
.The scp
command creates a copy of a file, or if given the -r
flag, a directory either from a local machine onto a remote machine or from a remote machine onto a local machine.
For example, to transfer files to ARCHER2 from a local machine:
scp [options] source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
In the above example, the [destination]
is optional, as when left out scp
will copy the source into your home directory. Also, the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
If you want to request a different encryption algorithm add the -c [algorithm-name]
flag to the scp
options. For example, to use the (usually faster) aes128-ctr encryption algorithm you would use:
scp [options] -c aes128-ctr source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
The rsync
command can also transfer data between hosts using a ssh
connection. It creates a copy of a file or, if given the -r
flag, a directory at the given destination, similar to scp
above.
Given the -a
option rsync can also make exact copies (including permissions), this is referred to as mirroring. In this case the rsync
command is executed with ssh
to create the copy on a remote machine.
To transfer files to ARCHER2 using rsync
with ssh
the command has the form:
rsync [options] -e ssh source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
In the above example, the [destination]
is optional, as when left out rsync will copy the source into your home directory. Also the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
Additional flags can be specified for the underlying ssh
command by using a quoted string as the argument of the -e
flag. e.g.
rsync [options] -e \"ssh -c aes128-ctr\" source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
Tip
Further information on using rsync
can be found in the rsync
manual (accessed via man rsync
or at man rsync).
The ARCHER2 filesystems have a Globus Collection (formerly known as an endpoint) with the name \"Archer2 file systems\" Full step-by-step guide for using Globus to transfer files to/from ARCHER2
"},{"location":"user-guide/data/#data-transfer-via-gridftp","title":"Data transfer via GridFTP","text":"ARCHER2 provides a module for grid computing, gct/6.2
, otherwise known as the Globus Grid Community Toolkit v6.2.20201212. This toolkit provides a command line interface for moving data to and from GridFTP servers.
Data transfers are managed by the globus-url-copy
command. Full details concerning this command's use can be found in the GCT 6.2 GridFTP User's Guide.
Info
Further information on using GridFTP on ARCHER2 to transfer data to the JASMIN facility can be found in the JASMIN user documentation.
"},{"location":"user-guide/data/#data-transfer-using-rclone","title":"Data transfer usingrclone
","text":"Rclone is a command-line program to manage files on cloud storage. You can transfer files directly to/from cloud storage services, such as MS OneDrive and Dropbox. The program preserves timestamps and verifies checksums at all times.
First of all, you must download and unzip rclone
on ARCHER2:
wget https://downloads.rclone.org/v1.62.2/rclone-v1.62.2-linux-amd64.zip\nunzip rclone-v1.62.2-linux-amd64.zip\ncd rclone-v1.62.2-linux-amd64/\n
The previous code snippet uses rclone v1.62.2, which was the latest version when these instructions were written.
Configure rclone using ./rclone config
. This will guide you through an interactive setup process where you can make a new remote (called remote
). See the following for detailed instructions for:
Please note that a token is required to connect from ARCHER2 to the cloud service. You need a web browser to get the token. The recommendation is to run rclone in your laptop using rclone authorize
, get the token, and then copy the token from your laptop to ARCHER2. The rclone website contains further instructions on configuring rclone on a remote machine without web browser.
Once all the above is done, you're ready to go. If you want to copy a directory, please use:
rclone copy <archer2_directory> remote:<cloud_directory>
Please note that \"remote\" is the name that you have chosen when running rclone config
. To copy files, please use:
rclone copyto <archer2_file> remote:<cloud_file>
Note
If the session times out while the data transfer takes place, adding the -vv
flag to an rclone transfer forces rclone to output to the terminal and therefore avoids triggering the timeout process.
Here we have a short example demonstrating transfer of data directly from a laptop/workstation to ARCHER2.
Note
This guide assumes you are using a command line interface to transfer data. This means the terminal on Linux or macOS, MobaXterm local terminal on Windows or Powershell.
Before we can transfer of data to ARCHER2 we need to make sure we have an SSH key setup to access ARCHER2 from the system we are transferring data from. If you are using the same system that you use to log into ARCHER2 then you should be all set. If you want to use a different system you will need to generate a new SSH key there (or use SSH key forwarding) to allow you to connect to ARCHER2.
Tip
Remember that you will need to use both a key and your password to transfer data to ARCHER2.
Once we know our keys are setup correctly, we are now ready to transfer data directly between the two machines. We begin by combining our important research data in to a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt file3.txt\n
We then initiate the data transfer from our system to ARCHER2, here using rsync
to allow the transfer to be recommenced without needing to start again, in the event of a loss of connection or other failure. For example, using the SSH key in the file ~/.ssh/id_RSA_A2
on our local system:
rsync -Pv -e\"ssh -c aes128-ctr -i $HOME/.ssh/id_RSA_A2\" ./all_my_files.tar.gz otbz19@login.archer2.ac.uk:/work/z19/z19/otbz19/\n
Note the use of the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e
flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c
option specifies the cipher to be used as aes128-ctr
which has been found to increase performance Unfortunately the ~
shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on ARCHER2.
Note
Remember to replace otbz19
with your username on ARCHER2.
If we were unconcerned about being able to restart an interrupted transfer, we could instead use the scp
command,
scp -c aes128-ctr -i ~/.ssh/id_RSA_A2 all_my_files.tar.gz otbz19@login.archer2.ac.uk:/work/z19/z19/otbz19/\n
but rsync
is recommended for larger transfers.
The following debugging tools are available on ARCHER2:
The Linaro Forge tool provides the DDT parallel debugger. See:
The GNU Debugger for HPC (gdb4hpc) is a GDB-based debugger used to debug applications compiled with CCE, PGI, GNU, and Intel Fortran, C and C++ compilers. It allows programmers to either launch an application within it or to attach to an already-running application. Attaching to an already-running and hanging application is a quick way of understanding why the application is hanging, whereas launching an application through gdb4hpc will allow you to see your application running step-by-step, output the values of variables, and check whether the application runs as expected.
Tip
For your executable to be compatible with gdb4hpc, it will need to be coded with MPI. You will also need to compile your code with the debugging flag -g
(e.g. cc -g my_program.c -o my_exe
).
Launch gdb4hpc
:
module load gdb4hpc\ngdb4hpc\n
You will get some information about this version of the program and, eventually, you will get a command prompt:
gdb4hpc 4.5 - Cray Line Mode Parallel Debugger\nWith Cray Comparative Debugging Technology.\nCopyright 2007-2019 Cray Inc. All Rights Reserved.\nCopyright 1996-2016 University of Queensland. All Rights Reserved.\nType \"help\" for a list of commands.\nType \"help <cmd>\" for detailed help about a command.\ndbg all>\n
We will use launch
to begin a multi-process application within gdb4hpc. Consider that we are wanting to test an application called my_exe
, and that we want this to be launched across all 256 processes in two nodes. We would launch this in gdb4hpc by running:
dbg all> launch --launcher-args=\"--account=[budget code] --partition=standard --qos=standard --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --exclusive --export=ALL\" $my_prog{256} ./my_ex\n
Make sure to replace the --account
input to your budget code (e.g. if you are using budget t01, that part should look like --account=t01
).
The default launcher is srun
and the --launcher-args=\"...\"
allows you to set launcher flags for srun
. The variable $my_prog
is a dummy name for the program being launched and you could use whatever name you want for it -- this will be the name of the srun
job that will be run. The number in the brackets {256}
is the number of processes over which the program will be executed, it's 256 here, but you could use any number. You should try to run this on as few processors as possible -- the more you use, the longer it will take for gdb4hpc to load the program.
Once the program is launched, gdb4hpc will load up the program and begin to run it. You will get output to screen something that looks like:
Starting application, please wait...\nCreating MRNet communication network...\nWaiting for debug servers to attach to MRNet communications network...\nTimeout in 400 seconds. Please wait for the attach to complete.\nNumber of dbgsrvs connected: [0]; Timeout Counter: [1]\nNumber of dbgsrvs connected: [0]; Timeout Counter: [2]\nNumber of dbgsrvs connected: [0]; Timeout Counter: [3]\nNumber of dbgsrvs connected: [1]; Timeout Counter: [0]\nNumber of dbgsrvs connected: [1]; Timeout Counter: [1]\nNumber of dbgsrvs connected: [2]; Timeout Counter: [0]\nFinalizing setup...\nLaunch complete.\nmy_prog{0..255}: Initial breakpoint, main at /PATH/TO/my_program.c:34\n
The line number at which the initial breakpoint is made (in the above example, line 34) corresponds to the line number at which MPI is initialised. You will not be able to see any parts of the code outside of the MPI region of a code with gdb4hpc.
Once the code is loaded, you can use various commands to move through your code. The following lists and describes some of the most useful ones:
help
-- Lists all gdb4hpc commands. You can run help COMMAND_NAME
to learn more about a specific command (e.g. help launch
will tell you about the launch commandlist
-- Will show the current line of code and the 9 lines following. Repeated use of list
will move you down the code in ten-line chunks.next
-- Will jump to the next step in the program for each process and output which line of code each process is one. It will not enter subroutines. !!! note that there is no reverse-step in gdb4hpc.step
-- Like next
, but this will step into subroutines.up
-- Go up one level in the program (e.g. from a subroutine back to main).print var
-- Prints the value of variable var
at this point in the code.watch var
-- Like print, but will print whenever a variable changes value.quit
-- Exits gdb4hpc.Remember to exit the interactive session once you are done debugging.
"},{"location":"user-guide/debug/#attaching-with-gdb4hpc","title":"Attaching with gdb4hpc","text":"Attaching to a hanging job using gdb4hpc is a great way of seeing which state each processor is in. However, this does not produce the most visually appealing results. For a more easy-to-read program, please take a look at the STAT tool.
In your interactive session, launch your executable as a background task (by adding an &
at the end of the command). For example, if you are running an executable called my_exe
using 256 processes, you would run:
srun -n 256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL \\\n --account=[budget code] --partition=standard --qos=standard ./my_exe &\n
Make sure to replace the --account
input to your budget code (e.g. if you are using budget t01, that part should look like --account=t01
).
You will need to get the full job ID of the job you have just launched. To do this, run:
squeue -u $USER\n
and find the job ID associated with this interactive session -- this will be the one with the jobname bash
. In this example:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)\n1050 workq my_mpi_j jsindt R 0:16 1 nid000001\n1051 workq bash jsindt R 0:12 1 nid000002\n
the appropriate job id is 1051. Next, you will need to run sstat
on this job id:
sstat 1051\n
This will output a large amount of information about this specific job. We are looking for the first number of this output, which should look like JOB_ID.##
-- the number after the job ID is the number of slurm tasks performed in this interactive session. For our example (where srun
is the first slurm task performed), the number is 1051.0.
Launch gdb4hpc
:
module load gdb4hpc\ngdb4hpc\n
You will get some information about this version of the program and, eventually, you will get a command prompt:
gdb4hpc 4.5 - Cray Line Mode Parallel Debugger\nWith Cray Comparative Debugging Technology.\nCopyright 2007-2019 Cray Inc. All Rights Reserved.\nCopyright 1996-2016 University of Queensland. All Rights Reserved.\nType \"help\" for a list of commands.\nType \"help <cmd>\" for detailed help about a command.\ndbg all>\n
We will be using the attach
command to attach to our program that hangs. This is done by writing:
dbg all> attach $my_prog JOB_ID.##\n
where JOB_ID.##
is the full job ID found using sstat
(in our example, this would be 1051.0). The name $my_prog
is a dummy-name -- it could be whatever name you like.
As it is attaching, gdb4hpc will output text to screen that looks like:
Attaching to application, please wait...\nCreating MRNet communication network...\nWaiting for debug servers to attach to MRNet communications network...\nTimeout in 400 seconds. Please wait for the attach to complete.\nNumber of dbgsrvs connected: [0]; Timeout Counter: [1]\n\n...\n\nFinalizing setup...\nAttach complete.\nCurrent rank location:\n
After this, you will get an output that, among other things, tells you which line of your code each process is on, and what each process is doing. This can be helpful to see where the hang-up is.
If you accidentally attached to the wrong job, you can detach by running:
dbg all> release $my_prog\n
and re-attach with the correct job ID. You will need to change your dummy name from $my_prog
to something else.
When you are finished using gbd4hpc
, simply run:
dbg all> quit\n
Do not forget to exit your interactive session.
"},{"location":"user-guide/debug/#valgrind4hpc","title":"valgrind4hpc","text":"valgrind4hpc is a Valgrind-based debugging tool to aid in the detection of memory leaks and errors in parallel applications. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior. Valgrind4hpc manages starting and redirecting output from many copies of Valgrind, as well as recombining and filtering Valgrind messages. If your program can be debugged with Valgrind, it can be debugged with valgrind4hpc.
The valgrind4hpc module enables the use of standard valgrind as well as the valgrind4hpc version more suitable to parallel programs.
"},{"location":"user-guide/debug/#using-valgrind-with-serial-programs","title":"Using Valgrind with serial programs","text":"Launch valgrind4hpc
:
module load valgrind4hpc\n
Next, run your executable through valgrind:
valgrind --tool=memcheck --leak-check=yes my_executable\n
The log outputs to screen. The ERROR SUMMARY
will tell you whether, and how many, memory errors there are in your program. Furthermore, if you compile your code using the -g
debugging flag (e.g. gcc -g my_program.c -o my_executable.c
), the log will point out the code lines where the error occurs.
Valgrind also includes a tool called Massif that can be used to give insight into the memory usage of your program. It takes regular snapshots and outputs this data into a single file, which can be visualised to show the total amount of memory used as a function of time. This shows when peaks and bottlenecks occur and allows you to identify which data structures in your code are responsible for the largest memory usage of your program.
Documentation explaining how to use Massif is available at the official Massif manual. In short, you should run your executable as follows:
valgrind --tool=massif my_executable\n
The memory profiling data will be output into a file called massif.out.pid
, where pid is the runtime process ID of your program. A custom filename can be chosen using the --massif-out-file option
, as follows:
valgrind --tool=massif --massif-out-file=optional_filename.out my_executable\n
The output file contains raw profiling statistics. To view a summary including a graphical plot of memory usage over time, use the ms_print
command as follows:
ms_print massif.out.12345\n
or, to save to a file:
ms_print massif.out.12345 > massif.analysis.12345\n
This will show total memory usage over time as well as a breakdown of the top data structures contributing to memory usage at each snapshot where there has been a significant allocation or deallocation of memory.
"},{"location":"user-guide/debug/#using-valgrind4hpc-with-parallel-programs","title":"Using Valgrind4hpc with parallel programs","text":"First, load valgrind4hpc
:
module load valgrind4hpc\n
To run valgrind4hpc, first reserve the resources you will use with salloc
. The following reservation request is for 2 nodes (256 physical cores) for 20 minutes on the short queue:
auser@uan01:> salloc --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 \\\n --time=00:20:00 --partition=standard --qos=short \\\n --hint=nomultithread \\\n --distribution=block:block --account=[budget code]\n
Once your allocation is ready, Use valgrind4hpc to run and profile your executable. To test an executable called my_executable
that requires two arguments arg1
and arg2
on 2 nodes and 256 processes, run:
valgrind4hpc --tool=memcheck --num-ranks=256 my_executable -- arg1 arg2\n
In particular, note the --
separating the executable from the arguments (this is not necessary if your executable takes no arguments).
Valgrind4hpc only supports certain tools found in valgrind. These are: memcheck, helgrind, exp-sgcheck, or drd. The --valgrind-args=\"arguments\"
allows users to use valgrind options not supported in valgrind4hpc (e.g. --leak-check
) -- note, however, that some of these options might interfere with valgrind4hpc.
More information on valgrind4hpc can be found in the manual (man valgrind4hpc
).
The Stack Trace Analysis Tool (STAT) is a cross-platform debugging tool from the University of Wisconsin-Madison. ATP is based on the same technology as STAT, both are designed to gather and merge stack traces from a running application's parallel processes. The STAT tool can be useful when application seems to be deadlocked or stuck, i.e. they don't crash but they don't progress as expected, and it has been designed to scale to a very large number of processes. Full information on STAT, including use cases, is available at the STAT website.
STAT will attach to a running program and query that program to find out where all the processes in that program currently are. It will then process that data and produce a graph displaying the unique process locations (i.e. where all the processes in the running program currently are). To make this easily understandable it collates together all processes that are in the same place providing only unique program locations for display.
"},{"location":"user-guide/debug/#using-stat-on-archer2","title":"Using STAT on ARCHER2","text":"On the login node, load the cray-stat
module:
module load cray-stat\n
Then, launch your job using srun
as a background task (by adding an &
at the end of the command). For example, if you are running an executable called my_exe
using 256 processes, you would run:
srun -n 256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL\\\n --account=[budget code] --partition=standard --qos=standard./my_exe &\n
Note
This example has set the job time limit to 1 hour -- if you need longer, change the --time
command.
You will need the Program ID (PID) of the job you have just launched -- the PID is printed to screen upon launch, or you can get it by running:
ps -u $USER\n
This will present you with a set of text that looks like this:
PID TTY TIME CMD\n154296 ? 00:00:00 systemd\n154297 ? 00:00:00 (sd-pam)\n154302 ? 00:00:00 sshd\n154303 pts/8 00:00:00 bash\n157150 pts/8 00:00:00 salloc\n157152 pts/8 00:00:00 bash\n157183 pts/8 00:00:00 srun\n157185 pts/8 00:00:00 srun\n157191 pts/8 00:00:00 ps\n
Once your application has reached the point where it hangs, issue the following command (replacing PID with the ID of the first srun task -- in the above example, I would replace PID with 157183):
stat-cl -i PID\n
You will get an output that looks like this:
STAT started at 2020-07-22-13:31:35\nAttaching to job launcher (null):157565 and launching tool daemons...\nTool daemons launched and connected!\nAttaching to application...\nAttached!\nApplication already paused... ignoring request to pause\nSampling traces...\nTraces sampled!\nResuming the application...\nResumed!\nPausing the application...\nPaused!\n\n...\n\nDetaching from application...\nDetached!\n\nResults written to $PATH_TO_RUN_DIRECTORY/stat_results/my_exe.0000\n
Once STAT is finished, you can kill the srun job using scancel
(replacing JID with the job ID of the job you just launched):
scancel JID\n
You can view the results that STAT has produced using the following command (note that \"my_exe\" will need to be replaced with the name of the executable you ran):
stat-view stat_results/my_exe.0000/00_my_exe.0000.3D.dot\n
This produces a graph displaying all the different places within the program that the parallel processes were when you queried them.
Note
To see the graph, you will need to have exported your X display when logging in.
Larger jobs may spend significant time queueing, requiring submission as a batch job. In this case, a slightly different invocation is illustrated as follows:
#!/bin/bash --login\n\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=02:00:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load additional modules\nmodule load cray-stat\n\nexport OMP_NUM_THREADS=1\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# This environment variable is required\nexport CTI_SLURM_OVERRIDE_MC=1\n\n# Request that stat sleeps for 3600 seconds before attaching\n# to our executable which we launch with command introduced\n# with -C:\n\nstat-cl -s 3600 -C srun --unbuffered ./my_exe\n
If the job is hanging it will continue to run until the wall clock exceeds the requested time. Use the stat-view
utility to inspect the results, as discussed above.
To enable ATP you should load the atp module and set the ATP_ENABLED
environment variable to 1 on the login node:
module load atp\nexport ATP_ENABLED=1\n# Fix for a known issue:\nexport HOME=${HOME/home/work}\n
Then, launch your job using srun
as a background task (by adding an &
at the end of the command). For example, if you are running an executable called my_exe
using 256 processes, you would run:
srun -n=256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL \\\n --account=[budget code] --partition=standard --qos=standard ./my_exe &\n
Note
This example has set the job time limit to 1 hour -- if you need longer, change the --time
command.
Once the job has finished running, load the stat
module to view the results:
module load cray-stat\n
and view the merged stack trace using:
stat-view atpMergedBT.dot\n
Note
To see the graph, you will need to have exported your X display when logging in.
"},{"location":"user-guide/dev-environment-4cab/","title":"Application development environment: 4-cabinet system","text":"Important
This section covers the application development environment on the initial, 4-cabinet ARCHER2 system. For docmentation on the application development environment on the full ARCHER2 system, please see Application development environment: full system.
"},{"location":"user-guide/dev-environment-4cab/#whats-available","title":"What's available","text":"ARCHER2 runs on the Cray Linux Environment (a version of SUSE Linux), and provides a development environment which includes:
Access to particular software, and particular versions, is managed by a standard TCL module framework. Most software is available via standard software modules and the different programming environments are available via module collections.
You can see what programming environments are available with:
auser@uan01:~> module savelist\nNamed collection list:\n 1) PrgEnv-aocc 2) PrgEnv-cray 3) PrgEnv-gnu\n
Other software modules can be listed with
auser@uan01:~> module avail\n------------------------------- /opt/cray/pe/perftools/20.09.0/modulefiles --------------------------------\nperftools perftools-lite-events perftools-lite-hbm perftools-nwpc \nperftools-lite perftools-lite-gpu perftools-lite-loops perftools-preload \n\n---------------------------------- /opt/cray/pe/craype/2.7.0/modulefiles ----------------------------------\ncraype-hugepages1G craype-hugepages8M craype-hugepages128M craype-network-ofi \ncraype-hugepages2G craype-hugepages16M craype-hugepages256M craype-network-slingshot10 \ncraype-hugepages2M craype-hugepages32M craype-hugepages512M craype-x86-rome \ncraype-hugepages4M craype-hugepages64M craype-network-none \n\n------------------------------------- /usr/local/Modules/modulefiles --------------------------------------\ndot module-git module-info modules null use.own \n\n-------------------------------------- /opt/cray/pe/cpe-prgenv/7.0.0 --------------------------------------\ncpe-aocc cpe-cray cpe-gnu \n\n-------------------------------------------- /opt/modulefiles ---------------------------------------------\naocc/2.1.0.3(default) cray-R/4.0.2.0(default) gcc/8.1.0 gcc/9.3.0 gcc/10.1.0(default) \n\n\n---------------------------------------- /opt/cray/pe/modulefiles -----------------------------------------\natp/3.7.4(default) cray-mpich-abi/8.0.15 craype-dl-plugin-py3/20.06.1(default) \ncce/10.0.3(default) cray-mpich-ucx/8.0.15 craype/2.7.0(default) \ncray-ccdb/4.7.1(default) cray-mpich/8.0.15(default) craypkg-gen/1.3.10(default) \ncray-cti/2.7.3(default) cray-netcdf-hdf5parallel/4.7.4.0 gdb4hpc/4.7.3(default) \ncray-dsmml/0.1.2(default) cray-netcdf/4.7.4.0 iobuf/2.0.10(default) \ncray-fftw/3.3.8.7(default) cray-openshmemx/11.1.1(default) papi/6.0.0.2(default) \ncray-ga/5.7.0.3 cray-parallel-netcdf/1.12.1.0 perftools-base/20.09.0(default) \ncray-hdf5-parallel/1.12.0.0 cray-pmi-lib/6.0.6(default) valgrind4hpc/2.7.2(default) \ncray-hdf5/1.12.0.0 cray-pmi/6.0.6(default) \ncray-libsci/20.08.1.2(default) cray-python/3.8.5.0(default) \n
A full discussion of the module system is available in the Software environment section.
A consistent set of modules is loaded on login to the machine (currently PrgEnv-cray
, see below). Developing applications then means selecting and loading the appropriate set of modules before starting work.
This section is aimed at code developers and will concentrate on the compilation environment and building libraries and executables, and specifically parallel executables. Other topics such as Python and Containers are covered in more detail in separate sections of the documentation.
"},{"location":"user-guide/dev-environment-4cab/#managing-development","title":"Managing development","text":"ARCHER2 supports common revision control software such as git
.
Standard GNU autoconf tools are available, along with make
(which is GNU Make). Versions of cmake
are available.
Note
Some of these tools are part of the system software, and typically reside in /usr/bin
, while others are provided as part of the module system. Some tools may be available in different versions via both /usr/bin
and via the module system.
There are three different compiler environments available on ARCHER2: AMD (AOCC), Cray (CCE), and GNU (GCC). The current compiler suite is selected via the programming environment, while the specific compiler versions are determined by the relevant compiler module. A summary is:
Suite name Module Programming environment collection CCEcce
PrgEnv-cray
GCC gcc
PrgEnv-gnu
AOCC aocc
PrgEnv-aocc
For example, at login, the default set of modules are:
Currently Loaded Modulefiles:\n1) cpe-cray 7) cray-dsmml/0.1.2(default) \n2) cce/10.0.3(default) 8) perftools-base/20.09.0(default) \n3) craype/2.7.0(default) 9) xpmem/2.2.35-7.0.1.0_1.3__gd50fabf.shasta(default) \n4) craype-x86-rome 10) cray-mpich/8.0.15(default) \n5) libfabric/1.11.0.0.233(default) 11) cray-libsci/20.08.1.2(default) \n6) craype-network-ofi \n
from which we see the default programming environment is Cray (indicated by cpe-cray
(at 1 in the list above) and the default compiler module is cce/10.0.3
(at 2 in the list above). The programming environment will give access to a consistent set of compiler, MPI library via cray-mpich
(at 10), and other libraries e.g., cray-libsci
(at 11 in the list above) infrastructure.
Within a given programming environment, it is possible to swap to a different compiler version by swapping the relevant compiler module.
To ensure consistent behaviour, compilation of C, C++, and Fortran source code should then take place using the appropriate compiler wrapper: cc
, CC
, and ftn
, respectively. The wrapper will automatically call the relevant underlying compiler and add the appropriate include directories and library locations to the invocation. This typically eliminates the need to specify this additional information explicitly in the configuration stage. To see the details of the exact compiler invocation use the -craype-verbose
flag to the compiler wrapper.
The default link time behaviour is also related to the current programming environment. See the section below on Linking and libraries.
Users should not, in general, invoke specific compilers at compile/link stages. In particular, gcc
, which may default to /usr/bin/gcc
, should not be used. The compiler wrappers cc
, CC
, and ftn
should be used via the appropriate module. Other common MPI compiler wrappers e.g., mpicc
should also be replaced by the relevant wrapper cc
(mpicc
etc are not available).
Important
Always use the compiler wrappers cc
, CC
, and/or ftn
and not a specific compiler invocation. This will ensure consistent compile/link time behaviour.
Further information on both the compiler wrappers, and the individual compilers themselves are available via the command line, and via standard man
pages. The man
page for the compiler wrappers is common to all programming environments, while the man
page for individual compilers depends on the currently loaded programming environment. The following table summarises options for obtaining information on the compiler and compile options:
man craycc
man crayCC
man crayftn
GNU man gcc
man g++
man gfortran
Wrappers man cc
man CC
man ftn
Tip
You can also pass the --help
option to any of the compilers or wrappers to get a summary of how to use them. The Cray Fortran compiler uses ftn --craype-help
to access the help options.
Tip
There are no man
pages for the AOCC compilers at the moment.
Tip
Cray C/C++ is based on Clang and therefore supports similar options to clang/gcc (man clang
is in fact equivalent to man craycc
). clang --help
will produce a full summary of options with Cray-specific options marked \"Cray\". The craycc
man page concentrates on these Cray extensions to the clang
front end and does not provide an exhaustive description of all clang
options. Cray Fortran is not based on Flang and so takes different options from flang/gfortran.
Executables on ARCHER2 link dynamically, and the Cray Programming Environment does not currently support static linking. This is in contrast to ARCHER where the default was to build statically.
If you attempt to link statically, you will see errors similar to:
/usr/bin/ld: cannot find -lpmi\n/usr/bin/ld: cannot find -lpmi2\ncollect2: error: ld returned 1 exit status\n
The compiler wrapper scripts on ARCHER link runtime libraries in using the runpath
by default. This means that the paths to the runtime libraries are encoded into the executable so you do not need to load the compiler environment in your job submission scripts.
If you are unsure which compiler you should choose, we suggest the starting point should be the GNU compiler collection (GCC, PrgEnv-gnu
); this is perhaps the most commonly used by code developers, particularly in the open source software domain. A portable, standard-conforming code should (in principle) compile in any of the three programming environments.
For users requiring specific compiler features, such as co-array Fortran, the recommended starting point would be Cray. The following sections provide further details of the different programming environments.
Warning
Intel compilers are not available on ARCHER2.
"},{"location":"user-guide/dev-environment-4cab/#amd-optimizing-cc-compiler-aocc","title":"AMD Optimizing C/C++ Compiler (AOCC)","text":"The AMD Optimizing C/++ Compiler (AOCC) is a clang-based optimising compiler. AOCC (despite its name) includes a flang-based Fortran compiler.
Switch the the AOCC programming environment via
$ module restore PrgEnv-aocc\n
Note
Further details on AOCC will appear here as they become available.
"},{"location":"user-guide/dev-environment-4cab/#aocc-reference-material","title":"AOCC reference material","text":"The Cray compiler environment (CCE) is the default compiler at the point of login. CCE supports C/C++ (along with unified parallel C UPC), and Fortran (including co-array Fortran). Support for OpenMP parallelism is available for both C/C++ and Fortran (currently OpenMP 4.5, with a number of exceptions).
The Cray C/C++ compiler is based on a clang front end, and so compiler options are similar to those for gcc/clang. However, the Fortran compiler remains based around Cray-specific options. Be sure to separate C/C++ compiler options and Fortran compiler options (typically CFLAGS
and FFLAGS
) if compiling mixed C/Fortran applications.
Switch the the Cray programming environment via
$ module restore PrgEnv-cray\n
"},{"location":"user-guide/dev-environment-4cab/#useful-cce-cc-options","title":"Useful CCE C/C++ options","text":"When using the compiler wrappers cc
or CC
, some of the following options may be useful:
Language, warning, Debugging options:
Option Comment-std=<standard>
Default is -std=gnu11
(gnu++14
for C++) [1] Performance options:
Option Comment-Ofast
Optimisation levels: -O0, -O1, -O2, -O3, -Ofast -ffp=level
Floating point maths optimisations levels 0-4 [2] -flto
Link time optimisation Miscellaneous options:
Option Comment-fopenmp
Compile OpenMP (default is off) -v
Display verbose output from compiler stages Notes
-std=gnu11
gives c11
plus GNU extensions (likewise c++14
plus GNU extensions). See https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/C-Extensions.html-ffp=3
is implied by -Ofast
or -ffast-math
Language, Warning, Debugging options:
Option Comment-m <level>
Message level (default -m 3
errors and warnings) Performance options:
Option Comment-O <level>
Optimisation levels: -O0 to -O3 (default -O2) -h fp<level>
Floating point maths optimisations levels 0-3 -h ipa
Inter-procedural analysis Miscellaneous options:
Option Comment-h omp
Compile OpenMP (default is -hnoomp
) -v
Display verbose output from compiler stages"},{"location":"user-guide/dev-environment-4cab/#gnu-compiler-collection-gcc","title":"GNU compiler collection (GCC)","text":"The commonly used open source GNU compiler collection is available and provides C/C++ and Fortran compilers.
The GNU compiler collection is loaded by switching to the GNU programming environment:
$ module restore PrgEnv-gnu\n
Bug
The gcc/8.1.0
module is available on ARCHER2 but cannot be used as the supporting scientific and system libraries are not available. You should not use this version of GCC.
Warning
If you want to use GCC version 10 or greater to compile Fortran code, with the old MPI interfaces (i.e. use mpi
or INCLUDE 'mpif.h'
) you must add the -fallow-argument-mismatch
option (or equivalent) when compiling otherwise you will see compile errors associated with MPI functions. The reason for this is that past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines using the old MPI interfaces where arrays of different types are passed to, for example, MPI_Send()
. This will now generate an error as not standard conforming. The -fallow-argument-mismatch
option is used to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
If you use the Fortran 2008 MPI interface (i.e. use mpi_f08
) then you should not need to add this option.
Fortran language MPI bindings are described in more detail at in the MPI Standard documentation.
"},{"location":"user-guide/dev-environment-4cab/#useful-gnu-fortran-options","title":"Useful Gnu Fortran options","text":"Option Comment-std=<standard>
Default is gnu -fallow-argument-mismatch
Allow mismatched procedure arguments. This argument is required for compiling MPI Fortran code with GCC version 10 or greater if you are using the older MPI interfaces (see warning above) -fbounds-check
Use runtime checking of array indices -fopenmp
Compile OpenMP (default is no OpenMP) -v
Display verbose output from compiler stages Tip
The standard
in -std
may be one of f95
f2003
, f2008
or f2018
. The default option -std=gnu
is the latest Fortran standard plus gnu extensions.
Warning
Past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines where arrays of different types are passed to MPI_Send()
and so on. This will now generate an error as not standard conforming. Use -fallow-argument-mismatch
to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
C/C++ documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/
Fortran documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gfortran/
HPE Cray provide, as standard, an MPICH implementation of the message passing interface which is specifically optimised for the ARCHER2 network. The current implementation supports MPI standard version 3.1.
The HPE Cray MPICH implementation is linked into software by default when compiling using the standard wrapper scripts: cc
, CC
and ftn
.
MPI standard documents: https://www.mpi-forum.org/docs/
"},{"location":"user-guide/dev-environment-4cab/#linking-and-libraries","title":"Linking and libraries","text":"Linking to libraries is performed dynamically on ARCHER2. One can use the -craype-verbose
flag to the compiler wrapper to check exactly what linker arguments are invoked. The compiler wrapper scripts encode the paths to the programming environment system libraries using RUNPATH. This ensures that the executable can find the correct runtime libraries without the matching software modules loaded.
The library RUNPATH associated with an executable can be inspected via, e.g.,
$ readelf -d ./a.out\n
(swap a.out
for the name of the executable you are querying).
Modules with names prefixed by cray-
are provided by HPE Cray, and are supported to be consistent with any of the programming environments and associated compilers. These modules should be the first choice for access to software libraries if available.
Tip
More information on the different software libraries on ARCHER2 can be found in the Software libraries section of the user guide.
"},{"location":"user-guide/dev-environment-4cab/#switching-to-a-different-hpe-cray-programming-environment-release","title":"Switching to a different HPE Cray Programming Environment release","text":"Important
See the section below on using non-default versions of HPE Cray libraries below as this process will generally need to be followed when using software from non-default PE installs.
Access to non-default PE environments is controlled by the use of the cpe
modules. These modules are typically loaded after you have restored a PrgEnv and loaded all the other modules you need and will set your compile environment to match that in the other PE release. This means:
For example, if you have a code that uses the Gnu programming environment, FFTW and NetCDF parallel libraries and you want to compile in the (non-default) 21.03 programming environment, you would do the following:
First, restore the Gnu programming environment and load the required library modules (FFTW and NetCDF HDF5 parallel). The loaded module list shows they are the versions from the default (20.10) programming environment):
auser@uan02:/work/t01/t01/auser> module restore -s PrgEnv-gnu\nauser@uan02:/work/t01/t01/auser> module load cray-fftw\nauser@uan02:/work/t01/t01/auser> module load cray-netcdf\nauser@uan02:/work/t01/t01/auser> module load cray-netcdf-hdf5parallel\nauser@uan02:/work/t01/t01/auser> module list\nCurrently Loaded Modulefiles:\n 1) cpe-gnu 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default) \n 2) gcc/10.1.0(default) 10) cray-mpich/8.0.16(default) \n 3) craype/2.7.2(default) 11) cray-libsci/20.10.1.2(default) \n 4) craype-x86-rome 12) bolt/0.7 \n 5) libfabric/1.11.0.0.233(default) 13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env \n 6) craype-network-ofi 14) /usr/local/share/epcc-module/epcc-module-loader \n 7) cray-dsmml/0.1.2(default) 15) cray-fftw/3.3.8.8(default) \n 8) perftools-base/20.10.0(default) 16) cray-netcdf-hdf5parallel/4.7.4.2(default) \n
Now, load the cpe/21.03
programming environment module to switch all the currently loaded HPE Cray modules from the default (20.10) programming environment version to the 21.03 programming environment versions:
auser@uan02:/work/t01/t01/auser> module load cpe/21.03\nSwitching to cray-dsmml/0.1.3.\nSwitching to cray-fftw/3.3.8.9.\nSwitching to cray-libsci/21.03.1.1.\nSwitching to cray-mpich/8.1.3.\nSwitching to cray-netcdf-hdf5parallel/4.7.4.3.\nSwitching to craype/2.7.5.\nSwitching to gcc/9.3.0.\nSwitching to perftools-base/21.02.0.\n\nLoading cpe/21.03\n Unloading conflict: cray-dsmml/0.1.2 cray-fftw/3.3.8.8 cray-libsci/20.10.1.2 cray-mpich/8.0.16 cray-netcdf-hdf5parallel/4.7.4.2\n craype/2.7.2 gcc/10.1.0 perftools-base/20.10.0\n Loading requirement: cray-dsmml/0.1.3 cray-fftw/3.3.8.9 cray-libsci/21.03.1.1 cray-mpich/8.1.3 cray-netcdf-hdf5parallel/4.7.4.3\n craype/2.7.5 gcc/9.3.0 perftools-base/21.02.0\nauser@uan02:/work/t01/t01/auser> module list\nCurrently Loaded Modulefiles:\n 1) cpe-gnu 9) cray-dsmml/0.1.3 17) cpe/21.03(default) \n 2) craype-x86-rome 10) cray-fftw/3.3.8.9 \n 3) libfabric/1.11.0.0.233(default) 11) cray-libsci/21.03.1.1 \n 4) craype-network-ofi 12) cray-mpich/8.1.3 \n 5) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default) 13) cray-netcdf-hdf5parallel/4.7.4.3 \n 6) bolt/0.7 14) craype/2.7.5 \n 7) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env 15) gcc/9.3.0 \n 8) /usr/local/share/epcc-module/epcc-module-loader 16) perftools-base/21.02.0 \n
Finally (as noted above), you will need to modify the value of LD_LIBRARY_PATH
before you compile your software to ensure it picks up the non-default versions of libraries:
auser@uan02:/work/t01/t01/auser> export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
Now you can go ahead and compile your software with the new programming environment.
Important
The cpe
modules only change the versions of software modules provided as part of the HPE Cray programming environments. Any modules provided by the ARCHER2 service will need to be loaded manually after you have completed the process described above.
Note
Unloading the cpe
module does not restore the original programming environment release. To restore the default programming environment release you should log out and then log back in to ARCHER2.
Bug
The cpe/21.03
module has a known issue with PrgEnv-gnu
where it loads an old version of GCC (9.3.0) rather than the correct, newer version (10.2.0). You can resolve this by using the sequence:
module restore -s PrgEnv-gnu\n...load any other modules you need...\nmodule load cpe/21.03\nmodule unload cpe/21.03\nmodule swap gcc gcc/10.2.0\n
"},{"location":"user-guide/dev-environment-4cab/#available-hpe-cray-programming-environment-releases-on-archer2","title":"Available HPE Cray Programming Environment releases on ARCHER2","text":"ARCHER2 currently has the following HPE Cray Programming Environment releases available:
cpe
modulecpe/21.03
moduleTip
You can see which programming environment release you currently have loaded by using module list
and looking at the version number of the cray-libsci
module you have loaded. The first two numbers indicate the version of the PE you have loaded. For example, if you have cray-libsci/20.10.1.2
loaded then you are using the 20.10 PE release.
If you wish to make use of non-default versions of libraries provided by HPE Cray (usually because they are part of a non-default PE release: either old or new) then you need to make changes at both compile and runtime. In summary, you need to load the correct module and also make changes to the LD_LIBRARY_PATH
environment variable.
At compile time you need to load the version of the library module before you compile and set the LD_LIBRARY_PATH environment variable to include the contencts of $CRAY_LD_LIBRARY_PATH
as the first entry. For example, to use the, non-default, 20.08.1.2 version of HPE Cray LibSci in the default programming environment (Cray Compiler Environment, CCE) you would first setup the environment to compile with:
auser@uan01:~/test/libsci> module swap cray-libsci cray-libsci/20.08.1.2 \nauser@uan01:~/test/libsci> export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
The order is important here: every time you change a module, you will need to reset the value of LD_LIBRARY_PATH
for the process to work (it will not be updated automatically).
Now you can compile your code. You can check that the executable is using the correct version of LibSci with the ldd
command and look for the line beginning libsci_cray.so.5
, you should see the version in the path to the library file:
auser@uan01:~/test/libsci> ldd dgemv.x \n linux-vdso.so.1 (0x00007ffe4a7d2000)\n libsci_cray.so.5 => /opt/cray/pe/libsci/20.08.1.2/CRAY/9.0/x86_64/lib/libsci_cray.so.5 (0x00007fafd6a43000)\n libdl.so.2 => /lib64/libdl.so.2 (0x00007fafd683f000)\n libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007fafd663c000)\n libquadmath.so.0 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libquadmath.so.0 (0x00007fafd63fc000)\n libmodules.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libmodules.so.1 (0x00007fafd61e0000)\n libfi.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libfi.so.1 (0x00007fafd5abe000)\n libcraymath.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libcraymath.so.1 (0x00007fafd57e2000)\n libf.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libf.so.1 (0x00007fafd554f000)\n libu.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libu.so.1 (0x00007fafd523b000)\n libcsup.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libcsup.so.1 (0x00007fafd5035000)\n libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00007fafd4c62000)\n libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fafd4a43000)\n libc.so.6 => /lib64/libc.so.6 (0x00007fafd4688000)\n libm.so.6 => /lib64/libm.so.6 (0x00007fafd4350000)\n /lib64/ld-linux-x86-64.so.2 (0x00007fafda988000)\n librt.so.1 => /lib64/librt.so.1 (0x00007fafd4148000)\n libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00007fafd3c92000)\n libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00007fafd3a7a000)\n
Tip
If any of the libraries point to versions in the /opt/cray/pe/lib64
directory then these are using the default versions of the libraries rather than the specific versions. This happens at compile time if you have forgotton to load the right module and set $LD_LIBRARY_PATH
afterwards.
At run time (typically in your job script) you need to repeat the environment setup steps (you can also use the ldd
command in your job submission script to check the library is pointing to the correct version). For example, a job submission script to run our dgemv.x
executable with the non-default version of LibSci could look like:
#!/bin/bash\n#SBATCH --job-name=dgemv\n#SBATCH --time=0:20:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --reservation=shortqos\n\n# Load the standard environment module\nmodule load epcc-job-env\n\n# Setup up the environment to use the non-default version of LibSci\n# We use \"module swap\" as the \"cray-libsci\" is loaded by default.\n# This must be done after loading the \"epcc-job-env\" module\nmodule swap cray-libsci cray-libsci/20.08.1.2\nexport LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n\n# Check which library versions the executable is pointing too\nldd dgemv.x\n\nexport OMP_NUM_THREADS=1\n\nsrun --hint=nomultithread --distribution=block:block dgemv.x\n
Tip
As when compiling, the order of commands matters. Setting the value of LD_LIBRARY_PATH
must happen after you have finished all your module
commands for it to have the correct effect.
Important
You must setup the environment at both compile and run time otherwise you will end up using the default version of the library.
"},{"location":"user-guide/dev-environment-4cab/#compiling-in-compute-nodes","title":"Compiling in compute nodes","text":"Sometimes you may wish to compile in a batch job. For example, the compile process may take a long time or the compile process is part of the research workflow and can be coupled to the production job. Unlike login nodes, the /home
file system is not available.
An example job submission script for a compile job using make
(assuming the Makefile is in the same directory as the job submission script) would be:
#!/bin/bash\n\n#SBATCH --job-name=compile\n#SBATCH --time=00:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the compilation environment (cray, gnu or aocc)\nmodule restore /etc/cray-pe.d/PrgEnv-cray\n\nmake clean\n\nmake\n
Warning
Do not forget to include the full path when the compilation environment is restored. For instance:
module restore /etc/cray-pe.d/PrgEnv-cray
You can also use a compute node in an interactive way using salloc
. Please see Section Using salloc to reserve resources for further details. Once your interactive session is ready, you can load the compilation environment and compile the code.
The ARCHER2 CSE team at EPCC and other contributors provide build configurations ando instructions for a range of research software, software libraries and tools on a variety of HPC systems (including ARCHER2) in a public Github repository. See:
The repository always welcomes contributions from the ARCHER2 user community.
"},{"location":"user-guide/dev-environment-4cab/#support-for-building-software-on-archer2","title":"Support for building software on ARCHER2","text":"If you run into issues building software on ARCHER2 or the software you require is not available then please contact the ARCHER2 Service Desk with any questions you have.
"},{"location":"user-guide/dev-environment/","title":"Application development environment","text":""},{"location":"user-guide/dev-environment/#whats-available","title":"What's available","text":"ARCHER2 runs the HPE Cray Linux Environment (a version of SUSE Linux), and provides a development environment which includes:
Access to particular software, and particular versions, is managed by an Lmod module framework. Most software is available by loading modules, including the different compiler environments
You can see what compiler environments are available with:
auser@uan01:~> module avail PrgEnv\n\n--------------------------------------- /opt/cray/pe/lmod/modulefiles/core ----------------------------------------\n PrgEnv-aocc/8.3.3 PrgEnv-cray/8.3.3 (L) PrgEnv-gnu/8.3.3\n\n Where:\n L: Module is loaded\n\nModule defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.\nSee https://lmod.readthedocs.io/en/latest/060_locating.html for details.\n\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
Other software modules can be searched using the module spider
command:
auser@uan01:~> module spider\n\n---------------------------------------------------------------------------------------------------------------\nThe following is a list of the modules and extensions currently available:\n---------------------------------------------------------------------------------------------------------------\n PrgEnv-aocc: PrgEnv-aocc/8.3.3\n\n PrgEnv-cray: PrgEnv-cray/8.3.3\n\n PrgEnv-gnu: PrgEnv-gnu/8.3.3\n\n amd-uprof: amd-uprof/3.6.449\n\n aocc: aocc/3.2.0\n\n aocc-mixed: aocc-mixed/3.2.0\n\n aocl: aocl/3.1, aocl/4.0\n\n forge: forge/24.0\n\n atp: atp/3.14.16\n\n bolt: bolt/0.7, bolt/0.8\n\n boost: boost/1.72.0, boost/1.81.0\n\n castep: castep/22.11\n\n cce: cce/15.0.0\n\n...output trimmed...\n
A full discussion of the module system is available in the Software environment section.
A consistent set of modules is loaded on login to the machine (currently PrgEnv-cray
, see below). Developing applications then means selecting and loading the appropriate set of modules before starting work.
This section is aimed at code developers and will concentrate on the compilation environment, building libraries and executables, specifically parallel executables. Other topics such as Python and Containers are covered in more detail in separate sections of the documentation.
Tip
If you want to get back to the login module state without having to logout and back in again, you can just use:
module restore\n
This is also handy for build scripts to ensure you are starting from a known state."},{"location":"user-guide/dev-environment/#compiler-environments","title":"Compiler environments","text":"There are three different compiler environments available on ARCHER2:
The current compiler suite is selected via the PrgEnv
module , while the specific compiler versions are determined by the relevant compiler module. A summary is:
PrgEnv-cray
cce
GCC PrgEnv-gnu
gcc
AOCC PrgEnv-aocc
aocc
For example, at login, the default set of modules are:
auser@ln03:~> module list\n\n 1) craype-x86-rome 6) cce/15.0.0 11) PrgEnv-cray/8.3.3\n 2) libfabric/1.12.1.2.2.0.0 7) craype/2.7.19 12) bolt/0.8\n 3) craype-network-ofi 8) cray-dsmml/0.2.2 13) epcc-setup-env\n 4) perftools-base/22.12.0 9) cray-mpich/8.1.23 14) load-epcc-module\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) cray-libsci/22.12.1.1\n
from which we see the default compiler environment is Cray (indicated by PrgEnv-cray
(at 11 in the list above) and the default compiler module is cce/15.0.0
(at 6 in the list above). The compiler environment will give access to a consistent set of compiler, MPI library via cray-mpich
(at 9), and other libraries e.g., cray-libsci
(at 10 in the list above).
Switching between different compiler environments is achieved using the module load
command. For example, to switch from the default HPE Cray (CCE) compiler environment to the GCC environment, you would use:
auser@ln03:~> module load PrgEnv-gnu\n\nLmod is automatically replacing \"cce/15.0.0\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-gnu/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n
If you then use the module list
command, you will see that your environment has been changed to the GCC environment:
auser@ln03:~> module list\n\nCurrently Loaded Modules:\n 1) craype-x86-rome 6) bolt/0.8 11) cray-dsmml/0.2.2\n 2) libfabric/1.12.1.2.2.0.0 7) epcc-setup-env 12) cray-mpich/8.1.23\n 3) craype-network-ofi 8) load-epcc-module 13) cray-libsci/22.12.1.1\n 4) perftools-base/22.12.0 9) gcc/11.2.0 14) PrgEnv-gnu/8.3.3\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) craype/2.7.19\n
"},{"location":"user-guide/dev-environment/#switching-between-compiler-versions","title":"Switching between compiler versions","text":"Within a given compiler environment, it is possible to swap to a different compiler version by swapping the relevant compiler module. To switch to the GNU compiler environment from the default HPE Cray compiler environment and than swap the version of GCC from the 11.2.0 default to the older 10.3.0 version, you would use
auser@ln03:~> module load PrgEnv-gnu\n\nLmod is automatically replacing \"cce/15.0.0\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-gnu/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n\nauser@ln03:~> module load gcc/10.3.0\n\nThe following have been reloaded with a version change:\n 1) gcc/11.2.0 => gcc/10.3.0\n
The first swap command moves to the GNU compiler environment and the second swap command moves to the older version of GCC. As before, module list
will show that your environment has been changed:
auser@ln03:~> module list\n\nCurrently Loaded Modules:\n 1) craype-x86-rome 6) bolt/0.8 11) cray-libsci/22.12.1.1\n 2) libfabric/1.12.1.2.2.0.0 7) epcc-setup-env 12) PrgEnv-gnu/8.3.3\n 3) craype-network-ofi 8) load-epcc-module 13) gcc/10.3.0\n 4) perftools-base/22.12.0 9) craype/2.7.19 14) cray-mpich/8.1.23\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) cray-dsmml/0.2.2\n
"},{"location":"user-guide/dev-environment/#compiler-wrapper-scripts-cc-cc-ftn","title":"Compiler wrapper scripts: cc
, CC
, ftn
","text":"To ensure consistent behaviour, compilation of C, C++, and Fortran source code should then take place using the appropriate compiler wrapper: cc
, CC
, and ftn
, respectively. The wrapper will automatically call the relevant underlying compiler and add the appropriate include directories and library locations to the invocation. This typically eliminates the need to specify this additional information explicitly in the configuration stage. To see the details of the exact compiler invocation use the -craype-verbose
flag to the compiler wrapper.
The default link time behaviour is also related to the current programming environment. See the section below on Linking and libraries.
Users should not, in general, invoke specific compilers at compile/link stages. In particular, gcc
, which may default to /usr/bin/gcc
, should not be used. The compiler wrappers cc
, CC
, and ftn
should be used (with the underlying compiler type and version set by the module system). Other common MPI compiler wrappers e.g., mpicc
, should also be replaced by the relevant wrapper, e.g. cc
(commands such as mpicc
are not available on ARCHER2).
Important
Always use the compiler wrappers cc
, CC
, and/or ftn
and not a specific compiler invocation. This will ensure consistent compile/link time behaviour.
Tip
If you are using a build system such as Make or CMake then you will need to replace all occurrences of mpicc
with cc
, mpicxx
/mpic++
with CC
and mpif90
with ftn
.
Further information on both the compiler wrappers, and the individual compilers themselves are available via the command line, and via standard man
pages. The man
page for the compiler wrappers is common to all programming environments, while the man
page for individual compilers depends on the currently loaded programming environment. The following table summarises options for obtaining information on the compiler and compile options:
man clang
man clang++
man crayftn
GNU man gcc
man g++
man gfortran
Wrappers man cc
man CC
man ftn
Tip
You can also pass the --help
option to any of the compilers or wrappers to get a summary of how to use them. The Cray Fortran compiler uses ftn --craype-help
to access the help options.
Tip
There are no man
pages for the AOCC compilers at the moment.
Tip
Cray C/C++ is based on Clang and therefore supports similar options to clang/gcc. clang --help
will produce a full summary of options with Cray-specific options marked \"Cray\". The clang
man page on ARCHER2 concentrates on these Cray extensions to the clang
front end and does not provide an exhaustive description of all clang
options. Cray Fortran is not based on Flang and so takes different options from flang/gfortran.
If you are unsure which compiler you should choose, we suggest the starting point should be the GNU compiler collection (GCC, PrgEnv-gnu
); this is perhaps the most commonly used by code developers, particularly in the open source software domain. A portable, standard-conforming code should (in principle) compile in any of the three compiler environments.
For users requiring specific compiler features, such as coarray Fortran, the recommended starting point would be Cray. The following sections provide further details of the different compiler environments.
Warning
Intel compilers are not currently available on ARCHER2.
"},{"location":"user-guide/dev-environment/#gnu-compiler-collection-gcc","title":"GNU compiler collection (GCC)","text":"The commonly used open source GNU compiler collection is available and provides C/C++ and Fortran compilers.
Switch the the GCC compiler environment from the default CCE (cray) compiler environment via:
auser@ln03:~> module load PrgEnv-gnu\n\nLmod is automatically replacing \"cce/15.0.0\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-gnu/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n
Warning
If you want to use GCC version 10 or greater to compile Fortran code, with the old MPI interfaces (i.e. use mpi
or INCLUDE 'mpif.h'
) you must add the -fallow-argument-mismatch
option (or equivalent) when compiling otherwise you will see compile errors associated with MPI functions. The reason for this is that past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines using the old MPI interfaces where arrays of different types are passed to, for example, MPI_Send()
. This will now generate an error as not standard conforming. The -fallow-argument-mismatch
option is used to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
If you use the Fortran 2008 MPI interface (i.e. use mpi_f08
) then you should not need to add this option.
Fortran language MPI bindings are described in more detail at in the MPI Standard documentation.
"},{"location":"user-guide/dev-environment/#useful-gnu-fortran-options","title":"Useful Gnu Fortran options","text":"Option Comment-O<level>
Optimisation levels: -O0
, -O1
, -O2
, -O3
, -Ofast
. -Ofast
is not recommended without careful regression testing on numerical output. -std=<standard>
Default is gnu -fallow-argument-mismatch
Allow mismatched procedure arguments. This argument is required for compiling MPI Fortran code with GCC version 10 or greater if you are using the older MPI interfaces (see warning above) -fbounds-check
Use runtime checking of array indices -fopenmp
Compile OpenMP (default is no OpenMP) -v
Display verbose output from compiler stages Tip
The standard
in -std
may be one of f95
f2003
, f2008
or f2018
. The default option -std=gnu
is the latest Fortran standard plus gnu extensions.
Warning
Past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines where arrays of different types are passed to MPI_Send()
and so on. This will now generate an error as not standard conforming. Use -fallow-argument-mismatch
to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
GCC 12.x compilers are available on ARCHER2 for users who wish to access newer features (particularly C++ features).
Testing by the CSE service has identified that some software regression tests produce different results from the reference values when using software compiled with gfortran from GCC 12.x so we do not recommend its general use by users. Users should carefully check results from software built using compilers from GCC 12.x before using it for their research projects.
You can access GCC 12.x by using the commands:
module load extra-compilers\nmodule load PrgEnv-gnu\n
"},{"location":"user-guide/dev-environment/#reference-material","title":"Reference material","text":"C/C++ documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/
Fortran documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gfortran/
The Cray Compiling Environment (CCE) is the default compiler at the point of login. CCE supports C/C++ (along with unified parallel C UPC), and Fortran (including co-array Fortran). Support for OpenMP parallelism is available for both C/C++ and Fortran (currently OpenMP 4.5, with a number of exceptions).
The Cray C/C++ compiler is based on a clang front end, and so compiler options are similar to those for gcc/clang. However, the Fortran compiler remains based around Cray-specific options. Be sure to separate C/C++ compiler options and Fortran compiler options (typically CFLAGS
and FFLAGS
) if compiling mixed C/Fortran applications.
As CCE is the default compiler environment on ARCHER2, you do not usually need to issue any commands to enable CCE.
Note
The CCE Clang compiler uses a GCC 8 toolchain so only C++ standard library features available in GCC 8 will be available in CCE Clang. You can add the compile option --gcc-toolchain=/opt/gcc/11.2.0/snos
to use a more recent version of the C++ standard library if you wish.
When using the compiler wrappers cc
or CC
, some of the following options may be useful:
Language, warning, Debugging options:
Option Comment-std=<standard>
Default is -std=gnu11
(gnu++14
for C++) [1] --gcc-toolchain=/opt/cray/pe/gcc/12.2.0/snos
Use the GCC 12.2.0 toolchain instead of the default 11.2.0 version packaged with CCE Performance options:
Option Comment-Ofast
Optimisation levels: -O0
, -O1
, -O2
, -O3
, -Ofast
. -Ofast
is not recommended without careful regression testing on numerical output. -ffp=level
Floating point maths optimisations levels 0-4 [2] -flto
Link time optimisation Miscellaneous options:
Option Comment-fopenmp
Compile OpenMP (default is off) -v
Display verbose output from compiler stages Notes
-std=gnu11
gives c11
plus GNU extensions (likewise c++14
plus GNU extensions). See https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/C-Extensions.html-ffp=3
is implied by -Ofast
or -ffast-math
Language, Warning, Debugging options:
Option Comment-m <level>
Message level (default -m 3
errors and warnings) Performance options:
Option Comment-O <level>
Optimisation levels: -O0 to -O3 (default -O2) -h fp<level>
Floating point maths optimisations levels 0-3 -h ipa
Inter-procedural analysis Miscellaneous options:
Option Comment-h omp
Compile OpenMP (default is -hnoomp
) -v
Display verbose output from compiler stages"},{"location":"user-guide/dev-environment/#cce-reference-documentation","title":"CCE Reference Documentation","text":"man clang
once the CCE compiler environment is loaded.The AMD Optimizing Compiler Collection (AOCC) is a clang-based optimising compiler. AOCC also includes a flang-based Fortran compiler.
Load the AOCC compiler environment from the default CCE (cray) compiler environment via:
auser@ln03:~> module load PrgEnv-aocc\n\nLmod is automatically replacing \"cce/15.0.0\" with \"aocc/3.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-aocc/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n
"},{"location":"user-guide/dev-environment/#aocc-reference-material","title":"AOCC reference material","text":"HPE Cray provide, as standard, an MPICH implementation of the message passing interface which is specifically optimised for the ARCHER2 interconnect. The current implementation supports MPI standard version 3.1.
The HPE Cray MPICH implementation is linked into software by default when compiling using the standard wrapper scripts: cc
, CC
and ftn
.
You do not need to do anything to make HPE Cray MPICH available when you log into ARCHER2, it is available by default to all users.
"},{"location":"user-guide/dev-environment/#switching-to-alternative-ucx-mpi-implementation","title":"Switching to alternative UCX MPI implementation","text":"HPE Cray MPICH can use two different low-level protocols to transfer data across the network. The default is the Open Fabrics Interface (OFI), but you can switch to the UCX protocol from Mellanox.
Which performs better will be application-dependent, but our experience is that UCX is often faster for programs that send a lot of data collectively between many processes, e.g. all-to-all communications patterns such as occur in parallel FFTs.
Note
You do not need to recompile your program - you simply load different modules in your Slurm script.
module load craype-network-ucx \nmodule load cray-mpich-ucx \n
Important
If your software was compiled using a compiler environment other then CCE you will also need to load that compiler environment as well as the UCX modules. For example, if you compiled using PrgEnv-gnu
you would need to:
module load PrgEnv-gnu\nmodule load craype-network-ucx \nmodule load cray-mpich-ucx \n
The performance benefits will also vary depending on the number of processes, so it is important to benchmark your application at the scale used in full production runs.
"},{"location":"user-guide/dev-environment/#mpi-reference-material","title":"MPI reference material","text":"MPI standard documents: https://www.mpi-forum.org/docs/
"},{"location":"user-guide/dev-environment/#linking-and-libraries","title":"Linking and libraries","text":"Linking to libraries is performed dynamically on ARCHER2.
Important
Static linking is not supported on ARCHER2. If you attempt to link statically, you will see errors similar to:
/usr/bin/ld: cannot find -lpmi\n/usr/bin/ld: cannot find -lpmi2\ncollect2: error: ld returned 1 exit status\n
One can use the -craype-verbose
flag to the compiler wrapper to check exactly what linker arguments are invoked. The compiler wrapper scripts encode the paths to the programming environment system libraries using RUNPATH. This ensures that the executable can find the correct runtime libraries without the matching software modules loaded.
The library RUNPATH associated with an executable can be inspected via, e.g.,
$ readelf -d ./a.out\n
(swap a.out
for the name of the executable you are querying).
Modules with names prefixed by cray-
are provided by HPE Cray, and work with any of the compiler environments and. These modules should be the first choice for access to software libraries if available.
Tip
More information on the different software libraries on ARCHER2 can be found in the Software libraries section of the user guide.
"},{"location":"user-guide/dev-environment/#hpe-cray-programming-environment-cpe-releases","title":"HPE Cray Programming Environment (CPE) releases","text":""},{"location":"user-guide/dev-environment/#available-hpe-cray-programming-environment-cpe-releases","title":"Available HPE Cray Programming Environment (CPE) releases","text":"ARCHER2 currently has the following HPE Cray Programming Environment (CPE) releases available:
You can find information, notes, and lists of changes for current and upcoming ARCHER2 HPE Cray programming environments in the HPE Cray Programming Environment GitHub repository.
Tip
We recommend that users use the most recent version of the PE available to get the latest improvements and bug fixes.
Later PE releases may sometimes be available via a containerised form. This allows developers to check that their code compiles and runs using CPE releases that have not yet been installed on ARCHER2.
CPE 23.12 is currently available as a Singularity container, see Using Containerised HPE Cray Programming Environments for further details.
"},{"location":"user-guide/dev-environment/#switching-to-a-different-hpe-cray-programming-environment-cpe-release","title":"Switching to a different HPE Cray Programming Environment (CPE) release","text":"Important
See the section below on using non-default versions of HPE Cray libraries as this process will generally need to be followed when using software from non-default PE installs.
Access to non-default PE environments is controlled by the use of the cpe
modules. Loading a cpe
module will do the following:
For example, if you have a code that uses the Gnu compiler environment, FFTW and NetCDF parallel libraries and you want to compile in the (non-default) 22.04 programming environment, you would do the following:
First, load the cpe/23.09
module to switch all the defaults to the versions from the 22.04 PE. Then, swap to the GNU compiler environment and load the required library modules (FFTW, hdf5-parallel and NetCDF HDF5 parallel). The loaded module list shows they are the versions from the 22.04 PE:
module load cpe/23.09\n
Output:
The following have been reloaded with a version change:\n 1) PrgEnv-cray/8.3.3 => PrgEnv-cray/8.4.0 4) cray-mpich/8.1.23 => cray-mpich/8.1.27\n 2) cce/15.0.0 => cce/16.0.1 5) craype/2.7.19 => craype/2.7.23\n 3) cray-libsci/22.12.1.1 => cray-libsci/23.09.1.1 6) perftools-base/22.12.0 => perftools-base/23.09.0\n
module load PrgEnv-gnu\n
Output: Lmod is automatically replacing \"cce/16.0.1\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.4.0\" with \"PrgEnv-gnu/8.4.0\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.27\n
module load cray-fftw\nmodule load cray-hdf5-parallel\nmodule load cray-netcdf-hdf5parallel\nmodule list\n
Output:
Currently Loaded Modules:\n 1) craype-x86-rome 6) epcc-setup-env 11) craype/2.7.23 16) cray-fftw/3.3.10.5\n 2) libfabric/1.12.1.2.2.0.0 7) load-epcc-module 12) cray-dsmml/0.2.2 17) cray-hdf5-parallel/1.12.2.7\n 3) craype-network-ofi 8) perftools-base/23.09.0 13) cray-mpich/8.1.27 18) cray-netcdf-hdf5parallel/4.9.0.7\n 4) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 9) cpe/23.09 14) cray-libsci/23.09.1.1\n 5) bolt/0.8 10) gcc/11.2.0 15) PrgEnv-gnu/8.4.0\n
Now you can go ahead and compile your software with the new programming environment.
Important
The cpe
modules only change the versions of software modules provided as part of the HPE Cray programming environments. Any modules provided by the ARCHER2 service will need to be loaded manually after you have completed the process described above.
Note
Unloading the cpe
module does not restore the original programming environment release. To restore the default programming environment release you should log out and then log back in to ARCHER2.
If you wish to make use of non-default versions of libraries provided by HPE Cray (usually because they are part of a non-default PE release: either old or new) then you need to make changes at both compile and runtime. In summary, you need to load the correct module and also make changes to the LD_LIBRARY_PATH
environment variable.
At compile time you need to load the version of the library module before you compile and set the LD_LIBRARY_PATH environment variable to include the contencts of $CRAY_LD_LIBRARY_PATH
as the first entry. For example, to use the, non-default, 23.09.1.1 version of HPE Cray LibSci in the default programming environment (Cray Compiler Environment, CCE) you would first setup the environment to compile with:
module load cray-libsci/23.09.1.1\nexport LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
The order is important here: every time you change a module, you will need to reset the value of LD_LIBRARY_PATH
for the process to work (it will not be updated automatically).
Now you can compile your code. You can check that the executable is using the correct version of LibSci with the ldd
command and look for the line beginning libsci_cray.so.5
, you should see the version in the path to the library file:
ldd dgemv.x \n
Output:
linux-vdso.so.1 (0x00007ffc7fff5000)\n libm.so.6 => /lib64/libm.so.6 (0x00007fd6a6361000)\n libsci_cray.so.5 => /opt/cray/pe/libsci/23.09.1.1/CRAY/12.0/x86_64/lib/libsci_cray.so.5 (0x00007fd6a2419000)\n libdl.so.2 => /lib64/libdl.so.2 (0x00007fd6a2215000)\n libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007fd6a68b3000)\n libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00007fd6a1fce000)\n libmodules.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libmodules.so.1 (0x00007fd6a689a000)\n libfi.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libfi.so.1 (0x00007fd6a1a29000)\n libcraymath.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymath.so.1 (0x00007fd6a67b3000)\n libf.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libf.so.1 (0x00007fd6a6720000)\n libu.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1 (0x00007fd6a1920000)\n libcsup.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcsup.so.1 (0x00007fd6a6715000)\n libc.so.6 => /lib64/libc.so.6 (0x00007fd6a152b000)\n /lib64/ld-linux-x86-64.so.2 (0x00007fd6a66ac000)\n libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd6a1308000)\n librt.so.1 => /lib64/librt.so.1 (0x00007fd6a10ff000)\n libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00007fd6a0c53000)\n libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00007fd6a0841000)\n libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00007fd6a0628000)\n
Tip
If any of the libraries point to versions in the /opt/cray/pe/lib64
directory then these are using the default versions of the libraries rather than the specific versions. This happens at compile time if you have forgotton to load the right module and set $LD_LIBRARY_PATH
afterwards.
At run time (typically in your job script) you need to repeat the environment setup steps (you can also use the ldd
command in your job submission script to check the library is pointing to the correct version). For example, a job submission script to run our dgemv.x
executable with the non-default version of LibSci could look like:
#!/bin/bash\n#SBATCH --job-name=dgemv\n#SBATCH --time=0:20:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --reservation=shortqos\n\n# Setup up the environment to use the non-default version of LibSci\nmodule load cray-libsci/23.09.1.1\nexport LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n\n# Check which library versions the executable is pointing too\nldd dgemv.x\n\nexport OMP_NUM_THREADS=1\n\nsrun --hint=nomultithread --distribution=block:block dgemv.x\n
Tip
As when compiling, the order of commands matters. Setting the value of LD_LIBRARY_PATH
must happen after you have finished all your module
commands for it to have the correct effect.
Important
You must setup the environment at both compile and run time otherwise you will end up using the default version of the library.
"},{"location":"user-guide/dev-environment/#compiling-on-compute-nodes","title":"Compiling on compute nodes","text":"Sometimes you may wish to compile in a batch job. For example, the compile process may take a long time or the compile process is part of the research workflow and can be coupled to the production job. Unlike login nodes, the /home
file system is not available.
An example job submission script for a compile job using make
(assuming the Makefile is in the same directory as the job submission script) would be:
#!/bin/bash\n\n#SBATCH --job-name=compile\n#SBATCH --time=00:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n\nmake clean\n\nmake\n
Note
If you want to use a compiler environment other than the default then you will need to add the module load
command before the make
command. e.g. to use the GCC compiler environemnt:
module load PrgEnv-gnu\n
You can also use a compute node in an interactive way using salloc
. Please see Section Using salloc to reserve resources for further details. Once your interactive session is ready, you can load the compilation environment and compile the code.
The compiler wrappers link with a number of HPE-provided libraries automatically. It is possible to compile codes in serial with the compiler wrappers to take advantage of the HPE libraries.
To set up your environment for serial compilation, you will need to run:
module load craype-network-none\n module remove cray-mpich\n
Once this is done, you can use the compiler wrappers (cc
for C, CC
for C++, and ftn
for Fortran) to compile your code in serial.
ARCHER2 supports common revision control software such as git
.
Standard GNU autoconf tools are available, along with make
(which is GNU Make). Versions of cmake
are available.
Tip
Some of these tools are part of the system software, and typically reside in /usr/bin
, while others are provided as part of the module system. Some tools may be available in different versions via both /usr/bin
and via the module system. If you find the default version is too old, then look in the module system for a more recent version.
The ARCHER2 CSE team at EPCC and other contributors provide build configurations ando instructions for a range of research software, software libraries and tools on a variety of HPC systems (including ARCHER2) in a public Github repository. See:
The repository always welcomes contributions from the ARCHER2 user community.
"},{"location":"user-guide/dev-environment/#support-for-building-software-on-archer2","title":"Support for building software on ARCHER2","text":"If you run into issues building software on ARCHER2 or the software you require is not available then please contact the ARCHER2 Service Desk with any questions you have.
"},{"location":"user-guide/energy/","title":"Energy use and emissions","text":"This section covers energy use and greenhouse gas (GHG) emissions from ARCHER2.
The emissions section describes how to estimate emissions from your use of ARCHER2 and the methodology we have used to produce emissions estimates for the service.
The energy section describes how to monitor energy use for your jobs on ARCHER2 and how to control the CPU frequency which allows some control over how much energy is consumed by jobs.
Important
The default CPU frequency cap on ARCHER2 compute nodes for jobs launched using srun
is currently set to 2.0 GHz. Information below describes how to control the CPU frequency cap using Slurm.
The Slurm accounting database stores the total energy consumed by a job and you can also directly access the counters on compute nodes which capture instantaneous power and energy data broken down by different hardware components.
"},{"location":"user-guide/energy/#using-sacct-to-get-energy-usage-for-individual-jobs","title":"Using sacct to get energy usage for individual jobs","text":"Energy usage for a particular job may be obtained using the sacct
command. For instance
sacct -j 2658300 --format=JobID,Elapsed,ReqCPUFreq,ConsumedEnergy\n
will provide the elapsed time and consumed energy in joules for the job(s) specified with -j
. The output of this command is:
JobID Elapsed ReqCPUFreq ConsumedEnergy \n------------ ---------- ---------- -------------- \n2658300 02:19:48 Unknown 4.58M \n2658300.bat+ 02:19:48 0 4.58M \n2658300.ext+ 02:19:48 0 4.58M \n2658300.0 02:19:09 Unknown 4.57M \n
In this case we can see that the job consumed 4.58 MJ for a run lasting 2 hours, 19 minutes and 48 seconds with the CPU frequency unset. To convert the energy to kWh we can multiply the energy in joules by 2.78e-7, in this case resulting in 1.27 kWh.
The Slurm database may be cleaned without notice so you should gather any data you want as soon as possible after the job completes - you can even add the sacct
command to the end of your job script to ensure this data is captured.
In addition to energy statistics sacct
provides a number of other statistics that can be specified to the --format
option, the full list of which can be viewed with
sacct --helpformat\n
or using the man
pages.
Note
The counters are available on each compute node and record data only for that compute node. If you are running multi-node jobs, you will need to combine data from multiple nodes to get data for the whole job.
On compute nodes, the raw energy counters and instantaneous power draw data are available at:
/sys/cray/pm_counters\n
There are a number of files in this directory, all the counter files include the current value and a timestamp.
This documentation is from the official HPE documentation:
Tip
The overall power
and energy
counters include all on-node systems. The major components are the CPU (processor), memory and Slingshot network interface controller (NIC).
Note
There exists an MPI-based wrapper library that can gather the pm
counter values at runtime via a simple set of function calls. See the link below for details.
You can request specific CPU frequency caps (in kHz) for compute nodes through srun
options or environment variables. The available frequency caps on the ARCHER2 processors along with the options and environment variables:
srun
option Slurm environment variable Turbo boost enabled? 2.25 GHz --cpu-freq=2250000
export SLURM_CPU_FREQ_REQ=2250000
Yes 2.00 GHz --cpu-freq=2000000
export SLURM_CPU_FREQ_REQ=2000000
No 1.50 GHz --cpu-freq=1500000
export SLURM_CPU_FREQ_REQ=1500000
No The only frequency caps available on the processors on ARCHER2 are 1.5 GHz, 2.0 GHz and 2.25GHz+turbo.
Important
Setting the CPU frequency cap in this way sets the maximum frequency that the processors can use. In practice, the individual cores may select different frequencies up to the value you have set depending on the workload on the processor.
Important
When you select the highest frequency value (2.25 GHz), you also enable turbo boost and so the processor is free to set the CPU frequency to values above 2.25 GHz if possible within the power and thermal limits of the processor. We see that, with turbo boost enabled, the processors typically boost to around 2.8 GHz even when performing compute-intensive work.
For example, you can add the following option to srun
commands in your job submission scripts to set the CPU frequency to 2.25 GHz (and also enable turbo boost):
srun --cpu-freq=2250000 ...usual srun options and arguments...\n
Alternatively, you could add the following line to your job submission script before you use srun
to launch the application:
export SLURM_CPU_FREQ_REQ=2250000\n
Tip
Testing by the ARCHER2 CSE team has shown that most software are most energy efficient when 2.0 GHz is selected as the CPU frequency.
Important
The CPU frequency settings only affect applications launched using the srun
command.
Priority of frequency settings:
SLURM_CPU_FREQ_REQ
setting set by the ARCHER2 service applies if no other mechnism is used to set the CPU frequencySLURM_CPU_FREQ_REQ
environment variable in a job script overrides options provided the default environment variable setting for any subsequent srun
commands in the job script.--cpu-freq=<freq in kHz>
option to the srun
launch command itself overrides all other options.Tip
Adding the --cpu-freq=<freq in kHz>
option to sbatch
(e.g. using #SBATCH --cpu-freq=<freq in kHz>
will not change the CPU frequency of srun
commands used in the job as the default setting for ARCHER2 will override the sbatch
option when the script runs.
If you do not specify a CPU frequency then you will get the default setting for the ARCHER2 service when you lanch an application using srun
. The table below lists the history of default CPU frequency settings on the ARCHER2 service
Most centrally installed research software (available via module load
commands) uses the same default Slurm CPU frequency as set globally for all ARCHER2 users (see above for this value). However, a small number of software have performance that is significantly degraded by using lower frequency settings and so the modules for these packages reset the CPU frequency to the highest value (2.25 GHz). The packages that currently do this are:
Important
If you specify the Slurm CPU frequency in your job scripts using one of the mechanisms described above after you have loaded the module, you will override the setting from the module.
"},{"location":"user-guide/energy/#emissions","title":"Emissions","text":"In this section we provide a brief overview of greenhouse gas (GHG) emissions sources relevant to ARCHER2, show how we have estimated the emissions associated with the service and describe how users can estimate emissions associated with their use of ARCHER2.
"},{"location":"user-guide/energy/#impact-on-reducing-emissions","title":"Impact on reducing emissions","text":"As well as a producer of GHG emissions, HPC systems like ARCHER2 also contribute to reducing emissions. The main source of reduced emissions from services such as ARCHER2 is in the research that leads to new technology, policies and approaches to reducing emissions. Some examples include:
As well as the research activities on the service leading to reductions in emissions, there are other activities that HPC services can potentially take. For example:
The emissions from ARCHER2 potentially fall into two categories (wording inspired by the Green Software Practitioner course linked below):
The other class of emissions (Scope 1) are not relevant for the ARCHER2 service:
If you want to learn more about GHG emissions in the area of software and digital infrastructure then you may want to look at the Green Software Foundation Green Software Practitioner online course.
"},{"location":"user-guide/energy/#archer2-emissions","title":"ARCHER2 emissions","text":"Important
All ARCHER2 emissions are estimated and you should understand that there is the potential for significant variation from the current values as understanding of emissions values and sources improves.
"},{"location":"user-guide/energy/#scope-3-emissions","title":"Scope 3 emissions","text":"Scope 3 emissions from the ARCHER2 hardware have been estimated from a subset of the components that are expected to make up the majority of the emissions. Note that there is a large amount of uncertainty for Scope 3 emissions due to lack of high quality Scope 3 emissions data from vendors. In particular, the number used for the compute node emissions is at the high end of estimated values and the actual value could be as much as 15% lower at around 900 kgCO2e/node.
Component Count Estimated kgCO2e per unit Estimated kgCO2e % Total Scope 3 References Compute nodes 5,860 nodes 1,100 6,400,000 84% (1) Interconnect switches 768 switches 280 150,000 2% (2) Lustre HDD 19,759,200 GB 0.02 400,000 6% (3) Lustre SSD 1,900,800 GB 0.16 300,000 4% (3) NFS HDD 3,240,000 GB 0.02 70,000 1% (3) Total 7,320,000 100%We then estimate the per-CU (nodeh) Scope 3 emissions by assuming a service lifetime of 6 years and 100% availability:
7,320,000 kgCO2e / (5,860 nodes * 6 years * 365 days * 24 hours) = 0.023 kgCO2e/CU\n
Tools use a value of 0.023 kgCO2e/CU for ARCHER2.
References:
Scope 2 emissions from ARCHER2 are zero as the service is supplied by 100% certified renewable energy. For information purposes we can calculate what the Scope 2 emissions would have been if the energy was not 100% renewable energy using the methodology described below.
We are aware that there is ongoing discussion in the sustainability community about the impact and effectiveness of certified renewable energy contracts that are supplied through UK National Grid connections. We are monitoring these discussions and taking advice from sustainability professionals on how we report and estimate ARCHER2 emissions.
UK National Grid based Scope 2 emissions are calculated using the compute node energy use for particular jobs along with the carbon intensity of the South Scotland region of the UK National Grid at the start time of the job. The carbon intensity is retrieved from the carbonintensity.org.uk web API.
If the energy use of a job is not available (which happens occasionally due to, e.g. counter failures) then the mean per node power draw from 1 Jan 2024 - 30 Jun 2024 on ARCHER2 is used to compute the energy consumption. This corresponds to a value of 0.41 kW per node.
Estimates of power draw of individual components of ARCHER2 suggest that the compute node power draw makes up around 85% of the system power draw so to estimate energy use by additional components we add 15% of the measured compute node energy.
Component Count Loaded power draw per unit (kW) Loaded power draw (kW) % Total Notes Compute nodes 5,860 nodes 0.41 2,400 85% Measured by on system counters Interconnect switches 768 switches 0.24 240 9% Measured by on system counters Lustre storage 5 file systems 8 40 1% Estimate from vendor NFS storage 4 file systems 8 32 1% Estimate from vendor Coolant distribution units 6 CDU 16 96 3% Estimate from vendor Total 2,808 99%Current Scope 2 grid based emission calculations estimates do not include overheads from the electrical and cooling plant, these will vary with outside weather conditions at the data centre but are typically less than 10%. As a conservative estimate, we add an additional 10% energy use to the total to account for plant overheads.
The final energy calculation for a job is therefore:
To help estimate GHG emissions from your use of ARCHER2 and place them in context to other sources of GHG emissions we are developing a number of tools. We will add more information on these tools in this section of the documentation as they become available.
At the moment, the following tools are available:
jobemissions
- a command line tool on ARCHER2 that reports estimated emissions for a specified, completed job. It can also provide comparisons to other GHG emissions sourcesjobemissions
tool","text":"The jobemissions
tool is available by default to all ARCHER2 users from the command line. You supply a Slurm job ID for a completed job and the tool provides an estimate of the GHG emissions associated with that job (based on the estimation methodologies described above). For example, to provide an estimate for the completed job with Job ID 7654321, you would use:
jobemissions 7654321\n
Typical output from the tool would look like:
Job details:\n Job ID: 7654321\n Start: 2024-11-11T20:51:25\n Budget: t01\n Nodes: 20\n Runtime: 324000 s\n CU: 1800.000\n Compute node energy use: 448.973 kWh\n Other hardware energy use: 67.346 kWh (estimated)\n Overhead energy use: 51.632 kWh (estimated)\n Total energy use: 567.951 kWh (estimated)\n\n Emissions estimates:\n Scope 2: 0.000 kgCO2e (ARCHER2 is on 100% certified\n renewable energy contract so scope 2 emissions are zero)\n Scope 3: 41.400 kgCO2e (23.0 gCO2e/CU)\n Total: 41.400 kgCO2e\n\n Indicative emissions estimates for UK national grid energy mix\n in S. Scotland at start of job if ARCHER2 was not using\n renewable energy\n Scope 2: 9.655 kgCO2e (567.951 kWh, 17.0 gCO2e/kWh)\n Scope 3: 41.400 kgCO2e (23.0 gCO2e/CU)\n Total: 51.055 kgCO2e\n\n Scope 2 carbon intensity values from carbonintensity.org.uk\n
If you add the flag --comparison food,other
the tool will add comparisons of GHG emissions for the job to other sources. e.g. for the same job above, it would add the following section to the end of the output.
Emissions from job approximately equivalent to following food consumption:\n | Food | Emissions (kgCO2e/100g) | Equivalent to (g) |\n |-----------|--------------------------|-------------------|\n | Beef | 12.47 | 332.00 |\n | Chicken | 1.43 | 2895.10 |\n | Avocado | 0.18 | 23000.00 |\n | Chickpeas | 0.04 | 103500.00 |\n\n Emissions from job approximately equivalent to:\n Daily emissions from 329.2 houses' electricity use (in S. Scotland)\n Emissions from flying 0.083 times across the Atlantic (500.00 kgCO2e/person)\n Emissions from driving 153.9 miles (0.27 kgCO2e/mile, average UK car, petrol and diesel very similar)\n
You can add the --json
flag to obtain the emissions data from the tool in a machine-readable format.
The ARCHER2 CU Calculator on the ARCHER2 website is used by potential users to estimate the number and cost of resources for potential applications to use ARCHER2. This tool has been augmented to include an estimate of GHG emissions from the proposed use of ARCHER2. In this tool, we include the Scope 3 emissions calculated as per the methodology above and note that Scope 2 emissions are zero due to the 100% renewable energy contract used to power ARCHER2.
"},{"location":"user-guide/functional-accounts/","title":"Functional accounts on ARCHER2","text":"Functional accounts are used to enable persistent services, controlled by users running on ARCHER2. For example, running a licence server to allow jobs on compute nodes to check out a licence for restricted software.
There are a number of steps involved in setting up functional accounts:
dvn04
) and the functional accountdvn04
)We cover these steps in detail below with the concrete example of setting up a licence server using the FlexLM software but the process should be able to be generalised for other persistent services.
Note
If you have any questions about functional accounts and persistent services on ARCHER2 please contact the ARCHER2 Service Desk.
"},{"location":"user-guide/functional-accounts/#submit-a-request-to-service-desk","title":"Submit a request to service desk","text":"If you wish to have access to a functional account for persistent services on ARCHER2 you should email the ARCHER2 Service Desk with a case for why you want to have this functionality. You should include the following information in your email:
If your request for a functional account is approved then the ARCHER2 user administration team will setup the account and enable access for the standard user accounts named in the application. They will then inform you of the functional account name.
"},{"location":"user-guide/functional-accounts/#test-access-to-functional-account","title":"Test access to functional account","text":"The process for accessing the functional account is:
dvn04
)dvn04
)sudo
to access the functional accountLog into ARCHER2 in the usual way using a normal user account that has been given access to manage the functional account.
"},{"location":"user-guide/functional-accounts/#setup-ssh-key-pair-for-dvn04-access","title":"Setup SSH key pair fordvn04
access","text":"You can create a passphrase-less SSH key pair to use for access to the persistent service node using the ssh-keygen
command. As long as you place the public and private key parts in the default location, you will not need any additional SSH options to access dvn04
from the ARCHER2 login nodes. Just hit enter when prompted for a passphrase to create a key with no passphrase.
Once the key pair has been created, you add the public part to the $HOME/.ssh/authorized_keys
file on ARCHER2 to make it valid for login to dvn04
using the command cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
.
Example commands to setup SSH key pair:
auser@ln04:~> ssh-keygen -t rsa\n\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/home/t01/t01/auser/.ssh/id_rsa): \nEnter passphrase (empty for no passphrase): \nEnter same passphrase again: \nYour identification has been saved in /home/t01/t01/auser/.ssh/id_rsa\nYour public key has been saved in /home/t01/t01/auser/.ssh/id_rsa.pub\nThe key fingerprint is:\nSHA256:wX2bgNElbsPaT8HXKIflNmqnjSfg7a8BPM1R56b4/60 auser@ln02\nThe key's randomart image is:\n+---[RSA 3072]----+\n| ..... o .|\n| . *.o = = |\n| + B B B +|\n| * * % + |\n| S * X o |\n| . O * |\n| . B + |\n| . + ..|\n| ooE.=|\n+----[SHA256]-----+\n\nauser@ln04:~> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys\n
"},{"location":"user-guide/functional-accounts/#login-to-the-persistent-service-node-dvn04","title":"Login to the persistent service node (dvn04
)","text":"Once you are logged into an ARCHER2 login node, and assuming the SSH key is in the default location, you can now login to dvn04
:
auser@ln04:~> ssh dvn04\n
Note
You will need to enter the TOTP for your ARCHER2 account to login to dvn04
unless you have logged in to the node recently.
Once you are logged into dvn04
, you use sudo
to access the functional account.
Important
You must use the normal user account account password to use the sudo
command. This password was set on your first ever login to ARCHER2 (and not used subsequently). If you have forgotten this password, you can reset it in SAFE.
For example, if the functional account is called testlm
, you would access it (on dvn04
) with:
auser@dvn04:~> sudo -iu testlm\n
To exit the functional account, you use the exit
command which will return you to your normal user account on dvn04
.
You should use systemctl
to manage your persistent service on dvn04
. In order to use the systemctl
command, you need to add the following lines to the ~/.bashrc
for the functional account:
export XDG_RUNTIME_DIR=/run/user/$UID\nexport DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$UID/bus\n
Next, create a service definition file for the persistent service and save it to a plain text file. Here is the example used for the QChem licence server:
[Unit]\nDescription=Licence manger for QChem\nAfter=network.target\nConditionHost=dvn04\n\n[Service]\nType=forking\nExecStart=/work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmgrd -l +/work/y07/shared/apps/core/qchem/6.1/var/log/qchemlm.log -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/\nExecStop=/work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmutil lmdown -all -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/\nSuccessExitStatus=15\nRestart=always\nRestartSec=30\n\n[Install]\nWantedBy=default.target\n
Enable the licence server service, e.g. for the QChem licence server service:
testlm@dvn04:~> systemctl --user enable /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/qchem-lm.service\n\nCreated symlink /home/y07/y07/testlm/.config/systemd/user/default.target.wants/qchem-lm.service \u2192 /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/qchem-lm.service.\nCreated symlink /home/y07/y07/testlm/.config/systemd/user/qchem-lm.service \u2192 /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/qchem-lm.service.\n
Once it has been enabled, you can start the licence server service, e.g. for the QChem licence server service:
testlm@dvn04:~> systemctl --user start qchem-lm.service\n
Check the status to make sure it is running:
testlm@dvn04:~> systemctl --user status qchem-lm\n\u25cf qchem-lm.service - Licence manger for QChem\n Loaded: loaded (/home/y07/y07/testlm/.config/systemd/user/qchem-lm.service; enabled; vendor preset: disabled)\n Active: active (running) since Thu 2024-05-16 15:33:59 BST; 8s ago\n Process: 174248 ExecStart=/work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmgrd -l +/work/y07/shared/apps/core/qchem/6.1/var/log/qchemlm.log -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/ (code=exited, status=0/SUCCESS)\n Main PID: 174249 (lmgrd)\n Tasks: 8 (limit: 39321)\n Memory: 5.6M\n CPU: 18ms\n CGroup: /user.slice/user-35153.slice/user@35153.service/app.slice/qchem-lm.service\n \u251c\u2500 174249 /work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmgrd -l +/work/y07/shared/apps/core/qchem/6.1/var/log/qchemlm.log -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/\n \u2514\u2500 174253 qchemlm -T 10.252.1.77 11.19 10 -c :/work/y07/shared/apps/core/qchem/6.1/etc/flexnet/: -lmgrd_port 6979 -srv mdSVdgushTnAjHX1s1PTj0ppCjHJw1Uk9ylvs1j13zkaUzhDBFlbv4thnqEIAXV --lmgrd_start 66461957 -vdrestart 0 -l /work/y07/shar>\n
"},{"location":"user-guide/gpu/","title":"AMD GPU Development Platform","text":"In early 2024, ARCHER2 users gained access to a small GPU system integrated into ARCHER2 which is designed to allow users to test and develop software using AMD GPUs.
Important
The GPU component is very small and so is aimed at software development and testing rather than to be used for production research.
"},{"location":"user-guide/gpu/#hardware-available","title":"Hardware available","text":"The GPU Development Platform consists of 4 compute nodes each with: - 1x AMD EPYC 7543P (Milan) processor, 32 core, 2.8 GHz - 4x AMD Instinct MI210 accelerator - 512 GiB host memory - 2\u00d7 100 Gb/s Slingshot interfaces per node
The AMD Instinct\u2122 MI210 Accelerators feature: - Architecture: CDNA 2 - Compute Units: 104 - Memory: 64 GB HBM2e
A comprehensive list of features is available on the AMD website.
"},{"location":"user-guide/gpu/#accessing-the-gpu-compute-nodes","title":"Accessing the GPU compute nodes","text":"The GPU nodes can be accessed through the Slurm job submission system from the standard ARCHER2 login nodes. Details of the scheduler limits and configuration and example job submission scripts are provided below.
"},{"location":"user-guide/gpu/#compiling-software-for-the-gpu-compute-nodes","title":"Compiling software for the GPU compute nodes","text":""},{"location":"user-guide/gpu/#overview","title":"Overview","text":"As a quick summary, the recommended procedure for compiling code that offloads to the AMD GPUs is as follows:
module load PrgEnv-xxx
module load rocm
module load craype-accel-amd-gfx90a
module load craype-x86-milan
ftn
, cc
, or CC
For details and alternative approaches, see below.
"},{"location":"user-guide/gpu/#programming-environments","title":"Programming Environments","text":"The following programming environments and compilers are available to compile code for the AMD GPUs on ARCHER2 using the usual compiler wrappers (ftn
, cc
, CC
), which is the recommended approach:
ftn
, cc
, CC
PrgEnv-amd
AMD LLVM compilers amdflang
, amdclang
, amdclang++
PrgEnv-cray
Cray compilers crayftn
, craycc
, crayCC
PrgEnv-gnu
GNU compilers gfortran
, gcc
, g++
PrgEnv-gnu-amd
hybrid gfortran
, amdclang
, amdclang++
PrgEnv-cray-amd
hybrid crayftn
, amdclang
, amdclang++
To decide which compiler(s) to use to compile offload code for the AMD GPUs, you may find it useful to consult the Compilation Strategies for GPU Offloading section below.
The hybrid environments PrgEnv-gnu-amd
and PrgEnv-cray-amd
are provided as a convenient way to mitigate less mature OpenMP offload support in the AMD LLVM Fortran compiler. In these hybrid environments ftn
therefore calls gfortran
or crayftn
instead of amdflang
.
Details about the underlying compiler being called by a compiler wrapper can be checked using the --version
flag, for example:
> module load PrgEnv-amd\n> cc --version\nAMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.3 22324 d6c88e5a78066d5d7a1e8db6c5e3e9884c6ad10e)\nTarget: x86_64-unknown-linux-gnu\nThread model: posix\nInstalledDir: /opt/rocm-5.2.3/llvm/bin\n
"},{"location":"user-guide/gpu/#rocm","title":"ROCm","text":"Access to AMD's ROCm software stack is provided through the rocm
module:
module load rocm\n
With the rocm
module loaded the AMD LLVM compilers amdflang
, amdclang
, and amdclang++
become available to use directly or through AMD's compiler driver utility hipcc
. Neither approach is recommended as a first choice for most users, as considerable care needs to be taken to pass suitable flags to the compiler or to hipcc
. With PrgEnv-amd
loaded the compiler wrappers ftn
, cc
, CC
, which bypass hipcc
and call amdflang
, amdclang
, or amdclang++
directly, take care of passing suitable compilation flags, which is why using these wrappers is the recommended approach for most users, at least initially.
Note: the rocm
module should be loaded whenever you are compiling for the AMD GPUs, even if you are not using the AMD LLVM compilers (amdflang
, amdclang
, amdclang++
).
The rocm
module also provides access to other AMD tools, such as HIPIFY (hipify-clang
or hipify-perl
command), which enables translation of CUDA to HIP code. See also the section below on HIPIFY.
Regardless of what approach you use, you will need to tell the underlying GPU compiler which GPU hardware to target. When using the compiler wrappers ftn
, cc
, or CC
, as recommended, this can be done by ensuring the appropriate GPU target module is loaded:
module load craype-accel-amd-gfx90a\n
"},{"location":"user-guide/gpu/#cpu-target","title":"CPU target","text":"The AMD GPU nodes are equipped with AMD EPYC Milan CPUs instead of the AMD EPYC Rome CPUs present on the regular CPU-only ARCHER2 compute nodes. Though the difference between these processors is small, when using the compiler wrappers ftn
, cc
, or CC
, as recommended, we should load the appropriate CPU target module:
module load craype-x86-milan\n
"},{"location":"user-guide/gpu/#compilation-strategies-for-gpu-offloading","title":"Compilation Strategies for GPU Offloading","text":"Compiler support on ARCHER2 for various programming models that enable offloading to AMD GPUs can be summarised at a glance in the following table:
PrgEnv Actual compiler OpenMP Offload HIP OpenACCPrgEnv-amd
amdflang
\u2705 \u274c \u274c PrgEnv-amd
amdclang
\u2705 \u274c \u274c PrgEnv-amd
amdclang++
\u2705 \u2705 \u274c PrgEnv-cray
crayftn
\u2705 \u274c \u2705 PrgEnv-cray
craycc
\u2705 \u274c \u274c PrgEnv-cray
crayCC
\u2705 \u2705 \u274c PrgEnv-gnu
gfortran
\u274c \u274c \u274c PrgEnv-gnu
gcc
\u274c \u274c \u274c PrgEnv-gnu
g++
\u274c \u274c \u274c It is generally recommended to do the following:
module load PrgEnv-xxx\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
And then to use the ftn
, cc
and/or CC
wrapper to compile as appropriate for the programming model in question. Specific guidance on how to do this for different programming models is provided in the subsections below.
When deviating from this procedure and using underlying compilers directly, or when debugging a problematic build using the wrappers, it may be useful to check what flags the compiler wrappers are passing to the underlying compiler. This can be done by using the -craype-verbose
option with a wrapper when compiling a file. Optionally piping the resulting output to the command tr \" \" \"\\n\"
so that flags are split over lines may be convenient for visual parsing. For example:
> CC -craype-verbose source.cpp | tr \" \" \"\\n\"\n
"},{"location":"user-guide/gpu/#openmp-offload","title":"OpenMP Offload","text":"To use the compiler wrappers to compile code that offloads to GPU with OpenMP directives, first load the desired PrgEnv module and other necessary modules:
module load PrgEnv-xxx\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
Then use the appropriate compiler wrapper and pass the -fopenmp
option to the wrapper when compiling. For example:
ftn -fopenmp source.f90\n
This should work under PrgEnv-amd
and PrgEnv-cray
, but not under PrgEnv-gnu as GCC 11.2.0 is the most recent version of GCC available on ARCHER2 and OpenMP offload to AMD MI200 series GPUs is only supported by GCC 13 and later.
You may find that offload directives introduced in more recent versions of the OpenMP standard, e.g. versions later than OpenMP 4.5, fail to compile with some compilers. Under PrgEnv-cray
an explicit description of supported OpenMP features can be viewed using the command man intro_openmp
.
To compile C or C++ code that uses HIP written specifically to offload to AMD GPUs, first load the desired PrgEnv module (either PrgEnv-amd
or PrgEnv-cray
) and other necessary modules:
module load PrgEnv-xxx\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
Then compile using the CC
compiler wrapper as follows:
CC -x hip -std=c++11 -D__HIP_ROCclr__ --rocm-path=${ROCM_PATH} source.cpp\n
Alternatively, you may use hipcc
to drive the AMD LLVM compiler amdclang(++)
to compile HIP code. In that case you will need to take care to explicitly pass all required offload flags to hipcc
, such as:
-D__HIP_PLATFORM_AMD__ --offload-arch=gfx90a\n
To see what hipcc
passes to the compiler, you can pass the --verbose
option. If you are compiling MPI-parallel HIP code with hipcc
, please see additional guidance under HIPCC and MPI.
hipcc
can compile both HIP code for device (GPU) execution and non-HIP code for host (CPU) execution and will default to using the AMD LLVM compiler amdclang(++)
to do so. If your software consists of separate compilation units - typically separate files - containing HIP code non-HIP code, it is possible to use a different compiler than hipcc
to compile the non-HIP code. To do this:
hipcc
CC
and a different PrgEnv than PrgEnv-amd
loaded.o
files) together using the compiler wrapperOffloading using OpenACC directives on ARCHER2 is only supported by the Cray Fortran compiler. You should therefore load the following:
module load PrgEnv-cray\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
OpenACC Fortran code can then be compiled using the -hacc
flag, as follows:
ftn -hacc source.f90\n
Details on what OpenACC standard and features are supported under PrgEnv-cray
can be viewed using the command man intro_openacc
.
Code may use OpenMP for multithreaded execution on the host CPU in combination with target directives to offload work to GPU. Both uses of OpenMP can coexist in a single compilation unit, which should be compiled using the relevant compiler wrapper and the -fopenmp
flag.
Using both OpenMP and HIP to offload to GPU is possible, but only if the two programming models are not mixed in the same compilation unit. Two or more separate compilation units - typically separate source files - should be compiled as recommended individually for HIP and OpenMP offload code in the respective sections above. The resulting code objects (.o
files) should then be linked together using a compiler wrapper with the -fopenmp
flag, but without the -x hip
flag.
Code in a single compilation unit, such as a single source file, can use HIP to offload to GPU as well as OpenMP for multithreaded execution on the host CPU. Compilation should be done using the relevant compiler wrapper and the flags -fopenmp
and \u2013x hip
- in that order - as well as the flags for HIP compilation specified above:
CC -fopenmp -x hip -std=c++11 -D__HIP_ROCclr__ --rocm-path=${ROCM_PATH} source.cpp\n
"},{"location":"user-guide/gpu/#hipcc-and-mpi","title":"HIPCC and MPI","text":"When compiling an MPI-parallel code with hipcc
instead of a compiler wrapper, the path to the Cray MPI library include directory should be passed explicitly, or set as part of the CXXFLAGS
environment variable, as:
-I${CRAY_MPICH_DIR}/include\n
MPI library directories should also be passed to hipcc
, or set as part of the LDFLAGS
environment variable prior to compiling, as:
-L${CRAY_MPICH_DIR}/lib ${PE_MPICH_GTL_DIR_amd_gfx90a}\n
Finally the MPI library should be linked explicitly, or set as part of the LIBS
environment variable prior to linking, as:
-lmpi ${PE_MPICH_GTL_LIBS_amd_gfx90a}\n
"},{"location":"user-guide/gpu/#cmake","title":"Cmake","text":"Documentation about integrating rocm with cmake can be found here.
"},{"location":"user-guide/gpu/#gpu-aware-mpi","title":"GPU-aware MPI","text":"Need to set an environment variable to enable GPU support in cray-mpich
:
export MPICH_GPU_SUPPORT_ENABLED=1
No additional or alternative MPI modules need to be loaded instead of the default cray-mpich
module.
This supports GPU-GPU transfers:
Be aware that on these nodes there are only two PCIe network cards in each node and they may not be in the same memory region to a given GPU. Therefore NUMA effects are to be expected in multi-node communication. More detail on this is provided below.
"},{"location":"user-guide/gpu/#libraries","title":"Libraries","text":"In order to access the GPU-accelerated version of Cray's LibSci maths libraries, a new module has been provided:
cray-libsci_acc
With this module loaded, documentation can be viewed using the command man intro_libsci_acc
.
Additionally a number of libraries are provided as part of the rocm
module.
The cray-python
module can be used as normal for the GPU partition with mpi4py
package that is installed by default. mpi4py
uses cray-mpich
under the hood and in the same way as the CPU compute nodes.
However unless specifically compiled for GPU-GPU communication certain python packages/frameworks that try to take advantage of the fast links between GPUs by calling MPI on GPU pointers may have issues. To set the environment correctly for a given python program the following snippet can be added to load the required libmpi_gtl_hsa
library:
from os import environ\nif environ.get(\"MPICH_GPU_SUPPORT_ENABLED\", False):\n from ctypes import CDLL, RTLD_GLOBAL\n CDLL(f\"{environ.get('CRAY_MPICH_ROOTDIR')}/gtl/lib/libmpi_gtl_hsa.so\", mode=RTLD_GLOBAL)\n\nfrom mpi4py import MPI\n
"},{"location":"user-guide/gpu/#supported-software","title":"Supported software","text":"The ARCHER2 GPU development platform is intended for code development, testing and experimentation and will not have supported centrally installed versions of codes as is the case for the standard ARCHER2 CPU compute nodes. However some builds are being made available to users by members of CSE to under a best effort approach to support the community.
Codes that have modules targeting GPUs are:
Note
Will be filled out as applications are compiled and made available.
"},{"location":"user-guide/gpu/#running-jobs-on-the-gpu-nodes","title":"Running jobs on the GPU nodes","text":"To run a GPU job, you must specify a GPU partition and a quality of service (QoS) as well as the number of GPUs required. You specify the number of GPU cards you want per node using the --gpus=N
option, where N
is typically 1, 2 or 4.
Note
As there are 4 GPUs per node, each GPU is associated with 1/4 of the resources of the node, i.e., 8 of 32 physical cores and roughly 128 GiB of the total 512 GiB host memory.
Allocations of host resources are made pro-rata. For example, if 2 GPUs are requested, sbatch
will allocate 16 cores and around 256 GiB of host memory (in addition to 2 GPUs). Any attempt to use more than the allocated resources will result in an error.
This automatic allocation by Slurm for GPU jobs means that the submission script should not specify options such as --ntasks
and --cpus-per-task
. Such a job submission will be rejected. See below for some examples of how to use host resources and how to launch MPI applications.
Warning
In order to run jobs on the GPU nodes your ARCHER2 budget must have positive CU hours associated with it. However, your budget will not be charged for any GPU jobs you run.
"},{"location":"user-guide/gpu/#slurm-partitions","title":"Slurm Partitions","text":"Your job script must specify a partition. The following table has a list of relevant GPU partition(s) on ARCHER2.
Partition Description Max nodes available gpu GPU nodes with AMD EPYC 32-core processor, 512 GB memory, 4\u00d7AMD Instinct MI210 GPU 4"},{"location":"user-guide/gpu/#slurm-quality-of-service-qos","title":"Slurm Quality of Service (QoS)","text":"Your job script must specify a QoS relevant for the GPU nodes. Available QoS specifications are as follows.
QoS Max Nodes Per Job Max Walltime Jobs Queued Jobs Running Partition(s) Notes gpu-shd 1 12 hr 2 1 gpu Nodes potentially shared with other users gpu-exc 2 12 hr 2 1 gpu Exclusive node access"},{"location":"user-guide/gpu/#example-job-submission-scripts","title":"Example job submission scripts","text":"Here are a series of example jobs for various patterns of running on the ARCHER2 GPU nodes They cover the following scenarios:
This example requests a single GPU on a potentially shared node and launch using a single CPU process with offload to a single GPU.
#!/bin/bash\n\n#SBATCH --job-name=single-GPU\n#SBATCH --gpus=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-shd\n\n# Check assigned GPU\nsrun --ntasks=1 rocm-smi\n\nsrun --ntasks=1 --cpus-per-task=1 ./my_gpu_program.x\n
"},{"location":"user-guide/gpu/#multiple-gpu-on-a-single-node-shared-node-access-max-2-gpu","title":"Multiple GPU on a single node - shared node access (max. 2 GPU)","text":"This example requests two GPUs on a potentially shared node and launch using two MPI processes (one per GPU) with one MPI process per CPU NUMA region.
We use the --cpus-per-task=8
option to srun
to set the stride between the two MPI processes to 8 physical cores. This places the MPI processes on separate NUMA regions to ensure they are associated with the correct GPU that is closest to them on the compute node architecture.
#!/bin/bash\n\n#SBATCH --job-name=multi-GPU\n#SBATCH --gpus=2\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-shd\n\n# Enable GPU-aware MPI\nexport MPICH_GPU_SUPPORT_ENABLED=1\n\n# Check assigned GPU\nsrun --ntasks=1 rocm-smi\n\n# Check process/thread pinning\nmodule load xthi\nsrun --ntasks=2 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n xthi\n\nsrun --ntasks=2 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n ./my_gpu_program.x\n
"},{"location":"user-guide/gpu/#multiple-gpu-on-a-single-node-exclusive-node-access-max-4-gpu","title":"Multiple GPU on a single node - exclusive node access (max. 4 GPU)","text":"This example requests four GPUs on a single node and launches the program using four MPI processes (one per GPU) with one MPI process per CPU NUMA region.
We use the --cpus-per-task=8
option to srun
to set the stride between the MPI processes to 8 physical cores. This places the MPI processes on separate NUMA regions to ensure they are associated with the correct GPU that is closest to them on the compute node architecture.
#!/bin/bash\n\n#SBATCH --job-name=multi-GPU\n#SBATCH --gpus=4\n#SBATCH --nodes=1\n#SBATCH --exclusive\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-exc\n\n# Check assigned GPU\nsrun --ntasks=1 rocm-smi\n\n# Check process/thread pinning\nmodule load xthi\nsrun --ntasks=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n xthi\n\n# Enable GPU-aware MPI\nexport MPICH_GPU_SUPPORT_ENABLED=1\n\nsrun --ntasks=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n ./my_gpu_program.x\n
Note
When you use the --qos=gpu-exc
QoS you must also add the --exclusive
flag and then specify the number of nodes you want with --nodes=1
.
This example requests eight GPUs across two nodes and launches the program using eight MPI processes (one per GPU) with one MPI process per CPU NUMA region.
We use the --cpus-per-task=8
option to srun
to set the stride between the MPI processes to 8 physical cores. This places the MPI processes on separate NUMA regions to ensure they are associated with the correct GPU that is closest to them on the compute node architecture.
#!/bin/bash\n\n#SBATCH --job-name=multi-GPU\n#SBATCH --gpus=4\n#SBATCH --nodes=2\n#SBATCH --exclusive\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-exc\n\n# Check assigned GPU\nnodelist=$(scontrol show hostname $SLURM_JOB_NODELIST)\nfor nodeid in $nodelist\ndo\n echo $nodeid\n srun --ntasks=1 --gpus=4 --nodes=1 --ntasks-per-node=1 --nodelist=$nodeid rocm-smi\ndone\n\n# Check process/thread pinning\nmodule load xthi\nsrun --ntasks-per-node=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n xthi\n\n# Enable GPU-aware MPI\nexport MPICH_GPU_SUPPORT_ENABLED=1\n\nsrun --ntasks-per-node=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n ./my_gpu_program.x\n
Note
When you use the --qos=gpu-exc
QoS you must also add the --exclusive
flag and then specify the number of nodes you want with, for example, --nodes=2
.
salloc
","text":"Tip
This method does not give you an interactive shell on a GPU compute node. If you want an interactive shell on the GPU compute nodes, see the srun
method described below.
If you wish to have a terminal to perform interactive testing, you can use the salloc
command to reserve the resources so you can use srun
commands interactively. For example, to request 1 GPU for 20 minutes you would use (remember to replace t01
with your budget code):
auser@ln04:/work/t01/t01/auser> salloc --gpus=1 --time=00:20:00 --partition=gpu --qos=gpu-shd --account=t01\nsalloc: Pending job allocation 5335731\nsalloc: job 5335731 queued and waiting for resources\nsalloc: job 5335731 has been allocated resources\nsalloc: Granted job allocation 5335731\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid200001 are ready for job\n\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> srun rocm-smi\n\n\n======================= ROCm System Management Interface =======================\n================================= Concise Info =================================\nGPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%\n0 31.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n================================================================================\n============================= End of ROCm SMI Log ==============================\n\n\nsrun: error: nid200001: tasks 0: Exited with exit code 2\nsrun: launch/slurm: _step_signal: Terminating StepId=5335731.0\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> srun --ntasks=1 --cpus-per-task=8 --hint=nomultithread xthi\nNode summary for 1 nodes:\nNode 0, hostname nid200001, mpi 1, omp 1, executable xthi\nMPI summary: 1 ranks\nNode 0, rank 0, thread 0, (affinity = 0-7)\n
"},{"location":"user-guide/gpu/#using-srun","title":"Using srun
","text":"If you want an interactive terminal on a GPU node then you can use the srun
command to achieve this. For example, to request 1 GPU for 20 minutes with an interactive terminal on a GPU compute node you would use (remember to replace t01
with your budget code):
auser@ln04:/work/t01/t01/auser> srun --gpus=1 --time=00:20:00 --partition=gpu --qos=gpu-shd --account=z19 --pty /bin/bash\nsrun: job 5335771 queued and waiting for resources\nsrun: job 5335771 has been allocated resources\nauser@nid200001:/work/t01/t01/auser>\n
Note that the command prompt has changed to indicate we are now on a GPU compute node. You can now directly run commands that interact with the GPU devices, e.g.:
auser@nid200001:/work/t01/t01/auser> rocm-smi\n\n======================= ROCm System Management Interface =======================\n================================= Concise Info =================================\nGPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%\n0 29.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
Warning
Launching parallel jobs on GPU nodes from an interactive shell on a GPU node is not straightforward so you should either use job submission scripts or the salloc
method of interactive use described above.
A list of device indices or UUIDs that will be exposed to applications
Runtime : ROCm Platform Runtime. Applies to all applications using the user mode ROCm software stack.
export ROCR_VISIBLE_DEVICES=\"0,GPU-DEADBEEFDEADBEEF\"
https://rocm.docs.amd.com/projects/HIP/en/docs-5.2.3/how_to_guides/debugging.html#summary-of-environment-variables-in-hip
"},{"location":"user-guide/gpu/#amd_log_level","title":"AMD_LOG_LEVEL","text":"Enable HIP log on different Level.
export AMD_LOG_LEVEL=1
Enable HIP log on different Levels.
export AMD_LOG_MASK=0x1
Default: 0x7FFFFFFF\n\n0x1: Log API calls.\n0x02: Kernel and Copy Commands and Barriers.\n0x4: Synchronization and waiting for commands to finish.\n0x8: Enable log on information and below levels.\n0x20: Queue commands and queue contents.\n0x40: Signal creation, allocation, pool.\n0x80: Locks and thread-safety code.\n0x100: Copy debug.\n0x200: Detailed copy debug.\n0x400: Resource allocation, performance-impacting events.\n0x800: Initialization and shutdown.\n0x1000: Misc debug, not yet classified.\n0x2000: Show raw bytes of AQL packet.\n0x4000: Show code creation debug.\n0x8000: More detailed command info, including barrier commands.\n0x10000: Log message location.\n0xFFFFFFFF: Log always even mask flag is zero.\n
"},{"location":"user-guide/gpu/#hip_visible_devices","title":"HIP_VISIBLE_DEVICES:","text":"For system with multiple devices, it\u2019s possible to make only certain device(s) visible to HIP via setting environment variable, HIP_VISIBLE_DEVICES(or CUDA_VISIBLE_DEVICES on Nvidia platform), only devices whose index is present in the sequence are visible to HIP.
Runtime : HIP Runtime. Applies only to applications using HIP on the AMD platform.
export HIP_VISIBLE_DEVICES=0,1
To serialize the kernel enqueuing set the following variable,
export AMD_SERIALIZE_KERNEL=1
To serialize the copies set,
export AMD_SERIALIZE_COPY=1
Sets whether memory in coherent in hipHostMalloc.
export HIP_HOST_COHERENT=1
If the value is 1
, memory is coherent with host; if 0
, memory is not coherent between host and GPU.
https://rocm.docs.amd.com/en/docs-5.2.3/reference/openmp/openmp.html#environment-variables
"},{"location":"user-guide/gpu/#omp_default_device","title":"OMP_DEFAULT_DEVICE","text":"Default device used for OpenMP target offloading.
Runtime : OpenMP Runtime. Applies only to applications using OpenMP offloading.
export OMP_DEFAULT_DEVICE=\"2\"
sets the default device to the 3rd device on the node.
"},{"location":"user-guide/gpu/#omp_num_teams","title":"OMP_NUM_TEAMS","text":"Users can choose the number of teams used for kernel launch by setting,
export OMP_NUM_THREADS
this can be tuned to optimise performance.
"},{"location":"user-guide/gpu/#gpu_max_hw_queues","title":"GPU_MAX_HW_QUEUES","text":"To set the number of HSA queues used in the OpenMP runtime set,
export GPU_MAX_HW_QUEUES
Activates GPU aware MPI in Cray MPICH:
export MPICH_GPU_SUPPORT_ENABLED=1
If not set MPI calls that attempt to send messages from buffers that are on GPU-attached memory will crash/hang.
"},{"location":"user-guide/gpu/#hsa_enable_sdma","title":"HSA_ENABLE_SDMA","text":"export HSA_ENABLE_SDMA=0
Forces host-to-device and device-to-host copies to use compute shader blit kernels rather than the dedicated DMA copy engines.
Impact will be reduced bandwidth but this is recommended when isolating issues with hardware copy engines.
"},{"location":"user-guide/gpu/#mpich_ofi_nic_policy","title":"MPICH_OFI_NIC_POLICY","text":"For GPU-enabled parallel applications that involve MPI operations that access application arrays that are resident on GPU-attached memory regions users can set,
export MPICH_OFI_NIC_POLICY=GPU
In this case, for each MPI process, Cray MPI aims to select a NIC device that is closest to the GPU device being used.
"},{"location":"user-guide/gpu/#mpich_ofi_nic_verbose","title":"MPICH_OFI_NIC_VERBOSE","text":"To display information pertaining to NIC selection set,
export MPICH_OFI_NIC_VERBOSE=2
Note
Work in progress
Documentation for rocgdb can be found in the following locations:
https://rocm.docs.amd.com/projects/ROCgdb/en/docs-5.2.3/index.html
https://docs.amd.com/projects/HIP/en/docs-5.2.3/how_to_guides/debugging.html#using-rocgdb
"},{"location":"user-guide/gpu/#profiling","title":"Profiling","text":"An initial profiling capability is provided via rocprof
which is part of the rocm
module.
For example in an interactive session where resources have already been allocated you can call,
srun -n 2 --exclusive --nodes=1 --time=00:20:00 --partition=gpu --qos=gpu-exc --gpus=2 rocprof --stats ./myprog_exe\n
to profile your application. More detail on the use of rocprof can be found here.
"},{"location":"user-guide/gpu/#performance-tuning","title":"Performance tuning","text":"AMD provides some documentation on performance tuning here not all options will be available to users, so be aware that mileage may vary.
"},{"location":"user-guide/gpu/#hardware-details","title":"Hardware details","text":"The specifications of the GPU hardware can be found here.
Additionally you can use the command,
rocminfo
in job on a GPU node to print information about the GPUs and CPU on the node. This command is provided as part of the rocm
module.
Using rocm-smi --showtopo
we can learn about the connections between the GPUs in a node and the how memory regions between the GPU and CPU are connected.
======================= ROCm System Management Interface =======================\n=========================== Weight between two GPUs ============================\n GPU0 GPU1 GPU2 GPU3\nGPU0 0 15 15 15\nGPU1 15 0 15 15\nGPU2 15 15 0 15\nGPU3 15 15 15 0\n\n============================ Hops between two GPUs =============================\n GPU0 GPU1 GPU2 GPU3\nGPU0 0 1 1 1\nGPU1 1 0 1 1\nGPU2 1 1 0 1\nGPU3 1 1 1 0\n\n========================== Link Type between two GPUs ==========================\n GPU0 GPU1 GPU2 GPU3\nGPU0 0 XGMI XGMI XGMI\nGPU1 XGMI 0 XGMI XGMI\nGPU2 XGMI XGMI 0 XGMI\nGPU3 XGMI XGMI XGMI 0\n\n================================== Numa Nodes ==================================\nGPU 0 : (Topology) Numa Node: 0\nGPU 0 : (Topology) Numa Affinity: 0\nGPU 1 : (Topology) Numa Node: 1\nGPU 1 : (Topology) Numa Affinity: 1\nGPU 2 : (Topology) Numa Node: 2\nGPU 2 : (Topology) Numa Affinity: 2\nGPU 3 : (Topology) Numa Node: 3\nGPU 3 : (Topology) Numa Affinity: 3\n============================= End of ROCm SMI Log ==============================\n
To quote the rocm documentation:
- The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the \u201cdistance\u201d data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.\n\n- The second block has a matrix named \u201cHops between two GPUs\u201d, where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links.\n\n- The third block outputs the link types between the GPUs. This can either be \u201cXGMI\u201d for AMD Infinity Fabric links or \u201cPCIE\u201d for PCIe Gen4 links.\n\n- The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC processors.\n
"},{"location":"user-guide/gpu/#rocm-bandwidth-test","title":"rocm-bandwidth-test","text":"As part of the rocm
module the rocm-bandwidth-test
is provided that can be used to measure the performance of communications between the hardware in a node.
In addition to rocm-smi
this is a bandwidth test that can be useful in understanding the composition and performance limitations in a GPU node. Here is an example output from a GPU nodes on ARCHER2.
Device: 0, AMD EPYC 7543P 32-Core Processor\nDevice: 1, AMD EPYC 7543P 32-Core Processor\nDevice: 2, AMD EPYC 7543P 32-Core Processor\nDevice: 3, AMD EPYC 7543P 32-Core Processor\nDevice: 4, , GPU-ab43b63dec8adaf3, c9:0.0\nDevice: 5, , GPU-0b953cf8e6d4184a, 87:0.0\nDevice: 6, , GPU-b0266df54d0dd2e1, 49:0.0\nDevice: 7, , GPU-790a09bfbf673859, 09:0.0\n\nInter-Device Access\n\nD/D 0 1 2 3 4 5 6 7\n\n0 1 1 1 1 1 1 1 1\n\n1 1 1 1 1 1 1 1 1\n\n2 1 1 1 1 1 1 1 1\n\n3 1 1 1 1 1 1 1 1\n\n4 1 1 1 1 1 1 1 1\n\n5 1 1 1 1 1 1 1 1\n\n6 1 1 1 1 1 1 1 1\n\n7 1 1 1 1 1 1 1 1\n\n\nInter-Device Numa Distance\n\nD/D 0 1 2 3 4 5 6 7\n\n0 0 12 12 12 20 32 32 32\n\n1 12 0 12 12 32 20 32 32\n\n2 12 12 0 12 32 32 20 32\n\n3 12 12 12 0 32 32 32 20\n\n4 20 32 32 32 0 15 15 15\n\n5 32 20 32 32 15 0 15 15\n\n6 32 32 20 32 15 15 0 15\n\n7 32 32 32 20 15 15 15 0\n\n\nUnidirectional copy peak bandwidth GB/s\n\nD/D 0 1 2 3 4 5 6 7\n\n0 N/A N/A N/A N/A 26.977 26.977 26.977 26.977\n\n1 N/A N/A N/A N/A 26.977 26.975 26.975 26.975\n\n2 N/A N/A N/A N/A 26.977 26.977 26.975 26.975\n\n3 N/A N/A N/A N/A 26.975 26.977 26.975 26.977\n\n4 28.169 28.171 28.169 28.169 1033.080 42.239 42.112 42.264\n\n5 28.169 28.169 28.169 28.169 42.243 1033.088 42.294 42.286\n\n6 28.169 28.171 28.167 28.169 42.158 42.281 1043.367 42.277\n\n7 28.171 28.169 28.169 28.169 42.226 42.264 42.264 1051.212\n\n\nBidirectional copy peak bandwidth GB/s\n\nD/D 0 1 2 3 4 5 6 7\n\n0 N/A N/A N/A N/A 40.480 42.528 42.059 42.173\n\n1 N/A N/A N/A N/A 41.604 41.826 41.903 41.417\n\n2 N/A N/A N/A N/A 41.008 41.499 41.258 41.338\n\n3 N/A N/A N/A N/A 40.968 41.273 40.982 41.450\n\n4 40.480 41.604 41.008 40.968 N/A 80.946 80.631 80.888\n\n5 42.528 41.826 41.499 41.273 80.946 N/A 80.944 80.940\n\n6 42.059 41.903 41.258 40.982 80.631 80.944 N/A 80.896\n\n7 42.173 41.417 41.338 41.450 80.888 80.940 80.896 N/A\n
"},{"location":"user-guide/gpu/#tools","title":"Tools","text":""},{"location":"user-guide/gpu/#rocm-smi","title":"rocm-smi","text":"If you load the rocm module on the system you will have access to the rocm-smi
utility. This utility allows users to report information about the GPUs on node and can be very useful in better understanding the set up of the hardware you are working with and monitoring GPU metrics during job execution.
Here are some useful commands to get you started:
rocm-smi --alldevices
device status
======================= ROCm System Management Interface =======================\n================================= Concise Info =================================\nGPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%\n0 28.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n1 30.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n2 33.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n3 33.0c 41.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
This shows you the current state of the hardware while an application is running. Focusing on the GPU activity can be useful to understand when your code is active on the GPUs:
rocm-smi --showuse
GPU activity
======================= ROCm System Management Interface =======================\n============================== % time GPU is busy ==============================\nGPU[0] : GPU use (%): 0\nGPU[0] : GFX Activity: 705759841\nGPU[1] : GPU use (%): 0\nGPU[1] : GFX Activity: 664257322\nGPU[2] : GPU use (%): 0\nGPU[2] : GFX Activity: 660987914\nGPU[3] : GPU use (%): 0\nGPU[3] : GFX Activity: 665049119\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
Additionally you can focus on the memory use of the GPUs:
rocm-smi --showmemuse
GPU memory currently consumed
======================= ROCm System Management Interface =======================\n============================== Current Memory Use ==============================\nGPU[0] : GPU memory use (%): 0\nGPU[0] : Memory Activity: 323631375\nGPU[1] : GPU memory use (%): 0\nGPU[1] : Memory Activity: 319196585\nGPU[2] : GPU memory use (%): 0\nGPU[2] : Memory Activity: 318641690\nGPU[3] : GPU memory use (%): 0\nGPU[3] : Memory Activity: 319854295\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
More commands can be found by running,
rocm-smi --help
will run on the login nodes to get more information about probing the GPUs.
More detail can be found at here.
"},{"location":"user-guide/gpu/#hipify","title":"HIPIFY","text":"HIPIFY is a CUDA to HIP source translator tool that can allow CUDA source code to be translated into HIP source code, easing the transition between the two hardware targets.
The tool is available on ARCHER2 by loading the rocm
module.
The github repository for HIPIFY can be found here.
The documentation for HIPIFY is found here.
"},{"location":"user-guide/gpu/#notes-and-useful-links","title":"Notes and useful links","text":"You should expect the software development environment to be similar to that available on the Frontier exascale system:
Note
Some of the material in this section is closely based on information provided by NASA as part of the documentation for the Aitkin HPC system.
"},{"location":"user-guide/hardware/#system-overview","title":"System overview","text":"ARCHER2 is a HPE Cray EX supercomputing system which has a total of 5,860 compute nodes. Each compute node has 128 cores (dual AMD EPYC 7742 64-core 2.25GHz processors) giving a total of 750,080 cores. Compute nodes are connected together by a HPE Slingshot interconnect.
There are additional User Access Nodes (UAN, also called login nodes), which provide access to the system, and data-analysis nodes, which are well-suited for preparation of job inputs and analysis of job outputs.
Compute nodes are only accessible via the Slurm job scheduling system.
There are two storage types: home and work. Home is available on login nodes and data-analysis nodes. Work is available on login, data-analysis nodes and compute nodes (see I/O and file systems).
This is shown in the ARCHER2 architecture diagram:
The home file system is provided by dual NetApp FAS8200A systems (one primary and one disaster recovery) with a capacity of 1 PB each.
The work file system consists of four separate HPE Cray L300 storage systems, each with a capacity of 3.6 PB. The interconnect uses a dragonfly topology, and has a bandwidth of 100 Gbps.
The system also includes 1.1 PB burst buffer NVMe storage, provided by an HPE Cray E1000.
"},{"location":"user-guide/hardware/#compute-node-overview","title":"Compute node overview","text":"The compute nodes each have 128 cores. They are dual socket nodes with two 64-core AMD EPYC 7742 processors. There are 5,276 standard memory nodes and 584 high memory nodes.
Note
Note due to Simultaneous Multi-Threading (SMT) each core has 2 threads, therefore a node has 128 cores / 256 threads. Most users will not want to use SMT, see Launching parallel jobs.
Component Details Processor 2x AMD Zen2 (Rome) EPYC 7742, 64-core, 2.25 Ghz Cores per node 128 NUMA structure 8 NUMA regions per node (16 cores per NUMA region) Memory per node 256 GB (standard), 512 GB (high memory) Memory per core 2 GB (standard), 4 GB (high memory) L1 cache 32 kB/core L2 cache 512 kB/core L3 cache 16 MB/4-cores Vector support AVX2 Network connection 2x 100 Gb/s injection ports per nodeEach socket contains eight Core Complex Dies (CCDs) and one I/O die (IOD). Each CCD contains two Core Complexes (CCXs). Each CCX has 4 cores and 16 MB of L3 cache. Thus, there are 64 cores per socket and 128 cores per node.
More information on the architecture of the AMD EPYC Zen2 processors:
The AMD EPYC 7742 Rome processor has a base CPU clock of 2.25 GHz and a maximum boost clock of 3.4 GHz. There are eight processor dies (CCDs) with a total of 64 cores per socket.
Tip
The processors can only access their boost frequencies if the CPU frequency is set to 2.25 GHz. See the documentation on setting CPU frequency for information on how to select the correct CPU frequency.
Note
When all 128 compute cores on a node are loaded with computationally intensive work, we typically see the processor clock frequency boost to around 2.8 GHz.
Hybrid multi-die design:
Within each socket, the eight processor dies are fabricated on a 7 nanometer (nm) process, while the I/O die is fabricated on a 14 nm process. This design decision was made because the processor dies need the leading edge (and more expensive) 7 nm technology in order to reduce the amount of power and space needed to double the number of cores, and to add more cache, compared to the first-generation EPYC processors. The I/O die retains the less expensive, older 14 nm technology.
2nd-generation Infinity Fabric technology:
Infinity Fabric technology is used for communication among different components throughout the node: within cores, between cores, between core complexes (CCX) in a core complex die (CCD), among CCDs in a socket, to the main memory and PCIe, and between the two sockets. The Rome processors are the first x86 systems to support 4th-generation PCIe, which delivers twice the I/O performance (to the Slingshot interconnect, storage, NVMe SSD, etc.) compared to 3rd-generation PCIe.
"},{"location":"user-guide/hardware/#processor-hierarchy","title":"Processor hierarchy","text":"The Zen2 processor hierarchy is as follows:
CPU core
AMD 7742 is a 64-bit x86 server microprocessor. A partial list of instructions and features supported in Rome includes SSE, SSE2, SSE3, SSSE3, SSE4a, SSE4.1, SSE4.2, AES, FMA, AVX, AVX2 (256 bit), Integrated x87 FPU (FPU), Multi-Precision Add-Carry (ADX), 16-bit Floating Point Conversion (F16C), and No-eXecute (NX). For a complete list, run cat /proc/cpuinfo
on the ARCHER2 login nodes.
Each core:
The cache hierarchy is as follows:
op cache (OC): 4K ops, private to each core; 64 sets; 64 bytes/line; 8-way. OC holds instructions that have already been decoded into micro-operations (micro-ops). This is useful when the CPU repeatedly executes a loop of code. Using OC improves:
L1 instruction cache: 32 KB, private to each core; 64 bytes/line; 8-way. The processor fetches instructions from the instruction cache in 32-byte naturally aligned blocks.
Note
With the write-back policy, data is updated in the current level cache first. The update in the next level storage is done later when the cache line is ready to be replaced.
Note
If a core misses in its local L2 and also in the L3, the shadow tags are consulted. If the shadow tag indicates that the data resides in another L2 within the CCX, a cache-to-cache transfer is initiated. 1 x 256 bits/cycle load bandwidth to L2 of each core; 1 x 256 bits/cycle store bandwidth from L2 of each core; write-back policy; populated by L2 victims.
"},{"location":"user-guide/hardware/#intra-socket-interconnect","title":"Intra-socket interconnect","text":"The Infinity Fabric, evolved from AMD's previous generation HyperTransport interconnect, is a software-defined, scalable, coherent, and high-performance fabric. It uses sensors embedded in each die to scale control (Scalable Control Fabric, or SCF) and data flow (Scalable Data Fabric, or SDF).
Two EPYC 7742 SoCs are interconnected via Socket to Socket Global Memory Interconnect (xGMI) links, part of the Infinity Fabric that connects all the components of the SoC together. On ARCHER2 compute nodes there are 3 xGMI links using a total of 48 PCIe lanes. With the xGMI link speed set at 16 GT/s, the theoretical throughput for each direction is 96 GB/s (3 links x 16 GT/s x 2 bytes/transfer) without factoring in the encoding for xGMI, since there is no publication from AMD available. However, the expected efficiencies are 66\u201375%, so the sustained bandwidth per direction will be 63.5\u201372 GB/s. xGMI Dynamic Link Width Management saves power during periods of low socket-to-socket data traffic by reducing the number of active xGMI lanes per link from 16 to 8.
"},{"location":"user-guide/hardware/#memory-subsystem","title":"Memory subsystem","text":"The Zen 2 microarchitecture places eight unified memory controllers in the centralized I/O die. The memory channels can be split into one, two, or four Non-Uniform Memory Access (NUMA) Nodes per Socket (NPS1, NPS2, and NPS4). ARCHER2 compute nodes are configured as NPS4, which is the highest memory bandwidth configuration geared toward HPC applications.
With eight 3,200 MHz memory channels, an 8-byte read or write operation taking place per cycle per channel results in a maximum total memory bandwidth of 204.8 GB/s per socket.
Each memory channel can be connected with up to two Double Data Rate (DDR) fourth-generation Dual In-line Memory Modules (DIMMs). On ARCHER2 standard memory nodes, each channel is connected to a single 16 GB DDR4 registered DIMM (RDIMM) with error correcting code (ECC) support leading to 128 GB per socket and 256 GB per node. For the high memory nodes, each channel is connected to a single 32 GB DDR4 registered DIMM (RDIMM) with error correcting code (ECC) support leading to 256 GB per socket and 512 GB per node.
"},{"location":"user-guide/hardware/#interconnect-details","title":"Interconnect details","text":"ARCHER2 has a HPE Slingshot interconnect with 200 Gb/s signalling per node. It uses a dragonfly topology:
Nodes are organized into groups.
All-to-all connection between groups using optical links.
Information on the ARCHER2 parallel Lustre file systems and how to get best performance is available in the IO section.
"},{"location":"user-guide/io/","title":"I/O performance and tuning","text":"This section describes common IO patterns, best practice for I/O and how to get good performance on the ARCHER2 storage.
Information on the file systems, directory layouts, quotas, archiving and transferring data can be found in the Data management and transfer section.
The advice here is targeted at use of the parallel file systems available on the compute nodes on ARCHER2 (i.e. Not the home and RDFaaS file systems).
"},{"location":"user-guide/io/#common-io-patterns","title":"Common I/O patterns","text":"There are number of I/O patterns that are frequently used in parallel applications:
"},{"location":"user-guide/io/#single-file-single-writer-serial-io","title":"Single file, single writer (Serial I/O)","text":"A common approach is to funnel all the I/O through one controller process (e.g. rank 0 in an MPI program). Although this has the advantage of producing a single file, the fact that only one client is doing all the I/O means that it gains little benefit from the parallel file system. In practice this severely limits the I/O rates, e.g. when writing large files the speed is not likely to significantly exceed 1 GB/s.
"},{"location":"user-guide/io/#file-per-process-fpp","title":"File-per-process (FPP)","text":"One of the first parallel strategies people use for I/O is for each parallel process to write to its own file. This is a simple scheme to implement and understand and can achieve high bandwidth as, with many I/O clients active at once, it benefits from the parallel Lustre filesystem. However, it has the distinct disadvantage that the data is spread across many different files and may therefore be very difficult to use for further analysis without a data reconstruction stage to recombine potentially thousands of small files.
In addition, having thousands of files open at once can overload the filesystem and lead to poor performance.
Tip
The ARCHER2 solid state file system can give very high performance when using this model of I/O
The ADIOS 2 I/O library uses an approach similar to file-per-process and so can achieve very good performance on modern parallel file systems.
"},{"location":"user-guide/io/#file-per-node-fpn","title":"File-per-node (FPN)","text":"A simple way to reduce the sheer number of files is to write a file per node rather than a file per process; as ARCHER2 has 128 CPU-cores per node, this can reduce the number of files by more than a factor of 100 and should not significantly affect the I/O rates. However, it still produces multiple files which can be hard to work with in practice.
"},{"location":"user-guide/io/#single-file-multiple-writers-without-collective-operations","title":"Single file, multiple writers without collective operations","text":"All aspects of data management are simpler if your parallel program produces a single file in the same format as a serial code, e.g. analysis or program restart are much more straightforward.
There are a number of ways to achieve this. For example, many processes can open the same file but access different parts by skipping some initial offset, although this is problematic when writing as locking may be needed to ensure consistency. Parallel I/O libraries such as MPI-IO, HDF5 and NetCDF allow for this form of access and will implement locking automatically.
The problem is that, with many clients all individually accessing the same file, there can be a lot of contention for file system resources, leading to poor I/O rates. When writing, file locking can effectively serialise the access and there is no benefit from the parallel filesystem.
"},{"location":"user-guide/io/#single-shared-file-with-collective-writes-ssf","title":"Single Shared File with collective writes (SSF)","text":"The problem with having many clients performing I/O at the same time is that the I/O library may have to restrict access to one client at a time by locking. However if I/O is done collectively, where the library knows that all clients are doing I/O at the same time, then reads and writes can be explicitly coordinated to avoid clashes and no locking is required.
It is only through collective I/O that the full bandwidth of the file system can be realised while accessing a single file. Whatever I/O library you are using, it is essential to use collective forms of the read and write calls to achieve good performance.
"},{"location":"user-guide/io/#achieving-efficient-io","title":"Achieving efficient I/O","text":"This section provides information on getting the best performance out of the parallel /work
file systems on ARCHER2 when writing data, particularly using parallel I/O patterns.
The ARCHER2 /work
file systems use Lustre as a parallel file system technology. It has many disk units (called Object Storage Targets or OSTs), all under the control of a single Meta Data Server (MDS) so that it appears to the user as a single file system. The Lustre file system provides POSIX semantics (changes on one node are immediately visible on other nodes) and can support very high data rates for appropriate I/O patterns.
In order to achieve good performance on the ARCHER2 Lustre file systems, you need to make sure your IO is configured correctly for the type of I/O you want to do. In the following sections we describe how to do this.
"},{"location":"user-guide/io/#summary-achieving-best-io-performance","title":"Summary: achieving best I/O performance","text":"The configuration you should use depends on the type of I/O you are performing. Here, we summarise the settings for two of the I/O patterns described above: File-Per-Process (FPP, including using ADIOS2) and Single Share File with collective writes (SSF).
Following sections describe the settings in more detail.
"},{"location":"user-guide/io/#file-per-process-fpp_1","title":"File-Per-Process (FPP)","text":"-c 1
), this is the default on ARCHER2-c -1
)export FI_OFI_RXM_SAR_LIMIT=64K
export MPICH_MPIIO_HINTS=\"*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4\u201d
We regularly run tests of FPP write performance on ARCHER2 `/work`` Lustre file systems using the benchio software in the following configuration:
Typical write performance:
We regularly run tests of FPP write performance on ARCHER2 `/work`` Lustre file systems using the benchio software in the following configuration:
FI_OFI_RXM_SAR_LIMIT=64K
, MPICH_MPIIO_HINTS=\"*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4\u201d
Typical write performance:
One of the main factors leading to the high performance of Lustre file systems is the ability to store data on multiple OSTs. For many small files, this is achieved by storing different files on different OSTs; large files must be striped across multiple OSTs to benefit from the parallel nature of Lustre.
When a file is striped it is split into chunks and stored across multiple OSTs in a round-robin fashion. Striping can improve the I/O performance because it increases the available bandwidth: multiple processes can read and write the same file simultaneously by accessing different OSTs. However striping can also increase the overhead. Choosing the right striping configuration is key to obtain high performance results.
Users have control of a number of striping settings on Lustre file systems. Although these parameters can be set on a per-file basis they are usually set on the directory where your output files will be written so that all output files inherit the same settings.
"},{"location":"user-guide/io/#default-configuration","title":"Default configuration","text":"The /work
file systems on ARCHER2 have the same default stripe settings:
These settings have been chosen to provide a good compromise for the wide variety of I/O patterns that are seen on the system but are unlikely to be optimal for any one particular scenario. The Lustre command to query the stripe settings for a directory (or file) is lfs getstripe
. For example, to query the stripe settings of an already created directory resdir
:
auser@ln03:~> lfs getstripe resdir/\nresdir\nstripe_count: 1 stripe_size: 1048576 stripe_offset: -1\n
"},{"location":"user-guide/io/#setting-custom-striping-configurations","title":"Setting custom striping configurations","text":"Users can set stripe settings for a directory (or file) using the lfs setstripe
command. The options for lfs setstripe
are:
[--stripe-count|-c]
to set the stripe count; 0 means use the system default (usually 1) and -1 means stripe over all available OSTs.[--stripe-size|-S]
to set the stripe size; 0 means use the system default (usually 1 MB) otherwise use k, m or g for KB, MB or GB respectively[--stripe-index|-i]
to set the OST index (starting at 0) on which to start striping for this file. An index of -1 allows the MDS to choose the starting index and it is strongly recommended, as this allows space and load balancing to be done by the MDS as needed.For example, to set a stripe size of 4 MiB for the existing directory resdir
, along with maximum striping count you would use:
auser@ln03:~> lfs setstripe -S 4m -c -1 resdir/\n
"},{"location":"user-guide/io/#environment-variables","title":"Environment variables","text":"The following environment variables typically only have an impact for the case when you using Single Shared Files with collective communications. As mentioned above, it is very important to use collective calls when doing parallel I/O to a single shared file.
However, with the default settings, parallel I/O on multiple nodes can currently give poor performance. We recommend always setting these environment variables in your SLURM batch script when you are using the SSF I/O pattern:
export FI_OFI_RXM_SAR_LIMIT=64K\nexport MPICH_MPIIO_HINTS=\"*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4\u201d\n
"},{"location":"user-guide/io/#mpi-transport-protocol","title":"MPI transport protocol","text":"Setting the environment variables described above can improve the performance of MPI collectives when handling large amounts of data, which in turn can improve collective file I/O. An alternative is to use the non-default UCX implementation of the MPI library as an alternative to the default OFI version.
To switch library version see the Application Development Environment section of the User Guide.
Note
This will affect all your MPI calls, not just those related to I/O, so you should check the overall performance of your program before and after the switch. It is possible that other functions may run slower even if the I/O performance improves.
"},{"location":"user-guide/io/#io-profiling","title":"I/O profiling","text":"If you are concerned about your I/O performance, you should quantify your transfer rates in terms of GB/s of data read or written to disk. Small files can achieve very high I/O rates due to data caching in Lustre. However, for large files you should be able to achieve a maximum of around 1 GB/s for an unstriped file, or up to 10 GB/s for a fully striped file (across all 12 OSTs).
Warning
You share /work
with all other users so I/O rates can be very variable, especially if the machine is heavily loaded.
If your I/O rates are poor then you can get useful summary information about how the parallel libraries are performing by setting this variable in your Slurm script
export MPICH_MPIIO_STATS=1\n
Amongst other things, this will give you information on how many independent and collective I/O operations were issued. If you see a large number of independent operations compared to collectives, this indicates that you have inefficient I/O patterns and you should check that you are calling your parallel I/O library correctly.
Although this information comes from the MPI library, it is still useful for users of higher-level libraries such as HDF5 as they all call MPI-IO at the lowest level.
"},{"location":"user-guide/io/#tips-and-advice-for-io","title":"Tips and advice for I/O","text":""},{"location":"user-guide/io/#set-an-optimum-blocksize-when-untaring-data","title":"Set an optimum blocksize when untar'ing data","text":"When you are expanding a large tar archive file to the Lustre file systems you should specify the -b 2048
option to ensure that tar writes out data in blocks of 1 MiB. This will improve the performance of your tar command and reduce the impact of writing the data to Lustre on other users.
Two Machine Learning (ML) frameworks are supported on ARCHER2, PyTorch and TensorFlow.
For each framework, we'll show how to run a particular MLCommons HPC benchmark. We start with PyTorch.
"},{"location":"user-guide/machine-learning/#pytorch","title":"PyTorch","text":"On ARCHER2, PyTorch is supported for use on both the CPU and GPU nodes.
We'll demonstrate the use of PyTorch with DeepCam, a deep learning climate segmentation benchmark. It involves training a neural network to recognise large-scale weather phenomena (e.g., tropical cyclones, atmospheric rivers) in the output generated by ensembles of weather simulations, see link below for more details.
Exascale Deep Learning for Climate Analytics
There are two DeepCam training datasets available on ARCHER2. A 62 GB mini dataset (/work/z19/shared/mlperf-hpc/deepcam/mini
), and a much larger 8.9 TB dataset (/work/z19/shared/mlperf-hpc/deepcam/full
).
A binary install of PyTorch 1.13.1 suitable for ROCm 5.2.3 has been installed according to the instructions linked below.
https://github.com/hpc-uk/build-instructions/blob/main/pyenvs/pytorch/build_pytorch_1.13.1_archer2_gpu.md
This install can be accessed by loading the pytorch/1.13.1-gpu
module.
As DeepCam is an MLPerf benchmark, you may wish to base a local python environment on pytorch/1.13.1-gpu
so that you have the opportunity to install additional python packages that support MLPerf logging, as well as extra features pertinent to DeepCam (e.g., dynamic learning rates).
The following instructions show how to create such an environment.
#!/bin/bash\n\nmodule -q load pytorch/1.13.1-gpu\n\nPYTHON_TAG=python`echo ${CRAY_PYTHON_LEVEL} | cut -d. -f1-2`\n\nPRFX=${HOME/home/work}/pyenvs\nPYVENV_ROOT=${PRFX}/mlperf-pt-gpu\nPYVENV_SITEPKGS=${PYVENV_ROOT}/lib/${PYTHON_TAG}/site-packages\n\nmkdir -p ${PYVENV_ROOT}\ncd ${PYVENV_ROOT}\n\n\npython -m venv --system-site-packages ${PYVENV_ROOT}\n\nextend-venv-activate ${PYVENV_ROOT}\n\nsource ${PYVENV_ROOT}/bin/activate\n\n\nmkdir -p ${PYVENV_ROOT}/repos\ncd ${PYVENV_ROOT}/repos\n\ngit clone -b hpc-1.0-branch https://github.com/mlcommons/logging mlperf-logging\npython -m pip install -e mlperf-logging\n\nrm ${PYVENV_SITEPKGS}/mlperf-logging.egg-link\nmv ./mlperf-logging/mlperf_logging ${PYVENV_SITEPKGS}/\nmv ./mlperf-logging/mlperf_logging.egg-info ${PYVENV_SITEPKGS}/\n\npython -m pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git\n\ndeactivate\n
In order to run a DeepCam training job, you must first clone the MLCommons HPC github repo.
mkdir ${HOME/home/work}/tests\ncd ${HOME/home/work}/tests\n\ngit clone https://github.com/mlcommons/hpc.git mlperf-hpc\n\ncd ./mlperf-hpc/deepcam/src/deepCam\n
You are now ready to run the following DeepCam submission script via the sbatch
command.
#!/bin/bash\n\n#SBATCH --job-name=deepcam\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-exc\n#SBATCH --nodes=2\n#SBATCH --gpus=8\n#SBATCH --time=01:00:00\n#SBATCH --exclusive\n\n\nJOB_OUTPUT_PATH=./results/${SLURM_JOB_ID}\nmkdir -p ${JOB_OUTPUT_PATH}/logs\n\nsource ${HOME/home/work}/pyenvs/mlperf-pt-gpu/bin/activate\n\nexport OMP_NUM_THREADS=1\nexport HOME=${HOME/home/work}\n\nsrun --ntasks=8 --tasks-per-node=4 \\\n --cpu-bind=verbose,map_cpu:0,8,16,24 --hint=nomultithread \\\n python train.py \\\n --run_tag test \\\n --data_dir_prefix /work/z19/shared/mlperf-hpc/deepcam/mini \\\n --output_dir ${JOB_OUTPUT_PATH} \\\n --wireup_method nccl-slurm \\\n --max_epochs 64 \\\n --local_batch_size 1\n\nmv slurm-${SLURM_JOB_ID}.out ${JOB_OUTPUT_PATH}/slurm.out\n
The job submission script activates the python environment that was setup earlier, but that particular command (source ${HOME/home/work}/pyenvs/mlperf-pt-gpu/bin/activate
) could be replaced by module -q load pytorch/1.13.1-gpu
if you are not running DeepCam and have no need for additional Python packages such as mlperf-logging
and warmup-scheduler
.
In the script above, we specify four tasks per node, one for each GPU. These tasks are evenly spaced across the node so as to maximise the communications bandwidth between the host and the GPU devices. Note, PyTorch is not using Cray MPICH for inter-task communications, which is instead being handled by the ROCm Collective Communications Library (RCCL), hence the --wireup_method nccl-slurm
option (nccl-slurm
works as an alias for `rccl-slurm in this context).
The above job should achieve convergence \u2014 an Intersection over Union (IoU) of 0.82 \u2014 after 35 epochs or so. Runtime should be around 20-30 minutes.
We can also modify the DeepCam train.py
script so that the accuracy and loss are logged using TensorBoard.
The following lines must be added to the DeepCam train.py
script.
import os\n...\n\nfrom torch.utils.tensorboard import SummaryWriter\n\n...\n\ndef main(pargs):\n\n #init distributed training\n comm_local_group = comm.init(pargs.wireup_method, pargs.batchnorm_group_size)\n comm_rank = comm.get_rank()\n ...\n\n #set up logging\n pargs.logging_frequency = max([pargs.logging_frequency, 0])\n log_file = os.path.normpath(os.path.join(pargs.output_dir, \"logs\", pargs.run_tag + \".log\"))\n ...\n\n writer = SummaryWriter()\n\n #set seed\n ...\n\n ...\n\n #training loop\n while True:\n ...\n\n #training\n step = train_epoch(pargs, comm_rank, comm_size,\n ...\n logger, writer)\n\n ...\n
The train_epoch
function is defined in ./driver/trainer.py
and so that file must be amended like so.
...\n\ndef train_epoch(pargs, comm_rank, comm_size,\n ...,\n logger, writer):\n\n ...\n\n writer.add_scalar(\"Accuracy/train\", iou_avg_train, epoch+1)\n writer.add_scalar(\"Loss/train\", loss_avg_train, epoch+1)\n\n return step\n
"},{"location":"user-guide/machine-learning/#deepcam-on-cpu","title":"DeepCam on CPU","text":"PyTorch can also be run on the ARCHER2 CPU nodes. However, since the DeepCam uses the torch.distributed
module, we cannot use Horovod to handle (via MPI) inter-task communications. We must instead build PyTorch from source so that we can link torch.distributed
to the correct Cray MPICH libraries.
The instructions for doing such a build can be found here, https://github.com/hpc-uk/build-instructions/blob/main/pyenvs/pytorch/build_pytorch_1.13.0a0_from_source_archer2_cpu.md.
This install can be accessed by loading the pytorch/1.13.0a0
module. Please note, PyTorch source version 1.13.0a0
corresponds to PyTorch package version 1.13.1
.
Once again, as we are running the DeepCam benchmark, we'll need to setup a local Python environment for installing the MLPerf logging package. This time the local environment is based on the pytorch/1.13.0a0
module.
#!/bin/bash\n\nmodule -q load pytorch/1.13.0a0\n\nPYTHON_TAG=python`echo ${CRAY_PYTHON_LEVEL} | cut -d. -f1-2`\n\nPRFX=${HOME/home/work}/pyenvs\nPYVENV_ROOT=${PRFX}/mlperf-pt\nPYVENV_SITEPKGS=${PYVENV_ROOT}/lib/${PYTHON_TAG}/site-packages\n\nmkdir -p ${PYVENV_ROOT}\ncd ${PYVENV_ROOT}\n\n\npython -m venv --system-site-packages ${PYVENV_ROOT}\n\nextend-venv-activate ${PYVENV_ROOT}\n\nsource ${PYVENV_ROOT}/bin/activate\n\n\nmkdir -p ${PYVENV_ROOT}/repos\ncd ${PYVENV_ROOT}/repos\n\ngit clone -b hpc-1.0-branch https://github.com/mlcommons/logging mlperf-logging\npython -m pip install -e mlperf-logging\n\nrm ${PYVENV_SITEPKGS}/mlperf-logging.egg-link\nmv ./mlperf-logging/mlperf_logging ${PYVENV_SITEPKGS}/\nmv ./mlperf-logging/mlperf_logging.egg-info ${PYVENV_SITEPKGS}/\n\npython -m pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git\n\ndeactivate\n
DeepCam can now be run on the CPU nodes using a submission script like the one below.
#!/bin/bash\n\n#SBATCH --job-name=deepcam\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n#SBATCH --nodes=32\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=128\n#SBATCH --time=10:00:00\n#SBATCH --exclusive\n\n\nJOB_OUTPUT_PATH=./results/${SLURM_JOB_ID}\nmkdir -p ${JOB_OUTPUT_PATH}/logs\n\nsource ${HOME/home/work}/pyenvs/mlperf-pt/bin/activate\n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\nexport OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}\n\nsrun --hint=nomultithread \\\n python train.py \\\n --run_tag test \\\n --data_dir_prefix /work/z19/shared/mlperf-hpc/deepcam/mini \\\n --output_dir ${JOB_OUTPUT_PATH} \\\n --wireup_method mpi \\\n --max_inter_threads ${SLURM_CPUS_PER_TASK} \\\n --max_epochs 64 \\\n --local_batch_size 1\n\nmv slurm-${SLURM_JOB_ID}.out ${JOB_OUTPUT_PATH}/slurm.out\n
The script above activates the local Python environment so that the mlperf-logging
package is available; this is needed by the logger
object declared in the DeepCam train.py
script. Notice also that the --wireup-method
parameter is now set to mpi
and that a new parameter has been added, --max_inter_threads
, for specifying the maximum number of concurrent readers.
DeepCam performance on the CPU nodes is much slower than GPU. Running on 32 CPU nodes, as shown above, will take around 6 hours to complete 35 epochs. This assumes you're using the default hyperparameter settings for DeepCam.
"},{"location":"user-guide/machine-learning/#tensorflow","title":"TensorFlow","text":"On ARCHER2, TensorFlow is supported for use on the CPU nodes only.
We'll demonstrate the use of TensorFlow with the CosmoFlow benchmark. It involves training a neural network to recognise cosmological parameter values from the output generated by 3D dark matter simulations, see link below for more details.
CosmoFlow: using deep learning to learn the universe at scale
There are two CosmoFlow training datasets available on ARCHER2. A 5.6 GB mini dataset (/work/z19/shared/mlperf-hpc/cosmoflow/mini
), and a much larger 1.7 TB dataset (/work/z19/shared/mlperf-hpc/cosmoflow/full
).
In order to run a CosmoFlow training job, you must first clone the MLCommons HPC github repo.
mkdir ${HOME/home/work}/tests\ncd ${HOME/home/work}/tests\n\ngit clone https://github.com/mlcommons/hpc.git mlperf-hpc\n\ncd ./mlperf-hpc/cosmoflow\n
You are now ready to run the following CosmoFlow submission script via the sbatch
command.
#!/bin/bash\n\n#SBATCH --job-name=cosmoflow\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n#SBATCH --nodes=32\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=01:00:00\n#SBATCH --exclusive\n\nmodule -q load tensorflow/2.13.0\n\nexport UCX_MEMTYPE_CACHE=n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\nexport MPICH_DPM_DIR=${SLURM_SUBMIT_DIR}/dpmdir\n\nexport OMP_NUM_THREADS=16\nexport TF_ENABLE_ONEDNN_OPTS=1\n\nsrun --hint=nomultithread --distribution=block:block --cpu-freq=2250000 \\\n python train.py \\\n --distributed --omp-num-threads ${OMP_NUM_THREADS} \\\n --inter-threads 0 --intra-threads 0 \\\n --n-epochs 2048 --n-train 1024 --n-valid 1024 \\\n --data-dir /work/z19/shared/mlperf-hpc/cosmoflow/mini/cosmoUniverse_2019_05_4parE_tf_v2_mini\n
The CosmoFlow job runs eight MPI tasks per node (one per NUMA region) with sixteen threads per task, and so, each node is fully populated. The TF_ENABLE_ONEDNN_OPTS
variable refers to Intel's oneAPI Deep Neural Network library. Within the TensorFlow source there are #ifdef
guards that are activated when oneDNN is enabled. It turns out that having TF_ENABLE_ONEDNN_OPTS=1
also improves performance (by a factor of 12) on AMD processors.
The inter/intra thread training parameters allow one to exploit any parallelism implied by the TensorFlow (TF) DNN graph. For example, if a node in the TF graph can be parallelised, the number of threads assigned will be the value of --intra-threads
; and, if there are separate nodes in the TF graph that can be run concurrently, the available thread count for such an activity is the value of --inter-threads
. Of course, the optimum values for these parameters will depend on the DNN graph. The job script above tells TensorFlow to choose the values by setting both parameters to zero.
You will note that only a few hyperparameters are specified for the CosmoFlow training job (e.g., --n-epochs
, --n-train
and --n-valid
). Those settings in fact override the values assigned to those same parameters within the ./configs/cosmo.yaml
file. However, that file contains settings for many other hyperparameters that are not overwritten.
The CosmoFlow job specified above should take around 140 minutes to complete 2048 epochs, which should be sufficient to achieve a mean average error of 0.23.
"},{"location":"user-guide/profile/","title":"Profiling","text":"There are a number of different ways to access profiling data on ARCHER2. In this section, we discuss the HPE Cray profiling tools, CrayPat-lite and CrayPat. We also show how to get usage data on currently running jobs from Slurm batch system.
You can also use the Linaro Forge tool to profile applications on ARCHER2.
If you are specifically interested in profiling IO, then you may want to look at the Darshan IO profiling tool.
"},{"location":"user-guide/profile/#craypat-lite","title":"CrayPat-lite","text":"CrayPat-lite is a simplified and easy-to-use version of the Cray Performance Measurement and Analysis Tool (CrayPat). CrayPat-lite provides basic performance analysis information automatically, with a minimum of user interaction, and yet offers information useful to users wishing to explore a program's behaviour further using the full CrayPat suite.
"},{"location":"user-guide/profile/#how-to-use-craypat-lite","title":"How to use CrayPat-lite","text":"Ensure the perftools-base
module is loaded.
module list
Load the perftools-lite
module.
module load perftools-lite
Compile your application normally. An informational message from CrayPat-lite will appear indicating that the executable has been instrumented.
cc -h std=c99 -o myapplication.x myapplication.c\n
INFO: creating the CrayPat-instrumented executable 'myapplication.x' (lite-samples) ...OK \n
Run the generated executable normally by submitting a job.
#!/bin/bash\n\n#SBATCH --job-name=CrayPat_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nexport OMP_NUM_THREADS=1\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel program\nsrun --hint=nomultithread --distribution=block:block mpi_test.x\n
Analyse the data.
After the job finishes executing, CrayPat-lite output should be printed to stdout (i.e. at the end of the job's output file). A new directory will also be created containing .rpt
and .ap2
files. The .rpt
files are text files that contain the same information printed in the job's output file and the .ap2
files can be used to obtain more detailed information, which can be visualized using the Cray Apprentice2 tool.
The Cray Performance Analysis Tool (CrayPat) is a powerful framework for analysing a parallel application\u2019s performance on Cray supercomputers. It can provide very detailed information about the timing and performance of individual application procedures.
CrayPat can perform two types of performance analysis, sampling experiments and tracing experiments. A sampling experiment probes the code at a predefined interval and produces a report based on the data collected. A tracing experiment explicitly monitors the code performance within named routines. Typically, the overhead associated with a tracing experiment is higher than that associated with a sampling experiment but provides much more detailed information. The key to getting useful data out of a sampling experiment is to run your profiling for a representative length of time.
"},{"location":"user-guide/profile/#sampling-analysis","title":"Sampling analysis","text":"Ensure the perftools-base
module is loaded.
module list
Load perftools
module.
module load perftools
Compile your code in the standard way always using the Cray compiler wrappers (ftn, cc and CC). Object files need to be made available to CrayPat to correctly build an instrumented executable for profiling or tracing, this means that the compile and link stage should be separated by using the -c
compile flag.
auser@ln01:/work/t01/t01/auser> cc -h std=c99 -c jacobi.c\nauser@ln01:/work/t01/t01/auser> cc jacobi.o -o jacobi\n
To instrument the binary, run the pat_build
command. This will generate a new binary with +pat
appended to the end (e.g. jacobi+pat
).
auser@ln:/work/t01/t01/auser> pat_build jacobi
Run the new executable with +pat
appended as you would with the regular executable. Each run will produce its own 'experiment directory' containing the performance data as .xf
files inside a subdirectory called xf-files
(e.g. running the jacobi+pat
instrumented executable might produce jacobi+pat+12265-1573s/xf-files
).
pat_report
.The .xf
files contain the raw sampling data from the run and need to be post-processed to produce useful results. This is done using the pat_report
tool which converts all the raw data into a summarised and readable form. You should provide the name of the experiment directory as the argument to pat_report
.
auser@ln:/work/t01/t01/auser> pat_report jacobi+pat+12265-1573s\n\nTable 1: Profile by Function (limited entries shown)\n\nSamp% | Samp | Imb. | Imb. | Group\n | | Samp | Samp% | Function\n | | | | PE=HIDE\n100.0% | 849.5 | -- | -- | Total\n|--------------------------------------------------\n| 56.7% | 481.4 | -- | -- | MPI\n||-------------------------------------------------\n|| 48.7% | 414.1 | 50.9 | 11.0% | MPI_Allreduce\n|| 4.4% | 37.5 | 118.5 | 76.6% | MPI_Waitall\n|| 3.0% | 25.2 | 44.8 | 64.5% | MPI_Isend\n||=================================================\n| 29.9% | 253.9 | 55.1 | 18.0% | USER\n||-------------------------------------------------\n|| 29.9% | 253.9 | 55.1 | 18.0% | main\n||=================================================\n| 13.4% | 114.1 | -- | -- | ETC\n||-------------------------------------------------\n|| 13.4% | 113.9 | 26.1 | 18.8% | __cray_memcpy_SNB\n|==================================================\n
This report will generate more files with the extension .ap2
in the experiment directory. These hold the same data as the .xf
files but in the post-processed form. Another file produced has an .apa
extension and is a text file with a suggested configuration for generating a traced experiment.
The .ap2
files generated are used to view performance data graphically with the Cray Apprentice2 tool.
The pat_report
command is able to produce many different profile reports from the profiling data. You can select a predefined report with the -O
flag to pat_report
. A selection of the most generally useful predefined report types are:= listed below.
Example output:
auser@ln01:/work/t01/t01/auser> pat_report -O ca+src,load_balance jacobi+pat+12265-1573s\n\nTable 1: Profile by Function and Callers, with Line Numbers (limited entries shown)\n\nSamp% | Samp | Imb. | Imb. | Group\n | | Samp | Samp% | Function\n | | | | PE=HIDE\n100.0% | 849.5 | -- | -- | Total\n|--------------------------------------------------\n|--------------------------------------\n| 56.7% | 481.4 | MPI\n||-------------------------------------\n|| 48.7% | 414.1 | MPI_Allreduce\n3| | | main:jacobi.c:line.80\n|| 4.4% | 37.5 | MPI_Waitall\n3| | | main:jacobi.c:line.73\n|| 3.0% | 25.2 | MPI_Isend\n|||------------------------------------\n3|| 1.6% | 13.2 | main:jacobi.c:line.65\n3|| 1.4% | 12.0 | main:jacobi.c:line.69\n||=====================================\n| 29.9% | 253.9 | USER\n||-------------------------------------\n|| 29.9% | 253.9 | main\n|||------------------------------------\n3|| 18.7% | 159.0 | main:jacobi.c:line.76\n3|| 9.1% | 76.9 | main:jacobi.c:line.84\n|||====================================\n||=====================================\n| 13.4% | 114.1 | ETC\n||-------------------------------------\n|| 13.4% | 113.9 | __cray_memcpy_SNB\n3| | | __cray_memcpy_SNB\n|======================================\n
"},{"location":"user-guide/profile/#tracing-analysis","title":"Tracing analysis","text":""},{"location":"user-guide/profile/#automatic-program-analysis-apa","title":"Automatic Program Analysis (APA)","text":"We can produce a focused tracing experiment based on the results from the sampling experiment using pat_build
with the .apa
file produced during the sampling.
auser@ln01:/work/t01/t01/auser> pat_build -O jacobi+pat+12265-1573s/build-options.apa\n
This will produce a third binary with extension +apa
. This binary should once again be run on the compute nodes and the name of the executable changed to jacobi+apa
. As with the sampling analysis, a report can be produced using pat_report
. For example:
auser@ln01:/work/t01/t01/auser> pat_report jacobi+apa+13955-1573t\n\nTable 1: Profile by Function Group and Function (limited entries shown)\n\nTime% | Time | Imb. | Imb. | Calls | Group\n | | Time | Time% | | Function\n | | | | | PE=HIDE\n\n100.0% | 12.987762 | -- | -- | 1,387,544.9 | Total\n|-------------------------------------------------------------------------\n| 44.9% | 5.831320 | -- | -- | 2.0 | USER\n||------------------------------------------------------------------------\n|| 44.9% | 5.831229 | 0.398671 | 6.4% | 1.0 | main\n||========================================================================\n| 29.2% | 3.789904 | -- | -- | 199,111.0 | MPI_SYNC\n||------------------------------------------------------------------------\n|| 29.2% | 3.789115 | 1.792050 | 47.3% | 199,109.0 | MPI_Allreduce(sync)\n||========================================================================\n| 25.9% | 3.366537 | -- | -- | 1,188,431.9 | MPI\n||------------------------------------------------------------------------\n|| 18.0% | 2.334765 | 0.164646 | 6.6% | 199,109.0 | MPI_Allreduce\n|| 3.7% | 0.486714 | 0.882654 | 65.0% | 199,108.0 | MPI_Waitall\n|| 3.3% | 0.428731 | 0.557342 | 57.0% | 395,104.9 | MPI_Isend\n|=========================================================================\n
"},{"location":"user-guide/profile/#manual-program-analysis","title":"Manual Program Analysis","text":"CrayPat allows you to manually choose your profiling preference. This is particularly useful if the APA mode does not meet your tracing analysis requirements.
The entire program can be traced as a whole using -w
:
auser@ln01:/work/t01/t01/auser> pat_build -w jacobi\n
Using -g
, a program can be instrumented to trace all function entry point references belonging to the trace function group (mpi, libsci, lapack, scalapack, heap, etc):
auser@ln01:/work/t01/t01/auser> pat_build -w -g mpi jacobi\n
"},{"location":"user-guide/profile/#dynamically-linked-binaries","title":"Dynamically-linked binaries","text":"CrayPat allows you to profile un-instrumented, dynamically linked binaries with the pat_run
utility. pat_run
delivers profiling information for codes that cannot easily be rebuilt. To use pat_run
:
Load the perftools-base
module if it is not already loaded.
module load perftools-base
Run your application normally including the pat_run
command right after your srun
options.
srun [srun-options] pat_run [pat_run-options] program [program-options]
Use pat_report
to examine any data collected during the execution of your application.
auser@ln01:/work/t01/t01/auser> pat_report jacobi+pat+12265-1573s
Some useful pat_run
options are as follows.
-w
Collect data by tracing.-g
Trace functions belonging to group names. See the -g option in pat_build(1) for a list of valid tracegroup values.-r
Generate a text report upon successful execution.Cray Apprentice2 is an optional GUI tool that is used to visualize and manipulate the performance analysis data captured during program execution. Cray Apprentice2 can display a wide variety of reports and graphs, depending on the type of program being analyzed, the way in which the program was instrumented for data capture, and the data that was collected during program execution.
You will need to use CrayPat to first instrument your program and capture performance analysis data, and then pat_report
to generate the .ap2
files from the results. You may then use Cray Apprentice2 to visualize and explore those files.
The number and appearance of the reports that can be generated using Cray Apprentice2 is determined by the kind and quantity of data captured during program execution, which in turn is determined by the way in which the program was instrumented and the environment variables in effect at the time of program execution. For example, changing the PAT_RT_SUMMARY environment variable to 0 before executing the instrumented program nearly doubles the number of reports available when analyzing the resulting data in Cray Apprentice2.
export PAT_RT_SUMMARY=0\n
To use Cray Apprentice2 (app2
), load perftools-base
module if it is not already loaded.
module load perftools-base\n
Next, open the experiment directory generated during the instrumentation phase with Apprentice2.
auser@ln01:/work/t01/t01/auser> app2 jacobi+pat+12265-1573s\n
"},{"location":"user-guide/profile/#hardware-performance-counters","title":"Hardware Performance Counters","text":"Hardware performance counters can be used to monitor CPU and power events on ARCHER2 compute nodes. The monitoring and reporting of hardware counter events is integrated with CrayPat - users should use CrayPat as described earlier in this section to run profiling experiments to gather data from hardware counter events and to analyse the data.
"},{"location":"user-guide/profile/#counters-and-counter-groups-available","title":"Counters and counter groups available","text":"You can explore which event counters are available on compute nodes by running the following commands (replace t01
with a valid budget code for your account):
module load perftools\nsrun --ntasks=1 --partition=standard --qos=short --account=t01 papi_avail\n
For convenience, the CrayPat tool provides predetermined groups of hardware event counters. You can get more information on the hardware event counters available through CrayPat with the following commands (on a login or compute node):
module load perftools\npat_help counters rome groups\n
If you want information on which hardware event counters are included in a group you can type the group name at the prompt you get after running the command above. Once you have finished browsing the help, type .
to quit back to the command line.
You can also access counters on power/energy consumption. To list the counters available to monitor power/energy use you can use the command (replace t01
with a valid budget code for your account):
module load perftools\nsrun --ntasks=1 --partition=standard --qos=short --account=t01 papi_native_avail -i cray_pm\n
"},{"location":"user-guide/profile/#enabling-hardware-counter-data-collection","title":"Enabling hardware counter data collection","text":"You enable the collection of hardware event counter data as part of a CrayPat experiment by setting the environment variable PAT_RT_PERFCTR
to a comma separated list of the groups/counters that you wish to measure.
For example, you could set (usually in your job submission script):
export PAT_RT_PERFCTR=1\n
to use the 1
counter group (summary with branch activity).
If you enabled collection of hardware event counters when running your profiling experiment, you will automatically get a report on the data when you use the pat_report
command to analyse the profile experiment data file.
You will see information similar to the following in the output from CrayPat for different sections of your code (this example if for the case where export PAT_RT_PERFCTR=1
, counter group: summary with branch activity, was set in the job submission script):
==============================================================================\n USER / main\n------------------------------------------------------------------------------\n Time% 88.3% \n Time 446.113787 secs\n Imb. Time 33.094417 secs\n Imb. Time% 6.9% \n Calls 0.002 /sec 1.0 calls\n PAPI_BR_TKN 0.240G/sec 106,855,535,005.863 branch\n PAPI_TOT_INS 5.679G/sec 2,533,386,435,314.367 instr\n PAPI_BR_INS 0.509G/sec 227,125,246,394.008 branch\n PAPI_TOT_CYC 1,243,344,265,012.828 cycles\n Instr per cycle 2.04 inst/cycle\n MIPS 1,453,770.20M/sec \n Average Time per Call 446.113787 secs\n CrayPat Overhead : Time 0.2% \n
"},{"location":"user-guide/profile/#using-the-craypat-api-to-gather-hardware-counter-data","title":"Using the CrayPAT API to gather hardware counter data","text":"The CrayPAT API features a particular function, PAT_counters
, that allows you to obtain the values of specific hardware counters at specific points within your code.
For convenience, we have developed an MPI-based wrapper for this aspect of the CrayPAT API, called pat_mpi_lib
, which can be found via the link below.
https://github.com/cresta-eu/pat_mpi_lib
The PAT MPI Library makes it possible to monitor a user-defined set of hardware performance counters during the execution of an MPI code running across multiple compute nodes. The library is lightweight, containing just four functions, and is intended to be straightforward to use. Once you've defined the hooks in your code for recording counter values, you can control which counters are read at runtime by setting the PAT_RT_PERFCTR
environment variable in the job submission script. As your code executes, the defined set of counters will be read at various points. After each reading, the counter values are summed by rank 0 (via an MPI reduction) before being output to a log file.
Further information along with test harnesses and example scripts can be found by reading the PAT MPI Library readme file.
"},{"location":"user-guide/profile/#more-information-on-hardware-counters","title":"More information on hardware counters","text":"More information on using hardware counters can be found in the appropriate section of the HPE documentation:
Also available are two MPI-based wrapper libraries, one for Power Management (PM) counters that cover such properties as point-in-time power, cumulative energy use and temperature; and one that provides access to PAPI counters. See the links below for further details.
Slurm commands on the login nodes can be used to quickly and simply retrieve information about memory usage for currently running and completed jobs.
There are three commands you can use on ARCHER2 to query job data from Slurm, two are standard Slurm commands and one is a script that provides information on running jobs:
sstat
command is used to display status information of a running job or job stepsacct
command is used to display accounting data for all finished jobs and job steps within the Slurm job database.archer2jobload
command is used to show CPU and memory usage information for running jobs. (This script is based on one originally written for the COSMA HPC facility at the University of Durham.)We provide examples of the use of these three commands below.
For the sacct
and sstat
command, the memory properties we print out below are:
AveRSS
- The mean memory use per process over the length of the jobMaxRSS
- The maximum memory use by an individual process measured during the jobMaxRSSTask
- The process ID associated with the maximum memory use measured during the jobMaxRSSNode
- The node ID associated with the maximum memory use measured during the jobTRESUsageInTot
- Totals of various properties for the job. For example, the total memory use of the job is available in the mem=
propertyTip
Slurm polls for the memory use in a job, this means that short-term changes in memory use may not be captured in the Slurm data.
"},{"location":"user-guide/profile/#example-1-sstat-for-running-jobs","title":"Example 1:sstat
for running jobs","text":"To display the current memory use of a running job with the ID 123456:
sstat --format=JobID,AveCPU,AveRSS,MaxRSS,MaxRSSTask,MaxRSSNode,TRESUsageInTot%150 -j 123456\n
"},{"location":"user-guide/profile/#example-2-sacct-for-finished-jobs","title":"Example 2: sacct
for finished jobs","text":"To display the memory use of a completed job with the ID 123456:
sacct --format=JobID,JobName,AveRSS,MaxRSS,MaxRSSTask,MaxRSSNode,TRESUsageInTot%150 -j 123456\n
Another usage of sacct
is to display when a job was submitted, started running and ended for a particular user:
sacct --format=JobID,Submit,Start,End -u auser\n
"},{"location":"user-guide/profile/#example-3-archer2jobload-for-running-jobs","title":"Example 3: archer2jobload
for running jobs","text":"Using the archer2jobload
command on its own with no options will show the current CPU and memory use across compute nodes for all running jobs.
More usefully, you can provide a job ID to archer2jobload
and it will show a summary of the CPU and memory use for a specific job. For example, to get the usage data for job 123456, you would use:
auser@ln01:~> archer2jobload 123456\n# JOB: 123456\nCPU_LOAD MEMORY ALLOCMEM FREE_MEM TMP_DISK NODELIST \n127.35-127.86 256000 239872 169686-208172 0 nid[001481,001638-00\n
This shows the minimum CPU load on a compute node is 126.04 (close to the limit of 128 cores) with the maximum load 127.41 (indicating all the nodes are being used evenly). The minimum free memory is 171893 MB and the maximum free memory is 177224 MB.
If you add the -l
option, you will see a breakdown per node:
auser@ln01:~> archer2jobload -l 276236\n# JOB: 123456\nNODELIST CPU_LOAD MEMORY ALLOCMEM FREE_MEM TMP_DISK \nnid001481 127.86 256000 239872 169686 0 \nnid001638 127.60 256000 239872 171060 0 \nnid001639 127.64 256000 239872 171253 0 \nnid001677 127.85 256000 239872 173820 0 \nnid001678 127.75 256000 239872 173170 0 \nnid001891 127.63 256000 239872 173316 0 \nnid001921 127.65 256000 239872 207562 0 \nnid001922 127.35 256000 239872 208172 0 \n
"},{"location":"user-guide/profile/#further-help-with-slurm","title":"Further help with Slurm","text":"The definitions of any variables discussed here and more usage information can be found in the man pages of sstat
and sacct
.
The AMD \u03bcProf tool provides capabilities for low-level profiling on AMD processors, see:
The Linaro Forge tool also provides profiling capabilities. See:
The Darshan lightweight IO profiling tool provides a quick way to profile the IO part of your software:
Python is supported on ARCHER2 both for running intensive parallel jobs and also as an analysis tool. This section describes how to use Python in either of these scenarios.
The Python installations on ARCHER2 contain some of the most commonly used packages. If you wish to install additional Python packages, we recommend that you use the pip
command, see the section entitled Installing your own Python packages (with pip).
Important
Python 2 is not supported on ARCHER2 as it has been deprecated since the start of 2020.
Note
When you log onto ARCHER2, no Python module is loaded by default. You will generally need to load the cray-python
module to access the functionality described below.
The recommended way to use Python on ARCHER2 is to use the HPE Cray Python distribution.
The HPE Cray distribution provides Python 3 along with some of the most common packages used for scientific computation and data analysis. These include:
The HPE Cray Python distribution can be loaded (either on the front-end or in a submission script) using:
module load cray-python\n
Tip
The HPE Cray Python distribution is built using GCC compilers. If you wish to compile your own Python, C/C++ or Fortran code to use with HPE Cray Python, you should ensure that you compile using PrgEnv-gnu
to make sure they are compatible.
Sometimes, you may need to setup a local custom Python environment such that it extends a centrally-installed cray-python
module. By extend, we mean being able to install packages locally that are not provided by cray-python
. This is necessary because some Python packages such as mpi4py
must be built specifically for the ARCHER2 system and so are best provided centrally.
You can do this by creating a lightweight virtual environment where the local packages can be installed. This environment is created on top of an existing Python installation, known as the environment's base Python.
First, load the PrgEnv-gnu
environment.
auser@ln01:~> module load PrgEnv-gnu\n
This first step is necessary because subsequent pip
installs may involve source code compilation and it is better that this be done using the GCC compilers to maintain consistency with how some base Python packages have been built.
Second, select the base Python by loading the cray-python
module that you wish to extend.
auser@ln01:~> module load cray-python\n
Next, create the virtual environment within a designated folder.
python -m venv --system-site-packages /work/t01/t01/auser/myvenv\n
In our example, the environment is created within a myvenv
folder located on /work
, which means the environment will be accessible from the compute nodes. The --system-site-packages
option ensures this environment is based on the currently loaded cray-python
module. See https://docs.python.org/3/library/venv.html for more details.
You're now ready to activate your environment.
source /work/t01/t01/auser/myvenv/bin/activate\n
Tip
The myvenv
path uses a fictitious project code, t01
, and username, auser
. Please remember to replace those values with your actual project code and username. Alternatively, you could enter ${HOME/home/work}
in place of /work/t01/t01/auser
. That command fragment expands ${HOME}
and then replaces the home
part with work
.
Installing packages to your local environment can now be done as follows.
(myvenv) auser@ln01:~> python -m pip install <package name>\n
Running pip
directly as in pip install <package name>
will also work, but we show the python -m
approach as this is consistent with the way the virtual environment was created. Further, if the package installation will require code compilation, you should amend the command to ensure use of the ARCHER2 compiler wrappers.
(myvenv) auser@ln01:~> CC=cc CXX=CC FC=ftn python -m pip install <package name>\n
And when you have finished installing packages, you can deactivate the environment by running the deactivate
command.
(myvenv) auser@ln01:~> deactivate\nauser@ln01:~>\n
The packages you have installed will only be available once the local environment has been activated. So, when running code that requires these packages, you must first activate the environment, by adding the activation command to the submission script, as shown below.
#!/bin/bash --login\n\n#SBATCH --job-name=myvenv\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=64\n#SBATCH --cpus-per-task=2\n#SBATCH --time=00:10:00\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nsource /work/t01/t01/auser/myvenv/bin/activate\n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\n\nsrun --distribution=block:block --hint=nomultithread python myvenv-script.py\n
Tip
If you find that a module you've installed to a virtual environment on /work
isn't found when running a job, it may be that it was previously installed to the default location of $HOME/.local
which is not mounted on the compute nodes. This can be an issue as pip
will reuse any modules found at this default location rather than reinstall them into a virtual environment. Thus, even if the virtual environment is on /work
, a module you've asked for may actually be located on /home
.
You can check a module's install location and its dependencies with pip show
, for example pip show matplotlib
. You may then run pip uninstall matplotlib
while no virtual environment is active to uninstall it from $HOME/.local
, and then re-run pip install matplotlib
while your virtual environment on /work
is active to reinstall it there. You will need to do this for any modules installed on /home
that will use either directly or indirectly. Remember you can check all your installed modules with pip list
.
The environment being extended does not have to come from one of the centrally-installed cray-python
modules. You can also create a local virtual environment based on one of the Machine Learning (ML) modules, e.g., tensorflow
or pytorch
. One extra command is required; it is issued immediately after the python -m venv ...
command.
extend-venv-activate /work/t01/t01/auser/myvenv\n
The extend-venv-activate
command merely adds some extra commands to the virtual environment's activate
script, ensuring that the Python packages will be gathered from the local virtual environment, the ML module and from the cray-python
base module. All this means you would avoid having to install ML packages within your local area.
Note
The extend-venv-activate
command becomes available (i.e., its location is placed on the path) only when the ML module is loaded. The ML modules are themselves based on cray-python
. For example, tensorflow/2.12.0
is based on the cray-python/3.9.13.1
module.
Conda-based Python distributions (e.g. Anaconda, Mamba, Miniconda) are an extremely popular way of installing and accessing software on many systems, including ARCHER2. Although conda-based distributions can be used on ARCHER2, care is needed in how they are installed and configured so that the installation does not adversely effect your use of ARCHER2. In particular, you should be careful of:
.bashrc
We cover each of these points in more detail below.
"},{"location":"user-guide/python/#conda-install-location","title":"Conda install location","text":"If you only need to use the files and executables from your conda installation on the login and data analysis nodes (via the serial
QoS) then the best place to install conda is in your home directory structure - this will usually be the default install location provided by the installation script.
If you need to access the files and executables from conda on the compute nodes then you will need to install to a different location as the home file systems are not available on the compute nodes. The work file systems are not well suited to hosting Python software natively due to the way in which file access work, particularly during Python startup. There are two main options for using conda from ARCHER2 compute nodes:
You can pull official conda-based container images from Dockerhub that you can use if you want just the standard set of Python modules that come with the distribution. For example, to get the latest Anaconda distribution as a Singularity container image on the ARCHER2 work file system, you would use (on an ARCHER2 login node, from the directory on the work file system where you want to store the container image):
singularity build anaconda3.sif docker://continuumio/anaconda3\n
Once you have the container image, you can run scripts in it with a command like:
singularity exec -B $PWD anaconda3.sif python my_script.py\n
As the container image is a single large file, you end up doing a single large read from the work file system rather than lots of small reads of individual Python files, this improves the performance of Python and reduces the detrimental impact on the wider file system performance for all users.
We have pre-built a Singularity container with the Anaconda distribution in on ARCHER2. Users can access it at $EPCC_SINGULARITY_DIR/anaconda3.sif
. To run a Python script with the centrally-installed image, you can use:
singularity exec -B $PWD $EPCC_SINGULARITY_DIR/anaconda3.sif python my_script.py\n
If you want additional packages that are not available in the standard container images then you will need to build your own container images. If you need help to do this, then please contact the ARCHER2 Service Desk
"},{"location":"user-guide/python/#conda-addtions-to-shell-configuration-files","title":"Conda addtions to shell configuration files","text":"During the install process most conda-based distributions will ask a question like:
Do you wish the installer to initialize Miniconda3 by running conda init?
If you are installing to the ARCHER2 work directories or the solid state storage, you should answer \"no\" to this question.
Adding the initialisation to shell startup scripts (typically .bashrc
) means that every time you login to ARCHER2, the conda environment will try to initialise by reading lots of files within the conda installation. This approach was designed for the case where a user has installed conda on their personal device and so is the only user of the file system. For shared file systems such as those on ARCHER2, this places a large load on the file system and will lead to you seeing slow login times and slow response from your command line on ARCHER2. It will also lead to degraded read/write performance from the work file systems for you and other users so should be avoided at all costs.
If you have previously installed a conda distribution and answered \"yes\" to the question about adding the initialisation to shell configuration files, you should edit your ~/.bashrc
file to remove the conda initialisation entries. This means deleting the lines that look something like:
# >>> conda initialize >>>\n# !! Contents within this block are managed by 'conda init' !!\n__conda_setup=\"$('/work/t01/t01/auser/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)\"\nif [ $? -eq 0 ]; then\neval \"$__conda_setup\"\nelse\nif [ -f \"/work/t01/t01/auser/miniconda3/etc/profile.d/conda.sh\" ]; then\n. \"/work/t01/t01/auser/miniconda3/etc/profile.d/conda.sh\"\nelse\nexport PATH=\"/work/t01/t01/auser/miniconda3/bin:$PATH\"\nfi\nfi\nunset __conda_setup\n# <<< conda initialize <<<\n
"},{"location":"user-guide/python/#running-python","title":"Running Python","text":""},{"location":"user-guide/python/#example-serial-python-submission-script","title":"Example serial Python submission script","text":"#!/bin/bash --login\n\n#SBATCH --job-name=python_test\n#SBATCH --ntasks=1\n#SBATCH --time=00:10:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n# Load the Python module, ...\nmodule load cray-python\n\n# ..., or, if using local virtual environment\nsource <<path to virtual environment>>/bin/activate\n\n# Run your Python program\npython python_test.py\n
"},{"location":"user-guide/python/#example-mpi4py-job-submission-script","title":"Example mpi4py job submission script","text":"Programs that have been parallelised with mpi4py can be run on the ARCHER2 compute nodes. Unlike the serial Python submission script however, we must launch the Python interpreter using srun
. Failing to do so will result in Python running a single MPI rank only.
#!/bin/bash --login\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=mpi4py_test\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:10:0\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the Python module, ...\nmodule load cray-python\n\n# ..., or, if using local virtual environment\nsource <<path to virtual environment>>/bin/activate\n\n# Pass cpus-per-task setting to srun\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\n\n# Run your Python program\n# Note that srun MUST be used to wrap the call to python,\n# otherwise your code will run serially\nsrun --distribution=block:block --hint=nomultithread python mpi4py_test.py\n
Tip
If you have installed your own packages you will need to activate your local Python environment within your job submission script as shown at the end of Installing your own Python packages (with pip).
By default, mpi4py will use the Cray MPICH OFI library. If one wishes to use UCX instead, you must first, within the submission script, load PrgEnv-gnu
before loading the UCX modules, as shown below.
module load PrgEnv-gnu\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\nmodule load cray-python\n
"},{"location":"user-guide/python/#running-python-at-scale","title":"Running Python at scale","text":"The file system metadata server may become overloaded when running a parallel Python script over many fully populated nodes (i.e., 128 MPI ranks per node). Performance degrades due to the IO operations that accompany a high volume of Python import statements. Typically, each import will first require the module or library to be located by searching a number of file paths before the module is loaded into memory. Such a workload scales as Np x Nlib x Npath , where Np is the number of parallel processes, Nlib is the number of libraries imported and Npath the number of file paths searched. And so, in this way much time can be lost during the initial phase of a large Python job, not to mention the fact that the IO contention will be impacting other users of the system. Spindle is a tool for improving the library-loading performance of dynamically linked HPC applications. It provides a mechanism for\u00a0scalable loading of shared libraries, executables and Python\u00a0files from a shared file system at scale without turning the file system into a bottleneck. This is achieved by caching libraries or their locations within node memory. Spindle takes a\u00a0pure user-space\u00a0approach: users do not need to configure new file systems, load particular OS kernels or build special system components. The tool operates on existing binaries \u2014\u00a0no application modification or special build flags\u00a0are required. The script below shows how to run Spindle with your Python code. The Note The It is possible to view and run Jupyter notebooks from both login nodes and compute nodes on ARCHER2. Note You can test such notebooks on the login nodes, but please do not attempt to run any computationally intensive work. Jobs may get killed once they hit a CPU limit on login nodes. Please follow these steps. Install JupyterLab in your work directory. #!/bin/bash --login\n\n#SBATCH --nodes=256\n#SBATCH --ntasks-per-node=128\n...\n\nmodule load cray-python\nmodule load spindle/0.13\n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\n\nspindle --slurm --python-prefix=/opt/cray/pe/python/${CRAY_PYTHON_LEVEL} \\ \n srun --overlap --distribution=block:block --hint=nomultithread \\\n python mpi4py_script.py\n
--python-prefix
argument can be set to a list of colon-separated paths if necessary. In the example above, the CRAY_PYTHON_LEVEL
environment variable is set as a conseqeunce of loading cray-python
.srun --overlap
option is required for Spindle as the version of Slurm on ARCHER2 is newer than 20.11.
module load cray-python\nexport PYTHONUSERBASE=/work/t01/t01/auser/.local\nexport PATH=$PYTHONUSERBASE/bin:$PATH\n# source <<path to virtual environment>>/bin/activate # If using a virtualenvironment uncomment this line and remove the --user flag from the next\n\npip install --user jupyterlab\n
If you want to test JupyterLab on the login node please go straight to step 3. To run your Jupyter notebook on a compute node, you first need to run an interactive session.
srun --nodes=1 --exclusive --time=00:20:00 --account=<your_budget> \\\n --partition=standard --qos=short --reservation=shortqos \\\n --pty /bin/bash\n
Your prompt will change to something like below. auser@nid001015:/tmp>\n
In this case, the node id is nid001015
. Now execute the following on the compute node. cd /work/t01/t01/auser # Update the path to your work directory\nexport PYTHONUSERBASE=$(pwd)/.local\nexport PATH=$PYTHONUSERBASE/bin:$PATH\nexport HOME=$(pwd)\nmodule load cray-python\n# source <<path to virtual environment>>/bin/activate # If using a virtualenvironment uncomment this line\n
Run the JupyterLab server.
export JUPYTER_RUNTIME_DIR=$(pwd)\njupyter lab --ip=0.0.0.0 --no-browser\n
Once it's started, you will see a URL printed in the terminal window of the form http://127.0.0.1:<port_number>/lab?token=<string>
; we'll need this URL for step 6. Please skip this step if you are connecting from a machine running Windows. Open a new terminal window on your laptop and run the following command.
ssh <username>@login.archer2.ac.uk -L<port_number>:<node_id>:<port_number>\n
where <username>
is your username, and <node_id>
is the id of the node you're currently on (for a login node, this will be ln01
, or similar; on a compute node, it will be a mix of numbers and letters). In our example, <node_id>
is nid001015
. Note, please use the same port number as that shown in the URL of step 3. This number may vary, likely values are 8888 or 8889. Please skip this step if you are connecting from Linux or macOS. If you are connecting from Windows, you should use MobaXterm to configure an SSH tunnel as follows.
Tunnelling
button above the MobaXterm terminal. Create a new tunnel by clicking on New SSH tunnel
in the window that opens.Local port forwarding
radio button is selected.forwarded port
text box on the left under My computer with MobaXterm
, enter the port number indicated in the JupyterLab server output (e.g., 8888 or 8890).SSH server
enter login.archer2.ac.uk
, your ARCHER2 username and then 22
.Remote server
, enter the id of the login or compute node running the JupyterLab server and the associated port number.Save
button..ppk
private key that you normally use when connecting to ARCHER2.Now, if you open a browser window locally, you should be able to navigate to the URL from step 3, and this should display the JupyterLab server. If JupyterLab is running on a compute node, the notebook will be available for the length of the interactive session you have requested.
Warning
Please do not use the other http address given by the JupyterLab output, the one formatted http://<node_id>:<port_number>/lab?token=<string>
. Your local browser will not recognise the <node_id>
part of the address.
The Dask-jobqueue project makes it easy to deploy Dask on ARCHER2. You can find more information in the Dask Job-Queue documentation.
Please follow these steps:
module load cray-python\nexport PYTHONUSERBASE=/work/t01/t01/auser/.local\nexport PATH=$PYTHONUSERBASE/bin:$PATH\n\npip install --user dask-jobqueue --upgrade\n
Dask-jobqueue creates a Dask Scheduler in the Python process where the cluster object is instantiated. A script for running dask jobs on ARCHER2 might look something like this:
from dask_jobqueue import SLURMCluster\ncluster = SLURMCluster(cores=128, \n processes=16,\n memory='256GB',\n queue='standard',\n header_skip=['--mem'],\n job_extra=['--qos=\"standard\"'],\n python='srun python',\n project='z19',\n walltime=\"01:00:00\",\n shebang=\"#!/bin/bash --login\",\n local_directory='$PWD',\n interface='hsn0',\n env_extra=['module load cray-python',\n 'export PYTHONUSERBASE=/work/t01/t01/auser/.local/',\n 'export PATH=$PYTHONUSERBASE/bin:$PATH',\n 'export PYTHONPATH=$PYTHONUSERBASE/lib/python3.8/site-packages:$PYTHONPATH'])\n\n\n\ncluster.scale(jobs=2) # Deploy two single-node jobs\n\nfrom dask.distributed import Client\nclient = Client(cluster) # Connect this local process to remote workers\n\n# wait for jobs to arrive, depending on the queue, this may take some time\nimport dask.array as da\nx = \u2026 # Dask commands now use these distributed resources\n
This script can be run on the login nodes and it submits the Dask jobs to the job queue. Users should ensure that the computationally intensive work is done with the Dask commands which run on the compute nodes.
The cluster object parameters specify the characteristics for running on a single compute node. The header_skip option is required as we are running on exclusive nodes where you should not specify the memory requirements, however Dask requires you to supply this option.
Jobs are be deployed with the cluster.scale command, where the jobs option sets the number of single node jobs requested. Job scripts are generated (from the cluster object) and these are submitted to the queue to begin running once the resources are available. You can check the status of the jobs by running squeue -u $USER
in a separate terminal.
If you wish to see the generated job script you can use:
print(cluster.job_script())\n
"},{"location":"user-guide/scheduler/","title":"Running jobs on ARCHER2","text":"As with most HPC services, ARCHER2 uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. ARCHER2 uses the Slurm software to schedule jobs.
Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.
Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.
Hint
If you have any questions on how to run jobs on ARCHER2 do not hesitate to contact the ARCHER2 Service Desk.
You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.
"},{"location":"user-guide/scheduler/#resources","title":"Resources","text":""},{"location":"user-guide/scheduler/#cus","title":"CUs","text":"Time used on ARCHER2 is measured in CUs. 1 CU = 1 Node Hour for a standard 128 core node.
The CU calculator will help you to calculate the CU cost for your jobs.
"},{"location":"user-guide/scheduler/#checking-available-budget","title":"Checking available budget","text":"You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g. e123 resources
and then under Resource Pool to the right of this, a note of the remaining budget in CUs.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins\n
This will list all the budget codes that you have access to e.g.
Account User MaxTRESMins\n---------- ---------- -------------\n e123 userx cpu=0\n e123-test userx\n
This shows that userx
is a member of budgets e123
and e123-test
. However, the cpu=0
indicates that the e123
budget is empty or disabled. This user can submit jobs using the e123-test
budget.
To see the number of CUs remaining you must check in SAFE.
"},{"location":"user-guide/scheduler/#charging","title":"Charging","text":"Jobs run on ARCHER2 are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).
Jobs are charged for the full number of nodes which are requested, even if they are not all used.
Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.
"},{"location":"user-guide/scheduler/#basic-slurm-commands","title":"Basic Slurm commands","text":"There are four key commands used to interact with the Slurm on the command line:
sinfo
- Get information on the partitions and resources availablesbatch jobscript.slurm
- Submit a job submission script (in this case called: jobscript.slurm
) to the schedulersqueue
- Get the current status of jobs submitted to the schedulerscancel 12345
- Cancel a job (in this case with the job ID 12345
)We cover each of these commands in more detail below.
"},{"location":"user-guide/scheduler/#sinfo-information-on-resources","title":"sinfo
: information on resources","text":"sinfo
is used to query information about available resources and partitions. Without any options, sinfo
lists the status of all resources and partitions, e.g.
auser@ln01:~> sinfo\n\nPARTITION AVAIL TIMELIMIT NODES STATE NODELIST\nstandard up 1-00:00:00 105 down* nid[001006,...,002014]\nstandard up 1-00:00:00 12 drain nid[001016,...,001969]\nstandard up 1-00:00:00 5 resv nid[001000,001002-001004,001114]\nstandard up 1-00:00:00 683 alloc nid[001001,...,001970-001991]\nstandard up 1-00:00:00 214 idle nid[001022-001023,...,002015-002023]\nstandard up 1-00:00:00 2 down nid[001021,001050]\n
Here we see the number of nodes in different states. For example, 683 nodes are allocated (running jobs), and 214 are idle (available to run jobs).
Note
that long lists of node IDs have been abbreviated with ...
.
sbatch
: submitting jobs","text":"sbatch
is used to submit a job script to the job submission system. The script will typically contain one or more srun
commands to launch parallel tasks.
When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.
auser@ln01:~> sbatch test-job.slurm\nSubmitted batch job 12345\n
"},{"location":"user-guide/scheduler/#squeue-monitoring-jobs","title":"squeue
: monitoring jobs","text":"squeue
without any options or arguments shows the current status of all jobs known to the scheduler. For example:
auser@ln01:~> squeue\n
will list all jobs on ARCHER2.
The output of this is often overwhelmingly large. You can restrict the output to just your jobs by adding the -u $USER
option:
auser@ln01:~> squeue -u $USER\n
"},{"location":"user-guide/scheduler/#scancel-deleting-jobs","title":"scancel
: deleting jobs","text":"scancel
is used to delete a jobs from the scheduler. If the job is waiting to run it is simply cancelled, if it is a running job then it is stopped immediately.
If you only want to cancel a specific job you need to provide the job ID of the job you wish to cancel/stop. For example:
auser@ln01:~> scancel 12345\n
will cancel (if waiting) or stop (if running) the job with ID 12345
.
scancel
can take other options. For example, if you want to cancel all your pending (queued) jobs but leave the running jobs running, you could use:
auser@ln01:~> scancel --state=PENDING --user=$USER\n
"},{"location":"user-guide/scheduler/#resource-limits","title":"Resource Limits","text":"The ARCHER2 resource limits for any given job are covered by three separate attributes.
The primary resource you can request for your job is the compute node.
Information
The --exclusive
option is enforced on ARCHER2 which means you will always have access to all of the memory on the compute node regardless of how many processes are actually running on the node.
Note
You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.
"},{"location":"user-guide/scheduler/#partitions","title":"Partitions","text":"On ARCHER2, compute nodes are grouped into partitions. You will have to specify a partition using the --partition
option in your Slurm submission script. The following table has a list of active partitions on ARCHER2.
Note
The standard
partition includes both the standard memory and high memory nodes but standard memory nodes are preferentially chosen for jobs where possible. To guarantee access to high memory nodes you should specify the highmem
partition.
On ARCHER2, job limits are defined by the requested Quality of Service (QoS), as specified by the --qos
Slurm directive. The following table lists the active QoS on ARCHER2.
You can find out the QoS that you can use by running the following command:
Full systemauser@ln01:~> sacctmgr show assoc user=$USER cluster=archer2 format=cluster,account,user,qos%50\n
Hint
If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.
"},{"location":"user-guide/scheduler/#e-mail-notifications","title":"E-mail notifications","text":"E-mail notifications from the scheduler are not currently available on ARCHER2.
"},{"location":"user-guide/scheduler/#priority","title":"Priority","text":"Job priority on ARCHER2 depends on a number of different factors:
Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on ARCHER2 is:
Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)] + [100 * P(size)]\n
The priority factors are:
lowpriority
QoS has a raw priority of 1.You can view the priorities for current queued jobs on the system with the sprio
command:
auser@ln04:~> sprio -l\n JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE QOS\n 828764 standard 1049 0 45 0 4 1000\n 828765 standard 1049 0 45 0 4 1000\n 828770 standard 1049 0 45 0 4 1000\n 828771 standard 1012 0 8 0 4 1000\n 828773 standard 1012 0 8 0 4 1000\n 828791 standard 1012 0 8 0 4 1000\n 828797 standard 1118 0 115 0 4 1000\n 828800 standard 1154 0 150 0 4 1000\n 828801 standard 1154 0 150 0 4 1000\n 828805 standard 1118 0 115 0 4 1000\n 828806 standard 1154 0 150 0 4 1000\n
"},{"location":"user-guide/scheduler/#troubleshooting","title":"Troubleshooting","text":""},{"location":"user-guide/scheduler/#slurm-error-messages","title":"Slurm error messages","text":"An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:
sbatch: unrecognized option <text>
One of your options is invalid or has a typo. man sbatch
to help.
error: Batch job submission failed: No partition specified or system default partition
A --partition=
option is missing. You must specify the partition (see the list above). This is most often --partition=standard
.
error: invalid partition specified: <partition>
error: Batch job submission failed: Invalid partition name specified
Check the partition exists and check the spelling is correct.
error: Batch job submission failed: Invalid account or account/partition combination specified
This probably means an invalid account has been given. Check the --account=
options against valid accounts in SAFE.
error: Batch job submission failed: Invalid qos specification
A QoS option is either missing or invalid. Check the script has a --qos=
option and that the option is a valid one from the table above. (Check the spelling of the QoS is correct.)
error: Your job has no time specification (--time=)...
Add an option of the form --time=hours:minutes:seconds
to the submission script. E.g., --time=01:30:00
gives a time limit of 90 minutes.
error: QOSMaxWallDurationPerJobLimit
error: Batch job submission failed: Job violates accounting/QOS policy
(job submit limit, user's size and/or time limits)
The script has probably specified a time limit which is too long for the corresponding QoS. E.g., the time limit for the short QoS is 20 minutes.
The squeue
command allows users to view information for jobs managed by Slurm. Jobs typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED. The first table provides a description of some job state codes. The second table provides a description of the reasons that cause a job to be in a state.
For a full list of see Job State Codes.
"},{"location":"user-guide/scheduler/#slurm-queued-reasons","title":"Slurm queued reasons","text":"Reason Description Priority One or more higher priority jobs exist for this partition or advanced reservation. Resources The job is waiting for resources to become available. BadConstraints The job's constraints can not be satisfied. BeginTime The job's earliest start time has not yet been reached. Dependency This job is waiting for a dependent job to complete. Licenses The job is waiting for a license. WaitingForScheduling No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. Prolog Its PrologSlurmctld program is still running. JobHeldAdmin The job is held by a system administrator. JobHeldUser The job is held by the user. JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc. NonZeroExitCode The job terminated with a non-zero exit code. InvalidAccount The job's account is invalid. InvalidQOS The job's QOS is invalid. QOSUsageThreshold Required QOS threshold has been breached. QOSJobLimit The job's QOS has reached its maximum job count. QOSResourceLimit The job's QOS has reached some resource limit. QOSTimeLimit The job's QOS has reached its time limit. NodeDown A node required by the job is down. TimeLimit The job exhausted its time limit. ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's \"reason\" field as \"UnavailableNodes\". Such nodes will typically require the intervention of a system administrator to make available.For a full list of see Job Reasons.
"},{"location":"user-guide/scheduler/#output-from-slurm-jobs","title":"Output from Slurm jobs","text":"Slurm places standard output (STDOUT) and standard error (STDERR) for each job in the file slurm_<JobID>.out
. This file appears in the job's working directory once your job starts running.
Hint
Output may be buffered - to enable live output, e.g. for monitoring job status, add --unbuffered
to the srun
command in your Slurm script.
You specify the resources you require for your job using directives at the top of your job submission script using lines that start with the directive #SBATCH
.
Hint
Most options provided using #SBATCH
directives can also be specified as command line options to srun
.
If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:
--account=<budgetID>
your budget ID is usually something like t01
or t01-test
. You can see which budget codes you can charge to in SAFE.Important
You must specify an account code for your job otherwise it will fail to submit with the error: sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
. (This error can also mean that you have specified a budget that has run out of resources.)
Other common options that are used are:
--time=<hh:mm:ss>
the maximum walltime for your job. e.g. For a 6.5 hour walltime, you would use --time=6:30:0
.--job-name=<jobname>
set a name for the job to help identify it in SlurmTo prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option
--export=none
prevents the user environment from being exported to the batch system.Using the --export=none
means that the behaviour of batch submissions should be repeatable. We strongly recommend its use.
Note
When submitting your job, the scheduler will check that the requested resources are available e.g. that your account is a member of the requested budget, that the requested QoS exists. If things change before the job starts and e.g. your account has been removed from the requested budget or the requested QoS has been deleted then the job will not be able to start. In such cases, the job will be removed from the pending queue by our systems team, as it will no longer be eligible to run.
"},{"location":"user-guide/scheduler/#additional-options-for-parallel-jobs","title":"Additional options for parallel jobs","text":"Note
For parallel jobs, ARCHER2 operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 128 cores) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 128 cores).
In addition, parallel jobs will also need to specify how many nodes, parallel processes and threads they require.
--nodes=<nodes>
the number of nodes to use for the job.--ntasks-per-node=<processes per node>
the number of parallel processes (e.g. MPI ranks) per node.--cpus-per-task=1
if you are using parallel processes only with no threading and you want to use all 128 cores on the node then you should set the number of CPUs (cores) per parallel process to 1. Important: if you are using threading (e.g. with OpenMP) or you want to use less than 128 cores per node (e.g. to access more memory or memory bandwidth per core) then you will need to change this option as described below.--cpu-freq=<freq. in kHz>
set the CPU frequency for the compute nodes. Valid values are 2250000
(2.25 GHz), 2000000
(2.0 GHz), 1500000
(1.5 GHz). For more information on CPU frequency settings and energy use see the Energy use section.For parallel jobs that use threading (e.g. OpenMP) or when you want to use less than 128 cores per node (e.g. to access more memory or memory bandwidth per core), you will also need to change the --cpus-per-task
option.
For jobs using threading: - --cpus-per-task=<threads per task>
the number of threads per parallel process (e.g. number of OpenMP threads per MPI task for hybrid MPI/OpenMP jobs). Important: you must also set the OMP_NUM_THREADS
environment variable if using OpenMP in your job.
For jobs using less than 128 cores per node: - --cpus-per-task=<stride between placement of processes>
the stride between the parallel processes. For example, if you want to double the memory and memory bandwidth per process on an ARCHER2 compute node you would want to place 64 processes per node and leave an empty core between each process you would set --cpus-per-task=2
and --ntasks-per-node=64
.
Important
You must also add export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
to your job submission script to pass the --cpus-per-task
setting from the job script to the srun
command. (Alternatively, you could use the --cpus-per-task
option in the srun command itself.) If you do not do this then the placement of processes/threads will be incorrect and you will likely see poor performance of your application.
The data analysis nodes are shared between all users and can be used to run jobs that require small numbers of cores and/or access to an external network to transfer data. These jobs are often serial jobs that only require a single core.
To run jobs on the data analysis node you require the following options:
--partition=serial
to select the data analysis nodes--qos=serial
to select the data analysis QoS (see above for QoS limits)--ntasks=<number of cores>
to select the number of cores you want to use in this job (up to the maximum defined in the QoS)--mem=<amount of memory>
to select the amount of memory you require (up to the maximum defined in the QoS).More information on using the data analysis nodes (including example job submission scripts) can be found in the Data Analysis section of the User and Best Practice Guide.
"},{"location":"user-guide/scheduler/#srun-launching-parallel-jobs","title":"srun
: Launching parallel jobs","text":"If you are running parallel jobs, your job submission script should contain one or more srun
commands to launch the parallel executable across the compute nodes. In most cases you will want to add the options --distribution=block:block
and --hint=nomultithread
to your srun
command to ensure you get the correct pinning of processes to cores on a compute node.
Warning
If you do not add the --distribution=block:block
and --hint=nomultithread
options to your srun
command the default process placement may lead to a drop in performance for your jobs on ARCHER2.
A brief explanation of these options: - --hint=nomultithread
- do not use hyperthreads/SMP - --distribution=block:block
- the first block
means use a block distribution of processes across nodes (i.e. fill nodes before moving onto the next one) and the second block
means use a block distribution of processes across \"sockets\" within a node (i.e. fill a \"socket\" before moving on to the next one).
Important
The Slurm definition of a \"socket\" does not correspond to a physical CPU socket. On ARCHER2 it corresponds to a 4-core CCX (Core CompleX).
"},{"location":"user-guide/scheduler/#slurm-definition-of-a-socket","title":"Slurm definition of a \"socket\"","text":"On ARCHER2, Slurm is configured with the following setting:
SlurmdParameters=l3cache_as_socket\n
The effect of this setting is to define a Slurm socket as a unit that has a shared L3 cache. On ARCHER2, this means that each Slurm \"socket\" corresponds to a 4-core CCX (Core CompleX). For a more detailed discussion on the hardware and the memory/cache layout see the Hardware section.
The effect of this setting can be illustrated by using the xthi
program to report placement when we select a cyclic distribution of processes across sockets from srun (--distribution=block:cyclic
). As you can see from the output from xthi
included below, the cyclic
per-socket distribution results in sequential MPI processes being placed on every 4th core (i.e. cyclic placement across CCX).
Node summary for 1 nodes:\nNode 0, hostname nid000006, mpi 128, omp 1, executable xthi_mpi\nMPI summary: 128 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 4)\nNode 0, rank 2, thread 0, (affinity = 8)\nNode 0, rank 3, thread 0, (affinity = 12)\nNode 0, rank 4, thread 0, (affinity = 16)\nNode 0, rank 5, thread 0, (affinity = 20)\nNode 0, rank 6, thread 0, (affinity = 24)\nNode 0, rank 7, thread 0, (affinity = 28)\nNode 0, rank 8, thread 0, (affinity = 32)\nNode 0, rank 9, thread 0, (affinity = 36)\nNode 0, rank 10, thread 0, (affinity = 40)\nNode 0, rank 11, thread 0, (affinity = 44)\nNode 0, rank 12, thread 0, (affinity = 48)\nNode 0, rank 13, thread 0, (affinity = 52)\nNode 0, rank 14, thread 0, (affinity = 56)\nNode 0, rank 15, thread 0, (affinity = 60)\nNode 0, rank 16, thread 0, (affinity = 64)\nNode 0, rank 17, thread 0, (affinity = 68)\nNode 0, rank 18, thread 0, (affinity = 72)\nNode 0, rank 19, thread 0, (affinity = 76)\nNode 0, rank 20, thread 0, (affinity = 80)\nNode 0, rank 21, thread 0, (affinity = 84)\nNode 0, rank 22, thread 0, (affinity = 88)\nNode 0, rank 23, thread 0, (affinity = 92)\nNode 0, rank 24, thread 0, (affinity = 96)\nNode 0, rank 25, thread 0, (affinity = 100)\nNode 0, rank 26, thread 0, (affinity = 104)\nNode 0, rank 27, thread 0, (affinity = 108)\nNode 0, rank 28, thread 0, (affinity = 112)\nNode 0, rank 29, thread 0, (affinity = 116)\nNode 0, rank 30, thread 0, (affinity = 120)\nNode 0, rank 31, thread 0, (affinity = 124)\nNode 0, rank 32, thread 0, (affinity = 1)\nNode 0, rank 33, thread 0, (affinity = 5)\nNode 0, rank 34, thread 0, (affinity = 9)\nNode 0, rank 35, thread 0, (affinity = 13)\nNode 0, rank 36, thread 0, (affinity = 17)\nNode 0, rank 37, thread 0, (affinity = 21)\nNode 0, rank 38, thread 0, (affinity = 25)\n\n...output trimmed...\n
"},{"location":"user-guide/scheduler/#bolt-job-submission-script-creation-tool","title":"bolt: Job submission script creation tool","text":"The bolt job submission script creation tool has been written by EPCC to simplify the process of writing job submission scripts for modern multicore architectures. Based on the options you supply, bolt will generate a job submission script that uses ARCHER2 in a reasonable way.
MPI, OpenMP and hybrid MPI/OpenMP jobs are supported.
Warning
The tool will allow you to generate scripts for jobs that use the long
QoS but you will need to manually modify the resulting script to change the QoS to long
.
If there are problems or errors in your job parameter specifications then bolt will print warnings or errors. However, bolt cannot detect all problems.
"},{"location":"user-guide/scheduler/#basic-usage","title":"Basic Usage","text":"The basic syntax for using bolt is:
bolt -n [parallel tasks] -N [parallel tasks per node] -d [number of threads per task] \\\n -t [wallclock time (h:m:s)] -o [script name] -j [job name] -A [project code] [arguments...]\n
Example 1: to generate a job script to run an executable called my_prog.x
for 24 hours using 8192 parallel (MPI) processes and 128 (MPI) processes per compute node you would use something like:
bolt -n 8192 -N 128 -t 24:0:0 -o my_job.bolt -j my_job -A z01-budget my_prog.x arg1 arg2\n
(remember to substitute z01-budget
for your actual budget code.)
Example 2: to generate a job script to run an executable called my_prog.x
for 3 hours using 2048 parallel (MPI) processes and 64 (MPI) processes per compute node (i.e. using half of the cores on a compute node), you would use:
bolt -n 2048 -N 64 -t 3:0:0 -o my_job.bolt -j my_job -A z01-budget my_prog.x arg1 arg2\n
These examples generate the job script my_job.bolt
with the correct options to run my_prog.x
with command line arguments arg1
and arg2
. The project code against which the job will be charged is specified with the ' -A ' option. As usual, the job script is submitted as follows:
sbatch my_job.bolt\n
Hint
If you do not specify the script name with the '-o' option then your script will be a file called a.bolt
.
Hint
If you do not specify the number of parallel tasks then bolt will try to generate a serial job submission script (and throw an error on the ARCHER2 4 cabinet system as serial jobs are not supported).
Hint
If you do not specify a project code, bolt will use your default project code (set by your login account).
Hint
If you do not specify a job name, bolt will use either bolt_ser_job
(for serial jobs) or bolt_par_job
(for parallel jobs).
You can access further help on using bolt on ARCHER2 with the ' -h ' option:
bolt -h\n
A selection of other useful options are:
-s
Write and submit the job script rather than just writing the job script.-p
Force the job to be parallel even if it only uses a single parallel task.The checkScript tool has been written to allow users to validate their job submission scripts before submitting their jobs. The tool will read your job submission script and try to identify errors, problems or inconsistencies.
An example of the sort of output the tool can give would be:
auser@ln01:/work/t01/t01/auser> checkScript submit.slurm\n\n===========================================================================\ncheckScript\n---------------------------------------------------------------------------\nCopyright 2011-2020 EPCC, The University of Edinburgh\nThis program comes with ABSOLUTELY NO WARRANTY.\nThis is free software, and you are welcome to redistribute it\nunder certain conditions.\n===========================================================================\n\nScript details\n---------------\n User: auser\nScript file: submit.slurm\n Directory: /work/t01/t01/auser (ok)\n Job name: test (ok)\n Partition: standard (ok)\n QoS: standard (ok)\nCombination: (ok)\n\nRequested resources\n-------------------\n nodes = 3 (ok)\ntasks per node = 16\n cpus per task = 8\ncores per node = 128 (ok)\nOpenMP defined = True (ok)\n walltime = 1:0:0 (ok)\n\nCU Usage Estimate (if full job time used)\n------------------------------------------\n CU = 3.000\n\n\n\ncheckScript finished: 0 warning(s) and 0 error(s).\n
"},{"location":"user-guide/scheduler/#checking-scripts-and-estimating-start-time-with-test-only","title":"Checking scripts and estimating start time with --test-only
","text":"sbatch --test-only
validates the batch script and returns an estimate of when the job would be scheduled to run given the current scheduler state. Please note that it is just an estimate, the actual start time may differ as the scheduler status when the start time was estimated may be different once the job is actually submitted and due to subsequent changes to the scheduler state. The job is not actually submitted.
auser@ln01:~> sbatch --test-only submit.slurm\nsbatch: Job 1039497 to start at 2022-02-01T23:20:51 using 256 processors on nodes nid002836\nin partition standard\n
"},{"location":"user-guide/scheduler/#estimated-start-time-for-queued-jobs","title":"Estimated start time for queued jobs","text":"You can use the squeue
command to show the current estimated start time for a job. Please note that it is just an estimate, the actual start time may differ as the scheduler status when the start time was estimated may be different due to subsequent changes to the scheduler state. To return the estimated start time for a job you specify the job ID with the --jobs=<jobid>
and --Format=StartTime
options.
For example, to show the estimated start time for job 123456
, you would use:
squeue --jobs=123456 --Format=StartTime\n
The output from this command would look like:
START_TIME\n2024-09-25T13:07:00\n
"},{"location":"user-guide/scheduler/#example-job-submission-scripts","title":"Example job submission scripts","text":"A subset of example job submission scripts are included in full below. Examples are provided for both the full system and the 4-cabinet system.
"},{"location":"user-guide/scheduler/#example-job-submission-script-for-mpi-parallel-job","title":"Example: job submission script for MPI parallel job","text":"A simple MPI job submission script to submit a job using 4 compute nodes and 128 MPI ranks per node for 20 minutes would look like:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_MPI_Job\n#SBATCH --time=0:20:0\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 512 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\n\nsrun --distribution=block:block --hint=nomultithread ./my_mpi_executable.x\n
This will run your executable \"my_mpi_executable.x\" in parallel on 512 MPI processes using 4 nodes (128 cores per node, i.e. not using hyper-threading). Slurm will allocate 4 nodes to your job and srun will place 128 MPI processes on each node (one per physical core).
See above for a more detailed discussion of the different sbatch
options
Mixed mode codes that use both MPI (or another distributed memory parallel model) and OpenMP should take care to ensure that the shared memory portion of the process/thread placement does not span more than one NUMA region. Nodes on ARCHER2 are made up of two sockets each containing 4 NUMA regions of 16 cores, i.e. there are 8 NUMA regions in total. Therefore the total number of threads should ideally not be greater than 16, and also needs to be a factor of 16. Sensible choices for the number of threads are therefore 1 (single-threaded), 2, 4, 8, and 16. More information about using OpenMP and MPI+OpenMP can be found in the Tuning chapter.
To ensure correct placement of MPI processes the number of cpus-per-task needs to match the number of OpenMP threads, and the number of tasks-per-node should be set to ensure the entire node is filled with MPI tasks.
In the example below, we are using 4 nodes for 6 hours. There are 32 MPI processes in total (8 MPI processes per node) and 16 OpenMP threads per MPI process. This results in all 128 physical cores per node being used.
Hint
Note the use of the export OMP_PLACES=cores
environment option to generate the correct thread pinning.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_MPI_Job\n#SBATCH --time=0:20:0\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the number of threads to 16 and specify placement\n# There are 16 OpenMP threads per MPI process\n# We want one thread per physical core\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\n\n# Launch the parallel job\n# Using 32 MPI processes\n# 8 MPI processes per node\n# 16 OpenMP threads per MPI process\n# Additional srun options to pin one thread per physical core\nsrun --hint=nomultithread --distribution=block:block ./my_mixed_executable.x arg1 arg2\n
"},{"location":"user-guide/scheduler/#job-arrays","title":"Job arrays","text":"The Slurm job scheduling system offers the job array concept, for running collections of almost-identical jobs. For example, running the same program several times with different arguments or input data.
Each job in a job array is called a subjob. The subjobs of a job array can be submitted and queried as a unit, making it easier and cleaner to handle the full set, compared to individual jobs.
All subjobs in a job array are started by running the same job script. The job script also contains information on the number of jobs to be started, and Slurm provides a subjob index which can be passed to the individual subjobs or used to select the input data per subjob.
"},{"location":"user-guide/scheduler/#job-script-for-a-job-array","title":"Job script for a job array","text":"As an example, the following script runs 56 subjobs, with the subjob index as the only argument to the executable. Each subjob requests a single node and uses all 128 cores on the node by placing 1 MPI process per core and specifies 4 hours maximum runtime per subjob:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Array_Job\n#SBATCH --time=04:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --array=0-55\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\nsrun --distribution=block:block --hint=nomultithread /path/to/exe $SLURM_ARRAY_TASK_ID\n
"},{"location":"user-guide/scheduler/#submitting-a-job-array","title":"Submitting a job array","text":"Job arrays are submitted using sbatch
in the same way as for standard jobs:
sbatch job_script.pbs\n
"},{"location":"user-guide/scheduler/#expressing-dependencies-between-jobs","title":"Expressing dependencies between jobs","text":"SLURM allows one to express dependencies between jobs using the --dependency
(or -d
) option. This allows the start of execution of the dependent job to be delayed until some condition involving a current or previous job, or set of jobs, has been satisfied. A simple example might be:
$ sbatch --dependency=4394150 myscript.sh\nSubmitted batch job 4394325\n
This states that the execution of the new batch job should not start until job 4394150 has completed/terminated. Here, completion/termination is the only condition. The new job 4394325 should appear in the pending state with reason (Dependency)
assuming 4394150 is still running. A dependency may be of a different type, of which there are a number of relevant possibilities. If we explicitly include the default type afterany
in the example above, we would have
$ sbatch --dependency=afterany:4394150 myscript.sh\nSubmitted batch job 4394325\n
This emphasises that the first job may complete with any exit code, and still satisfy the dependency. If we wanted a dependent job which would only become eligible for execution following successful completion of the dependency, we would use afterok
: $ sbatch --dependency=afterok:4394150 myscript.sh\nSubmitted batch job 4394325\n
This means that should the dependency fail with non-zero exit code, the dependent job will be in a state where it will never run. This may appear in squeue
as (DependencyNeverSatisfied)
as the reason. Such jobs will need to be cancelled. The general form of the dependency list is <type:job_id[:job_id] [,type:job_id ...]>
where a dependency may include one or more jobs, with one or more types. If a list is comma-separated, all the dependencies must be satisfied before the dependent job becomes eligible. The use of ?
as the list separator implies that any of the dependencies is sufficient.
Useful type options include afterany
, afterok
, and afternotok
. For the last case, the dependency is only satisfied if there is non-zero exit code (the opposite of afterok
). See the current SLURM documentation for a full list of possibilities.
Job dependencies can be used to construct complex pipelines or chain together long simulations requiring multiple steps.
For example, if we have just two jobs, the following shell script extract will submit the second dependent on the first, irrespective of actual job ID:
jobid=$(sbatch --parsable first_job.sh)\nsbatch --dependency=afterok:${jobid} second_job.sh\n
where we have used the --parsable
option to sbatch
to return just the new job ID (without the Submitted batch job
). This can be extended to a longer chain as required. E.g.:
jobid1=$(sbatch --parsable first_job.sh)\njobid2=$(sbatch --parsable --dependency=afterok:${jobid1} second_job.sh)\njobid3=$(sbatch --parsable --dependency=afterok:${jobid1} third_job.sh)\nsbatch --dependency=afterok:${jobid2},afterok:${jobid3} last_job.sh\n
Note jobs 2 and 3 are dependent on job 1 (only), but the final job is dependent on both jobs 2 and 3. This allows quite general workflows to be constructed."},{"location":"user-guide/scheduler/#number-of-jobs-not-known-in-advance","title":"Number of jobs not known in advance","text":"This automation may be taken a step further to a case where a submission script propagates itself. E.g., a script might include, schematically,
#SBATCH ...\n\n# submit new job here ...\nsbatch --dependency=afterok:${SLURM_JOB_ID} thisscript.sh\n\n# perform work here...\nsrun ...\n
where the original submission of the script will submit a new instance of itself dependent on its own successful completion. This is done via the SLURM environment variable SLURM_JOB_ID
which holds the id of the current job. One could defer the sbatch
until the end of the script to avoid the dependency never being satisfied if the work associated with the srun
fails. This approach can be useful in situations where, e.g., simulations with checkpoint/restart need to continue until some criterion is met. Some care may be required to ensure the script logic is correct in determining the criterion for stopping: it is best to start with a small/short test example. Incorrect logic and/or errors may lead to a rapid proliferation of submitted jobs. Termination of such chains needs to be arranged either via appropriate logic in the script, or manual intervention to cancel pending jobs when no longer required.
"},{"location":"user-guide/scheduler/#using-multiple-srun-commands-in-a-single-job-script","title":"Using multiplesrun
commands in a single job script","text":"You can use multiple srun
commands within in a Slurm job submission script to allow you to use the resource requested more flexibly. For example, you could run a collection of smaller jobs within the requested resources or you could even subdivide nodes if your individual calculations do not scale up to use all 128 cores on a node.
In this guide we will cover two scenarios:
When subdivding a larger job into smaller subjobs you typically need to overwrite the --nodes
option to srun
and add the --ntasks
option to ensure that each subjob runs on the correct number of nodes and that subjobs are placed correctly onto separate nodes.
For example, we will show how to request 100 nodes and then run 100 separate 1-node jobs, each of which use 128 MPI processes and which run on a different compute node. We start by showing the job script that would achieve this and then explain how this works and the options used. In our case, we will run 100 copies of the xthi
program that prints the process placement on the node it is running on.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=multi_xthi\n#SBATCH --time=0:20:0\n#SBATCH --nodes=100\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the xthi module\nmodule load xthi\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Loop over 100 subjobs starting each of them on a separate node\nfor i in $(seq 1 100)\ndo\n# Launch this subjob on 1 node, note nodes and ntasks options and & to place subjob in the background\n srun --nodes=1 --ntasks=128 --distribution=block:block --hint=nomultithread xthi > placement${i}.txt &\ndone\n# Wait for all background subjobs to finish\nwait\n
Key points from the example job script:
#SBATCH
options select 100 full nodes in the usual way.srun
command sets the following:--nodes=1
We need override this setting from the main job so that each subjob only uses 1 node--ntasks=128
For normal jobs, the number of parallel tasks (MPI processes) is calculated from the number of nodes you request and the number of tasks per node. We need to explicitly tell srun
how many we require for this subjob.--distribution=block:block --hint=nomultithread
These options ensure correct placement of processes within the compute nodes.&
Each subjob srun
command ends with an ampersand to place the process in the background and move on to the next loop iteration (and subjob submission). Without this, the script would wait for this subjob to complete before moving on to submit the next.wait
command to tell the script to wait for all the background subjobs to complete before exiting. If we did not have this in place, the script would exit as soon as the last subjob was submitted and kill all running subjobs.As the ARCHER2 nodes contain a large number of cores (128 per node) it may sometimes be useful to be able to run multiple executables on a single node. For example, you may want to run 128 copies of a serial executable or Python script; or, you may want to run multiple copies of parallel executables that use fewer than 128 cores each. This use model is possible using multiple srun
commands in a job script on ARCHER2
Note
You can never share a compute node with another user. Although you can use srun
to place multiple copies of an executable or script on a compute node, you still have exclusive use of that node. The minimum amount of resources you can reserve for your use on ARCHER2 is a single node.
When using srun
to place multiple executables or scripts on a compute node you must be aware of a few things:
srun
command must specify any Slurm options that differ in value from those specified to sbatch
. This typically means that you need to specify the --nodes
, --ntasks
and --ntasks-per-node
options to srun
.--exact
flag to your srun
command. With this flag on, Slurm will ensure that the resources you request are assigned to your subjob. Furthermore, if the resources are not currently available, Slurm will output a message letting you know that this is the case and stall the launch of this subjob until enough of your previous subjobs have completed to free up the resources for this subjob.--mem=<amount of memory>
flag. The amount of memory is given in MiB by default but other units can be specified. If you do not know how much memory to specify, we recommend that you specify 1500M (1,500 MiB) per core being used.srun
command into the background and then use the wait
command at the end of the submission script to make sure it does not exit before the commands are complete.srun
per node (e.g. 256 single core processes across 2 nodes) then you need to pass the node ID to the srun
commands otherwise Slurm will oversubscribe cores on the first node.Below, we provide four examples or running multiple subjobs in a node: one that runs 128 serial processes across a single node; one that runs 8 subjobs each of which use 8 MPI processes with 2 OpenMP threads per MPI process; one that runs four inhomogeneous jobs, each of which requires a different number of MPI processes and OpenMP threads per process; and one that runs 256 serial processes across two nodes.
"},{"location":"user-guide/scheduler/#example-1-128-serial-tasks-running-on-a-single-node","title":"Example 1: 128 serial tasks running on a single node","text":"For our first example, we will run 128 single-core copies of the xthi
program (which prints process/thread placement) on a single ARCHER2 compute node with each copy of xthi
pinned to a different core. The job submission script for this example would look like:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiSerialOnCompute\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Loop over 128 subjobs pinning each to a different core\nfor i in $(seq 1 128)\ndo\n# Launch subjob overriding job settings as required and in the background\n# Make sure to change the amount specified by the `--mem=` flag to the amount\n# of memory required. The amount of memory is given in MiB by default but other\n# units can be specified. If you do not know how much memory to specify, we\n# recommend that you specify `--mem=1500M` (1,500 MiB).\nsrun --nodes=1 --ntasks=1 --ntasks-per-node=1 \\\n --exact --mem=1500M xthi > placement${i}.txt &\ndone\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#example-2-8-subjobs-on-1-node-each-with-8-mpi-processes-and-2-openmp-threads-per-process","title":"Example 2: 8 subjobs on 1 node each with 8 MPI processes and 2 OpenMP threads per process","text":"For our second example, we will run 8 subjobs, each running the xthi
program (which prints process/thread placement) across 1 node. Each subjob will use 8 MPI processes and 2 OpenMP threads per process. The job submission script for this example would look like:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiParallelOnCompute\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=64\n#SBATCH --cpus-per-task=2\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to 2 as required by all subjobs\nexport OMP_NUM_THREADS=2\n\n# Loop over 8 subjobs\nfor i in $(seq 1 8)\ndo\n echo $j $i\n # Launch subjob overriding job settings as required and in the background\n # Make sure to change the amount specified by the `--mem=` flag to the amount\n # of memory required. The amount of memory is given in MiB by default but other\n # units can be specified. If you do not know how much memory to specify, we\n # recommend that you specify `--mem=12500M` (12,500 MiB).\n srun --nodes=1 --ntasks=8 --ntasks-per-node=8 --cpus-per-task=2 \\\n --exact --mem=12500M xthi > placement${i}.txt &\ndone\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#example-3-running-inhomogeneous-subjobs-on-one-node","title":"Example 3: Running inhomogeneous subjobs on one node","text":"For our third example, we will run 4 subjobs, each running the xthi
program (which prints process/thread placement) across 1 node. Our subjobs will each run with a different number of MPI processes and OpenMP threads. We will run: one job with 64 MPI processes and 1 OpenMP process per thread; one job with 16 MPI processes and 2 threads per process; one job with 4 MPI processes and 4 OpenMP threads per job; and, one job with 1 MPI process and 16 OpenMP threads per job.
To be able to change the number of MPI processes and OpenMP threads per process, we will need to forgo using the #SBATCH --ntasks-per-node
and the #SBATCH cpus-per-task
commands -- if you set these Slurm will not let you alter the OMP_NUM_THREADS
variable and you will not be able to change the number of OpenMP threads per process between each job.
Before each srun
command, you will need to define the number of OpenMP threads per process you want by changing the OMP_NUM_THREADS
variable. Furthermore, for each srun
command, you will need to set the --ntasks
flag to equal the number of MPI processes you want to use. You will also need to set the --cpus-per-task
flag to equal the number of OpenMP threads per process you want to use.
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiParallelOnCompute\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to value required by the first job\nexport OMP_NUM_THREADS=1\nsrun --ntasks=64 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Set the number of threads to the value required by the second job\nexport OMP_NUM_THREADS=2\nsrun --ntasks=16 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Set the number of threads to the value required by the second job\nexport OMP_NUM_THREADS=4\nsrun --ntasks=4 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Set the number of threads to the value required by the second job\nexport OMP_NUM_THREADS=16\nsrun --ntasks=1 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#example-4-256-serial-tasks-running-across-two-nodes","title":"Example 4: 256 serial tasks running across two nodes","text":"For our fourth example, we will run 256 single-core copies of the xthi
program (which prints process/thread placement) across two ARCHER2 compute nodes with each copy of xthi
pinned to a different core. We will illustrate a mechanism for getting the node IDs to pass to srun
as this is required to ensure that the individual subjobs are assigned to the correct node. This mechanism uses the scontrol
command to turn the nodelist from sbatch
into a format we can use as input to srun
. The job submission script for this example would look like:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiSerialOnComputes\n#SBATCH --time=0:10:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Get a list of the nodes assigned to this job in a format we can use.\n# scontrol converts the condensed node IDs in the sbatch environment\n# variable into a list of full node IDs that we can use with srun to\n# ensure the subjobs are placed on the correct node. e.g. this converts\n# \"nid[001234,002345]\" to \"nid001234 nid002345\"\nnodelist=$(scontrol show hostnames $SLURM_JOB_NODELIST)\n\n# Loop over the nodes assigned to the job\nfor nodeid in $nodelist\ndo\n # Loop over 128 subjobs on each node pinning each to a different core\n for i in $(seq 1 128)\n do\n # Launch subjob overriding job settings as required and in the background\n # Make sure to change the amount specified by the `--mem=` flag to the amount\n # of memory required. The amount of memory is given in MiB by default but other\n # units can be specified. If you do not know how much memory to specify, we\n # recommend that you specify `--mem=1500M` (1,500 MiB).\n srun --nodelist=${nodeid} --nodes=1 --ntasks=1 --ntasks-per-node=1 \\\n --exact --mem=1500M xthi > placement_${nodeid}_${i}.txt &\n done\ndone\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#process-placement","title":"Process placement","text":"There are many occasions where you may want to control (usually, MPI) process placement and change it from the default, for example:
There are a number of different methods for defining process placement, below we cover two different options: using Slurm options and using the MPICH_RANK_REORDER_METHOD
environment variable. Most users will likely use the Slurm options approach.
The standard approach recommended on ARCHER2 is to place processes sequentially on nodes until the maximum number of tasks is reached. You can use the xthi
program to verify this for MPI process placement:
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=128 \\\n --cpus-per-task=1 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170365\nsalloc: job 1170365 queued and waiting for resources\nsalloc: job 1170365 has been allocated resources\nsalloc: Granted job allocation 1170365\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002526-002527] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\nauser@ln04:/work/t01/t01/auser> srun --distribution=block:block --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002526, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002527, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\n\n...output trimmed...\n\nNode 0, rank 124, thread 0, (affinity = 124)\nNode 0, rank 125, thread 0, (affinity = 125)\nNode 0, rank 126, thread 0, (affinity = 126)\nNode 0, rank 127, thread 0, (affinity = 127)\nNode 1, rank 128, thread 0, (affinity = 0)\nNode 1, rank 129, thread 0, (affinity = 1)\nNode 1, rank 130, thread 0, (affinity = 2)\nNode 1, rank 131, thread 0, (affinity = 3)\n\n...output trimmed...\n
Note
For MPI programs on ARCHER2, each rank corresponds to a process.
Important
To get good performance out of MPI collective operations, MPI processes should be placed sequentially on cores as in the standard placement described above.
"},{"location":"user-guide/scheduler/#setting-process-placement-using-slurm-options","title":"Setting process placement using Slurm options","text":""},{"location":"user-guide/scheduler/#for-underpopulation-of-nodes-with-processes","title":"For underpopulation of nodes with processes","text":"When you are using fewer processes than cores on compute nodes (i.e. < 128 processes per node) the basic Slurm options (usually supplied in your script as options to sbatch
) for process placement are:
--ntasks-per-node=X
Place X processes on each node--cpus-per-task=Y
Set a stride of Y cores between each placed process. If you specify this option in a job submission script (queued using sbatch
) or via salloc
they you will also need to set export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
to ensure the setting is passed to srun
commands in the script or allocation.In addition, the following options are added to your srun
commands in your job submission script:
--hint=nomultithread
Only use physical cores (avoids use of SMT/hyperthreads)--distribution=block:block
Allocate processes to cores in a sequential fashionFor example, to place 32 processes per node and have 1 process per 4-core block (corresponding to a CCX, Core CompleX, that shares an L3 cache), you would set:
--ntasks-per-node=32
Place 32 processes on each node--cpus-per-task=4
Set a stride of 4 cores between each placed processHere is the output from xthi
:
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=32 \\\n --cpus-per-task=4 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170383\nsalloc: job 1170383 queued and waiting for resources\nsalloc: job 1170383 has been allocated resources\nsalloc: Granted job allocation 1170383\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002526-002527] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\nauser@ln04:/work/t01/t01/auser> srun --distribution=block:block --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002526, mpi 32, omp 1, executable xthi\nNode 1, hostname nid002527, mpi 32, omp 1, executable xthi\nMPI summary: 64 ranks\nNode 0, rank 0, thread 0, (affinity = 0-3)\nNode 0, rank 1, thread 0, (affinity = 4-7)\nNode 0, rank 2, thread 0, (affinity = 8-11)\nNode 0, rank 3, thread 0, (affinity = 12-15)\nNode 0, rank 4, thread 0, (affinity = 16-19)\nNode 0, rank 5, thread 0, (affinity = 20-23)\nNode 0, rank 6, thread 0, (affinity = 24-27)\nNode 0, rank 7, thread 0, (affinity = 28-31)\nNode 0, rank 8, thread 0, (affinity = 32-35)\nNode 0, rank 9, thread 0, (affinity = 36-39)\nNode 0, rank 10, thread 0, (affinity = 40-43)\nNode 0, rank 11, thread 0, (affinity = 44-47)\nNode 0, rank 12, thread 0, (affinity = 48-51)\nNode 0, rank 13, thread 0, (affinity = 52-55)\nNode 0, rank 14, thread 0, (affinity = 56-59)\nNode 0, rank 15, thread 0, (affinity = 60-63)\nNode 0, rank 16, thread 0, (affinity = 64-67)\nNode 0, rank 17, thread 0, (affinity = 68-71)\nNode 0, rank 18, thread 0, (affinity = 72-75)\nNode 0, rank 19, thread 0, (affinity = 76-79)\nNode 0, rank 20, thread 0, (affinity = 80-83)\nNode 0, rank 21, thread 0, (affinity = 84-87)\nNode 0, rank 22, thread 0, (affinity = 88-91)\nNode 0, rank 23, thread 0, (affinity = 92-95)\nNode 0, rank 24, thread 0, (affinity = 96-99)\nNode 0, rank 25, thread 0, (affinity = 100-103)\nNode 0, rank 26, thread 0, (affinity = 104-107)\nNode 0, rank 27, thread 0, (affinity = 108-111)\nNode 0, rank 28, thread 0, (affinity = 112-115)\nNode 0, rank 29, thread 0, (affinity = 116-119)\nNode 0, rank 30, thread 0, (affinity = 120-123)\nNode 0, rank 31, thread 0, (affinity = 124-127)\nNode 1, rank 32, thread 0, (affinity = 0-3)\nNode 1, rank 33, thread 0, (affinity = 4-7)\nNode 1, rank 34, thread 0, (affinity = 8-11)\nNode 1, rank 35, thread 0, (affinity = 12-15)\nNode 1, rank 36, thread 0, (affinity = 16-19)\nNode 1, rank 37, thread 0, (affinity = 20-23)\nNode 1, rank 38, thread 0, (affinity = 24-27)\nNode 1, rank 39, thread 0, (affinity = 28-31)\nNode 1, rank 40, thread 0, (affinity = 32-35)\nNode 1, rank 41, thread 0, (affinity = 36-39)\nNode 1, rank 42, thread 0, (affinity = 40-43)\nNode 1, rank 43, thread 0, (affinity = 44-47)\nNode 1, rank 44, thread 0, (affinity = 48-51)\nNode 1, rank 45, thread 0, (affinity = 52-55)\nNode 1, rank 46, thread 0, (affinity = 56-59)\nNode 1, rank 47, thread 0, (affinity = 60-63)\nNode 1, rank 48, thread 0, (affinity = 64-67)\nNode 1, rank 49, thread 0, (affinity = 68-71)\nNode 1, rank 50, thread 0, (affinity = 72-75)\nNode 1, rank 51, thread 0, (affinity = 76-79)\nNode 1, rank 52, thread 0, (affinity = 80-83)\nNode 1, rank 53, thread 0, (affinity = 84-87)\nNode 1, rank 54, thread 0, (affinity = 88-91)\nNode 1, rank 55, thread 0, (affinity = 92-95)\nNode 1, rank 56, thread 0, (affinity = 96-99)\nNode 1, rank 57, thread 0, (affinity = 100-103)\nNode 1, rank 58, thread 0, (affinity = 104-107)\nNode 1, rank 59, thread 0, (affinity = 108-111)\nNode 1, rank 60, thread 0, (affinity = 112-115)\nNode 1, rank 61, thread 0, (affinity = 116-119)\nNode 1, rank 62, thread 0, (affinity = 120-123)\nNode 1, rank 63, thread 0, (affinity = 124-127)\n
Tip
You usually only want to use physical cores on ARCHER2, so (ntasks-per-node
) \u00d7 (cpus-per-task
) should generally be equal to 128.
If you want to change the order processes are placed on nodes and cores using Slurm options then you should use the --distribution
option to srun
to change this.
For example, to place processes sequentially on nodes but round-robin on the 16-core NUMA regions in a single node, you would use the --distribution=block:cyclic
option to srun
. This type of process placement can be beneficial when a code is memory bound.
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=128 \\\n --cpus-per-task=1 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170594\nsalloc: job 1170594 queued and waiting for resources\nsalloc: job 1170594 has been allocated resources\nsalloc: Granted job allocation 1170594\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002616,002621] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> srun --distribution=block:cyclic --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002616, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002621, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 16)\nNode 0, rank 2, thread 0, (affinity = 32)\nNode 0, rank 3, thread 0, (affinity = 48)\nNode 0, rank 4, thread 0, (affinity = 64)\nNode 0, rank 5, thread 0, (affinity = 80)\nNode 0, rank 6, thread 0, (affinity = 96)\nNode 0, rank 7, thread 0, (affinity = 112)\nNode 0, rank 8, thread 0, (affinity = 1)\nNode 0, rank 9, thread 0, (affinity = 17)\nNode 0, rank 10, thread 0, (affinity = 33)\nNode 0, rank 11, thread 0, (affinity = 49)\nNode 0, rank 12, thread 0, (affinity = 65)\nNode 0, rank 13, thread 0, (affinity = 81)\nNode 0, rank 14, thread 0, (affinity = 97)\nNode 0, rank 15, thread 0, (affinity = 113\n\n...output trimmed...\n\nNode 0, rank 120, thread 0, (affinity = 15)\nNode 0, rank 121, thread 0, (affinity = 31)\nNode 0, rank 122, thread 0, (affinity = 47)\nNode 0, rank 123, thread 0, (affinity = 63)\nNode 0, rank 124, thread 0, (affinity = 79)\nNode 0, rank 125, thread 0, (affinity = 95)\nNode 0, rank 126, thread 0, (affinity = 111)\nNode 0, rank 127, thread 0, (affinity = 127)\nNode 1, rank 128, thread 0, (affinity = 0)\nNode 1, rank 129, thread 0, (affinity = 16)\nNode 1, rank 130, thread 0, (affinity = 32)\nNode 1, rank 131, thread 0, (affinity = 48)\nNode 1, rank 132, thread 0, (affinity = 64)\nNode 1, rank 133, thread 0, (affinity = 80)\nNode 1, rank 134, thread 0, (affinity = 96)\nNode 1, rank 135, thread 0, (affinity = 112)\n\n...output trimmed...\n
If you wish to place processes round robin on both nodes and 16-core regions (cores that share access to a DRAM single memory controller) within in a node you would use --distribution=cyclic:cyclic
:
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=128 \\\n --cpus-per-task=1 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170594\nsalloc: job 1170594 queued and waiting for resources\nsalloc: job 1170594 has been allocated resources\nsalloc: Granted job allocation 1170594\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002616,002621] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\nauser@ln04:/work/t01/t01/auser> srun --distribution=cyclic:cyclic --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002616, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002621, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 2, thread 0, (affinity = 16)\nNode 0, rank 4, thread 0, (affinity = 32)\nNode 0, rank 6, thread 0, (affinity = 48)\nNode 0, rank 8, thread 0, (affinity = 64)\nNode 0, rank 10, thread 0, (affinity = 80)\nNode 0, rank 12, thread 0, (affinity = 96)\nNode 0, rank 14, thread 0, (affinity = 112)\nNode 0, rank 16, thread 0, (affinity = 1)\nNode 0, rank 18, thread 0, (affinity = 17)\nNode 0, rank 20, thread 0, (affinity = 33)\nNode 0, rank 22, thread 0, (affinity = 49)\nNode 0, rank 24, thread 0, (affinity = 65)\nNode 0, rank 26, thread 0, (affinity = 81)\nNode 0, rank 28, thread 0, (affinity = 97)\nNode 0, rank 30, thread 0, (affinity = 113)\n\n...output trimmed...\n\nNode 1, rank 1, thread 0, (affinity = 0)\nNode 1, rank 3, thread 0, (affinity = 16)\nNode 1, rank 5, thread 0, (affinity = 32)\nNode 1, rank 7, thread 0, (affinity = 48)\nNode 1, rank 9, thread 0, (affinity = 64)\nNode 1, rank 11, thread 0, (affinity = 80)\nNode 1, rank 13, thread 0, (affinity = 96)\nNode 1, rank 15, thread 0, (affinity = 112)\nNode 1, rank 17, thread 0, (affinity = 1)\nNode 1, rank 19, thread 0, (affinity = 17)\nNode 1, rank 21, thread 0, (affinity = 33)\nNode 1, rank 23, thread 0, (affinity = 49)\nNode 1, rank 25, thread 0, (affinity = 65)\nNode 1, rank 27, thread 0, (affinity = 81)\nNode 1, rank 29, thread 0, (affinity = 97)\nNode 1, rank 31, thread 0, (affinity = 113)\n\n...output trimmed...\n
Remember, MPI collective performance is generally much worse if processes are not placed sequentially on a node (so adjacent MPI ranks are as close to each other as possible). This is the reason that the default recommended placement on ARCHER2 is sequential rather than round-robin.
"},{"location":"user-guide/scheduler/#mpich_rank_reorder_method-for-mpi-process-placement","title":"MPICH_RANK_REORDER_METHOD
for MPI process placement","text":"The MPICH_RANK_REORDER_METHOD
environment variable can also be used to specify other types of MPI task placement. For example, setting it to \"0\" results in a round-robin placement on both nodes and NUMA regions in a node (equivalent to the --distribution=cyclic:cyclic
option to srun
). Note, we do not specify the --distribution
option to srun
in this case as the environment variable is controlling placement:
salloc --nodes=8 --ntasks-per-node=2 --cpus-per-task=1 --time=0:10:0 --account=t01\n\nsalloc: Granted job allocation 24236\nsalloc: Waiting for resource configuration\nsalloc: Nodes cn13 are ready for job\n\nmodule load xthi\nexport OMP_NUM_THREADS=1\nexport MPICH_RANK_REORDER_METHOD=0\nsrun --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002616, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002621, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 2, thread 0, (affinity = 16)\nNode 0, rank 4, thread 0, (affinity = 32)\nNode 0, rank 6, thread 0, (affinity = 48)\nNode 0, rank 8, thread 0, (affinity = 64)\nNode 0, rank 10, thread 0, (affinity = 80)\nNode 0, rank 12, thread 0, (affinity = 96)\nNode 0, rank 14, thread 0, (affinity = 112)\nNode 0, rank 16, thread 0, (affinity = 1)\nNode 0, rank 18, thread 0, (affinity = 17)\nNode 0, rank 20, thread 0, (affinity = 33)\nNode 0, rank 22, thread 0, (affinity = 49)\nNode 0, rank 24, thread 0, (affinity = 65)\nNode 0, rank 26, thread 0, (affinity = 81)\nNode 0, rank 28, thread 0, (affinity = 97)\nNode 0, rank 30, thread 0, (affinity = 113)\n\n...output trimmed...\n
There are other modes available with the MPICH_RANK_REORDER_METHOD
environment variable, including one which lets the user provide a file called MPICH_RANK_ORDER
which contains a list of each task's placement on each node. These options are described in detail in the intro_mpi
man page.
For MPI applications which perform a large amount of nearest-neighbor communication, e.g., stencil-based applications on structured grids, HPE provide a tool in the perftools-base
module (Loaded by default for all users) called grid_order
which can generate a MPICH_RANK_ORDER
file automatically by taking as parameters the dimensions of the grid, core count, etc. For example, to place 256 MPI parameters in row-major order on a Cartesian grid of size $(8, 8, 4)$, using 128 cores per node:
grid_order -R -c 128 -g 8,8,4\n\n# grid_order -R -Z -c 128 -g 8,8,4\n# Region 3: 0,0,1 (0..255)\n0,1,2,3,32,33,34,35,64,65,66,67,96,97,98,99,128,129,130,131,160,161,162,163,192,193,194,195,224,225,226,227,4,5,6,7,36,37,38,39,68,69,70,71,100,101,102,103,132,133,134,135,164,165,166,167,196,197,198,199,228,229,230,231,8,9,10,11,40,41,42,43,72,73,74,75,104,105,106,107,136,137,138,139,168,169,170,171,200,201,202,203,232,233,234,235,12,13,14,15,44,45,46,47,76,77,78,79,108,109,110,111,140,141,142,143,172,173,174,175,204,205,206,207,236,237,238,239\n16,17,18,19,48,49,50,51,80,81,82,83,112,113,114,115,144,145,146,147,176,177,178,179,208,209,210,211,240,241,242,243,20,21,22,23,52,53,54,55,84,85,86,87,116,117,118,119,148,149,150,151,180,181,182,183,212,213,214,215,244,245,246,247,24,25,26,27,56,57,58,59,88,89,90,91,120,121,122,123,152,153,154,155,184,185,186,187,216,217,218,219,248,249,250,251,28,29,30,31,60,61,62,63,92,93,94,95,124,125,126,127,156,157,158,159,188,189,190,191,220,221,222,223,252,253,254,255\n
One can then save this output to a file called MPICH_RANK_ORDER
and then set MPICH_RANK_REORDER_METHOD=3
before running the job, which tells Cray MPI to read the MPICH_RANK_ORDER
file to set the MPI task placement. For more information, please see the man page man grid_order
.
salloc
to reserve resources","text":"When you are developing or debugging code you often want to run many short jobs with a small amount of editing the code between runs. This can be achieved by using the login nodes to run MPI but you may want to test on the compute nodes (e.g. you may want to test running on multiple nodes across the high performance interconnect). One of the best ways to achieve this on ARCHER2 is to use interactive jobs.
An interactive job allows you to issue srun
commands directly from the command line without using a job submission script, and to see the output from your program directly in the terminal.
You use the salloc
command to reserve compute nodes for interactive jobs.
To submit a request for an interactive job reserving 8 nodes (1024 physical cores) for 20 minutes on the short QoS you would issue the following command from the command line:
auser@ln01:> salloc --nodes=8 --ntasks-per-node=128 --cpus-per-task=1 \\\n --time=00:20:00 --partition=standard --qos=short \\\n --account=[budget code]\n
When you submit this job your terminal will display something like:
salloc: Granted job allocation 24236\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid000002 are ready for job\nauser@ln01:>\n
It may take some time for your interactive job to start. Once it runs you will enter a standard interactive terminal session (a new shell). Note that this shell is still on the front end (the prompt has not change). Whilst the interactive session lasts you will be able to run parallel jobs on the compute nodes by issuing the srun --distribution=block:block --hint=nomultithread
command directly at your command prompt using the same syntax as you would inside a job script. The maximum number of nodes you can use is limited by resources requested in the salloc
command.
Important
If you wish the cpus-per-task
option to salloc
to propagate to srun
commands in the allocation, you will need to use the command export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
before you issue any srun
commands.
If you know you will be doing a lot of intensive debugging you may find it useful to request an interactive session lasting the expected length of your working session, say a full day.
Your session will end when you hit the requested walltime. If you wish to finish before this you should use the exit
command - this will return you to your prompt before you issued the salloc
command.
srun
directly","text":"A second way to run an interactive job is to use srun
directly in the following way (here using the short
QoS):
auser@ln01:/work/t01/t01/auser> srun --nodes=1 --exclusive --time=00:20:00 \\\n --partition=standard --qos=short --account=[budget code] \\\n --pty /bin/bash\nauser@nid001261:/work/t01/t01/auser> hostname\nnid001261\n
The --pty /bin/bash
will cause a new shell to be started on the first node of a new allocation . This is perhaps closer to what many people consider an 'interactive' job than the method using salloc
appears.
One can now issue shell commands in the usual way. A further invocation of srun
is required to launch a parallel job in the allocation.
Note
When using srun
within an interactive srun
session, you will need to include both the --overlap
and --oversubscribe
flags, and specify the number of cores you want to use:
auser@nid001261:/work/t01/t01/auser> srun --overlap --oversubscribe --distribution=block:block \\\n --hint=nomultithread --ntasks=128 ./my_mpi_executable.x\n
Without --overlap
the second srun
will block until the first one has completed. Since your interactive session was launched with srun
this means it will never actually start -- you will get repeated warnings that \"Requested nodes are busy\".
When finished, type exit
to relinquish the allocation and control will be returned to the front end.
By default, the interactive shell will retain the environment of the parent. If you want a clean shell, remember to specify --export=none
.
Most of the Slurm submissions discussed above involve running a single executable. However, there are situations where two or more distinct executables are coupled and need to be run at the same time, potentially using the same MPI communicator. This is most easily handled via the Slurm heterogeneous job mechanism.
Two common cases are discussed below: first, a client server model in which client and server each have a different MPI_COMM_WORLD
, and second the case were two or more executables share MPI_COMM_WORLD
.
MPI_COMM_WORLDs
","text":"The essential feature of a heterogeneous job here is to create a single batch submission which specifies the resource requirements for the individual components. Schematically, we would use
#!/bin/bash\n\n# Slurm specifications for the first component\n\n#SBATCH --partition=standard\n\n...\n\n#SBATCH hetjob\n\n# Slurm specifications for the second component\n\n#SBATCH --partition=standard\n\n...\n
where new each component beyond the first is introduced by the special token #SBATCH hetjob
(note this is not a normal option and is not --hetjob
). Each component must specify a partition. Such a job will appear in the scheduler as, e.g.,
50098+0 standard qscript- user PD 0:00 1 (None)\n 50098+1 standard qscript- user PD 0:00 2 (None)\n
and counts as (in this case) two separate jobs from the point of QoS limits. Consider a case where we have two executables which may both be parallel (in that they use MPI), both run at the same time, and communicate with each other using MPI or by some other means. In the following example, we run two different executables, xthi-a
and xthi-b
, both of which must finish before the jobs completes.
#!/bin/bash\n\n#SBATCH --time=00:20:00\n#SBATCH --exclusive\n#SBATCH --export=none\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=8\n\n#SBATCH hetjob\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=4\n\n# Run two executables with separate MPI_COMM_WORLD\n\nsrun --distribution=block:block --hint=nomultithread --het-group=0 ./xthi-a &\nsrun --distribution=block:block --hint=nomultithread --het-group=1 ./xthi-b &\nwait\n
In this case, each executable is launched with a separate call to srun
but specifies a different heterogeneous group via the --het-group
option. The first group is --het-group=0
. Both are run in the background with &
and the wait
is required to ensure both executables have completed before the job submission exits. The above is a rather artificial example using two executables which are in fact just symbolic links in the job directory to xthi
, used without loading the module. You can test this script yourself by creating symbolic links to the original executable before submitting the job:
auser@ln04:/work/t01/t01/auser/job-dir> module load xthi\nauser@ln04:/work/t01/t01/auser/job-dir> which xthi\n/work/y07/shared/utils/core/xthi/1.2/CRAYCLANG/11.0/bin/xthi\nauser@ln04:/work/t01/t01/auser/job-dir> ln -s /work/y07/shared/utils/core/xthi/1.2/CRAYCLANG/11.0/bin/xthi xthi-a\nauser@ln04:/work/t01/t01/auser/job-dir> ln -s /work/y07/shared/utils/core/xthi/1.2/CRAYCLANG/11.0/bin/xthi xthi-b\n
The example job will produce two reports showing the placement of the MPI tasks from the two instances of xthi
running in each of the heterogeneous groups. For example, the output might be
Node summary for 1 nodes:\nNode 0, hostname nid002400, mpi 8, omp 1, executable xthi-a\nMPI summary: 8 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode summary for 2 nodes:\nNode 0, hostname nid002146, mpi 4, omp 1, executable xthi-b\nNode 1, hostname nid002149, mpi 4, omp 1, executable xthi-b\nMPI summary: 8 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 1, rank 4, thread 0, (affinity = 0)\nNode 1, rank 5, thread 0, (affinity = 1)\nNode 1, rank 6, thread 0, (affinity = 2)\nNode 1, rank 7, thread 0, (affinity = 3)\n
Here we have the first executable running on one node with a communicator size 8 (ranks 0-7). The second executable runs on two nodes also with communicator size 8 (ranks 0-7, 4 ranks per node). Further examples of placement for heterogenenous jobs are given below. Finally, if your workflow requires the different heterogeneous jobs to communicate via MPI, but without sharing their MPI_COM_WORLD
, you will need to export two new variables before your srun
commands as defined below:
export PMI_UNIVERSE_SIZE=3\nexport MPICH_SINGLE_HOST_ENABLED=0\n
"},{"location":"user-guide/scheduler/#heterogeneous-jobs-for-a-shared-mpi_com_world","title":"Heterogeneous jobs for a shared MPI_COM_WORLD
","text":"Note
The directive SBATCH hetjob
can no longer be used for jobs requiring a shared MPI_COMM_WORLD
Note
In this approach, each hetjob
component must be on its own set of nodes. You cannot use this approach to place different hetjob
components on the same node.
If two or more heterogeneous components need to share a unique MPI_COMM_WORLD
, a single srun
invocation with the different components separated by a colon :
should be used. Arguments to the individual components of the srun
control the placement of the tasks and threads for each component. For example, running the same xthi-a
and xthi-b
executables as above but now in a shared communicator, we might run:
#!/bin/bash\n\n#SBATCH --time=00:20:00\n#SBATCH --export=none\n#SBATCH --account=[...]\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# We must specify correctly the total number of nodes required.\n#SBATCH --nodes=3\n\nSHARED_ARGS=\"--distribution=block:block --hint=nomultithread\"\n\nsrun --het-group=0 --nodes=1 --ntasks-per-node=8 ${SHARED_ARGS} ./xthi-a : \\\n --het-group=1 --nodes=2 --ntasks-per-node=4 ${SHARED_ARGS} ./xthi-b\n
The output should confirm we have a single MPI_COMM_WORLD
with a total of three nodes, xthi-a
running on one and xthi-b
on two, with ranks 0-15 extending across both executables.
Node summary for 3 nodes:\nNode 0, hostname nid002668, mpi 8, omp 1, executable xthi-a\nNode 1, hostname nid002669, mpi 4, omp 1, executable xthi-b\nNode 2, hostname nid002670, mpi 4, omp 1, executable xthi-b\nMPI summary: 16 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode 1, rank 8, thread 0, (affinity = 0)\nNode 1, rank 9, thread 0, (affinity = 1)\nNode 1, rank 10, thread 0, (affinity = 2)\nNode 1, rank 11, thread 0, (affinity = 3)\nNode 2, rank 12, thread 0, (affinity = 0)\nNode 2, rank 13, thread 0, (affinity = 1)\nNode 2, rank 14, thread 0, (affinity = 2)\nNode 2, rank 15, thread 0, (affinity = 3)\n
"},{"location":"user-guide/scheduler/#heterogeneous-placement-for-mixed-mpiopenmp-work","title":"Heterogeneous placement for mixed MPI/OpenMP work","text":"Some care may be required for placement of tasks/threads in heterogeneous jobs in which the number of threads needs to be specified differently for different components.
In the following we have two components, again using xthi-a
and xthi-b
as our two separate executables. The first component runs 8 MPI tasks each with 16 OpenMP threads on one node. The second component runs 8 MPI tasks with one task per NUMA region on a second node; each task has one thread. An appropriate Slurm submission might be:
#!/bin/bash\n\n#SBATCH --time=00:20:00\n#SBATCH --export=none\n#SBATCH --account=[...]\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=2\n\nSHARED_ARGS=\"--distribution=block:block --hint=nomultithread \\\n --nodes=1 --ntasks-per-node=8 --cpus-per-task=16\"\n\n# Do not set OMP_NUM_THREADS in the calling environment\n\nunset OMP_NUM_THREADS\nexport OMP_PROC_BIND=spread\n\nsrun --het-group=0 ${SHARED_ARGS} --export=all,OMP_NUM_THREADS=16 ./xthi-a : \\\n --het-group=1 ${SHARED_ARGS} --export=all,OMP_NUM_THREADS=1 ./xthi-b\n
The important point here is that OMP_NUM_THREADS
must not be set in the environment that calls srun
in order that the different specifications for the separate groups via --export
on the srun
command line take effect. If OMP_NUM_THREADS
is set in the calling environment, then that value takes precedence, and each component will see the same value of OMP_NUM_THREADS
.
The output might then be:
Node 0, hostname nid001111, mpi 8, omp 16, executable xthi-a\nNode 1, hostname nid001126, mpi 8, omp 1, executable xthi-b\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 0, thread 1, (affinity = 1)\nNode 0, rank 0, thread 2, (affinity = 2)\nNode 0, rank 0, thread 3, (affinity = 3)\nNode 0, rank 0, thread 4, (affinity = 4)\nNode 0, rank 0, thread 5, (affinity = 5)\nNode 0, rank 0, thread 6, (affinity = 6)\nNode 0, rank 0, thread 7, (affinity = 7)\nNode 0, rank 0, thread 8, (affinity = 8)\nNode 0, rank 0, thread 9, (affinity = 9)\nNode 0, rank 0, thread 10, (affinity = 10)\nNode 0, rank 0, thread 11, (affinity = 11)\nNode 0, rank 0, thread 12, (affinity = 12)\nNode 0, rank 0, thread 13, (affinity = 13)\nNode 0, rank 0, thread 14, (affinity = 14)\nNode 0, rank 0, thread 15, (affinity = 15)\nNode 0, rank 1, thread 0, (affinity = 16)\nNode 0, rank 1, thread 1, (affinity = 17)\n...\nNode 0, rank 7, thread 14, (affinity = 126)\nNode 0, rank 7, thread 15, (affinity = 127)\nNode 1, rank 8, thread 0, (affinity = 0)\nNode 1, rank 9, thread 0, (affinity = 16)\nNode 1, rank 10, thread 0, (affinity = 32)\nNode 1, rank 11, thread 0, (affinity = 48)\nNode 1, rank 12, thread 0, (affinity = 64)\nNode 1, rank 13, thread 0, (affinity = 80)\nNode 1, rank 14, thread 0, (affinity = 96)\nNode 1, rank 15, thread 0, (affinity = 112)\n
Here we can see the eight MPI tasks from xthi-a
each running with sixteen OpenMP threads. Then the 8 MPI tasks with no threading from xthi-b
are spaced across the cores on the second node, one per NUMA region.
Low priority jobs are not charged against your allocation but will only run when other, higher-priority, jobs cannot be run. Although low priority jobs are not charged, you do need a valid, positive budget to be able to submit and run low priority jobs, i.e. you need at least 1 CU in your budget.
Low priority access is always available and has the following limits:
You submit a low priority job on ARCHER2 by using the lowpriority
QoS. For example, you would usually have the following line in your job submission script sbatch options:
#SBATCH --qos=lowpriority\n
"},{"location":"user-guide/scheduler/#reservations","title":"Reservations","text":"Reservations are available on ARCHER2. These allow users to reserve a number of nodes for a specified length of time starting at a particular time on the system.
Reservations require justification. They will only be approved if the request could not be fulfilled with the normal QoS's. For instance, you require a job/jobs to run at a particular time e.g. for a demonstration or course.
Note
Reservation requests must be submitted at least 60 hours in advance of the reservation start time. If requesting a reservation for a Monday at 18:00, please ensure this is received by the Friday at 12:00 the latest. The same applies over Service Holidays.
Note
Reservations are only valid for standard compute nodes, high memory compute nodes and/or PP nodes cannot be included in reservations.
Reservations will be charged at 1.5 times the usual CU rate and our policy is that they will be charged the full rate for the entire reservation at the time of booking, whether or not you use the nodes for the full time. In addition, you will not be refunded the CUs if you fail to use them due to a job issue unless this issue is due to a system failure.
To request a reservation you complete a form on SAFE:
On the first page, you need to provide the following:
On the second page, you will need to specify which username you wish the reservation to be charged against and, once the username has been selected, the budget you want to charge the reservation to. (The selected username will be charged for the reservation but the reservation can be used by all members of the selected budget.)
Your request will be checked by the ARCHER2 User Administration team and, if approved, you will be provided a reservation ID which can be used on the system. To submit jobs to a reservation, you need to add --reservation=<reservation ID>
and --qos=reservation
options to your job submission script or command.
Important
You must have at least 1 CU in the budget to submit a job on ARCHER2, even to a pre-paid reservation.
Tip
You can submit jobs to a reservation as soon as the reservation has been set up; jobs will remain queued until the reservation starts.
"},{"location":"user-guide/scheduler/#capability-days","title":"Capability Days","text":"Important
The next Capability Days session has not been scheduled yet
ARCHER2 Capability Days are a mechanism to allow users to run large scale (512 node or more) tests on the system free of charge. The motivations behind Capability Days are:
To enable this, a period will be made available regularly where users can run jobs at large scale free of charge.
Capability Days are made up of different parts:
pre-capabilityday
QoS) to allow users to test scaling and job setup ahead of full Capability DayNERCcapability
reservation) to allow NERC users to test at large scalecapabilityday
QoS)Tip
Any jobs left in the queues when Capability Days finish will be deleted.
"},{"location":"user-guide/scheduler/#pre-capability-day-session","title":"pre-Capability Day session","text":"The pre-Capability Day session is typically available directly before the full Capability Day session and allows short test jobs to prepare for Capability Day.
Submit to the pre-capabilityday
QoS. Jobs can be submitted ahead of time and will start when the pre-Capability Day session starts.
pre-capabilityday
QoS limits:
srun
commands) within job scripts should also be a minimum of 256 nodes#!/bin/bash\n#SBATCH --job-name=test_capability_job\n#SBATCH --nodes=256\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=1:0:0\n#SBATCH --partition=standard\n#SBATCH --qos=pre-capabilityday\n#SBATCH --account=t01\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Check process/thread placement\nmodule load xthi\nsrun --hint=multithread --distribution=block:block xthi > placement-${SLURM_JOBID}.out\n\nsrun --hint=multithread --distribution=block:block my_app.x\n
"},{"location":"user-guide/scheduler/#nerc-capability-reservation","title":"NERC Capability reservation","text":"The NERC Capability reservation is typically available directly before the full Capability Day session and allows short test jobs to prepare for Capability Day.
Submit to the NERCcapability
reservation. Jobs can be submitted ahead of time and will start when the NERC Capability reservatoin starts.
NERCcapability
reservation limits:
#!/bin/bash\n#SBATCH --job-name=NERC_capability_job\n#SBATCH --nodes=256\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=1:0:0\n#SBATCH --partition=standard\n#SBATCH --reservation=NERCcapability\n#SBATCH --qos=reservation\n#SBATCH --account=t01\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Check process/thread placement\nmodule load xthi\nsrun --hint=multithread --distribution=block:block xthi > placement-${SLURM_JOBID}.out\n\nsrun --hint=multithread --distribution=block:block my_app.x\n
"},{"location":"user-guide/scheduler/#capability-day-session","title":"Capability Day session","text":"The Capability Day session is typically available directly after the pre-Capability Day session.
Submit to the capability
QoS. Jobs can be submitted ahead of time and will start when the Capability Day session starts.
capabilityday
QoS limits:
srun
commands) within job scripts should also be a minimum of 512 nodes#!/bin/bash\n#SBATCH --job-name=capability_job\n#SBATCH --nodes=1024\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=1:0:0\n#SBATCH --partition=standard\n#SBATCH --qos=capabilityday\n#SBATCH --account=t01\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Check process/thread placement\nmodule load xthi\nsrun --hint=multithread --distribution=block:block xthi > placement-${SLURM_JOBID}.out\n\nsrun --hint=multithread --distribution=block:block my_app.x\n
"},{"location":"user-guide/scheduler/#capability-day-tips","title":"Capability Day tips","text":"You can run serial jobs on the shared data analysis nodes. More information on using the data analysis nodes (including example job submission scripts) can be found in the Data Analysis section of the User and Best Practice Guide.
"},{"location":"user-guide/scheduler/#gpu-jobs","title":"GPU jobs","text":"You can run on the ARCHER2 GPU nodes and full guidance can be found on the GPU development platform page
"},{"location":"user-guide/scheduler/#best-practices-for-job-submission","title":"Best practices for job submission","text":"This guidance is adapted from the advice provided by NERSC
"},{"location":"user-guide/scheduler/#time-limits","title":"Time Limits","text":"Due to backfill scheduling, short and variable-length jobs generally start quickly resulting in much better job throughput. You can specify a minimum time for your job with the --time-min
option to SBATCH:
#SBATCH --time-min=<lower_bound>\n#SBATCH --time=<upper_bound>\n
Within your job script, you can get the time remaining in the job with squeue -h -j ${Slurm_JOBID} -o %L
to allow you to deal with potentially varying runtimes when using this option.
Simulations which must run for a long period of time achieve the best throughput when composed of many small jobs using a checkpoint and restart method chained together (see above for how to chain jobs together). However, this method does occur a startup and shutdown overhead for each job as the state is saved and loaded so you should experiment to find the best balance between runtime (long runtimes minimise the checkpoint/restart overheads) and throughput (short runtimes maximise throughput).
"},{"location":"user-guide/scheduler/#interconnect-locality","title":"Interconnect locality","text":"For jobs which are sensitive to interconnect (MPI) performance and utilise 128 nodes or less it is possible to request that all nodes are in a single Slingshot dragonfly group. The maximum number of nodes in a group on ARCHER2 is 128.
Slurm has a concept of \"switches\" which on ARCHER2 are configured to map to Slingshot electrical groups; where all compute nodes have all-to-all electrical connections which minimises latency. Since this places an additional constraint on the scheduler a maximum time to wait for the requested topology can be specified - after this time, the job will be placed without the constraint.
For example, to specify that all requested nodes should come from one electrical group and to wait for up to 6 hours (360 minutes) for that placement, you would use the following option in your job:
#SBATCH --switches=1@360\n
You can request multiple groups using this option if you are using more nodes than are in a single group to maximise the number of nodes that share electrical connetions in the job. For example, to request 4 groups (maximum of 512 nodes) and have this as an absolute constraint with no timeout, you would use:
#SBATCH --switches=4\n
Danger
When specifying the number of groups take care to request enough groups to satisfy the requested number of nodes. If the number is too low then an unnecessary delay will be added due to the unsatisfiable request.
A useful heuristic to ensure this is the case is to ensure that the total nodes requested is less than or equal to the number of groups multiplied by 128.
"},{"location":"user-guide/scheduler/#large-jobs","title":"Large Jobs","text":"Large jobs may take longer to start up. The sbcast
command is recommended for large jobs requesting over 1500 MPI tasks. By default, Slurm reads the executable on the allocated compute nodes from the location where it is installed; this may take long time when the file system (where the executable resides) is slow or busy. The sbcast
command, the executable can be copied to the /tmp
directory on each of the compute nodes. Since /tmp
is part of the memory on the compute nodes, it can speed up the job startup time.
sbcast --compress=none /path/to/exe /tmp/exe\nsrun /tmp/exe\n
"},{"location":"user-guide/scheduler/#huge-pages","title":"Huge pages","text":"Huge pages are virtual memory pages which are bigger than the default page size of 4K bytes. Huge pages can improve memory performance for common access patterns on large data sets since it helps to reduce the number of virtual to physical address translations when compared to using the default 4KB.
To use huge pages for an application (with the 2 MB huge pages as an example):
module load craype-hugepages2M\ncc -o mycode.exe mycode.c\n
And also load the same huge pages module at runtime.
Warning
Due to the huge pages memory fragmentation issue, applications may get Cannot allocate memory warnings or errors when there are not enough hugepages on the compute node, such as:
libhugetlbfs [nid0000xx:xxxxx]: WARNING: New heap segment map at 0x10000000 failed: Cannot allocate memory``
By default, The verbosity level of libhugetlbfs HUGETLB_VERBOSE
is set to 0
on ARCHER2 to suppress debugging messages. Users can adjust this value to obtain more information on huge pages use.
HUGETLB_RESTRICT_EXE
can be used to specify the susbset of the programs to use hugepages.Important
This section covers the software environment on the initial, 4-cabinet ARCHER2 system. For docmentation on the software environment on the full ARCHER2 system, please see Software environment: full system.
The software environment on ARCHER2 is primarily controlled through the module
command. By loading and switching software modules you control which software and versions are available to you.
Information
A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.
By default, all users on ARCHER2 start with the default software environment loaded.
Software modules on ARCHER2 are provided by both HPE Cray (usually known as the Cray Development Environment, CDE) and by EPCC, who provide the Service Provision, and Computational Science and Engineering services.
In this section, we provide:
module
commandmodule
command manipulates your environmentmodule
command","text":"We only cover basic usage of the module
command here. For full documentation please see the Linux manual page on modules
The module
command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:
module list [name]
- List modules currently loaded in your environment, optionally filtered by [name]
module avail [name]
- List modules available, optionally filtered by [name]
module savelist
- List module collections available (usually used for accessing different programming environments)module restore name
- Restore the module collection called name
(usually used for setting up a programming environment)module load name
- Load the module called name
into your environmentmodule remove name
- Remove the module called name
from your environmentmodule swap old new
- Swap module new
for module old
in your environmentmodule help name
- Show help information on module name
module show name
- List what module name
actually does to your environmentThese are described in more detail below.
"},{"location":"user-guide/sw-environment-4cab/#information-on-the-available-modules","title":"Information on the available modules","text":"The module list
command will give the names of the modules and their versions you have presently loaded in your environment:
auser@uan01:~> module list\nCurrently Loaded Modulefiles:\n1) cpe-aocc 7) cray-dsmml/0.1.2(default)\n2) aocc/2.1.0.3(default) 8) perftools-base/20.09.0(default)\n3) craype/2.7.0(default) 9) xpmem/2.2.35-7.0.1.0_1.3__gd50fabf.shasta(default)\n4) craype-x86-rome 10) cray-mpich/8.0.15(default)\n5) libfabric/1.11.0.0.233(default) 11) cray-libsci/20.08.1.2(default)\n6) craype-network-ofi\n
Finding out which software modules are available on the system is performed using the module avail
command. To list all software modules available, use:
auser@uan01:~> module avail\n------------------------------- /opt/cray/pe/perftools/20.09.0/modulefiles --------------------------------\nperftools perftools-lite-events perftools-lite-hbm perftools-nwpc \nperftools-lite perftools-lite-gpu perftools-lite-loops perftools-preload \n\n---------------------------------- /opt/cray/pe/craype/2.7.0/modulefiles ----------------------------------\ncraype-hugepages1G craype-hugepages8M craype-hugepages128M craype-network-ofi \ncraype-hugepages2G craype-hugepages16M craype-hugepages256M craype-network-slingshot10 \ncraype-hugepages2M craype-hugepages32M craype-hugepages512M craype-x86-rome \ncraype-hugepages4M craype-hugepages64M craype-network-none \n\n------------------------------------- /usr/local/Modules/modulefiles --------------------------------------\ndot module-git module-info modules null use.own \n\n-------------------------------------- /opt/cray/pe/cpe-prgenv/7.0.0 --------------------------------------\ncpe-aocc cpe-cray cpe-gnu \n\n-------------------------------------------- /opt/modulefiles ---------------------------------------------\naocc/2.1.0.3(default) cray-R/4.0.2.0(default) gcc/8.1.0 gcc/9.3.0 gcc/10.1.0(default) \n\n\n---------------------------------------- /opt/cray/pe/modulefiles -----------------------------------------\natp/3.7.4(default) cray-mpich-abi/8.0.15 craype-dl-plugin-py3/20.06.1(default) \ncce/10.0.3(default) cray-mpich-ucx/8.0.15 craype/2.7.0(default) \ncray-ccdb/4.7.1(default) cray-mpich/8.0.15(default) craypkg-gen/1.3.10(default) \ncray-cti/2.7.3(default) cray-netcdf-hdf5parallel/4.7.4.0 gdb4hpc/4.7.3(default) \ncray-dsmml/0.1.2(default) cray-netcdf/4.7.4.0 iobuf/2.0.10(default) \ncray-fftw/3.3.8.7(default) cray-openshmemx/11.1.1(default) papi/6.0.0.2(default) \ncray-ga/5.7.0.3 cray-parallel-netcdf/1.12.1.0 perftools-base/20.09.0(default) \ncray-hdf5-parallel/1.12.0.0 cray-pmi-lib/6.0.6(default) valgrind4hpc/2.7.2(default) \ncray-hdf5/1.12.0.0 cray-pmi/6.0.6(default) \ncray-libsci/20.08.1.2(default) cray-python/3.8.5.0(default) \n
This will list all the names and versions of the modules available on the service. Not all of them may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.
You can list all the modules of a particular type by providing an argument to the module avail
command. For example, to list all available versions of the HPE Cray FFTW library, use:
auser@uan01:~> module avail cray-fftw\n\n---------------------------------------- /opt/cray/pe/modulefiles -----------------------------------------\ncray-fftw/3.3.8.7(default) \n
If you want more info on any of the modules, you can use the module help
command:
auser@uan01:~> module help cray-fftw\n\n-------------------------------------------------------------------\nModule Specific Help for /opt/cray/pe/modulefiles/cray-fftw/3.3.8.7:\n\n\n===================================================================\nFFTW 3.3.8.7\n============\n Release Date:\n -------------\n June 2020\n\n\n Purpose:\n --------\n This Cray FFTW 3.3.8.7 release is supported on Cray Shasta Systems. \n FFTW is supported on the host CPU but not on the accelerator of Cray systems.\n\n The Cray FFTW 3.3.8.7 release provides the following:\n - Optimizations for AMD Rome CPUs.\n See the Product and OS Dependencies section for details\n\n[...]\n
The module show
command reveals what operations the module actually performs to change your environment when it is loaded. We provide a brief overview of what the significance of these different settings mean below. For example, for the default FFTW module:
auser@uan01:~> module show cray-fftw\n-------------------------------------------------------------------\n/opt/cray/pe/modulefiles/cray-fftw/3.3.8.7:\n\nconflict cray-fftw\nconflict fftw\nsetenv FFTW_VERSION 3.3.8.7\nsetenv CRAY_FFTW_VERSION 3.3.8.7\nsetenv CRAY_FFTW_PREFIX /opt/cray/pe/fftw/3.3.8.7/x86_rome\nsetenv FFTW_ROOT /opt/cray/pe/fftw/3.3.8.7/x86_rome\nsetenv FFTW_DIR /opt/cray/pe/fftw/3.3.8.7/x86_rome/lib\nsetenv FFTW_INC /opt/cray/pe/fftw/3.3.8.7/x86_rome/include\nprepend-path PATH /opt/cray/pe/fftw/3.3.8.7/x86_rome/bin\nprepend-path MANPATH /opt/cray/pe/fftw/3.3.8.7/share/man\nprepend-path CRAY_LD_LIBRARY_PATH /opt/cray/pe/fftw/3.3.8.7/x86_rome/lib\nprepend-path PE_PKGCONFIG_PRODUCTS PE_FFTW\nsetenv PE_FFTW_TARGET_x86_skylake x86_skylake\nsetenv PE_FFTW_TARGET_x86_rome x86_rome\nsetenv PE_FFTW_TARGET_x86_cascadelake x86_cascadelake\nsetenv PE_FFTW_TARGET_x86_64 x86_64\nsetenv PE_FFTW_TARGET_share share\nsetenv PE_FFTW_TARGET_sandybridge sandybridge\nsetenv PE_FFTW_TARGET_mic_knl mic_knl\nsetenv PE_FFTW_TARGET_ivybridge ivybridge\nsetenv PE_FFTW_TARGET_haswell haswell\nsetenv PE_FFTW_TARGET_broadwell broadwell\nsetenv PE_FFTW_VOLATILE_PKGCONFIG_PATH /opt/cray/pe/fftw/3.3.8.7/@PE_FFTW_TARGET@/lib/pkgconfig\nsetenv PE_FFTW_PKGCONFIG_VARIABLES PE_FFTW_OMP_REQUIRES_@openmp@\nsetenv PE_FFTW_OMP_REQUIRES { }\nsetenv PE_FFTW_OMP_REQUIRES_openmp _mp\nsetenv PE_FFTW_PKGCONFIG_LIBS fftw3_mpi:libfftw3_threads:fftw3:fftw3f_mpi:libfftw3f_threads:fftw3f\nmodule-whatis {FFTW 3.3.8.7 - Fastest Fourier Transform in the West}\n [...]\n
"},{"location":"user-guide/sw-environment-4cab/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"To load a module to use the module load
command. For example, to load the default version of HPE Cray FFTW into your environment, use:
auser@uan01:~> module load cray-fftw\n
Once you have done this, your environment will be setup to use the HPE Cray FFTW library. The above command will load the default version of HPE Cray FFTW. If you need a specific version of the software, you can add more information:
auser@uan01:~> module load cray-fftw/3.3.8.7\n
will load HPE Cray FFTW version 3.3.8.7 into your environment, regardless of the default.
If you want to remove software from your environment, module remove
will remove a loaded module:
auser@uan01:~> module remove cray-fftw\n
will unload what ever version of cray-fftw
(even if it is not the default) you might have loaded.
There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule
.
Suppose you have loaded version 3.3.8.7 of cray-fftw
, the following command will change to version 3.3.8.5:
auser@uan01:~> module swap cray-fftw cray-fftw/3.3.8.5\n
You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.
"},{"location":"user-guide/sw-environment-4cab/#changing-programming-environment","title":"Changing Programming Environment","text":"The three programming environments PrgEnv-aocc
, PrgEnv-cray
, PrgEnv-gnu
are implemented as module collections. The correct way to change programming environment, that is, change the collection of modules, is therefore via module restore
. For example:
auser@uan01:~> module restore PrgEnv-gnu\n
!!! note there is only one argument, which is the collection to be restored. The command module restore
will output a list of modules in the outgoing collection as they are unloaded, and the modules in the incoming collection as they are loaded. If you prefer not to have messages
auser@uan1:~> module -s restore PrgEnv-gnu\n
will suppress the messages. An attempt to restore a collection which is already loaded will result in no operation.
Module collections are stored in a user's home directory ${HOME}/.module
. However, as the home directory is not available to the back end, module restore
may fail for batch jobs. In this case, it is possible to restore one of the three standard programming environments via, e.g.,
module restore /etc/cray-pe.d/PrgEnv-gnu\n
"},{"location":"user-guide/sw-environment-4cab/#capturing-your-environment-for-reuse","title":"Capturing your environment for reuse","text":"Sometimes it is useful to save the module environment that you are using to compile a piece of code or execute a piece of software. This is saved as a module collection. You can save a collection from your current environment by executing:
auser@uan01:~> module save [collection_name]\n
Note
If you do not specify the environment name, it is called default
.
You can find the list of saved module environments by executing:
auser@uan01:~> module savelist\nNamed collection list:\n 1) default 2) PrgEnv-aocc 3) PrgEnv-cray 4) PrgEnv-gnu \n
To list the modules in a collection, you can execute, e.g.,:
auser@uan01:~> module saveshow PrgEnv-gnu\n-------------------------------------------------------------------\n/home/t01/t01/auser/.module/default:\nmodule use --append /opt/cray/pe/perftools/20.09.0/modulefiles\nmodule use --append /opt/cray/pe/craype/2.7.0/modulefiles\nmodule use --append /usr/local/Modules/modulefiles\nmodule use --append /opt/cray/pe/cpe-prgenv/7.0.0\nmodule use --append /opt/modulefiles\nmodule use --append /opt/cray/modulefiles\nmodule use --append /opt/cray/pe/modulefiles\nmodule use --append /opt/cray/pe/craype-targets/default/modulefiles\nmodule load cpe-gnu\nmodule load gcc\nmodule load craype\nmodule load craype-x86-rome\nmodule load --notuasked libfabric\nmodule load craype-network-ofi\nmodule load cray-dsmml\nmodule load perftools-base\nmodule load xpmem\nmodule load cray-mpich\nmodule load cray-libsci\nmodule load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env\n
Note again that the details of the collection have been saved to the home directory (the first line of output above). It is possible to save a module collection with a fully qualified path, e.g.,
auser@uan1:~> module save /work/t01/z01/auser/.module/PrgEnv-gnu\n
which would make it available from the batch system.
To delete a module environment, you can execute:
auser@uan01:~> module saverm <environment_name>\n
"},{"location":"user-guide/sw-environment-4cab/#shell-environment-overview","title":"Shell environment overview","text":"When you log in to ARCHER2, you are using the bash shell by default. As any other software, the bash shell has loaded a set of environment variables that can be listed by executing printenv
or export
.
The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS
define the number of threads.
To define an environment variable, you need to execute:
export OMP_NUM_THREADS=4\n
Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.
You can show the value of a specific environment variable if you print it:
echo $OMP_NUM_THREADS\n
Do not forget the dollar symbol. To remove an environment variable, just execute:
unset OMP_NUM_THREADS\n
"},{"location":"user-guide/sw-environment/","title":"Software environment","text":"The software environment on ARCHER2 is managed using the Lmod software. Selecting which software is available in your environment is primarily controlled through the module
command. By loading and switching software modules you control which software and versions are available to you.
Information
A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.
By default, all users on ARCHER2 start with the default software environment loaded.
Software modules on ARCHER2 are provided by both HPE (usually known as the HPE Cray Programming Environment, CPE) and by EPCC, who provide the Service Provision, and Computational Science and Engineering services.
In this section, we provide:
module
commandmodule
command manipulates your environmentmodule
command","text":"We only cover basic usage of the Lmod module
command here. For full documentation please see the Lmod documentation
The module
command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:
module restore
- Restore the default module setup (i.e. as if you had logged out and back in again)module list [name]
- List modules currently loaded in your environment, optionally filtered by [name]
module avail [name]
- List modules available, optionally filtered by [name]
module spider [name][/version]
- Search available modules (including hidden modules) and provide information on modulesmodule load name
- Load the module called name
into your environmentmodule remove name
- Remove the module called name
from your environmentmodule help name
- Show help information on module name
module show name
- List what module name
actually does to your environmentThese are described in more detail below.
Tip
Lmod allows you to use the ml
shortcut command. Without any arguments, ml
behaves like module list
; when a module name is specified to ml
, ml
behaves like module load
.
Note
You will often have to include module
commands in any job submission scripts to setup the software to use in your jobs. Generally, if you load modules in interactive sessions, these loaded modules do not carry over into any job submission scripts.
Important
You should not use the module purge
command on ARCHER2 as this will cause issues for the HPE Cray programming environment. If you wish to reset your modules, you should use the module restore
command instead.
The key commands for getting information on modules are covered in more detail below. They are:
module list
module avail
module spider
module help
module show
module list
","text":"The module list
command will give the names of the modules and their versions you have presently loaded in your environment:
auser@ln03:~> module list\n\nCurrently Loaded Modules:\n 1) craype-x86-rome 6) cce/15.0.0 11) PrgEnv-cray/8.3.3\n 2) libfabric/1.12.1.2.2.0.0 7) craype/2.7.19 12) bolt/0.8\n 3) craype-network-ofi 8) cray-dsmml/0.2.2 13) epcc-setup-env\n 4) perftools-base/22.12.0 9) cray-mpich/8.1.23 14) load-epcc-module\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) cray-libsci/22.12.1.1\n
All users start with a default set of modules loaded corresponding to:
module avail
","text":"Finding out which software modules are currently available to load on the system is performed using the module avail
command. To list all software modules currently available to load, use:
auser@uan01:~> module avail\n\n--------------------------- /work/y07/shared/archer2-lmod/utils/compiler/crayclang/10.0 ---------------------------\n darshan/3.3.1\n\n------------------------------------ /work/y07/shared/archer2-lmod/python/core ------------------------------------\n matplotlib/3.4.3 netcdf4/1.5.7 pytorch/1.10.0 scons/4.3.0 seaborn/0.11.2 tensorflow/2.7.0\n\n------------------------------------- /work/y07/shared/archer2-lmod/libs/core -------------------------------------\n aocl/3.1 (D) gmp/6.2.1 matio/1.5.23 parmetis/4.0.3 slepc/3.14.1\n aocl/4.0 gsl/2.7 metis/5.1.0 petsc/3.14.2 slepc/3.18.3 (D)\n boost/1.72.0 hypre/2.18.0 mkl/2023.0.0 petsc/3.18.5 (D) superlu-dist/6.4.0\n boost/1.81.0 (D) hypre/2.25.0 (D) mumps/5.3.5 scotch/6.1.0 superlu-dist/8.1.2 (D)\n eigen/3.4.0 libxml2/2.9.7 mumps/5.5.1 (D) scotch/7.0.3 (D) superlu/5.2.2\n\n------------------------------------- /work/y07/shared/archer2-lmod/apps/core -------------------------------------\n castep/22.11 namd/2.14 (D) py-chemshell/21.0.3\n code_saturne/7.0.1-cce15 nektar/5.2.0 quantum_espresso/6.8 (D)\n code_saturne/7.0.1-gcc11 (D) nwchem/7.0.2 quantum_espresso/7.1\n cp2k/cp2k-2023.1 onetep/6.1.9.0-CCE-LibSci (D) tcl-chemshell/3.7.1\n elk/elk-7.2.42 onetep/6.1.9.0-GCC-LibSci vasp/5/5.4.4.pl2-vtst\n fhiaims/210716.3 onetep/6.1.9.0-GCC-MKL vasp/5/5.4.4.pl2\n gromacs/2022.4+plumed openfoam/com/v2106 vasp/6/6.3.2-vtst\n gromacs/2022.4 (D) openfoam/com/v2212 (D) vasp/6/6.3.2 (D)\n lammps/17Feb2023 openfoam/org/v9.20210903\n namd/2.14-nosmp openfoam/org/v10.20230119 (D)\n\n------------------------------------ /work/y07/shared/archer2-lmod/utils/core -------------------------------------\n amd-uprof/3.6.449 darshan-util/3.3.1 imagemagick/7.1.0 reframe/4.1.0\n forge/24.0 epcc-reframe/0.2 ncl/6.6.2 tcl/8.6.13\n bolt/0.7 epcc-setup-env (L) nco/5.0.3 (D) tk/8.6.13\n bolt/0.8 (L,D) gct/v6.2.20201212 nco/5.0.5 usage-analysis/1.2\n cdo/1.9.9rc1 genmaskcpu/1.0 ncview/2.1.7 visidata/2.1\n cdo/2.1.1 (D) gnuplot/5.4.2-simg other-software/1.0 vmd/1.9.3-gcc10\n cmake/3.18.4 gnuplot/5.4.2 (D) paraview/5.9.1 (D) xthi/1.3\n cmake/3.21.3 (D) gnuplot/5.4.3 paraview/5.10.1\n\n--------------------- /opt/cray/pe/lmod/modulefiles/mpi/crayclang/14.0/ofi/1.0/cray-mpich/8.0 ---------------------\n cray-hdf5-parallel/1.12.2.1 cray-mpixlate/1.0.0.6 cray-parallel-netcdf/1.12.3.1\n\n--------------------------- /opt/cray/pe/lmod/modulefiles/comnet/crayclang/14.0/ofi/1.0 ---------------------------\n cray-mpich-abi/8.1.23 cray-mpich/8.1.23 (L)\n\n...output trimmed...\n
This will list all the names and versions of the modules that you can currently load. Note that other modules may be defined but not available to you as they depend on modules you do not have loaded. Lmod only shows modules that you can currently load, not all those that are defined. You can search for modules that are not currently visble to you using the module spider
command - we cover this in more detail below.
Note also, that not all modules may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.
You can list all the modules of a particular type by providing an argument to the module avail
command. For example, to list all available versions of the HPE Cray FFTW library, use:
auser@ln03:~> module avail cray-fftw\n\n--------------------------------- /opt/cray/pe/lmod/modulefiles/cpu/x86-rome/1.0 ----------------------------------\n cray-fftw/3.3.10.3\n\nModule defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.\nSee https://lmod.readthedocs.io/en/latest/060_locating.html for details.\n\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
"},{"location":"user-guide/sw-environment/#module-spider","title":"module spider
","text":"The module spider
command is used to find out which modules are defined on the system. Unlike module avail
, this includes modules that are not currently able to be loaded due to the fact you have not yet loaded dependencies to make them directly available.
module spider
takes 3 forms:
module spider
without any arguments lists all modules defined on the systemmodule spider <module>
shows information on which versions of <module>
are defined on the systemmodule spider <module>/<version>
shows information on the specific version of the module defined on the system, including dependencies that must be loaded before this module can be loaded (if any)If you cannot find a module that you expect to be on the system using module avail
then you can use module spider
to find out which dependencies you need to load to make the module available.
For example, the module cray-netcdf-hdf5parallel
is installed on ARCHER2 but it will not be found by module avail
:
auser@ln03:~> module avail cray-netcdf-hdf5parallel\nNo module(s) or extension(s) found!\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
We can use module spider
without any arguments to verify it exists and list the versions available:
auser@ln03:~> module spider\n\n-----------------------------------------------------------------------------------------------\nThe following is a list of the modules and extensions currently available:\n-----------------------------------------------------------------------------------------------\n\n...output trimmed...\n\n cray-mpich-abi: cray-mpich-abi/8.1.23\n\n cray-mpixlate: cray-mpixlate/1.0.0.6\n\n cray-mrnet: cray-mrnet/5.0.4\n\n cray-netcdf: cray-netcdf/4.9.0.1\n\n cray-netcdf-hdf5parallel: cray-netcdf-hdf5parallel/4.9.0.1\n\n cray-openshmemx: cray-openshmemx/11.5.7\n\n...output trimmed...\n
Now we know which versions are available, we can use module spider cray-netcdf-hdf5parallel/4.9.0.1
to find out how we can make it available:
auser@ln03:~> module spider module spider cray-netcdf-hdf5parallel/4.9.0.1\n\n---------------------------------------------------------------------------------------------------------------\n cray-netcdf-hdf5parallel: cray-netcdf-hdf5parallel/4.9.0.1\n---------------------------------------------------------------------------------------------------------------\n\n You will need to load all module(s) on any one of the lines below before the \"cray-netcdf-hdf5parallel/4.9.0.1\" module is available to load.\n\n aocc/3.2.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n cce/15.0.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n craype-network-none cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n craype-network-ofi cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n craype-network-ucx cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n gcc/10.3.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n gcc/11.2.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n\n Help:\n Release info: /opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/release_info\n
There is a lot of information here, but what the output is essentailly telling us is that in order to have cray-netcdf-hdf5parallel/4.9.0.1
available to load we need to have loaded a compiler (any version of CCE, GCC or AOCC), an MPI library (any version of cray-mpich) and cray-hdf5-parallel
loaded. As we always have a compiler and MPI library loaded, we can satisfy all of the dependencies by loading cray-hdf5-parallel
, and then we can use module avail cray-netcdf-hdf5parallel
again to show that the module is now available to load:
auser@ln03:~> module load cray-hdf5-parallel\nauser@ln03:~> module avail cray-netcdf-hdf5parallel\n\n--- /opt/cray/pe/lmod/modulefiles/hdf5-parallel/crayclang/14.0/ofi/1.0/cray-mpich/8.0/cray-hdf5-parallel/1.12.2 ---\n cray-netcdf-hdf5parallel/4.9.0.1\n\nModule defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.\nSee https://lmod.readthedocs.io/en/latest/060_locating.html for details.\n\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
"},{"location":"user-guide/sw-environment/#module-help","title":"module help
","text":"If you want more info on any of the modules, you can use the module help
command:
auser@ln03:~> module help gromacs\n
"},{"location":"user-guide/sw-environment/#module-show","title":"module show
","text":"The module show
command reveals what operations the module actually performs to change your environment when it is loaded. For example, for the default FFTW module:
auser@ln03:~> module show gromacs\n\n [...]\n
"},{"location":"user-guide/sw-environment/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"To change your environment and make different software available you use the following commands which we cover in more detail below.
module load
module remove
module swap
module load
","text":"To load a module to use the module load
command. For example, to load the default version of GROMACS into your environment, use:
auser@ln03:~> module load gromacs\n
Once you have done this, your environment will be setup to use GROMACS. The above command will load the default version of GROMACS. If you need a specific version of the software, you can add more information:
auser@uan01:~> module load gromacs/2022.4 \n
will load GROMACS version 2022.4 into your environment, regardless of the default.
"},{"location":"user-guide/sw-environment/#module-remove","title":"module remove
","text":"If you want to remove software from your environment, module remove
will remove a loaded module:
auser@uan01:~> module remove gromacs\n
will unload what ever version of gromacs
you might have loaded (even if it is not the default).
module swap
","text":"There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule
.
For example, to swap from the default CCE (cray) compiler environment to the GCC (gnu) compiler environment, you would use:
auser@ln03:~> module swap PrgEnv-cray PrgEnv-gnu\n
You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.
"},{"location":"user-guide/sw-environment/#shell-environment-overview","title":"Shell environment overview","text":"When you log in to ARCHER2, you are using the bash shell by default. As with any software, the bash shell has loaded a set of environment variables that can be listed by executing printenv
or export
.
The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS
define the number of threads.
To define an environment variable, you need to execute:
export OMP_NUM_THREADS=4\n
Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.
You can show the value of a specific environment variable if you print it:
echo $OMP_NUM_THREADS\n
Do not forget the dollar symbol. To remove an environment variable, just execute:
unset OMP_NUM_THREADS\n
Note that the dollar symbol is not included when you use the unset
command.
Note that it not possible for a single user to monopolise the resources on a login node as this is controlled by cgroups. This means that a user cannot slow down the response time for other users.
"},{"location":"user-guide/tds/","title":"ARCHER2 Test and Development System (TDS) user notes","text":"The ARCHER2 Test and Development System (TDS) is a small system used for testing changes before they are rolled out onto the full ARCHER2 system. This page contains useful information for people using the TDS on its configuration and what they can expect from the system.
Important
The TDS is used for testing on a day to day basis. This means that nodes and the entire system may be made unavailable or rebooted with little or no warning.
"},{"location":"user-guide/tds/#tds-system-details","title":"TDS system details","text":"Compute nodes: 8 compute nodes in total
Slingshot interconnect
Storage:
You can only log into the TDS from an ARCHER2 login node. You should create an SSH key pair on an ARCHER2 login node and add the public part to your ARCHER2 account in SAFE in the usual way.
Once your new key pair is setup, you can then login to the TDS (from an ARCHER2 login node) with
ssh login-tds.archer2.ac.uk\n
You will require your SSH key passphrase (for the new key pair you generated) and your usual ARCHER2 account password to login to the TDS.
"},{"location":"user-guide/tds/#slurm-scheduler-configuration","title":"Slurm scheduler configuration","text":"standard
: includes all compute nodeshighmem
: includes high memory compute nodesstandard
: same limits as on ARCHER2 main systemhighmem
: same limits as on ARCHER2 main systemSoftware modules
/work
file system - i.e. you may be able to load a module but the software it points to may not be available. Check if the software is actually installed before trying to use it.GCC 12.2.0 (gcc/g++/gfortran) compiler has been shown to give incorrect numerical results for a number of software packages (VASP, CASTEP. CP2K). If you want to use this compiler version we recommend checking output carefully. We may remove this version from the PE software stack installed on the full system as part of the software upgrade.
Singularity + MPI does not currently work - MPI executable in the Singularity container segfaults.
Energy use data is not available from TDS compute nodes.
Change of behaviour of the --cpus-per-task
Slurm option. If you set --cpus-per-task
greater than 1
in your job submission script (e.g. using #SBATCH
directives) then this option is not inhereted by srun
commands in the job script. You need to eithe set something like export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
or repeat the option explicitly in the srun
command (e.g. srun --cpus-per-task=$SLURM_CPUS_PER_TASK --hint=nomultithread --distribution=block:block
).
Change in definition of a Slurm NUMA region. On the TDS, a Slurm NUMA region is 4 cores (corresponding to an Core CompleX CCX in the AMD EPYC Zen2 architecture). This means cyclic process placements on NUMA regions (e.g. --distribution=block:cyclic
) will cycle over 4-core CCX. (On the main system, a Slurm NUMA region is 16 cores).
The vast majority of parallel scientific applications use the MPI library as the main way to implement parallelism; it is used so universally that the Cray compiler wrappers on ARCHER2 link to the Cray MPI library by default. Unlike other clusters you may have used, there is no choice of MPI library on ARCHER2: regardless of what compiler you are using, your program will use Cray MPI. This is because the Slingshot network on ARCHER2 is Cray-specific and significant effort has been put in by Cray software engineers to optimise the MPI performance on their Shasta systems.
Here we list a number of suggestions for improving the performance of your MPI programs on ARCHER2. Although MPI programs are capable of scaling very well due to the bespoke communications hardware and software, the details of how a program calls MPI can have significant effects on achieved performance.
Note
Many of these tips are actually quite generic and should be beneficial to any MPI program; however, they all become much more important when running on very large numbers of processes on a machine the size of ARCHER2.
"},{"location":"user-guide/tuning/#mpi-environment-variables","title":"MPI environment variables","text":"There are a number of environment variables available to control aspects of MPI behavour on ARCHER2, the set of options can be displayed by running,
man intro_mpi\n
o n the ARCHER2 login nodes. A couple of specific variables to highlight are MPICH_OFI_STARTUP_CONNECT and MPICH_OFI_RMA_STARTUP_CONNECT.
When using the default OFI transport layer the connections between ranks are set-up as they are required. This allows for good performance while reducing memory requirements. However for jobs using all-to-all communication it might be better to generate these connections in a coordinated way at the start of the application. To enable this set the following environment variable:
export MPICH_OFI_STARTUP_CONNECT=1 \n
Additionally, RMA jobs requiring an all-to-all communication pattern on node it may be beneficial to set up the connections between processes on a node in a coordinated fashion:
export MPICH_OFI_RMA_STARTUP_CONNECT=1\n
This option automatically enables MPICH_OFI_STARTUP_CONNECT.
"},{"location":"user-guide/tuning/#synchronous-vs-asynchronous-communications","title":"Synchronous vs asynchronous communications","text":""},{"location":"user-guide/tuning/#mpi_send","title":"MPI_Send","text":"A standard way to send data in MPI is using MPI_Send
(aptly called standard send). Somewhat confusingly, MPI is allowed to choose how to implement this in two different ways:
Synchronously The sending process waits until a matching receive has been posted, i.e. it operates like MPI_Ssend
. This clearly has the risk of deadlock if no receive is ever issued.
Asynchronously MPI makes a copy of the message into an internal buffer and returns straight away without waiting for a matching receive; the message may actually be delivered later on. This is like the behaviour of the the buffered send routine MPI_Bsend
.
The rationale is that MPI, rather than the user, should decide how best to send a message.
In practice, what typically happens is that MPI tries to use an asynchronous approach via the eager protocol: the message is sent directly to a preallocated buffer on the receiver and the routine returns immediately afterwards. Clearly there is a limit on how much space can be reserved for this, so:
The threshold is often termed the eager limit which is fixed for the entire run of your program. It will have some default setting which varies from system to system, but might be around 8K bytes.
"},{"location":"user-guide/tuning/#implications","title":"Implications","text":"MPI_Send
is implemented asynchronously using the eager protocol since synchronisation between sender and receive is much reduced.MPI_Send
buffers your message, so if you have concerns about deadlock you will need to use the non-blocking variant MPI_Isend
to guarantee that the send routine returns control to you immediately even if there is no matching receive.MPI_Send
/ MPI_Isend
with MPI_Ssend
/ MPI_Issend
. A correct MPI program should still run correctly when all references to standard send are replaced by synchronous send (since MPI is allowed to implement standard send as synchronous send).With most MPI libraries you should be able to alter the default value of the eager limit at runtime, perhaps via an environment variable or a command-line argument to mpirun
.
The advice for tuning the performance of MPI_Send
is
MPI_Send
is (a profiling tool may be useful here);MPI_Isend
as well: even in the non-blocking form, which can help to weaken synchronisation between sender and receiver, the amount of hand-shaking required is much reduced if the eager protocol is used;Note
It cannot be stressed strongly enough that although the performance may be affected by the value of the eager limit, the functionality of your program should be unaffected. If changing the eager limit affects the correctness of your program (e.g. whether or not it deadlocks) then you have an incorrect MPI program.
"},{"location":"user-guide/tuning/#setting-the-eager-limit-on-archer2","title":"Setting the eager limit on ARCHER2","text":"On ARCHER2, things are a little more complicated. Although the eager limit defaults to 16KiB, messages up to 256KiB are sent asynchronously because they are actually sent as a number of smaller messages.
To send even larger messages asynchronously, alter the value of FI_OFI_RXM_SAR_LIMIT
in your job submission script, e.g. to set to 512KiB:
export FI_OFI_RXM_SAR_LIMIT=524288\n
You can also control the size of the smaller messages by altering the value of FI_OFI_RXM_BUFFER_SIZE
in your job submission script, e.g. to set to 128KiB:
export FI_OFI_RXM_BUFFER_SIZE=131072\n
A different protocol is used for messages between two processes on the same node. The default eager limit for these is 8K. Although the performance of on-node messages is unlikely to be a limiting factor for your program you can change this value, e.g. to set to 16KiB:
export MPICH_SMP_SINGLE_COPY_SIZE=16384\n
"},{"location":"user-guide/tuning/#collective-operations","title":"Collective operations","text":"Many of the collective operations that are commonly required by parallel scientific programs, i.e. operations that involve a group of processes, are already implemented in MPI. The canonical operation is perhaps adding up a double precision number across all MPI processes, which is best achieved by a reduction operation:
MPI_Allreduce(&x, &xsum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);\n
This will be implemented using an efficient algorithm, for example based on a binary tree. Using such divide-and-conquer approaches typically results in an algorithm whose execution time on P processes scales as log_2(P); compare this to a naive approach where every process sends its input to rank 0 where the time will scale as P. This might not be significant on your laptop, but even on as few as 1000 processes the tree-based algorithm will already be around 100 times faster.
So, the basic advice is always use a collective routine to implement your communications pattern if at all possible.
In real MPI applications, collective operations are often called on a small amount of data, for example a global reduction of a single variable. In these cases, the time taken will be dominated by message latency and the first port of call when looking at performance optimisation is to call them as infrequently as possible!
Sometimes, the collective routines available may not appear to do exactly what you want. However, they can sometimes be used with a small amount of additional programming work:
To operate on a subset of processes, create sub-communicators containing the relevant subset(s) and use these communicators instead of MPI_COMM_WORLD
. Useful functions for communicator management include:
MPI_Comm_split
is the most general routine;MPI_Comm_split_type
can be used to create a separate communicator for each shared-memory node with split type = MPI_COMM_TYPE_SHARED
;MPI_Cart_sub
can divide a Cartesian communicator into regular slices.If the communication pattern is what you want, but the data on each process is not arranged in the required layout, consider using MPI derived data types for the input and/or output. This can be useful, for example, if you want to communicate non-contiguous data such as a subsection of a multidimensional array although care must be taken in defining these types to ensure they have the correct extents.
Another example would be using MPI_Allreduce
to add up an integer and a double-precision variable using a single call by putting them together into a C struct
and defining a matching MPI datatype using MPI_Type_create_struct
. Here you would also have to provide MPI with a custom reduction operation using MPI_Op_create
.
Many MPI programs call MPI_Barrier
to explicitly synchronise all the processes. Although this can be useful for getting reliable performance timings, it is rare in practice to find a program where the call is actually needed for correctness. For example, you may see:
// Ensure the input x is available on all processes\nMPI_Barrier(MPI_COMM_WORLD);\n// Perform a global reduction operation\nMPI_Allreduce(&x, &xsum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);\n// Ensure the result xsum is available on all processes\nMPI_Barrier(MPI_COMM_WORLD);\n
Neither of these barriers are needed as the reduction operation performs all the required synchronisation.
If removing a barrier from your MPI code makes it run incorrectly, then this should ring alarm bells -- it is often a symptom of an underlying bug that is simply being masked by the barrier.
For example, if you use non-blocking calls such as MPI_Irecv
then it is the programmer's responsibility to ensure that these are completed at some later point, for example by calling MPI_Wait
on the returned request object. A common bug is to forget to do this, in which case you might be reading the contents of the receive buffer before the incoming message has arrived (e.g. if the sender is running late).
Calling a barrier may mask this bug as it will make all the processes wait for each other, perhaps allowing the late sender to catch up. However, this is not guaranteed so the real solution is to call the non-blocking communications correctly.
One of the few times when a barrier may be required is if processes are communicating with each other via some other non-MPI method, e.g. via the file system. If you want processes to sequentially open, append to, then close the same file then barriers are a simple way to achieve this:
for (i=0; i < size; i++)\n{\n if (rank == i) append_data_to_file(data, filename);\n MPI_Barrier(comm);\n}\n
but this is really something of a special case.
Global synchronisation may be required if you are using more advanced techniques such as hybrid MPI/OpenMP or single-sided MPI communication with put and get, but typically you should be using specialised routines such as MPI_Win_fence
rather than MPI_Barrier
.
Tip
If you run a performance profiler on your code and it shows a lot of time being spent in a collective operation such as MPI_Allreduce
, this is not necessarily a sign that the reduction operation itself is the bottleneck. This is often a symptom of load imbalance: even if a reduction operation is efficiently implemented, it may take a long time to complete if the MPI processes do not all call it at the same time. MPI_Allreduce
synchronises across processes so will have to wait for all the processes to call it before it can complete. A single slow process will therefore adversely impact the performance of your entire parallel program.
There are a variety of possible issues that can result in poor performance of OpenMP programs. These include:
"},{"location":"user-guide/tuning/#sequential-code","title":"Sequential code","text":"Code outside of parallel regions is executed sequentially by the master thread.
"},{"location":"user-guide/tuning/#idle-threads","title":"Idle threads","text":"If different threads have different amounts of computation to do, then threads may be idle whenever a barrier is encountered, for example at the end of parallel regions or the end of worksharing loops. For worksharing loops, choosing a suitable schedule kind may help. For more irregular computation patterns, using OpenMP tasks might offer a solution: the runtime will try to load balance tasks across the threads in the team.
Synchronisation mechanisms that enforce mutual exclusion, such as critical regions, atomic statements and locks can also result in idle threads if there is contention - threads have to wait their turn for access.
"},{"location":"user-guide/tuning/#synchronisation","title":"Synchronisation","text":"The act of synchronising threads comes at some cost, even if the threads are never idle. In OpenMP, the most common source of synchronisation overheads is the implicit barriers at the end of parallel regions and worksharing loops. The overhead of these barriers depends on the OpenMP implementation being used as well as on the number of threads, but is typically in the range of a few microseconds. This means that for a simple parallel loop such as
#pragma omp parallel for reduction(+:sum)\nfor (i=0;i<n;i++){\n sum += a[i];\n}\n
the number of iterations required to make parallel execution worthwhile may be of the order of 100,000. On ARCHER2, benchmarking has shown that for the AOCC compiler, OpenMP barriers have significantly higher overhead than for either the Cray or GNU compilers.
It is possible to suppress the implicit barrier at the end of worksharing loop using a nowait
clause, taking care that this does not introduce and race conditions.
Atomic statements are designed to be capable of more efficient implementation that the equivalent critical region or lock/unlock pair, so should be used where applicable.
"},{"location":"user-guide/tuning/#scheduling","title":"Scheduling","text":"Whenever we rely on the OpenMP runtime to dynamically assign computation to threads (e.g. dynamic or guided loop schedules, tasks), there is some overhead incurred (some of this cost may actually be internal synchronisation in the runtime). It is often necessary to adjust the granularity of the computation to find a compromise between too many small units (and high scheduling cost) and too few large units (where load imbalance may dominate). For example, we can choose a non-default chunksize for the dynamic schedule, or adjust the amount of computation within each OpenMP task construct.
"},{"location":"user-guide/tuning/#communication","title":"Communication","text":"Communication between threads in OpenMP takes place via the cache coherency mechanism. In brief, whenever a thread writes a memory location, all copies of this location which are in a cache belonging to a different core have to be marked as invalid. Subsequent accesses to this location by other threads will result in the up-to-date value being retrieved from the cache where the last write occurred (or possibly from main memory).
Due to the fine granularity of memory accesses, these overheads are difficult to analyse or monitor. To minimise communication, we need to write code with good data affinity - i.e. each thread should access the same subset of program data as much as possible.
"},{"location":"user-guide/tuning/#numa-effects","title":"NUMA effects","text":"On modern CPU nodes, main memory is often organised in NUMA regions - sections of main memory associated with a subset of the cores on a node. On ARCHER2 nodes, there are 8 NUMA regions per node, each associated with 16 CPU cores. On such systems the location of data in main memory with respect to the cores that are accessing it can be important. The default OS policy is to place data in the NUMA region which first accesses it (first touch policy). For OpenMP programs this can be the worst possible option: if the data is initialised by the master thread, it is all allocated one NUMA region and having all threads accessing data becomes a bandwidth bottleneck.
This default policy can be changed using the numactl
command, but it is probably better to make use of the first touch policy by explicitly parallelising the data initialisation in the application code. This may be straightforward for large multidimensional arrays, but more challenging for irregular data structures.
The cache coherency mechanism described above operates on units of data corresponding to the size of cache lines - for ARCHER2 CPUs this is 64 bytes. This means that if different threads are accessing neighbouring words in memory, and at least some of the accesses are writes, then communication may be happening even if no individual word is actually being accessed by more than one thread. This means that patterns such as
#pragma omp parallel shared(count) private(myid) \n{\n myid = omp_get_thread_num();\n ....\n count[myid]++;\n ....\n}\n
may give poor performance if the updates to the count
array are sufficiently frequent.
Whenever there are multiple threads (or processes) executing inside a node, they may contend for some hardware resources. The most important of these for many HPC applications is memory bandwidth. This is effect is very evident on ARCHER2 CPUs - it is possible for just 2 threads to almost saturate the available memory bandwidth in a NUMA region which has 16 cores associated with it. For very bandwidth-intensive applications, running more that 2 threads per NUMA region may gain little additional performance. If an OpenMP code is not using all the cores on a node, by default Slurm will spread the threads out across NUMA regions to maximise the available bandwidth.
Another resource that threads may contend for is space in shared caches. On ARCHER2, every set of 4 cores shares 16MB of L3 cache.
"},{"location":"user-guide/tuning/#compiler-non-optimisation","title":"Compiler non-optimisation","text":"In rare cases, adding OpenMP directives can adversely affect the compiler's optimisation process. The symptom of this is that the OpenMP code running on 1 thread is slower than the same code compiled without the OpenMP flag. It can be difficult to find a workaround - using the compiler's diagnostic flags to find out which optimisation (e.g. vectorisation, loop unrolling) is being affected and adding compiler-specific directives may help.
"},{"location":"user-guide/tuning/#hybrid-mpi-and-openmp","title":"Hybrid MPI and OpenMP","text":"There are two main motivations for using both MPI and OpenMP in the same application code: reducing memory requirements and improving performance. At low core counts, where the pure MPI version of the code is still scaling well, adding OpenMP is unlikely to improve performance. In fact, it can introduce some additional overheads which make performance worse! The benefit is likely to come in the regime where the pure MPI version starts to lose scalability - here adding OpenMP can reduce communication costs, make load balancing easier, or be an effective way of exploiting additional parallelism without excessive code re-writing.
An important performance consideration for MPI + OpenMP applications is the choice of the number of OpenMP threads per MPI process. The optimum value will depend on the application, the input data, the number of nodes requested and the choice of compiler, and is hard to predict without experimentation. However, there are some considerations that apply to ARCHER2:
Due to NUMA effects, it is likely that running at least one MPI process per NUMA region (i.e. at least 8 MPI processes per node) will be beneficial.
The number of MPI processes per node should be a power of 2, so that all OpenMP threads run in the same NUMA region as their parent MPI process.
For applications where each process has a small memory footprint (e.g. some molecular dynamics codes), running no more than 4 OpenMP threads per MPI process may be beneficial, so that all the threads in a process share a single L3 cache.
ARCHER2 is the next generation UK National Supercomputing Service. You can find more information on the service and the research it supports on the ARCHER2 website.
The ARCHER2 Service is a world class advanced computing resource for UK researchers. ARCHER2 is provided by UKRI, EPCC, Cray (an HPE company) and the University of Edinburgh.
"},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"This is the documentation for the ARCHER2 service and includes:
Quick Start Guide The ARCHER2 quick start guide provides the minimum information for new users.
ARCHER2 User and Best Practice Guide Covers all aspects of use of the ARCHER2 supercomputing service. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of ARCHER2, and other advanced technical topics.
Research Software Information on each of the centrally-installed research software packages.
Software Libraries Information on the centrally-installed software libraries. Most libraries work as expected so no additional notes are required however a small number require specific documentation
Data Analysis and Tools Information on data analysis tools and other useful utilities.
Other Software Useful information on software that is not officially supported by the ARCHER2 service but that will be useful to users of that software.
Essential Skills This section provides information and links on essential skills required to use ARCHER2 efficiently: e.g. using Linux command line, accessing help and documentation.
ARCHER2 and publications This section describes how to acknowledge the use of ARCHER2 in your published work and how to use the ARCHER2 publications database.
The source for this documentation is publicly available in the ARCHER2 documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or addtions to the content and/or addtion of Issues providing suggestions for how it can be improved.
Full details of how to contribute can be found in the README.md
file of the repository.
This documentation draws on the Cirrus Tier-2 HPC Documentation, Sheffield Iceberg Documentation and the ARCHER National Supercomputing Service Documentation.
"},{"location":"archer-migration/","title":"ARCHER to ARCHER2 migration","text":"This section of the documentation is a guide for user migrating from ARCHER to ARCHER2.
It covers:
Tip
If you need help or have questions on ARCHER to ARCHER2 migration, please contact the ARCHER2 service desk
"},{"location":"archer-migration/account-migration/","title":"Migrating your account from ARCHER to ARCHER2","text":"This section covers the following questions:
Tip
If you need help or have questions on ARCHER to ARCHER2 migration, please contact the ARCHER2 service desk
"},{"location":"archer-migration/account-migration/#when-will-i-be-able-to-access-archer2","title":"When will I be able to access ARCHER2?","text":"We anticipate that users will have access during the week beginning 11th January 2021. Notification of activation of ARCHER2 projects will be sent to the project leaders/PIs and the project users.
"},{"location":"archer-migration/account-migration/#has-my-project-been-migrated-to-archer2","title":"Has my project been migrated to ARCHER2?","text":"If you have an active ARCHER allocation at the end of the ARCHER service then your project will very likely be migrated to ARCHER2. If your project is migrated to ARCHER2 then it will have the same project code as it had on ARCHER.
Some further information that may be useful:
The unit of allocation on ARCHER2 is called the ARCHER2 Compute Unit (CU) and, in general, 1 CU will be worth 1 ARCHER2 node hour.
UKRI have determined the conversion rates which will be used to transfer existing ARCHER allocations onto ARCHER2. These will be:
In identifying these conversion rates UKRI has endeavoured to ensure that no user will be disadvantaged by the transfer of their allocation from ARCHER to ARCHER2.
A nominal allocation will be provided to all projects during the initial no-charging period. Users will be notified before the no-charging period ends.
When the ARCHER service ends, any unused ARCHER allocation in kAUs will be converted to ARCHER2 CUs and transferred to ARCHER2 project allocation.
"},{"location":"archer-migration/account-migration/#how-do-i-set-up-an-archer2-account","title":"How do I set up an ARCHER2 account?","text":"Once you have been notified that you can go ahead and setup an ARCHER2 account you will do this through SAFE. Note that you should use the new unified SAFE interface rather than the ARCHER SAFE. The correct URL for the new SAFE is:
Your access details for this SAFE are the same as those for the ARCHER SAFE. You should log in in exactly the same way as you did on the ARCHER SAFE.
Important
You should make sure you request the same account name in your project on ARCHER2 as you have on ARCHER. This is to ensure that you have seamless access to your ARCHER /home data on ARCHER2. See the ARCHER to ARCHER2 Data Migration page for details on data transfer from ARCHER to ARCHER2
Once you have logged into SAFE, you will need to complete the following steps before you can log into ARCHER2 for the first time:
The ARCHER2 documentation covers logging in to ARCHER from a variety of operating systems:
This section provides an overview of the main differences between ARCHER and ARCHER2 along with links to more information where appropriate.
"},{"location":"archer-migration/archer2-differences/#for-all-users","title":"For all users","text":"srun
rather than aprun
This short guide explains how to move data from the ARCHER service to the ARCHER2 service.
We have also created a walkthrough video to guide you.
Note
This section assumes that you have an active ARCHER and ARCHER2 account, and that you have successfully logged in to both accounts.
Tip
Unlike normal access, ARCHER to ARCHER2 transfer has been set up to require only one form of authentication. You will not need to generate a new SSH key pair to transfer data from ARCHER to ARCHER2 as your password will suffice.
First, login to the ARCHER(1) (making sure to change auser
to your username):
ssh auser@login.archer.ac.uk\n
Then, combine important research data into a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt directory1/\n
Please be selective -- the more data you want to transfer, the more time it will take.
From ARCHER in particular, in order to get the best transfer performance, we need to access a newer version of the SSH program. We do this by loading the openssh
module:
module load openssh\n
"},{"location":"archer-migration/data-migration/#transferring-data-using-rsync-recommended","title":"Transferring data using rsync
(recommended)","text":"Begin the data transfer from ARCHER to ARCHER2 using rsync
:
rsync -Pv -e\"ssh -c aes128-gcm@openssh.com\" \\\n ./all_my_files.tar.gz a2user@transfer.dyn.archer2.ac.uk:/work/t01/t01/a2user\n
Important
Notice that the hostname for data transfer from ARCHER to ARCHER2 is not the usual login address. Instead, you use transfer.dyn.archer2.ac.uk
. This address has been configured to allow higher performance data transfer and to allow access to ARCHER with password only with no SSH key required.
When running this command, you will be prompted to enter your ARCHER2 password. Enter it and the data transfer will begin. Also, remember to replace a2user
with your ARCHER2 username, and t01
with the budget associated with that username.
The use of the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e
flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c
option specifies the cipher to be used as aes128-gcm
which has been found to increase performance. Unfortunately the ~
shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on ARCHER2.
If you are unconcerned about being able to restart an interrupted transfer, you could instead use the scp
command,
scp -c aes128-gcm@openssh.com all_my_files.tar.gz \\\n a2user@transfer.dyn.archer2.ac.uk:/work/t01/t01/a2user/\n
but rsync
is recommended for larger transfers.
Important
Notice that the hostname for data transfer from ARCHER to ARCHER2 is not the usual login address. Instead, you use transfer.dyn.archer2.ac.uk
. This address has been configured to allow higher performance data transfer and to allow access to ARCHER with password only with no SSH key required.
This section of the documentation is a guide for user migrating from the ARCHER2 4-cabinet system to the ARCHER2 full system.
It covers:
Tip
If you need help or have questions on ARCHER2 4-cab to full ARCHER2 migration please contact the ARCHER2 service desk
"},{"location":"archer2-migration/account-migration/","title":"Accessing the ARCHER2 full system","text":"This section covers the following questions:
Tip
If you need help or have questions on using ARCHER2 4-cabinet system and ARCHER2 full system please contact the ARCHER2 service desk
"},{"location":"archer2-migration/account-migration/#when-will-i-be-able-to-access-archer2-full-system","title":"When will I be able to access ARCHER2 full system?","text":"We anticipate that users will have access from mid-late November. Users will have access to both the ARCHER2 4-cabinet system and ARCHER2 full system for at least 30 days. UKRI will confirm the dates and these will be communicated to you as they are confirmed. There will be at least 14 days notice before access to the ARCHER2 4-Cabinet system is removed.
"},{"location":"archer2-migration/account-migration/#has-my-project-been-enabled-on-archer2-full-system","title":"Has my project been enabled on ARCHER2 full system?","text":"If you have an active ARCHER2 4-cabinet system allocation on 1st October 2021 then your project will be enabled on the ARCHER2 full system. The project code is the same on the full service as it is on ARCHER2 4-cabinet system.
Some further information that may be useful:
The unit of allocation on ARCHER2 is called the ARCHER2 Compute Unit (CU) and 1 CU is equivalent to 1 ARCHER2 node hour. Your time budget will be shared on both systems. This means that any existing allocation available to your project on the 4-cabinet system will also be available on the full system.
There will be a period of at least 30 days where users will have access to both the 4-cabinet system and the full system. During this time, use on the full system will be uncharged (though users must still have access to a valid, positive budget to be able to submit jobs) and use on the 4-cabinet system will be a charged in the usual way. Users will be notified before the no-charging period ends.
"},{"location":"archer2-migration/account-migration/#how-do-i-set-up-an-account-on-the-full-system","title":"How do I set up an account on the full system?","text":"You will keep the same usernames, passwords and SSH keys that you use on the 4-cabinet system on the full system.
You do not need to do anything to enable your account, these will be made available automatically once access to the full system is available.
You will connect to the full system in the same way as you connect to the 4-cabinet system except for switching the ordering of the credentials:
The ARCHER2 documentation covers logging in to ARCHER2 from a variety of operating systems: - Logging in to ARCHER2 from macOS/Linux - Logging in to ARCHER2 from Windows
Login addresses:
Tip
When logging into the ARCHER2 full system for the first time, you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
There are three file systems associated with the ARCHER2 Service:
"},{"location":"archer2-migration/account-migration/#home-file-systems","title":"home file systems","text":"The home file systems will be mounted on both the 4-cabinet system and the full system; so users\u2019 directories are shared across the two systems. Users will be able to access the home file systems from both systems and no action is required to move data. The home file systems will be read and writeable on both services during the transition period.
"},{"location":"archer2-migration/account-migration/#work-file-systems","title":"work file systems","text":"There are different work file systems for the 4-cabinet system and the full system.
The work file system on the 4-cabinet system (labelled \u201carcher2-4c-work\u201d in SAFE) will remain available on the 4-cabinet system during the transition period.
There will be new work file systems on the full system and you will have new directories on the new work file systems. Your initial quotas will typically be double your quotas for the 4-cabinet work file system.
Important: you are responsible for transferring any required data from the 4-cabinet work file systems to your new directories on the work file systems on the full system.
The work file system on the 4-cabinet system will be available for you to transfer your data from for at least 30 days from the start of the ARCHER2 full system access and 14 days notice will be given before the 4-cabinet work file system is removed.
"},{"location":"archer2-migration/account-migration/#rdfaas-file-systems","title":"RDFaaS file systems","text":"For users who have access to the RDFaaS, your RDFaaS data will be available on both the 4-cabinet system and the full system during the transition period and will be readable and writeable on both systems.
"},{"location":"archer2-migration/archer2-differences/","title":"Main differences between ARCHER2 4-cabinet system and ARCHER2 full system","text":"This section provides an overview of the main differences between the ARCHER2 4-cabinet system that all users have been using up until now and the full ARCHER2 system along with links to more information where appropriate.
"},{"location":"archer2-migration/archer2-differences/#for-all-users","title":"For all users","text":"--reservation=shortqos
when using the short
QoS.reservation
QoS.module load epcc-job-env
command to job submission scripts.cray-netcdf
or cray-netcdf-hdf5parallel
modules until you have loaded the appropriate cray-hdf5
or cray-hdf5-parallel
modules). You can use the module spider
command to see all available modules, including hidden ones.This short guide explains how to move data from from the work file system on the ARCHER2 4-cabinet system to the ARCHER2 full system. Your space on the home file system is shared between the ARCHER2 4-cabinet system and the ARCHER2 full system so everything from your home directory is already effectively transferred.
Note
This section assumes that you have an active ARCHER2 4-cabinet system and ARCHER2 full system account, and that you have successfully logged in to both accounts.
Tip
Unlike normal access, ARCHER2 4-cabinet system to ARCHER2 full system transfer has been set up to require only one form of authentication. You will only need one factor to authenticate from the 4-cab to the full system or vice versa. This factor can be either an SSH key (that has been registered against your account in SAFE) or you can use your passowrd. If you have a large amount of data to transfer you may want to setup a passphrase-less SSH key on ARCHER2 full system and use the data analysis nodes to run transfers via a Slurm job.
"},{"location":"archer2-migration/data-migration/#transferring-data-interactively-from-the-4-cabinet-system-to-the-full-system","title":"Transferring data interactively from the 4-cabinet system to the full system","text":"First, login to the ARCHER2 4-cabinet system (making sure to change auser
to your username):
ssh auser@login-4c.archer2.ac.uk\n
Then, combine important research data into a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt directory1/\n
Please be selective -- the more data you want to transfer, the more time it will take.
Unpack the archive file in the destination directory
tar -xzf all_my_files.tar.gz\n
"},{"location":"archer2-migration/data-migration/#transferring-data-using-rsync-recommended","title":"Transferring data using rsync
(recommended)","text":"Begin the data transfer from the ARCHER2 4-cabinet system to the ARCHER2 full system using rsync
:
rsync -Pv all_my_files.tar.gz a2user@login.archer2.ac.uk:/work/t01/t01/a2user\n
When running this command, you will be prompted to enter your ARCHER2 password -- this is the same password for the ARCHER2 4-cabinet system and the ARCHER2 full system. Enter it and the data transfer will begin. Remember to replace a2user
with your ARCHER2 username, and t01
with the budget associated with that username.
We use the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. We move our research archive to our project work directory on the ARCHER2 full system.
If you are unconcerned about being able to restart an interrupted transfer, you could instead use the scp
command,
scp all_my_files.tar.gz a2user@login.archer2.ac.uk:/work/t01/t01/a2user/\n
but rsync
is recommended for larger transfers.
It may be convenient to submit long data transfers to the serial queue. In this case, a number of simple preparatory steps are required to authenticate:
ssh/scp
commands in the serial queue to authenticate. As it has been arranged that only one of ssh key/password are required between the serial nodes and the 4-cabinet system, this is sufficient.An example serial queue script using rsync
might be:
#!/bin/bash\n\n# Slurm job options (job-name, job time)\n\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n#SBATCH --time=02:00:00\n#SBATCH --ntasks=1\n\n# Replace [budget code] below with your budget code\n\n#SBATCH --account=[budget code] \n\n# Issue appropriate rsync command\n\nrsync -av --stats --progress --rsh=\"ssh -i ${HOME}/.ssh/id_rsa_batch\" \\\n user-01@login-4c.archer2.ac.uk:/work/proj01/proj01/user-01/src \\\n /work/proj01/proj01/user-01/destination\n
where ${HOME}/.ssh/id_rsa_batch
is the new ssh key. Note that the ${HOME}
directory is visible from the serial nodes on the full system, so ssh key pairs in ${HOME}/.ssh
are available. "},{"location":"archer2-migration/porting/","title":"Porting applications to full ARCHER2 system","text":"Porting applications to the full ARCHER2 system has generally proven straightforward if they are running successfully on the ARCHER2 4-cabinet system. You should be able to use the same (or very similar) compile processes on the the full system as you used on ARCHER2.
During testing of the ARCHER2 full system, the CSE team at EPCC have seen that application binaries compiled on the 4-cabinet system can usually be copied over to the full system and work well and give good performance. However, if you run into issues with executables taken from the 4-cabinet system on the full system you should recompile in the first instance.
Information on compiling applications on the full system can be found in the Application Development Environment section of the User and Best Practice Guide.
"},{"location":"data-tools/","title":"Data Analysis and Tools","text":"This section provides information on each of the centrally installed data analysis software and other software tools.
The tools currently available in this section are (software that is installed or maintained by third-parties rather than the ARCHER2 service are marked with *):
AMD \u03bcProf (\u201cMICRO-prof\u201d) is a software profiling analysis tool for x86 applications running on Windows, Linux and FreeBSD operating systems and provides event information unique to the AMD \u201cZen\u201d-based processors and AMD INSTINCT\u2122 MI Series accelerators. AMD uProf enables the developer to better understand the limiters of application performance and evaluate improvements.
"},{"location":"data-tools/amd-uprof/#accessing-amd-prof-on-archer2","title":"Accessing AMD \u03bcProf on ARCHER2","text":"To gain access to the AMD\u03bcProf tools on ARCHER2, you must load the module:
module load amd-uprof\n
"},{"location":"data-tools/amd-uprof/#using-amd-prof","title":"Using AMD \u03bcProf","text":"Please see the AMD documentation for information on how to use \u03bcProf:
Blender is a 3D rendering and package tool primarily used for 3D animation and VFX but increasingly also for scientific visualisation. By being an artist tool first as opposed to a scientific visualisation package, it allows for a great versatility and a complete control over every aspect of the final image.
"},{"location":"data-tools/blender/#useful-links","title":"Useful links","text":"Blender is available through the blender
module.
module load blender\n
Once the module has been loaded, the Blender executable will be available.
"},{"location":"data-tools/blender/#running-blender-jobs","title":"Running blender jobs","text":"Even though blender is single node only, each frame being independent makes the render of animations an embarrassingly parallel problem. Here is an example job for running blender to export frames 1 to 100 from the blend file scene.blend
. Submitting an other job with a different frame range will use a 2nd node etc.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=example_blender_job\n#SBATCH --time=0:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load blender\n\nexport BLENDER_USER_RESOURCES=${HOME/home/work}/.blender\n\nblender -b scene.blend --render-output //render_ -noaudio -f 1-100 -- --cycles-device CPU\n
The full list of command line arguments can be found in Blender's documentation. Note that with blender, the order of the arguments is important.
To automatise this process addons like the one available in blender4science are helpful as they allow to submit multiple identical jobs and handles the parallelisation to render each frame only once.
!!! note Blender doesn't work on ARCHER2 GPU nodes at the moment due to incompatibilities with the rocm version available
"},{"location":"data-tools/cray-r/","title":"R","text":""},{"location":"data-tools/cray-r/#r-for-statistical-computing","title":"R for statistical computing","text":"R is a software environment for statistical computing and graphics. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time-series analysis, classification, clustering, and so on).
Note
When you log onto ARCHER2, no R module is loaded by default. You need to load the cray-R
module to access the functionality described below.
The recommended version of R to use on ARCHER2 is the HPE Cray R distribution, which can be loaded using:
module load cray-R\n
The HPE Cray R distribution includes a range of common R packages, including all of the base packages, plus a few others.
To see what packages are available, run the R command
library()\n
--from the R command prompt.
At the time of writing, the HPE Cray R distribution included the following packages:
Full SystemPackages in library \u2018/opt/R/4.0.3.0/lib64/R/library\u2019:\n\nbase The R Base Package\nboot Bootstrap Functions (Originally by Angelo Canty\n for S)\nclass Functions for Classification\ncluster \"Finding Groups in Data\": Cluster Analysis\n Extended Rousseeuw et al.\ncodetools Code Analysis Tools for R\ncompiler The R Compiler Package\ndatasets The R Datasets Package\nforeign Read Data Stored by 'Minitab', 'S', 'SAS',\n 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...\ngraphics The R Graphics Package\ngrDevices The R Graphics Devices and Support for Colours\n and Fonts\ngrid The Grid Graphics Package\nKernSmooth Functions for Kernel Smoothing Supporting Wand\n & Jones (1995)\nlattice Trellis Graphics for R\nMASS Support Functions and Datasets for Venables and\n Ripley's MASS\nMatrix Sparse and Dense Matrix Classes and Methods\nmethods Formal Methods and Classes\nmgcv Mixed GAM Computation Vehicle with Automatic\n Smoothness Estimation\nnlme Linear and Nonlinear Mixed Effects Models\nnnet Feed-Forward Neural Networks and Multinomial\n Log-Linear Models\nparallel Support for Parallel computation in R\nrpart Recursive Partitioning and Regression Trees\nspatial Functions for Kriging and Point Pattern\n Analysis\nsplines Regression Spline Functions and Classes\nstats The R Stats Package\nstats4 Statistical Functions using S4 Classes\nsurvival Survival Analysis\ntcltk Tcl/Tk Interface\ntools Tools for Package Development\nutils The R Utils Package\n
4-cabinet system Packages in library \u2018/opt/R/4.0.2.0/lib64/R/library\u2019:\n\nbase The R Base Package\nboot Bootstrap Functions (Originally by Angelo Canty\n for S)\nclass Functions for Classification\ncluster \"Finding Groups in Data\": Cluster Analysis\n Extended Rousseeuw et al.\ncodetools Code Analysis Tools for R\ncompiler The R Compiler Package\ndatasets The R Datasets Package\nforeign Read Data Stored by 'Minitab', 'S', 'SAS',\n 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...\ngraphics The R Graphics Package\ngrDevices The R Graphics Devices and Support for Colours\n and Fonts\ngrid The Grid Graphics Package\nKernSmooth Functions for Kernel Smoothing Supporting Wand\n & Jones (1995)\nlattice Trellis Graphics for R\nMASS Support Functions and Datasets for Venables and\n Ripley's MASS\nMatrix Sparse and Dense Matrix Classes and Methods\nmethods Formal Methods and Classes\nmgcv Mixed GAM Computation Vehicle with Automatic\n Smoothness Estimation\nnlme Linear and Nonlinear Mixed Effects Models\nnnet Feed-Forward Neural Networks and Multinomial\n Log-Linear Models\nparallel Support for Parallel computation in R\nrpart Recursive Partitioning and Regression Trees\nspatial Functions for Kriging and Point Pattern\n Analysis\nsplines Regression Spline Functions and Classes\nstats The R Stats Package\nstats4 Statistical Functions using S4 Classes\nsurvival Survival Analysis\ntcltk Tcl/Tk Interface\ntools Tools for Package Development\nutils The R Utils Package\n
"},{"location":"data-tools/cray-r/#running-r-on-the-compute-nodes","title":"Running R on the compute nodes","text":"In this section, we provide an example R job submission scripts for using R on the ARCHER2 compute nodes.
"},{"location":"data-tools/cray-r/#serial-r-submission-script","title":"Serial R submission script","text":"#!/bin/bash --login\n\n#SBATCH --job-name=r_test\n#SBATCH --ntasks=1\n#SBATCH --time=00:10:00\n\n# Replace [budget code] below with your project code (e.g., t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n# Load the R module\nmodule load cray-R\n\n# Run your R progamme\nRscript serial_test.R\n
On completion, the output of the R script will be available in the job output file.
"},{"location":"data-tools/darshan/","title":"Darshan","text":"Darshan is a scalable HPC I/O characterization tool. Darshan is designed to capture an accurate picture of application I/O behavior, including properties such as patterns of access within files, with minimum overhead. The name is taken from a Sanskrit word for \"sight\" or \"vision\".
Darshan is developed at the Argonne Leadership Computing Facility (ALCF)
Useful links:
Using Darshan generally consists of two stages:
To collect IO profile data you add the command:
module load darshan\n
to your job submission script as the last module
command before you run your program. As Darshan does not distinguish between different software run in your job submission script, we typically recommand that you use a structure like:
module load darshan\nsrun ...usual software launch options...\nmodule remove darshan\n
This will avoid Darshan profiling IO for operations that are not part of your main parallel program.
Tip
There may be some periods when Darshan monitoring is enabled by default for all users. During these periods, you can disable Darshan monitoring by adding the command module remove darshan
to your job submission script. Periods of Darshan monitoring will be noted on the ARCHER2 Service Status page.
Important
The darshan
module is dependent on the compiler environment you are using and you should ensure that you load the darshan
module that matches the compiler environment you used to compile the program you are analysing. For example, if your software was compiled using PrgEnv-gnu
, then you would need to activate the GCC compiler environment before loading the darshan
module to ensure you get the GCC version of Darshan. This means loading the correct PrgEnv-
module before you load the darshan
module:
module load PrgEnv-gnu\nmodule load darshan\nsrun ...usual software launch options...\nmodule remove darshan\n
"},{"location":"data-tools/darshan/#location-of-darshan-profile-logs","title":"Location of Darshan profile logs","text":"Darshan writes all profile logs to a shared location on the ARCHER2 NVMe Lustre file system. You can find your profile logs at:
/mnt/lustre/a2fs-nvme/system/darshan/YYYY/MM/DD\n
where YYYY/MM/DD
correspond to the date on which your job ran.
The simplest way to analyse the profile log files is to use the darshan-parser
utility on the ARCHER2 login nodes. You make the Darshan analysis utilities available with the command:
module load darshan-util\n
Once this is loaded, you can produce and IO performance summary from a profile log file with:
darshan-parser --perf /path/to/darshan/log/file.darshan\n
You can get a dump of all data in the Darshan profile log by omitting the --perf
option, e.g.:
darshan-parser /path/to/darshan/log/file.darshan\n
Tip
The darshan-job-summary.pl
and darshan-summary-per-file.sh
utilities do not work on ARCHER2 as the required graphical packages are not currently available.
Documentation on the Darshan analysis utilities are available at:
Linaro Forge provides debugging and profiling tools for MPI parallel applications, and OpenMP or pthreads multi-threaded applications (and also hydrid MPI/OpenMP). Forge DDT is the debugger and MAP is the profiler.
"},{"location":"data-tools/forge/#user-interface","title":"User interface","text":"There are two ways of running the Forge user interface. If you have a good internet connection to ARCHER2, the GUI can be run on the front-end (with an X-connection). Alternatively, one can download a copy of the Forge remote client to your laptop or desktop, and run it locally. The remote client should be used if at all possible.
To download the remote client, see the Forge download pages. Version 24.0 is known to work at the time of writing. A section further down this page explains how to use the remote client, see Connecting with the remote client.
"},{"location":"data-tools/forge/#licensing","title":"Licensing","text":"ARCHER2 has a licence for up to 2080 tokens, where a token represents an MPI parallel process. Running Forge DDT/MAP to debug/profile a code running across 16 nodes using 128 MPI ranks per node would require 2048 tokens. If you wish to run on more nodes, say 32, then it will be necessary to reduce the number of tasks per node so as to fall below the maximum number of tokens allowed.
Please note, Forge licence tokens are shared by all ARCHER2 (and Cirrus) users.
To see how many tokens are in use, you can view the licence server status page by first setting up an SSH tunnel to the node hosting the licence server.
ssh <username>@login.archer2.ac.uk -L 4241:dvn04:4241\n
You can now view the status page from within a local browser, see http://localhost:4241/status.html.
Note
The licence status page may contain multiple licences, indicated by a row of buttons (one per licence) near the top of the page. The details of the 12-month licence described above can be accessed by clicking on the first button in the row. Additional buttons may appear at various times for boosted licences: once a quarter, ARCHER2 will have a boosted 7-day licence offering 8192 tokens, sufficient for 64 nodes running 128 MPI ranks per node. Please contact the Service Desk if you have a specific requirement that exceeds the current Forge licence provision.
Note
The licence status page refers to the Arm Licence Server. Arm is the name of the company that originally developed Forge before it was acquired by Linaro.
"},{"location":"data-tools/forge/#one-time-set-up-for-using-forge","title":"One time set-up for using Forge","text":"A preliminary step is required to set up the necessary Forge configuration files that allow DDT and MAP to initialise its environment correctly so that it can, for example, interact with the Slurm queue system. These steps should be performed in the /work
file system on ARCHER2.
It is recommended that these commands are performed in the top-level work file system directory for the user account, i.e., ${HOME/home/work}
.
module load forge\ncd ${HOME/home/work}\nsource ${FORGE_DIR}/config-init\n
Running the source
command will create a directory ${HOME/home/work}/.forge
that contains the following files.
system.config user.config\n
Warning
The config-init
script may output, Warning: failed to read system config
. Please ignore as subsequent messages should indicate that the new configuration files have been created.
Within the system.config
file you should find that shared directory
is set to the equivalent of ${HOME/home/work/.forge}
. That directory will also store other relevant files when Forge is run.
DDT (Distributed Debugging Tool) provides an easy-to-use graphical interface for source-level debugging of compiled C/C++ or Fortran codes. It can be used for non-interactive debugging, and there is also some limited support for python debugging.
"},{"location":"data-tools/forge/#preparation","title":"Preparation","text":"To prepare your program for debugging, compile and link in the normal way but remember to include the -g
compiler option to retain symbolic information in the executable. For some programs, it may be necessary to reduce the optimisation to -O0
to obtain full and consistent information. However, this in itself can change the behaviour of bugs, so some experimentation may be necessary.
A non-interactive method of debugging is available which allows information to be obtained on the state of the execution at the point of failure in a batch job.
Such a job can be submitted to the batch system in the usual way. The relevant command to start the executable is as follows.
# ... Slurm batch commands as usual ...\n\nmodule load forge\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nddt --verbose --offline --mpi=slurm --np 8 \\\n --mem-debug=fast --check-bounds=before \\\n ./my_executable\n
The parallel launch is delegated to ddt
and the --mpi=slurm
option indicates to ddt
that the relevant queue system is Slurm (there is no explicit srun
). It will also be necessary to state explicitly to ddt
the number of processes required (here --np 8
). For other options see, e.g., ddt --help
.
Note that higher levels of memory debugging can result in extremely slow execution. The example given above uses the default --mem-debug=fast
which should be a reasonable first choice.
Execution will produce a .html
format report which can be used to examine the state of execution at the point of failure.
You can also start the client interactively (for details of remote launch, see Connecting with the remote client).
module load forge\nddt\n
This should start a window as shown below. Click on the DDT panel on the left, and then on the Run and debug a program option. This will bring up the Run dialogue as shown.
Note:
One can start either DDT or MAP by clicking the appropriate panel on the left-hand side;
If the license has connected successfully, a serial number will be shown in small text at the lower left.
In the Application sub panel of the Run dialog box, details of the executable, command line arguments or data files, the working directory and so on should be entered.
Click the MPI checkbox and specify the MPI implementation. This is done by clicking the Details button and then the Change button. Choose the SLURM (generic) implementation from the drop-down menu and click OK. You can then specify the required number of nodes/processes and so on.
Click the OpenMP checkbox and select the relevant number of threads (if there is no OpenMP in the application itself, select 1 thread).
Click the Submit to Queue checkbox and then the associated Configure button. A new set of options will appear such as Submission template file, where you can enter ${FORGE_DIR}/templates/archer2.qtf
and click OK. This template file provides many of the options required for a standard batch job. You will then need to click on the Queue Parameters button in the same section and specify the relevant project budget, see the Account entry.
The default queue template file configuration uses the short QoS with the standard time limit of 20 minutes. If something different is required, one can edit the settings. Alternatively, one can copy the archer2.qtf
file (to ${HOME/home/work}/.forge
) and make the relevant changes. This new template file can then be specified in the dialog window.
There may be a short delay while the sbatch job starts. Debugging should then proceed as described in the Linaro Forge documentation.
"},{"location":"data-tools/forge/#using-map","title":"Using MAP","text":"Load the forge
module:
module load forge\n
"},{"location":"data-tools/forge/#linking","title":"Linking","text":"MAP uses two small libraries to collect data from your program. These are called map-sampler
and map-sampler-pmpi
. On ARCHER2, the linking of these libraries is usually done automatically via the LD_PRELOAD mechanism, but only if your program is dynamically linked. Otherwise, you will need to link the MAP libraries manually by providing explicit link options.
The library paths specified in the link options will depend on the programming environment you are using as well as the Cray programming release. Here are the paths for each of the compiler environments consistent with the Cray Programming Release (CPE) 22.12 using the default OFI as the low-level comms protocol:
PrgEnv-cray
: ${FORGE_DIR}/map/libs/default/cray/ofi
PrgEnv-gnu
: ${FORGE_DIR}/map/libs/default/gnu/ofi
PrgEnv-aocc
: ${FORGE_DIR}/map/libs/default/aocc/ofi
For example, for PrgEnv-gnu
the additional options required at link time are given below.
-L${FORGE_DIR}/map/libs/default/gnu/ofi \\\n-lmap-sampler-pmpi -lmap-sampler \\\n-Wl,--eh-frame-hdr -Wl,-rpath=${FORGE_DIR}/map/libs/default/gnu/ofi\n
The MAP libraries for other Cray programming releases can be found under ${FORGE_DIR}/map/libs
. If you require MAP libraries built for the UCX comms protocol, simply replace ofi
with ucx
in the library path.
Submit a batch job in the usual way, and include the lines:
# ... Slurm batch commands as usual ...\n\nmodule load forge\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nmap -n <number of MPI processes> --mpi=slurm --mpiargs=\"--hint=nomultithread --distribution=block:block\" --profile ./my_executable\n
Successful execution will generate a file with a .map
extension.
This .map
file may be viewed via the GUI (start with either map
or forge
) by selecting the Load a profile data file from a previous run option. The resulting file selection dialog box can then be used to locate the .map
file.
If one starts the Forge client on e.g., a laptop, one should see the main window as shown above. Select Remote Launch and then Configure from the drop-down menu. In the Configure Remote Connections dialog box click Add. The following window should be displayed. Fill in the fields as shown. The Connection Name is just a tag for convenience (useful if a number of different accounts are in use). The Host Name should be as shown with the appropriate username. The Remote Installation Directory should be exactly as shown. The Remote Script is needed to execute additional environment commands on connection. A default script is provided in the location shown.
/work/y07/shared/utils/core/forge/latest/remote-init\n
Other settings can be as shown. Remember to click OK when done.
From the Remote Launch menu you should now see the new Connection Name. Select this, and enter the relevant ssh passphase and machine password to connect. A remote connection will allow you to debug, or view a profile, as discussed above.
If different commands are required on connection, a copy of the remote-init
script can be placed in, e.g., ${HOME/home/work}/.forge
and edited as necessary. The full path of the new script should then be specified in the remote launch settings dialog box. Note that the script changes the directory to the /work/
file system so that batch submissions via sbatch
will not be rejected.
Finally, note that ssh
may need to be configured so that it picks up the correct local public key file. This may be done, e.g., via the local .ssh/config
configuration file.
Navigate to https://app.globus.org
Log in with your Globus identity (this could be a globusid.org or other identity)
In File Manager, use the search tool to search for \u201cArcher2 file systems\u201d. Select it.
In the transfer pane, you are told that Authentication/Consent is required. Click Continue.
Click on the ARCHER2 Safe (safe.epcc.ed.ac.uk) link
Select the correct User account (if you have more than one)
Click Accept
Now confirm your Globus credentials \u2013 click Continue
Click on the SAFE id you selected previously
Make sure the correct User account is selected and Accept again
Your ARCHER2 /home directory will be shown
You can switch to viewing e.g. your /work directory by editing the path, or using the \"up one folder\" and selecting folders to move down the tree, as required
"},{"location":"data-tools/globus/#setting-up-the-other-end-of-the-transfer","title":"Setting up the other end of the transfer","text":"Make sure you select two-panel view mode
"},{"location":"data-tools/globus/#laptop","title":"Laptop","text":"If you wish to transfer data to/from your personal laptop or other device, click on the Collection Search in the right-hand panel
Use the link to \u201cGet Globus Connect Personal\u201d to create a Collection for your local drive.
"},{"location":"data-tools/globus/#other-server-eg-jasmin","title":"Other server e.g. JASMIN","text":"If you wish to connect to another server, you will need to search for the Collection e.g. JASMIN Default Collection and authenticate
Please see the JASMIN Globus page for more information
"},{"location":"data-tools/globus/#setting-up-and-initiating-the-transfer","title":"Setting up and initiating the transfer","text":"Once you are connected to both the Source and Destination Collections, you can use the File Manager to select the files to be transferred, and then click the Start button to initiate the transfer
A pop-up will appear once the Transfer request has been submitted successfully
Clicking on the \u201cView Details\u201d will show the progress and final status of the transfer
"},{"location":"data-tools/globus/#using-a-different-archer2-account","title":"Using a different ARCHER2 account","text":"If you want to use Globus with a different account on ARCHER2, you will have to go to Settings
Manage Identities
And Unlink the current ARCHER2 safe identity, then repeat the link process with the other ARCHER2 account
"},{"location":"data-tools/julia/","title":"Julia","text":"Julia is a general purpose software used widely in datascience and for data visualisation.
Important
This documentation is provided by an external party (i.e. not by the ARCHER2 service itself). Julia is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"data-tools/julia/#first-time-installation","title":"First time installation","text":"Note
There is no centrally installed version of Julia, so you will have to manually install it and any packages you may need. The following guide was tested on julia-1.6.6.
You will first need to download Julia into your work directory and untar the folder. You should then add the folder to your system path so you can use the julia
executable. Finally, you need to tell Julia to install any packages in the work directory as opposed to the default home directory, which can only be accessed from the login nodes. This can be done with the following code
export WORK=/work/t01/t01/auser\ncd $WORK\n\nwget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.6-linux-x86_64.tar.gz\ntar zxvf julia-1.6.6-linux-x86_64.tar.gz\nrm ./julia-1.6.6-linux-x86_64.tar.gz\n\nexport PATH=\"$PATH:$WORK/julia-1.6.6/bin\"\n\nmkdir ./.julia\nexport JULIA_DEPOT_PATH=\"$WORK/.julia\"\nexport PATH=\"$PATH:$WORK/$JULIA_DEPOT_PATH/bin\"\n
At this point you should have a working installation of Julia! The environment variables will however be cleared when you log out of the terminal. You can set them in the .bashrc
file so that they're automatically defined every time you log in by adding the following lines to the end of the file ~/.bashrc
export WORK=\"/work/t01/t01/auser\"\nexport JULIA_DEPOT_PATH=\"$WORK/.julia\"\nexport PATH=\"$PATH:$WORK/julia-1.6.6/bin\"\nexport PATH=\"$PATH:$JULIA_DEPOT_PATH/bin\"\n
"},{"location":"data-tools/julia/#installing-packages-and-using-environments","title":"Installing packages and using environments","text":"Julia has a built in package manager which can be used to install registered packages quickly and easily. Like with many other high level programming languages we can make use of environments to control dependencies etc.
To make an environment, first navigate to where you want your environment to be (ideally a subfolder of your /work/
directory) and create an empty folder to store the environment in. Then launch Julia with the --project flag.
cd $WORK\nmkdir ./MyTestEnv\njulia --project=$WORK/MyTestEnv\n
This launches Julia in the MyTestEnv
environment. You can then install packages as usual using the normal commands in the Julia terminal. E.g.
using Pkg\nPkg.add(\"Oceananigans\")\n
"},{"location":"data-tools/julia/#configuring-mpijl","title":"Configuring MPI.jl","text":"The MPI.jl
package doesn't use the system MPICH implementation by default. You can set it up to do this by following the steps below. First you will need to load the cray-mpich
module and define some environment variables (see here for further details). Then you can launch Julia in an environment of your choice, ready to build.
module load cray-mpich/8.1.23\nexport JULIA_MPI_BINARY=\"system\"\nexport JULIA_MPI_PATH=\"\"\nexport JULIA_MPI_LIBRARY=\"/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi.so\"\nexport JULIA_MPIEXEC=\"srun\"\n\njulia --project=<<path to environment>>\n
Once in the Julia terminal you can build the MPI.jl
package using the following code. The final line installs the mpiexecjl
command which should be used instead of srun
to launch mpi processes.
using Pkg\nPkg.build(\"MPI\"; verbose=true)\nMPI.install_mpiexecjl(command = \"mpiexecjl\", force = false, verbose = true)\n
The mpiexecjl
command will be installed in the directory that JULIA_DEPOT_PATH
points too. Note
You only need to do this once per environment.
"},{"location":"data-tools/julia/#running-julia-on-the-compute-nodes","title":"Running Julia on the compute nodes","text":"Below is an example script for running Julia with mpi on the compute nodes
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=<<job-name>>\n#SBATCH --time=00:19:00\n\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=24\n#SBATCH --cpus-per-task=1\n\n#SBATCH --qos=short\n#SBATCH --reservation=shortqos\n\n#SBATCH --account=<<your account>>\n#SBATCH --partition=standard\n\n# Setup the job environment (this module needs to be loaded before any other modules)\nmodule load PrgEnv-cray\nmodule load cray-mpich/8.1.23\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\nexport JULIA_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Define some paths\nexport WORK=/work/t01/t01/auser\n\nexport JULIA=\"$WORK/julia-1.6.6/bin/julia\" # The julia executable\nexport PATH=\"$PATH:$WORK/julia-1.6.6/bin\" # The folder of the julia executable\nexport JULIA_DEPOT_PATH=\"$WORK/.julia\"\nexport MPIEXECJL=\"$JULIA_DEPOT_PATH/bin/mpiexecjl\" # The path to the mpiexexjl executable\n\n$MPIEXECJL --project=$WORK/MyTestEnv -n 24 $JULIA ./MyMpiJuliaScript.jl\n
The above script uses MPI but you can also use multithreading instead by setitng the JULIA_NUM_THREADS
environment variable.
The Performance Application Programming Interface (PAPI) is an API that facilitates the reading of performance counter data without needing to know the details of the underlying hardware.
For convenience, we have developed an MPI-based wrapper for PAPI, called papi_mpi_lib
, which can be found via the link below.
https://github.com/cresta-eu/papi_mpi_lib
The PAPI MPI Library makes it possible to monitor a user-defined set of hardware performance counters during the execution of an MPI code running across multiple compute nodes. The library is lightweight, containing just four functions, and is intended to be straightforward to use. Once you've decided where in your code you wish to record counter values, you can control which counters are read at runtime by setting the PAT_RT_PERFCTR
environment variable in the job submission script. As your code executes, the specified counters will be read at various points. After each reading, the counter values are summed by rank 0 (via an MPI reduction) before being output to a log file.
You can discover which counters are available on ARCHER2 compute nodes by submitting the following single node job.
#!/bin/bash --login\n\n#SBATCH -J papi\n#SBATCH --time=00:20:00\n#SBATCH --exclusive\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=1\n#SBATCH --cpus-per-task=1\n#SBATCH --account=<budget code>\n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --export=none\n\nfunction papi_query() {\n export LD_LIBRARY_PATH=/opt/cray/pe/papi/$2/lib64:/opt/cray/libfabric/$3/lib64\n module -q restore\n\n module -q load cpe/$1\n module -q load papi/$2\n\n mkdir -p $1\n papi_component_avail -d &> $1/papi_component_avail.txt\n papi_native_avail -c &> $1/papi_native_avail.txt\n papi_avail -c -d &> $1/papi_avail.txt\n}\n\npapi_query 22.12 6.0.0.17 1.12.1.2.2.0.0\n
The job runs various papi
commands with the output being directed to specific text files. Please consult the text files to see which counters are available. Note, counters that are not available may still be listed in the file, but with a label such as <NA>
.
As of July 2023, the Cray Programming Environment (CPE), PAPI and libfabric versions on ARCHER2, were 22.12
, 6.0.0.17
and 1.12.1.2.2.0.0
respectively; these versions may change in the future.
Alternatively, you can run pat_help counters rome
from a login node to check the availability of individual counters.
Further information on papi_mpi_lib
along with test harnesses and example scripts can be found by reading the PAPI MPI Library readme file.
ParaView is a data visualisation and analysis package. Whilst ARCHER2 compute or login nodes do not have graphics cards installed in them, ParaView is installed so the visualisation libraries and applications can be used to post-process simulation data. The ParaView server (pvserver
), batch application (pvbatch
), and the Python interface (pvpython
) are all available. Users are able to run the server on the compute nodes and connect to a local ParaView client running on their own computer.
ParaView is available through the paraview
module.
module load paraview\n
Once the module has been added, the ParaView executables, tools, and libraries will be available.
"},{"location":"data-tools/paraview/#connecting-to-pvserver-on-archer2","title":"Connecting to pvserver on ARCHER2","text":"For doing visualisation, you should connect to pvserver
from a local ParaView client running on your own computer.
Note
You should make sure the version of ParaView you have installed locally is the same as the one on ARCHER2 (version 5.10.1).
The following instructions are for running pvserver in an interactive job. Start an iteractive job using:
srun --nodes=1 --exclusive --time=00:20:00 \\\n --partition=standard --qos=short --pty /bin/bash\n
Once the job starts the command prompt will change to show you are now on the compute node, e.g.:
auser@nid001023:/work/t01/t01/auser> \n
Then load the ParaView module and start pvserver
with the srun
command,
auser@nid001023:/work/t01/t01/auser> module load paraview\nauser@nid001023:/work/t01/t01/auser> srun --overlap --oversubscribe -n 4 \\\n> pvserver --mpi --force-offscreen-rendering\nWaiting for client...\nConnection URL: cs://nid001023:11111\nAccepting connection(s): nid001023:11111\n
Note
The previous example uses 4 compute cores to run pvserver
. You can increase the number of cores in case the visualisation does not run smoothly. Please bear in mind that, depending on the testcase, a large number of compute cores can lead to an out-of-memory runtime error.
In a separate terminal you can now set up an SSH tunnel with the node ID and port number which the pvserver
is using, e.g.:
ssh -L 11111:nid001023:11111 auser@login.archer2.ac.uk \n
enter your password and passphrase as usual.
You can then connect from your local client using the following connection settings:
Name: archer2 \nServer Type: Client/Server \nHost: localhost \nPort: 11111\n
Note
The Host from the local client should be set to \"localhost\" when using the SSH tunnel. The \"Name\" field can be set to a name of your choosing. 11111 is the default port for pvserver
.
If it has connected correctly, you should see the following:
Waiting for client...\nConnection URL: cs://nid001023:11111\nAccepting connection(s): nid001023:11111\nClient connected.\n
"},{"location":"data-tools/paraview/#using-batch-mode-pvbatch","title":"Using batch-mode (pvbatch)","text":"A pvbatch
script can be run in a standard job script. For example the following will run on a single node:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=example_paraview_job\n#SBATCH --time=0:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load paraview\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread pvbatch pvbatchscript.py\n
"},{"location":"data-tools/paraview/#compiling-paraview","title":"Compiling ParaView","text":"The latest instructions for building ParaView on ARCHER2 may be found in the GitHub repository of build instructions:
The ARCHER2 compute nodes each have a set of so-called Power Management (PM) counters. These cover point-in-time power readings for the whole node, and for the CPU and memory domains. The accumulated energy use is also recorded at the same level of detail. Further, there are two temperature counters, one for each socket/processor on the node. The counters are read ten times per second and the data written to a set of files stored within node memory (located at /sys/cray/pm_counters/
).
For convenience, we have developed an MPI-based wrapper, called pm_mpi_lib
that facilitates the reading of the PM counter files, see the link below.
https://github.com/cresta-eu/pm_mpi_lib
The PM MPI Library makes it possible to monitor the Power Management counters during the execution of an MPI code running across multiple compute nodes. The library is lightweight, containing just three functions, and is intended to be straightforward to use. You simply decide which parts of your code you wish to profile as regards energy usage and/or power consumption.
As your code executes, the PM counters will be read at various points by a single designated monitor rank on each node assigned to the job. These readings are then written to a log file, which, after the job completes, will contain one set of time-stamped readings per node for every call to the pm_mpi_record
function made from within your code. The readings can then be aggregated according to preference.
Further information along with test harnesses and example scripts can be found by reading the PM MPI Library readme file.
"},{"location":"data-tools/spack/","title":"Spack","text":"Spack is a package manager, a tool to assist with building and installing software as well as determining what dependencies are required and installing those. It was originally designed for use on HPC systems, where several variations of a given package may be installed alongside one another for different use cases -- for example different versions, built with different compilers, using MPI or hybrid MPI+OpenMP. Spack is principally written in Python but has a component written in Answer Set Programming (ASP) which is used to determine the required dependencies for a given package installation.
Users are welcome to install Spack themselves in their own directories, but we are making an experimental installation tailored for ARCHER2 available centrally. This page provides documentation on how to activate and install packages using the central installation on ARCHER2. For more in-depth information on using Spack itself please see the developers' documentation.
Important
As ARCHER2's central Spack installation is still in an experimental stage please be aware that we cannot guarantee that it will work with full functionality and we may not be able to provide support. The centrally-provided configuration and Spack-installed software may be subject to change.
"},{"location":"data-tools/spack/#activating-spack","title":"Activating Spack","text":"As it is still in an experimental stage, the Spack module is not made available to users by default. You must firstly load the other-software
module:
auser@ln01:~> module load other-software\n
Several modules with spack
in their name will become visible to you. You should load the spack
module:
auser@ln01:~> module load spack\n
This configures Spack to place its cache on and install software to a directory called .spack
in your base work directory, e.g. at /work/t01/t01/auser/.spack
.
At this point Spack is available to you via the spack
command. You can get started with spack help
, reading the Spack documentation, or by testing a package's installation.
At its simplest, Spack installs software with the spack install
command:
auser@ln01:~> spack install gromacs\n
This very simple gromacs
installation specification, or spec, would install GROMACS using the default options given by the Spack gromacs
package. The spec can be expanded to include which options you like. For example, the command
auser@ln01:~> spack install gromacs@2024.2%gcc+mpi\n
would use the GCC compiler to install an MPI-enabled version of GROMACS version 2024.2.
Tip
Spack needs to bootstrap the installation of some extra software in order to function, principally clingo
which is used to solve the dependencies required for an installation. The first time you ask Spack to concretise a spec into a precise set of requirements, it will take extra time as it downloads this software and extracts it into a local directory for Spack's use.
You can find information about any Spack package and the options available to use with the spack info
command:
auser@ln01:~> spack info gromacs\n
Tip
The Spack developers also provide a website at https://packages.spack.io/ where you can search for and examine packages, including all information on options, versions and dependencies.
When installing a package, Spack will determine what dependencies are required to support it. If they are not already available to Spack, either as packages that it has installed beforehand or as external dependencies, then Spack will also install those, marking them as implicity installed, as opposed to the explicit installation of the package you requested. If you want to see the dependencies of a package before you install it, you can use spack spec
to see the full concretised set of packages:
auser@ln01:~> spack spec gromacs@2024.2%gcc+mpi\n
Tip
Spack on ARCHER2 has been configured to use as much of the HPE Cray Programming Environment as possible. For example, this means that Cray LibSci will be used to provide the BLAS, LAPACK and ScaLAPACK dependencies and Cray MPICH will provide MPI. It is also configured to allow it to re-use as dependencies any packages that the ARCHER2 CSE team has spack install
ed centrally, potentially helping to save you build time and storage quota.
Spack provides a module-like way of making software that you have installed available to use. If you have a GROMACS installation, you can make it available to use with spack load
:
auser@ln01:~> spack load gromacs\n
At this point you should be able to use the software as normal. You can then remove it once again from the environment with spack unload
:
auser@ln01:~> spack unload gromacs\n
If you have multiple variants of the same package installed, you can use the spec to distinguish between them. You can always check what packages have been installed using the spack find
command. If no other arguments are given it will simply list all installed packages, or you can give a package name to narrow it down:
auser@ln01:~> spack find gromacs\n
You can see your packages' install locations using spack find --paths
or spack find -p
.
In any Spack command that requires as an argument a reference to an installed package, you can provide a hash reference to it rather than its spec. You can see the first part of the hash by running spack find -l
, or the full hash with spack find -L
. Then use the hash in a command by prefixing it with a forward slash, e.g. wjy5dus
becomes /wjy5dus
.
If you have two packages installed which appear identical in spack find
apart from their hash, you can differentiate them with spack diff
:
auser@ln01:~> spack diff /wjy5dus /bleelvs\n
You can uninstall your packages with spack uninstall
:
auser@ln01:~> spack uninstall gromacs@2024.2\n
and of course, to be absolutely certain that you are uninstalling the correct package, you can provide the hash:
auser@ln01:~> spack uninstall /wjy5dus\n
Uninstalling a package will leave behind any implicitly installed packages that were installed to support it. Spack may have also installed build-time dependencies that aren't actually needed any more -- these are often packages like autoconf
, cmake
and m4
. You can run the garbage collection command to uninstall any build dependencies and implicit dependencies that are no longer required:
auser@ln01:~> spack gc\n
If you commonly use a set of Spack packages together you may want to consider using a Spack environment to assist you in their installation and management. Please see the Spack documentation for more information.
"},{"location":"data-tools/spack/#custom-configuration","title":"Custom configuration","text":"Spack is configured using YAML files. The central installation on ARCHER2 made available to users is configured to use the HPE Cray Programming Environment and to allow you to start installing software to your /work
directories right away, but if you wish to make any changes you can provide your own overriding userspace configuration.
Your own configuration should fit in the user level scope. On ARCHER2 Spack is configured to, by default, place and look for your configuration files in your work directory at e.g. /work/t01/t01/auser/.spack
. You can however override this to have Spack use any directory you choose by setting the SPACK_USER_CONFIG_PATH
environment variable, for example:
auser@ln01:~> export SPACK_USER_CONFIG_PATH=/work/t01/t01/auser/spack-config\n
Of course this will need to be a directory where you have write permissions, such in your home or work directories, or in one of your project's shared
directories.
You can edit the configuration files directly in a text editor or by running, for example:
auser@ln01:~> spack config edit repos\n
which would open your repos.yaml
in vim
.
Tip
If you would rather not use vim
, you can change which editor is used by Spack by setting the SPACK_EDITOR
environment variable.
The final configuration used by Spack is a compound of several scopes, from the Spack defaults which are overridden by the ARCHER2 system configuration files, which can then be overridden in turn by your own configurations. You can see what options are in use at any point by running, for example:
auser@ln01:~> spack config get config\n
which goes through any and all config.yaml
files known to Spack and sets the options according to those files' level of precedence. You can also get more information on which files are responsible for which lines in the final active configuration by running, for example to check packages.yaml
:
auser@ln01:~> spack config blame packages\n
Unless you have already written a packages.yaml
of your own, this will show a mix of options originating from the Spack defaults and also from an archer2-user
directory which is where we have told Spack how to use packages from the HPE Cray Programming Environment.
If there is some behaviour in Spack that you want to change, looking at the output of spack config get
and spack config blame
may help to show what you would need to do. You can then write your own user scope configuration file to set the behaviour you want, which will override the option as set by the lower-level scopes.
Please see the Spack documentation to find out more about writing configuration files.
"},{"location":"data-tools/spack/#writing-new-packages","title":"Writing new packages","text":"A Spack package is at its core a Python package.py
file which provides instructions to Spack on how to obtain source code and compile it. A very simple package will allow it to build just one version with one compiler and one set of options. A more fully-featured package will list more versions and include logic to build them with different compilers and different options, and to also pick its dependencies correctly according to what is chosen.
Spack provides several thousand packages in its builtin
repository. You may be able to use these with no issues on ARCHER2 by simply running spack install
as described above, but if you do run into problems in the interaction between Spack and the CPE compilers and libraries then you may wish to write your own. Where the ARCHER2 CSE service has encountered problems with packages we have provided our own in a repository located at $SPACK_ROOT/var/spack/repos/archer2
.
A package repository is a directory containing a repo.yaml
configuration file and another directory called packages
. Directories within the latter are named for the package they provide, for example cp2k
, and contain in turn a package.py
. You can create a repository from scratch with the command
auser@ln01:~> spack repo create dirname\n
where dirname
is the name of the directory holding the repository. This command will create the directory in your current working directory, but you can choose to instead provide a path to its location. You can then make the new repository available to Spack by running:
auser@ln01:~> spack repo add dirname\n
This adds the path to dirname
to the repos.yaml
file in your user scope configuration directory as described above. If your repos.yaml
doesn't yet exist, it will be created.
A Spack repository can similarly be removed from the config using:
auser@ln01:~> spack repo rm dirname\n
"},{"location":"data-tools/spack/#namespaces-and-repository-priority","title":"Namespaces and repository priority","text":"A package can exist in several repositories. For example, the Quantum Espresso package is provided by both the builtin
repository provided with Spack and also by the archer2
repository; the latter has been patched to work on ARCHER2.
To distinguish between these packages, each repository's packages exist within that repository's namespace. By default the namespace is the same as the name of the directory it was created in, but Spack does allow it to be different. Both builtin
and archer2
use the same directory name and namespace.
Tip
If you want your repository namespace to be different from the name of the directory, you can change it either by editing the repository's repo.yaml
or by providing an extra argument to spack repo create
:
auser@ln01:~> spack repo create dirname namespace\n
Running spack find -N
will return the list of installed packages with their namespace. You'll see that they are then prefixed with the repository namespace, for example builtin.bison@3.8.2
and archer2.quantum-espresso@7.2
. In order to avoid ambiguity when managing package installation you can always prefix a spec with a repository namespace.
If you don't include the repository in a spec, Spack will search in order all the repositories it has been configured to use until it finds a matching package, which it will then use. The earlier in the list of repositories, the higher the priority. You can check this with:
auser@ln01:~> spack repo list\n
If you run this without having added any repositories of your own, you will see that the two available repositories are archer2
and builtin
, in this order. This means that archer2
has higher priority. Because of this, running spack install quantum-espresso
would install archer2.quantum-espresso
, but you could still choose to install from the other repository with spack install builtin.quantum-espresso
.
Once you have a repository of your own in place, you can create new packages to store within it. Spack has a spack create
command which will do the initial setup and create a boilerplate package.py
. To create an empty package called packagename
you would run:
auser@ln01:~> spack create --name packagename\n
However, it will very often be more efficient if you instead provide a download URL for your software as the argument. For example, the Code_Saturne 8.0.3 source is obtained from https://www.code-saturne.org/releases/code_saturne-8.0.3.tar.gz
, so you can run:
auser@ln01:~> spack create https://www.code-saturne.org/releases/code_saturne-8.0.3.tar.gz\n
Spack will determine from this the package name, the download URLs for all versions X.Y.Z matching the https://www.code-saturne.org/releases/code_saturne-X.Y.Z.tar.gz
pattern. It will then ask you interactively which of these you want to use. Finally, it will download the .tar.gz
archives for those versions and calculate their checksums, then place all this information in the initial version of the package for you. This takes away a lot of the initial work!
At this point you can get to work on the package. You can edit an existing package by running
auser@ln01:~> spack edit packagename\n
or by directly opening packagename/package.py
within the repository with a text editor.
The boilerplate code will note several sections for you to fill out. If you did provide a source code download URL, you'll also see listed the versions you chose and their checksums.
A package is implemented as a Python class. You'll see that by default it will inherit from the AutotoolsPackage
class which defines how a package following the common configure
> make
> make install
process should be built. You can change this to another build system, for example CMakePackage
. If you want, you can have the class inherit from several different types of build system classes and choose between them at install time.
Options must be provided to the build. For an AutotoolsPackage
package, you can write a configure_args
method which very simply returns a list of the command line arguments you would give to configure
if you were building the code yourself. There is an identical cmake_args
method for CMakePackage
packages.
Finally, you will need to provide your package's dependencies. In the main body of your package class you should add calls to the depends_on()
function. For example, if your package needs MPI, add depends_on(\"mpi\")
. As the argument to the function is a full Spack spec, you can provide any necessary versioning or options, so, for example, if you need PETSc 3.18.0 or newer with Fortran support, you can call depends_on(\"petsc+fortran@3.18.0:\")
.
If you know that you will only ever want to build a package one way, then providing the build options and dependencies should be all that you need to do. However, if you want to allow for different options as part of the install spec, patch the source code or perform post-install fixes, or take more manual control of the build process, it can become much more complex. Thankfully the Spack developers have provided excellent documentation covering the whole process, and there are many existing packages you can look at to see how it's done.
"},{"location":"data-tools/spack/#tips-when-writing-packages-for-archer2","title":"Tips when writing packages for ARCHER2","text":"Here are some useful pointers when writing packages for use with the HPE Cray Programming Environment on ARCHER2.
"},{"location":"data-tools/spack/#cray-compiler-wrappers","title":"Cray compiler wrappers","text":"An important point of note is that Spack does not use the Cray compiler wrappers cc
, CC
and ftn
when compiling code. Instead, it uses the underlying compilers themselves. Remember that the wrappers automate the use of Cray LibSci, Cray FFTW, Cray HDF5 and Cray NetCDF. Without this being done for you, you may need to take extra care to ensure that the options needed to use those libraries are correctly set.
Cray LibSci provides optimised implementations of BLAS, BLACS, LAPACK and ScaLAPACK on ARCHER2. These are bundled together into single libraries named for variants on libsci_cray.so
. Although Spack itself knows about LibSci, many applications don't and it can sometimes be tricky to get them to use these libraries when they are instead looking for libblas.so
and the like.
The configure
or cmake
or equivalent step for your software will hopefully allow you to manually point it to the correct library. For example, Code_Saturne's configure
can take the options --with-blas-lib
and --with-blas-libs
which respectively tell it the location to search and the libraries to use in order to build against BLAS.
Spack can provide the correct BLAS library search and link flags to be passed on to configure
via self.spec[\"blas\"].libs
, a LibraryList
object. So, the Code_Saturne package uses the following configure_args()
method:
def configure_args(self):\n blas = self.spec[\"blas\"].libs\n args = [\"--with-blas-lib={0}\".format(blas.search_flags),\n \"--with-blas-libs={0}\".format(blas.link_flags)]\n return args\n
Here the blas.search_flags
attribute is resolved to a -L
library search flag using the path to the correct LibSci directory, taking into account whether the libraries for the Cray, GCC or AOCC compilers should be used. blas.link_flags
similarly gives a -l
flag for the correct LibSci library. Depending on what you need, the LibraryList
has other attributes which can help you pass the options needed to get configure
to find and use the correct library.
If you develop a package for use on ARCHER2 please do consider opening a pull request to the GitHub repository.
"},{"location":"data-tools/visidata/","title":"VisiData","text":"VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.
"},{"location":"data-tools/visidata/#useful-links","title":"Useful links","text":"You can access VisiData on ARCHER2 by loading the visidata
module:
module load visidata\n
Once the module has been loaded, VisiData is available via the vd
command.
Visidata can also be used in scripts by saving a command log and replaying it. See the VisiData documentation on saving and restoring VisiData sessions.
"},{"location":"data-tools/vmd/","title":"VMD","text":"VMD is a visualisation program for displaying, animating, and analysing molecular systems using 3D graphics, and built-In tcl/tk scripting.
"},{"location":"data-tools/vmd/#useful-links","title":"Useful links","text":"VMD is available through the vmd
module.
module load vmd\n
Once the module has been added the VMD executables, tools, and libraries will be made available.
Without anything else, this allows you to run VMD in \"text-only\" mode with:
vmd -dispdev text\n
If you want to launch VMD with a GUI, see the requirements on the next section.
"},{"location":"data-tools/vmd/#launching-vmd-with-a-gui","title":"Launching VMD with a GUI","text":"To be able to launch VMD with it's graphical interface, your machine needs to support the x11 \"X windows system\". Most Linux and *NIX systems support this by default. If you're using Windows (through WSL, for example), you will need an X11 display server, we recommend XMing. For macOS, we recommend XQuartz, but please be aware that there's some extra configuration needed, please see the next section
To launch VMD with a GUI, once you have a running X11 display server on your local machine, you'll need to connect to ARCHER2 with X11 forwarding enabled, please follow the instructions in the logging in section. Once you're connected to ARCHER2, load the VMD module with:
module load vmd\n
and launch VMD with:
vmd\n
"},{"location":"data-tools/vmd/#using-vmd-from-macos","title":"Using VMD from macOS","text":"If you're using macOS and XQuartz, before you're able to launch VMD with a GUI, you will need to change the XQuartz configuration. On a local terminal (that is, not connected to ARCHER2), run the following command:
defaults write org.xquartz.X11 enable_iglx -bool true\n
then, restart XQuartz. You will now be able to launch VMD's GUI without a segmentation fault
.
The latest instructions for building VMD on ARCHER2 may be found in the GitHub repository of build instructions.
"},{"location":"essentials/","title":"Essential Skills","text":"This section provides information and links on essential skills required to use ARCHER2 efficiently: e.g. using Linux command line, accessing help and documentation.
"},{"location":"essentials/#terminal","title":"Terminal","text":"In order to access HPC machines such as ARCHER2 you will need to use a Linux command line terminal window
Options for Linux, MacOS and Windows are described under our Connecting to ARCHER2 guide
"},{"location":"essentials/#linux-command-line","title":"Linux Command Line","text":"A guide to using the Unix Shell for complete novices
For those already familiar with the basics there is also a lesson on shell extras
"},{"location":"essentials/#basic-slurm-commands","title":"Basic Slurm commands","text":"Slurm is the scheduler used on ARCHER2 and we provide a guide to using the basic Slurm commands including how to find out:
The following text editors are available on ARCHER2
Name Description Examples emacs A widely used editor with a focus on extensibility.emacs -nw sharpen.pbs
CTRL+X CTRL+C
quits CTRL+X CTRL+S
saves nano A small, free editor with a focus on user friendliness. nano sharpen.pbs
CTRL+X
quits CTRL+O
saves vi A mode based editor with a focus on aiding code development. vi cfd.f90
:q
in command mode quits :q!
in command mode quits without saving :w
in command mode saves i
in command mode switches to insert mode ESC
in insert mode switches to command mode If you are using MobaXterm on Windows you can use the inbuilt MobaTextEditor text file editor.
You can edit on your local machine using your preferred text editor, and then upload the file to ARCHER2. Make sure you can save the file using Linux line-endings. Notepad, for example, will support Unix/Linux line endings (LF), Macintosh line endings (CR), and Windows Line endings (CRLF)
"},{"location":"essentials/#quick-reference-sheet","title":"Quick Reference Sheet","text":"We have produced this Quick Reference Sheet which you may find useful.
"},{"location":"faq/","title":"ARCHER2 Frequently Asked Questions","text":"This section documents some of the questions raised to the Service Desk on ARCHER2, and the advice and solutions.
"},{"location":"faq/#user-accounts","title":"User accounts","text":""},{"location":"faq/#username-already-in-use","title":"Username already in use","text":"Q. I created a machine account on ARCHER2 for a training course, but now I want to use that machine username for my main ARCHER2 project, and the system will not let me, saying \"that name is already in use\". How can I re-use that username.
A. Send an email to the service desk, letting us know the username and project that you set up previously, and asking for that account and any associated data to be deleted. Once deleted, you can then re-use that username to request an account in your main ARCHER2 project.
"},{"location":"faq/#data","title":"Data","text":""},{"location":"faq/#undeleteable-file-nfsxxxxxxxxxxx","title":"Undeleteable file .nfsXXXXXXXXXXX","text":"Q. I have a file called .nfsXXXXXXXXXXX (where XXXXXXXXXXX is a long hexadecimal string) in my /home folder but I can't delete it.
A. This file will have been created during a file copy which failed. Trying to delete it will give an error \"Device or resource busy\", even though the copy has ended and no active task is locking it.
echo -n >.nfsXXXXXXXXXXX
will remove it.
"},{"location":"faq/#running-on-archer2","title":"Running on ARCHER2","text":""},{"location":"faq/#oom-error-on-archer2","title":"OOM error on ARCHER2","text":"Q. Why is my code failing on ARCHER2 with an out of memory (OOM) error?
A. You are requesting too much memory per process. We recommend that you try running the same job on underpopulated nodes. This can be done by editing reducing the --ntasks-per-node
in your Slurm submission script. Please lower it to half of its value when it fails (so if you have --ntasks-per-node=128
, reduce it to --ntasks-per-node=64
).
Q. How can I check which budget code(s) I can use?
A. You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g. e123 resources
and then under Resource Pool to the right of this, a note of the remaining budget.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=user,Account%12,MaxTRESMins,QOS%40\n
This will list all the budget codes that you have access to (but not the amount of budget available) e.g.
User Account MaxTRESMins QOS\n-------- ------------ ------------ -----------------------------------\n userx e123-test largescale,long,short,standard\n userx e123 cpu=0 largescale,long,short,standard\n
This shows that userx
is a member of budgets e123-test
and e123
. However, the cpu=0
indicates that the e123
budget is empty or disabled. This user can submit jobs using the e123-test
budget.
You can only check the amount of available budget via SAFE - see above.
"},{"location":"faq/#estimated-start-time-of-queued-jobs","title":"Estimated start time of queued jobs","text":"Q. I\u2019ve checked the estimated start time for my queued jobs using \u201csqueue -u $USER --start\u201d. Why does the estimated start time keep changing?
A. ARCHER2 uses the Slurm scheduler to queue jobs for the compute nodes. Slurm attempts to find a better schedule as jobs complete and new jobs are added to the queue. This helps to maximise the use of resources by minimising the number of idle compute nodes, in turn reducing your wait time in the queue.
However, If you periodically check the estimated start time of your queued jobs, you may notice that the estimate changes or even disappears. This is because Slurm only assigns the top entries in the queue with an estimated start time. As the schedule changes, your jobs could move in and out of this top region and thus gain or lose an estimated start time.
"},{"location":"faq/upgrade-2023/","title":"ARCHER2 Upgrade: 2023","text":"During the first half of 2023 ARCHER went through a major software upgrade.
On this page we describe the background to the changes what impact the changes have had for users, any action you should expect to take following the upgrade and information on the versions on updated software.
If you have any questions or concerns, please contact the ARCHER2 Service Desk.
"},{"location":"faq/upgrade-2023/#why-did-the-upgrade-happen","title":"Why did the upgrade happen?","text":"There are a number of reasons why ARCHER2 needed to go through this major software upgrade. All of these reasons are related to the fact that the previous system software setup was out of date; due to this, maintenance of the service was very difficult and updating software within the current framework was not possible. Some specific issues were:
This major software upgrade involved a complete re-install of system software followed by a reinstatement of local configurations (e.g. Slurm, authentication services, SAFE integration). Unfortunately, this major work required a long period of downtime but this was planned with all service partners to minimise the outage and give as much notice to users as possible so that they could plan accordingly.
The outage dates were:
The allocation periods (where appropriate) were extended for the outage period. The changes were in place when the service was returned.
After the upgrade process there are a number of changes that may require action from users
"},{"location":"faq/upgrade-2023/#updated-login-node-host-keys","title":"Updated login node host keys","text":"If you previously logged into the ARCHER2 system before the upgrade you may see an error from SSH that looks like:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts file (in the example above the offending line is line #11).
The current login node host keys are always documented in the User Guide
"},{"location":"faq/upgrade-2023/#recompile-and-test-software","title":"Recompile and test software","text":"As the new system is based on a new OS version and new versions of compilers and libraries we strongly recommend that all users recompile and test all software on the service. The ARCHER2 CSE service recompiled all centrally installed software.
"},{"location":"faq/upgrade-2023/#no-python-2-installation","title":"No Python 2 installation","text":"There is no Python 2 installation available as part of supported software following the upgrade. Python 3 continues to be fully-supported.
"},{"location":"faq/upgrade-2023/#impact-on-data-on-the-service","title":"Impact on data on the service","text":"srun
","text":"Change in Slurm behaviour. The setting from the --cpus-per-task
option to sbatch/salloc is no longer propagated by default to srun
commands in the job script.
This can lead to very poor performance due to oversubscription of cores with processes/threads if job submission scripts are not updated. The simplest workaround is to add the command:
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n
before any srun commands in the script. You can also explicitly use the --cpus-per-task
option to srun if you prefer.
This change only affects users who use a placement scheme where placement of processes on sockets is cyclic (e.g. --distribution=block:cyclic
). The Slurm definition of a \u201csocket\u201d has changed. The previous setting on ARCHER2 was that a socket = 16 cores (all share a DRAM memory controller). On the updated ARCHER2, the setting of a socket = 4 cores (corresponding to a CCX - Core CompleX). Each CCX shares 16 MB L3 Cache.
The paths you need to bind and the LD_LIBRARY_PATH
settings required to use Cray MPICH with MPI in Singularity containers have changed. The updated settings are documented in the Containers section of the User and Best Practice Guide. This also includes updated information on building containers with MPI to use on ARCHER2.
The AMD \u03bcProf tool is not available on the upgraded system yet. We are working to get this fixed as soon as possible.
"},{"location":"faq/upgrade-2023/#what-software-versions-will-be-available-after-the-upgrade","title":"What software versions will be available after the upgrade?","text":"System software:
Compilers:
Communication libraries:
Numerical libraries:
IO Libraries:
Tools:
For full information, see CPE 22.12 Release Notes
CCE 15
C++ applications built using CCE 13 or earlier should be recompiled due to the significant changes that were necessary to implement C++17. This is expected to be a one-time requirement.
Some non-standard Cray Fortran extensions supporting shorthand notation for logical operations will be removed in a future release. CCE 15 will issue warning messages when these are encountered, providing time to adapt the application to use standard Fortran.
HPE Cray MPICH 8.1.23
Cray MPICH 8.1.23 can support only ~2040 simultaneous MPI communicators.
"},{"location":"faq/upgrade-2023/#cse-supported-software","title":"CSE supported software","text":"Default version in italics
Software Versions CASTEP 22.11, 23.11 Code_Saturne 7.0.1 ChemShell/PyChemShell 3.7.1/21.0.3 CP2K 2023.1 FHI-aims 221103 GROMACS 2022.4 LAMMPS 17_FEB_2023 NAMD 2.14 Nektar++ 5.2.0 NWChem 7.0.2 ONETEP 6.9.1.0 OpenFOAM v10.20230119 (.org), v2212 (.com) Quantum Espresso 6.8, 7.1 VASP 5.4.4.pl2, 6.3.2, 6.4.1-vtst, 6.4.1 Software Versions AOCL 3.1, 4.0 Boost 1.81.0 GSL 2.7 HYPRE 2.18.0, 2.25.0 METIS/ParMETIS 5.1.0/4.0.3 MUMPS 5.3.5, 5.5.1 PETSc 13.14.2, 13.18.5 PT/Scotch 6.1.0, 07.0.3 SLEPC 13.14.1, 13.18.3 SuperLU/SuperLU_Dist 5.2.2 / 6.4.0, 8.1.2 Trilinos 12.18.1"},{"location":"known-issues/","title":"ARCHER2 Known Issues","text":"This section highlights known issues on ARCHER2, their potential impacts and any known workarounds. Many of these issues are under active investigation by HPE Cray and the wider service.
Info
This page was last reviewed on 9 November 2023
"},{"location":"known-issues/#open-issues","title":"Open Issues","text":""},{"location":"known-issues/#atp-module-tries-to-write-to-home-from-compute-nodes-added-2024-04-29","title":"ATP Module tries to write to /home from compute nodes (Added: 2024-04-29)","text":"The ATP Module tries to execute a mkdir
command in the /home
filesystem. When running the ATP module on the compute nodes, this will lead to an error, as the compute nodes cannot access the /home
filesystem.
To circumvent the error, add the line:
export HOME=${HOME/home/work}\n
in the slurm script, so that the ATP module will write to /work
instead.
For situations where users are close to user or project quotas on work (Lustre) file systems we have seen cases of the following behaviour:
If you see these symptoms: slower than expected performance, data corruption; then you should check if you are close to your storage quota (either user or project quota). If you are, you may be experiencing this issue. Either remove data to free up space or request more storage quota.
"},{"location":"known-issues/#e-mail-alerts-from-slurm-do-not-work-added-2023-11-09","title":"e-mail alerts from Slurm do not work (Added: 2023-11-09)","text":"Email alerts from Slurm (--mail-type
and --mail-user
options) do not produce emails to users. We are investigating with Universtiy of Edinburgh Information Services to enable this Slurm feature in the future.
We have seen cases when using the (non-default) UCX communications protocol where the peak in memory use is much higher than would be expected. This leads to jobs failing unexpectedly with an OOM (Out Of Memory) error. The workaround is to use Open Fabrics (OFI) communication protocol instead. OFI is the default protocol on ARCHER2 and so does not usually need to be explicitly loaded; but if you have UCX loaded, you can switch to OFI by adding the following lines to your submission script before you run your application:
module load craype-network-ofi\nmodule load cray-mpich\n
It can be very useful to track the memory usage of your job as it runs, for example to see whether there is high usage on all nodes, or a single node, if usage increases gradually or rapidly etc.
Here are instructions on how to do this using a couple of small scripts.
"},{"location":"known-issues/#slurm-cpu-freqx-option-is-not-respected-when-used-with-sbatch-added-2023-01-18","title":"Slurm--cpu-freq=X
option is not respected when used with sbatch
(Added: 2023-01-18)","text":"If you specify the CPU frequency using the --cpu-freq
option with the sbatch
command (either using the script #SBATCH --cpu-freq=X
method or the --cpu-freq=X
option directly) then this option will not be respected as the default setting for ARCHER2 (2.0 GHz) will override the option. You should specify the --cpu-freq
option to srun
directly instead within the job submission script. i.e.:
srun --cpu-freq=2250000 ...\n
You can find more information on setting the CPU frequency in the User Guide.
"},{"location":"known-issues/#research-software","title":"Research Software","text":"There are several outstanding issues for the centrally installed Research Software:
Users should also check individual software pages, for known limitations/ caveats, for the use of software on the Cray EX platform and Cray Linux Environment.
"},{"location":"known-issues/#issues-with-rpath-for-non-default-library-versions","title":"Issues with RPATH for non-default library versions","text":"When you compile applications against non-default versions of libraries within the HPE Cray software stack and use the environment variable CRAY_ADD_RPATH=yes
to try and encode the paths to these libraries within the binary this will not be respected at runtime and the binaries will use the default versions instead.
The workaround for this issue is to ensure that you set:
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
at both compile and runtime. For more details on using non-default versions of libraries, see the description in the User and Best Practice Guide
"},{"location":"known-issues/#mpi-ucx-error-ivb_reg_mr","title":"MPIUCX ERROR: ivb_reg_mr
","text":"If you are using the UCX layer for MPI communication you may see an error such as:
[1613401128.440695] [nid001192:11838:0] ib_md.c:325 UCX ERROR ibv_reg_mr(address=0xabcf12c0, length=26400, access=0xf) failed: Cannot allocate memory\n[1613401128.440768] [nid001192:11838:0] ucp_mm.c:137 UCX ERROR failed to register address 0xabcf12c0 mem_type bit 0x1 length 26400 on md[4]=mlx5_0: Input/output error (md reg_mem_types 0x15)\n[1613401128.440773] [nid001192:11838:0] ucp_request.c:269 UCX ERROR failed to register user buffer datatype 0x8 address 0xabcf12c0 len 26400: Input/output error\nMPICH ERROR [Rank 1534] [job id 114930.0] [Mon Feb 15 14:58:48 2021] [unknown] [nid001192] - Abort(672797967) (rank 1534 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack:\nPMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed\nMPID_Isend(416)......:\nMPID_isend_unsafe(92):\nMPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error)\naborting job:\nFatal error in PMPI_Isend: Other MPI error, error stack:\nPMPI_Isend(160)......: MPI_Isend(buf=0xabcf12c0, count=3300, MPI_DOUBLE_PRECISION, dest=1612, tag=4, comm=0x84000004, request=0x7fffb38fa0fc) failed\nMPID_Isend(416)......:\nMPID_isend_unsafe(92):\nMPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error)\n[1613401128.457254] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e09 apid 200002e3e is not released, refcount 1\n[1613401128.457261] [nid001192:11838:0] mm_xpmem.c:82 UCX WARN remote segment id 200002e08 apid 100002e3e is not released, refcount 1\n
You can add the following line to your job submission script before the srun
command to try and workaround this error:
export UCX_IB_REG_METHODS=direct\n
Note
Setting this flag may have an impact on code performance.
"},{"location":"known-issues/#aocc-compiler-fails-to-compile-with-netcdf-added-2021-11-18","title":"AOCC compiler fails to compile with NetCDF (Added: 2021-11-18)","text":"There is currently a problem with the module file which means cray-netcdf-hdf5parallel will not operate correctly in PrgEnv-aocc. An example of the error seen is:
F90-F-0004-Corrupt or Old Module file /opt/cray/pe/netcdf-hdf5parallel/4.7.4.3/crayclang/9.1/include/netcdf.mod (netcdf.F90: 8)\n
The current workaround for this is to load module epcc-netcdf-hdf5parallel instead if PrgEnv-aocc is required.
"},{"location":"known-issues/#slurm-export-option-does-not-work-in-job-submission-script","title":"Slurm--export
option does not work in job submission script","text":"The option --export=ALL
propagates all the environment variables from the login node to the compute node. If you include the option in the job submission script, it is wrongly ignored by Slurm. The current workaround is to include the option when the job submission script is launched. For instance:
sbatch --export=ALL myjob.slurm\n
"},{"location":"known-issues/#recently-resolved-issues","title":"Recently Resolved Issues","text":""},{"location":"other-software/","title":"Software provided by external parties","text":"This section describes software that has been installed on ARCHER2 by external parties (i.e. not by the ARCHER2 service itself) for general use by ARCHER2 users or provides useful notes on software that is not installed centrally.
Important
While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"other-software/#research-software","title":"Research Software","text":"This page has moved
"},{"location":"other-software/cesm-further-examples/","title":"Cesm further examples","text":"This page has moved
"},{"location":"other-software/cesm213/","title":"Cesm213","text":"This page has moved
"},{"location":"other-software/cesm213_run/","title":"Cesm213 run","text":"This page has moved
"},{"location":"other-software/cesm213_setup/","title":"Cesm213 setup","text":"This page has moved
"},{"location":"other-software/crystal/","title":"Crystal","text":"This page has moved
"},{"location":"publish/","title":"ARCHER2 and publications","text":"This section provides information on how to acknowledge the use of ARCHER2 in your published work and how to register your work on ARCHER2 into the ARCHER2 publications database via SAFE.
"},{"location":"publish/#acknowledging-archer2","title":"Acknowledging ARCHER2","text":"We will shortly be publishing a description of the ARCHER2 service with a DOI that you can cite in your published work that arises from the use of ARCHER2. Until that time, please add the following words to any work you publish that arises from your use of ARCHER2:
This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk).
You should also tag outputs with the keyword \"ARCHER2\" whenever possible.
"},{"location":"publish/#archer2-publication-database","title":"ARCHER2 publication database","text":"The ARCHER2 service maintains a publication database of works that have arisen from ARCHER2 and links them to project IDs that have ARCHER2 access. We ask all users of ARCHER2 to register any publications in the database - all you need is your publication's DOI.
Registering your publications in SAFE has a number of advantages:
You will need a DOI for the publication you wish to register. A DOI has the form of an set of ID strings separated by slashes. For example, 10.7488/ds/1505
, you should not include the web host address which provides a link to the DOI.
Login to SAFE. Then:
Login to SAFE. Then:
At the moment we support export lists of DOIs to comma-separated values (CSV) files. This does not export all the metadata, just the DOIs themselves with a maximum of 25 DOIs per line. This format is primarily useful for importing into ResearchFish (where you can paste in the comma-separated lists to import publications). We plan to add further export formats in the future.
Login to SAFE. Then:
The ARCHER2 quickstart guides provide the minimum information for new users or users transferring from ARCHER. There are two sections available which are meant to be followed in sequence.
This guide aims to quickly enable developers to work on ARCHER2. It assumes that you are familiar with the material in the Quickstart for users section.
"},{"location":"quick-start/quickstart-developers/#compiler-wrappers","title":"Compiler wrappers","text":"When compiling code on ARCHER2, you should make use of the HPE Cray compiler wrappers. These ensure that the correct libraries and headers (for example, MPI or HPE LibSci) will be used during the compilation and linking stages. These wrappers should be accessed by providing the following compiler names.
Language Wrapper name C cc C++ CC Fortran ftnThis means that you should use the wrapper names whether on the command line, in build scripts, or in configure options. It could be helpful to set some or all of the following environment variables before running a build to ensure that the build tool is aware of the wrappers.
export CC=cc\nexport CXX=CC\nexport FC=ftn\nexport F77=ftn\nexport F90=ftn\n
man
pages are available for each wrapper. You can also see the full set of compiler and linker options being used by passing the -craype-verbose
option to the wrapper.
Tip
The HPE Cray compiler wrappers should be used instead of the MPI compiler wrappers such as mpicc
, mpicxx
and mpif90
that you may have used on other HPC systems.
On login to ARCHER2, the PrgEnv-cray
compiler environment will be loaded, as will a cce
module. The latter makes available the Cray compilers from the Cray Compiling Environment (CCE), while the former provides the correct wrappers and support to use them. The GNU Compiler Collection (GCC) and the AMD compiler environment (AOCC) are also available.
To make use of any particular compiler environment, you load the correct PrgEnv
module. After doing so the compiler wrappers (cc
, CC
and ftn
) will correctly call the compilers from the new suite. The default version of the corresponding compiler suite will also be loaded, but you may swap to another available version if you wish.
The following table summarises the suites and associated compiler environments.
Suite name Module Programming environment collection CCEcce
PrgEnv-cray
GCC gcc
PrgEnv-gnu
AOCC aocc
PrgEnv-aocc
As an example, after logging in you may wish to use GCC as your compiler suite. Running module load PrgEnv-gnu
will replace the default CCE (Cray) environment with the GNU environment. It will also unload the cce
module and load the default version of the gcc
module; at the time of writing, this is GCC 11.2.0. If you need to use a different version of GCC, for example 10.3.0, you would follow up with module load gcc/10.3.0
. At this point you may invoke the compiler wrappers and they will correctly use the HPE libraries and tools in conjunction with GCC 10.3.0.
When choosing the compiler environment, a big factor will likely be which compilers you have previously used for your code's development. The Cray Fortran compiler is similar to the compiler you may be familiar with from ARCHER, while the Cray C and C++ compilers provided on ARCHER2 are new versions that are now derived from Clang. The GCC suite provides gcc/g++ and gfortran. The AOCC suite provides AMD Clang/Clang++ and AMD Flang.
Note
The Intel compilers are not available on ARCHER2.
"},{"location":"quick-start/quickstart-developers/#useful-compiler-options","title":"Useful compiler options","text":"The compiler options you use will depend on both the software you are building and also on the current stage of development. The following flags should be a good starting point for reasonable performance.
Compilers Optimisation flags Cray C/C++-O2 -funroll-loops -ffast-math
Cray Fortran Default options GCC -O2 -ftree-vectorize -funroll-loops -ffast-math
Tip
If you want to use GCC version 10 or greater to compile MPI Fortran code, you must add the -fallow-argument-mismatch
option when compiling otherwise you will see compile errors associated with MPI functions.
When you are happy with your code's performance you may wish to enable more aggressive optimisations; in this case you could start using the following flags. Please note, however, that these optimisations may lead to deviations from IEEE/ISO specifications. If your code relies on strict adherence then using these flags may cause incorrect output.
Compilers Optimisation flags Cray C/C++-Ofast -funroll-loops
Cray Fortran -O3 -hfp3
GCC -Ofast -funroll-loops
Vectorisation is enabled by the Cray Fortran compiler at -O1
and above, by Cray C and C++ at -O2
and above or when using -ftree-vectorize
, and by the GCC compilers at -O3
and above or when using -ftree-vectorize
.
You may wish to promote default real
and integer
types in Fortran codes from 4 to 8 bytes. In this case, the following flags may be used.
real
and integer
promotion flags Cray Fortran -s real64 -s integer64
gfortran -freal-4-real-8 -finteger-4-integer-8
More documentation on the compilers is available through man
. The pages to read are accessed as follow.
man craycc
man crayCC
man crayftn
GNU man gcc
man g++
man gfortran
Tip
There are no man
pages for the AOCC compilers at the moment.
Executables on ARCHER2 link dynamically, and the Cray Programming Environment does not currently support static linking. This is in contrast to ARCHER where the default was to build statically.
If you attempt to link statically, you will see errors similar to:
/usr/bin/ld: cannot find -lpmi\n/usr/bin/ld: cannot find -lpmi2\ncollect2: error: ld returned 1 exit status\n
The compiler wrapper scripts on ARCHER link runtime libraries in using the RUNPATH
by default. This means that the paths to the runtime libraries are encoded into the executable so you do not need to load the compiler environment in your job submission scripts.
The default behaviour of a dynamically linked executable will be to allow the linker to provide the libraries it needs at runtime by searching the paths in the LD_LIBRARY_PATH
environment and then by searching the paths in the RUNPATH
variable setting of the binary. This is flexible in that it allows an executable to use newly installed library versions without rebuilding, but in some cases you may prefer to bake the paths to specific libraries into the executable RUNPATH
, keeping them constant. While the libraries are still dynamically loaded at run time, from the end user's point of view the resulting behaviour will be similar to that of a statically compiled executable in that they will not need to concern themselves with ensuring the linker will be able to find the libraries.
This is achieved by providing additional paths to add to RUNPATH
to the compiler as options. To set the compiler wrappers to do this, you can set the following environment variable.
export CRAY_ADD_RPATH=yes\n
"},{"location":"quick-start/quickstart-developers/#using-rpaths-to-link","title":"Using RPATHs to link","text":"RPATH
differs from RUNPATH
in that it searches RPATH directories for libraries before searching the paths in LD_LIBRARY_PATH
so they cannot be overridden in the same way at runtime.
You can provide RPATHs directly to the compilers using the -Wl,-rpath=<path-to-directory>
flag, where the provided path is to the directory containing the libraries which are themselves typically specified with flags of the type -l<library-name>
.
The following debugging tools are available on ARCHER2:
module load gdb4hpc
.module load valgrind4hpc
.module load cray-stat
.To get started debugging on ARCHER2, you might like to use gdb4hpc. You should first of all compile your code using the -g
flag to enable debugging symbols. Once compiled, load the gdb4hpc module and start it:
module load gdb4hpc\ngdb4hpc\n
Once inside gdb4hpc, you can start your program's execution with the launch
command:
dbg all> launch $my_prog{128} ./prog\n
In this example, a job called my_prog
will be launched to run the executable file prog
over 128 cores on a compute node. If you run squeue
in another terminal you will be able to see it running. Inside gdb4hpc you may then step
through the code's execution, continue
to breakpoints that you set with break
, print
the values of variables at these points, and perform a backtrace
on the stack if the program crashes. Debugging jobs will end when you exit gdb4hpc, or you can end them yourself by running, in this example, release $my_prog
.
For more information on debugging parallel codes, see the documentation in the Debugging section of the ARCHER2 User and Best Practice Guide.
"},{"location":"quick-start/quickstart-developers/#profiling-tools","title":"Profiling tools","text":"Profiling on ARCHER2 is provided through the Cray Performance Measurement and Analysis Tools (CrayPAT). This has a number of different components:
pat_build
, the utility used to instrument programs, the CrayPat run time environment, which collects the specified performance data during program execution, and pat_report
, the first-level data analysis tool, used to produce text reports or export data for more sophisticated analysis.The above tools are made available for use by firstly loading the perftools-base
module followed by either perftools
(for CrayPAT, Reveal and Apprentice2) or one of the perftools-lite
modules.
The simplest way to get started profiling your code is with CrayPAT-lite. For example, to sample a run of a code you would load the perftools-base
and perftools-lite
modules, and then compile (you will receive a message that the executable is being instrumented). Performing a batch run as usual with this executable will produce a directory such as my_prog+74653-2s
which can be passed to pat_report
to view the results. In this example,
pat_report -O calltree+src my_prog+74653-2s\n
will produce a report containing the call tree. You can view available report keywords to be provided to the -O
option by running pat_report -O -h
. The available perftools-lite
modules are:
perftools-lite
, instrumenting a basic sampling experiment.perftools-lite-events
, instrumenting a tracing experiment.perftools-lite-gpu
, instrumenting OpenACC and OpenMP 4 use of GPUs.perftools-lite-hbm
, instrumenting for memory bandwidth usage.perftools-lite-loops
, instrumenting a loop work estimate experiment.Tip
For more information on profiling parallel codes, see the documentation in the Profiling section of the ARCHER2 User and Best Practice Guide.
"},{"location":"quick-start/quickstart-developers/#useful-links","title":"Useful Links","text":"Links to other documentation you may find useful:
Once you have set up your machine account and logged on, run a job or two and possibly updated and compiled your code: what next?
There is still loads of support and advice available to you:
Getting Started on ARCHER2 gives an overview of some of this help.
Advice on how to Get Access with different funding routes, and if your chosen route requires you to complete a Technical Assessment, we have advice on How to prepare a successful TA
And we also have a comprehensive Training Programme for all levels of experience and a wide range of different uses. All our training is free for UK Academics and we have a list of upcoming training and also all the materials and resources from previous training events.
"},{"location":"quick-start/quickstart-users-totp/","title":"Quickstart for users","text":"This guide aims to quickly enable new users to get up and running on ARCHER2. It covers the process of getting an ARCHER2 account, logging in and running your first job.
"},{"location":"quick-start/quickstart-users-totp/#request-an-account-on-archer2","title":"Request an account on ARCHER2","text":"Important
You need to use both a password and a passphrase-protected SSH key pair to log into ARCHER2. You get the password from SAFE, but, you will also need to setup your own SSH key pair and add the public part to your account via SAFE before you will be able to log in. We cover the authentication steps below.
"},{"location":"quick-start/quickstart-users-totp/#obtain-an-account-on-the-safe-website","title":"Obtain an account on the SAFE website","text":"Warning
We have seen issues with Gmail blocking emails from SAFE so we recommend that users use their institutional/work email address rather than Gmail addresses to register for SAFE accounts.
The first step is to sign up for an account on the ARCHER2 SAFE website. The SAFE account is used to manage all of your login accounts, allowing you to report on your usage and quotas. To do this:
You are now registered. Your SAFE password will be emailed to the email address you provided. You can then login with that email address and password. (You can change your initial SAFE password whenever you want by selecting the Change SAFE password option from the Your details menu.)
"},{"location":"quick-start/quickstart-users-totp/#request-an-archer2-login-account","title":"Request an ARCHER2 login account","text":"Once you have a SAFE account and an SSH key you will need to request a user account on ARCHER2 itself. To do this you will require a Project Code; you usually obtain this from the Principle Investigator (PI) or project manager for the project you will be working on. Once you have the Project Code:
Full systemThe PI or project manager of the project will be asked to approve your request. After your request has been approved the account will be created and when this has been done you will receive an email. You can then come back to SAFE and pick up the initial single-use password for your new account.
Note
ARCHER2 account passwords are also sometimes referred to as LDAP passwords by the system.
"},{"location":"quick-start/quickstart-users-totp/#generating-and-adding-an-ssh-key-pair","title":"Generating and adding an SSH key pair","text":"How you generate your SSH key pair depends on which operating system you use and which SSH client you use to connect to ARCHER2. We will not cover the details on generating an SSH key pair here, but detailed information on this topic is available in the ARCHER2 User and Best Practice Guide.
After generating your SSH key pair, add the public part to your login account using SAFE:
Once you have done this, your SSH key will be added to your ARCHER2 account.
Remember, you will need to use both an SSH key and password to log into ARCHER2 so you will also need to collect your initial password before you can log into ARCHER2 for the first time. We cover this next.
Note
If you want to connect to ARCHER2 from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.
"},{"location":"quick-start/quickstart-users-totp/#login-to-archer2","title":"Login to ARCHER2","text":"To log into ARCHER2 you should use the address:
Full systemssh [userID]@login.archer2.ac.uk
The order in which you are asked for credentials depends on the system you are accessing:
Full systemYou will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you previously logged into the ARCHER2 system before the major upgrade in May/June 2023 with your account you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Tip
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_archer2
you would use the command ssh -i keys/id_rsa_archer2 username@login.archer2.ac.uk
to log in.
Remember, you will need to use both an SSH key and Time-based one-time password to log into ARCHER2 so you will also need to set up your TOTP before you can log into ARCHER2.
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed You will not use your password when logging on to ARCHER2 after the initial logon.
Hint
More information on connecting to ARCHER2 is available in the Connecting to ARCHER2 section of the User Guide.
"},{"location":"quick-start/quickstart-users-totp/#file-systems-and-manipulating-data","title":"File systems and manipulating data","text":"ARCHER2 has a number of different file systems and understanding the difference between them is crucial to being able to use the system. In particular, transferring and moving data often requires a bit of thought in advance to ensure that the data is secure and in a useful form.
ARCHER2 file systems are:
All users have a directory on one of the home file systems and on one of the work file systems. The directories are located at:
/home/[project ID]/[project ID]/[user ID]
(this is also set as your home directory)/work/[project ID]/[project ID]/[user ID]
Top tips for managing data on ARCHER2:
tar
or zip
).tar
or rsync
between file systems mounted on ARCHER2 avoid the use of compression options as these can slow performance (time saved by transferring smaller compressed files is usually less than the overhead added by having to compress files on the fly).Hint
Information on the file systems and best practice in managing you data is available in the Data management and transfer section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users-totp/#accessing-software","title":"Accessing software","text":"Software on ARCHER2 is principally accessed through modules. These load and unload the desired applications, compilers, tools and libraries through the module
command and its subcommands. Some modules will be loaded by default on login, providing a default working environment; many more will be available for use but initially unloaded, allowing you to set up the environment to suit your needs.
At any stage you can check which modules have been loaded by running
module list\n
Running the following command will display all environment modules available on ARCHER2, whether loaded or unloaded
module avail\n
The search field for this command may be narrowed by providing the first few characters of the module name being queried. For example, all available versions and variants of VASP may be found by running
module avail vasp\n
You will see that different versions are available for many modules. For example, vasp/5/5.4.4.pl2
and vasp/6/6.3.2
are two available versions of VASP on the full system. Furthermore, a default version may be specified; this is used if no version is provided by the user.
Important
VASP is licensed software, as are other software packages on ARCHER2. You must have a valid licence to use licensed software on ARCHER2. Often you will need to request access through the SAFE. More on this below.
The module load
command loads a module for use. Following the above,
module load vasp/6\n
would load the default version of VASP 6, while
module load vasp/6/6.3.2\n
would specifically load version 6.3.2
. A loaded module may be unloaded through the identical module remove
command, e.g.
module unload vasp\n
The above unloads whichever version of VASP is currently in the environment. Rather than issuing separate unload and load commands, versions of a module may be swapped as follows:
module swap vasp vasp/5/5.4.4.pl2\n
Other helpful commands are:
module help <modulename>
which provides a short description of the modulemodule show <modulename>
which displays the contents of the modulefilemodule restore
which returns you to the default module setup as if you had just logged inTip
You should not use the module purge
command on ARCHER2 as this will cause issues for the HPE Cray programming environment. If you wish to reset your modules, you should use the module restore
command instead.
Points to be aware of include:
module show
) should reveal the cause of the conflict and how to resolve it.module show
.More information on modules and the software environment on ARCHER2 can be found in the Software environment section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users-totp/#requesting-access-to-licensed-software","title":"Requesting access to licensed software","text":"Some of the software installed on ARCHER2 requires a user to have a valid licence agreed with the software owners/developers to be able to use it (for example, VASP). Although you will be able to load this software on ARCHER2, you will be barred from actually using it until your licence has been verified.
You request access to licensed software through the SAFE (the web administration tool you used to apply for your account and retrieve your initial password) by being added to the appropriate Package Group. To request access to licensed software:
Your request will then be processed by the ARCHER2 Service Desk who will confirm your license with the software owners/developers before enabling your access to the software on ARCHER2. This can take several days (depending on how quickly the software owners/developers take to respond) but you will be advised once this has been done.
"},{"location":"quick-start/quickstart-users-totp/#create-a-job-submission-script","title":"Create a job submission script","text":"To run a program on the ARCHER2 compute nodes you need to write a job submission script that tells the system how many compute nodes you want to reserve and for how long. You also need to use the srun
command to launch your parallel executable.
Hint
For a more details on the Slurm scheduler on ARCHER2 and writing job submission scripts see the Running jobs on ARCHER2 section of the User and Best Practice Guide.
Important
Parallel jobs on ARCHER2 should be run from the work file systems as the home file systems are not available on the compute nodes - you will see a chdir
or file not found error if you try to access data on the home file system within a parallel job running on the compute nodes.
Create a job submission script called submit.slurm
in your space on the work file systems using your favourite text editor. For example, using vim
:
auser@ln01:~> cd /work/t01/t01/auser\nauser@ln01:/work/t01/t01/auser> vim submit.slurm\n
Tip
You will need to use your project code and username to get to the correct directory. i.e. replace the t01
above with your project code and replace the username auser
with your ARCHER2 username.
Paste the following text into your job submission script, replacing ENTER_YOUR_BUDGET_CODE_HERE
with your budget code e.g. e99-ham
, ENTER_PARTITION_HERE
with the partition you wish to run on (e.g standard
), and ENTER_QOS_HERE
with the quality of service you want (e.g. standard
).
#!/bin/bash --login\n\n#SBATCH --job-name=test_job\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:5:0\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the xthi module to get access to the xthi program\nmodule load xthi\n\n# Recommended environment settings\n# Stop unintentional multi-threading within software libraries\nexport OMP_NUM_THREADS=1\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# srun launches the parallel program based on the SBATCH options\nsrun --distribution=block:block --hint=nomultithread xthi_mpi\n
"},{"location":"quick-start/quickstart-users-totp/#submit-your-job-to-the-queue","title":"Submit your job to the queue","text":"You submit your job to the queues using the sbatch
command:
auser@ln01:/work/t01/t01/auser> sbatch submit.slurm\nSubmitted batch job 23996\n\nThe value returned is your *Job ID*.\n
"},{"location":"quick-start/quickstart-users-totp/#monitoring-your-job","title":"Monitoring your job","text":"You use the squeue
command to examine jobs in the queue. To list all the jobs you have in the queue, use:
auser@ln01:/work/t01/t01/auser> squeue -u $USER\n
squeue
on its own lists all jobs in the queue from all users.
The job submission script above should write the output to a file called slurm-<jobID>.out
(i.e. if the Job ID was 23996, the file would be slurm-23996.out
), you can check the contents of this file with the cat
command. If the job was successful you should see output that looks something like:
auser@ln01:/work/t01/t01/auser> cat slurm-23996.out\nNode 0, hostname nid001020\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode 0, rank 8, thread 0, (affinity = 8)\nNode 0, rank 9, thread 0, (affinity = 9)\nNode 0, rank 10, thread 0, (affinity = 10)\nNode 0, rank 11, thread 0, (affinity = 11)\nNode 0, rank 12, thread 0, (affinity = 12)\nNode 0, rank 13, thread 0, (affinity = 13)\nNode 0, rank 14, thread 0, (affinity = 14)\nNode 0, rank 15, thread 0, (affinity = 15)\nNode 0, rank 16, thread 0, (affinity = 16)\nNode 0, rank 17, thread 0, (affinity = 17)\nNode 0, rank 18, thread 0, (affinity = 18)\nNode 0, rank 19, thread 0, (affinity = 19)\nNode 0, rank 20, thread 0, (affinity = 20)\nNode 0, rank 21, thread 0, (affinity = 21)\n... output trimmed ...\n
If something has gone wrong, you will find any error messages in the file instead of the expected output.
"},{"location":"quick-start/quickstart-users-totp/#acknowledging-archer2","title":"Acknowledging ARCHER2","text":"You should use the following phrase to acknowledge ARCHER2 for all research outputs that were generated using the ARCHER2 service:
This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk).
You should also tag outputs with the keyword \"ARCHER2\" whenever possible.
"},{"location":"quick-start/quickstart-users-totp/#useful-links","title":"Useful Links","text":"If you plan to compile your own programs on ARCHER2, you may also want to look at Quickstart for developers.
Other documentation you may find useful:
This guide aims to quickly enable new users to get up and running on ARCHER2. It covers the process of getting an ARCHER2 account, logging in and running your first job.
"},{"location":"quick-start/quickstart-users/#request-an-account-on-archer2","title":"Request an account on ARCHER2","text":"Important
To access ARCHER2, you need to use two sets of credentials: your SSH key pair protected by a passphrase and a Time-based one-time password (TOTP). Additionally, the first time you ever log into an account on ARCHER2, you will need to use a single use password you retrieve from SAFE.
"},{"location":"quick-start/quickstart-users/#obtain-an-account-on-the-safe-website","title":"Obtain an account on the SAFE website","text":"Warning
We have seen issues with Gmail blocking emails from SAFE so we recommend that users use their institutional/work email address rather than Gmail addresses to register for SAFE accounts.
The first step is to sign up for an account on the ARCHER2 SAFE website. The SAFE account is used to manage all of your login accounts, allowing you to report on your usage and quotas. To do this:
You are now registered. Your SAFE password will be emailed to the email address you provided. You can then login with that email address and password. (You can change your initial SAFE password whenever you want by selecting the Change SAFE password option from the Your details menu.)
"},{"location":"quick-start/quickstart-users/#request-an-archer2-login-account","title":"Request an ARCHER2 login account","text":"Once you have a SAFE account and an SSH key you will need to request a user account on ARCHER2 itself. To do this you will require a Project Code; you usually obtain this from the Principle Investigator (PI) or project manager for the project you will be working on. Once you have the Project Code:
Full systemThe PI or project manager of the project will be asked to approve your request. After your request has been approved the account will be created and when this has been done you will receive an email. You can then come back to SAFE and pick up the initial single-use password for your new account.
Note
ARCHER2 account passwords are also sometimes referred to as LDAP passwords by the system.
"},{"location":"quick-start/quickstart-users/#generating-and-adding-an-ssh-key-pair","title":"Generating and adding an SSH key pair","text":"How you generate your SSH key pair depends on which operating system you use and which SSH client you use to connect to ARCHER2. We will not cover the details on generating an SSH key pair here, but detailed information on this topic is available in the ARCHER2 User and Best Practice Guide.
After generating your SSH key pair, add the public part to your login account using SAFE:
Once you have done this, your SSH key will be added to your ARCHER2 account.
Remember, you will need to use both an SSH key and password to log into ARCHER2 so you will also need to collect your initial password before you can log into ARCHER2 for the first time. We cover this next.
Note
If you want to connect to ARCHER2 from more than one machine, e.g. from your home laptop as well as your work laptop, you should generate an ssh key on each machine, and add each of the public keys into SAFE.
"},{"location":"quick-start/quickstart-users/#login-to-archer2","title":"Login to ARCHER2","text":"To log into ARCHER2 you should use the address:
ssh [userID]@login.archer2.ac.uk
The order in which you are asked for credentials depends on the system you are accessing:
You will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you previously logged into the ARCHER2 system before the major upgrade in May/June 2023 with your account you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Tip
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_archer2
you would use the command ssh -i keys/id_rsa_archer2 username@login.archer2.ac.uk
to log in.
Remember, you will need to use both an SSH key and Time-based one-time password to log into ARCHER2 so you will also need to set up your TOTP before you can log into ARCHER2.
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed You will not use your password when logging on to ARCHER2 after the initial logon.
Hint
More information on connecting to ARCHER2 is available in the Connecting to ARCHER2 section of the User Guide.
"},{"location":"quick-start/quickstart-users/#file-systems-and-manipulating-data","title":"File systems and manipulating data","text":"ARCHER2 has a number of different file systems and understanding the difference between them is crucial to being able to use the system. In particular, transferring and moving data often requires a bit of thought in advance to ensure that the data is secure and in a useful form.
ARCHER2 file systems are:
All users have a directory on one of the home file systems and on one of the work file systems. The directories are located at:
/home/[project ID]/[project ID]/[user ID]
(this is also set as your home directory)/work/[project ID]/[project ID]/[user ID]
Top tips for managing data on ARCHER2:
tar
or zip
).tar
or rsync
between file systems mounted on ARCHER2 avoid the use of compression options as these can slow performance (time saved by transferring smaller compressed files is usually less than the overhead added by having to compress files on the fly).Hint
Information on the file systems and best practice in managing you data is available in the Data management and transfer section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users/#accessing-software","title":"Accessing software","text":"Software on ARCHER2 is principally accessed through modules. These load and unload the desired applications, compilers, tools and libraries through the module
command and its subcommands. Some modules will be loaded by default on login, providing a default working environment; many more will be available for use but initially unloaded, allowing you to set up the environment to suit your needs.
At any stage you can check which modules have been loaded by running
module list\n
Running the following command will display all environment modules available on ARCHER2, whether loaded or unloaded
module avail\n
The search field for this command may be narrowed by providing the first few characters of the module name being queried. For example, all available versions and variants of VASP may be found by running
module avail vasp\n
You will see that different versions are available for many modules. For example, vasp/5/5.4.4.pl2
and vasp/6/6.3.2
are two available versions of VASP on the full system. Furthermore, a default version may be specified; this is used if no version is provided by the user.
Important
VASP is licensed software, as are other software packages on ARCHER2. You must have a valid licence to use licensed software on ARCHER2. Often you will need to request access through the SAFE. More on this below.
The module load
command loads a module for use. Following the above,
module load vasp/6\n
would load the default version of VASP 6, while
module load vasp/6/6.3.2\n
would specifically load version 6.3.2
. A loaded module may be unloaded through the identical module remove
command, e.g.
module unload vasp\n
The above unloads whichever version of VASP is currently in the environment. Rather than issuing separate unload and load commands, versions of a module may be swapped as follows:
module swap vasp vasp/5/5.4.4.pl2\n
Other helpful commands are:
module help <modulename>
which provides a short description of the modulemodule show <modulename>
which displays the contents of the modulefilemodule restore
which returns you to the default module setup as if you had just logged inTip
You should not use the module purge
command on ARCHER2 as this will cause issues for the HPE Cray programming environment. If you wish to reset your modules, you should use the module restore
command instead.
Points to be aware of include:
module show
) should reveal the cause of the conflict and how to resolve it.module show
.More information on modules and the software environment on ARCHER2 can be found in the Software environment section of the User and Best Practice Guide.
"},{"location":"quick-start/quickstart-users/#requesting-access-to-licensed-software","title":"Requesting access to licensed software","text":"Some of the software installed on ARCHER2 requires a user to have a valid licence agreed with the software owners/developers to be able to use it (for example, VASP). Although you will be able to load this software on ARCHER2, you will be barred from actually using it until your licence has been verified.
You request access to licensed software through the SAFE (the web administration tool you used to apply for your account and retrieve your initial password) by being added to the appropriate Package Group. To request access to licensed software:
Your request will then be processed by the ARCHER2 Service Desk who will confirm your license with the software owners/developers before enabling your access to the software on ARCHER2. This can take several days (depending on how quickly the software owners/developers take to respond) but you will be advised once this has been done.
"},{"location":"quick-start/quickstart-users/#create-a-job-submission-script","title":"Create a job submission script","text":"To run a program on the ARCHER2 compute nodes you need to write a job submission script that tells the system how many compute nodes you want to reserve and for how long. You also need to use the srun
command to launch your parallel executable.
Hint
For a more details on the Slurm scheduler on ARCHER2 and writing job submission scripts see the Running jobs on ARCHER2 section of the User and Best Practice Guide.
Important
Parallel jobs on ARCHER2 should be run from the work file systems as the home file systems are not available on the compute nodes - you will see a chdir
or file not found error if you try to access data on the home file system within a parallel job running on the compute nodes.
Create a job submission script called submit.slurm
in your space on the work file systems using your favourite text editor. For example, using vim
:
auser@ln01:~> cd /work/t01/t01/auser\nauser@ln01:/work/t01/t01/auser> vim submit.slurm\n
Tip
You will need to use your project code and username to get to the correct directory. i.e. replace the t01
above with your project code and replace the username auser
with your ARCHER2 username.
Paste the following text into your job submission script, replacing ENTER_YOUR_BUDGET_CODE_HERE
with your budget code e.g. e99-ham
, ENTER_PARTITION_HERE
with the partition you wish to run on (e.g standard
), and ENTER_QOS_HERE
with the quality of service you want (e.g. standard
).
#!/bin/bash --login\n\n#SBATCH --job-name=test_job\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:5:0\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the xthi module to get access to the xthi program\nmodule load xthi\n\n# Recommended environment settings\n# Stop unintentional multi-threading within software libraries\nexport OMP_NUM_THREADS=1\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# srun launches the parallel program based on the SBATCH options\nsrun --distribution=block:block --hint=nomultithread xthi_mpi\n
"},{"location":"quick-start/quickstart-users/#submit-your-job-to-the-queue","title":"Submit your job to the queue","text":"You submit your job to the queues using the sbatch
command:
auser@ln01:/work/t01/t01/auser> sbatch submit.slurm\nSubmitted batch job 23996\n\nThe value returned is your *Job ID*.\n
"},{"location":"quick-start/quickstart-users/#monitoring-your-job","title":"Monitoring your job","text":"You use the squeue
command to examine jobs in the queue. To list all the jobs you have in the queue, use:
auser@ln01:/work/t01/t01/auser> squeue -u $USER\n
squeue
on its own lists all jobs in the queue from all users.
The job submission script above should write the output to a file called slurm-<jobID>.out
(i.e. if the Job ID was 23996, the file would be slurm-23996.out
), you can check the contents of this file with the cat
command. If the job was successful you should see output that looks something like:
auser@ln01:/work/t01/t01/auser> cat slurm-23996.out\nNode 0, hostname nid001020\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode 0, rank 8, thread 0, (affinity = 8)\nNode 0, rank 9, thread 0, (affinity = 9)\nNode 0, rank 10, thread 0, (affinity = 10)\nNode 0, rank 11, thread 0, (affinity = 11)\nNode 0, rank 12, thread 0, (affinity = 12)\nNode 0, rank 13, thread 0, (affinity = 13)\nNode 0, rank 14, thread 0, (affinity = 14)\nNode 0, rank 15, thread 0, (affinity = 15)\nNode 0, rank 16, thread 0, (affinity = 16)\nNode 0, rank 17, thread 0, (affinity = 17)\nNode 0, rank 18, thread 0, (affinity = 18)\nNode 0, rank 19, thread 0, (affinity = 19)\nNode 0, rank 20, thread 0, (affinity = 20)\nNode 0, rank 21, thread 0, (affinity = 21)\n... output trimmed ...\n
If something has gone wrong, you will find any error messages in the file instead of the expected output.
"},{"location":"quick-start/quickstart-users/#acknowledging-archer2","title":"Acknowledging ARCHER2","text":"You should use the following phrase to acknowledge ARCHER2 for all research outputs that were generated using the ARCHER2 service:
This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk).
You should also tag outputs with the keyword \"ARCHER2\" whenever possible.
"},{"location":"quick-start/quickstart-users/#useful-links","title":"Useful Links","text":"If you plan to compile your own programs on ARCHER2, you may also want to look at Quickstart for developers.
Other documentation you may find useful:
ARCHER2 provides a number of research software packages as centrally supported packages. Many of these packages are free to use, but others require a license (which you, or your research group, need to supply).
This section also contains information on research software contributed and/or supported by third parties (marked with a * in the list below).
For centrally supported packages, the version available will usually be the current stable release, to include major releases and significant updates. We will usually not maintain older versions and versions no longer supported by the developers of the package.
The following sections provide details on access to each of the centrally installed packages (software that is not part of the fully-supported software stack are marked with *):
If the software you are interested in is not in the above list, we may still be able to help you install your own version, either individually, or as a project. Please contact the Service Desk.
"},{"location":"research-software/casino/","title":"CASINO","text":"Note
CASINO is not available as central install/module on ARCHER2 at this time. This page provides tips on using CASINO on ARCHER2 for users who have obtained their own copy of the code.
Important
CASINO is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
CASINO is a computer program system for performing quantum Monte Carlo (QMC) electronic structure calculations that has been developed by a group of researchers initially working in the Theory of Condensed Matter group in the Cambridge University physics department, and their collaborators, over more than 20 years. It is capable of calculating incredibly accurate solutions to the Schr\u00f6dinger equation of quantum mechanics for realistic systems built from atoms.
"},{"location":"research-software/casino/#useful-links","title":"Useful Links","text":"You should use the linuxpc-gcc-slurm-parallel.archer2
configuration that is supplied along with the CASINO source code to build on ARCHER2 and ensure that you build the \"Shm\" (System-V shared memory) version of the code.
Bug
The linuxpc-cray-slurm-parallel.archer2
configuration produces a binary that crashes with a segfault and should not be used.
The performance of CASINO on ARCHER2 is critically dependent on three things:
Next, we show how to make sure that the MPI transport layer is set to UCX, how to set the number of cores sharing the System-V shared memory segments and how to pin MPI processes sequentially to cores.
Finally, we provide a job submission script that demonstrates all these options together.
"},{"location":"research-software/casino/#setting-the-mpi-transport-layer-to-ucx","title":"Setting the MPI transport layer to UCX","text":"In your job submission script that runs CASINO you switch to using UCX as the MPI transport layer by including the following lines before you run CASINO (i.e. before the srun
command that launches the CASINO executable):
module load PrgEnv-gnu\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\n
"},{"location":"research-software/casino/#setting-the-number-of-cores-sharing-memory","title":"Setting the number of cores sharing memory","text":"In your job submission script you set the number of cores sharing memory segments by setting the CASINO_NUMABLK
environment variable before you run CASINO. For example, to specify that there should be shared memory segments each shared between 16 cores, you would use:
export CASINO_NUMABLK=16\n
Tip
If you do not set CASINO_NUMABLK
then CASINO will use the default of all cores on a node (the equivalent of setting it to 128) which will give very poor performance so you should always set this environment variable. Setting CASINO_NUMABLK
to 8 or 16 cores gives the best performance. 32 cores is acceptable if you want to maximise memory efficiency. Using 64 and 128 gives poor performance.
For shared memory segments to work efficiently MPI processes must be pinned sequentially to cores on compute nodes (so that cores sharing memory are close in the node memory hierarchy). To do this, you add the following options to the srun
command in your job script that runs the CASINO executable:
--distribution=block:block --hint=nomultithread\n
"},{"location":"research-software/casino/#example-casino-job-submission-script","title":"Example CASINO job submission script","text":"The following script will run a CASINO job using 16 nodes (2048 cores).
#!/bin/bash\n\n# Request 16 nodes with 128 MPI tasks per node for 20 minutes\n#SBATCH --job-name=CASINO\n#SBATCH --nodes=16\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Ensure we are using UCX as the MPI transport layer\nmodule load PrgEnv-gnu\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\n\n# Set CASINO to share memory across 16 core blocks\nexport CASINO_NUMABLK=16\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the location of the CASINO executable - this must be on /work\n# Replace this with the path to your compiled CASINO binary\nCASINO_EXE=/work/t01/t01/auser/CASINO/bin_qmc/linuxpc-gcc-slurm-parallel.archer2/Shm/opt/casino\n\n# Launch CASINO with MPI processes pinned to cores in a sequential order\nsrun --distribution=block:block --hint=nomultithread ${CASINO_EXE}\n
"},{"location":"research-software/casino/#casino-performance-on-archer2","title":"CASINO performance on ARCHER2","text":"We have run the benzene_dimer benchmark on ARCHER2 with the following configuration:
linuxpc-gcc-slurm-parallel.archer2
, \"Shm\" versionTimings are reported as time taken for 100 equilibration steps in DMC calculation.
"},{"location":"research-software/casino/#casino_numablk8","title":"CASINO_NUMABLK=8","text":"Nodes Time taken (s) Speedup 1 289.90 1.0 2 154.93 1.9 4 81.06 3.6 8 41.44 7.0 16 23.16 12.5"},{"location":"research-software/castep/","title":"CASTEP","text":"CASTEP is a leading code for calculating the properties of materials from first principles. Using density functional theory, it can simulate a wide range of properties of materials proprieties including energetics, structure at the atomic level, vibrational properties, electronic response properties etc. In particular it has a wide range of spectroscopic features that link directly to experiment, such as infra-red and Raman spectroscopies, NMR, and core level spectra.
"},{"location":"research-software/castep/#useful-links","title":"Useful Links","text":"CASTEP is only available to users who have a valid CASTEP licence.
If you have a CASTEP licence and wish to have access to CASTEP on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
"},{"location":"research-software/castep/#note-on-using-relativistic-j-dependent-pseudopotentials","title":"Note on using Relativistic J-dependent pseudopotentials","text":"These pseudopotentials cannot be generated on the fly by CASTEP and so are available in the following directory on ARCHER2:
/work/y07/shared/apps/core/castep/pseudopotentials\n
"},{"location":"research-software/castep/#running-parallel-castep-jobs","title":"Running parallel CASTEP jobs","text":"The following script will run a CASTEP job using 2 nodes (256 cores). it assumes that the input files have the file stem text_calc
.
#!/bin/bash\n\n# Request 2 nodes with 128 MPI tasks per node for 20 minutes\n#SBATCH --job-name=CASTEP\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Load the CASTEP module, avoid any unintentional OpenMP threading by\n# setting OMP_NUM_THREADS, and launch the code.\nmodule load castep\nexport OMP_NUM_THREADS=1\nsrun --distribution=block:block --hint=nomultithread castep.mpi test_calc\n
"},{"location":"research-software/castep/#using-serial-castep-tools","title":"Using serial CASTEP tools","text":"Serial CASTEP tools are available in the standard CASTEP module.
"},{"location":"research-software/castep/#compiling-castep","title":"Compiling CASTEP","text":"The latest instructions for building CASTEP on ARCHER2 may be found in the GitHub repository of build instructions:
In the process of porting CESM 2.1.3 to ARCHER2, a set of 4 long runs were carried out. This page contains the four example cases which have been validated with longer runs. They vary in the numbers of cores or threads used, but included here are the PE layouts used in these validation runs, which can be used as a guide for other runs. While only these four compsets and grids have been validated, CESM2 is not bound to just these cases. Links to the UCAR/NCAR pages on configurations, compsets and grids are in the useful links section of the CESM2.1.3 on ARCHER2 page, which can be used to find many of the defined compsets for CESM2.1.3.
"},{"location":"research-software/cesm-further-examples/#atmosphere-only-f2000climo","title":"Atmosphere-only / F2000climo","text":"This compset uses the F09 grid which is roughly equivalent to a 1 degree resolution. On ARCHER2 with four nodes this configuration should give a throughput of around 7.8 simulated years per (wallclock) day (SYPD). The commands to set up and run the case are as follows:
${CIMEROOT}/scripts/create_newcase --case [case name] --compset F2000climo --res f09_f09_mg17 --walltime [enough time] --project [project code]\ncd [case directory]\n./xmlchange NTASKS=512,NTASKS_ESP=1\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm-further-examples/#slab-ocean-etest","title":"Slab Ocean / ETEST","text":"The slab ocean case is similar to the atmosphere-only case in terms of resources needed, as the slab ocean is inexpensive to simulate in comparison to the atmosphere. The setup detailed below uses two OMP threads, and more tasks than were used by the F2000climo case, and so a throughput of around 20 SYPD can be expected. Unlike F2000climo, but like most compsets, this is unsupported (meaning it has not been scientifically verified by NCAR personnel) and as such an extra argument is required when creating the case. The arguments for ROOTPE are to guard against poor decisions being automatically chosen with respect to resources.
${CIMEROOT}/scripts/create_newcase --case [case name] --compset ETEST --res f09_g17 --walltime [enough time] --project [project code] --run-unsupported\ncd [case directory]\n./xmlchange NTASKS=1024,NTASKS_ESP=1\n./xmlchange NTHRDS=2\n./xmlchange ROOTPE_ICE=0,ROOTPE_OCN=0\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm-further-examples/#coupled-ocean-b1850","title":"Coupled Ocean / B1850","text":"Compsets with the B
prefix are fully coupled, and actively simulate all components. As such, This case is more expensive to run, most especially the ocean component. This case can be set up to run on dedicated nodes by changing the $ROOTPE
variables (run the ./pelayout command to check that you have things as you wish). This should give a throughput of just over 10 SYPD.
${CIMEROOT}/scripts/create_newcase --case [case name] --compset B1850 --res f09_g17 --walltime [enough time] --project [project name]\ncd [case directory]\n./xmlchange NTASKS_CPL=1024,NTASKS_ICE=256,NTASKS_LND=256,NTASKS_GLC=128,NTASKS_ROF=128,NTASKS_WAV=256,NTASKS_OCN=512,NTASKS_ATM=1024\n./xmlchange ROOTPE_CPL=0,ROOTPE_ICE=0,ROOTPE_LND=256,ROOTPE_GLC=512,ROOTPE_ROF=640,ROOTPE_WAV=768,ROOTPE_OCN=1024,ROOTPE_ATM=0\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
You can also define the PE layout in terms of full nodes by using negative values. As such, for a $MAX_MPITASKS_PER_NODE=128
and $MAX_TASKS_PER_NODE=128
, the below is equivalent to the above:
${CIMEROOT}/scripts/create_newcase --case [case name] --compset B1850 --res f09_g17 --walltime [enough time] --project [project name]\ncd [case directory]\n./xmlchange NTASKS_CPL=-8,NTASKS_ICE=-2,NTASKS_LND=-2,NTASKS_GLC=-1,NTASKS_ROF=-1,NTASKS_WAV=-2,NTASKS_OCN=-4,NTASKS_ATM=-8\n./xmlchange ROOTPE_CPL=0,ROOTPE_ICE=0,ROOTPE_LND=-2,ROOTPE_GLC=-4,ROOTPE_ROF=-5,ROOTPE_WAV=-6,ROOTPE_OCN=-8,ROOTPE_ATM=0\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm-further-examples/#waccm-x-fxhist","title":"WACCM-X / FXHIST","text":"The WACCM-X case needs care during the set up and running for a couple of reasons. Firstly, as mentioned in the known issues section on archiving errors the short-term archiver can sometimes move too many files and thus create problems with resubmissions. Secondly, it can pick up other files in the cesm_inputdata directory, causing issues when running. WACCM-X is also comparatively very expensive, and so only has an expected throughput of a little over 1.5 SYPD, and that when on a coarser grid than above. The setup for running a WACCM-X case with approximately 2 degree resolution and no short-term archiving is
${CIMEROOT}/scripts/create_newcase --case [case name] --compset FXHIST --res f19_f19_mg16 --walltime [enough time] --project [project name] --run-unsupported\ncd [case directory]\n./xmlchange NTASKS=512,NTASKS_ESP=1\n./xmlchange NTHRDS=2\n./xmlchange DOUT_S=FALSE\n[Any other changes e.g. run length or resubmissions]\n./case.setup\n./case.build\n./case.submit\n
"},{"location":"research-software/cesm/","title":"Community Earth System Model (CESM2)","text":"CESM2 is a fully-coupled, community, global climate model that provides state-of-the-art computer simulations of the Earth's past, present, and future climate states. It has seven different components: atmosphere, ocean, river run off, sea ice, land ice, waves and adaptive river transport.
Important
CESM is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/cesm/#cesm-213","title":"CESM 2.1.3","text":"At the time of writing, CESM 2.1.3 is the latest scientifically verified version of the model.
"},{"location":"research-software/cesm/#setting-up-cesm-213-on-archer2","title":"Setting up CESM 2.1.3 on ARCHER2","text":"Due to the nature of CESM2, there is not a centrally installed version of the program available on ARCHER2. Instead, users download their own copy of the program and make use of ARCHER2-specific configurations that have been rigorously tested.
The setup process has been streamlined on ARCHER2 and can be carried out by following the instructions on the ARCHER2 CESM2.1.3 setup page
"},{"location":"research-software/cesm/#using-cesm-213-on-archer2","title":"Using CESM 2.1.3 on ARCHER2","text":"A quickstart guide for running a simple coupled case of CESM 2.1.3 on ARCHER2 can be found here. It should be noted that this is only a quickstart guide with a focus on the way that CESM 2.1.3 should be run specifically on ARCHER2, and is not intended to replace the larger CESM or CIME documentation linked to below.
"},{"location":"research-software/cesm/#useful-links","title":"Useful Links","text":""},{"location":"research-software/cesm/#documentation","title":"Documentation","text":"If this is your first time running CESM2, it is highly recommended that you consult both the CIME documentation and the NCAR CESM pages for the version used in CESM 2.1.3, paying particular attention to the pages on Basic Usage of CIME which gives detailed description of the basic commands needed to get a model running.
"},{"location":"research-software/cesm/#compsets-and-configurations","title":"Compsets and Configurations","text":"CESM2 allows simulations to be carried out using a very wide range of configurations. If you are new to CESM2 it is highly recommended that, unless you are running a case you are already familiar with, you consult the CESM2.1 Configurations page. You can also see a list of the defined compsets already available on the component set definitions page. More information about configurations, grids and compsets can be found on the CESM2 Configurations and Grids page, which includes links to the configuration settings of the different components.
"},{"location":"research-software/cesm213_run/","title":"Quick Start: CESM Model Workflow (CESM 2.1.3)","text":"This is the procedure for quickly setting up and running a simple CESM2 case on ARCHER2. This document is based on the general quickstart guide for CESM 2.1, with modifications to give instructions specific to ARCHER2. For more expansive instructions on running CESM 2.1, please consult the NCAR CESM pages
Before following these instructions, ensure you have completed the setup procedure (see Setting up CESM2 on ARCHER2).
For your target case, the first step is to select a component set, and a resolution for your case. For the purposes of this guide, we will be looking at a simple coupled case using the B1850
compset and the f19_g17
resolution.
The current configuration of CESM 2.1.3 on ARCHER2 has been validated with the F2000 (atmosphere only), ETEST (slab ocean), B1850 (fully coupled) and FX2000 (WACCM-X) compsets. Instructions for these are here: CESM2.1.3 further examples.
Details of available component sets and resolutions are available from the query_config tool located in the my_cesm_sandbox/cime/scripts
directory
cd my_cesm_sandbox/cime/scripts\n./query_config --help\n
See the supported component sets, supported model resolutions and supported machines for a complete list of CESM2 supported component sets, grids and computational platforms.
Note: Variables presented as $VAR
in this guide typically refer to variables in XML files in a CESM case. From within a case directory, you can determine the value of such a variable with ./xmlquery VAR
. In some instances, $VAR
refers to a shell variable or some other variable; we try to make these exceptions clear.
There are three stages to preparing the case: create, setup and build. Here you can find information on each of these steps
"},{"location":"research-software/cesm213_run/#1-create-a-case","title":"1. Create a case","text":"The create_newcase command creates a case directory containing the scripts and XML files to configure a case (see below) for the requested resolution, component set, and machine. create_newcase has three required arguments: --case
, --compset
and --res
(invoke create_newcase --help for help).
On machines where a project or account code is needed (including ARCHER2), you must either specify the --project
argument to create_newcase or set the $PROJECT
variable in your shell environment.
If running on a supported machine, that machine will normally be recognized automatically and therefore it is not required to specify the --machine
argument to create_newcase. For CESM 2.1.3, ARCHER2 is classed as an unsupported machine, however the configurations for ARCHER2 are included in the version of cime downloaded in the setup process, and so adding the --machine
flag should not be necessary.
Invoke create_newcase as follows:
./create_newcase --case CASENAME --compset COMPSET --res GRID --project PROJECT\n
where:
CASENAME
defines the name of your case (stored in the $CASE
XML variable). This is a very important piece of metadata that will be used in filenames, internal metadata and directory paths. create_newcase will create the case directory with the same name as the CASENAME
. If CASENAME
is simply a name (not a path), the case directory is created in the directory where you executed create_newcase. If CASENAME
is a relative or absolute path, the case directory is created there, and the name of the case will be the last component of the path. The full path to the case directory will be stored in the $CASEROOT
XML variable. See CESM2 Experiment Casenames for details regarding CESM experiment case naming conventions.COMPSET
is the component set.GRID
is the model resolution.PROJECT
is you project code on ARCHER2.Here is an example on ARCHER2 with the CESM2 module loaded:
$CIMEROOT/scripts/create_newcase --case $CESM_ROOT/runs/b.e20.B1850.f19_g17.test --compset B1850 --res f19_g17 --project n02\n
"},{"location":"research-software/cesm213_run/#2-setting-up-the-case-run-script","title":"2. Setting up the case run script","text":"Issuing the case.setup command creates scripts needed to run the model along with namelist user_nl_xxx
files, where xxx denotes the set of components for the given case configuration. Before invoking case.setup, modify the env_mach_pes.xml
file in the case directory using the xmlchange command as needed for the experiment.
cd to the case directory. Following the example from above:
cd $CESM_ROOT/runs/b.e20.B1850.f19_g17.test\n
Invoke the case.setup command.
./case.setup\n
If any changes are made to the case, case.setup can be re-run using
./case.setup --reset\n
"},{"location":"research-software/cesm213_run/#3-build-the-executable-using-the-casebuild-command","title":"3. Build the executable using the case.build command","text":"Run the build script.
./case.build\n
This build may take a while to run, and have periods where the build process doesn't seem to be doing anything. You should only cancel the build if there has been no activity by the build script after 15 minutes.
The CESM executable will appear in the directory given by the XML variable $EXEROOT
, which can be queried using:
./xmlquery EXEROOT\n
by default, this will be the bld
directory in your case directory.
If any changes are made to xml parameters that would necessitate rebuilding (see the Making Changes section below), then you can apply these by running
./case.setup --reset\n./case.build --clean-all\n./case.build\n
"},{"location":"research-software/cesm213_run/#input-data","title":"Input Data","text":"Each case of CESM will require input data, which is downloaded from UCAR servers. Input data from similar compsets is often reused, so running two similar cases may not require downloading any additional input data for the second case.
You can check to see if the required input data is already in your input data directory using
./check_input_data\n
If it is not present you can download the input data for the case prior to running the case using
./check_input_data --download\n
This can be useful for cases where a large amount of data is needed, as you can write a simple slurm script to run this download on the serial queue. Information on creating job submission scripts can be found on the ARCHER2 page on Running Jobs.
Downloading the case input data at this stage is optional, and if skipped the data will be downloaded using the login node when you run the case.submit script. This may cause the case.submit script to take a long time to download.
An important thing to note is that your input data will be stored in your /work area, and will contribute to your storage allocation. These input files can sometimes take up a large amount of space, and so it is recommended that you do not keep any input data that is no longer needed.
"},{"location":"research-software/cesm213_run/#making-changes-to-a-case","title":"Making changes to a case","text":"After creating a new case, the CIME functions can be used to make changes to the case setup, such as changing the wallclock time, number of cores etc.
You can query settings using the xmlquery script from your case directory:
./xmlquery <name_of_setting>\n
Adding the -p
flag allows you to look up partial names, for example
$ ./xmlquery -p JOB\n\nOutput:\nResults in group case.run\n JOB_QUEUE: standard\n JOB_WALLCLOCK_TIME: 01:30:00\n\nResults in group case.st_archive\n JOB_QUEUE: short\n JOB_WALLCLOCK_TIME: 0:20:00\n
Here all parameters that match the JOB
pattern are returned. It is worth noting that the parameters JOB_QUEUE
and JOB_WALLCLOCK_TIME
are present for both the case.run job and the case.st_archive job. To view just one of these, you can use the --subgroup
flag:
$ ./xmlquery -p JOB --subgroup case.run\n\nOutput:\nResults in group case.run\n JOB_QUEUE: standard\n JOB_WALLCLOCK_TIME: 01:30:00\n
When you know which setting you want to change, you can do so using the xmlchange command
./xmlchange <name_of_setting>=<new_value>\n
For example to change the wallclock time for the case.run job to 30 minutes, without knowing the exact name, you could do
$ ./xmlquery -p WALLCLOCK\n\nOutput:\nResults in group case.run\n JOB_WALLCLOCK_TIME: 24:00:00\n\nResults in group case.st_archive\n JOB_WALLCLOCK_TIME: 0:20:00\n\n$ ./xmlchange JOB_WALLCLOCK_TIME=00:30:00 --subgroup case.run\n\n$ ./xmlquery JOB_WALLCLOCK_TIME\n\nOutput:\nResults in group case.run\n JOB_WALLCLOCK_TIME: 00:30:00\n\nResults in group case.st_archive\n JOB_WALLCLOCK_TIME: 0:20:00\n
Note: If you try to set a parameter equal to a value that is not known to the program, it might suggest using a --force
flag. This may be useful, for example, in the case of using a queue that has not been configured yet, but use with care!
Some changes to the case must be done before calling ./case.setup
or ./case.build
, otherwise the case will need to be reset or cleaned, using ./case.setup --reset
and ./case.build --clean-all
. These are as follows:
Before calling ./case.setup
, changes to NTASKS
, NTHRDS
, ROOTPE
, PSTRID
and NINST
must be made, as well as any changes to the env_mach_specific.xml
file, which contains some configuration for the module environment and environment variables.
Before calling ./case.build
, ./case.setup
must have been called and any changes to env_build.xml
and Macros.make
must have been made. This includes whether you have edited the file directly, or used ./xmlchange
to alter the variables.
Many of the namelist variables can be changed just before calling ./case.submit
.
Modify runtime settings in env_run.xml
(optional). At this point you may want to change the running parameters of your case, such as run length. By default, the model is set to run for 5 days based on the $STOP_N
and $STOP_OPTION
variables:
./xmlquery STOP_OPTION,STOP_N\n
These default settings can be useful in troubleshooting runtime problems before submitting for a longer time, but will not allow the model to run long enough to produce monthly history climatology files. In order to produce history files, increase the run length to a month or longer:
./xmlchange STOP_OPTION=nmonths,STOP_N=1\n
If you want a longer run, for example 30 years, this cannot be done in a single job as the amount of wallclock time required would be considerably longer than the maximum allowed by the ARCHER2 queue system. To do this, you would split the simulation into appropriate chunks, such as 6 chunks of 5 years (assuming a simulated years per day (SYPD) of greater than 5 - some values for SYPD on ARCHER2 are given in the further examples page). Using the $RESUBMIT
xml variable and setting the values of the $STOP_OPTION
and $STOP_N
variables accordingly you can then chain the running of these chunks:
./xmlchange RESUBMIT=6, STOP_OPTION= nyears, and STOP_N= 5\n
This would then run 6 resubmissions, each new job picking back up where the previous job had stopped. For more information about this, see the user guide page on running a case.
Once you have set your job to run for the correct length of time, it is a good idea to check the correct amount of resource is available for the job. You can quickly check the job submission parameters by running
./preview_run\n
which will show you at a glance the wallclock times, job queues and the list of jobs to be submitted, as well as other parameters such as the number of MPI tasks, number of OpenMP threads.
Submit the job to the batch queue using the case.submit command.
./case.submit\n
The case.submit script will submit a job called .case.run, and if $DOUT_S
is set to TRUE
it will also submit a short-term archiving job. By default, the queue these jobs are submitted to is the standard
queue. For information on the resources available on each queue, see the QOS guide.
Note: There is a small possibility that your job may initially fail with the error message ERROR: Undefined env var 'CESM_ROOT'
. This could have two causes: 1. You do not have the CESM2/2.1.3 module loaded. This module needs to be loaded when running the case as well as when building the case. Try running again after having run module load CESM2/2.1.3
2. This could also be due to a known issue with ARCHER2 where adding the SBATCH directive export=ALL
to a slurm script will not work (see the ARCHER2 known issues entry on the subject). The ARCHER2 configuration included in the version of cime that was downloaded during setup should apply a work-around to this, and so you should not see this error in this case. It may still occur in some corner cases however. To avoid this, ensure that the environment from which you are submitting your case has the CESM2/2.1.3 module loaded and run the case.submit script with the following command
./case.submit -a=--export=ALL\n
When the job is complete, most output will not necessarily be written under the case directory, but instead under some other directories. Review the following directories and files, whose locations can be found with xmlquery (note: xmlquery can be run with a list of comma separated names and no spaces):
./xmlquery RUNDIR,CASE,CASEROOT,DOUT_S,DOUT_S_ROOT\n
$RUNDIR
This directory is set in the env_run.xml
file. This is the location where CESM2 was run. There should be log files there for every component (i.e. of the form cpl.log.yymmdd-hhmmss) if $DOUT_S == FALSE
. Each component writes its own log file. Also see whether any restart or history files were written. To check that a run completed successfully, check the last several lines of the cpl.log file for the string \\\"SUCCESSFUL TERMINATION OF CPL7-cesm\\\".
$DOUT_S_ROOT/$CASE
$DOUT_S_ROOT
refers to the short-term archive path location on local disk. This path is used by the case.st_archive script when $DOUT_S = TRUE
. See CESM Model Output File Locations for details regarding the component model output filenames and locations.
$DOUT_S_ROOT/$CASE
is the short-term archive directory for this case. If $DOUT_S
is FALSE, then no archive directory should exist. If $DOUT_S
is TRUE, then log, history, and restart files should have been copied into a directory tree here.
$DOUT_S_ROOT/$CASE/logs
The log files should have been copied into this directory if the run completed successfully and the short-term archiver is turned on with $DOUT_S = TRUE
. Otherwise, the log files are in the $RUNDIR
.
$CASEROOT
There could be standard out and/or standard error files output from the batch system.
$CASEROOT/CaseDocs
The case namelist files are copied into this directory from the $RUNDIR
.
$CASEROOT/timing
There should be two timing files there that summarize the model performance.
As CESM jobs are submitted to the ARCHER2 batch system, they can be monitored in the same way as other jobs, using the command
squeue -u $USER\n
You can get more details about the batch scheduler by consulting the ARCHER2 scheduling guide.
"},{"location":"research-software/cesm213_run/#archiving","title":"Archiving","text":"The CIME framework allows for short-term and long-term archiving of model output. This is particularly useful when the model is configured to output to a small storage space and large files may need to be moved during larger simulations. On ARCHER2, the model is configured to use short-term archiving, but not yet configured for long-term archiving.
Short-term archiving is on by default for compsets and can be toggled on and off using the DOUT_S parameter set to True or False using the xmlchange script:
./xmlchange DOUT_S=FALSE\n
When DOUT_S=TRUE
, calling ./case.submit will automatically submit a \u201cst_archive\u201d job to the batch system that will be held in the queue until the main job is complete. This can be configured in the same way as the main job for a different queue, wallclock time, etc. One change that may be advisable to make would be to change the queue your st_archive job is submitted to, as archiving does not require a large amount of resources and the short and serial queues on ARCHER2 do not use your project allowance. This would be done using the xmlchange script almost the same as for the case.run job. Note that the main job and the archiving job share some parameter names such as JOB_QUEUE
, and so a flag (--subgroup) specifying which you want to change should be used, as below:
./xmlchange JOB_QUEUE=short --subgroup case.st_archive\n
If the --subgroup
flag is not used, then the JOB_QUEUE
value for both the case.run and case.st_archive jobs will be changed. You can verify that they are different by running
./xmlquery JOB_QUEUE\n
which will show the value of this parameter for both jobs.
The archive is set up to move .nc
files and logs from $CESM_ROOT/runs/$CASE
to $CESM_ROOT/archive/$CASE
. As such, your /work
storage quota is being used whether archiving is switched on or off, and so it would be recommended that data you wish to retain be moved to another service such as a group workspace on JASMIN. See the Data Management and Transfer guide for more information on archiving data from ARCHER2. If you want to archive your files directly to a different location than the default, this can be set using the $DOUT_S_ROOT
parameter.
If a run fails, the first place to check is the run submission output file, usually located at
$CASEROOT/run.$CASE\n
so, for the example job run in this guide, the output file will be at
$CESM_ROOT/runs/b.e20.B1850.f19_g17.test/run.b.e20.B1850.f19_g17.test\n
If any errors have occurred, the location of the relevant log in which you can examine this error will be printed towards the end of this output file. The log will usually be located at
$CASEROOT/run/cesm.log.*\n
so in this case, the path would be
$CESM_ROOT/runs/b.e20.B1850.f19_g17.test/run/cesm.log.*\n
"},{"location":"research-software/cesm213_run/#known-issues-and-common-problems","title":"Known Issues and Common Problems","text":""},{"location":"research-software/cesm213_run/#input-data-errors","title":"Input data errors","text":"Occasionally, the input data for a case is not downloaded correctly. Unfortunately, in these cases the checksum test run by the check_input_data
script will not catch the corrupted fields in the file. The error message displayed can vary somewhat, but a common error message is
ERROR timeaddmonths(): MM out of range\"\n
You can often spot these errors by examining the log as described above, as the error will occur shortly after a file has been read. If this happens, delete the file in question from your cesm_inputdata
directory and rerun
./check_input_data --download\n
to ensure that the data is downloaded correctly."},{"location":"research-software/cesm213_run/#sigfpe-errors","title":"SIGFPE errors","text":"If running a case with the DEBUG flag enabled, you may see some SIGFPE errors. In this case, the traceback shown in the logs will show the error as originating in one of three places:
This problem is caused by 'short-circuit' logic in the affected files, where there may be a conditional of the form
if (A .and. B) then....\n
where B cannot be properly evaluated if A fails, for example if ( x /= 0 .and. y/x > c ) then....\n
which would result in a divide-by-zero error if the second condition was evaluated after the first condition had already failed. In standard simulations, the second condition would be skipped in these cases however if the user has set
./xmlchange DEBUG=TRUE\n
then the second condition will not be skipped and a SIGFPE error will occur.
If encountering these errors, a user can do one of two things. The simplest solution is to turn off the DEBUG flag with
./xmlchange DEBUG=TRUE\n
If this option is not possible however, and your simulation absolutely needs to be run in DEBUG mode, then the conditional can be modified in the program code. THIS IS DONE AT YOUR OWN RISK!!! The fix that has been applied for the WW3 component can be seen here. It is recommended that if you are making any changes to the code for this reason, that you revert your changes back once you no longer need to run your case in DEBUG mode."},{"location":"research-software/cesm213_run/#sigsegv-errors","title":"SIGSEGV errors","text":"Sometimes an error will occur where a run is ended prematurely and gives an error of the form
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.\n
This can often be solved by increasing the amount of available memory per task, either by changing the maximum number of MPI tasks per node by using
./xmlchange MAX_TASKS_PER_NODE=64\n
or by increasing the number of threads used by using
./xmlchange NTHRDS=2\n
This will double the amount of memory available for each physical core
"},{"location":"research-software/cesm213_run/#archiving-errors","title":"Archiving Errors","text":"When running WACCM-X cases (compsets starting FX*), there can sometimes be problems when running restart jobs. This is caused by the short-term archiving job mistakenly moving files needed for restarts to the archive. To ensure this does not happen, it can be a good idea when running WACCM-X simulations to turn off the short-term archiver using
./xmlchange DOUT_S=FALSE\n
While this behaviour has so far only been observed for WACCM-X jobs, it is possible that this behaviour can occur with other compsets
"},{"location":"research-software/cesm213_run/#job-failing-instantly-with-undefined-environment-variable","title":"Job Failing instantly with undefined environment variable","text":"There is a small possibility that your job may initially fail with the error message
ERROR: Undefined env var 'CESM_ROOT'\n
This could have two causes: 1. You do not have the CESM2/2.1.3 module loaded. This module needs to be loaded when running the case as well as when building the case. Try running again after having run module load CESM2/2.1.3
2. This could also be due to a known issue with ARCHER2 where adding the SBATCH directive export=ALL
to a slurm script will not work (see the ARCHER2 known issues entry on the subject). The ARCHER2 configuration included in the version of cime that was downloaded during setup should apply a work-around to this, and so you should not see this error in this case. It may still occur in some corner cases however. To avoid this, ensure that the environment from which you are submitting your case has the CESM2/2.1.3 module loaded and run the case.submit script with the following command ./case.submit -a=--export=ALL\n
"},{"location":"research-software/cesm213_setup/","title":"First-Time setup of CESM 2.1.3","text":"Important
These instructions are intended for users of the n02
project. Downloads may be incomplete if you are not a member of n02
.
Due to the nature of the CESM program, a centrally installed version of the code is not provided on ARCHER2. Instead, a user needs to download and set up the program themselves in their /work
area. The installation is done in three steps:
After setup, CESM is ready to run a simple case.
"},{"location":"research-software/cesm213_setup/#downloading-cesm-213-and-setting-up-the-directory-structure","title":"Downloading CESM 2.1.3 And Setting Up The Directory Structure","text":"For ease of use, a setup script has been created which downloads CESM 2.1.3, creates the directory structure needed for running CESM2 cases and creates a hidden file in your home directory containing environment variables needed by CESM.
To execute this script, run the following in an archer2 terminal
module load cray-python\nsource /work/n02/shared/CESM2/setup_cesm213.sh\n
This script will create a directory, defaulting to /work/$GROUP/$GROUP/$USER/cesm/CESM2.1.3
, where $GROUP
is your default group, for example n02, and populate it with the following subdirectories: * archive
- short-term archiving for completed runs, * ccsm_baselines
- baseline files, * cesm_inputdata
- input data downloaded and used when running cases, * runs
- location of the case files used when running a case, * cesm directory - location of the cesm source code and the various components. Defaults to my_cesm_sandbox
The default locations for the CESM root directory and the CESM location can be overridden during installation either by entering new paths at runtime when prompted or by providing them as command line arguments, for example
source /work/n02/shared/CESM2/setup_cesm213.sh -p /work/n03/n03/$USER/CESM213 -l cesm_prog\n
"},{"location":"research-software/cesm213_setup/#manual-setup-instructions","title":"Manual setup instructions","text":"If you have trouble with running the setup script, you can install manually by running the following commands:
PREFIX=\"path/to/your/desired/cesm/root/location\"\nCESM_DIR_LOC=\"name_of_install_directory_for_cesm\"\n\nmkdir -p $PREFIX\ncd $PREFIX\nmkdir -p archive\nmkdir -p ccsm_baselines\nmkdir -p cesm_inputdata\nmkdir -p runs\n\nCESM_LOC=$PREFIX/$CESM_DIR_LOC\n\ngit clone -b release-cesm2.1.3 https://github.com/ESCOMP/CESM.git $CESM_LOC\ncd $CESM_LOC\ngit checkout release-cesm2.1.3\n\ntee ${HOME}/.cesm213 <<EOF > /dev/null\n### CESM 2.1.3 on ARCHER2 Path File\n### Do Not Edit This File Unless You Know What You Are Doing\nCIME_MODEL=cesm\nCESM_ROOT=$PREFIX\nCESM_LOC=$PREFIX/$CESM_DIR_LOC\nCIMEROOT=$PREFIX/$CESM_DIR_LOC/cime\nEOF\n\necho \"module use /work/n02/shared/CESM2/module\" >> ~/.bashrc\nmodule use /work/n02/shared/CESM2/module\nmodule load CESM2/2.1.3\n
"},{"location":"research-software/cesm213_setup/#linking-and-downloading-components","title":"Linking And Downloading Components","text":"CESM utilises multiple components, including CAM (atmosphere), CICE (sea ice), CISM (ice sheets), CTSM (land), MOSART (adaptive river transport), POP2 (ocean), RTM (river transport) and WW3 (waves), all of which are connected using the Common Infrastructure for Modelling the Earth (CIME). These components are hosted on github, and during the setup process they are downloaded.
Before downloading the external components, you must first modify the file $CESM_LOC/Externals.cfg
. This will change the version of CIME from the default cime 5.6.32 to the maintained cime 5.6 branch. This is done by modifying the file so that the cime section goes from
[cime]\ntag = cime5.6.32\nprotocol = git\nrepo_url = https://github.com/ESMCI/cime\nlocal_path = cime\nrequired = True\n
to
[cime]\nbranch = maint-5.6\nprotocol = git\nrepo_url = https://github.com/ESMCI/cime\nlocal_path = cime\nexternals = Externals_cime.cfg\nrequired = True\n
In the same $CESM_LOC/Externals.cfg
file, also update the version of CAM:
[cam]\ntag = cam_cesm2_1_rel_41\nprotocol = git\nrepo_url = https://github.com/ESCOMP/CAM\nlocal_path = components/cam\nexternals = Externals_CAM.cfg\nrequired = True\n
to
[cam]\ntag = cam_cesm2_1_rel\nprotocol = git\nrepo_url = https://github.com/ESCOMP/CAM\nlocal_path = components/cam\nexternals = Externals_CAM.cfg\nrequired = True\n
By making these changes, the configurations for archer2 are brought in along with some bug fixes
Once this has been done you are free to download the external components by executing the commands
cd $CESM_LOC\n./manage_externals/checkout_externals\n
The first time you run the checkout_externals script, you may be asked to accept a certificate, and you may also get an error of the form
svn: E120108: Error running context: The server unexpectedly closed the connection.\n
If this happens, rerun the checkout_externals script and it should download the external components correctly."},{"location":"research-software/cesm213_setup/#building-cprnc","title":"Building cprnc","text":"cprnc is a generic tool for analyzing a netcdf file or comparing two netcdf files. It is used in various places by CESM and the source is included with cime.
To build, execute the following commands
module load CESM2/2.1.3\ncd $CIMEROOT/tools/cprnc\ncmake . -DNetCDF_Fortran_LIBRARIES=libnetcdff.so -DNetCDF_C_LIBRARIES=libnetcdf.so\nmake\n
You are now ready to run a simple test case!
"},{"location":"research-software/chemshell/","title":"ChemShell","text":"ChemShell is a script-based chemistry code focusing on hybrid QM/MM calculations with support for standard quantum chemical or force field calculations. There are two versions: an older Tcl-based version Tcl-ChemShell and a more recent python-based version Py-ChemShell.
The advice from https://www.chemshell.org/licence on the difference is:
We consider Py-ChemShell 23.0 to be suitable for production calculations on both materials systems and biomolecules, and recommend that new ChemShell users should use the Python-based version.
We continue to maintain the original Tcl-based version of ChemShell and distribute it on request. Tcl-ChemShell currently contains some features that are not yet available in Py-ChemShell (but will be soon!) including a QM/MM MD driver and multiple electronic state calculations. At the present time if you need this functionality you will need to obtain a licence for Tcl-Chemshell.
"},{"location":"research-software/chemshell/#useful-links","title":"Useful Links","text":"The python-based version of ChemShell is open-source and is freely available to all users on ARCHER2. The version of Py-ChemShell pre-installed on ARCHER2 is compiled with NWChem and GULP as libraries.
Warning
Py-ChemShell on ARCHER2 is compiled with GULP 6.0. This is a licenced software that is free to use for academics. If you are not an academic user (or if you are using Py-ChemShell for non-academic work), please ensure that you have the correct GULP licence before using GULP functionalities in py-ChemShell or make sure that you are not using any of the GULP functionalities in your code (i.e., do not set theory=GULP in your calculations).
"},{"location":"research-software/chemshell/#running-parallel-py-chemshell-jobs","title":"Running parallel Py-ChemShell jobs","text":"Unlike most other ARCHER2 software packages, the Py-ChemShell module is built in such a way as to enable users to create and submit jobs to the compute nodes by running a chemsh
script from the login node rather than by creating and submitting a Slurm submission script. Below is an example command for submitting a pure MPI Py-ChemShell job running on 8 nodes (128x8 cores) with the chemsh
command:
# Run this from the login node\n module load py-chemshell\n\n # Replace [budget code] below with your project code (e.g. t01)\n chemsh --submit \\\n --jobname pychmsh \\\n --account [budget code] \\\n --partition standard \\\n --qos standard \\\n --walltime 0:10:0 \\\n --nnodes 8 \\\n --nprocs 1024 \\ \n py-chemshell-job.py\n
"},{"location":"research-software/chemshell/#using-tcl-chemshell-on-archer2","title":"Using Tcl-ChemShell on ARCHER2","text":"The older version of Tcl-based ChemShell requires a license. Users with a valid license should request access via the ARCHER2 SAFE.
"},{"location":"research-software/chemshell/#running-parallel-tcl-chemshell-jobs","title":"Running parallel Tcl-ChemShell jobs","text":"The following script will run a pure MPI Tcl-based ChemShell job using 8 nodes (128x8 cores).
#!/bin/bash\n\n#SBATCH --job-name=lammps_test\n#SBATCH --nodes=8\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load tcl-chemshell/3.7.1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread chemsh.x input.chm\n
"},{"location":"research-software/code-saturne/","title":"Code_Saturne","text":"Code_Saturne solves the Navier-Stokes equations for 2D, 2D-axisymmetric and 3D flows, steady or unsteady, laminar or turbulent, incompressible or weakly dilatable, isothermal or not, with scalar transport if required. Several turbulence models are available, from Reynolds-averaged models to large-eddy simulation (LES) models. In addition, a number of specific physical models are also available as \"modules\": gas, coal and heavy-fuel oil combustion, semi-transparent radiative transfer, particle-tracking with Lagrangian modeling, Joule effect, electrics arcs, weakly compressible flows, atmospheric flows, rotor/stator interaction for hydraulic machines.
"},{"location":"research-software/code-saturne/#useful-links","title":"Useful Links","text":"Code_Saturne is released under the GNU General Public Licence v2 and so is freely available to all users on ARCHER2.
You can load the default GCC build of Code_Saturne for use by running the following command:
module load code_saturne\n
This will load the default code_saturne/7.0.1-gcc11
module. A build using the CCE compilers, code_saturne/7.0.1-cce12
, has also been made optionally available to users on the full ARCHER2 system as testing indicates that this may provide improved performance over the GCC build.
After setting up a case it should be initialized by running the following command from the case directory, where setup.xml is the input file:
code_saturne run --initialize --param setup.xml\n
This will create a directory named for the current date and time (e.g. 20201019-1636) inside the RESU directory. Inside the new directory will be a script named run_solver. You may alter this to resemble the script below, or you may wish to simply create a new one with the contents shown.
If you wish to alter the existing run_solver script you will need to add all the #SBATCH
options shown to set the job name, size and so on. You should also add the two module
commands, and srun --distribution=block:block --hint=nomultithread
as well as the --mpi
option to the line executing ./cs_solver
to ensure parallel execution on the compute nodes. The export LD_LIBRARY_PATH=...
and cd
commands are redundant and may be retained or removed.
This script will run an MPI-only Code_Saturne job using the default GCC build and UCX over 4 nodes (128 x 4 = 512 cores) for a maximum of 20 minutes.
#!/bin/bash\n#SBATCH --export=none\n#SBATCH --job-name=CSExample\n#SBATCH --time=0:20:0\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the GCC build of Code_Saturne 7.0.1\nmodule load cpe/21.09\nmodule load PrgEnv-gnu\nmodule load code_saturne\n\n# Switch to mpich-ucx implementation (see info note below)\nmodule swap craype-network-ofi craype-network-ucx\nmodule swap cray-mpich cray-mpich-ucx\n\n# Prevent threading.\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Run solver.\nsrun --distribution=block:block --hint=nomultithread ./cs_solver --mpi $@\n
The script can then be submitted to the batch system with sbatch
.
Info
There is a known issue with the default MPI collectives which is causing performance issues on Code_Saturne. The suggested workaround is to switch to the mpich-ucx implementation. For this to link correctly on the full system, the extra cpe/21.09
and PrgEnv-gnu
modules also have to be explicitly loaded.
The latest instructions for building Code_Saturne on ARCHER2 may be found in the GitHub repository of build instructions:
CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. CP2K provides a general framework for different modelling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO), and classical force fields (AMBER, CHARMM). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method.
"},{"location":"research-software/cp2k/#useful-links","title":"Useful links","text":"CP2K is available through the cp2k
module. MPI only cp2k.popt
and MPI/OpenMP Hybrid cp2k.psmp
binaries are available.
For ARCHER2, CP2K has been compiled with the following optional features: FFTW
for fast Fourier transforms, libint
to enable methods including Hartree-Fock exchange, libxc
to provide a wider choice of exchange-correlation functionals, ELPA
for improved performance of matrix diagonalisation, PLUMED
to allow enhanced sampling methods.
See CP2K compile instructions for a full list of optional features.
If there is an optional feature not available, and which you would like, please contact the Service Desk. Experts may also wish to compile their own versions of the code (see below for instructions).
"},{"location":"research-software/cp2k/#running-parallel-cp2k-jobs","title":"Running parallel CP2K jobs","text":""},{"location":"research-software/cp2k/#mpi-only-jobs","title":"MPI only jobs","text":"To run CP2K using MPI only, load the cp2k
module and use the cp2k.psmp
executable.
For example, the following script will run a CP2K job using 4 nodes (128x4 cores):
#!/bin/bash\n\n# Request 4 nodes using 128 cores per node for 128 MPI tasks per node.\n\n#SBATCH --job-name=CP2K_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevent CP2K module\nmodule load cp2k\n\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --hint=nomultithread --distribution=block:block cp2k.psmp -i MYINPUT.inp\n
"},{"location":"research-software/cp2k/#mpiopenmp-hybrid-jobs","title":"MPI/OpenMP hybrid jobs","text":"To run CP2K using MPI and OpenMP, load the cp2k
module and use the cp2k.psmp
executable.
#!/bin/bash\n\n# Request 4 nodes with 16 MPI tasks per node each using 8 threads;\n# note this means 128 MPI tasks in total.\n# Remember to replace [budget code] below with your account code,\n# e.g. '--account=t01'.\n\n#SBATCH --job-name=CP2K_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=16\n#SBATCH --cpus-per-task=8\n#SBATCH --time=00:20:00\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevant CP2K module\nmodule load cp2k\n\n# Ensure OMP_NUM_THREADS is consistent with cpus-per-task above\nexport OMP_NUM_THREADS=8\nexport OMP_PLACES=cores\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --hint=nomultithread --distribution=block:block cp2k.psmp -i MYINPUT.inp\n
"},{"location":"research-software/cp2k/#compiling-cp2k","title":"Compiling CP2K","text":"The latest instructions for building CP2K on ARCHER2 may be found in the GitHub repository of build instructions:
CRYSTAL is a general-purpose program for the study of crystalline solids. The CRYSTAL program computes the electronic structure of periodic systems within Hartree Fock, density functional or various hybrid approximations (global, range-separated and double-hybrids). The Bloch functions of the periodic systems are expanded as linear combinations of atom centred Gaussian functions. Powerful screening techniques are used to exploit real space locality. Restricted (Closed Shell) and Unrestricted (Spin-polarized) calculations can be performed with all-electron and valence-only basis sets with effective core pseudo-potentials. The current release is CRYSTAL23.
Important
CRYSTAL is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/crystal/#useful-links","title":"Useful Links","text":"CRYSTAL is only available to users who have a valid CRYSTAL license. You request access through SAFE:
Please have your license details to hand.
"},{"location":"research-software/crystal/#running-parallel-crystal-jobs","title":"Running parallel CRYSTAL jobs","text":"The following script will run CRYSTAL using pure MPI for parallelisation using 256 MPI processes, 1 per core across 2 nodes. It assumes that the input file is tio2.d12
#!/bin/bash\n#SBATCH --nodes=2\n#SBATCH --time=0:20:00\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load crystal/23-1.0.1-2\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Change this to the name of your input file\ncp tio2.d12 INPUT\n\nsrun --hint=nomultithread --distribution=block:block MPPcrystal\n
An equivalent 2 node job using MPI+OpenMP parallelism with 4 threads per MPI process, 64 MPI processes, 1 thread per core across 2 nodes would be:
#!/bin/bash\n#SBATCH --nodes=2\n#SBATCH --time=0:20:00\n#SBATCH --ntasks-per-node=32\n#SBATCH --cpus-per-task=4\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load crystal/23-1.0.1-2\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Change this to the name of your input file\ncp tio2.d12 INPUT\n\nexport OMP_NUM_THREADS=4\nexport OMP_PLACES=cores\nexport OMP_STACKSIZE=16M\n\nsrun --hint=nomultithread --distribution=block:block MPPcrystalOMP\n
"},{"location":"research-software/crystal/#tips-and-known-issues","title":"Tips and known issues","text":""},{"location":"research-software/crystal/#cpu-frequency","title":"CPU frequency","text":"You should run some short (1 or 2 SCF cycles) jobs to test the scaling of your job so you can decide on the balance between cost to your budget and the time it takes to get a result. You now should include a few tests at different clock rates as part of this process.
Based on a few simple tests we have run it is likely that jobs dominated by building the Kohn-Sham matrix (SHELLX+MONMO3+NUMDFT in the output) will see minimal energy savings and better performance at 2.25GHz. Jobs dominated by the ScaLapack calls (MPP_DIAG in the output) may show useful energy savings at 2.0GHz.
"},{"location":"research-software/crystal/#out-of-memory-errors","title":"Out-of-memory errors","text":"Long-running jobs may encounter unexpected errors of the form
slurmstepd: error: Detected 1 oom-kill event(s) in step 411502.0 cgroup.\n
These are related to a memory leak in the underlying libfabric communication layer, which will be fixed in a future release. In the meantime, it should be possible to work around the problem by adding export FI_MR_CACHE_MAX_COUNT=0 \n
to the SLURM submission script."},{"location":"research-software/fhi-aims/","title":"FHI-aims","text":"FHI-aims is an all-electron electronic structure code based on numeric atom-centered orbitals. It enables first-principles simulations with very high numerical accuracy for production calculations, with excellent scalability up to very large system sizes (thousands of atoms) and up to very large, massively parallel supercomputers (ten thousand CPU cores).
"},{"location":"research-software/fhi-aims/#useful-links","title":"Useful Links","text":"FHI-aims is only available to users who have a valid FHI-aims licence.
If you have a FHI-aims licence and wish to have access to FHI-aims on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
"},{"location":"research-software/fhi-aims/#running-parallel-fhi-aims-jobs","title":"Running parallel FHI-aims jobs","text":"The following script will run a FHI-aims job using 8 nodes (1024 cores). The script assumes that the input have the default names control.in
and geometry.in
.
#!/bin/bash\n\n# Request 2 nodes with 128 MPI tasks per node for 20 minutes\n#SBATCH --job-name=FHI-aims\n#SBATCH --nodes=8\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the FHI-aims module, avoid any unintentional OpenMP threading by\n# setting OMP_NUM_THREADS, and launch the code.\nmodule load fhiaims\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport OMP_NUM_THREADS=1\nsrun --distribution=block:block --hint=nomultithread aims.mpi.x\n
"},{"location":"research-software/fhi-aims/#compiling-fhi-aims","title":"Compiling FHI-aims","text":"The latest instructions for building FHI-aims on ARCHER2 may be found in the GitHub repository of build instructions:
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.
"},{"location":"research-software/gromacs/#useful-links","title":"Useful Links","text":"GROMACS is Open Source software and is freely available to all users. Three executable versions are available on the normal (CPU-only) modules:
gmx_mpi
gmx_mpi_d
gmx
We also provide a GPU version of GROMACS that will run on the MI210 GPU nodes, it's named gromacs/2022.4-GPU
and can be loaded with
module load gromacs/2022.4-GPU\n
Important
The gromacs
modules reset the CPU frequency to the highest possible value (2.25 GHz) as this generally achieves the best balance of performance to energy use. You can change this setting by following the instructions in the Energy use section of the User Guide.
The following script will run a GROMACS MD job using 4 nodes (128x4 cores) with pure MPI.
#!/bin/bash\n\n#SBATCH --job-name=mdrun_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Setup the environment\nmodule load gromacs\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport OMP_NUM_THREADS=1 \nsrun --distribution=block:block --hint=nomultithread gmx_mpi mdrun -s test_calc.tpr\n
"},{"location":"research-software/gromacs/#running-hybrid-mpiopenmp-jobs","title":"Running hybrid MPI/OpenMP jobs","text":"The following script will run a GROMACS MD job using 4 nodes (128x4 cores) with 6 MPI processes per node (24 MPI processes in total) and 6 OpenMP threads per MPI process.
#!/bin/bash\n#SBATCH --job-name=mdrun_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=16\n#SBATCH --cpus-per-task=8\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Setup the environment\nmodule load gromacs\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport OMP_NUM_THREADS=8\nsrun --distribution=block:block --hint=nomultithread gmx_mpi mdrun -s test_calc.tpr\n
"},{"location":"research-software/gromacs/#running-gromacs-on-the-amd-mi210-gpus","title":"Running GROMACS on the AMD MI210 GPUs","text":"The following script will run a GROMACS MD job using 1 GPU with 1 MPI process 8 OpenMP threads per MPI process.
#!/bin/bash\n#SBATCH --job-name=mdrun_gpu\n#SBATCH --gpus=1\n#SBATCH --time=00:20:00\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-shd # or gpu-exc\n\n# Setup the environment\nmodule load gromacs/2022.4-GPU\n\nexport OMP_NUM_THREADS=8\nsrun --ntasks=1 --cpus-per-task=8 gmx_mpi mdrun -ntomp 8 --noconfout -s calc.tpr\n
"},{"location":"research-software/gromacs/#compiling-gromacs","title":"Compiling Gromacs","text":"The latest instructions for building GROMACS on ARCHER2 may be found in the GitHub repository of build instructions:
LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is a classical molecular dynamics code. LAMMPS has potentials for solid-state materials (metals, semiconductors) and soft matter (biomolecules, polymers), and coarse-grained or mesoscopic systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, mesoscopic, or continuum scale.
"},{"location":"research-software/lammps/#useful-links","title":"Useful Links","text":"LAMMPS is freely available to all ARCHER2 users.
The centrally installed version of LAMMPS is compiled with all the standard packages included: ASPHERE
, BODY
, CLASS2
, COLLOID
, COMPRESS
, CORESHELL
, DIPOLE
, GRANULAR
, KSPACE
, MANYBODY
, MC
, MISC
, MOLECULE
, OPT
, PERI
, QEQ
, REPLICA
, RIGID
, SHOCK
, SNAP
, SRD
.
We do not install any USER
packages. If you are interested in a USER
package, we would encourage you to try to compile your own version and we can help out if necessary (see below).
Important
The lammps
modules reset the CPU frequency to the highest possible value (2.25 GHz) as this generally achieves the best balance of performance to energy use. You can change this setting by following the instructions in the Energy use section of the User Guide.
LAMMPS can exploit multiple nodes on ARCHER2 and will generally be run in exclusive mode using more than one node.
For example, the following script will run a LAMMPS MD job using 4 nodes (128x4 cores) with MPI only.
#!/bin/bash\n\n#SBATCH --job-name=lammps_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load lammps\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread lmp -i in.test -l out.test\n
"},{"location":"research-software/lammps/#compiling-lammps","title":"Compiling LAMMPS","text":"The large range of optional packages available for LAMMPS, and opportunity for extensibility, may mean that it is convenient for users to compile their own copy. In practice, LAMMPS is relatively easy to compile, so we encourage users to have a go.
Compilation instructions for LAMMPS on ARCHER2 can be found on GitHub:
The Massachusetts Institute of Technology General Circulation Model (MITgcm) is a numerical model designed for study of the atmosphere, ocean, and climate. MITgcm's flexible non-hydrostatic formulation enables it to simulate fluid phenomena over a wide range of scales; its adjoint capabilities enable it to be applied to sensitivity questions and to parameter and state estimation problems. By employing fluid equation isomorphisms, a single dynamical kernel can be used to simulate flow of both the atmosphere and ocean.
"},{"location":"research-software/mitgcm/#useful-links","title":"Useful Links","text":"MITgcm is not available via a module on ARCHER2 as users will build their own executables specific to the problem they are working on.
You can obtain the MITgcm source code from the developers by cloning from the GitHub repository with the command
git clone https://github.com/MITgcm/MITgcm.git\n
You should then copy the ARCHER2 optfile into the MITgcm directories.
Warning
A current ARCHER2 optfile is not available at the present time. Please contact support@archer2.ac.uk
for help.
You should also set the following environment variables. MITGCM_ROOTDIR
is used to locate the source code and should point to the top MITgcm directory. Optionally, adding the MITgcm tools directory to your PATH
environment variable makes it easier to use tools such as genmake2
, and the MITGCM_OPT
environment variable makes it easier to refer to pass the optfile to genmake2
.
export MITGCM_ROOTDIR=/path/to/MITgcm\nexport PATH=$MITGCM_ROOTDIR/tools:$PATH\nexport MITGCM_OPT=$MITGCM_ROOTDIR/tools/build_options/dev_linux_amd64_cray_archer2\n
When using genmake2
to create the Makefile, you will need to specify the optfile to use. Other commonly used options might be to use extra source code with the -mods
option, to enable MPI with -mpi
, and to enable OpenMP with -omp
. You might then run a command that resembles the following:
genmake2 -mods /path/to/additional/source -mpi -optfile $MITGCM_OPT\n
You can read about the full set of options available to genmake2
by running
genmake2 -help\n
Finally, you may then build your executable by running
make depend\nmake\n
"},{"location":"research-software/mitgcm/#running-mitgcm-on-archer2","title":"Running MITgcm on ARCHER2","text":""},{"location":"research-software/mitgcm/#pure-mpi","title":"Pure MPI","text":"Once you have built your executable you can write a script like the following which will allow it to run on the ARCHER2 compute nodes. This example would run a pure MPI MITgcm simulation over 2 nodes of 128 cores each for up to one hour.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MITgcm-simulation\n#SBATCH --time=1:0:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 256 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\nsrun --distribution=block:block --hint=nomultithread ./mitgcmuv\n
"},{"location":"research-software/mitgcm/#hybrid-openmp-mpi","title":"Hybrid OpenMP & MPI","text":"Warning
Running the model in hybrid mode may lead to performance decreases as well as increases. You should be sure to profile your code both as a pure MPI application and as a hybrid OpenMP-MPI application to ensure you are making efficient use of resources. Be sure to read both the Archer2 advice on OpenMP and the MITgcm documentation first.
Note
Early versions of the ARCHER2 MITgcm optfile do not contain an OMPFLAG
. Please ensure you have an up to date copy of the optfile before attempting to compile OpenMP enabled codes.
Depending upon your model setup, you may wish to run the MITgcm code as a hybrid OpenMP-MPI application. In terms of compiling the model, this is as simple as using the flag -omp
when calling genmake2
, and updating your SIZE.h
file to have multiple tiles per process.
The model can be run using a slurm job submission script similar to that shown below. This example will run MITgcm across 2 nodes, with each node using 16 MPI processes, and each process using 4 threads. Note that this would underpopulate the nodes \u2014 i.e. we will only be using 128 of the 256 cores available to us. This can also sometimes lead to performance increases.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MITgcm-hybrid-simulation\n#SBATCH --time=1:0:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=16\n#SBATCH --cpus-per-task=4\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=4 # Set to number of threads per process\nexport OMP_PLACES=\"cores(128)\" # Set to total number of threads\nexport OMP_PROC_BIND=true # Required if we want to underpopulate nodes\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 256 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\nsrun --distribution=block:block --hint=nomultithread ./mitgcmuv\n
One final note, is that you should remember to update the eedata
file in the model's run directory to ensure the number of threads requested there match those requested in the job submission script.
The ECCO version 4 state estimate (ECCOv4-r4) is an observationally-constrained numerical solution produced by the ECCO group at JPL. If you would like to reproduce the state estimate on ARCHER2 in order to create customised runs and experiments, follow the instructions below. They have been slightly modified from the JPL instructions for ARCHER2.
For more information, see the ECCOv4-r4 website https://ecco-group.org/products-ECCO-V4r4.htm
"},{"location":"research-software/mitgcm/#get-the-eccov4-r4-source-code","title":"Get the ECCOv4-r4 source code","text":"First, navigate to your directory on the /work
filesystem in order to get access to the compute nodes. Next, create a working directory, perhaps MYECCO, and navigate into this working directory:
mkdir MYECCO\ncd MYECCO\n
In order to reproduce ECCOv4-r4, we need a specific checkpoint of the MITgcm source code.
git clone https://github.com/MITgcm/MITgcm.git -b checkpoint66g\n
Next, get the ECCOv4-r4 specific code from GitHub:
cd MITgcm\nmkdir -p ECCOV4/release4\ncd ECCOV4/release4\ngit clone https://github.com/ECCO-GROUP/ECCO-v4-Configurations.git\nmv ECCO-v4-Configurations/ECCOv4\\ Release\\ 4/code .\nrm -rf ECCO-v4-Configurations\n
"},{"location":"research-software/mitgcm/#get-the-eccov4-r4-forcing-files","title":"Get the ECCOv4-r4 forcing files","text":"The surface forcing and other input files that are too large to be stored on GitHub are available via NASA data servers. In total, these files are about 200 GB in size. You must register for an Earthdata account and connect to a WebDAV server in order to access these files. For more detailed instructions, read the help page https://ecco.jpl.nasa.gov/drive/help.
First, apply for an Earthdata account: https://urs.earthdata.nasa.gov/users/new
Next, acquire your WebDAV credentials: https://ecco.jpl.nasa.gov/drive (second box from the top)
Now, you can use wget to download the required forcing and input files:
wget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_forcing\nwget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_init\nwget -r --no-parent --user YOURUSERNAME --ask-password https://ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_ecco\n
After using wget
, you will notice that the input*
directories are, by default, several levels deep in the directory structure. Use the mv
command to move the input*
directories to the directory where you executed the wget
command. Specifically,
mv ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_forcing/ .\nmv ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_init/ .\nmv ecco.jpl.nasa.gov/drive/files/Version4/Release4/input_ecco/ .\nrm -rf ecco.jpl.nasa.gov\n
"},{"location":"research-software/mitgcm/#compiling-and-running-eccov4-r4","title":"Compiling and running ECCOv4-r4","text":"The steps for building the ECCOv4-r4 instance of MITgcm are very similar to those for other build cases. First, wou will need to create a build directory:
cd MITgcm/ECCOV4/release4\nmkdir build\ncd build\n
Load the NetCDF modules:
module load cray-hdf5\nmodule load cray-netcdf\n
If you haven't already, set your environment variables:
export MITGCM_ROOTDIR=../../../../MITgcm\nexport PATH=$MITGCM_ROOTDIR/tools:$PATH\nexport MITGCM_OPT=$MITGCM_ROOTDIR/tools/build_options/dev_linux_amd64_cray_archer2\n
Next, compile the executable:
genmake2 -mods ../code -mpi -optfile $MITGCM_OPT\nmake depend\nmake\n
Once you have compiled the model, you will have the mitgcmuv executable for ECCOv4-r4.
"},{"location":"research-software/mitgcm/#create-run-directory-and-link-files","title":"Create run directory and link files","text":"In order to run the model, you need to create a run directory and link/copy the appropriate files. First, navigate to your directory on the work
filesystem. From the MITgcm/ECCOV4/release4
directory:
mkdir run\ncd run\n\n# link the data files\nln -s ../input_init/NAMELIST/* .\nln -s ../input_init/error_weight/ctrl_weight/* .\nln -s ../input_init/error_weight/data_error/* .\nln -s ../input_init/* .\nln -s ../input_init/tools/* .\nln -s ../input_ecco/*/* .\nln -s ../input_forcing/eccov4r4* .\n\npython mkdir_subdir_diags.py\n\n# manually copy the mitgcmuv executable\ncp -p ../build/mitgcmuv .\n
For a short test run, edit the nTimeSteps
variable in the file data
. Comment out the default value and uncomment the line reading nTimeSteps=8
. This is a useful test to make sure that the model can at least start up.
To run on ARCHER2, submit a batch script to the Slurm scheduler. Here is an example submission script:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=ECCOv4r4-test\n#SBATCH --time=1:0:0\n#SBATCH --nodes=8\n#SBATCH --ntasks-per-node=12\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# For adjoint runs the default cpu-freq is a lot slower\n#SBATCH --cpu-freq=2250000\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 256 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\nsrun --distribution=block:block --hint=nomultithread ./mitgcmuv\n
This configuration uses 96 MPI processes at 12 MPI processes per node. Once the run has finished, in order to check that the run has successfully completed, check the end of one of the standard output files.
tail STDOUT.0000\n
It should read
PROGRAM MAIN: Execution ended Normally\n
The files named STDOUT.*
contain diagnostic information that you can use to check your results. As a first pass, check the printed statistics for any clear signs of trouble (e.g. NaN values, extremely large values).
If you have access to the commercial TAF software produced by http://FastOpt.de, then you can compile and run the ECCOv4-r4 instance of MITgcm in adjoint mode. This mode is useful for comprehensive sensitivity studies and for constructing state estimates. From the MITgcm/ECCOV4/release4
directory, create a new code directory and a new build directory:
mkdir code_ad\ncd code_ad\nln -s ../code/* .\ncd ..\nmkdir build_ad\ncd build_ad\n
In this instance, the code_ad
and code
directories are identical, although this does not have to be the case. Make sure that you have the staf
script in your path or in the build_ad
directory itself. To make sure that you have the most up-to-date script, run:
./staf -get staf\n
To test your connection to the FastOpt servers, try:
./staf -test\n
You should receive the following message:
Your access to the TAF server is enabled.\n
The compilation commands are similar to those used to build the forward case.
# load relevant modules\nmodule load cray-netcdf-hdf5parallel\nmodule load cray-hdf5-parallel\n\n# compile adjoint model\n../../../MITgcm/tools/genmake2 -ieee -mpi -mods=../code_ad -of=(PATH_TO_OPTFILE)\nmake depend\nmake adtaf\nmake adall\n
The source code will be packaged and forwarded to the FastOpt servers, where it will undergo source-to-source translation via the TAF algorithmic differentiation software. If the compilation is successful, you will have an executable named mitgcmuv_ad
. This will run the ECCOv4-r4 configuration of MITgcm in adjoint mode. As before, create a run directory and copy in the relevant files. The procedure is the same as for the forward model, with the following modifications:
cd ..\nmkdir run_ad\ncd run_ad\n# manually copy the mitgcmuv executable\ncp -p ../build_ad/mitgcmuv_ad .\n
To run the model, change the name of the executable in the Slurm submission script; everything else should be the same as in the forward case. As above, at the end of the run you should have a set of STDOUT.*
files that you can examine for any obvious problems.
If TAF compilation fails with an error like failed to convert GOTPCREL relocation; relink with --no-relax
then add the following line to the FFLAGS options: -Wl,--no-relax
.
In an adjoint run, there is a balance between storage (i.e. saving the model state to disk) and recomputation (i.e. integrating the model forward from a stored state). Changing the nchklev
parameters in the tamc.h
file at compile time is how you control the relative balance between storage and recomputation.
A suggested strategy that has been used on a variety of HPC platforms is as follows: 1. Set nchklev_1
as large as possible, up to the size allowed by memory on your machine. (Use the size
command to estimate the memory per process. This should be just a little bit less than the maximum allowed on the machine. On ARCHER2 this is 2 GB (standard) and 4 GB (high memory)). 2. Next, set nchklev_2
and nchklev_3
to be large enough to accommodate the entire run. A common strategy is to set nchklev_2 = nchklev_3 = sqrt(numsteps/nchklev_1) + 1
. 3. If the nchklev_2
files get too big, then you may have to add a fourth level (i.e. nchklev_4
), but this is unlikely.
This strategy allows you to keep as much in memory as possible, minimising the I/O requirements for the disk. This is useful, as I/O is often the bottleneck for MITgcm runs on HPC.
Another way to adjust performance is to adjust how tapelevel I/O is handled. This strategy performs well for most configurations:
C o tape settings\n#define ALLOW_AUTODIFF_WHTAPEIO\n#define AUTODIFF_USE_OLDSTORE_2D\n#define AUTODIFF_USE_OLDSTORE_3D\n#define EXCLUDE_WHIO_GLOBUFF_2D\n#define ALLOW_INIT_WHTAPEIO\n
"},{"location":"research-software/mo-unified-model/","title":"Met Office Unified Model","text":"The Met Office Unified Model (\"the UM\") is a numerical model of the atmosphere used for both weather and climate applications. It is often coupled to the NEMO ocean model using the OASIS coupling framework to provide a full Earth system model.
"},{"location":"research-software/mo-unified-model/#useful-links","title":"Useful Links","text":"Information on using the UM is provided by the NCAS Computational Modelling Service (CMS).
"},{"location":"research-software/namd/","title":"NAMD","text":"NAMD is an award-winning parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
"},{"location":"research-software/namd/#useful-links","title":"Useful Links","text":"NAMD is freely available to all ARCHER2 users.
ARCHER2 has two versions of NAMD available: no-SMP (namd/2.14-nosmp
) or SMP (namd/2.14
). The SMP (Shared Memory Parallelism) build of NAMD introduces threaded parallelism to address memory limitations. The no-SMP build will typically provide the best performance but most users will require SMP in order to cope with high memory requirements.
Important
The namd
modules reset the CPU frequency to the highest possible value (2.25 GHz) as this generally achieves the best balance of performance to energy use. You can change this setting by following the instructions in the Energy use section of the User Guide.
Using no-SMP NAMD will run jobs with only MPI processes and will not introduce additional threaded parallelism. This is the simplest approach to running NAMD jobs and is likely to give the best performance unless simulations are limited by high memory requirements.
The following script will run a pure MPI NAMD MD job using 4 nodes (i.e. 128x4 = 512 MPI parallel processes).
#!/bin/bash\n\n# Request four nodes to run a job of 512 MPI tasks with 128 MPI\n# tasks per node, here for maximum time 20 minutes.\n\n#SBATCH --job-name=namd-nosmp\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load namd/2.14-nosmp\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread namd2 input.namd\n
"},{"location":"research-software/namd/#running-smp-namd-jobs","title":"Running SMP NAMD jobs","text":"If your jobs runs out of memory, then using the SMP version of NAMD will reduce the memory requirements. This involves launching a combination of MPI processes for communication and worker threads which perform computation.
The following script will run a SMP NAMD MD job using 4 nodes with 8 MPI communication processes per node and 16 worker threads per communication process (i.e. a fully-occupied node with all 512 cores populated with processes).
#!/bin/bash\n#SBATCH --job-name=namd-smp\n#SBATCH --ntasks-per-node=32\n#SBATCH --cpus-per-task=4\n#SBATCH --nodes=4\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevant modules\nmodule load namd\n\n# Set procs per node (PPN) & OMP_NUM_THREADS\nexport PPN=$(($SLURM_CPUS_PER_TASK-1))\nexport OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK\nexport OMP_PLACES=cores\n\n# Record PPN in the output file\necho \"Number of worker threads PPN = $PPN\"\n\n# Run NAMD\nsrun --distribution=block:block --hint=nomultithread namd2 +setcpuaffinity +ppn $PPN input.namd\n
Important
Please do not set SRUN_CPUS_PER_TASK
when running the SMP version of NAMD. Otherwise, Charm++ will be unable to pin processes to CPUs, causing NAMD to abort with errors such as Couldn't bind to cpuset 0x00000010,,,0x0: Invalid argument
.
How do I choose an optimal choice of MPI processes and worker threads for my simulations? The optimal choice for the numbers of MPI processes and worker threads per node depends on the data set and the number of compute nodes. Before running large production jobs, it is worth experimenting with these parameters to find the optimal configuration for your simulation.
We recommend that users match the ARCHER2 NUMA architecture to find the optimal balance of thread and process parallelism. The NUMA levels on ARCHER2 compute nodes are: 4 cores per CCX, 8 cores per CCD, 16 cores per memory controller, 64 cores per socket. For example, the above submission script specifies 32 MPI communication processes per node and 4 worker threads per communication process which places 1 MPI process per CCX on each node.
Note
To ensure fully occupied nodes with the SMP build of NAMD and match the NUMA layout, the optimal values of (tasks-per-node
, cpus-per-task
) are likely to be (32,4), (16,8) or (8,16).
How do I choose a value for the +ppn flag? The number of workers per communication process is specified by the +ppn argument to NAMD, which is set here to equal cpus-per-task - 1, to leave a CPU-core free for the associated MPI process.
We recommend that users reserve a thread per process to improve the scalability. Reserving this thread on a many-cores-per-node architecture like ARCHER2 will reduce the communication between threads and improve the scalability.
"},{"location":"research-software/namd/#compiling-namd","title":"Compiling NAMD","text":"The latest instructions for building NAMD on ARCHER2 may be found in the GitHub repository of build instructions.
ARCHER2 Full System
"},{"location":"research-software/nektarplusplus/","title":"Nektar++","text":"Nektar++ is a tensor product based finite element package designed to allow one to construct efficient classical low polynomial order h-type solvers (where h is the size of the finite element) as well as higher p-order piecewise polynomial order solvers.
The Nektar++ framework comes with a number of solvers and also allows one to construct a variety of new solvers. Users can therefore use Nektar++ just to run simulations, or to extend and/or develop new functionality.
"},{"location":"research-software/nektarplusplus/#useful-links","title":"Useful Links","text":"Nektar++ is released under an MIT license and is available to all users on the ARCHER2 full system.
"},{"location":"research-software/nektarplusplus/#where-can-i-get-help","title":"Where can I get help?","text":"Specific issues with Nektar++ itself might be submitted to the issue tracker at the Nektar++ gitlab repository (see link above). More general questions might also be directed to the Nektar-users mailing list. Issues specific to the use or behaviour of Nektar++ on ARCHER2 should be sent to the Service Desk.
"},{"location":"research-software/nektarplusplus/#running-parallel-nektar-jobs","title":"Running parallel Nektar++ jobs","text":"Below is the submission script for running the Taylor-Green Vortex, one of the Nektar++ tutorials, see https://doc.nektar.info/tutorials/latest/incns/taylor-green-vortex/incns-taylor-green-vortex.html#incns-taylor-green-vortexch4.html .
You first need to download the archive linked on the tutorial page.
cd /path/to/work/dir\nwget https://doc.nektar.info/tutorials/latest/incns/taylor-green-vortex/incns-taylor-green-vortex.tar.gz\ntar -xvzf incns-taylor-green-vortex.tar.gz\n
#!/bin/bash\n#SBATCH --job-name=nektar\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=32\n#SBATCH --cpus-per-task=1\n#SBATCH --time=02:00:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load nektar\n\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nNEK_INPUT_PATH=/path/to/work/dir/incns-taylor-green-vortex/completed/solver64\n\nsrun --distribution=block:cyclic --hint=nomultithread \\\n ${NEK_DIR}/bin/IncNavierStokesSolver \\\n ${NEK_INPUT_PATH}/TGV64_mesh.xml \\\n ${NEK_INPUT_PATH}/TGV64_conditions.xml\n
"},{"location":"research-software/nektarplusplus/#compiling-nektar","title":"Compiling Nektar++","text":"Instructions for building Nektar++ on ARCHER2 may be found in the GitHub repository of build instructions:
The Nektar++ team have themselves also provided detailed instructions on the build process, updated following the mid-2023 system update, on the Nektar++ website:
This page also provides instructions on how to run jobs using your local installation.
"},{"location":"research-software/nemo/","title":"NEMO","text":"NEMO (Nucleus for European Modelling of the Ocean) is a state-of-the-art framework for research activities and forecasting services in ocean and climate sciences, developed in a sustainable way by a European consortium.
"},{"location":"research-software/nemo/#useful-links","title":"Useful Links","text":"NEMO is released under a CeCILL license and is freely available to all users on ARCHER2.
"},{"location":"research-software/nemo/#compiling-nemo","title":"Compiling NEMO","text":"A central install of NEMO is not appropriate for most users of ARCHER2 since many configurations will want to add bespoke code changes.
The latest instructions for building NEMO on ARCHER2 are found in the Github repository of build instructions:
Typical NEMO production runs perform significant I/O management to handle the very large volumes of data associated with ocean modelling. To address this, NEMO ocean clients are interfaced with XIOS I/O servers. XIOS is a library which manages NetCDF outputs for climate models. NEMO uses XIOS to simplify the I/O management and introduce dedicated processors to manage large volumes of data.
Users can choose to run NEMO in attached or detached mode: - In attached mode each processor acts as an ocean client and I/O-server process. - In detached mode ocean clients and external XIOS I/O-server processors are separately defined.
Running NEMO in attached mode can be done with a simple submission script specifying both the NEMO and XIOS executable to srun
. However, typical production runs of NEMO will perform significant I/O management and will be unable to run in attached mode.
Detached mode introduces external XIOS I/O-servers to help manage the large volumes of data. This requires users to specify the placement of clients and servers on different cores throughout the node using the \u2013cpu-bind=map_cpu:<cpu map>
srun option to define a CPU map or mask. It is tedious to construct these maps by hand. Instead, Andrew Coward provides a tool to aid users in the construction submission scripts:
/work/n01/shared/nemo/mkslurm_hetjob\n/work/n01/shared/nemo/mkslurm_hetjob_Gnu\n
Usage of the script:
usage: mkslurm_hetjob [-h] [-S S] [-s S] [-m M] [-C C] [-g G] [-N N] [-t T]\n [-a A] [-j J] [-v]\n\nPython version of mkslurm_alt by Andrew Coward using HetJob. Server placement\nand spacing remains as mkslurm but clients are always tightly packed with a\ngap left every \"NC_GAP\" cores where NC_GAP can be given by the -g argument.\nvalues of 4, 8 or 16 are recommended.\n\noptional arguments:\n -h, --help show this help message and exit\n -S S num_servers (default: 4)\n -s S server_spacing (default: 8)\n -m M max_servers_per_node (default: 2)\n -C C num_clients (default: 28)\n -g G client_gap_interval (default: 4)\n -N N ncores_per_node (default: 128)\n -t T time_limit (default: 00:10:00)\n -a A account (default: n01)\n -j J job_name (default: nemo_test)\n -v show human readable hetjobs (default: False)\n
Note
We recommend that you retain your own copy of this script as it is not directly provided by the ARCHER2 CSE team and subject to change. Once obtained, you can set your own defaults for options in the script.
For example, to run with 4 XIOS I/O-servers (a maximum of 2 per node), each with sole occupancy of a 16-core NUMA region and 96 ocean cores, spaced with a idle core in between each, use:
./mkslurm_hetjob -S 4 -s 16 -m 2 -C 96 -g 2 > myscript.slurm\n\nINFO:root:Running mkslurm_hetjob -S 4 -s 16 -m 2 -C 96 -g 2 -N 128 -t 00:10:00 -a n01 -j nemo_test -v False\nINFO:root:nodes needed= 2 (256)\nINFO:root:cores to be used= 100 (256)\n
This has reported that 2 nodes are needed with 100 active cores spread over 256 cores. This will also have produced a submission script \"myscript.slurm\":
#!/bin/bash\n#SBATCH --job-name=nemo_test\n#SBATCH --time=00:10:00\n#SBATCH --account=n01\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-core=1\n\n# Created by: mkslurm_hetjob -S 4 -s 16 -m 2 -C 96 -g 2 -N 128 -t 00:10:00 -a n01 -j nemo_test -v False\nmodule swap craype-network-ofi craype-network-ucx\nmodule swap cray-mpich cray-mpich-ucx\nmodule load cray-hdf5-parallel/1.12.0.7\nmodule load cray-netcdf-hdf5parallel/4.7.4.7\nexport OMP_NUM_THREADS=1\n\ncat > myscript_wrapper.sh << EOFB\n#!/bin/ksh\n#\nset -A map ./xios_server.exe ./nemo\nexec_map=( 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 )\n#\nexec \\${map[\\${exec_map[\\$SLURM_PROCID]}]}\n##\nEOFB\nchmod u+x ./myscript_wrapper.sh\n\nsrun --mem-bind=local \\\n--ntasks=100 --ntasks-per-node=50 --cpu-bind=v,mask_cpu:0x1,0x10000,0x100000000,0x400000000,0x1000000000,0x4000000000,0x10000000000,0x40000000000,0x100000000000,0x400000000000,0x1000000000000,0x4000000000000,0x10000000000000,0x40000000000000,0x100000000000000,0x400000000000000,0x1000000000000000,0x4000000000000000,0x10000000000000000,0x40000000000000000,0x100000000000000000,0x400000000000000000,0x1000000000000000000,0x4000000000000000000,0x10000000000000000000,0x40000000000000000000,0x100000000000000000000,0x400000000000000000000,0x1000000000000000000000,0x4000000000000000000000,0x10000000000000000000000,0x40000000000000000000000,0x100000000000000000000000,0x400000000000000000000000,0x1000000000000000000000000,0x4000000000000000000000000,0x10000000000000000000000000,0x40000000000000000000000000,0x100000000000000000000000000,0x400000000000000000000000000,0x1000000000000000000000000000,0x4000000000000000000000000000,0x10000000000000000000000000000,0x40000000000000000000000000000,0x100000000000000000000000000000,0x400000000000000000000000000000,0x1000000000000000000000000000000,0x4000000000000000000000000000000,0x10000000000000000000000000000000,0x40000000000000000000000000000000 ./myscript_wrapper.sh\n
Submitting this script in a directory with the nemo and xios_server.exe executables will run the desired MPMD job. The exec_map array shows the position of each executable in the rank list (0 = xios_server.exe, 1 = nemo). For larger core counts the cpu_map can be limited to a single node map which will be cycled through as many times as necessary.
"},{"location":"research-software/nemo/#how-to-optimise-the-performance-of-nemo","title":"How to optimise the performance of NEMO","text":"Note
Our optimisation advice is based on the ARCHER2 4-cabinet preview system with the same node architecture as the current ARCHER2 service but a total of 1,024 compute nodes. During these investigations we used NEMO-4.0.6 and XIOS-2.5.
Through testing with idealised test cases to optimise the computational performance (i.e. without the demanding I/O management that is typical of NEMO production runs), we have found that drastically under-populating the nodes does not affect the performance of the computation. This indicates that users can reserve large portions of the nodes without a performance detriment. Users can run larger simulations by reserving up to 75% of the node can be reserved for I/O management (i.e. XIOS I/O-servers).
XIOS I/O-servers can be more lightly packed than ocean clients and should be evenly distributed amongst the nodes i.e. not concentrated on a specific node. We found that placing 1 XIOS I/O-server per node with 4, 8, and 16 dedicated cores did not affect the performance. However, the performance was affected when allocating dedicated I/O-server cores outside of a 16-core NUMA region. Thus, users should confine XIOS I/O-servers to NUMA regions to improve performance and benefit from the memory hierarchy.
"},{"location":"research-software/nemo/#a-performance-investigation","title":"A performance investigation","text":"Note
These results were collated during early user testing of the ARCHER2 service by Andrew Coward and is subject to change.
This table shows some preliminary results of a repeated 60 day simulation of the ORCA2_ICE_PISCES, SETTE configuration using various core counts and packing strategies:
Note
These results used the mkslurm script, now hosted in /work/n01/shared/nemo/old_scripts/mkslurm
It is clear from the previous results that fully populating an ARCHER2 node is unlikely to provide the optimal performance for any codes with moderate memory bandwidth requirements. The explored regular packing strategy does not allow experimentation with less wasteful packing strategies than half-population though.
There may be a case, for example, for just leaving every 1 in 4 cores idle, or every 1 in 8, or even fewer idle cores per node. The mkslurm_alt script (/work/n01/shared/nemo/old_scripts/mkslurm_alt) provided a method of generating cpu-bind maps for exploring these strategies. The script assumed no change in the packing strategy for the servers but the core spacing argument (-c) for the ocean cores is replaced by a -g option representing the frequency of a gap in the, otherwise tightly-packed, ocean cores.
Preliminary tests have been conducted with the ORCA2_ICE_PISCES SETTE test case. This is a relatively small test case that will fit onto a single node. It is also small enough to perform well in attached mode. First some baseline tests in attached mode.
Previous tests used 4 I/O servers each occupying a single NUMA. For this size model, 2 servers occupying half a NUMA each will suffice. That leaves 112 cores with which to try different packing strategies. Is it possible to match or better this elapsed time on a single node including external I/O servers? -Yes! -but not with an obvious gap frequency:
And activating land suppression can reduce times further:
The optimal two-node solution is also shown (this is quicker but the one node solution is cheaper).
This leads us to the current iteration of the mkslurm script - mkslurm_hetjob. Note a tightly-packed placement with no gaps amongst the ocean processes can be generated using a client gap interval greater than the number of clients. This script has been used to explore the different placement strategies with a larger configuration based on eORCA025. In all cases, 8 XIOS servers were used, each with sole occupancy of a 16-core NUMA and a maximum of 2 servers per node. The rest of the initial 4 nodes (and any subsequent ocean core-only nodes) were filled with ocean cores at various packing densities (from tightly packed to half-populated). A summary of the results are shown below.
The limit of scalability for this problem size lies around 1500 cores. One interesting aspect is that the cost, in terms of node hours, remains fairly flat up to a thousand processes and the choice of gap placement makes much less difference as the individual domains shrink. It looks as if, so long as you avoid inappropriately high numbers of processors, choosing the wrong placement won't waste your allocation but may waste your time.
"},{"location":"research-software/nwchem/","title":"NWChem","text":"NWChem aims to provide its users with computational chemistry tools that are scalable both in their ability to treat large scientific computational chemistry problems efficiently, and in their use of available parallel computing resources from high-performance parallel supercomputers to conventional workstation clusters. The NWChem software can handle: biomolecules, nanostructures, and solid-state system; from quantum to classical, and all combinations; Gaussian basis functions or plane-waves; scaling from one to thousands of processors; properties and relativity.
"},{"location":"research-software/nwchem/#useful-links","title":"Useful Links","text":"NWChem is released under an Educational Community License (ECL 2.0) and is freely available to all users on ARCHER2.
"},{"location":"research-software/nwchem/#where-can-i-get-help","title":"Where can I get help?","text":"If you have problems accessing or running NWChem on ARCHER2, please contact the Service Desk. General questions on the use of NWChem might also be directed to the [NWChem forum][1]. More experienced users with detailed technical issues on NWChem should consider submitting them to the NWChem GitHub issue tracker.
"},{"location":"research-software/nwchem/#running-nwchem-jobs","title":"Running NWChem jobs","text":"The following script will run a NWChem job using 2 nodes (256 cores) in the standard partition. It assumes that the input file is called test_calc.nw
.
#!/bin/bash\n\n# Request 2 nodes with 128 MPI tasks per node for 20 minutes\n\n#SBATCH --job-name=NWChem_test\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the NWChem module, avoid any unintentional OpenMP threading by\n# setting OMP_NUM_THREADS, and launch the code.\nmodule load nwchem\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --distribution=block:block --hint=nomultithread nwchem test_calc\n
"},{"location":"research-software/nwchem/#compiling-nwchem","title":"Compiling NWChem","text":"The latest instructions for building NWChem on ARCHER2 may be found in the GitHub repository of build instructions:
ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling code for quantum-mechanical calculations based on density-functional theory.
"},{"location":"research-software/onetep/#useful-links","title":"Useful Links","text":"ONETEP is only available to users who have a valid ONETEP licence.
If you have a ONETEP licence and wish to have access to ONETEP on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
"},{"location":"research-software/onetep/#running-parallel-onetep-jobs","title":"Running parallel ONETEP jobs","text":"The following script, supplied by the ONETEP developers, will run a ONETEP job using 2 nodes (256 cores) with 16 MPI processes per node and 8 OpenMP threads per MPI process. It assumes that there is a single calculation options file with the .dat
extension in the working directory.
#!/bin/bash\n\n# --------------------------------------------------------------------------\n# A SLURM submission script for ONETEP on ARCHER2 (full 23-cabinet system).\n# Central install, Cray compiler version.\n# Supports hybrid (MPI/OMP) parallelism.\n#\n# 2022.06 Jacek Dziedzic, J.Dziedzic@soton.ac.uk\n# University of Southampton\n# Lennart Gundelach, L.Gundelach@soton.ac.uk\n# University of Southampton\n# Tom Demeyere, T.Demeyere@soton.ac.uk\n# University of Southampton\n# --------------------------------------------------------------------------\n\n# v1.00 (2022.06.04) jd: Adapted from the user-compiled Cray compiler version.\n\n# ==========================================================================================================\n# Edit the following lines to your liking.\n#\n#SBATCH --job-name=mine # Name of the job.\n#SBATCH --nodes=2 # Number of nodes in job.\n#SBATCH --ntasks-per-node=16 # Number of MPI processes per node.\n#SBATCH --cpus-per-task=8 # Number of OMP threads spawned from each MPI process.\n#SBATCH --time=5:00:00 # Max time for your job (hh:mm:ss).\n#SBATCH --partition=standard # Partition: standard memory CPU nodes with AMD EPYC 7742 64-core processor\n#SBATCH --account=t01 # Replace 't01' with your budget code.\n#SBATCH --qos=standard # Requested Quality of Service (QoS), See ARCHER2 documentation\n\nexport OMP_NUM_THREADS=8 # Repeat the value from 'cpus-per-task' here.\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set up the job environment, loading the ONETEP module.\n# The module automatically sets OMP_PLACES, OMP_PROC_BIND and FI_MR_CACHE_MAX_COUNT.\n# To use a different binary, replace this line with either (drop the leading '#')\n# module load onetep/6.1.9.0-GCC-LibSci\n# to use the GCC-libsci binary, or with\n# module load onetep/6.1.9.0-GCC-MKL\n# to use the GCC-MKL binary.\n\nmodule load onetep/6.1.9.0-CCE-LibSci\n\n# ==========================================================================================================\n# !!! You should not need to modify anything below this line.\n# ==========================================================================================================\n\nworkdir=`pwd`\necho \"--- This is the submission script, the time is `date`.\"\n\n# Figure out ONETEP executable\nonetep_exe=`which onetep.archer2`\necho \"--- ONETEP executable is $onetep_exe.\"\n\nonetep_launcher=`echo $onetep_exe | sed -r \"s/onetep.archer2/onetep_launcher/\"`\n\necho \"--- workdir is '$workdir'.\"\necho \"--- onetep_launcher is '$onetep_launcher'.\"\n\n# Ensure exactly 1 .dat file in there.\nndats=`ls -l *dat | wc -l`\n\nif [ \"$ndats\" == \"0\" ]; then\n echo \"!!! There is no .dat file in the current directory. Aborting.\" >&2\n touch \"%NO_DAT_FILE\"\n exit 2\nfi\n\nif [ \"$ndats\" == \"1\" ]; then\n true\nelse\n echo \"!!! More than one .dat file in the current directory, that's too many. Aborting.\" >&2\n touch \"%MORE_THAN_ONE_DAT_FILE\"\n exit 3\nfi\n\nrootname=`echo *.dat | sed -r \"s/\\.dat\\$//\"`\nrootname_dat=$rootname\".dat\"\nrootname_out=$rootname\".out\"\nrootname_err=$rootname\".err\"\n\necho \"--- The input file is $rootname_dat, the output goes to $rootname_out and errors go to $rootname_err.\"\n\n# Ensure ONETEP executable is there and is indeed executable.\nif [ ! -x \"$onetep_exe\" ]; then\n echo \"!!! $onetep_exe does not exist or is not executable. Aborting!\" >&2\n touch \"%ONETEP_EXE_MISSING\"\n exit 4\nfi\n\n# Ensure onetep_launcher is there and is indeed executable.\nif [ ! -x \"$onetep_launcher\" ]; then\n echo \"!!! $onetep_launcher does not exist or is not executable. Aborting!\" >&2\n touch \"%ONETEP_LAUNCHER_MISSING\"\n exit 5\nfi\n\n# Dump the module list to a file.\nmodule list >\\$modules_loaded 2>&1\n\nldd $onetep_exe >\\$ldd\n\n# Report details\necho \"--- Number of nodes as reported by SLURM: $SLURM_JOB_NUM_NODES.\"\necho \"--- Number of tasks as reported by SLURM: $SLURM_NTASKS.\"\necho \"--- Using this srun executable: \"`which srun`\necho \"--- Executing ONETEP via $onetep_launcher.\"\n\n\n# Actually run ONETEP\n# Additional srun options to pin one thread per physical core\n########################################################################################################################################################\nsrun --hint=nomultithread --distribution=block:block -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS $onetep_launcher -e $onetep_exe -t $OMP_NUM_THREADS $rootname_dat >$rootname_out 2>$rootname_err\n########################################################################################################################################################\n\necho \"--- srun finished at `date`.\"\n\n# Check for error conditions\nresult=$?\nif [ $result -ne 0 ]; then\n echo \"!!! srun reported a non-zero exit code $result. Aborting!\" >&2\n touch \"%SRUN_ERROR\"\n exit 6\nfi\n\nif [ -r $rootname.error_message ]; then\n echo \"!!! ONETEP left an error message file. Aborting!\" >&2\n touch \"%ONETEP_ERROR_DETECTED\"\n exit 7\nfi\n\ntail $rootname.out | grep completed >/dev/null 2>/dev/null\nresult=$?\nif [ $result -ne 0 ]; then\n echo \"!!! ONETEP calculation likely did not complete. Aborting!\" >&2\n touch \"%ONETEP_DID_NOT_COMPLETE\"\n exit 8\nfi\n\necho \"--- Looks like everything went fine. Praise be.\"\ntouch \"%DONE\"\n\necho \"--- Finished successfully at `date`.\"\n
"},{"location":"research-software/onetep/#hints-and-tips","title":"Hints and Tips","text":"See the information in the ONETEP documentation.
"},{"location":"research-software/onetep/#compiling-onetep","title":"Compiling ONETEP","text":"The latest instructions for building ONETEP on ARCHER2 may be found in the GitHub repository of build instructions:
OpenFOAM is an open-source toolbox for computational fluid dynamics. OpenFOAM consists of generic tools to simulate complex physics for a variety of fields of interest, from fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics, electromagnetism and the pricing of financial options.
The core technology of OpenFOAM is a flexible set of modules written in C++. These are used to build solvers and utilities to perform pre-processing and post-processing tasks ranging from simple data manipulation to visualisation and mesh processing.
There are a number of different flavours of the OpenFOAM package with slightly different histories, and slightly different features. The two most common are distributed by openfoam.org and openfoam.com.
"},{"location":"research-software/openfoam/#useful-links","title":"Useful Links","text":"OpenFOAM is released under a GPL v3 license and is freely available to all users on ARCHER2.
Upgrade 2023Full systemauser@ln01> module avail openfoam\n--------------- /work/y07/shared/archer2-lmod/apps/core -----------------\nopenfoam/com/v2106 openfoam/org/v9.20210903\nopenfoam/com/v2212 (D) openfoam/org/v10.20230119 (D)\n
Note: the older versions were recompiled under PE22.12 in April 2023.
auser@ln01> module avail openfoam\n--------------- /work/y07/shared/archer2-lmod/apps/core -----------------\nopenfoam/com/v2106 openfoam/org/v9.20210903 (D)\nopenfoam/org/v8.20200901\n
Versions from openfoam.org are typically v8.0 etc and there is typically one release per year (in June; with a patch release in September). Versions from openfoam.com are e.g., v2106 (to be read as 2021 June) and there are typically two releases a year (one in June, and one in December).
To use OpenFOAM on ARCHER2 you should first load an OpenFOAM module, e.g.
user@ln01:> module load PrgEnv-gnu\nuser@ln01:> module load openfoam/com/v2106\n
(Note that the openfoam
module will automatically load PrgEnv-gnu
if it is not already active.) The module defines only the base installation directory via the environment variable FOAM_INSTALL_DIR
. After loading the module you need to source the etc/bashrc
file provided by OpenFOAM, e.g.
source ${FOAM_INSTALL_DIR}/etc/bashrc\n
You should then be able to use OpenFOAM. The above commands will also need to be added to any job/batch submission scripts you want to use to run OpenFOAM. Note that all the centrally installed versions of OpenFOAM are compiled under PrgEnv-gnu
.
Note there are no default module versions specified. It is recommended to use a fully qualified module name (with the exact version, as in the example above).
"},{"location":"research-software/openfoam/#extensions-to-openfoam","title":"Extensions to OpenFOAM","text":"Many packages extend the central OpenFOAM functionality in some way. However, there is no completely standardised way in which this works. Some packages assume they have write access to the main OpenFOAM installation. If this is the case, you must install your own version before continuing. This can be done on an individual basis, or a per-project basis using the project shared directories.
Some packages are installed in the OpenFOAM user directory, by default this is set to $HOME/OpenFOAM/$USER-[openfoam-version]
. This can be changed (e.g. to the work filesystem) by adding WM_PROJECT_USER_DIR=/work/a01/a01/auser/OpenFOAM/auser-[openfoam-version]
as an argument to source ${FOAM_INSTALL_DIR}/etc/bashrc
. For example:
source ${FOAM_INSTALL_DIR}/etc/bashrc WM_PROJECT_USER_DIR=/work/a01/a01/auser/OpenFOAM/auser-v2106\n
"},{"location":"research-software/openfoam/#compiling-openfoam","title":"Compiling OpenFOAM","text":"If you want to compile your own version of OpenFOAM, instructions are available for ARCHER2 at:
While it is possible to run limited OpenFOAM pre-processing and post-processing activities on the front end, we request all significant work is submitted to the queue system. Please remember that the front end is a shared resource.
A typical SLURM job submission script for OpenFOAM is given here. This would request 4 nodes to run with 128 MPI tasks per node (a total of 512 MPI tasks). Each MPI task is allocated one core (--cpus-per-task=1
).
#!/bin/bash\n\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --distribution=block:block\n#SBATCH --hint=nomultithread\n#SBATCH --time=00:10:00\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Load the appropriate module and source the OpenFOAM bashrc file\n\nmodule load openfoam/org/v10.20230119\n\nsource ${FOAM_INSTALL_DIR}/etc/bashrc\n\n# Run OpenFOAM work, e.g.,\n\nsrun interFoam -parallel\n
#!/bin/bash\n\n#SBATCH --nodes=4\n#SBATCH --tasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --distribution=block:block\n#SBATCH --hint=nomultithread\n#SBATCH --time=00:10:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the appropriate module and source the OpenFOAM bashrc file\n\nmodule load openfoam/org/v8.20210901\n\nsource ${FOAM_INSTALL_DIR}/etc/bashrc\n\n# Run OpenFOAM work, e.g.,\n\nsrun interFoam -parallel\n
"},{"location":"research-software/openfoam/#module-version-history","title":"Module version history","text":"The following centrally installed versions are available.
"},{"location":"research-software/openfoam/#upgrade-2023","title":"Upgrade 2023","text":"Module openfoam/com/v2212
installed as default April 2023 (PE 22.12). This is version v2212 (December 2022). See the OpenFOAM.com v2212 release announcement
Module openfoam/com/v2106
was recompiled April 2023 (PE 22.12). This is version v2106 (June 2021). See the OpenFOAM.com v2106 release announcement
Module openfoam/org/v10.20230119
installed as default April 2023 (PE 22.12) This is version 10 patch release 19th January 2023. See version 10 patch news
Module openfoam/org/v9.20210903
was recompiled April 2023 (PE 22.12). This is version 9 patch release 3rd September 2021. See version 9 patch release news.
Module openfoam/com/v2106
installed October 2021 (Cray PE 21.04). Version v2106 (June 2021). See OpenFOAM.com website
Module openfoam/org/v9.20200903
installed October 2021 (Cray PE 21.09). Version 9 patch release 3rd September 2021. See OpenFOAM.org website
Module openfoam/org/v8.20200901
installed October 2021 (Cray PE 21.09). Version 8 patch release 1st September 2020. See OpenFOAM.org website
ORCA is an ab initio quantum chemistry program package that contains modern electronic structure methods including density functional theory, many-body perturbation, coupled cluster, multireference methods, and semi-empirical quantum chemistry methods. Its main field of application is larger molecules, transition metal complexes, and their spectroscopic properties. ORCA is developed in the research group of Frank Neese. The free version is available only for academic use at academic institutions.
Important
ORCA is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/orca/#useful-links","title":"Useful Links","text":"ORCA is available for academic use on ARCHER2 only. If you wish to use ORCA for commercial applications, you must contact the ORCA developers.
"},{"location":"research-software/orca/#running-parallel-orca-jobs","title":"Running parallel ORCA jobs","text":"The following script will run an ORCA job on the ARCHER2 system using 256 MPI processes across 2 nodes, each MPI process will be placed on a separate physical core. It assumes that the input file is my_calc.inp
#!/bin/bash\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:20:00\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load orca\n\n# Launch the ORCA calculation\n# * You must use \"$ORCADIR/orca\" so the application has the full executable path\n# * Do not use \"srun\" to launch parallel ORCA jobs as they use OpenMPI rather than Cray MPICH\n# * Remember to change the name of the input file to match your file name\n$ORCADIR/orca my_calc.inp\n
"},{"location":"research-software/qchem/","title":"QChem","text":"QChem is an ab initio quantum chemistry software package for fast and accurate simulations of molecular systems, including electronic and molecular structure, reactivities, properties, and spectra.
Important
QChem is not part of the officially supported software on ARCHER2. While the ARCHER2 service desk is able to provide support for basic use of this software (e.g. access to software, writing job submission scripts) it does not generally provide detailed technical support for the software and you may be directed to seek support from other places if the service desk cannot answer the questions.
"},{"location":"research-software/qchem/#useful-links","title":"Useful Links","text":"ARCHER2 has a site licence for QChem.
"},{"location":"research-software/qchem/#running-parallel-qchem-jobs","title":"Running parallel QChem jobs","text":"Important
QChem parallelisation is only available on ARCHER2 by using multiple threads within a single compute node. Multi-process and multi-node parallelisation will not work on ARCHER2.
The following script will run QChem using 16 OpenMP threads using the input in hf3c.in
.
#!/bin/bash\n#SBATCH --nodes=1\n#SBATCH --time=1:0:0\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=16\n\n# Replace [budget code] below with your project code (e.g. e05)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load other-software\nmodule load qchem\n\nexport OMP_PLACES=cores\nexport OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nexport SLURM_HINT=\"nomultithread\"\nexport SLURM_DISTRIBUTION=\"block:block\"\n\nqchem -slurm -nt $OMP_NUM_THREADS hf3c.in hf3c.out\n
"},{"location":"research-software/qe/","title":"Quantum Espresso","text":"Quantum Espresso (QE) is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials.
"},{"location":"research-software/qe/#useful-links","title":"Useful Links","text":"QE is released under a GPL v2 license and is freely available to all ARCHER2 users.
"},{"location":"research-software/qe/#running-parallel-qe-jobs","title":"Running parallel QE jobs","text":"For example, the following script will run a QE pw.x
job using 4 nodes (128x4 cores).
#!/bin/bash\n\n# Request 4 nodes to run a 512 MPI task job with 128 MPI tasks per node.\n# The maximum walltime limit is set to be 20 minutes.\n\n#SBATCH --job-name=qe_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the relevant Quantum Espresso module\nmodule load quantum_espresso\n\n#\u00a0Set number of OpenMP threads to 1 to prevent multithreading by libraries\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\nsrun --hint=nomultithread --distribution=block:block pw.x < test_calc.in\n
"},{"location":"research-software/qe/#hints-and-tips","title":"Hints and tips","text":"The QE module is set to load up the default QE-provided pseudo-potentials. If you wish to use non-default pseudo-potentials, you will need to change the ESPRESSO_PSEUDO
variable to point to the directory you wish. This can be done by adding the following line after the module is loaded
export ESPRESSO_PSEUDO /path/to/pseudo_potentials\n
"},{"location":"research-software/qe/#compiling-qe","title":"Compiling QE","text":"The latest instructions for building QE on ARCHER2 can be found in the GitHub repository of build instructions:
The Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale materials modelling, e.g. electronic structure calculations and quantum-mechanical molecular dynamics, from first principles.
VASP computes an approximate solution to the many-body Schr\u00f6dinger equation, either within density functional theory (DFT), solving the Kohn-Sham equations, or within the Hartree-Fock (HF) approximation, solving the Roothaan equations. Hybrid functionals that mix the Hartree-Fock approach with density functional theory are implemented as well. Furthermore, Green's functions methods (GW quasiparticles, and ACFDT-RPA) and many-body perturbation theory (2nd-order M\u00f8ller-Plesset) are available in VASP.
In VASP, central quantities, like the one-electron orbitals, the electronic charge density, and the local potential are expressed in plane wave basis sets. The interactions between the electrons and ions are described using norm-conserving or ultrasoft pseudopotentials, or the projector-augmented-wave method.
To determine the electronic ground state, VASP makes use of efficient iterative matrix diagonalisation techniques, like the residual minimisation method with direct inversion of the iterative subspace (RMM-DIIS) or blocked Davidson algorithms. These are coupled to highly efficient Broyden and Pulay density mixing schemes to speed up the self-consistency cycle.
"},{"location":"research-software/vasp/#useful-links","title":"Useful Links","text":"VASP is only available to users who have a valid VASP licence.
If you have a VASP 5 or 6 licence and wish to have access to VASP on ARCHER2, please make a request via the SAFE, see:
Please have your license details to hand.
Note
Both VASP 5 and VASP 6 are available on ARCHER2. You generally need a different licence for each of these versions.
"},{"location":"research-software/vasp/#running-parallel-vasp-jobs","title":"Running parallel VASP jobs","text":"To access VASP you should load the appropriate vasp
module in your job submission scripts.
To load the default version of VASP, you would use:
module load vasp\n
Tip
VASP 6.4.3 and above have all been compiled to include Wannier90 functionality. Older versions of VASP on ARCHER2 do not include Wannier90.
Once loaded, the executables are called:
vasp_std
- Multiple k-point versionvasp_gam
- GAMMA-point only versionvasp_ncl
- Non-collinear versionOnce the module has been loaded, you can access the LDA and PBE pseudopotentials for VASP on ARCHER2 at:
$VASP_PSPOT_DIR\n
Tip
VASP 6 can make use of OpenMP threads in addition to running with pure MPI. We will add notes on performance and use of threading in VASP as information becomes available.
Example VASP submission script
#!/bin/bash\n\n# Request 16 nodes (2048 MPI tasks at 128 tasks per node) for 20 minutes. \n\n#SBATCH --job-name=VASP_test\n#SBATCH --nodes=16\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code] \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the VASP module\nmodule load vasp/6\n\n# Avoid any unintentional OpenMP threading by setting OMP_NUM_THREADS\nexport OMP_NUM_THREADS=1\n\n# Ensure the cpus-per-task option is propagated to srun commands\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the code - the distribution and hint options are important for performance\nsrun --distribution=block:block --hint=nomultithread vasp_std\n
"},{"location":"research-software/vasp/#vasp-transition-state-tools-vtst","title":"VASP Transition State Tools (VTST)","text":"As well as the standard VASP 5 modules, we provide versions of VASP 5 with the VASP Transition State Tools (VTST) from the University of Texas added. The VTST version adds various functionality to VASP and provides additional scripts to use with VASP. Additional functionality includes:
Full details of these methods and the provided scripts can be found on the VTST website.
On ARCHER2, the VTST version of VASP 5 can be accessed by loading the modules with VTST
in the module name, for example:
module load vasp/6/6.4.1-vtst\n
"},{"location":"research-software/vasp/#compiling-vasp-on-archer2","title":"Compiling VASP on ARCHER2","text":"If you wish to compile your own version of VASP on ARCHER2 (either VASP 5 or VASP 6) you can find information on how we compiled the central versions in the build instructions GitHub repository. See:
The VASP modules are setup to use the OpenFabrics MPI transport protocol as testing has shown that this passes all the regression tests and gives the most reliable operation on ARCHER2. However, there may be cases where using UCX can give better performance than OpenFabrics.
If you want to try the UCX transport protocol then you can do this using by loading additional modules after you have loaded the VASP modules. For example, for VASP 6, you would use:
module load vasp/6\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\n
"},{"location":"research-software/vasp/#increasing-the-cpu-frequency-and-enabling-turbo-boost","title":"Increasing the CPU frequency and enabling turbo-boost","text":"The default CPU frequency is currently set to 2 GHz on ARCHER2. While many VASP calculations are memory or MPI bound, some calculations can be CPU bound. For those cases, you may see a signiicant difference in performance by increasing the CPU frequency and enabling turbo-boost (though you will almost certainly also be less energy efficient).
You can do this by adding the line:
export SLURM_CPU_FREQ_REQ=2250000\n
in your job submission script before the srun command
"},{"location":"research-software/vasp/#performance-tips","title":"Performance tips","text":"The performance of VASP depends on the version of VASP used, the performance of MPI collective operations, the choice of VASP parallelisation parameters (NCORE
/NPAR
and KPAR
) and how many MPI processes per node are used.
KPAR: You should always use the maximum value of KPAR
that is possible for your calculation within the memory limits of what is possible.
NCORE/NPAR: We have found that the optimal values of NCORE
(and hence NPAR
) depend on both the type of calculation you are performing (e.g. pure DFT, hybrid functional, \u0393-point, non-collinear) and the number of nodes/cores you are using for your calculation. In practice, this means that you should experiment with different values to find the best choice for your calculation. There is information below on the best choices for the benchmarks we have run on ARCHER2 that may serve as a useful starting point. The performance difference from choosing different values can vary by up to 100% so it is worth spending time investigating this.
MPI processes per node We found that it is sometimes beneficial to performance to use less MPI processes per node than the total number of cores per node in some cases for the benchmarks used.
OpenMP threads Using multiple OpenMP threads per MPI process can be beneficial to performance. 4 OpenMP threads per MPI process typically sees the best performance in the tests we have performed.
"},{"location":"research-software/vasp/#vasp-performance-data-on-archer2","title":"VASP performance data on ARCHER2","text":"VASP performance data on ARCHER2 is currently available for two different benchmark systems:
Basic information:
vasp_ncl
NELM = 6
Performance summary:
vasp/6/6.4.2-mkl19
modules)NCORE
:NCORE = 16
KPAR = 2
is maximum that can be used on standard memory nodes Setup details: - vasp/6/6.4.2-mkl19
module - GCC 11.2.0 - MKL 19.5 for BLAS/LAPACK/ScaLAPACK and FFTW - OFI for MPI transport layer
This page has moved
"},{"location":"research-software/chemshell/chemshell/","title":"Chemshell","text":"This page has moved
"},{"location":"research-software/code-saturne/code-saturne/","title":"Code saturne","text":"This page has moved
"},{"location":"research-software/cp2k/cp2k/","title":"Cp2k","text":"This page has moved
"},{"location":"research-software/fhi-aims/fhi-aims/","title":"Fhi aims","text":"This page has moved
"},{"location":"research-software/gromacs/gromacs/","title":"Gromacs","text":"This page has moved
"},{"location":"research-software/lammps/lammps/","title":"Lammps","text":"This page has moved
"},{"location":"research-software/mitgcm/mitgcm/","title":"Mitgcm","text":"This page has moved
"},{"location":"research-software/mo-unified-model/mo-unified-model/","title":"Mo unified model","text":"This page has moved
"},{"location":"research-software/namd/namd/","title":"Namd","text":"This page has moved
"},{"location":"research-software/nektarplusplus/nektarplusplus/","title":"Nektarplusplus","text":"This page has moved
"},{"location":"research-software/nemo/nemo/","title":"Nemo","text":"This page has moved
"},{"location":"research-software/nwchem/nwchem/","title":"Nwchem","text":"This page has moved
"},{"location":"research-software/onetep/onetep/","title":"Onetep","text":"This page has moved
"},{"location":"research-software/openfoam/openfoam/","title":"Openfoam","text":"This page has moved
"},{"location":"research-software/qe/qe/","title":"Qe","text":"This page has moved
"},{"location":"research-software/vasp/vasp/","title":"Vasp","text":"This page has moved
"},{"location":"software-libraries/","title":"Software Libraries","text":"This section provides information on centrally-installed software libraries and library-based packages. These provide significant functionality that is of interest to both users and developers of applications.
Libraries are made available via the module system, and fall into a number of distinct groups.
"},{"location":"software-libraries/#libraries-via-modules-cray-","title":"Libraries via modulescray-*
","text":"The following libraries are available as modules prefixed by cray-
and may be of direct interest to developers and users. The modules are provided by HPE Cray to be optimised for performance on the ARCHER2 hardware, and should be used where possible. The relevant modules are:
cray-fftw ...details for module load cray-fttw...
FFTW (Fastest Fourier Transform in the West) is a standard package for discrete Fourier transforms. See the FFTW home page
cray-hdf5 and cray-hdf5-parallel ...details for hdf5...
Hierarchical Data Format (HDF5) is a high-performance and portable data format and data model. These modules provide serial and parallel variants of HDF5. See the HDF5 home page
cray-libsci ...details for cray-libsci...
BLAS, LAPACK, BLACS, and SCALAPACK provide basic linear algebra functionality such as vector-vector, matrix-vector, and matrix-matrix multiplication. Module cray-libsci
is loaded by default in all programming environments.
cray-netcdf ...details for cray-netcdf...
Serial version of Network Common Data Form (NetCDF), a widely used and portable data format. See the NETCDF website
cray-netcdf-hdf5parallel
A serial NetCDF built against parallel HDF5. Load module cray-hdf5-parallel
first.
cray-parallel-netcdf ...deatils for Parallel NetCDF...
A parallel NetCDF implementation (sometimes referred to as \"Pnetcdf\").
All libraries provided by modules prefixed cray-
integrate with the compiler environment, and so appropriate compiler and link stage options are injected when using the standard compiler wrappers cc
, CC
and ftn
.
The following libraries will also made available by the ARCHER2 CSE team:
ADIOS2 ...details for AOCL on ARCHER2...
ADIOS2 parallel IO libray.
AOCL ...details for AOCL on ARCHER2...
AOCL (AMD Optimizing CPU Libraries) provides a set of numerical libraries optimised for AMD \"Zen\"-based processors.
ARPACK-NG ...details for ARPACK-NG on ARCHER2...
ARPACK-NG (Arnodli Package) computes eigenvalues and eigenvectors of large sparse matrics.
Boost ...details for Boost on ARCHER2...
Boost is a portable C++ library providing reference implementations of many common containers, operations and algorithms.
Eigen ...details for Eigen on ARCHER2...
Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
GLM ...details for GLM on ARCHER2...
GLM (GL Math library) is a C++ header-only library for performing operations commonly encountered in graphics applications.
Hypre ...details for HYPRE on ARCHER2...
HYPRE provides pre-conditioners and solvers for sparse linear algebra problems.
Metis and Parmetis ...details for Metis and Parmetis...
METIS is a set of (serial) routines for partitioning graphs and meshes, and computing reduced-fill orderings of sparse matrices. It is commonly used e.g., to compute decompositions for finite element problems. Parmetis is the distributed memory counterpart.
Mumps ...details for MUMPS on ARCHER2...
MUMPS provides parallel direct solution of large sparse matrix problems.
PETSc ...details for PETSc on ARCHER2...
PETSc is a general package with functionality related to the solution of a wide range of problems described by partial differential equations.
Scotch ...details for Scotch and PT-Scotch on ARCHER2...
Scotch (and its parallel partner PT-Scotch) is a graph partitioning library.
SLEPc ...details for SLEPc on ARCHER2...
SLEPc is a package for large eigenvalue problems based on PETSc.
SuperLU and SuperLU_DIST ...details for SuperLU on ARCHER2...
SuperLU provides solutions to large non-symmetric sparse systems. SuperLU_DIST is the distributed memory version.
Trilinos ...details for Trilinos on ARCHER2...
Trilinos is a large collection of packages for the solution of complex scientific and engineering problems.
Again, all the libraries listed above are supported by all programming environments via the module system. Additional compile and link time flags should not be required.
"},{"location":"software-libraries/#building-your-own-library-versions","title":"Building your own library versions","text":"For the libraries listed in this section, a set of build and installation scripts are available at the ARCHER2 Github repository.
Follow the instructions to build the relevant package (note this is the cse-develop
branch of the repository). See also individual libraries pages in the list above for further details.
The scripts available from this repository should work in all three programming environments.
"},{"location":"software-libraries/adios/","title":"ADIOS","text":"The Adaptable I/O System (ADIOS) is developed at Oak Ridge National Laboratory and is freely available under a BSD license. The current development is ADIOS2.
"},{"location":"software-libraries/adios/#version-history","title":"Version history","text":"CurrentVersions of ADIOS2 for different programming environments are available. See, e.g.:
user@ln01:> module load other-software\nuser@ln01:> module avail adios2\n
Please load the appropriate module for the current programming environment. Upgrade 2023 The central installation of ADIOS (version 1) has been removed as it is no longer actively developed.
Full system4-cabinet systemadios/1.13.1
installed October 2021 (PE 21.04)adios/1.13.1
installed January 2021Configuration details for ADIOS2 are obtained via the utility adios2-config
which should be available in the PATH
once ADIOS is installed. For example, to recover the compiler options required to provide serial C include files, issue:
$ adios2-config -s -c\n
Use adios2-config --help
for a summary of options. To compile and link application, such statements can be embedded in a Makefile via, e.g.,
ADIOS_INC := $(shell adios2-config -s -c)\nADIOS_CLIB := $(shell adios2-config -s -l)\n
See the ADIOS2 user manual for further details and examples."},{"location":"software-libraries/adios/#compile-your-own-version","title":"Compile your own version","text":"Details for ADIOS2 are pending.
"},{"location":"software-libraries/adios/#resources","title":"Resources","text":"The ADIOS2 user manual
The ADIOS2 github repository
"},{"location":"software-libraries/aocl/","title":"AMD Optimizing CPU Libraries (AOCL)","text":"AMD Optimizing CPU Libraries (AOCL) are a set of numerical libraries optimized for AMD \u201cZen\u201d-based processors, including EPYC, Ryzen Threadripper PRO, and Ryzen.
AOCL is comprised of eight libraries: - BLIS (BLAS Library) - libFLAME (LAPACK) - AMD-FFTW - LibM (AMD Core Math Library) - ScaLAPACK - AMD Random Number Generator (RNG) - AMD Secure RNG - AOCL-Sparse
Tip
AOCL 3.1
and 4.0
are available. 3.1
is default.
Important
AOCL does not currently support the Cray programming environment and is currently unavailable with PrgEnv-cray
loaded.
Important
The cray-libsci
module is loaded by default for all users and this module also contains definitions of BLAS, LAPACK and ScaLAPACK routines that conflict with those in AOCL. The aocl
module automatically unloads cray-libsci
.
AOCL 3.1
and 4.0
is available for all versions of the GCC compilers: gcc/11.2.0
and gcc/10.3.0
module load PrgEnv-gnu\nmodule load aocl\n
"},{"location":"software-libraries/aocl/#aocc-programming-environment","title":"AOCC Programming Environment","text":"AOCL 3.1
and 4.0
is available for all versions of the AOCC compilers: aocc/3.2.0
.
module load PrgEnv-aocc\nmodule load aocl\n
"},{"location":"software-libraries/aocl/#resources","title":"Resources","text":"For more information on AOCL, please see: https://developer.amd.com/amd-aocl/#documentation
"},{"location":"software-libraries/aocl/#version-history","title":"Version history","text":"Current modules:
aocl/3.1
installed June 2023aocl/4.0
installed June 2023The Arnoldi Package (ARPACK) was designed to compute eigenvalues and eigenvectors of large sparse matrices. Originally from Rice University, an open source version (ARPACK-NG) is available under a BSD license and is made available here.
"},{"location":"software-libraries/arpack/#compiling-and-linking-with-arpack","title":"Compiling and linking with ARPACK","text":"module load arpack-ng
To compile an application against the ARPACK-NG libraries, load the arpack-ng
module and use the compiler wrappers cc
, CC
, and ftn
in the usual way.
The arpack-ng
module defines ARPACK_NG_DIR
which locates the root of the installation for the current programming environment.
arpack-ng/3.8.0
installed October 2021 (PE 21.04)The current supported version of MUMPS on Archer2 can be compiled using a script available from the Archer githug repository.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/arpack-ng.sh --prefix=/path/to/install/location\n
where the --prefix
specifies a suitable location. See the Archer2 github repository for further options and details. Note that the build process runs the tests, for which an salloc
allocation is required to allow the parallel tests to run correctly."},{"location":"software-libraries/arpack/#resources","title":"Resources","text":"ARPACK-NG github site
"},{"location":"software-libraries/boost/","title":"Boost","text":"Boost provide portable C++ libraries useful in a broad range of contexts. The libraries are freely available under the terms of the Boost Software license.
"},{"location":"software-libraries/boost/#compiling-and-linking","title":"Compiling and linking","text":"module load boost
The C++ compiler wrapper CC
will introduce the appropriate options to compile an application against the Boost libraries. The other compiler wrappers (cc
and ftn
) do not introduce these options.
To check exactly what options are introduced type, e.g.,
$ CC --cray-print-opts\n
The boost
module also defines the environment variable BOOST_DIR
as the root of the installation for the current programming environment if this information is needed.
boost/1.81.0
installed May 2023 (PE 22.12)boost/1.72.0
recompiled May 2023 (PE 22.12)boost/1.72
installed October 2021 (PE 21.04)boost/1.72.0
installed January 2021The following libraries are installed: atomic chrono container context contract coroutine date_time exception fiber filesystem graph_parallel graph iostreams locale log math mpi program_options random regex serialization stacktrace system test thread timer type_erasure wave
The ARCHER2 Github repository contains a recipe for compiling Boost for the different programming environments.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout cse-develop\n$ ./sh/boost.sh --prefix=/path/to/install/location\n
where the --prefix
determines the install location. The list of libraries compiled is specified in the boost.sh
script. See the ARCHER2 Github repository for further information."},{"location":"software-libraries/boost/#resources","title":"Resources","text":"Boost home page.
Documentation (HTML) for the current version.
Boost GitHub repository.
Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
"},{"location":"software-libraries/eigen/#compiling-with-eigen","title":"Compiling with Eigen","text":"module load eigen
To compile an application with the Eigen header files, load the eigen
module and use the compiler wrappers cc
, CC
, or ftn
in the usual way. The relevant header files will be introduced automatically.
The header files are located in /work/y07/shared/libs/core/eigen/3.4.0/
, and can be included manually at compilation without loading the module if required.
eigen/3.4.0
installed October 2021The current supported version on Archer2 can be built using the following script
$ wget https://gitlab.com/libeigen/eigen/-/archive/3.4.0/eigen-3.4.0.tar.gz\n$ tar xvf eigen-3.4.0.tar.gz\n$ cmake eigen-3.4.0/ -DCMAKE_INSTALL_PREFIX=/path/to/install/location\n$ make install\n
where the -DCMAKE_INSTALL_PREFIX
option determines the install directory. Installing in this way will also build the Eigen documentation and unit-tests."},{"location":"software-libraries/eigen/#resources","title":"Resources","text":"Eigen home page
Getting Started guide
"},{"location":"software-libraries/fftw/","title":"FFTW","text":"module load cray-fftw
FFTW is a C subroutine library (which includes a Fortran interface) for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
Only the version 3 interface is available on ARCHER2.
"},{"location":"software-libraries/glm/","title":"GLM","text":"OpenGL Mathemetics (GLM) is a header-only C++ library which performs operations typically encountered in graphics applications, but can also be relevant to scientific applications. GLM is freely available under an MIT license.
"},{"location":"software-libraries/glm/#compiling-with-glm","title":"Compiling with GLM","text":"module load glm
The compiler wrapper CC
will automatically location the required include directory when the module is loaded.
The glm
module also defines the environment variable GLM_DIR
which carries the root of the installation, if needed.
glm/0.9.9.6
installed October 2021 (PE 21.04)glm/0.9.9.6
installed January 2021One can follow the instructions used to install the current version on ARCHER2 via the ARCHER2 Github repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2021-10\n$ ./sh/glm.sh --prefix=/path/to/install/location\n
where the --prefix
option sets the install location. See the ARCHER2 Github repository for further details."},{"location":"software-libraries/glm/#resources","title":"Resources","text":"The GLM Github repository.
"},{"location":"software-libraries/hdf5/","title":"HDF5","text":"The Hierarchical Data Format HDF5 (and its parallel manifestation HDF5 parallel) is a standard library and data format developed and supported by The HDF Group, and is released under a BSD-like license.
Both serial and parallel versions are available on ARCHER2 as standard modules:
module load cray-hdf5
(serial version)module load cray-hdf5-parallel
(MPI parallel version)Use module help
to locate cray-
specific release notes on a particular version.
Known issues:
Upgrade 2023Full system4-cabinet systemcray-hdf5-parallel
will not operate correctly in PrgEnv-aocc
. One can load module epcc-cray-hdf5-parallel
instead as a work-around if PrgEnv-aocc
is required.Some general comments and information on serial and parallel I/O to ARCHER2 are given in the section on I/O and file systems.
"},{"location":"software-libraries/hdf5/#compiling-applications-against-hdf5","title":"Compiling applications against HDF5","text":"If the appropriate programming environment and HDF5 modules are loaded, compiling applications against the HDF5 libraries should straightforward. You should use the compiler wrappers cc
, CC
, and/or ftn
. See, e.g., cc --cray-print-opts
for the full list of include paths and library paths and options added by the compiler wrapper.
The HDF5 support website includes general documentation.
For parallel HDF5, some tutorials and presentations are available.
"},{"location":"software-libraries/hypre/","title":"HYPRE","text":"HYPRE is a library of linear solvers for structured and unstructured problems with a particular emphasis on multigrid. It is a product of the Lawrence Livermore National Laboratory and is distributed under either the MIT license or the Apache license.
"},{"location":"software-libraries/hypre/#compiling-and-linking-with-hypre","title":"Compiling and linking with HYPRE","text":"module load hypre
To compile and link an application with the HYPRE libraries, load the hypre
module and use the compiler wrappers cc
, CC
, or ftn
in the usual way. The relevant include files and libraries will be introduced automatically.
Two versions of HYPRE are included: one with, and one without, OpenMP. The relevant version will be selected if e.g., -fopenmp
is included in the compile or link stage.
The hypre
module defines the environment variable HYPRE_DIR
which will show the root of the installation for the current programming environment if required.
hypre/2.25.0
installed as default May 2023 (PE 22.12)hypre/2.18.0
recompiled and installed May 2023 (PE 22.12)hypre/2.18.0
installed October 2021 (PE 21.04)hypre/2.18.0
installed January 2021The current supported version on Archer2 can be built using the script from the Archer2 repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/hypre.sh --prefix=/path/to/install/location\n
where the --prefix
option determines the install directory. See the Archer2 github repository for more information."},{"location":"software-libraries/hypre/#resources","title":"Resources","text":"HYPRE home page
The latest HYPRE user manual (HTML)
An older pdf version
HYPRE github repository
"},{"location":"software-libraries/libsci/","title":"HPE Cray LibSci","text":"module load cray-libsci
(note: loaded by default for all users)Cray scientific libraries, available for all compiler choices provides access to the Fortran BLAS and LAPACK interface for basic linear algebra, the corresponding C interfaces CBLAS and LAPACKE, and BLACS and ScaLAPACK for parallel linear algebra. Type man intro_libsci
for further details.
Additionally there is GPU support available via the cray-libsci_acc
module. More information can be found here.
Matio is a library which allows reading and writing matrices in MATLAB MAT format. It is an open source development released under a BSD license.
"},{"location":"software-libraries/matio/#compiling-and-linking-against-matio","title":"Compiling and linking against Matio","text":"module load matio
Load the matio
module and use the standard compiler wrappers cc
, CC
, or ftn
in the usual way. The appropriate header files and libraries will be included automatically via the compiler wrappers.
The matio
module set the PATH
variable so that the stand-alone utility matdump
can be used. The module also defines MATIO_PATH
which gives the root of the installation if this is needed.
matio/1.5.23
installed May 2023 (PE 22.12)matio/1.5.18
is removed.matio/1.5.18
installed October 2021 (PE 21.04)matio/1.5.18
installed January 2021A version of Matio as currently installed on Archer2 can be compiled using the script avaailable from the Archer2 github repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/matio.sh --prefix=/path/to/install/location\n
where --prefix
defines the location of the installation."},{"location":"software-libraries/matio/#resources","title":"Resources","text":"Matio github repository
"},{"location":"software-libraries/mesa/","title":"Mesa","text":"Mesa is an open-source implementation of OpenGL, Vulkan, and other graphics API to vendor-specific hardware drivers.
"},{"location":"software-libraries/mesa/#compiling-with-mesa","title":"Compiling with Mesa","text":"module load mesa
To compile an application with the mesa header files, load the mesa
module and use the compiler wrappers in the usual way. The relevant header files will be introduced automatically.
The header files are located in /work/y07/shared/libs/core/mesa/21.0.1/
, and can be included manually at compilation without loading the module if required.
mesa/21.0.1
installed June 2023Build recipe for this module can be found at the HPC-UK github repo
"},{"location":"software-libraries/mesa/#resources","title":"Resources","text":"Mesa home page
"},{"location":"software-libraries/metis/","title":"Metis and Parmetis","text":"The University of Minnesota provide a family of libraries for partitioning graphs and meshes, and computing fill-reducing ordering of sparse matrices. These libraries coming broadly under the label of \"Metis\". They are free to use for educational and research purposes.
"},{"location":"software-libraries/metis/#metis","title":"Metis","text":"module load metis
Metis is the sequential library for partitioning problems; it also supplies a number of simple stand-alone utility programs to access the Metis API for graph and mesh partitioning, and graph and mesh manipulation. The stand alone programs typically read a graph or mesh from file which must be in \"metis\" format.
"},{"location":"software-libraries/metis/#compiling-and-linking-with-metis","title":"Compiling and linking with Metis","text":"The Metis library available via module load metis
comes both with and without support for OpenMP. When using the compiler wrappers cc
, CC
, and ftn
, the appropriate version will be selected based on the presence or absence of, e.g., -fopenmp
in the compile or link invocation.
Use, e.g.,
$ cc --cray-print-opts\n
or $ cc -fopenmp --cray-print-opts\n
to see exactly what options are being issued by the compiler wrapper when the metis
module is loaded. Metis is currently provided as static libraries, so it should not be necessary to re-load the metis
module at run time.
The serial utilities (e.g. gpmetis
for graph partitioning) are supplied without OpenMP. These may then be run on the front end for small problems if the metis
module is loaded.
The metis
module defines the environment variable METIS_DIR
which indicates the current location of the Metis installation.
Note the metis
and parmetis
libraries (and dependent modules) have been compiled with the default 32-bit integer indexing, and 4-byte floating point options.
module load parmetis
Parmetis is the distributed memory incarnation of the Metis functionality. As for the metis
module, Parmetis is integrated with use of the compiler wrappers cc
, CC
, and ftn
.
Parmetis depends on the metis
module, which is loaded automatically by the parmetis
module.
The parmetis
module defines the environment variable PARMETIS_DIR
which holds the current location of the Parmetis installation. This variable may not respond to a change of compiler version within a given programming environment. If you wish to use PARMETIS_DIR
in such a context, you may need to (re-)load the parmetis
module after the change of compiler version.
metis/5.1.0
recompiled and installed May 2023 (PE22.12)partmetis/4.0.3
recompiled and installed May 2023 (PE22.12)metis/5.1.0
installed October 2021 (PE21.04)parmetis/4.0.3
installed January 2021 (PE21.04)metis/5.1.0
installed January 2021parmetis/4.0.3
installed January 2021The build procedure used for the Metis and Parmetis libraries on Archer2 is available via github.
"},{"location":"software-libraries/metis/#metis_1","title":"Metis","text":"The latest Archer2 version of Metis can be installed
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n
where --prefix
determines the install location. This will download and install the default version for the current programming environment.
Parmetis can be installed in via the same mechanism as Metis:
$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n
The Metis package should be installed first (as above) using the same location. See the Archer2 repository for further details and options."},{"location":"software-libraries/metis/#resources","title":"Resources","text":"-- Metis and Parmetis at github
"},{"location":"software-libraries/mkl/","title":"Intel Math Kernel Library (MKL)","text":"The Intel Maths Kernel Libraries (MKL) contain a variety of optimised numerical libraries including BLAS, LAPACK, ScaLAPACK and FFTW. In general, the exact commands required to build against MKL depend on the details of compiler, environment, requirements for parallelism, and so on. The Intel MKL link line advisor should be consulted.
Some examples are given below. Note that loading the mkl
module will provide the environment variable MKLROOT
which holds the location of the various MKL components.
Warning
The ARCHER2 CSE team have seen that using MKL on ARCHER2 for some software leads to failed regression tests due to numerical differences between refernece results and those produced with software using MKL.
We strongly recommend that you use the HPE Cray LibSci and HPE Cray FFTW libraries for software if at all possible rather than MKL. If you do decide to use MKL on ARCHER2, then you should carefully validate results from your software to ensure that it is giving the expected results.
Important
The cray-libsci
module is loaded by default for all users and this module also contains definitions of BLAS, LAPACK and ScaLAPACK routines that conflict with those in MKL. The mkl
module automatically unloads cray-libsci
.
Important
The mkl
module needs to be loaded both at compile time and at runtime (usually in your job submission script).
Tip
MKL only supports the GCC programming environment (PrgEnv-gnu
). Other programming environments may work but this is untested and unsupported on ARCHER2.
Swap modules:
module load PrgEnv-gnu\nmodule load mkl\n
Language Compile options Link options Fortran -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl
C/C++ -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl
"},{"location":"software-libraries/mkl/#threaded-mkl-with-gcc","title":"Threaded MKL with GCC","text":"Swap modules:
module load PrgEnv-gnu\nmodule load mkl\n
Language Compile options Link options Fortran -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
C/C++ -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
"},{"location":"software-libraries/mkl/#mkl-parallel-scalapack-with-gcc","title":"MKL parallel ScaLAPACK with GCC","text":"Swap modules:
module load PrgEnv-gnu\nmodule load mkl\n
Language Compile options Link options Fortran -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm -ldl
C/C++ -m64 -I\"${MKLROOT}/include\"
-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm -ldl
"},{"location":"software-libraries/mumps/","title":"MUMPS","text":"MUMPS is a parallel solver for large sparse systems and features a 'multifrontal' method and is developed largely at CERFCAS, ENS Lyon, IRIT Toulouse, INRIA, and the University of Bordeaux. It is provided free of charge and is largely under a CeCILL-C license.
"},{"location":"software-libraries/mumps/#compiling-and-linking-with-mumps","title":"Compiling and linking with MUMPS","text":"module load mumps
To compile an application against the MUMPS libraries, load the mumps
module and use the compiler wrappers cc
, CC
, and ftn
in the usual way.
MUMPS is configured to allow Pord, Metis, Parmetis, and Scotch orderings.
Two versions of MUMPS are provided: one with, and one without, OpenMP. The relevant version will be selected if the relevant option is included at the compile stage.
The mumps
module defines MUMPS_DIR
which locates the root of the installation for the current programming environment.
mumps/5.5.1
installed as default May 2023 (PE 22.12)mumps/5.3.5
recompiled May 2023 (PE 22.12)Note: mumps/5.5.1
uses scotch/7.0.3
while mumps/5.3.5
uses scotch/6.1.0
.
mumps/5.3.5
installed October 2021 (PE 21.04)mumps/5.2.1
installed January 2021Known issues: The OpenMP version in PrgEnv-aocc
is not available at the moment.
The current supported version of MUMPS on Archer2 can be compiled using a script available from the Archer githug repository.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/scotchv7.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/mumps.sh --prefix=/path/to/install/location\n
where the --prefix
option should be the same for MUMPS at the three dependencies (Metis, Parmetis, and Scotch Version 7). See the Archer2 github repository for further options and details."},{"location":"software-libraries/mumps/#resources","title":"Resources","text":"The MUMPS home page
MUMPS user manual (Version 5.6, pdf)
"},{"location":"software-libraries/netcdf/","title":"NetCDF","text":"The Network Common Data Form NetCDF (and its parallel manifestation NetCDF parallel) is a standard library and data format developed and supported by UCAR is released under a BSD-like license.
Both serial and parallel versions are available on ARCHER2 as standard modules:
module load cray-netcdf
(serial version)module load cray-netcdf-hdf5parallel
(MPI parallel version)Note that one should first load the relevant HDF module file, e.g.,
$ module load cray-hdf5\n$ module load cray-netcdf\n
for the serial version. Use module spider
to locate available versions, and use module help
to locate cray-
specific release notes on a particular version.
Known issues:
Upgrade 2023Full system4-cabinet systemcray-netcdf-hdf5parallel
will not operate correctly in PrgEnv-aocc
. One can load module epcc-netcdf-hdf5parallel
instead as a work-around if PrgEnv-aocc
is required.Some general comments and information on serial and parallel I/O to ARCHER2 are given in the section on I/O and file systems.
"},{"location":"software-libraries/netcdf/#resources","title":"Resources","text":"The NetCDF home page.
"},{"location":"software-libraries/petsc/","title":"PETSc","text":"PETSc is a suite of parallel tools for solution of partial differential equations. PETSc is developed at Argonne National Laboratory and is freely available under a BSD 2-clause license.
"},{"location":"software-libraries/petsc/#build","title":"Build","text":"module load petsc
Applications may be linked against PETSc by loading the petsc
module and using the compiler wrappers cc
, CC
, and ftn
in the usual way. Details of options introduced by the compiler wrappers can be examined via, e.g.,
$ cc --cray-print-opts\n
PETSC is configured with Metis, Parmetis, and Scotch orderings, and to support HYPRE, MUMPS, SuperLU, and SuperLU-DIST. PETSc is compiled without OpenMP.
The petsc
module defines the environment variable PETSC_DIR
as the root of the installation if this is required.
petsc/3.18.5
installed as default May 2023 (PE 22.12)petsc/3.14.2
recompiled May 2023 (PE 22.12)Note: PETSc has a number of dependencies; where applicable, the newer version of PETSc depends on the newer module version of each relevant dependency. Check module list
to be sure.
petsc/3.14.2
installed October 2021 (PE 21.04)petsc/3.13.3
installed January 2021Known issues: PETSc is not currently available for PrgEnv-aocc
. There is no HYPRE support in this version.
It is possible to follow the steps used to build the current version on Archer2. These steps are codified at the Archer2 github repository and include a number of dependencies to be built in the correct order:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2012-12\n$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/hypre.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/scotchv7.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/mumps.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/superlu.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/superlu-dist.sh --prefix=/path/to/install/location\n\n$ module load cray-hdf5\n$ ./sh/petsc.sh --prefix=/path/to/install/location\n
The --prefix
option indicating the install directory should be the same in all cases. See the Archer2 github repository for further details (and options). This will compile version 3.18.5 against the latest module versions of each dependency."},{"location":"software-libraries/petsc/#resources","title":"Resources","text":"PETSc home page
Current PETSc documentation (HTML)
"},{"location":"software-libraries/scotch/","title":"Scotch and PT-Scotch","text":"Scotch and its parallel version PT-Scotch are provided by Labri at the University of Bordeaux and INRIA Bordeaux South-West. They are used for graph partitioning and ordering problems. The libraries are freely available for scientific use under a license similar to the LGPL license.
"},{"location":"software-libraries/scotch/#scotch-and-pt-scotch_1","title":"Scotch and PT-Scotch","text":"module load scotch
The scotch
module provides access to both the Scotch and PT-Scotch libraries via the compiler system. A number of stand-alone utilities are also provided as part of the package.
If the scotch
module is loaded, then applications may be automatically compiled and linked against the libraries for the current programming environment. Check, e.g.,
$ cc --cray-print-opts\n
if you wish to see exactly what options are generated by the compiler wrappers. Scotch and PT-Scotch libraries are provides as static archives only. The compiler wrappers do not give access to the libraries libscotcherrexit.a
or libptscotcherrexit.a
. If you wish to perform your own error handling these libraries must be linked manually.
The scotch
module defines the environment SCOTCH_DIR
which holds the root of the installation for a given programming environment. Libraries are present in ${SCOTCH_DIR}/lib
.
Stand-alone applications are also available. See the Scotch and PT-Scotch user manuals for further details.
"},{"location":"software-libraries/scotch/#module-version-history","title":"Module version history","text":"Upgrade 2023Full system4-cabinet systemscotch/7.0.3
installed May 2023 (PE 22.12)scotch/6.1.0
recompiled May 2023 (PE 22.12)Note: scotch/7.0.3
has disabled a number of features including the Metis compatibility layer, and threads, to allow all tests to pass.
Module `scotch/6.1.0 installed October 2021 (PE 21.04)
Known issue: a small number of the standard PT-Scotch tests are failing (all programming environments). Symptoms include truncated MPI_Recvs
. This is currently being investigated.
Module scotch/6.0.10
installed January 2021
Known issue: a small number of the standard PT-Scotch tests are failing (all programming environments). Symptoms include truncated MPI_Recvs
. This is currently being investigated.
The build procedure for the Scotch package on Archer2 is available via github.
"},{"location":"software-libraries/scotch/#scotch-and-pt-scotch_2","title":"Scotch and PT-Scotch","text":"The latest Scotch and PT-Scotch libraries are installed on Archer using the following mechanism:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/scotchv7.sh --prefix=/path/to/install/location\n
where the --prefix
option defines the destination for the install. This script will download, compile and install version 7.0.3. A separate script (scotch.sh
) in the same location is used for version 6."},{"location":"software-libraries/scotch/#resources","title":"Resources","text":"The Scotch home page
Scotch user manual (pdf)
PT-Scotch user manual (pdf)
"},{"location":"software-libraries/slepc/","title":"SLEPC","text":"The Scalable Library for Eigenvalue Problem computations is an extension of PETSc developed at the Universitat Politecnica de Valencia. SLEPc is freely available under a 2-clause BSD license.
"},{"location":"software-libraries/slepc/#compiling-and-linking-with-slepc","title":"Compiling and linking with SLEPc","text":"module load slepc
To compile an application against the SLEPc libraries, load the slepc
module and use the compiler wrappers cc
, CC
, and ftn
in the usual way. Static libraries are available so no module is required at run time.
The SLEPc module defines SLEPC_DIR
which locates the root of the installation.
slepc/3.18.3
installed as default May 2023 (PE 22.12)slepc/3.14.1
recompiled May 2023 (PE 22.12)Note: each SLEPc module depends on a PETSc module with the same minor version number.
slepc/3.14.1
installed October 2021 (PE 21.04)slepc/3.13.2
installed January 2021The version of SLEPc currently available on ARCHER2 can be compiled using a script available from the ARCHER2 github repository:
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/slepc.sh --prefix=/path/to/install/location\n
The dependencies (including PETSc) can be built in the same way, or taken from the existing modules. See the ARCHER2 github repository for further information."},{"location":"software-libraries/slepc/#resources","title":"Resources","text":"SLEPc home page
Latest release version of SLEPc user manual (PDF)
SLEPc Gitlab repository
"},{"location":"software-libraries/superlu/","title":"SuperLU and SuperLU_DIST","text":"SuperLU and SuperLU_DIST are libraries for the direct solution of large sparse non-symmetric systems of linear equations, typically by factorisation and back-substitution. The libraries are provided by Lawrence Berkeley National Laboratory and are freely available under a slightly modified BSD-style license.
Two separate modules are provided for SuperLU and SuperLU_DIST.
"},{"location":"software-libraries/superlu/#superlu","title":"SuperLU","text":"module load superlu
This module provides the serial library SuperLU.
"},{"location":"software-libraries/superlu/#compiling-and-linking-with-superlu","title":"Compiling and linking with SuperLU","text":"Compiling and linking SuperLU applications requires no special action beyond module load superlu
and using the standard compiler wrappers cc
, CC
, or ftn
. The exact options issued by the compiler wrapper can be examined via, e.g.,
$ cc --cray-print-opts\n
while the module is loaded. The module defines the environment variable SUPERLU_DIR
as the root location of the installation for a given programming environment.
superlu/5.2.2
recompiled May 2023 (PE 22.12)superlu/5.2.2
installed October 2021 (PE 21.04)superle/5.2.1
installed January 2021module load superlu-dist
This modules provides the distributed memory parallel library SuperLU_DIST both with and without OpenMP.
"},{"location":"software-libraries/superlu/#compiling-and-linking-superlu_dist","title":"Compiling and linking SuperLU_DIST","text":"Use the standard compiler wrappers:
$ cc my_superlu_dist_application.c\n
or $ cc -fopenmp my_superlu_dist_application.c\n
to compile the and link against the appropriate libraries. The superlu-dist
module defines the environment variable SUPERLU_DIST_DIR
as the root of the installation for the current programming environment.
superlu-dist/8.1.2
installed as default May 2023 (PE 22.12)superlu-dist/6.4.0
recompiled May 2023 (PE 22.12)superlu-dist/6.4.0
installed October 2021 (PE 21.04)superlu-dist/6.1.1
installed January 2021The build used for Archer2 can be replicated by using the scripts provided at the Archer2 repository.
"},{"location":"software-libraries/superlu/#superlu_1","title":"SuperLU","text":"The current Archer2 supported version may be built via
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ./sh/tpsl/superlu.sh --prefix=/path/to/install/location\n
where the --prefix
option controls the install destination."},{"location":"software-libraries/superlu/#superlu_dist_1","title":"SuperLU_DIST","text":"SuperLU_DIST is configured using Metis and Parmetis, so these should be installed first:
$ ./sh/tpsl/metis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/parmetis.sh --prefix=/path/to/install/location\n$ ./sh/tpsl/superlu_dist.sh --prefix=/path/to/install/location\n
will download, compile, and install the relevant libraries. The install location should be the same for all three packages. See the Archer2 github repository for further options and details."},{"location":"software-libraries/superlu/#resources","title":"Resources","text":"The Supernodal LU project home page
The SuperLU User guide (pdf). This describes both SuperLU and SuperLU_DIST.
The SuperLU github repository
The SuperLU_DIST github repository
"},{"location":"software-libraries/trilinos/","title":"Trilinos","text":"Trilinos is a large collection of packages with software components that can be used for scientific and engineering problems. Most of the package are released under a BSD license (and some under LGPL).
"},{"location":"software-libraries/trilinos/#compiling-and-linking-against-trilinos","title":"Compiling and linking against Trilinos","text":"module load trilinos
Applications may be built against the module version of Trilinos by using the using the compiler wrappers CC
or ftn
in the normal way. The appropriate include files and library paths will be inserted automatically. Trilinos is build with OpenPM enabled.
The trilinos
module defines the environment variable TRILINOS_DIR
as the root of the installation for the current programming environment.
Trilinos also provides a small number of stand-alone executables which are available via the standard PATH
mechanism while the module is loaded.
trilinos/12.18.1
recompiled May 2023 (PE 22.12)Note that Trilinos is not currently available for PrgEnv-aocc
.
trilinos/12.18.1
installed October 2021 (PE 21.04)If using AMD compilers, module version aocc/3.0.0
is required.
module trilinos/12.18.1
installed January 2021Known issue
Trilinos is not available in PrgEnv-aocc
at the moment.
Known issue
The ForTrilinos
package is not available in this version.
Packages enabled are: Amesos, Amesos2, Anasazi, AztecOO Belos Epetra EpretExt FEI Galeri GlobiPack Ifpack Ifpack2 Intrepid Isorropia Kokkos Komplex Mesquite ML Moertel MueLu NOX OptiPack Pamgen Phalanx Piro Pliris ROL RTOp Rythmos Sacado Shards ShyLU STK STKSearch STKTopology STKUtil Stratimikos Teko Teuchos Thyra Tpetra TrilinosCouplings Triutils Xpetra Zoltan Zoltan2
A script which has details of the relevant configuration options for Trilinos is available at the ARCHER2 Github repository. The script will build a static-only version of the libraries.
$ git clone https://github.com/ARCHER2-HPC/pe-scripts.git\n$ cd pe-scripts\n$ git checkout modules-2022-12\n$ ...\n$ ./sh/trilinos.sh --prefix=/path/to/install/location\n
where --prefix
sets the installation location. The ellipsis ...
is standing for the dependencies used to build Trilinos, which here are: metis, parmetis, superlu, superlu-dist, scotch, mumps, glm, boost
. These packages should be built as described in their corresponding pages linked in the menu on the left. See the ARCHER2 Github repository for further details.
Note that Trilinos may take up to one hour to compile on its own, and so the compilation is best performed as a batch job.
"},{"location":"software-libraries/trilinos/#resources","title":"Resources","text":"Trilinos home page
Trilinos Github repository
The ARCHER2 User and Best Practice Guide covers all aspects of use of the ARCHER2 service. This includes fundamentals (required by all users to use the system effectively), best practice for getting the most out of ARCHER2 and more technical topics.
The User and Best Practice Guide contains the following sections:
As well as being used for scientific simulations, ARCHER2 can also be used for data pre-/post-processing and analysis. This page provides an overview of the different options for doing so.
"},{"location":"user-guide/analysis/#using-the-login-nodes","title":"Using the login nodes","text":"The easiest way to run non-computationally intensive data analysis is to run directly on the login nodes. However, please remember that the login nodes are a shared resource and should not be used for long-running tasks.
"},{"location":"user-guide/analysis/#example-running-an-r-script-on-a-login-node","title":"Example: Running an R script on a login node","text":"module load cray-R\nRscript example.R\n
"},{"location":"user-guide/analysis/#using-the-compute-nodes","title":"Using the compute nodes","text":"If running on the login nodes is not feasible (e.g. due to memory requirements or computationally intensive analysis), the compute nodes can also be used for data analysis.
Important
This is a more expensive option, as you will be charged for using the entire node, even though your analysis may only be using one core.
"},{"location":"user-guide/analysis/#example-running-an-r-script-on-a-compute-node","title":"Example: Running an R script on a compute node","text":"#!/bin/bash\n#SBATCH --job-name=data_analysis\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nmodule load cray-R\n\nRscript example.R\n
An advantage of this method is that you can use Job chaining to automate the process of analysing your output data once your compute job has finished.
"},{"location":"user-guide/analysis/#using-interactive-jobs","title":"Using interactive jobs","text":"For more interactive analysis, it may be useful to use salloc
to reserve a compute node on which to do your analysis. This allows you to run jobs directly on the compute nodes from the command line without using a job submission script. More information on interactive jobs can be found here.
auser@ln01:> salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 \\\n --time=00:20:00 --partition=standard --qos=short \\\n --account=[budget code]\n
Note
If you want to run for longer than 20 minutes, you will need to use a different QoS as the maximum runtime for the short
QoS is 20 mins.
The data analysis nodes on the ARCHER2 system are designed for large compilations, post-calculation analysis and data manipulation. They should be used for jobs which are too small to require a whole compute node, but which would have an adverse impact on the operation of the login nodes if they were run interactively.
Unlike compute nodes, the data analysis nodes are able to access the home, work, and the RDFaaS file systems. They can also be used to transfer data from a remote system to ARCHER2 and vice versa (using e.g. scp
or rsync
). This can be useful when transferring large amounts of data that might take hours to complete.
The ARCHER2 data analysis nodes can be reached by using the serial
partition and the serial
QoS. Unlike other nodes on ARCHER2, you may only request part of a single node and you will likely be sharing the node with other users.
The data analysis nodes are set up such that you can specify the number of cores you want to use (up to 32 physical cores) and the amount of memory you want for your job (up to 125 GB). You can have multiple jobs running on the data analysis nodes at the same time, but the total number of cores used by those jobs cannot exceed 32, and the total memory used by jobs currently running from a single user cannot exceed 125 GB -- any jobs above this limit will remain pending until your previous jobs are finished.
You do not need to specify both number of cores and memory for jobs on the data analysis nodes. By default, you will get 1984 MiB of memory per core (which is a little less than 2 GB), when specifying cores only, and 1 core when specifying the memory only.
Note
Each data analysis node is fitted with 512 GB of memory. However, a small amount of this memory is needed for system processes, which is why we set an upper limit of 125 GB per user (a user is limited to one quarter of the RAM on a node). This is also why the per-core default memory allocation is slightly less than 2 GB.
Note
When running on the data analysis nodes, you must always specify either the number of cores you want, the amount of memory you want, or both. The examples shown below specify the number of cores with the --ntasks
flag and the memory with the --mem
flag. If you are only wanting to specify one of the two, please remember to delete the other one.
A Slurm batch script for the data analysis nodes looks very similar to one for the compute nodes. The main differences are that you need to use --partition=serial
and --qos=serial
, specify the number of tasks (rather than the number of nodes) and/or specify the amount of memory you want. For example, to use a single core and 4 GB of memory, you would use something like:
#!/bin/bash\n\n# Slurm job options (job-name, job time)\n#SBATCH --job-name=data_analysis\n#SBATCH --time=0:20:0\n#SBATCH --ntasks=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n# Define memory required for this jobs. By default, you would\n# get just under 2 GB, but you can ask for up to 125 GB.\n#SBATCH --mem=4G\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\nmodule load cray-python\n\npython my_analysis_script.py\n
"},{"location":"user-guide/analysis/#interactive-session-on-the-data-analysis-nodes","title":"Interactive session on the data analysis nodes","text":"There are two ways to start an interactive session on the data analysis nodes: you can either use salloc
to reserve a part of a data analysis node for interactive jobs; or, you can use srun
to open a terminal on the node and run things on the node directly. You can find out more information on the advantages and disadvantages of both of these methods in the Running jobs on ARCHER2 section of the User and Best Practice Guide.
salloc
for interactive access","text":"You can reserve resources on a data analysis node using salloc
. For example, to request 1 core and 4 GB of memory for 20 minutes, you would use:
auser@ln01:~> salloc --time=00:20:00 --partition=serial --qos=serial \\\n --account=[budget code] --ntasks=1 \\\n --mem=4G\n
When you submit this job, your terminal will display something like:
salloc: Pending job allocation 523113\nsalloc: job 523113 queued and waiting for resources\nsalloc: job 523113 has been allocated resources\nsalloc: Granted job allocation 523113\nsalloc: Waiting for resource configuration\nsalloc: Nodes dvn01 are ready for job\n\nauser@ln01:~>\n
It may take some time for your interactive job to start. Once it runs you will enter a standard interactive terminal session (a new shell). Note that this shell is still on the front end (the prompt has not changed). Whilst the interactive session lasts you will be able to run jobs on the data analysis nodes by issuing the srun
command directly at your command prompt. The maximum number of cores and memory you can use is limited by resources requested in the salloc
command (or by the defaults if you did not explicitly ask for particular amounts of resource).
Your session will end when you hit the requested walltime. If you wish to finish before this you should use the exit
command - this will return you to your prompt before you issued the salloc
command.
srun
for interactive access","text":"You can get a command prompt directly on the data analysis nodes by using the srun
command directly. For example, to reserve 1 core and 8 GB of memory, you would use:
auser@ln01:~> srun --time=00:20:00 --partition=serial --qos=serial \\\n --account=[budget code] \\\n --ntasks=1 --mem=8G \\\n --pty /bin/bash\n
The --pty /bin/bash
will cause a new shell to be started on the data analysis node. (This is perhaps closer to what many people consider an 'interactive' job than the method using the salloc
method described above.)
One can now issue shell commands in the usual way.
When finished, type exit
to relinquish the allocation and control will be returned to the front end.
By default, the interactive shell will retain the environment of the parent. If you want a clean shell, remember to specify the --export=none
option to the srun
command.
You can view data on the data analysis nodes by starting an interactive srun
session with the --x11
flag to export the X display back to your local system. For 1 core with * GB of memory:
auser@ln01:~> srun --time=00:20:00 --partition=serial --qos=serial \\\n --hint=nomultithread --account=[budget code] \\\n --ntasks=1 --mem=8G --x11 --pty /bin/bash\n
Tip
Data visualisation on ARCHER2 is only possible if you used the -X
or -Y
flag to the ssh
command when when logging in to the system.
Singularity can be useful for data analysis, as sites such as DockerHub or SingularityHub contain many pre-built images of data analysis tools that can be simply downloaded and used on ARCHER2. More information about Singularity on ARCHER2 can be found in the Containers section section of the User and Best Practice Guide.
"},{"location":"user-guide/analysis/#data-analysis-tools","title":"Data analysis tools","text":"Useful tools for data analysis can be found on the Data Analysis and Tools page.
"},{"location":"user-guide/connecting-totp/","title":"Connecting to ARCHER2","text":"This section covers the basic connection methods.
On the ARCHER2 system, interactive access is achieved using SSH, either directly from a command-line terminal or using an SSH client. In addition, data can be transferred to and from the ARCHER2 system using scp
from the command line or by using a file-transfer client.
Before following the process below, we assume you have set up an account on ARCHER2 through the EPCC SAFE. Documentation on how to do this can be found at:
Linux distributions include a terminal application that can be used for SSH access to the ARCHER2 login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g., GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.
"},{"location":"user-guide/connecting-totp/#macos","title":"MacOS","text":"MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.
"},{"location":"user-guide/connecting-totp/#windows","title":"Windows","text":"A typical Windows installation will not include a terminal client, though there are various clients available. We recommend Windows users download and install MobaXterm to access ARCHER2. It is very easy to use and includes an integrated X Server, which allows you to run graphical applications on ARCHER2.
You can download MobaXterm Home Edition (Installer Edition) from the following link:
Double-click the downloaded Microsoft Installer file (.msi) and follow the instructions from the Windows Installation Wizard. Note, you might need to have administrator rights to install on some versions of Windows. Also, make sure to check whether Windows Firewall has blocked any features of this program after installation (Windows will warn you if the built-in firewall blocks an action, and gives you the opportunity to override the behaviour).
Once installed, start MobaXterm and then click \"Start local terminal\".
Tips
If you download the .zip file rather than the .msi, make sure you unzip it before attempting to run the installer.
If you do not have administrator rights, you can use the Portable edition of MobaXterm.
If this is your first time using MobaXterm, you should check that a permanent /home directory has been set up (otherwise, all saved info will be lost from session to session). Go to \"Settings\" -> \"Configuration\" and check that a path is set in the field marked \"Persistent home directory\". If prompted, make sure path is set as \"private\".
Any SSH key generated in MobaXterm will, by default, be stored in the permanent /home directory (see above). That is, if your /home directory is _MyDocuments_\\MobaXterm\\home
then within that folder you will find a folder named _MyDocuments_\\MobaXterm\\home\\.ssh
containing your keys. This folder will be 'hidden' by default, so you may need to tick 'Hidden items' under 'View' in Windows Explorer to see it.
MobaXterm also allows you to set up pre-configured SSH sessions with the username, login host and key details saved. You are welcome to use this, rather than using the \"Local terminal\", but we are not able to assist with debugging connection issues if you choose this method.
To access ARCHER2, you need to use two sets of credentials: your SSH key pair protected by a passphrase and a Time-based one-time password. You can find more detailed instructions on how to set up your credentials to access ARCHER2 from Windows, MacOS and Linux below.
"},{"location":"user-guide/connecting-totp/#ssh-key-pairs","title":"SSH Key Pairs","text":"You will need to generate an SSH key pair protected by a passphrase to access ARCHER2.
Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:
$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n| . ...+o++++. |\n| . . . =o.. |\n|+ . . .......o o |\n|oE . . |\n|o = . S |\n|. +.+ . |\n|. oo |\n|. . |\n| .. |\n+-----------------+\n
(remember to replace \"your@email.com\" with your e-mail address).
"},{"location":"user-guide/connecting-totp/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:
Login to SAFE.
Then:
Once you have done this, your SSH key will be added to your ARCHER2 account.
"},{"location":"user-guide/connecting-totp/#mfa-time-based-one-time-passcode-totp-code","title":"MFA Time-based one-time passcode (TOTP code)","text":"Remember, you will need to use both an SSH key and time-based one-time passcode to log into ARCHER2 so you will also need to set up a method for generating a TOTP code before you can log into ARCHER2.
"},{"location":"user-guide/connecting-totp/#first-login-password-required","title":"First login: password required","text":"Important
You will not use your password when logging on to ARCHER2 after the first login for a new account.
As an additional security measure, you will also need to use a password from SAFE for your first login to ARCHER2 with a new account. When you log into ARCHER2 for the first time with a new account, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed. You will no longer need this password to log into ARCHER2 from this point forwards, you will use your SSH key and TOTP code as described above.
"},{"location":"user-guide/connecting-totp/#ssh-clients","title":"SSH Clients","text":"As noted above, you interact with ARCHER2, over an encrypted communication channel (specifically, Secure Shell version 2 (SSH-2)). This allows command-line access to one of the login nodes of ARCHER2, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers, when used in conjunction with an X Server.
"},{"location":"user-guide/connecting-totp/#logging-in","title":"Logging in","text":"The login addresses for ARCHER2 are:
You can use the following command from the terminal window to log in to ARCHER2:
Full systemssh username@login.archer2.ac.uk\n
The order in which you are asked for credentials depends on the system you are accessing:
Full systemYou will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you logged into ARCHER2 with your account before the major upgrade in May/June 2023 you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Warning
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_ARCHER2
you would use the command ssh -i keys/id_rsa_ARCHER2 username@login.archer2.ac.uk
to log in (or the equivalent for the 4-cabinet system).
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three-step process:
Your password will now have been changed
To allow remote programs, especially graphical applications, to control your local display, such as for a debugger, use:
Full systemssh -X username@login.archer2.ac.uk\n
Some sites recommend using the -Y
flag. While this can fix some compatibility issues, the -X
flag is more secure.
Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:
Adding the host keys to your SSH configuration file provides an extra level of security for your connections to ARCHER2. The host keys are checked against the login nodes when you login to ARCHER2 and if the remote server key does not match the one in the configuration file, the connection will be refused. This provides protection against potential malicious servers masquerading as the ARCHER2 login nodes.
"},{"location":"user-guide/connecting-totp/#loginarcher2acuk","title":"login.archer2.ac.uk","text":"login.archer2.ac.uk ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBANu9BQJ1UFr4nwy8X5seIPgCnBl1TKc8XBq2YVY65qS53QcpzjZAH53/CtvyWkyGcmY8/PWsJo9sXHqzXVSkzk=\n\nlogin.archer2.ac.ukssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDFGGByIrskPayB5xRm3vkWoEc5bVtTCi0oTGslD8m+M1Sc/v2IV6FxaEVXGwO9ErQwrtFQRj0KameLS3Jn0LwQ13Tw+vTXV0bsKyGgEu2wW+BSDijGpbxRZXZrg30TltZXd4VkTuWiE6kyhJ6qiIIR0nwfDblijGy3u079gM5Om/Q2wydwh0iAASRzkqldL5bKDb14Vliy7tCT3TJXI49+qIagWUhNEzyN1j2oK/2n3JdflT4/anQ4jUywVG4D1Tor/evEeSa3h5++gbtgAXZaCtlQbBxwckmTetXqnlI+pvkF0AAuS18Bh+hdmvT1+xW0XLv7CMA64HfR93XgQIIuPqFAS1p+HuJkmk4xFAdwrzjnpYAiU5Apkq+vx3W957/LULzZkeiFQY2Y3CY9oPVR8WBmGKXOOBifhl2Hvd51fH1wd0Lw7Zph53NcVSQQhdDUVhgsPJA3M/+UlqoAMEB/V6ESE2z6yrXVfNjDNbbgA1K548EYpyNR8z4eRtZOoi0=\n\nlogin.archer2.ac.uk ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINyptPmidGmIBYHPcTwzgXknVPrMyHptwBgSbMcoZgh5\n
Host key verification can fail if this key is out of date, a problem which can be fixed by removing the offending entry in ~/.ssh/known_hosts
and replacing it with the new key published here. We recommend users should check this page for any key updates and not just accept a new key from the server without confirmation.
Typing in the full command to log in or transfer data to ARCHER2 can become tedious as it often has to be repeated several times. You can use the SSH configuration file, usually located on your local machine at .ssh/config
to make the process more convenient.
Each remote site (or group of sites) can have an entry in this file, which may look something like:
Full systemHost archer2\n HostName login.archer2.ac.uk\n User username\n
(remember to replace username
with your actual username!).
Taking the full-system example: the Host
line defines a short name for the entry. In this case, instead of typing ssh username@login.archer2.ac.uk
to access the ARCHER2 login nodes, you could use ssh archer2
instead. The remaining lines define the options for the host.
Hostname login.archer2.ac.uk
--- defines the full address of the hostUser username
--- defines the username to use by default for this host (replace username
with your own username on the remote host)Now you can use SSH to access ARCHER2 without needing to enter your username or the full hostname every time:
ssh archer2\n
You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config manual page (or man ssh_config
on any machine with SSH installed) for a description of the SSH configuration file. For example, you may find the IdentityFile
option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.
Bug
There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512)
, you will need to either specify the path to your ssh key in the command line (using the -i
option as described above) or add that path to your SSH config file by using the IdentityFile
option.
If you find you are unable to connect to ARCHER2, there are some simple checks you may use to diagnose the issue, which are described below. If you are having difficulties connecting, we suggest trying these before contacting the ARCHER2 Service Desk.
"},{"location":"user-guide/connecting-totp/#use-the-userloginarcher2acuk-syntax-rather-than-l-user-loginarcher2acuk","title":"Use theuser@login.archer2.ac.uk
syntax rather than -l user login.archer2.ac.uk
","text":"We have seen a number of instances where people using the syntax
ssh -l user login.archer2.ac.uk\n
have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:
ssh user@login.archer2.ac.uk\n
works more reliably.
"},{"location":"user-guide/connecting-totp/#can-you-connect-to-the-login-node","title":"Can you connect to the login node?","text":"Try the command ping -c 3 login.archer2.ac.uk
, on Linux or MacOS, or ping -n 3 login.archer2.ac.uk
on Windows. If you successfully connect to the login node, the output should include:
--- login.archer2.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n
(the ping time '38ms' is not important). If not all packets are received there could be a problem with your Internet connection, or the login node could be unavailable.
"},{"location":"user-guide/connecting-totp/#ssh-key","title":"SSH key","text":"If you get the error message Permission denied (publickey)
, this may indicate a problem with your SSH key. Some things to check:
Have you uploaded the key to SAFE? Please note that if the same key is re-uploaded, SAFE will not map the \"new\" key to ARCHER2. If for some reason this is required, please delete the key first, then re-upload.
Is SSH using the correct key? You can check which keys are being found and offered by SSH using ssh -vvv
. If your private key has a non-default name, you should use the -i
option to provide it to ssh. For example, ssh -i path/to/key username@login.archer2.ac.uk
.
Are you entering the passphrase correctly? You will be asked for your private key's passphrase first. If you enter it incorrectly you will usually be asked to enter it again (usually you will get three chances, after which SSH will fail with Permission denied (publickey)
). If you would like to confirm your passphrase without attempting to connect, you can use ssh-keygen -y -f /path/to/private/key
. If successful, this command will print the corresponding public key. You can also use this to check that you have uploaded the correct public key to SAFE.
Are permissions correct on the SSH key? One common issue is that the permissions are set incorrectly on either the key files or the directory it is contained in. On Linux and MacOS, if your private keys are held in ~/.ssh/
you can check this with ls -al ~/.ssh
. This should give something similar to the following output:
$ ls -al ~/.ssh/\n drwx------. 2 user group 48 Jul 15 20:24 .\n drwx------. 12 user group 4096 Oct 13 12:11 ..\n -rw-------. 1 user group 113 Jul 15 20:23 authorized_keys\n -rw-------. 1 user group 12686 Jul 15 20:23 id_rsa\n -rw-r--r--. 1 user group 2785 Jul 15 20:23 id_rsa.pub\n -rw-r--r--. 1 user group 1967 Oct 13 14:11 known_hosts\n
The important section here is the string of letters and dashes at the start, for the lines ending in .
, id_rsa
, and id_rsa.pub
, which indicate permissions on the containing directory, private key, and public key, respectively. If your permissions are not correct, they can be set with chmod
. Consult the table below for the relevant chmod
command.
chmod
Code Directory drwx------
700 Private Key -rw-------
600 Public Key -rw-r--r--
644 chmod
can be used to set permissions on the target in the following way: chmod <code> <target>
. So for example to set correct permissions on the private key file id_rsa_ARCHER2
, use the command chmod 600 id_rsa_ARCHER2
.
On Windows, permissions are handled differently but can be set by right-clicking on the file and selecting Properties > Security > Advanced. The user, SYSTEM, and Administrators should have Full control
, and no other permissions should exist for both the public and private key files, as well as the containing folder.
Tip
Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -
, or directory d
. The next three characters indicate the owning user's permissions. The first character is r
if they have read permission, -
if they don't, the second character is w
if they have write permission, -
if they don't, the third character is x
if they have execute permission, -
if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r--
indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod
codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------
becomes 111 000 000
-> 700
.
If your TOTP passcode is being consistently rejected, you can remove MFA from your account and then re-enable it.
"},{"location":"user-guide/connecting-totp/#ssh-verbose-output","title":"SSH verbose output","text":"The verbose-debugging output from ssh
can be very useful for diagnosing issues. In particular, it can be used to distinguish between problems with the SSH key and password. To enable verbose output, add the -vvv
flag to your SSH command. For example:
ssh -vvv username@login.archer2.ac.uk\n
The output is lengthy, but somewhere in there you should see lines similar to the following:
debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: password, keyboard-interactive\n
In the text above, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded
indicates that the SSH key has been accepted. By default SSH will go through a list of standard private-key files, as well as any you have specified with -i
or a config file. To succeed, one of these private keys needs to match to the public key uploaded to SAFE.
If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey)
. If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.
You should next see something similiar to:
debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (keyboard-interactive).\n
If you do not see the Password:
prompt you may have connection issues, or there could be a problem with the ARCHER2 login nodes. If you do not see Authenticated with partial success
it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success
, it means your password was accepted, and your SSH key will now be checked.
The equivalent information can be obtained in PuTTY by enabling All Logging in settings.
"},{"location":"user-guide/connecting-totp/#related-software","title":"Related Software","text":""},{"location":"user-guide/connecting-totp/#tmux","title":"tmux","text":"tmux is a multiplexer application available on the ARCHER2 login nodes. It allows for multiple sessions to be open concurrently and these sessions can be detached and run in the background. Furthermore, sessions will continue to run after a user logs off and can be reattached to upon logging in again. It is particularly useful if you are connecting to ARCHER2 on an unstable Internet connection or if you wish to keep an arrangement of terminal applications running while you disconnect your client from the Internet -- for example, when moving between your home and workplace.
"},{"location":"user-guide/connecting/","title":"Connecting to ARCHER2","text":"This section covers the basic connection methods.
On the ARCHER2 system, interactive access is achieved using SSH, either directly from a command-line terminal or using an SSH client. In addition, data can be transferred to and from the ARCHER2 system using scp
from the command line or by using a file-transfer client.
Before following the process below, we assume you have set up an account on ARCHER2 through the EPCC SAFE. Documentation on how to do this can be found at:
Linux distributions include a terminal application that can be used for SSH access to the ARCHER2 login nodes. Linux users will have different terminals depending on their distribution and window manager (e.g., GNOME Terminal in GNOME, Konsole in KDE). Consult your Linux distribution's documentation for details on how to load a terminal.
"},{"location":"user-guide/connecting/#macos","title":"MacOS","text":"MacOS users can use the Terminal application, located in the Utilities folder within the Applications folder.
"},{"location":"user-guide/connecting/#windows","title":"Windows","text":"A typical Windows installation will not include a terminal client, though there are various clients available. We recommend Windows users download and install MobaXterm to access ARCHER2. It is very easy to use and includes an integrated X Server, which allows you to run graphical applications on ARCHER2.
You can download MobaXterm Home Edition (Installer Edition) from the following link:
Double-click the downloaded Microsoft Installer file (.msi) and follow the instructions from the Windows Installation Wizard. Note, you might need to have administrator rights to install on some versions of Windows. Also, make sure to check whether Windows Firewall has blocked any features of this program after installation (Windows will warn you if the built-in firewall blocks an action, and gives you the opportunity to override the behaviour).
Once installed, start MobaXterm and then click \"Start local terminal\".
Tips
If you download the .zip file rather than the .msi, make sure you unzip it before attempting to run the installer.
If you do not have administrator rights, you can use the Portable edition of MobaXterm.
If this is your first time using MobaXterm, you should check that a permanent /home directory has been set up (otherwise, all saved info will be lost from session to session). Go to \"Settings\" -> \"Configuration\" and check that a path is set in the field marked \"Persistent home directory\". If prompted, make sure path is set as \"private\".
Any SSH key generated in MobaXterm will, by default, be stored in the permanent /home directory (see above). That is, if your /home directory is _MyDocuments_\\MobaXterm\\home
then within that folder you will find a folder named _MyDocuments_\\MobaXterm\\home\\.ssh
containing your keys. This folder will be 'hidden' by default, so you may need to tick 'Hidden items' under 'View' in Windows Explorer to see it.
MobaXterm also allows you to set up pre-configured SSH sessions with the username, login host and key details saved. You are welcome to use this, rather than using the \"Local terminal\", but we are not able to assist with debugging connection issues if you choose this method.
To access ARCHER2, you need to use two sets of credentials: your SSH key pair protected by a passphrase and a Time-based one-time password. You can find more detailed instructions on how to set up your credentials to access ARCHER2 from Windows, MacOS and Linux below.
"},{"location":"user-guide/connecting/#ssh-key-pairs","title":"SSH Key Pairs","text":"You will need to generate an SSH key pair protected by a passphrase to access ARCHER2.
Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:
$ ssh-keygen -t rsa -C \"your@email.com\"\n...\n-bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\nEnter passphrase (empty for no passphrase): [Passphrase]\nEnter same passphrase again: [Passphrase]\nYour identification has been saved in /Home/user/.ssh/id_rsa.\nYour public key has been saved in /Home/user/.ssh/id_rsa.pub.\nThe key fingerprint is:\n03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\nThe key's randomart image is:\n+--[ RSA 2048]----+\n| . ...+o++++. |\n| . . . =o.. |\n|+ . . .......o o |\n|oE . . |\n|o = . S |\n|. +.+ . |\n|. oo |\n|. . |\n| .. |\n+-----------------+\n
(remember to replace \"your@email.com\" with your e-mail address).
"},{"location":"user-guide/connecting/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:
Login to SAFE.
Then:
Once you have done this, your SSH key will be added to your ARCHER2 account.
"},{"location":"user-guide/connecting/#mfa-time-based-one-time-passcode-totp-code","title":"MFA Time-based one-time passcode (TOTP code)","text":"Remember, you will need to use both an SSH key and time-based one-time passcode to log into ARCHER2 so you will also need to set up a method for generating a TOTP code before you can log into ARCHER2.
"},{"location":"user-guide/connecting/#first-login-password-required","title":"First login: password required","text":"Important
You will not use your password when logging on to ARCHER2 after the first login for a new account.
As an additional security measure, you will also need to use a password from SAFE for your first login to ARCHER2 with a new account. When you log into ARCHER2 for the first time with a new account, you will be prompted to change your initial password. This is a three step process:
Your password has now been changed. You will no longer need this password to log into ARCHER2 from this point forwards, you will use your SSH key and TOTP code as described above.
"},{"location":"user-guide/connecting/#ssh-clients","title":"SSH Clients","text":"As noted above, you interact with ARCHER2, over an encrypted communication channel (specifically, Secure Shell version 2 (SSH-2)). This allows command-line access to one of the login nodes of ARCHER2, from which you can run commands or use a command-line text editor to edit files. SSH can also be used to run graphical programs such as GUI text editors and debuggers, when used in conjunction with an X Server.
"},{"location":"user-guide/connecting/#logging-in","title":"Logging in","text":"The login addresses for ARCHER2 are:
You can use the following command from the terminal window to log in to ARCHER2:
Full systemssh username@login.archer2.ac.uk\n
The order in which you are asked for credentials depends on the system you are accessing:
Full systemYou will first be prompted for the passphrase associated with your SSH key pair. Once you have entered this passphrase successfully, you will then be prompted for your machine account password. You need to enter both credentials correctly to be able to access ARCHER2.
Tip
If you logged into ARCHER2 with your account before the major upgrade in May/June 2023 you may see an error from SSH that looks like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nThe ECDSA host key for login.archer2.ac.uk has changed,\nand the key for the corresponding IP address 193.62.216.43\nhas a different value. This could either mean that\nDNS SPOOFING is happening or the IP address for the host\nand its host key have changed at the same time.\nOffending key for IP in /Users/auser/.ssh/known_hosts:11\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\nIT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!\nSomeone could be eavesdropping on you right now (man-in-the-middle attack)!\nIt is also possible that a host key has just been changed.\nThe fingerprint for the ECDSA key sent by the remote host is\nSHA256:UGS+LA8I46LqnD58WiWNlaUFY3uD1WFr+V8RCG09fUg.\nPlease contact your system administrator.\n
If you see this, you should delete the offending host key from your ~/.ssh/known_hosts
file (in the example above the offending line is line #11)
Warning
If your SSH key pair is not stored in the default location (usually ~/.ssh/id_rsa
) on your local system, you may need to specify the path to the private part of the key wih the -i
option to ssh
. For example, if your key is in a file called keys/id_rsa_ARCHER2
you would use the command ssh -i keys/id_rsa_ARCHER2 username@login.archer2.ac.uk
to log in (or the equivalent for the 4-cabinet system).
Tip
When you first log into ARCHER2, you will be prompted to change your initial password. This is a three-step process:
Your password will now have been changed
To allow remote programs, especially graphical applications, to control your local display, such as for a debugger, use:
Full systemssh -X username@login.archer2.ac.uk\n
Some sites recommend using the -Y
flag. While this can fix some compatibility issues, the -X
flag is more secure.
Current MacOS systems do not have an X window system. Users should install the XQuartz package to allow for SSH with X11 forwarding on MacOS systems:
Adding the host keys to your SSH configuration file provides an extra level of security for your connections to ARCHER2. The host keys are checked against the login nodes when you login to ARCHER2 and if the remote server key does not match the one in the configuration file, the connection will be refused. This provides protection against potential malicious servers masquerading as the ARCHER2 login nodes.
"},{"location":"user-guide/connecting/#loginarcher2acuk","title":"login.archer2.ac.uk","text":"login.archer2.ac.uk ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBANu9BQJ1UFr4nwy8X5seIPgCnBl1TKc8XBq2YVY65qS53QcpzjZAH53/CtvyWkyGcmY8/PWsJo9sXHqzXVSkzk=\n\nlogin.archer2.ac.ukssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDFGGByIrskPayB5xRm3vkWoEc5bVtTCi0oTGslD8m+M1Sc/v2IV6FxaEVXGwO9ErQwrtFQRj0KameLS3Jn0LwQ13Tw+vTXV0bsKyGgEu2wW+BSDijGpbxRZXZrg30TltZXd4VkTuWiE6kyhJ6qiIIR0nwfDblijGy3u079gM5Om/Q2wydwh0iAASRzkqldL5bKDb14Vliy7tCT3TJXI49+qIagWUhNEzyN1j2oK/2n3JdflT4/anQ4jUywVG4D1Tor/evEeSa3h5++gbtgAXZaCtlQbBxwckmTetXqnlI+pvkF0AAuS18Bh+hdmvT1+xW0XLv7CMA64HfR93XgQIIuPqFAS1p+HuJkmk4xFAdwrzjnpYAiU5Apkq+vx3W957/LULzZkeiFQY2Y3CY9oPVR8WBmGKXOOBifhl2Hvd51fH1wd0Lw7Zph53NcVSQQhdDUVhgsPJA3M/+UlqoAMEB/V6ESE2z6yrXVfNjDNbbgA1K548EYpyNR8z4eRtZOoi0=\n\nlogin.archer2.ac.uk ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINyptPmidGmIBYHPcTwzgXknVPrMyHptwBgSbMcoZgh5\n
Host key verification can fail if this key is out of date, a problem which can be fixed by removing the offending entry in ~/.ssh/known_hosts
and replacing it with the new key published here. We recommend users should check this page for any key updates and not just accept a new key from the server without confirmation.
Typing in the full command to log in or transfer data to ARCHER2 can become tedious as it often has to be repeated several times. You can use the SSH configuration file, usually located on your local machine at .ssh/config
to make the process more convenient.
Each remote site (or group of sites) can have an entry in this file, which may look something like:
Full systemHost archer2\n HostName login.archer2.ac.uk\n User username\n
(remember to replace username
with your actual username!).
Taking the full-system example: the Host
line defines a short name for the entry. In this case, instead of typing ssh username@login.archer2.ac.uk
to access the ARCHER2 login nodes, you could use ssh archer2
instead. The remaining lines define the options for the host.
Hostname login.archer2.ac.uk
--- defines the full address of the hostUser username
--- defines the username to use by default for this host (replace username
with your own username on the remote host)Now you can use SSH to access ARCHER2 without needing to enter your username or the full hostname every time:
ssh archer2\n
You can set up as many of these entries as you need in your local configuration file. Other options are available. See the ssh_config manual page (or man ssh_config
on any machine with SSH installed) for a description of the SSH configuration file. For example, you may find the IdentityFile
option useful if you have to manage multiple SSH key pairs for different systems as this allows you to specify which SSH key to use for each system.
Bug
There is a known bug with Windows ssh-agent. If you get the error message: Warning: agent returned different signature type ssh-rsa (expected rsa-sha2-512)
, you will need to either specify the path to your ssh key in the command line (using the -i
option as described above) or add that path to your SSH config file by using the IdentityFile
option.
If you find you are unable to connect to ARCHER2, there are some simple checks you may use to diagnose the issue, which are described below. If you are having difficulties connecting, we suggest trying these before contacting the ARCHER2 Service Desk.
"},{"location":"user-guide/connecting/#use-the-userloginarcher2acuk-syntax-rather-than-l-user-loginarcher2acuk","title":"Use theuser@login.archer2.ac.uk
syntax rather than -l user login.archer2.ac.uk
","text":"We have seen a number of instances where people using the syntax
ssh -l user login.archer2.ac.uk\n
have not been able to connect properly and get prompted for a password many times. We have found that using the alternative syntax:
ssh user@login.archer2.ac.uk\n
works more reliably.
"},{"location":"user-guide/connecting/#can-you-connect-to-the-login-node","title":"Can you connect to the login node?","text":"Try the command ping -c 3 login.archer2.ac.uk
, on Linux or MacOS, or ping -n 3 login.archer2.ac.uk
on Windows. If you successfully connect to the login node, the output should include:
--- login.archer2.ac.uk ping statistics ---\n3 packets transmitted, 3 received, 0% packet loss, time 38ms\n
(the ping time '38ms' is not important). If not all packets are received there could be a problem with your Internet connection, or the login node could be unavailable.
"},{"location":"user-guide/connecting/#ssh-key","title":"SSH key","text":"If you get the error message Permission denied (publickey)
, this may indicate a problem with your SSH key. Some things to check:
Have you uploaded the key to SAFE? Please note that if the same key is re-uploaded, SAFE will not map the \"new\" key to ARCHER2. If for some reason this is required, please delete the key first, then re-upload.
Is SSH using the correct key? You can check which keys are being found and offered by SSH using ssh -vvv
. If your private key has a non-default name, you should use the -i
option to provide it to ssh. For example, ssh -i path/to/key username@login.archer2.ac.uk
.
Are you entering the passphrase correctly? You will be asked for your private key's passphrase first. If you enter it incorrectly you will usually be asked to enter it again (usually you will get three chances, after which SSH will fail with Permission denied (publickey)
). If you would like to confirm your passphrase without attempting to connect, you can use ssh-keygen -y -f /path/to/private/key
. If successful, this command will print the corresponding public key. You can also use this to check that you have uploaded the correct public key to SAFE.
Are permissions correct on the SSH key? One common issue is that the permissions are set incorrectly on either the key files or the directory it is contained in. On Linux and MacOS, if your private keys are held in ~/.ssh/
you can check this with ls -al ~/.ssh
. This should give something similar to the following output:
$ ls -al ~/.ssh/\n drwx------. 2 user group 48 Jul 15 20:24 .\n drwx------. 12 user group 4096 Oct 13 12:11 ..\n -rw-------. 1 user group 113 Jul 15 20:23 authorized_keys\n -rw-------. 1 user group 12686 Jul 15 20:23 id_rsa\n -rw-r--r--. 1 user group 2785 Jul 15 20:23 id_rsa.pub\n -rw-r--r--. 1 user group 1967 Oct 13 14:11 known_hosts\n
The important section here is the string of letters and dashes at the start, for the lines ending in .
, id_rsa
, and id_rsa.pub
, which indicate permissions on the containing directory, private key, and public key, respectively. If your permissions are not correct, they can be set with chmod
. Consult the table below for the relevant chmod
command.
chmod
Code Directory drwx------
700 Private Key -rw-------
600 Public Key -rw-r--r--
644 chmod
can be used to set permissions on the target in the following way: chmod <code> <target>
. So for example to set correct permissions on the private key file id_rsa_ARCHER2
, use the command chmod 600 id_rsa_ARCHER2
.
On Windows, permissions are handled differently but can be set by right-clicking on the file and selecting Properties > Security > Advanced. The user, SYSTEM, and Administrators should have Full control
, and no other permissions should exist for both the public and private key files, as well as the containing folder.
Tip
Unix file permissions can be understood in the following way. There are three groups that can have file permissions: (owning) users, (owning) groups, and others. The available permissions are read, write, and execute. The first character indicates whether the target is a file -
, or directory d
. The next three characters indicate the owning user's permissions. The first character is r
if they have read permission, -
if they don't, the second character is w
if they have write permission, -
if they don't, the third character is x
if they have execute permission, -
if they don't. This pattern is then repeated for group, and other permissions. For example the pattern -rw-r--r--
indicates that the owning user can read and write the file, members of the owning group can read it, and anyone else can also read it. The chmod
codes are constructed by treating the user, group, and owner permission strings as binary numbers, then converting them to decimal. For example the permission string -rwx------
becomes 111 000 000
-> 700
.
If your TOTP passcode is being consistently rejected, you can remove MFA from your account and then re-enable it.
"},{"location":"user-guide/connecting/#ssh-verbose-output","title":"SSH verbose output","text":"The verbose-debugging output from ssh
can be very useful for diagnosing issues. In particular, it can be used to distinguish between problems with the SSH key and password. To enable verbose output, add the -vvv
flag to your SSH command. For example:
ssh -vvv username@login.archer2.ac.uk\n
The output is lengthy, but somewhere in there you should see lines similar to the following:
debug1: Next authentication method: publickey\ndebug1: Offering public key: RSA SHA256:<key_hash> <path_to_private_key>\ndebug3: send_pubkey_test\ndebug3: send packet: type 50\ndebug2: we sent a publickey packet, wait for reply\ndebug3: receive packet: type 60\ndebug1: Server accepts key: pkalg rsa-sha2-512 blen 2071\ndebug2: input_userauth_pk_ok: fp SHA256:<key_hash>\ndebug3: sign_and_send_pubkey: RSA SHA256:<key_hash>\nEnter passphrase for key '<path_to_private_key>':\ndebug3: send packet: type 50\ndebug3: receive packet: type 51\nAuthenticated with partial success.\ndebug1: Authentications that can continue: password, keyboard-interactive\n
In the text above, you can see which files ssh has checked for private keys, and you can see if any key is accepted. The line Authenticated succeeded
indicates that the SSH key has been accepted. By default SSH will go through a list of standard private-key files, as well as any you have specified with -i
or a config file. To succeed, one of these private keys needs to match to the public key uploaded to SAFE.
If your SSH key passphrase is incorrect, you will be asked to try again up to three times in total, before being disconnected with Permission denied (publickey)
. If you enter your passphrase correctly, but still see this error message, please consider the advice under SSH key above.
You should next see something similiar to:
debug1: Next authentication method: keyboard-interactive\ndebug2: userauth_kbdint\ndebug3: send packet: type 50\ndebug2: we sent a keyboard-interactive packet, wait for reply\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 1\nPassword:\ndebug3: send packet: type 61\ndebug3: receive packet: type 60\ndebug2: input_userauth_info_req\ndebug2: input_userauth_info_req: num_prompts 0\ndebug3: send packet: type 61\ndebug3: receive packet: type 52\ndebug1: Authentication succeeded (keyboard-interactive).\n
If you do not see the Password:
prompt you may have connection issues, or there could be a problem with the ARCHER2 login nodes. If you do not see Authenticated with partial success
it means your password was not accepted. You will be asked to re-enter your password, usually two more times before the connection will be rejected. Consider the suggestions under Password above. If you do see Authenticated with partial success
, it means your password was accepted, and your SSH key will now be checked.
The equivalent information can be obtained in PuTTY by enabling All Logging in settings.
"},{"location":"user-guide/connecting/#related-software","title":"Related Software","text":""},{"location":"user-guide/connecting/#tmux","title":"tmux","text":"tmux is a multiplexer application available on the ARCHER2 login nodes. It allows for multiple sessions to be open concurrently and these sessions can be detached and run in the background. Furthermore, sessions will continue to run after a user logs off and can be reattached to upon logging in again. It is particularly useful if you are connecting to ARCHER2 on an unstable Internet connection or if you wish to keep an arrangement of terminal applications running while you disconnect your client from the Internet -- for example, when moving between your home and workplace.
"},{"location":"user-guide/containers/","title":"Containers","text":"This page was originally based on the documentation at the University of Sheffield HPC service
Designed around the notion of mobility of compute and reproducible science, Singularity enables users to have full control of their operating system environment. This means that a non-privileged user can \"swap out\" the Linux operating system and environment on the host for a Linux OS and environment that they control. So if the host system is running CentOS Linux but your application runs in Ubuntu Linux with a particular software stack, you can create an Ubuntu image, install your software into that image, copy the image to another host (e.g. ARCHER2), and run your application on that host in its native Ubuntu environment.
Singularity also allows you to leverage the resources of whatever host you are on. This includes high-speed interconnects (e.g. Slingshot on ARCHER2), file systems (e.g. /home and /work on ARCHER2) and potentially other resources.
Note
Singularity only supports Linux containers. You cannot create images that use Windows or macOS (this is a restriction of the containerisation model rather than Singularity).
"},{"location":"user-guide/containers/#useful-links","title":"Useful Links","text":"Similar to Docker, a Singularity container is a self-contained software stack. As Singularity does not require a root-level daemon to run its containers (as is required by Docker) it is suitable for use on multi-user HPC systems such as ARCHER2. Within the container, you have exactly the same permissions as you do in a standard login session on the system.
In practice, this means that a container image created on your local machine with all your research software installed for local development will also run on ARCHER2.
Pre-built container images (such as those on DockerHub or SingularityHub archive can simply be downloaded and used on ARCHER2 (or anywhere else Singularity is installed).
Creating and modifying container images requires root permission and so must be done on a system where you have such access (in practice, this is usually within a virtual machine on your laptop/workstation).
Note
SingularityHub was a publicly available cloud service for Singularity container images active from 2016 to 2021. It built container recipes from Github repositories on Google Cloud, and container images were available via the command line Singularity or sregistry software. These container images are still available now in the SingularityHub Archive
"},{"location":"user-guide/containers/#using-singularity-images-on-archer2","title":"Using Singularity Images on ARCHER2","text":"Singularity containers can be used on ARCHER2 in a number of ways, including:
We provide information on each of these scenarios below. First, we describe briefly how to get existing container images onto ARCHER2 so that you can launch containers based on them.
"},{"location":"user-guide/containers/#getting-existing-container-images-onto-archer2","title":"Getting existing container images onto ARCHER2","text":"Singularity container images are files, so, if you already have a container image, you can use scp
to copy the file to ARCHER2 as you would with any other file.
If you wish to get a file from one of the container image repositories, then Singularity allows you to do this from ARCHER2 itself.
For example, to retrieve a container image from SingularityHub on ARCHER2 we can simply issue a Singularity command to pull the image.
auser@ln03:~> singularity pull hello-world.sif shub://vsoch/hello-world\n
The container image located at the shub
URI is written to a Singularity Image File (SIF) called hello-world.sif
.
Once you have a container image file, launching a container based on the container image on the login nodes in an interactive way is extremely simple: you use the singularity shell
command. Using the container image we built in the example above:
auser@ln03:~> singularity shell hello-world.sif\nSingularity>\n
Within a Singularity container your home directory will be available.
Once you have finished using your container, you can return to the ARCHER2 login node prompt with the exit
command:
Singularity> exit\nexit\nauser@ln03:~>\n
"},{"location":"user-guide/containers/#interactive-use-on-the-compute-nodes","title":"Interactive use on the compute nodes","text":"The process for using a container interactively on the compute nodes is very similar to that for the login nodes. The only difference is that you first have to submit an interactive serial job (from a location on /work
) in order to get interactive access to the compute node.
For example, to reserve a full node for you to work on interactively you would use:
auser@ln03:/work/t01/t01/auser> srun --nodes=1 --exclusive --time=00:20:00 \\\n --account=[budget code] \\\n --partition=standard --qos=standard \\\n --pty /bin/bash\n\n...wait until job starts...\n\nauser@nid00001:/work/t01/t01/auser>\n
Note that the prompt has changed to show you are on a compute node. Now you can launch a container in the same way as on the login node.
auser@nid00001:/work/t01/t01/auser> singularity shell hello-world.sif\nSingularity> exit\nexit\nauser@nid00001:/work/t01/t01/auser> exit\nauser@ln03:/work/t01/t01/auser>\n
Note
We used exit
to leave the interactive container shell and then exit
again to leave the interactive job on the compute node.
You can also use Singularity containers within a non-interactive batch script as you would any other command. If your container image contains a runscript then you can use singularity run
to execute the runscript in the job. You can also use singularity exec
to execute arbitrary commands (or scripts) within the container.
An example job submission script to run a serial job that executes the runscript within a container based on the container image in the hello-world.sif
file that we downloaded previously to an ARCHER2 login node would be as follows.
#!/bin/bash --login\n\n# Slurm job options (name, compute nodes, job time)\n\n#SBATCH --job-name=helloworld\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:10:00\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Run the serial executable\nsingularity run $SLURM_SUBMIT_DIR/hello-world.sif\n
You submit this in the usual way and the standard output and error should be written to slurm-...
, where the output filename ends with the job number.
Running a Singularity container in parallel across a number of compute nodes requires some preparation. In general though, Singularity can be run within the parallel job launcher (srun
).
srun <options> \\\n singularity <options> /path/to/image/file \\\n app <options>\n
The code snippet above shows the launch command as having three nested parts, srun
, the singularity environment and the containerised application.
The Singularity container image must be compatible with the MPI environment on the host; either, the containerised app has been built against the appropriate MPI libraries or the container itself contains an MPI library that is compatible with the host MPI. The latter situation is known as the hybrid model; this is the approach taken in the sections that follow.
"},{"location":"user-guide/containers/#creating-your-own-singularity-container-images","title":"Creating Your Own Singularity Container Images","text":"As we saw above, you can create Singularity container images by importing from DockerHub or Singularity Hub on ARCHER2 itself. If you wish to create your own custom container image to use with Singularity then you must use a system where you have root (or administrator) privileges - often your own laptop or workstation.
There are a number of different options to create container images on your local system to use with Singularity on ARCHER2. We are going to use Docker on our local system to create the container image, push the new container image to Docker Hub and then use Singularity on ARCHER2 to convert the Docker container image to a Singularity container image SIF file.
For macOS and Windows users we recommend installing Docker Desktop. For Linux users, we recommend installing Docker directly on your local system. See the Docker documentation for full details on how to install Docker Desktop/Docker.
"},{"location":"user-guide/containers/#building-container-images-using-docker","title":"Building container images using Docker","text":"Note
We assume that you are familiar with using Docker in these instructions. You can find an introduction to Docker at Reproducible Computational Environments Using Containers: Introduction to Docker
As usual, you can build container images with a command similar to:
docker build --platform linux/amd64 -t <username>/<image name>:<version> .\n
Where:
<username>
is your Docker Hub username<image name>
is the name of the container image you wish to create<version>
- specifies the version of the image you are creating (e.g. \"latest\", \"v1\").
is the build context - in this example it is the location of the DockerfileNote, you should use the --platform linux/amd64
option to ensure that the container image is compatible with the processor architecture on ARCHER2.
MPI on ARCHER2 is provided by the Cray MPICH libraries with the interface to the high-performance Slingshot interconnect provided via the OFI interface. Therefore, as per the Singularity MPI Hybrid model, we will build our container image such that it contains a version of the MPICH MPI library compiled with support for OFI. Below, we provide instructions on creating a container image with a version of MPICH compiled in this way. We then provide an example of how to run a Singularity container with MPI over multiple ARCHER2 compute nodes.
"},{"location":"user-guide/containers/#building-an-image-with-mpi-from-scratch","title":"Building an image with MPI from scratch","text":"Warning
Remember, all these steps should be executed on your local system where you have administrator privileges and Docker installed, not on ARCHER2.
We will illustrate the process of building a Singularity image with MPI from scratch by building an image that contains MPI provided by MPICH and the OSU MPI benchmarks. As part of the container image creation we need to download the source code for both MPICH and the OSU benchmarks. At the time of writing, the stable MPICH release is 3.4.2 and the stable OSU benchmark release is 5.8 - this may have changed by the time you are following these instructions.
First, create a Dockerfile that describes how to build the image:
FROM ubuntu:20.04\n\nENV DEBIAN_FRONTEND=noninteractive\n\n# Install the necessary packages (from repo)\nRUN apt-get update && apt-get install -y --no-install-recommends \\\n apt-utils \\\n build-essential \\\n curl \\\n libcurl4-openssl-dev \\\n libzmq3-dev \\\n pkg-config \\\n software-properties-common\nRUN apt-get clean\nRUN apt-get install -y dkms\nRUN apt-get install -y autoconf automake build-essential numactl libnuma-dev autoconf automake gcc g++ git libtool\n\n# Download and build an ABI compatible MPICH\nRUN curl -sSLO http://www.mpich.org/static/downloads/3.4.2/mpich-3.4.2.tar.gz \\\n && tar -xzf mpich-3.4.2.tar.gz -C /root \\\n && cd /root/mpich-3.4.2 \\\n && ./configure --prefix=/usr --with-device=ch4:ofi --disable-fortran \\\n && make -j8 install \\\n && rm -rf /root/mpich-3.4.2 \\\n && rm /mpich-3.4.2.tar.gz\n\n# OSU benchmarks\nRUN curl -sSLO http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.4.1.tar.gz \\\n && tar -xzf osu-micro-benchmarks-5.4.1.tar.gz -C /root \\\n && cd /root/osu-micro-benchmarks-5.4.1 \\\n && ./configure --prefix=/usr/local CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx \\\n && cd mpi \\\n && make -j8 install \\\n && rm -rf /root/osu-micro-benchmarks-5.4.1 \\\n && rm /osu-micro-benchmarks-5.4.1.tar.gz\n\n# Add the OSU benchmark executables to the PATH\nENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt:$PATH\nENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/collective:$PATH\n\n# path to mlx libraries in Ubuntu\nENV LD_LIBRARY_PATH=/usr/lib/libibverbs:$LD_LIBRARY_PATH\n
A quick overview of what the above Dockerfile is doing:
ubuntu:20.04
Docker image.RUN
sections with apt-get
commands: install the base packages required from the Ubunntu package reposENV
sections: add the OSU benchmark executables to the PATH so they can be executed in the container without specifying the full path; set the correct paths to the network libraries within the container.Now we can go ahead and build the container image using Docker (this assumes that you issue the command in the same directory as the Dockerfile you created based on the specification above):
docker build --platform linux/amd64 -t auser/osu-benchmarks:5.4.1 .\n
(Remember to change auser
to your Dockerhub username.)
Once you have successfully built your container image, you should push it to Dockerhub:
docker push auser/osu-benchmarks:5.4.1\n
Finally, you need to use Singularity on ARCHER2 to convert the Docker container image to a Singularity container image file. Log into ARCHER2, move to the work file system and then use a command like:
auser@ln01:/work/t01/t01/auser> singularity build osu-benchmarks_5.4.1.sif docker://auser/osu-benchmarks:5.4.1\n
Tip
You can find a copy of the osu-benchmarks_5.4.1.sif
image on ARCHER2 in the directory $EPCC_SINGULARITY_DIR
if you do not want to build it yourself but still want to test.
Tip
These instructions assume you have built a Singularity container image file on ARCHER2 that includes MPI provided by MPICH with the OFI interface. See the sections above for how to build such container images.
Once you have built your Singularity container image file that includes MPICH built with OFI for ARCHER2, you can use it to run parallel jobs in a similar way to non-Singularity jobs. The example job submission script below uses the container image file we built above with MPICH and the OSU benchmarks to run the Allreduce benchmark on two nodes where all 128 cores on each node are used for MPI processes (so, 256 MPI processes in total).
#!/bin/bash\n\n# Slurm job options (name, compute nodes, job time)\n#SBATCH --job-name=singularity_parallel\n#SBATCH --time=0:10:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n#SBATCH --account=[budget code]\n\n# Load the module to make the Cray MPICH ABI available\nmodule load cray-mpich-abi\n\nexport OMP_NUM_THREADS=1\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n#\u00a0Set the LD_LIBRARY_PATH environment variable within the Singularity container\n# to ensure that it used the correct MPI libraries.\nexport SINGULARITYENV_LD_LIBRARY_PATH=\"/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib:/opt/cray/libfabric/1.12.1.2.2.0.0/lib64:/opt/cray/pe/gcc-libs:/opt/cray/pe/gcc-libs:/opt/cray/pe/lib64:/opt/cray/pe/lib64:/opt/cray/xpmem/default/lib64:/usr/lib64/libibverbs:/usr/lib64:/usr/lib64\"\n\n# This makes sure HPE Cray Slingshot interconnect libraries are available\n# from inside the container.\nexport SINGULARITY_BIND=\"/opt/cray,/var/spool,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib,/etc/host.conf,/etc/libibverbs.d/mlx5.driver,/etc/libnl/classid,/etc/resolv.conf,/opt/cray/libfabric/1.12.1.2.2.0.0/lib64/libfabric.so.1,/opt/cray/pe/gcc-libs/libatomic.so.1,/opt/cray/pe/gcc-libs/libgcc_s.so.1,/opt/cray/pe/gcc-libs/libgfortran.so.5,/opt/cray/pe/gcc-libs/libquadmath.so.0,/opt/cray/pe/lib64/libpals.so.0,/opt/cray/pe/lib64/libpmi2.so.0,/opt/cray/pe/lib64/libpmi.so.0,/opt/cray/xpmem/default/lib64/libxpmem.so.0,/run/munge/munge.socket.2,/usr/lib64/libibverbs/libmlx5-rdmav34.so,/usr/lib64/libibverbs.so.1,/usr/lib64/libkeyutils.so.1,/usr/lib64/liblnetconfig.so.4,/usr/lib64/liblustreapi.so,/usr/lib64/libmunge.so.2,/usr/lib64/libnl-3.so.200,/usr/lib64/libnl-genl-3.so.200,/usr/lib64/libnl-route-3.so.200,/usr/lib64/librdmacm.so.1,/usr/lib64/libyaml-0.so.2\"\n\n# Launch the parallel job.\nsrun --hint=nomultithread --distribution=block:block \\\n singularity run osu-benchmarks_5.4.1.sif \\\n osu_allreduce\n
The only changes from a standard submission script are:
SINGULARITY_LD_LIBRARY_PATH
to ensure that the excutable can find the correct libraries are available within the container to be able to use HPE Cray Slingshot interconnect.SINGULARITY_BIND
to ensure that the correct libraries are available within the container to be able to use HPE Cray Slingshot interconnect.srun
calls the singularity
software with the container image file we created rather than the parallel program directly.Important
Remember that the image file must be located on /work
to run jobs on the compute nodes.
If the job runs correctly, you should see output similar to the following in your slurm-*.out
file:
Lmod is automatically replacing \"cray-mpich/8.1.23\" with\n\"cray-mpich-abi/8.1.23\".\n\n\n# OSU MPI Allreduce Latency Test v5.4.1\n# Size Avg Latency(us)\n4 7.93\n8 7.93\n16 8.13\n32 8.69\n64 9.54\n128 13.75\n256 17.04\n512 25.94\n1024 29.43\n2048 43.53\n4096 46.53\n8192 46.20\n16384 55.85\n32768 83.11\n65536 136.90\n131072 257.13\n262144 486.50\n524288 1025.87\n1048576 2173.25\n
"},{"location":"user-guide/containers/#using-containerised-hpe-cray-programming-environments","title":"Using Containerised HPE Cray Programming Environments","text":"An experimental containerised CPE module has been setup on ARCHER2. The module is not available by default but can be made accessible by running module use
with the right path.
module use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n
The purpose of the ccpe
module(s) is to allow developers to check that their code compiles with the latest Cray Programming Environment (CPE) releases. The CPE release installed on ARCHER2 (currently CPE 22.12) will typically be older than the latest available. A more recent containerised CPE therefore gives developers the opportunity to try out the latest compilers and libraries before the ARCHER CPE is upgraded.
Note
The Containerised CPEs support CCE and GCC compilers, but not AOCC compilers.
The ccpe/23.12
module then provides access to CPE 23.12 via a Singularity image file, located at /work/y07/shared/utils/dev/ccpe/23.12/cpe_23.12.sif
. Singularity containers can be run such that locations on the host file system are still visible. This means source code stored on /work
can be compiled from inside the CPE container. And any output resulting from the compilation, such as object files, libraries and executables, can be written to /work
also. This ability to bind to locations on the host is necessary as the container is immutable, i.e., you cannot write files to the container itself.
Any executable resulting from a containerised CPE build can be run from within the container, allowing the developer to test the performance of the containerised libraries, e.g., libmpi_cray
, libpmi2
, libfabric
.
We'll now show how to build and run a simple Hello World MPI example using a containerised CPE.
First, cd
to the directory containing the Hello World MPI source, makefile and build script. Examples of these files are given below.
#!/bin/bash\n\nmake clean\nmake\n\necho -e \"\\n\\nldd helloworld\"\nldd helloworld\n
MF= Makefile\n\nFC= ftn\nFFLAGS= -O3\nLFLAGS= -lmpichf90\n\nEXE= helloworld\nFSRC= helloworld.f90\n\n#\n# No need to edit below this line\n#\n\n.SUFFIXES:\n.SUFFIXES: .f90 .o\n\nOBJ= $(FSRC:.f90=.o)\n\n.f90.o:\n $(FC) $(FFLAGS) -c $<\n\nall: $(EXE)\n\n$(EXE): $(OBJ)\n $(FC) $(FFLAGS) -o $@ $(OBJ) $(LFLAGS)\n\nclean:\n rm -f $(OBJ) $(EXE) core\n
!\n! Prints 'Hello World' from rank 0 and\n! prints what processor it is out of the total number of processors from\n! all ranks\n!\n\nprogram helloworld\n use mpi\n\n implicit none\n\n integer :: comm, rank, size, ierr\n integer :: last_arg\n\n comm = MPI_COMM_WORLD\n\n call MPI_INIT(ierr)\n\n call MPI_COMM_RANK(comm, rank, ierr)\n call MPI_COMM_SIZE(comm, size, ierr)\n\n ! Each process prints out its rank\n write(*,*) 'I am ', rank, 'out of ', size,' processors.'\n\n call sleep(1)\n\n call MPI_FINALIZE(ierr)\n\nend program helloworld\n
The ldd
command at the end of the build script is simply there to confirm that the code is indeed linked to containerised libraries that form part of the CPE 23.12 release.
The next step is to launch a job (via sbatch
) on a serial node that instantiates the containerised CPE 23.12 image and builds the Hello World MPI code.
#!/bin/bash\n\n#SBATCH --job-name=ccpe-build\n#SBATCH --ntasks=8\n#SBATCH --time=00:10:00\n#SBATCH --account=<budget code>\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n#SBATCH --export=none\n\nexport OMP_NUM_THREADS=1\n\nmodule use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n\nBUILD_CMD=\"${CCPE_BUILDER} ${SLURM_SUBMIT_DIR}/build.sh\"\n\nsingularity exec --cleanenv \\\n --bind ${CCPE_BIND_ARGS},${SLURM_SUBMIT_DIR} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \\\n ${CCPE_IMAGE_FILE} ${BUILD_CMD}\n
The CCPE
environment variables shown above (e.g., CCPE_BUILDER
and CCPE_IMAGE_FILE
) are set by the loading of the ccpe/23.12
module. The CCPE_BUILDER
variable holds the path to the script that prepares the containerised environment prior to running the build.sh
script. You can run cat ${CCPE_BUILDER}
to take a closer look at what is going on.
Note
Passing the ${SLURM_SUBMIT_DIR}
path to Singularity via the --bind
option allows the CPE container to access the source code and write out the executable using locations on the host.
Running the newly-built code is similarly straightforward; this time the containerised CPE is launched on the compute nodes using the srun
command.
#!/bin/bash\n\n#SBATCH --job-name=helloworld\n#SBATCH --nodes=2\n#SBATCH --tasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n#SBATCH --account=<budget code>\n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --export=none\n\nexport OMP_NUM_THREADS=1\n\nmodule use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n\nRUN_CMD=\"${SLURM_SUBMIT_DIR}/helloworld\"\n\nsrun --distribution=block:block --hint=nomultithread --chdir=${SLURM_SUBMIT_DIR} \\\n singularity exec --bind ${CCPE_BIND_ARGS},${SLURM_SUBMIT_DIR} --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \\\n ${CCPE_IMAGE_FILE} ${RUN_CMD}\n
If you wish you can at runtime replace a containerised library with its host equivalent. You may for example decide to do this for a low-level communications library such as libfabric
or libpmi
. This can be done by adding (before the srun
command) something like the following line to the submit-run.slurm
file.
source ${CCPE_SET_HOST_PATH} \"/opt/cray/pe/pmi\" \"6.1.8\" \"lib\"\n
As of April 2024, the version of PMI available on ARCHER2 is 6.1.8 (CPE 22.12), and so the command above would allow you to isolate the impact of the containerised PMI library, which for CPE 23.12 is PMI 6.1.13. To see how the setting of the host library is done, simply run cat ${CCPE_SET_HOST_PATH}
after loading the ccpe
module.
An MPI code that just prints a message from each rank is obviously very simple. Real-world codes such as CP2K or GROMACS will often require additional software for compilation, e.g., Intel MKL libraries or tools that control the build process such as CMake
. The way round this sort of problem is to point the CCPE container at the locations on the host where the software is installed.
#!/bin/bash\n\n#SBATCH --job-name=ccpe-build\n#SBATCH --ntasks=8\n#SBATCH --time=00:10:00\n#SBATCH --account=<budget code>\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n#SBATCH --export=none\n\nexport OMP_NUM_THREADS=1\n\nmodule use /work/y07/shared/archer2-lmod/others/dev\nmodule load ccpe/23.12\n\nCMAKE_DIR=\"/work/y07/shared/utils/core/cmake/3.21.3\"\n\nBUILD_CMD=\"${CCPE_BUILDER} ${SLURM_SUBMIT_DIR}/build.sh\"\n\nsingularity exec --cleanenv \\\n --bind ${CCPE_BIND_ARGS},${CMAKE_DIR},${SLURM_SUBMIT_DIR} \\\n --env LD_LIBRARY_PATH=${CCPE_LD_LIBRARY_PATH} \\\n ${CCPE_IMAGE_FILE} ${BUILD_CMD}\n
The submit-cmake-build.slurm
script shows how the --bind
option can be used to make the CMake
installation on ARCHER2 accessible from within the container. The build.sh
script can then call the cmake
command directly (once the CMake
bin directory has been added to the PATH
environment variable).
This content has been moved to archer-migration/data-migration
"},{"location":"user-guide/data/","title":"Data management and transfer","text":"This section covers best practice and tools for data management on ARCHER2 along with a description of the different storage available on the service.
The IO section has information on achieving good performance for reading and writing data to the ARCHER2 storage along with information and advice on different IO patterns.
Information
If you have any questions on data management and transfer please do not hesitate to contact the ARCHER2 service desk at support@archer2.ac.uk.
"},{"location":"user-guide/data/#useful-resources-and-links","title":"Useful resources and links","text":"We strongly recommend that you give some thought to how you use the various data storage facilities that are part of the ARCHER2 service. This will not only allow you to use the machine more effectively but also to ensure that your valuable data is protected.
Here are the main points you should consider:
rsync
, tar
, zip
) and generally encourage you to use them to reduce data volumes. However, in some cases, the time spent compressing the data can take longer than actually transferring the uncompressed data; particularly when transferring data between two locations that both have large data transfer bandwidth available.scp
(and rsync
over scp
) your data will be encrypted introducing a static overhead per file. This issue can be minimised by reducing the number files to be transferred by creating archives. You can also change the encryption algorithm to one that involves minimal encryption. The fastest performing cipher that is commonly available in SSH at the moment is generally aes128-ctr
as most common processors provide a hardware implementation.The ARCHER2 service, like many HPC systems, has a complex structure. There are a number of different data storage types available to users:
/epsrc
and /general
)Each type of storage has different characteristics and policies, and is suitable for different types of use.
Important
All users have a directory on one of the home file systems and on one of the work file systems. The directories are located at:
/home/[project ID]/[project ID]/[user ID]
(this is also set as your home directory)/work/[project ID]/[project ID]/[user ID]
There are also three different types of node available to users:
Each type of node sees a different combination of the storage types. The following table shows which storage options are avalable on different node types:
Storage Login Nodes Compute Nodes Data analysis nodes Notes /home yes no yes Incremental backup /work yes yes yes No backup, high performance Solid state (NVMe) yes yes yes No backup, high performance RDFaaS yes no yes Disaster recovery backupImportant
Only the work file systems and the solid state (NVMe) file system are visible on the compute nodes. This means that all data required by calculations at runtime (input data, application binaries, software libraries, etc.) must be placed on one of these file systems.
You may see \"file not found\" errors if you try to access data on the /home or RDFaaS file systems when running on the compute nodes.
"},{"location":"user-guide/data/#home-file-systems","title":"Home file systems","text":"There are four independent home file-systems. Every project has an allocation on one of the four. You do not need to know which one your project uses as your projects space can always be accessed via the path /home/[project ID]
with your personal directory at /home/[project ID]/[project ID]/[user ID]
. Each home file-system is approximately 100 TB in size and is implemented using standard Network Attached Storage (NAS) technology. This means that these disks are not particularly high performance but are well suited to standard operations like compilation and file editing. These file systems are visible from the ARCHER2 login nodes.
The home file systems are fully backed up. The home file systems retain snapshots which can be used to recover past versions of files. Snapshots are taken weekly (for each of the past two weeks), daily (for each of the past two days) and hourly (for each of the last 6 hours). You can access the snapshots at .snapshot
from any given directory on the home file systems. Note that the .snapshot
directory will not show up under any version of \u201cls\u201d and will not tab complete.
These file systems are a good location to keep source code, copies of scripts and compiled binaries. Small amounts of important data can also be copied here for safe keeping though the file systems are not fast enough to manipulate large datasets effectively.
"},{"location":"user-guide/data/#quotas-on-home-file-systems","title":"Quotas on home file systems","text":"All projects are assigned a quota on the home file systems. The project PI or manager can split this quota up between users or groups of users if they wish.
You can view any home file system quotas that apply to your account by logging into SAFE and navigating to the page for your ARCHER2 login account.
Tip
Quota and usage data on SAFE is updated twice daily so may not be exactly up to date with the situation on the systems themselves.
"},{"location":"user-guide/data/#work-file-systems","title":"Work file systems","text":"There are currently three work file systems on the full ARCHER2 service. Each of these file systems is 3.4 PB and a portion of one of these file systems is available to each project. You do not usually need to know which one your project uses as your projects space can always be accessed via the path /work/[project ID]
with your personal directory at /work/[project ID]/[project ID]/[user ID]
.
All of these are high-performance, Lustre parallel file systems. They are designed to support data in large files. The performance for data stored in large numbers of small files is probably not going to be as good.
These file systems are available on the compute nodes and are the default location users should use for data required at runtime on the compute nodes.
Warning
There are no backups of any data on the work file systems. You should not rely on these file systems for long term storage.
Ideally, these file systems should only contain data that is:
In practice it may be convenient to keep copies of datasets on the work file systems that you know will be needed at a later date. However, make sure that important data is always backed up elsewhere and that your work would not be significantly impacted if the data on the work file systems was lost.
Large data sets can be moved to the RDFaaS storage or transferred off the ARCHER2 service entirely.
If you have data on the work file systems that you are not going to need in the future please delete it.
"},{"location":"user-guide/data/#quotas-on-the-work-file-systems","title":"Quotas on the work file systems","text":"As for the home file systems, all projects are assigned a quota on the work file systems. The project PI or manager can split this quota up between users or groups of users if they wish.
You can view any work file system quotas that apply to your account by logging into SAFE and navigating to the page for your ARCHER2 login account.
Tip
Quota and usage data on SAFE is updated twice daily so may not be exactly up to date with the situation on the systems themselves.
You can also examine up to date quotas and usage on the ARCHER2 systems themselves using the lfs quota
command. To do this:
auser
in project t01
then I would:cd /work/t01/t01/auser\n
auser@ln03:/work/t01/t01/auser> lfs quota -hu auser .\nDisk quotas for usr auser (uid 5496):\n Filesystem used quota limit grace files quota limit grace\n . 1.366G 0k 0k - 5486 0 0 -\nuid 5496 is using default block quota setting\nuid 5496 is using default file quota setting\n
the quota
and limit
of 0k
here indicate that no user quota is set for this user
auser@ln03:/work/t01/t01/auser> lfs quota -hp $(id -g) .\nDisk quotas for prj 1009 (pid 1009):\n Filesystem used quota limit grace files quota limit grace\n . 2.905G 0k 0k - 25300 0 0 -\npid 1009 is using default block quota setting\npid 1009 is using default file quota setting\n
"},{"location":"user-guide/data/#solid-state-nvme-file-system-scratch-storage","title":"Solid state (NVMe) file system - scratch storage","text":"Important
The solid state storage system is configured as scratch storage with all files that have not been accessed in the last 28 days being automatically deleted. This implementation starts on 28 Feb 2024, i.e. any files not accessed since 1 Feb 2024 will be automatically removed on 28 Feb 2024.
The solid state storage file system is a 1 PB high performance parallel Lustre file system similar to the work file systems. However, unlike the work file systems, all of the disks are based solid state storage (NVMe) technology. This changes the performance characteristics of the file system compared to the work file systems. Testing by the ARCHER2 CSE team at EPCC has shown that you may see I/O performance improvements from the solid state storage compared to the standard work Lustre file systems on ARCHER2 if your I/O model has the following characteristics or similar:
Data on the solid state (NVMe) file system is visible on the compute nodes
Important
If you use MPI-IO approaches to reading/writing data - this includes parallel HDF5 and parallel NetCDF - then you very unlikely to see any performance improvements from using the solid state storage over the standard parallel Lustre file systems on ARCHER2.
Warning
There are no backups of any data on the solid state (NVMe) file system. You should not rely on this file system for long term storage.
"},{"location":"user-guide/data/#access-to-the-solid-state-file-system","title":"Access to the solid state file system","text":"Projects do not have access to the solid state file system by default. If your project does not yet have access and you want access for your project, please contact the Service Desk to request access.
"},{"location":"user-guide/data/#location-of-directories","title":"Location of directories","text":"You can find your directory on the file system at:
/mnt/lustre/a2fs-nvme/work/<project code>/<project code>/<username>\n
For example, if my username is auser
and I am in project t01
, I could find my solid state storage directory at:
/mnt/lustre/a2fs-nvme/work/t01/t01/auser\n
"},{"location":"user-guide/data/#quotas-on-solid-state-file-system","title":"Quotas on solid state file system","text":"Important
All projects have the same, large quota of 250,000 GiB on the solid state file system to allow them to use it as a scratch file system. Remember, any files that have not been accessed in the last 28 days will be automatically deleted.
You query quotas for the solid state file system in the same way as quotas on the work file systems.
Bug
Usage and quotas of the solid state file system are not yet available in SAFE - you should use commands such as lfs quota -hp $(id -g) .
to query quotas on the solid state file system.
You can identify which files you own that are candidates for deletion at the next scratch file system purge using the find
command in the following format:
find /mnt/lustre/a2fs-nvme/work/<project code> -atime +28 -type f -print\n
For example, if my account is in project t01
, I would use:
find /mnt/lustre/a2fs-nvme/work/t01 -atime +28 -type f -print\n
"},{"location":"user-guide/data/#rdfaas-file-systems","title":"RDFaaS file systems","text":"The RDFaaS file systems provide additional capacity for projects to store data that is not currently required on the compute nodes but which is too large for the Home file systems.
Warning
The RDFaaS file systems are backed up for disaster recovery purposes only (e.g. loss of the whole file system) so it is not possible to recover individual files if they are deleted by mistake or otherwise lost.
Tip
Not all projects on ARCHER2 have access to RDFaaS, if you do have access, this will show up in the login account page on SAFE for your ARCHER2 login account.
If you have access to RDFaaS, you will have a directory in one of two file systems: either /epsrc
or /general
.
For example, if your username is auser
and you are in the e05
project, then your RDFaaS directory will be at:
/epsrc/e05/e05/auser\n
The RDFaaS file systems are not available on the ARCHER2 compute nodes.
Tip
If you are having issues accessing data on the RDFaaS file system then please contact the ARCHER2 Service Desk
"},{"location":"user-guide/data/#copying-data-from-rdfaas-to-work-file-systems","title":"Copying data from RDFaaS to Work file systems","text":"You should use the standard Linux cp
command to copy data from the RDFaaS file system to other ARCHER2 file systems (usually /work
). For example, to transfer the file important-data.tar.gz
from the RDFaaS file system to /work
you would use the following command (assuming you are user auser
in project e05
):
cp /epsrc/e05/e05/auser/important-data.tar.gz /work/e05/e05/auser/\n
(remember to replace the project code and username with your own username and project code. You may also need to use /general
if your data was there on the RDF file systems).
Some large projects may choose to split their resources into multiple subprojects. These subprojects will have identifiers appended to the main project ID. For example, the rse
subgroup of the z19
project would have the ID z19-rse
. If the main project has allocated storage quotas to the subproject the directories for this storage will be found at, for example:
/home/z19/z19-rse/auser\n
Your Linux home directory will generally not be changed when you are made a member of a subproject so you must change directories manually (or change the ownership of files) to make use of this different storage quota allocation.
"},{"location":"user-guide/data/#sharing-data-with-other-archer2-users","title":"Sharing data with other ARCHER2 users","text":"How you share data with other ARCHER2 users depends on whether or not they belong to the same project as you. Each project has two shared folders that can be used for sharing data.
"},{"location":"user-guide/data/#sharing-data-with-archer2-users-in-your-project","title":"Sharing data with ARCHER2 users in your project","text":"Each project has an inner shared folder.
/work/[project code]/[project code]/shared\n
This folder has read/write permissions for all project members. You can place any data you wish to share with other project members in this directory. For example, if your project code is x01 the inner shared folder would be located at /work/x01/x01/shared
.
Some projects have subprojects (also often referred to as a 'project groups' or sub-budgets) e.g. project e123 might have a project group e123-fred for a sub-group of researchers working with Fred.
Often project groups do not have a disk quota set, but if the project PI does set up a group disk quota e.g. for /work then additional directories are created:
/work/e123/e123-fred\n/work/e123/e123-fred/shared\n/work/e123/e123-fred/<user> (for every user in the group)\n
and all members of the /work/e123/e123-fred
group will be able to use the /work/e123/e123-fred/shared
directory to share their files.
Note
If files are copied from their usual directories they will keep the original ownership. To grant ownership to the group:
chown -R $USER:e123-fred /work/e123/e123-fred/ ...
Each project also has an outer shared folder.:
/work/[project code]/shared\n
It is writable by all project members and readable by any user on the system. You can place any data you wish to share with other ARCHER2 users who are not members of your project in this directory. For example, if your project code is x01 the outer shared folder would be located at /work/x01/shared
.
You should check the permissions of any files that you place in the shared area, especially if those files were created in your own ARCHER2 account. Files of the latter type are likely to be readable by you only.
The chmod
command below shows how to make sure that a file placed in the outer shared folder is also readable by all ARCHER2 users.
chmod a+r /work/x01/shared/your-shared-file.txt\n
Similarly, for the inner shared folder, chmod
can be called such that read permission is granted to all users within the x01 project.
chmod g+r /work/x01/x01/shared/your-shared-file.txt\n
If you're sharing a set of files stored within a folder hierarchy the chmod
is slightly more complicated.
chmod -R a+Xr /work/x01/shared/my-shared-folder\nchmod -R g+Xr /work/x01/x01/shared/my-shared-folder\n
The -R
option ensures that the read permission is enabled recursively and the +X
guarantees that the user(s) you're sharing the folder with can access the subdirectories below my-shared-folder
.
Every file has an owner group that specifies access permissions for users belonging to that group. It's usually the case that the group id is synonymous with the project code. Somewhat confusingly however, projects can contain groups of their own, called subprojects, which can be assigned disk space quotas distinct from the project.
chown -R $USER:x01-subproject /work/x01/x01-subproject/$USER/my-folder\n
The chown
command above changes the owning group for all the files within my-folder
to the x01-subproject
group. This might be necessary if previously those files were owned by the x01 group and thereby using some of the x01 disk quota.
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.
The method you use to transfer data to/from ARCHER2 will depend on how much you want to transfer and where to. The methods we cover in this guide are:
Before discussing specific data transfer methods, we cover archiving which is an essential process for transferring data efficiently.
"},{"location":"user-guide/data/#archiving","title":"Archiving","text":"If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger \"archive\" file for ease of transfer and manipulation. A single large file makes more efficient use of the file system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar
and zip
.
The tar
command packs files into a \"tape archive\" format. The command has general form:
tar [options] [file(s)]\n
Common options include:
-c
create a new archive-v
verbosely list files processed-W
verify the archive after writing-l
confirm all file hard links are included in the archive-f
use an archive file (for historical reasons, tar writes its output to stdout by default rather than a file).-b 2048
use a 1 MiB block size (better performance and less contention on Lustre compared to the default block size)Putting these together:
tar -cvWlf mydata.tar mydata\n
will create and verify an archive.
To extract files from a tar file, the option -x
is used. For example:
tar -b 2048 -xf mydata.tar\n
will recover the contents of mydata.tar
to the current working directory (using a block size of 1 MiB to improve Lustre performance and reduce contention).
To verify an existing tar file against a set of data, the -d
(diff) option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:
$> tar -df mydata.tar mydata/*\nmydata/damaged_file: Mod time differs\nmydata/damaged_file: Size differs\n
Note
tar files do not store checksums with their data, requiring the original data to be present during verification.
Tip
Further information on using tar
can be found in the tar
manual (accessed via man tar
or at man tar).
The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:
zip [options] mydata.zip [file(s)]\n
Common options are:
-r
used to zip up a directory-#
where \"#\" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process.Together:
zip -0r mydata.zip mydata\n
will create an archive.
Note
Unlike tar, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system layout.
The corresponding unzip
command is used to extract data from the archive. The simplest use case is:
unzip mydata.zip\n
which recovers the contents of the archive to the current working directory.
Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip
provides options for verifying this checksum against the stored files. The relevant flag is -t
and is used as follows:
$> unzip -t mydata.zip\nArchive: mydata.zip\n testing: mydata/ OK\n testing: mydata/file OK\nNo errors detected in compressed data of mydata.zip.\n
Tip
Further information on using zip
can be found in the zip
manual (accessed via man zip
or at man zip).
The easiest way of transferring data to/from ARCHER2 is to use one of the standard programs based on the SSH protocol such as scp
, sftp
or rsync
. These all use the same underlying mechanism (SSH) as you normally use to log-in to ARCHER2. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine (ARCHER2 in this case).
To avoid having to type in your password multiple times you can set up a SSH key pair and use an SSH agent as documented in the User Guide at connecting
.
The SSH protocol encrypts all traffic it sends. This means that file transfer using SSH consumes a relatively large amount of CPU time at both ends of the transfer (for encryption and decryption). The ARCHER2 login nodes have fairly fast processors that can sustain about 100 MB/s transfer. The encryption algorithm used is negotiated between the SSH client and the SSH server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The aes128-ctr
or aes256-ctr
algorithms are well supported and fast as they are implemented in hardware. These are not usually the default choice when using scp
so you will need to manually specify them.
A single SSH based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce metadata interactions it is a good idea to overlap transfers of files from different directories.
In addition, you should consider the following when transferring data:
gzip
.The scp
command creates a copy of a file, or if given the -r
flag, a directory either from a local machine onto a remote machine or from a remote machine onto a local machine.
For example, to transfer files to ARCHER2 from a local machine:
scp [options] source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
In the above example, the [destination]
is optional, as when left out scp
will copy the source into your home directory. Also, the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
If you want to request a different encryption algorithm add the -c [algorithm-name]
flag to the scp
options. For example, to use the (usually faster) aes128-ctr encryption algorithm you would use:
scp [options] -c aes128-ctr source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
The rsync
command can also transfer data between hosts using a ssh
connection. It creates a copy of a file or, if given the -r
flag, a directory at the given destination, similar to scp
above.
Given the -a
option rsync can also make exact copies (including permissions), this is referred to as mirroring. In this case the rsync
command is executed with ssh
to create the copy on a remote machine.
To transfer files to ARCHER2 using rsync
with ssh
the command has the form:
rsync [options] -e ssh source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
In the above example, the [destination]
is optional, as when left out rsync will copy the source into your home directory. Also the source
should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
Additional flags can be specified for the underlying ssh
command by using a quoted string as the argument of the -e
flag. e.g.
rsync [options] -e \"ssh -c aes128-ctr\" source user@login.archer2.ac.uk:[destination]\n
(Remember to replace user
with your ARCHER2 username in the example above.)
Tip
Further information on using rsync
can be found in the rsync
manual (accessed via man rsync
or at man rsync).
The ARCHER2 filesystems have a Globus Collection (formerly known as an endpoint) with the name \"Archer2 file systems\" Full step-by-step guide for using Globus to transfer files to/from ARCHER2
"},{"location":"user-guide/data/#data-transfer-via-gridftp","title":"Data transfer via GridFTP","text":"ARCHER2 provides a module for grid computing, gct/6.2
, otherwise known as the Globus Grid Community Toolkit v6.2.20201212. This toolkit provides a command line interface for moving data to and from GridFTP servers.
Data transfers are managed by the globus-url-copy
command. Full details concerning this command's use can be found in the GCT 6.2 GridFTP User's Guide.
Info
Further information on using GridFTP on ARCHER2 to transfer data to the JASMIN facility can be found in the JASMIN user documentation.
"},{"location":"user-guide/data/#data-transfer-using-rclone","title":"Data transfer usingrclone
","text":"Rclone is a command-line program to manage files on cloud storage. You can transfer files directly to/from cloud storage services, such as MS OneDrive and Dropbox. The program preserves timestamps and verifies checksums at all times.
First of all, you must download and unzip rclone
on ARCHER2:
wget https://downloads.rclone.org/v1.62.2/rclone-v1.62.2-linux-amd64.zip\nunzip rclone-v1.62.2-linux-amd64.zip\ncd rclone-v1.62.2-linux-amd64/\n
The previous code snippet uses rclone v1.62.2, which was the latest version when these instructions were written.
Configure rclone using ./rclone config
. This will guide you through an interactive setup process where you can make a new remote (called remote
). See the following for detailed instructions for:
Please note that a token is required to connect from ARCHER2 to the cloud service. You need a web browser to get the token. The recommendation is to run rclone in your laptop using rclone authorize
, get the token, and then copy the token from your laptop to ARCHER2. The rclone website contains further instructions on configuring rclone on a remote machine without web browser.
Once all the above is done, you're ready to go. If you want to copy a directory, please use:
rclone copy <archer2_directory> remote:<cloud_directory>
Please note that \"remote\" is the name that you have chosen when running rclone config
. To copy files, please use:
rclone copyto <archer2_file> remote:<cloud_file>
Note
If the session times out while the data transfer takes place, adding the -vv
flag to an rclone transfer forces rclone to output to the terminal and therefore avoids triggering the timeout process.
Here we have a short example demonstrating transfer of data directly from a laptop/workstation to ARCHER2.
Note
This guide assumes you are using a command line interface to transfer data. This means the terminal on Linux or macOS, MobaXterm local terminal on Windows or Powershell.
Before we can transfer of data to ARCHER2 we need to make sure we have an SSH key setup to access ARCHER2 from the system we are transferring data from. If you are using the same system that you use to log into ARCHER2 then you should be all set. If you want to use a different system you will need to generate a new SSH key there (or use SSH key forwarding) to allow you to connect to ARCHER2.
Tip
Remember that you will need to use both a key and your password to transfer data to ARCHER2.
Once we know our keys are setup correctly, we are now ready to transfer data directly between the two machines. We begin by combining our important research data in to a single archive file using the following command:
tar -czf all_my_files.tar.gz file1.txt file2.txt file3.txt\n
We then initiate the data transfer from our system to ARCHER2, here using rsync
to allow the transfer to be recommenced without needing to start again, in the event of a loss of connection or other failure. For example, using the SSH key in the file ~/.ssh/id_RSA_A2
on our local system:
rsync -Pv -e\"ssh -c aes128-ctr -i $HOME/.ssh/id_RSA_A2\" ./all_my_files.tar.gz otbz19@login.archer2.ac.uk:/work/z19/z19/otbz19/\n
Note the use of the -P
flag to allow partial transfer -- the same command could be used to restart the transfer after a loss of connection. The -e
flag allows specification of the ssh command - we have used this to add the location of the identity file. The -c
option specifies the cipher to be used as aes128-ctr
which has been found to increase performance Unfortunately the ~
shortcut is not correctly expanded, so we have specified the full path. We move our research archive to our project work directory on ARCHER2.
Note
Remember to replace otbz19
with your username on ARCHER2.
If we were unconcerned about being able to restart an interrupted transfer, we could instead use the scp
command,
scp -c aes128-ctr -i ~/.ssh/id_RSA_A2 all_my_files.tar.gz otbz19@login.archer2.ac.uk:/work/z19/z19/otbz19/\n
but rsync
is recommended for larger transfers.
The following debugging tools are available on ARCHER2:
The Linaro Forge tool provides the DDT parallel debugger. See:
The GNU Debugger for HPC (gdb4hpc) is a GDB-based debugger used to debug applications compiled with CCE, PGI, GNU, and Intel Fortran, C and C++ compilers. It allows programmers to either launch an application within it or to attach to an already-running application. Attaching to an already-running and hanging application is a quick way of understanding why the application is hanging, whereas launching an application through gdb4hpc will allow you to see your application running step-by-step, output the values of variables, and check whether the application runs as expected.
Tip
For your executable to be compatible with gdb4hpc, it will need to be coded with MPI. You will also need to compile your code with the debugging flag -g
(e.g. cc -g my_program.c -o my_exe
).
Launch gdb4hpc
:
module load gdb4hpc\ngdb4hpc\n
You will get some information about this version of the program and, eventually, you will get a command prompt:
gdb4hpc 4.5 - Cray Line Mode Parallel Debugger\nWith Cray Comparative Debugging Technology.\nCopyright 2007-2019 Cray Inc. All Rights Reserved.\nCopyright 1996-2016 University of Queensland. All Rights Reserved.\nType \"help\" for a list of commands.\nType \"help <cmd>\" for detailed help about a command.\ndbg all>\n
We will use launch
to begin a multi-process application within gdb4hpc. Consider that we are wanting to test an application called my_exe
, and that we want this to be launched across all 256 processes in two nodes. We would launch this in gdb4hpc by running:
dbg all> launch --launcher-args=\"--account=[budget code] --partition=standard --qos=standard --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --exclusive --export=ALL\" $my_prog{256} ./my_ex\n
Make sure to replace the --account
input to your budget code (e.g. if you are using budget t01, that part should look like --account=t01
).
The default launcher is srun
and the --launcher-args=\"...\"
allows you to set launcher flags for srun
. The variable $my_prog
is a dummy name for the program being launched and you could use whatever name you want for it -- this will be the name of the srun
job that will be run. The number in the brackets {256}
is the number of processes over which the program will be executed, it's 256 here, but you could use any number. You should try to run this on as few processors as possible -- the more you use, the longer it will take for gdb4hpc to load the program.
Once the program is launched, gdb4hpc will load up the program and begin to run it. You will get output to screen something that looks like:
Starting application, please wait...\nCreating MRNet communication network...\nWaiting for debug servers to attach to MRNet communications network...\nTimeout in 400 seconds. Please wait for the attach to complete.\nNumber of dbgsrvs connected: [0]; Timeout Counter: [1]\nNumber of dbgsrvs connected: [0]; Timeout Counter: [2]\nNumber of dbgsrvs connected: [0]; Timeout Counter: [3]\nNumber of dbgsrvs connected: [1]; Timeout Counter: [0]\nNumber of dbgsrvs connected: [1]; Timeout Counter: [1]\nNumber of dbgsrvs connected: [2]; Timeout Counter: [0]\nFinalizing setup...\nLaunch complete.\nmy_prog{0..255}: Initial breakpoint, main at /PATH/TO/my_program.c:34\n
The line number at which the initial breakpoint is made (in the above example, line 34) corresponds to the line number at which MPI is initialised. You will not be able to see any parts of the code outside of the MPI region of a code with gdb4hpc.
Once the code is loaded, you can use various commands to move through your code. The following lists and describes some of the most useful ones:
help
-- Lists all gdb4hpc commands. You can run help COMMAND_NAME
to learn more about a specific command (e.g. help launch
will tell you about the launch commandlist
-- Will show the current line of code and the 9 lines following. Repeated use of list
will move you down the code in ten-line chunks.next
-- Will jump to the next step in the program for each process and output which line of code each process is one. It will not enter subroutines. !!! note that there is no reverse-step in gdb4hpc.step
-- Like next
, but this will step into subroutines.up
-- Go up one level in the program (e.g. from a subroutine back to main).print var
-- Prints the value of variable var
at this point in the code.watch var
-- Like print, but will print whenever a variable changes value.quit
-- Exits gdb4hpc.Remember to exit the interactive session once you are done debugging.
"},{"location":"user-guide/debug/#attaching-with-gdb4hpc","title":"Attaching with gdb4hpc","text":"Attaching to a hanging job using gdb4hpc is a great way of seeing which state each processor is in. However, this does not produce the most visually appealing results. For a more easy-to-read program, please take a look at the STAT tool.
In your interactive session, launch your executable as a background task (by adding an &
at the end of the command). For example, if you are running an executable called my_exe
using 256 processes, you would run:
srun -n 256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL \\\n --account=[budget code] --partition=standard --qos=standard ./my_exe &\n
Make sure to replace the --account
input to your budget code (e.g. if you are using budget t01, that part should look like --account=t01
).
You will need to get the full job ID of the job you have just launched. To do this, run:
squeue -u $USER\n
and find the job ID associated with this interactive session -- this will be the one with the jobname bash
. In this example:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)\n1050 workq my_mpi_j jsindt R 0:16 1 nid000001\n1051 workq bash jsindt R 0:12 1 nid000002\n
the appropriate job id is 1051. Next, you will need to run sstat
on this job id:
sstat 1051\n
This will output a large amount of information about this specific job. We are looking for the first number of this output, which should look like JOB_ID.##
-- the number after the job ID is the number of slurm tasks performed in this interactive session. For our example (where srun
is the first slurm task performed), the number is 1051.0.
Launch gdb4hpc
:
module load gdb4hpc\ngdb4hpc\n
You will get some information about this version of the program and, eventually, you will get a command prompt:
gdb4hpc 4.5 - Cray Line Mode Parallel Debugger\nWith Cray Comparative Debugging Technology.\nCopyright 2007-2019 Cray Inc. All Rights Reserved.\nCopyright 1996-2016 University of Queensland. All Rights Reserved.\nType \"help\" for a list of commands.\nType \"help <cmd>\" for detailed help about a command.\ndbg all>\n
We will be using the attach
command to attach to our program that hangs. This is done by writing:
dbg all> attach $my_prog JOB_ID.##\n
where JOB_ID.##
is the full job ID found using sstat
(in our example, this would be 1051.0). The name $my_prog
is a dummy-name -- it could be whatever name you like.
As it is attaching, gdb4hpc will output text to screen that looks like:
Attaching to application, please wait...\nCreating MRNet communication network...\nWaiting for debug servers to attach to MRNet communications network...\nTimeout in 400 seconds. Please wait for the attach to complete.\nNumber of dbgsrvs connected: [0]; Timeout Counter: [1]\n\n...\n\nFinalizing setup...\nAttach complete.\nCurrent rank location:\n
After this, you will get an output that, among other things, tells you which line of your code each process is on, and what each process is doing. This can be helpful to see where the hang-up is.
If you accidentally attached to the wrong job, you can detach by running:
dbg all> release $my_prog\n
and re-attach with the correct job ID. You will need to change your dummy name from $my_prog
to something else.
When you are finished using gbd4hpc
, simply run:
dbg all> quit\n
Do not forget to exit your interactive session.
"},{"location":"user-guide/debug/#valgrind4hpc","title":"valgrind4hpc","text":"valgrind4hpc is a Valgrind-based debugging tool to aid in the detection of memory leaks and errors in parallel applications. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior. Valgrind4hpc manages starting and redirecting output from many copies of Valgrind, as well as recombining and filtering Valgrind messages. If your program can be debugged with Valgrind, it can be debugged with valgrind4hpc.
The valgrind4hpc module enables the use of standard valgrind as well as the valgrind4hpc version more suitable to parallel programs.
"},{"location":"user-guide/debug/#using-valgrind-with-serial-programs","title":"Using Valgrind with serial programs","text":"Launch valgrind4hpc
:
module load valgrind4hpc\n
Next, run your executable through valgrind:
valgrind --tool=memcheck --leak-check=yes my_executable\n
The log outputs to screen. The ERROR SUMMARY
will tell you whether, and how many, memory errors there are in your program. Furthermore, if you compile your code using the -g
debugging flag (e.g. gcc -g my_program.c -o my_executable.c
), the log will point out the code lines where the error occurs.
Valgrind also includes a tool called Massif that can be used to give insight into the memory usage of your program. It takes regular snapshots and outputs this data into a single file, which can be visualised to show the total amount of memory used as a function of time. This shows when peaks and bottlenecks occur and allows you to identify which data structures in your code are responsible for the largest memory usage of your program.
Documentation explaining how to use Massif is available at the official Massif manual. In short, you should run your executable as follows:
valgrind --tool=massif my_executable\n
The memory profiling data will be output into a file called massif.out.pid
, where pid is the runtime process ID of your program. A custom filename can be chosen using the --massif-out-file option
, as follows:
valgrind --tool=massif --massif-out-file=optional_filename.out my_executable\n
The output file contains raw profiling statistics. To view a summary including a graphical plot of memory usage over time, use the ms_print
command as follows:
ms_print massif.out.12345\n
or, to save to a file:
ms_print massif.out.12345 > massif.analysis.12345\n
This will show total memory usage over time as well as a breakdown of the top data structures contributing to memory usage at each snapshot where there has been a significant allocation or deallocation of memory.
"},{"location":"user-guide/debug/#using-valgrind4hpc-with-parallel-programs","title":"Using Valgrind4hpc with parallel programs","text":"First, load valgrind4hpc
:
module load valgrind4hpc\n
To run valgrind4hpc, first reserve the resources you will use with salloc
. The following reservation request is for 2 nodes (256 physical cores) for 20 minutes on the short queue:
auser@uan01:> salloc --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 \\\n --time=00:20:00 --partition=standard --qos=short \\\n --hint=nomultithread \\\n --distribution=block:block --account=[budget code]\n
Once your allocation is ready, Use valgrind4hpc to run and profile your executable. To test an executable called my_executable
that requires two arguments arg1
and arg2
on 2 nodes and 256 processes, run:
valgrind4hpc --tool=memcheck --num-ranks=256 my_executable -- arg1 arg2\n
In particular, note the --
separating the executable from the arguments (this is not necessary if your executable takes no arguments).
Valgrind4hpc only supports certain tools found in valgrind. These are: memcheck, helgrind, exp-sgcheck, or drd. The --valgrind-args=\"arguments\"
allows users to use valgrind options not supported in valgrind4hpc (e.g. --leak-check
) -- note, however, that some of these options might interfere with valgrind4hpc.
More information on valgrind4hpc can be found in the manual (man valgrind4hpc
).
The Stack Trace Analysis Tool (STAT) is a cross-platform debugging tool from the University of Wisconsin-Madison. ATP is based on the same technology as STAT, both are designed to gather and merge stack traces from a running application's parallel processes. The STAT tool can be useful when application seems to be deadlocked or stuck, i.e. they don't crash but they don't progress as expected, and it has been designed to scale to a very large number of processes. Full information on STAT, including use cases, is available at the STAT website.
STAT will attach to a running program and query that program to find out where all the processes in that program currently are. It will then process that data and produce a graph displaying the unique process locations (i.e. where all the processes in the running program currently are). To make this easily understandable it collates together all processes that are in the same place providing only unique program locations for display.
"},{"location":"user-guide/debug/#using-stat-on-archer2","title":"Using STAT on ARCHER2","text":"On the login node, load the cray-stat
module:
module load cray-stat\n
Then, launch your job using srun
as a background task (by adding an &
at the end of the command). For example, if you are running an executable called my_exe
using 256 processes, you would run:
srun -n 256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL\\\n --account=[budget code] --partition=standard --qos=standard./my_exe &\n
Note
This example has set the job time limit to 1 hour -- if you need longer, change the --time
command.
You will need the Program ID (PID) of the job you have just launched -- the PID is printed to screen upon launch, or you can get it by running:
ps -u $USER\n
This will present you with a set of text that looks like this:
PID TTY TIME CMD\n154296 ? 00:00:00 systemd\n154297 ? 00:00:00 (sd-pam)\n154302 ? 00:00:00 sshd\n154303 pts/8 00:00:00 bash\n157150 pts/8 00:00:00 salloc\n157152 pts/8 00:00:00 bash\n157183 pts/8 00:00:00 srun\n157185 pts/8 00:00:00 srun\n157191 pts/8 00:00:00 ps\n
Once your application has reached the point where it hangs, issue the following command (replacing PID with the ID of the first srun task -- in the above example, I would replace PID with 157183):
stat-cl -i PID\n
You will get an output that looks like this:
STAT started at 2020-07-22-13:31:35\nAttaching to job launcher (null):157565 and launching tool daemons...\nTool daemons launched and connected!\nAttaching to application...\nAttached!\nApplication already paused... ignoring request to pause\nSampling traces...\nTraces sampled!\nResuming the application...\nResumed!\nPausing the application...\nPaused!\n\n...\n\nDetaching from application...\nDetached!\n\nResults written to $PATH_TO_RUN_DIRECTORY/stat_results/my_exe.0000\n
Once STAT is finished, you can kill the srun job using scancel
(replacing JID with the job ID of the job you just launched):
scancel JID\n
You can view the results that STAT has produced using the following command (note that \"my_exe\" will need to be replaced with the name of the executable you ran):
stat-view stat_results/my_exe.0000/00_my_exe.0000.3D.dot\n
This produces a graph displaying all the different places within the program that the parallel processes were when you queried them.
Note
To see the graph, you will need to have exported your X display when logging in.
Larger jobs may spend significant time queueing, requiring submission as a batch job. In this case, a slightly different invocation is illustrated as follows:
#!/bin/bash --login\n\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=02:00:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load additional modules\nmodule load cray-stat\n\nexport OMP_NUM_THREADS=1\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# This environment variable is required\nexport CTI_SLURM_OVERRIDE_MC=1\n\n# Request that stat sleeps for 3600 seconds before attaching\n# to our executable which we launch with command introduced\n# with -C:\n\nstat-cl -s 3600 -C srun --unbuffered ./my_exe\n
If the job is hanging it will continue to run until the wall clock exceeds the requested time. Use the stat-view
utility to inspect the results, as discussed above.
To enable ATP you should load the atp module and set the ATP_ENABLED
environment variable to 1 on the login node:
module load atp\nexport ATP_ENABLED=1\n# Fix for a known issue:\nexport HOME=${HOME/home/work}\n
Then, launch your job using srun
as a background task (by adding an &
at the end of the command). For example, if you are running an executable called my_exe
using 256 processes, you would run:
srun -n=256 --nodes=2 --ntasks-per-node=128 --cpus-per-task=1 --time=01:00:00 --export=ALL \\\n --account=[budget code] --partition=standard --qos=standard ./my_exe &\n
Note
This example has set the job time limit to 1 hour -- if you need longer, change the --time
command.
Once the job has finished running, load the stat
module to view the results:
module load cray-stat\n
and view the merged stack trace using:
stat-view atpMergedBT.dot\n
Note
To see the graph, you will need to have exported your X display when logging in.
"},{"location":"user-guide/dev-environment-4cab/","title":"Application development environment: 4-cabinet system","text":"Important
This section covers the application development environment on the initial, 4-cabinet ARCHER2 system. For docmentation on the application development environment on the full ARCHER2 system, please see Application development environment: full system.
"},{"location":"user-guide/dev-environment-4cab/#whats-available","title":"What's available","text":"ARCHER2 runs on the Cray Linux Environment (a version of SUSE Linux), and provides a development environment which includes:
Access to particular software, and particular versions, is managed by a standard TCL module framework. Most software is available via standard software modules and the different programming environments are available via module collections.
You can see what programming environments are available with:
auser@uan01:~> module savelist\nNamed collection list:\n 1) PrgEnv-aocc 2) PrgEnv-cray 3) PrgEnv-gnu\n
Other software modules can be listed with
auser@uan01:~> module avail\n------------------------------- /opt/cray/pe/perftools/20.09.0/modulefiles --------------------------------\nperftools perftools-lite-events perftools-lite-hbm perftools-nwpc \nperftools-lite perftools-lite-gpu perftools-lite-loops perftools-preload \n\n---------------------------------- /opt/cray/pe/craype/2.7.0/modulefiles ----------------------------------\ncraype-hugepages1G craype-hugepages8M craype-hugepages128M craype-network-ofi \ncraype-hugepages2G craype-hugepages16M craype-hugepages256M craype-network-slingshot10 \ncraype-hugepages2M craype-hugepages32M craype-hugepages512M craype-x86-rome \ncraype-hugepages4M craype-hugepages64M craype-network-none \n\n------------------------------------- /usr/local/Modules/modulefiles --------------------------------------\ndot module-git module-info modules null use.own \n\n-------------------------------------- /opt/cray/pe/cpe-prgenv/7.0.0 --------------------------------------\ncpe-aocc cpe-cray cpe-gnu \n\n-------------------------------------------- /opt/modulefiles ---------------------------------------------\naocc/2.1.0.3(default) cray-R/4.0.2.0(default) gcc/8.1.0 gcc/9.3.0 gcc/10.1.0(default) \n\n\n---------------------------------------- /opt/cray/pe/modulefiles -----------------------------------------\natp/3.7.4(default) cray-mpich-abi/8.0.15 craype-dl-plugin-py3/20.06.1(default) \ncce/10.0.3(default) cray-mpich-ucx/8.0.15 craype/2.7.0(default) \ncray-ccdb/4.7.1(default) cray-mpich/8.0.15(default) craypkg-gen/1.3.10(default) \ncray-cti/2.7.3(default) cray-netcdf-hdf5parallel/4.7.4.0 gdb4hpc/4.7.3(default) \ncray-dsmml/0.1.2(default) cray-netcdf/4.7.4.0 iobuf/2.0.10(default) \ncray-fftw/3.3.8.7(default) cray-openshmemx/11.1.1(default) papi/6.0.0.2(default) \ncray-ga/5.7.0.3 cray-parallel-netcdf/1.12.1.0 perftools-base/20.09.0(default) \ncray-hdf5-parallel/1.12.0.0 cray-pmi-lib/6.0.6(default) valgrind4hpc/2.7.2(default) \ncray-hdf5/1.12.0.0 cray-pmi/6.0.6(default) \ncray-libsci/20.08.1.2(default) cray-python/3.8.5.0(default) \n
A full discussion of the module system is available in the Software environment section.
A consistent set of modules is loaded on login to the machine (currently PrgEnv-cray
, see below). Developing applications then means selecting and loading the appropriate set of modules before starting work.
This section is aimed at code developers and will concentrate on the compilation environment and building libraries and executables, and specifically parallel executables. Other topics such as Python and Containers are covered in more detail in separate sections of the documentation.
"},{"location":"user-guide/dev-environment-4cab/#managing-development","title":"Managing development","text":"ARCHER2 supports common revision control software such as git
.
Standard GNU autoconf tools are available, along with make
(which is GNU Make). Versions of cmake
are available.
Note
Some of these tools are part of the system software, and typically reside in /usr/bin
, while others are provided as part of the module system. Some tools may be available in different versions via both /usr/bin
and via the module system.
There are three different compiler environments available on ARCHER2: AMD (AOCC), Cray (CCE), and GNU (GCC). The current compiler suite is selected via the programming environment, while the specific compiler versions are determined by the relevant compiler module. A summary is:
Suite name Module Programming environment collection CCEcce
PrgEnv-cray
GCC gcc
PrgEnv-gnu
AOCC aocc
PrgEnv-aocc
For example, at login, the default set of modules are:
Currently Loaded Modulefiles:\n1) cpe-cray 7) cray-dsmml/0.1.2(default) \n2) cce/10.0.3(default) 8) perftools-base/20.09.0(default) \n3) craype/2.7.0(default) 9) xpmem/2.2.35-7.0.1.0_1.3__gd50fabf.shasta(default) \n4) craype-x86-rome 10) cray-mpich/8.0.15(default) \n5) libfabric/1.11.0.0.233(default) 11) cray-libsci/20.08.1.2(default) \n6) craype-network-ofi \n
from which we see the default programming environment is Cray (indicated by cpe-cray
(at 1 in the list above) and the default compiler module is cce/10.0.3
(at 2 in the list above). The programming environment will give access to a consistent set of compiler, MPI library via cray-mpich
(at 10), and other libraries e.g., cray-libsci
(at 11 in the list above) infrastructure.
Within a given programming environment, it is possible to swap to a different compiler version by swapping the relevant compiler module.
To ensure consistent behaviour, compilation of C, C++, and Fortran source code should then take place using the appropriate compiler wrapper: cc
, CC
, and ftn
, respectively. The wrapper will automatically call the relevant underlying compiler and add the appropriate include directories and library locations to the invocation. This typically eliminates the need to specify this additional information explicitly in the configuration stage. To see the details of the exact compiler invocation use the -craype-verbose
flag to the compiler wrapper.
The default link time behaviour is also related to the current programming environment. See the section below on Linking and libraries.
Users should not, in general, invoke specific compilers at compile/link stages. In particular, gcc
, which may default to /usr/bin/gcc
, should not be used. The compiler wrappers cc
, CC
, and ftn
should be used via the appropriate module. Other common MPI compiler wrappers e.g., mpicc
should also be replaced by the relevant wrapper cc
(mpicc
etc are not available).
Important
Always use the compiler wrappers cc
, CC
, and/or ftn
and not a specific compiler invocation. This will ensure consistent compile/link time behaviour.
Further information on both the compiler wrappers, and the individual compilers themselves are available via the command line, and via standard man
pages. The man
page for the compiler wrappers is common to all programming environments, while the man
page for individual compilers depends on the currently loaded programming environment. The following table summarises options for obtaining information on the compiler and compile options:
man craycc
man crayCC
man crayftn
GNU man gcc
man g++
man gfortran
Wrappers man cc
man CC
man ftn
Tip
You can also pass the --help
option to any of the compilers or wrappers to get a summary of how to use them. The Cray Fortran compiler uses ftn --craype-help
to access the help options.
Tip
There are no man
pages for the AOCC compilers at the moment.
Tip
Cray C/C++ is based on Clang and therefore supports similar options to clang/gcc (man clang
is in fact equivalent to man craycc
). clang --help
will produce a full summary of options with Cray-specific options marked \"Cray\". The craycc
man page concentrates on these Cray extensions to the clang
front end and does not provide an exhaustive description of all clang
options. Cray Fortran is not based on Flang and so takes different options from flang/gfortran.
Executables on ARCHER2 link dynamically, and the Cray Programming Environment does not currently support static linking. This is in contrast to ARCHER where the default was to build statically.
If you attempt to link statically, you will see errors similar to:
/usr/bin/ld: cannot find -lpmi\n/usr/bin/ld: cannot find -lpmi2\ncollect2: error: ld returned 1 exit status\n
The compiler wrapper scripts on ARCHER link runtime libraries in using the runpath
by default. This means that the paths to the runtime libraries are encoded into the executable so you do not need to load the compiler environment in your job submission scripts.
If you are unsure which compiler you should choose, we suggest the starting point should be the GNU compiler collection (GCC, PrgEnv-gnu
); this is perhaps the most commonly used by code developers, particularly in the open source software domain. A portable, standard-conforming code should (in principle) compile in any of the three programming environments.
For users requiring specific compiler features, such as co-array Fortran, the recommended starting point would be Cray. The following sections provide further details of the different programming environments.
Warning
Intel compilers are not available on ARCHER2.
"},{"location":"user-guide/dev-environment-4cab/#amd-optimizing-cc-compiler-aocc","title":"AMD Optimizing C/C++ Compiler (AOCC)","text":"The AMD Optimizing C/++ Compiler (AOCC) is a clang-based optimising compiler. AOCC (despite its name) includes a flang-based Fortran compiler.
Switch the the AOCC programming environment via
$ module restore PrgEnv-aocc\n
Note
Further details on AOCC will appear here as they become available.
"},{"location":"user-guide/dev-environment-4cab/#aocc-reference-material","title":"AOCC reference material","text":"The Cray compiler environment (CCE) is the default compiler at the point of login. CCE supports C/C++ (along with unified parallel C UPC), and Fortran (including co-array Fortran). Support for OpenMP parallelism is available for both C/C++ and Fortran (currently OpenMP 4.5, with a number of exceptions).
The Cray C/C++ compiler is based on a clang front end, and so compiler options are similar to those for gcc/clang. However, the Fortran compiler remains based around Cray-specific options. Be sure to separate C/C++ compiler options and Fortran compiler options (typically CFLAGS
and FFLAGS
) if compiling mixed C/Fortran applications.
Switch the the Cray programming environment via
$ module restore PrgEnv-cray\n
"},{"location":"user-guide/dev-environment-4cab/#useful-cce-cc-options","title":"Useful CCE C/C++ options","text":"When using the compiler wrappers cc
or CC
, some of the following options may be useful:
Language, warning, Debugging options:
Option Comment-std=<standard>
Default is -std=gnu11
(gnu++14
for C++) [1] Performance options:
Option Comment-Ofast
Optimisation levels: -O0, -O1, -O2, -O3, -Ofast -ffp=level
Floating point maths optimisations levels 0-4 [2] -flto
Link time optimisation Miscellaneous options:
Option Comment-fopenmp
Compile OpenMP (default is off) -v
Display verbose output from compiler stages Notes
-std=gnu11
gives c11
plus GNU extensions (likewise c++14
plus GNU extensions). See https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/C-Extensions.html-ffp=3
is implied by -Ofast
or -ffast-math
Language, Warning, Debugging options:
Option Comment-m <level>
Message level (default -m 3
errors and warnings) Performance options:
Option Comment-O <level>
Optimisation levels: -O0 to -O3 (default -O2) -h fp<level>
Floating point maths optimisations levels 0-3 -h ipa
Inter-procedural analysis Miscellaneous options:
Option Comment-h omp
Compile OpenMP (default is -hnoomp
) -v
Display verbose output from compiler stages"},{"location":"user-guide/dev-environment-4cab/#gnu-compiler-collection-gcc","title":"GNU compiler collection (GCC)","text":"The commonly used open source GNU compiler collection is available and provides C/C++ and Fortran compilers.
The GNU compiler collection is loaded by switching to the GNU programming environment:
$ module restore PrgEnv-gnu\n
Bug
The gcc/8.1.0
module is available on ARCHER2 but cannot be used as the supporting scientific and system libraries are not available. You should not use this version of GCC.
Warning
If you want to use GCC version 10 or greater to compile Fortran code, with the old MPI interfaces (i.e. use mpi
or INCLUDE 'mpif.h'
) you must add the -fallow-argument-mismatch
option (or equivalent) when compiling otherwise you will see compile errors associated with MPI functions. The reason for this is that past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines using the old MPI interfaces where arrays of different types are passed to, for example, MPI_Send()
. This will now generate an error as not standard conforming. The -fallow-argument-mismatch
option is used to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
If you use the Fortran 2008 MPI interface (i.e. use mpi_f08
) then you should not need to add this option.
Fortran language MPI bindings are described in more detail at in the MPI Standard documentation.
"},{"location":"user-guide/dev-environment-4cab/#useful-gnu-fortran-options","title":"Useful Gnu Fortran options","text":"Option Comment-std=<standard>
Default is gnu -fallow-argument-mismatch
Allow mismatched procedure arguments. This argument is required for compiling MPI Fortran code with GCC version 10 or greater if you are using the older MPI interfaces (see warning above) -fbounds-check
Use runtime checking of array indices -fopenmp
Compile OpenMP (default is no OpenMP) -v
Display verbose output from compiler stages Tip
The standard
in -std
may be one of f95
f2003
, f2008
or f2018
. The default option -std=gnu
is the latest Fortran standard plus gnu extensions.
Warning
Past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines where arrays of different types are passed to MPI_Send()
and so on. This will now generate an error as not standard conforming. Use -fallow-argument-mismatch
to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
C/C++ documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/
Fortran documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gfortran/
HPE Cray provide, as standard, an MPICH implementation of the message passing interface which is specifically optimised for the ARCHER2 network. The current implementation supports MPI standard version 3.1.
The HPE Cray MPICH implementation is linked into software by default when compiling using the standard wrapper scripts: cc
, CC
and ftn
.
MPI standard documents: https://www.mpi-forum.org/docs/
"},{"location":"user-guide/dev-environment-4cab/#linking-and-libraries","title":"Linking and libraries","text":"Linking to libraries is performed dynamically on ARCHER2. One can use the -craype-verbose
flag to the compiler wrapper to check exactly what linker arguments are invoked. The compiler wrapper scripts encode the paths to the programming environment system libraries using RUNPATH. This ensures that the executable can find the correct runtime libraries without the matching software modules loaded.
The library RUNPATH associated with an executable can be inspected via, e.g.,
$ readelf -d ./a.out\n
(swap a.out
for the name of the executable you are querying).
Modules with names prefixed by cray-
are provided by HPE Cray, and are supported to be consistent with any of the programming environments and associated compilers. These modules should be the first choice for access to software libraries if available.
Tip
More information on the different software libraries on ARCHER2 can be found in the Software libraries section of the user guide.
"},{"location":"user-guide/dev-environment-4cab/#switching-to-a-different-hpe-cray-programming-environment-release","title":"Switching to a different HPE Cray Programming Environment release","text":"Important
See the section below on using non-default versions of HPE Cray libraries below as this process will generally need to be followed when using software from non-default PE installs.
Access to non-default PE environments is controlled by the use of the cpe
modules. These modules are typically loaded after you have restored a PrgEnv and loaded all the other modules you need and will set your compile environment to match that in the other PE release. This means:
For example, if you have a code that uses the Gnu programming environment, FFTW and NetCDF parallel libraries and you want to compile in the (non-default) 21.03 programming environment, you would do the following:
First, restore the Gnu programming environment and load the required library modules (FFTW and NetCDF HDF5 parallel). The loaded module list shows they are the versions from the default (20.10) programming environment):
auser@uan02:/work/t01/t01/auser> module restore -s PrgEnv-gnu\nauser@uan02:/work/t01/t01/auser> module load cray-fftw\nauser@uan02:/work/t01/t01/auser> module load cray-netcdf\nauser@uan02:/work/t01/t01/auser> module load cray-netcdf-hdf5parallel\nauser@uan02:/work/t01/t01/auser> module list\nCurrently Loaded Modulefiles:\n 1) cpe-gnu 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default) \n 2) gcc/10.1.0(default) 10) cray-mpich/8.0.16(default) \n 3) craype/2.7.2(default) 11) cray-libsci/20.10.1.2(default) \n 4) craype-x86-rome 12) bolt/0.7 \n 5) libfabric/1.11.0.0.233(default) 13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env \n 6) craype-network-ofi 14) /usr/local/share/epcc-module/epcc-module-loader \n 7) cray-dsmml/0.1.2(default) 15) cray-fftw/3.3.8.8(default) \n 8) perftools-base/20.10.0(default) 16) cray-netcdf-hdf5parallel/4.7.4.2(default) \n
Now, load the cpe/21.03
programming environment module to switch all the currently loaded HPE Cray modules from the default (20.10) programming environment version to the 21.03 programming environment versions:
auser@uan02:/work/t01/t01/auser> module load cpe/21.03\nSwitching to cray-dsmml/0.1.3.\nSwitching to cray-fftw/3.3.8.9.\nSwitching to cray-libsci/21.03.1.1.\nSwitching to cray-mpich/8.1.3.\nSwitching to cray-netcdf-hdf5parallel/4.7.4.3.\nSwitching to craype/2.7.5.\nSwitching to gcc/9.3.0.\nSwitching to perftools-base/21.02.0.\n\nLoading cpe/21.03\n Unloading conflict: cray-dsmml/0.1.2 cray-fftw/3.3.8.8 cray-libsci/20.10.1.2 cray-mpich/8.0.16 cray-netcdf-hdf5parallel/4.7.4.2\n craype/2.7.2 gcc/10.1.0 perftools-base/20.10.0\n Loading requirement: cray-dsmml/0.1.3 cray-fftw/3.3.8.9 cray-libsci/21.03.1.1 cray-mpich/8.1.3 cray-netcdf-hdf5parallel/4.7.4.3\n craype/2.7.5 gcc/9.3.0 perftools-base/21.02.0\nauser@uan02:/work/t01/t01/auser> module list\nCurrently Loaded Modulefiles:\n 1) cpe-gnu 9) cray-dsmml/0.1.3 17) cpe/21.03(default) \n 2) craype-x86-rome 10) cray-fftw/3.3.8.9 \n 3) libfabric/1.11.0.0.233(default) 11) cray-libsci/21.03.1.1 \n 4) craype-network-ofi 12) cray-mpich/8.1.3 \n 5) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default) 13) cray-netcdf-hdf5parallel/4.7.4.3 \n 6) bolt/0.7 14) craype/2.7.5 \n 7) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env 15) gcc/9.3.0 \n 8) /usr/local/share/epcc-module/epcc-module-loader 16) perftools-base/21.02.0 \n
Finally (as noted above), you will need to modify the value of LD_LIBRARY_PATH
before you compile your software to ensure it picks up the non-default versions of libraries:
auser@uan02:/work/t01/t01/auser> export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
Now you can go ahead and compile your software with the new programming environment.
Important
The cpe
modules only change the versions of software modules provided as part of the HPE Cray programming environments. Any modules provided by the ARCHER2 service will need to be loaded manually after you have completed the process described above.
Note
Unloading the cpe
module does not restore the original programming environment release. To restore the default programming environment release you should log out and then log back in to ARCHER2.
Bug
The cpe/21.03
module has a known issue with PrgEnv-gnu
where it loads an old version of GCC (9.3.0) rather than the correct, newer version (10.2.0). You can resolve this by using the sequence:
module restore -s PrgEnv-gnu\n...load any other modules you need...\nmodule load cpe/21.03\nmodule unload cpe/21.03\nmodule swap gcc gcc/10.2.0\n
"},{"location":"user-guide/dev-environment-4cab/#available-hpe-cray-programming-environment-releases-on-archer2","title":"Available HPE Cray Programming Environment releases on ARCHER2","text":"ARCHER2 currently has the following HPE Cray Programming Environment releases available:
cpe
modulecpe/21.03
moduleTip
You can see which programming environment release you currently have loaded by using module list
and looking at the version number of the cray-libsci
module you have loaded. The first two numbers indicate the version of the PE you have loaded. For example, if you have cray-libsci/20.10.1.2
loaded then you are using the 20.10 PE release.
If you wish to make use of non-default versions of libraries provided by HPE Cray (usually because they are part of a non-default PE release: either old or new) then you need to make changes at both compile and runtime. In summary, you need to load the correct module and also make changes to the LD_LIBRARY_PATH
environment variable.
At compile time you need to load the version of the library module before you compile and set the LD_LIBRARY_PATH environment variable to include the contencts of $CRAY_LD_LIBRARY_PATH
as the first entry. For example, to use the, non-default, 20.08.1.2 version of HPE Cray LibSci in the default programming environment (Cray Compiler Environment, CCE) you would first setup the environment to compile with:
auser@uan01:~/test/libsci> module swap cray-libsci cray-libsci/20.08.1.2 \nauser@uan01:~/test/libsci> export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
The order is important here: every time you change a module, you will need to reset the value of LD_LIBRARY_PATH
for the process to work (it will not be updated automatically).
Now you can compile your code. You can check that the executable is using the correct version of LibSci with the ldd
command and look for the line beginning libsci_cray.so.5
, you should see the version in the path to the library file:
auser@uan01:~/test/libsci> ldd dgemv.x \n linux-vdso.so.1 (0x00007ffe4a7d2000)\n libsci_cray.so.5 => /opt/cray/pe/libsci/20.08.1.2/CRAY/9.0/x86_64/lib/libsci_cray.so.5 (0x00007fafd6a43000)\n libdl.so.2 => /lib64/libdl.so.2 (0x00007fafd683f000)\n libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007fafd663c000)\n libquadmath.so.0 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libquadmath.so.0 (0x00007fafd63fc000)\n libmodules.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libmodules.so.1 (0x00007fafd61e0000)\n libfi.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libfi.so.1 (0x00007fafd5abe000)\n libcraymath.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libcraymath.so.1 (0x00007fafd57e2000)\n libf.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libf.so.1 (0x00007fafd554f000)\n libu.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libu.so.1 (0x00007fafd523b000)\n libcsup.so.1 => /opt/cray/pe/cce/10.0.4/cce/x86_64/lib/libcsup.so.1 (0x00007fafd5035000)\n libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00007fafd4c62000)\n libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fafd4a43000)\n libc.so.6 => /lib64/libc.so.6 (0x00007fafd4688000)\n libm.so.6 => /lib64/libm.so.6 (0x00007fafd4350000)\n /lib64/ld-linux-x86-64.so.2 (0x00007fafda988000)\n librt.so.1 => /lib64/librt.so.1 (0x00007fafd4148000)\n libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00007fafd3c92000)\n libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00007fafd3a7a000)\n
Tip
If any of the libraries point to versions in the /opt/cray/pe/lib64
directory then these are using the default versions of the libraries rather than the specific versions. This happens at compile time if you have forgotton to load the right module and set $LD_LIBRARY_PATH
afterwards.
At run time (typically in your job script) you need to repeat the environment setup steps (you can also use the ldd
command in your job submission script to check the library is pointing to the correct version). For example, a job submission script to run our dgemv.x
executable with the non-default version of LibSci could look like:
#!/bin/bash\n#SBATCH --job-name=dgemv\n#SBATCH --time=0:20:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --reservation=shortqos\n\n# Load the standard environment module\nmodule load epcc-job-env\n\n# Setup up the environment to use the non-default version of LibSci\n# We use \"module swap\" as the \"cray-libsci\" is loaded by default.\n# This must be done after loading the \"epcc-job-env\" module\nmodule swap cray-libsci cray-libsci/20.08.1.2\nexport LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n\n# Check which library versions the executable is pointing too\nldd dgemv.x\n\nexport OMP_NUM_THREADS=1\n\nsrun --hint=nomultithread --distribution=block:block dgemv.x\n
Tip
As when compiling, the order of commands matters. Setting the value of LD_LIBRARY_PATH
must happen after you have finished all your module
commands for it to have the correct effect.
Important
You must setup the environment at both compile and run time otherwise you will end up using the default version of the library.
"},{"location":"user-guide/dev-environment-4cab/#compiling-in-compute-nodes","title":"Compiling in compute nodes","text":"Sometimes you may wish to compile in a batch job. For example, the compile process may take a long time or the compile process is part of the research workflow and can be coupled to the production job. Unlike login nodes, the /home
file system is not available.
An example job submission script for a compile job using make
(assuming the Makefile is in the same directory as the job submission script) would be:
#!/bin/bash\n\n#SBATCH --job-name=compile\n#SBATCH --time=00:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the compilation environment (cray, gnu or aocc)\nmodule restore /etc/cray-pe.d/PrgEnv-cray\n\nmake clean\n\nmake\n
Warning
Do not forget to include the full path when the compilation environment is restored. For instance:
module restore /etc/cray-pe.d/PrgEnv-cray
You can also use a compute node in an interactive way using salloc
. Please see Section Using salloc to reserve resources for further details. Once your interactive session is ready, you can load the compilation environment and compile the code.
The ARCHER2 CSE team at EPCC and other contributors provide build configurations ando instructions for a range of research software, software libraries and tools on a variety of HPC systems (including ARCHER2) in a public Github repository. See:
The repository always welcomes contributions from the ARCHER2 user community.
"},{"location":"user-guide/dev-environment-4cab/#support-for-building-software-on-archer2","title":"Support for building software on ARCHER2","text":"If you run into issues building software on ARCHER2 or the software you require is not available then please contact the ARCHER2 Service Desk with any questions you have.
"},{"location":"user-guide/dev-environment/","title":"Application development environment","text":""},{"location":"user-guide/dev-environment/#whats-available","title":"What's available","text":"ARCHER2 runs the HPE Cray Linux Environment (a version of SUSE Linux), and provides a development environment which includes:
Access to particular software, and particular versions, is managed by an Lmod module framework. Most software is available by loading modules, including the different compiler environments
You can see what compiler environments are available with:
auser@uan01:~> module avail PrgEnv\n\n--------------------------------------- /opt/cray/pe/lmod/modulefiles/core ----------------------------------------\n PrgEnv-aocc/8.3.3 PrgEnv-cray/8.3.3 (L) PrgEnv-gnu/8.3.3\n\n Where:\n L: Module is loaded\n\nModule defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.\nSee https://lmod.readthedocs.io/en/latest/060_locating.html for details.\n\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
Other software modules can be searched using the module spider
command:
auser@uan01:~> module spider\n\n---------------------------------------------------------------------------------------------------------------\nThe following is a list of the modules and extensions currently available:\n---------------------------------------------------------------------------------------------------------------\n PrgEnv-aocc: PrgEnv-aocc/8.3.3\n\n PrgEnv-cray: PrgEnv-cray/8.3.3\n\n PrgEnv-gnu: PrgEnv-gnu/8.3.3\n\n amd-uprof: amd-uprof/3.6.449\n\n aocc: aocc/3.2.0\n\n aocc-mixed: aocc-mixed/3.2.0\n\n aocl: aocl/3.1, aocl/4.0\n\n forge: forge/24.0\n\n atp: atp/3.14.16\n\n bolt: bolt/0.7, bolt/0.8\n\n boost: boost/1.72.0, boost/1.81.0\n\n castep: castep/22.11\n\n cce: cce/15.0.0\n\n...output trimmed...\n
A full discussion of the module system is available in the Software environment section.
A consistent set of modules is loaded on login to the machine (currently PrgEnv-cray
, see below). Developing applications then means selecting and loading the appropriate set of modules before starting work.
This section is aimed at code developers and will concentrate on the compilation environment, building libraries and executables, specifically parallel executables. Other topics such as Python and Containers are covered in more detail in separate sections of the documentation.
Tip
If you want to get back to the login module state without having to logout and back in again, you can just use:
module restore\n
This is also handy for build scripts to ensure you are starting from a known state."},{"location":"user-guide/dev-environment/#compiler-environments","title":"Compiler environments","text":"There are three different compiler environments available on ARCHER2:
The current compiler suite is selected via the PrgEnv
module , while the specific compiler versions are determined by the relevant compiler module. A summary is:
PrgEnv-cray
cce
GCC PrgEnv-gnu
gcc
AOCC PrgEnv-aocc
aocc
For example, at login, the default set of modules are:
auser@ln03:~> module list\n\n 1) craype-x86-rome 6) cce/15.0.0 11) PrgEnv-cray/8.3.3\n 2) libfabric/1.12.1.2.2.0.0 7) craype/2.7.19 12) bolt/0.8\n 3) craype-network-ofi 8) cray-dsmml/0.2.2 13) epcc-setup-env\n 4) perftools-base/22.12.0 9) cray-mpich/8.1.23 14) load-epcc-module\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) cray-libsci/22.12.1.1\n
from which we see the default compiler environment is Cray (indicated by PrgEnv-cray
(at 11 in the list above) and the default compiler module is cce/15.0.0
(at 6 in the list above). The compiler environment will give access to a consistent set of compiler, MPI library via cray-mpich
(at 9), and other libraries e.g., cray-libsci
(at 10 in the list above).
Switching between different compiler environments is achieved using the module load
command. For example, to switch from the default HPE Cray (CCE) compiler environment to the GCC environment, you would use:
auser@ln03:~> module load PrgEnv-gnu\n\nLmod is automatically replacing \"cce/15.0.0\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-gnu/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n
If you then use the module list
command, you will see that your environment has been changed to the GCC environment:
auser@ln03:~> module list\n\nCurrently Loaded Modules:\n 1) craype-x86-rome 6) bolt/0.8 11) cray-dsmml/0.2.2\n 2) libfabric/1.12.1.2.2.0.0 7) epcc-setup-env 12) cray-mpich/8.1.23\n 3) craype-network-ofi 8) load-epcc-module 13) cray-libsci/22.12.1.1\n 4) perftools-base/22.12.0 9) gcc/11.2.0 14) PrgEnv-gnu/8.3.3\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) craype/2.7.19\n
"},{"location":"user-guide/dev-environment/#switching-between-compiler-versions","title":"Switching between compiler versions","text":"Within a given compiler environment, it is possible to swap to a different compiler version by swapping the relevant compiler module. To switch to the GNU compiler environment from the default HPE Cray compiler environment and than swap the version of GCC from the 11.2.0 default to the older 10.3.0 version, you would use
auser@ln03:~> module load PrgEnv-gnu\n\nLmod is automatically replacing \"cce/15.0.0\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-gnu/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n\nauser@ln03:~> module load gcc/10.3.0\n\nThe following have been reloaded with a version change:\n 1) gcc/11.2.0 => gcc/10.3.0\n
The first swap command moves to the GNU compiler environment and the second swap command moves to the older version of GCC. As before, module list
will show that your environment has been changed:
auser@ln03:~> module list\n\nCurrently Loaded Modules:\n 1) craype-x86-rome 6) bolt/0.8 11) cray-libsci/22.12.1.1\n 2) libfabric/1.12.1.2.2.0.0 7) epcc-setup-env 12) PrgEnv-gnu/8.3.3\n 3) craype-network-ofi 8) load-epcc-module 13) gcc/10.3.0\n 4) perftools-base/22.12.0 9) craype/2.7.19 14) cray-mpich/8.1.23\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) cray-dsmml/0.2.2\n
"},{"location":"user-guide/dev-environment/#compiler-wrapper-scripts-cc-cc-ftn","title":"Compiler wrapper scripts: cc
, CC
, ftn
","text":"To ensure consistent behaviour, compilation of C, C++, and Fortran source code should then take place using the appropriate compiler wrapper: cc
, CC
, and ftn
, respectively. The wrapper will automatically call the relevant underlying compiler and add the appropriate include directories and library locations to the invocation. This typically eliminates the need to specify this additional information explicitly in the configuration stage. To see the details of the exact compiler invocation use the -craype-verbose
flag to the compiler wrapper.
The default link time behaviour is also related to the current programming environment. See the section below on Linking and libraries.
Users should not, in general, invoke specific compilers at compile/link stages. In particular, gcc
, which may default to /usr/bin/gcc
, should not be used. The compiler wrappers cc
, CC
, and ftn
should be used (with the underlying compiler type and version set by the module system). Other common MPI compiler wrappers e.g., mpicc
, should also be replaced by the relevant wrapper, e.g. cc
(commands such as mpicc
are not available on ARCHER2).
Important
Always use the compiler wrappers cc
, CC
, and/or ftn
and not a specific compiler invocation. This will ensure consistent compile/link time behaviour.
Tip
If you are using a build system such as Make or CMake then you will need to replace all occurrences of mpicc
with cc
, mpicxx
/mpic++
with CC
and mpif90
with ftn
.
Further information on both the compiler wrappers, and the individual compilers themselves are available via the command line, and via standard man
pages. The man
page for the compiler wrappers is common to all programming environments, while the man
page for individual compilers depends on the currently loaded programming environment. The following table summarises options for obtaining information on the compiler and compile options:
man clang
man clang++
man crayftn
GNU man gcc
man g++
man gfortran
Wrappers man cc
man CC
man ftn
Tip
You can also pass the --help
option to any of the compilers or wrappers to get a summary of how to use them. The Cray Fortran compiler uses ftn --craype-help
to access the help options.
Tip
There are no man
pages for the AOCC compilers at the moment.
Tip
Cray C/C++ is based on Clang and therefore supports similar options to clang/gcc. clang --help
will produce a full summary of options with Cray-specific options marked \"Cray\". The clang
man page on ARCHER2 concentrates on these Cray extensions to the clang
front end and does not provide an exhaustive description of all clang
options. Cray Fortran is not based on Flang and so takes different options from flang/gfortran.
If you are unsure which compiler you should choose, we suggest the starting point should be the GNU compiler collection (GCC, PrgEnv-gnu
); this is perhaps the most commonly used by code developers, particularly in the open source software domain. A portable, standard-conforming code should (in principle) compile in any of the three compiler environments.
For users requiring specific compiler features, such as coarray Fortran, the recommended starting point would be Cray. The following sections provide further details of the different compiler environments.
Warning
Intel compilers are not currently available on ARCHER2.
"},{"location":"user-guide/dev-environment/#gnu-compiler-collection-gcc","title":"GNU compiler collection (GCC)","text":"The commonly used open source GNU compiler collection is available and provides C/C++ and Fortran compilers.
Switch the the GCC compiler environment from the default CCE (cray) compiler environment via:
auser@ln03:~> module load PrgEnv-gnu\n\nLmod is automatically replacing \"cce/15.0.0\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-gnu/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n
Warning
If you want to use GCC version 10 or greater to compile Fortran code, with the old MPI interfaces (i.e. use mpi
or INCLUDE 'mpif.h'
) you must add the -fallow-argument-mismatch
option (or equivalent) when compiling otherwise you will see compile errors associated with MPI functions. The reason for this is that past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines using the old MPI interfaces where arrays of different types are passed to, for example, MPI_Send()
. This will now generate an error as not standard conforming. The -fallow-argument-mismatch
option is used to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
If you use the Fortran 2008 MPI interface (i.e. use mpi_f08
) then you should not need to add this option.
Fortran language MPI bindings are described in more detail at in the MPI Standard documentation.
"},{"location":"user-guide/dev-environment/#useful-gnu-fortran-options","title":"Useful Gnu Fortran options","text":"Option Comment-O<level>
Optimisation levels: -O0
, -O1
, -O2
, -O3
, -Ofast
. -Ofast
is not recommended without careful regression testing on numerical output. -std=<standard>
Default is gnu -fallow-argument-mismatch
Allow mismatched procedure arguments. This argument is required for compiling MPI Fortran code with GCC version 10 or greater if you are using the older MPI interfaces (see warning above) -fbounds-check
Use runtime checking of array indices -fopenmp
Compile OpenMP (default is no OpenMP) -v
Display verbose output from compiler stages Tip
The standard
in -std
may be one of f95
f2003
, f2008
or f2018
. The default option -std=gnu
is the latest Fortran standard plus gnu extensions.
Warning
Past versions of gfortran
have allowed mismatched arguments to external procedures (e.g., where an explicit interface is not available). This is often the case for MPI routines where arrays of different types are passed to MPI_Send()
and so on. This will now generate an error as not standard conforming. Use -fallow-argument-mismatch
to reduce the error to a warning. The same effect may be achieved via -std=legacy
.
GCC 12.x compilers are available on ARCHER2 for users who wish to access newer features (particularly C++ features).
Testing by the CSE service has identified that some software regression tests produce different results from the reference values when using software compiled with gfortran from GCC 12.x so we do not recommend its general use by users. Users should carefully check results from software built using compilers from GCC 12.x before using it for their research projects.
You can access GCC 12.x by using the commands:
module load extra-compilers\nmodule load PrgEnv-gnu\n
"},{"location":"user-guide/dev-environment/#reference-material","title":"Reference material","text":"C/C++ documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/
Fortran documentation https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gfortran/
The Cray Compiling Environment (CCE) is the default compiler at the point of login. CCE supports C/C++ (along with unified parallel C UPC), and Fortran (including co-array Fortran). Support for OpenMP parallelism is available for both C/C++ and Fortran (currently OpenMP 4.5, with a number of exceptions).
The Cray C/C++ compiler is based on a clang front end, and so compiler options are similar to those for gcc/clang. However, the Fortran compiler remains based around Cray-specific options. Be sure to separate C/C++ compiler options and Fortran compiler options (typically CFLAGS
and FFLAGS
) if compiling mixed C/Fortran applications.
As CCE is the default compiler environment on ARCHER2, you do not usually need to issue any commands to enable CCE.
Note
The CCE Clang compiler uses a GCC 8 toolchain so only C++ standard library features available in GCC 8 will be available in CCE Clang. You can add the compile option --gcc-toolchain=/opt/gcc/11.2.0/snos
to use a more recent version of the C++ standard library if you wish.
When using the compiler wrappers cc
or CC
, some of the following options may be useful:
Language, warning, Debugging options:
Option Comment-std=<standard>
Default is -std=gnu11
(gnu++14
for C++) [1] --gcc-toolchain=/opt/cray/pe/gcc/12.2.0/snos
Use the GCC 12.2.0 toolchain instead of the default 11.2.0 version packaged with CCE Performance options:
Option Comment-Ofast
Optimisation levels: -O0
, -O1
, -O2
, -O3
, -Ofast
. -Ofast
is not recommended without careful regression testing on numerical output. -ffp=level
Floating point maths optimisations levels 0-4 [2] -flto
Link time optimisation Miscellaneous options:
Option Comment-fopenmp
Compile OpenMP (default is off) -v
Display verbose output from compiler stages Notes
-std=gnu11
gives c11
plus GNU extensions (likewise c++14
plus GNU extensions). See https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/C-Extensions.html-ffp=3
is implied by -Ofast
or -ffast-math
Language, Warning, Debugging options:
Option Comment-m <level>
Message level (default -m 3
errors and warnings) Performance options:
Option Comment-O <level>
Optimisation levels: -O0 to -O3 (default -O2) -h fp<level>
Floating point maths optimisations levels 0-3 -h ipa
Inter-procedural analysis Miscellaneous options:
Option Comment-h omp
Compile OpenMP (default is -hnoomp
) -v
Display verbose output from compiler stages"},{"location":"user-guide/dev-environment/#cce-reference-documentation","title":"CCE Reference Documentation","text":"man clang
once the CCE compiler environment is loaded.The AMD Optimizing Compiler Collection (AOCC) is a clang-based optimising compiler. AOCC also includes a flang-based Fortran compiler.
Load the AOCC compiler environment from the default CCE (cray) compiler environment via:
auser@ln03:~> module load PrgEnv-aocc\n\nLmod is automatically replacing \"cce/15.0.0\" with \"aocc/3.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.3.3\" with \"PrgEnv-aocc/8.3.3\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.23\n
"},{"location":"user-guide/dev-environment/#aocc-reference-material","title":"AOCC reference material","text":"HPE Cray provide, as standard, an MPICH implementation of the message passing interface which is specifically optimised for the ARCHER2 interconnect. The current implementation supports MPI standard version 3.1.
The HPE Cray MPICH implementation is linked into software by default when compiling using the standard wrapper scripts: cc
, CC
and ftn
.
You do not need to do anything to make HPE Cray MPICH available when you log into ARCHER2, it is available by default to all users.
"},{"location":"user-guide/dev-environment/#switching-to-alternative-ucx-mpi-implementation","title":"Switching to alternative UCX MPI implementation","text":"HPE Cray MPICH can use two different low-level protocols to transfer data across the network. The default is the Open Fabrics Interface (OFI), but you can switch to the UCX protocol from Mellanox.
Which performs better will be application-dependent, but our experience is that UCX is often faster for programs that send a lot of data collectively between many processes, e.g. all-to-all communications patterns such as occur in parallel FFTs.
Note
You do not need to recompile your program - you simply load different modules in your Slurm script.
module load craype-network-ucx \nmodule load cray-mpich-ucx \n
Important
If your software was compiled using a compiler environment other then CCE you will also need to load that compiler environment as well as the UCX modules. For example, if you compiled using PrgEnv-gnu
you would need to:
module load PrgEnv-gnu\nmodule load craype-network-ucx \nmodule load cray-mpich-ucx \n
The performance benefits will also vary depending on the number of processes, so it is important to benchmark your application at the scale used in full production runs.
"},{"location":"user-guide/dev-environment/#mpi-reference-material","title":"MPI reference material","text":"MPI standard documents: https://www.mpi-forum.org/docs/
"},{"location":"user-guide/dev-environment/#linking-and-libraries","title":"Linking and libraries","text":"Linking to libraries is performed dynamically on ARCHER2.
Important
Static linking is not supported on ARCHER2. If you attempt to link statically, you will see errors similar to:
/usr/bin/ld: cannot find -lpmi\n/usr/bin/ld: cannot find -lpmi2\ncollect2: error: ld returned 1 exit status\n
One can use the -craype-verbose
flag to the compiler wrapper to check exactly what linker arguments are invoked. The compiler wrapper scripts encode the paths to the programming environment system libraries using RUNPATH. This ensures that the executable can find the correct runtime libraries without the matching software modules loaded.
The library RUNPATH associated with an executable can be inspected via, e.g.,
$ readelf -d ./a.out\n
(swap a.out
for the name of the executable you are querying).
Modules with names prefixed by cray-
are provided by HPE Cray, and work with any of the compiler environments and. These modules should be the first choice for access to software libraries if available.
Tip
More information on the different software libraries on ARCHER2 can be found in the Software libraries section of the user guide.
"},{"location":"user-guide/dev-environment/#hpe-cray-programming-environment-cpe-releases","title":"HPE Cray Programming Environment (CPE) releases","text":""},{"location":"user-guide/dev-environment/#available-hpe-cray-programming-environment-cpe-releases","title":"Available HPE Cray Programming Environment (CPE) releases","text":"ARCHER2 currently has the following HPE Cray Programming Environment (CPE) releases available:
You can find information, notes, and lists of changes for current and upcoming ARCHER2 HPE Cray programming environments in the HPE Cray Programming Environment GitHub repository.
Tip
We recommend that users use the most recent version of the PE available to get the latest improvements and bug fixes.
Later PE releases may sometimes be available via a containerised form. This allows developers to check that their code compiles and runs using CPE releases that have not yet been installed on ARCHER2.
CPE 23.12 is currently available as a Singularity container, see Using Containerised HPE Cray Programming Environments for further details.
"},{"location":"user-guide/dev-environment/#switching-to-a-different-hpe-cray-programming-environment-cpe-release","title":"Switching to a different HPE Cray Programming Environment (CPE) release","text":"Important
See the section below on using non-default versions of HPE Cray libraries as this process will generally need to be followed when using software from non-default PE installs.
Access to non-default PE environments is controlled by the use of the cpe
modules. Loading a cpe
module will do the following:
For example, if you have a code that uses the Gnu compiler environment, FFTW and NetCDF parallel libraries and you want to compile in the (non-default) 22.04 programming environment, you would do the following:
First, load the cpe/23.09
module to switch all the defaults to the versions from the 22.04 PE. Then, swap to the GNU compiler environment and load the required library modules (FFTW, hdf5-parallel and NetCDF HDF5 parallel). The loaded module list shows they are the versions from the 22.04 PE:
module load cpe/23.09\n
Output:
The following have been reloaded with a version change:\n 1) PrgEnv-cray/8.3.3 => PrgEnv-cray/8.4.0 4) cray-mpich/8.1.23 => cray-mpich/8.1.27\n 2) cce/15.0.0 => cce/16.0.1 5) craype/2.7.19 => craype/2.7.23\n 3) cray-libsci/22.12.1.1 => cray-libsci/23.09.1.1 6) perftools-base/22.12.0 => perftools-base/23.09.0\n
module load PrgEnv-gnu\n
Output: Lmod is automatically replacing \"cce/16.0.1\" with \"gcc/11.2.0\".\n\n\nLmod is automatically replacing \"PrgEnv-cray/8.4.0\" with \"PrgEnv-gnu/8.4.0\".\n\n\nDue to MODULEPATH changes, the following have been reloaded:\n 1) cray-mpich/8.1.27\n
module load cray-fftw\nmodule load cray-hdf5-parallel\nmodule load cray-netcdf-hdf5parallel\nmodule list\n
Output:
Currently Loaded Modules:\n 1) craype-x86-rome 6) epcc-setup-env 11) craype/2.7.23 16) cray-fftw/3.3.10.5\n 2) libfabric/1.12.1.2.2.0.0 7) load-epcc-module 12) cray-dsmml/0.2.2 17) cray-hdf5-parallel/1.12.2.7\n 3) craype-network-ofi 8) perftools-base/23.09.0 13) cray-mpich/8.1.27 18) cray-netcdf-hdf5parallel/4.9.0.7\n 4) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 9) cpe/23.09 14) cray-libsci/23.09.1.1\n 5) bolt/0.8 10) gcc/11.2.0 15) PrgEnv-gnu/8.4.0\n
Now you can go ahead and compile your software with the new programming environment.
Important
The cpe
modules only change the versions of software modules provided as part of the HPE Cray programming environments. Any modules provided by the ARCHER2 service will need to be loaded manually after you have completed the process described above.
Note
Unloading the cpe
module does not restore the original programming environment release. To restore the default programming environment release you should log out and then log back in to ARCHER2.
If you wish to make use of non-default versions of libraries provided by HPE Cray (usually because they are part of a non-default PE release: either old or new) then you need to make changes at both compile and runtime. In summary, you need to load the correct module and also make changes to the LD_LIBRARY_PATH
environment variable.
At compile time you need to load the version of the library module before you compile and set the LD_LIBRARY_PATH environment variable to include the contencts of $CRAY_LD_LIBRARY_PATH
as the first entry. For example, to use the, non-default, 23.09.1.1 version of HPE Cray LibSci in the default programming environment (Cray Compiler Environment, CCE) you would first setup the environment to compile with:
module load cray-libsci/23.09.1.1\nexport LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n
The order is important here: every time you change a module, you will need to reset the value of LD_LIBRARY_PATH
for the process to work (it will not be updated automatically).
Now you can compile your code. You can check that the executable is using the correct version of LibSci with the ldd
command and look for the line beginning libsci_cray.so.5
, you should see the version in the path to the library file:
ldd dgemv.x \n
Output:
linux-vdso.so.1 (0x00007ffc7fff5000)\n libm.so.6 => /lib64/libm.so.6 (0x00007fd6a6361000)\n libsci_cray.so.5 => /opt/cray/pe/libsci/23.09.1.1/CRAY/12.0/x86_64/lib/libsci_cray.so.5 (0x00007fd6a2419000)\n libdl.so.2 => /lib64/libdl.so.2 (0x00007fd6a2215000)\n libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007fd6a68b3000)\n libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00007fd6a1fce000)\n libmodules.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libmodules.so.1 (0x00007fd6a689a000)\n libfi.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libfi.so.1 (0x00007fd6a1a29000)\n libcraymath.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcraymath.so.1 (0x00007fd6a67b3000)\n libf.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libf.so.1 (0x00007fd6a6720000)\n libu.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1 (0x00007fd6a1920000)\n libcsup.so.1 => /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libcsup.so.1 (0x00007fd6a6715000)\n libc.so.6 => /lib64/libc.so.6 (0x00007fd6a152b000)\n /lib64/ld-linux-x86-64.so.2 (0x00007fd6a66ac000)\n libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd6a1308000)\n librt.so.1 => /lib64/librt.so.1 (0x00007fd6a10ff000)\n libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00007fd6a0c53000)\n libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00007fd6a0841000)\n libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x00007fd6a0628000)\n
Tip
If any of the libraries point to versions in the /opt/cray/pe/lib64
directory then these are using the default versions of the libraries rather than the specific versions. This happens at compile time if you have forgotton to load the right module and set $LD_LIBRARY_PATH
afterwards.
At run time (typically in your job script) you need to repeat the environment setup steps (you can also use the ldd
command in your job submission script to check the library is pointing to the correct version). For example, a job submission script to run our dgemv.x
executable with the non-default version of LibSci could look like:
#!/bin/bash\n#SBATCH --job-name=dgemv\n#SBATCH --time=0:20:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=short\n#SBATCH --reservation=shortqos\n\n# Setup up the environment to use the non-default version of LibSci\nmodule load cray-libsci/23.09.1.1\nexport LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH\n\n# Check which library versions the executable is pointing too\nldd dgemv.x\n\nexport OMP_NUM_THREADS=1\n\nsrun --hint=nomultithread --distribution=block:block dgemv.x\n
Tip
As when compiling, the order of commands matters. Setting the value of LD_LIBRARY_PATH
must happen after you have finished all your module
commands for it to have the correct effect.
Important
You must setup the environment at both compile and run time otherwise you will end up using the default version of the library.
"},{"location":"user-guide/dev-environment/#compiling-on-compute-nodes","title":"Compiling on compute nodes","text":"Sometimes you may wish to compile in a batch job. For example, the compile process may take a long time or the compile process is part of the research workflow and can be coupled to the production job. Unlike login nodes, the /home
file system is not available.
An example job submission script for a compile job using make
(assuming the Makefile is in the same directory as the job submission script) would be:
#!/bin/bash\n\n#SBATCH --job-name=compile\n#SBATCH --time=00:20:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=1\n\n# Replace the account code, partition and QoS with those you wish to use\n#SBATCH --account=t01 \n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n\nmake clean\n\nmake\n
Note
If you want to use a compiler environment other than the default then you will need to add the module load
command before the make
command. e.g. to use the GCC compiler environemnt:
module load PrgEnv-gnu\n
You can also use a compute node in an interactive way using salloc
. Please see Section Using salloc to reserve resources for further details. Once your interactive session is ready, you can load the compilation environment and compile the code.
The compiler wrappers link with a number of HPE-provided libraries automatically. It is possible to compile codes in serial with the compiler wrappers to take advantage of the HPE libraries.
To set up your environment for serial compilation, you will need to run:
module load craype-network-none\n module remove cray-mpich\n
Once this is done, you can use the compiler wrappers (cc
for C, CC
for C++, and ftn
for Fortran) to compile your code in serial.
ARCHER2 supports common revision control software such as git
.
Standard GNU autoconf tools are available, along with make
(which is GNU Make). Versions of cmake
are available.
Tip
Some of these tools are part of the system software, and typically reside in /usr/bin
, while others are provided as part of the module system. Some tools may be available in different versions via both /usr/bin
and via the module system. If you find the default version is too old, then look in the module system for a more recent version.
The ARCHER2 CSE team at EPCC and other contributors provide build configurations ando instructions for a range of research software, software libraries and tools on a variety of HPC systems (including ARCHER2) in a public Github repository. See:
The repository always welcomes contributions from the ARCHER2 user community.
"},{"location":"user-guide/dev-environment/#support-for-building-software-on-archer2","title":"Support for building software on ARCHER2","text":"If you run into issues building software on ARCHER2 or the software you require is not available then please contact the ARCHER2 Service Desk with any questions you have.
"},{"location":"user-guide/energy/","title":"Energy use and emissions","text":"This section covers energy use and greenhouse gas (GHG) emissions from ARCHER2.
The emissions section describes how to estimate emissions from your use of ARCHER2 and the methodology we have used to produce emissions estimates for the service.
The energy section describes how to monitor energy use for your jobs on ARCHER2 and how to control the CPU frequency which allows some control over how much energy is consumed by jobs.
Important
The default CPU frequency cap on ARCHER2 compute nodes for jobs launched using srun
is currently set to 2.0 GHz. Information below describes how to control the CPU frequency cap using Slurm.
The Slurm accounting database stores the total energy consumed by a job and you can also directly access the counters on compute nodes which capture instantaneous power and energy data broken down by different hardware components.
"},{"location":"user-guide/energy/#using-sacct-to-get-energy-usage-for-individual-jobs","title":"Using sacct to get energy usage for individual jobs","text":"Energy usage for a particular job may be obtained using the sacct
command. For instance
sacct -j 2658300 --format=JobID,Elapsed,ReqCPUFreq,ConsumedEnergy\n
will provide the elapsed time and consumed energy in joules for the job(s) specified with -j
. The output of this command is:
JobID Elapsed ReqCPUFreq ConsumedEnergy \n------------ ---------- ---------- -------------- \n2658300 02:19:48 Unknown 4.58M \n2658300.bat+ 02:19:48 0 4.58M \n2658300.ext+ 02:19:48 0 4.58M \n2658300.0 02:19:09 Unknown 4.57M \n
In this case we can see that the job consumed 4.58 MJ for a run lasting 2 hours, 19 minutes and 48 seconds with the CPU frequency unset. To convert the energy to kWh we can multiply the energy in joules by 2.78e-7, in this case resulting in 1.27 kWh.
The Slurm database may be cleaned without notice so you should gather any data you want as soon as possible after the job completes - you can even add the sacct
command to the end of your job script to ensure this data is captured.
In addition to energy statistics sacct
provides a number of other statistics that can be specified to the --format
option, the full list of which can be viewed with
sacct --helpformat\n
or using the man
pages.
Note
The counters are available on each compute node and record data only for that compute node. If you are running multi-node jobs, you will need to combine data from multiple nodes to get data for the whole job.
On compute nodes, the raw energy counters and instantaneous power draw data are available at:
/sys/cray/pm_counters\n
There are a number of files in this directory, all the counter files include the current value and a timestamp.
This documentation is from the official HPE documentation:
Tip
The overall power
and energy
counters include all on-node systems. The major components are the CPU (processor), memory and Slingshot network interface controller (NIC).
Note
There exists an MPI-based wrapper library that can gather the pm
counter values at runtime via a simple set of function calls. See the link below for details.
You can request specific CPU frequency caps (in kHz) for compute nodes through srun
options or environment variables. The available frequency caps on the ARCHER2 processors along with the options and environment variables:
srun
option Slurm environment variable Turbo boost enabled? 2.25 GHz --cpu-freq=2250000
export SLURM_CPU_FREQ_REQ=2250000
Yes 2.00 GHz --cpu-freq=2000000
export SLURM_CPU_FREQ_REQ=2000000
No 1.50 GHz --cpu-freq=1500000
export SLURM_CPU_FREQ_REQ=1500000
No The only frequency caps available on the processors on ARCHER2 are 1.5 GHz, 2.0 GHz and 2.25GHz+turbo.
Important
Setting the CPU frequency cap in this way sets the maximum frequency that the processors can use. In practice, the individual cores may select different frequencies up to the value you have set depending on the workload on the processor.
Important
When you select the highest frequency value (2.25 GHz), you also enable turbo boost and so the processor is free to set the CPU frequency to values above 2.25 GHz if possible within the power and thermal limits of the processor. We see that, with turbo boost enabled, the processors typically boost to around 2.8 GHz even when performing compute-intensive work.
For example, you can add the following option to srun
commands in your job submission scripts to set the CPU frequency to 2.25 GHz (and also enable turbo boost):
srun --cpu-freq=2250000 ...usual srun options and arguments...\n
Alternatively, you could add the following line to your job submission script before you use srun
to launch the application:
export SLURM_CPU_FREQ_REQ=2250000\n
Tip
Testing by the ARCHER2 CSE team has shown that most software are most energy efficient when 2.0 GHz is selected as the CPU frequency.
Important
The CPU frequency settings only affect applications launched using the srun
command.
Priority of frequency settings:
SLURM_CPU_FREQ_REQ
setting set by the ARCHER2 service applies if no other mechnism is used to set the CPU frequencySLURM_CPU_FREQ_REQ
environment variable in a job script overrides options provided the default environment variable setting for any subsequent srun
commands in the job script.--cpu-freq=<freq in kHz>
option to the srun
launch command itself overrides all other options.Tip
Adding the --cpu-freq=<freq in kHz>
option to sbatch
(e.g. using #SBATCH --cpu-freq=<freq in kHz>
will not change the CPU frequency of srun
commands used in the job as the default setting for ARCHER2 will override the sbatch
option when the script runs.
If you do not specify a CPU frequency then you will get the default setting for the ARCHER2 service when you lanch an application using srun
. The table below lists the history of default CPU frequency settings on the ARCHER2 service
Most centrally installed research software (available via module load
commands) uses the same default Slurm CPU frequency as set globally for all ARCHER2 users (see above for this value). However, a small number of software have performance that is significantly degraded by using lower frequency settings and so the modules for these packages reset the CPU frequency to the highest value (2.25 GHz). The packages that currently do this are:
Important
If you specify the Slurm CPU frequency in your job scripts using one of the mechanisms described above after you have loaded the module, you will override the setting from the module.
"},{"location":"user-guide/energy/#emissions","title":"Emissions","text":"In this section we provide a brief overview of greenhouse gas (GHG) emissions sources relevant to ARCHER2, show how we have estimated the emissions associated with the service and describe how users can estimate emissions associated with their use of ARCHER2.
"},{"location":"user-guide/energy/#impact-on-reducing-emissions","title":"Impact on reducing emissions","text":"As well as a producer of GHG emissions, HPC systems like ARCHER2 also contribute to reducing emissions. The main source of reduced emissions from services such as ARCHER2 is in the research that leads to new technology, policies and approaches to reducing emissions. Some examples include:
As well as the research activities on the service leading to reductions in emissions, there are other activities that HPC services can potentially take. For example:
The emissions from ARCHER2 potentially fall into two categories (wording inspired by the Green Software Practitioner course linked below):
The other class of emissions (Scope 1) are not relevant for the ARCHER2 service:
If you want to learn more about GHG emissions in the area of software and digital infrastructure then you may want to look at the Green Software Foundation Green Software Practitioner online course.
"},{"location":"user-guide/energy/#archer2-emissions","title":"ARCHER2 emissions","text":"Important
All ARCHER2 emissions are estimated and you should understand that there is the potential for significant variation from the current values as understanding of emissions values and sources improves.
"},{"location":"user-guide/energy/#scope-3-emissions","title":"Scope 3 emissions","text":"Scope 3 emissions from the ARCHER2 hardware have been estimated from a subset of the components that are expected to make up the majority of the emissions. Note that there is a large amount of uncertainty for Scope 3 emissions due to lack of high quality Scope 3 emissions data from vendors. In particular, the number used for the compute node emissions is at the high end of estimated values and the actual value could be as much as 15% lower at around 900 kgCO2e/node.
Component Count Estimated kgCO2e per unit Estimated kgCO2e % Total Scope 3 References Compute nodes 5,860 nodes 1,100 6,400,000 84% (1) Interconnect switches 768 switches 280 150,000 2% (2) Lustre HDD 19,759,200 GB 0.02 400,000 6% (3) Lustre SSD 1,900,800 GB 0.16 300,000 4% (3) NFS HDD 3,240,000 GB 0.02 70,000 1% (3) Total 7,320,000 100%We then estimate the per-CU (nodeh) Scope 3 emissions by assuming a service lifetime of 6 years and 100% availability:
7,320,000 kgCO2e / (5,860 nodes * 6 years * 365 days * 24 hours) = 0.023 kgCO2e/CU\n
Tools use a value of 0.023 kgCO2e/CU for ARCHER2.
References:
Scope 2 emissions from ARCHER2 are zero as the service is supplied by 100% certified renewable energy. For information purposes we can calculate what the Scope 2 emissions would have been if the energy was not 100% renewable energy using the methodology described below.
We are aware that there is ongoing discussion in the sustainability community about the impact and effectiveness of certified renewable energy contracts that are supplied through UK National Grid connections. We are monitoring these discussions and taking advice from sustainability professionals on how we report and estimate ARCHER2 emissions.
UK National Grid based Scope 2 emissions are calculated using the compute node energy use for particular jobs along with the carbon intensity of the South Scotland region of the UK National Grid at the start time of the job. The carbon intensity is retrieved from the carbonintensity.org.uk web API.
If the energy use of a job is not available (which happens occasionally due to, e.g. counter failures) then the mean per node power draw from 1 Jan 2024 - 30 Jun 2024 on ARCHER2 is used to compute the energy consumption. This corresponds to a value of 0.41 kW per node.
Estimates of power draw of individual components of ARCHER2 suggest that the compute node power draw makes up around 85% of the system power draw so to estimate energy use by additional components we add 15% of the measured compute node energy.
Component Count Loaded power draw per unit (kW) Loaded power draw (kW) % Total Notes Compute nodes 5,860 nodes 0.41 2,400 85% Measured by on system counters Interconnect switches 768 switches 0.24 240 9% Measured by on system counters Lustre storage 5 file systems 8 40 1% Estimate from vendor NFS storage 4 file systems 8 32 1% Estimate from vendor Coolant distribution units 6 CDU 16 96 3% Estimate from vendor Total 2,808 99%Current Scope 2 grid based emission calculations estimates do not include overheads from the electrical and cooling plant, these will vary with outside weather conditions at the data centre but are typically less than 10%. As a conservative estimate, we add an additional 10% energy use to the total to account for plant overheads.
The final energy calculation for a job is therefore:
To help estimate GHG emissions from your use of ARCHER2 and place them in context to other sources of GHG emissions we are developing a number of tools. We will add more information on these tools in this section of the documentation as they become available.
At the moment, the following tools are available:
jobemissions
- a command line tool on ARCHER2 that reports estimated emissions for a specified, completed job. It can also provide comparisons to other GHG emissions sourcesjobemissions
tool","text":"The jobemissions
tool is available by default to all ARCHER2 users from the command line. You supply a Slurm job ID for a completed job and the tool provides an estimate of the GHG emissions associated with that job (based on the estimation methodologies described above). For example, to provide an estimate for the completed job with Job ID 7654321, you would use:
jobemissions 7654321\n
Typical output from the tool would look like:
Job details:\n Job ID: 7654321\n Start: 2024-11-11T20:51:25\n Budget: t01\n Nodes: 20\n Runtime: 324000 s\n CU: 1800.000\n Compute node energy use: 448.973 kWh\n Other hardware energy use: 67.346 kWh (estimated)\n Overhead energy use: 51.632 kWh (estimated)\n Total energy use: 567.951 kWh (estimated)\n\n Emissions estimates:\n Scope 2: 0.000 kgCO2e (ARCHER2 is on 100% certified\n renewable energy contract so scope 2 emissions are zero)\n Scope 3: 41.400 kgCO2e (23.0 gCO2e/CU)\n Total: 41.400 kgCO2e\n\n Indicative emissions estimates for UK national grid energy mix\n in S. Scotland at start of job if ARCHER2 was not using\n renewable energy\n Scope 2: 9.655 kgCO2e (567.951 kWh, 17.0 gCO2e/kWh)\n Scope 3: 41.400 kgCO2e (23.0 gCO2e/CU)\n Total: 51.055 kgCO2e\n\n Scope 2 carbon intensity values from carbonintensity.org.uk\n
If you add the flag --comparison food,other
the tool will add comparisons of GHG emissions for the job to other sources. e.g. for the same job above, it would add the following section to the end of the output.
Emissions from job approximately equivalent to following food consumption:\n | Food | Emissions (kgCO2e/100g) | Equivalent to (g) |\n |-----------|--------------------------|-------------------|\n | Beef | 12.47 | 332.00 |\n | Chicken | 1.43 | 2895.10 |\n | Avocado | 0.18 | 23000.00 |\n | Chickpeas | 0.04 | 103500.00 |\n\n Emissions from job approximately equivalent to:\n Daily emissions from 329.2 houses' electricity use (in S. Scotland)\n Emissions from flying 0.083 times across the Atlantic (500.00 kgCO2e/person)\n Emissions from driving 153.9 miles (0.27 kgCO2e/mile, average UK car, petrol and diesel very similar)\n
You can add the --json
flag to obtain the emissions data from the tool in a machine-readable format.
The ARCHER2 CU Calculator on the ARCHER2 website is used by potential users to estimate the number and cost of resources for potential applications to use ARCHER2. This tool has been augmented to include an estimate of GHG emissions from the proposed use of ARCHER2. In this tool, we include the Scope 3 emissions calculated as per the methodology above and note that Scope 2 emissions are zero due to the 100% renewable energy contract used to power ARCHER2.
"},{"location":"user-guide/functional-accounts/","title":"Functional accounts on ARCHER2","text":"Functional accounts are used to enable persistent services, controlled by users running on ARCHER2. For example, running a licence server to allow jobs on compute nodes to check out a licence for restricted software.
There are a number of steps involved in setting up functional accounts:
dvn04
) and the functional accountdvn04
)We cover these steps in detail below with the concrete example of setting up a licence server using the FlexLM software but the process should be able to be generalised for other persistent services.
Note
If you have any questions about functional accounts and persistent services on ARCHER2 please contact the ARCHER2 Service Desk.
"},{"location":"user-guide/functional-accounts/#submit-a-request-to-service-desk","title":"Submit a request to service desk","text":"If you wish to have access to a functional account for persistent services on ARCHER2 you should email the ARCHER2 Service Desk with a case for why you want to have this functionality. You should include the following information in your email:
If your request for a functional account is approved then the ARCHER2 user administration team will setup the account and enable access for the standard user accounts named in the application. They will then inform you of the functional account name.
"},{"location":"user-guide/functional-accounts/#test-access-to-functional-account","title":"Test access to functional account","text":"The process for accessing the functional account is:
dvn04
)dvn04
)sudo
to access the functional accountLog into ARCHER2 in the usual way using a normal user account that has been given access to manage the functional account.
"},{"location":"user-guide/functional-accounts/#setup-ssh-key-pair-for-dvn04-access","title":"Setup SSH key pair fordvn04
access","text":"You can create a passphrase-less SSH key pair to use for access to the persistent service node using the ssh-keygen
command. As long as you place the public and private key parts in the default location, you will not need any additional SSH options to access dvn04
from the ARCHER2 login nodes. Just hit enter when prompted for a passphrase to create a key with no passphrase.
Once the key pair has been created, you add the public part to the $HOME/.ssh/authorized_keys
file on ARCHER2 to make it valid for login to dvn04
using the command cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
.
Example commands to setup SSH key pair:
auser@ln04:~> ssh-keygen -t rsa\n\nGenerating public/private rsa key pair.\nEnter file in which to save the key (/home/t01/t01/auser/.ssh/id_rsa): \nEnter passphrase (empty for no passphrase): \nEnter same passphrase again: \nYour identification has been saved in /home/t01/t01/auser/.ssh/id_rsa\nYour public key has been saved in /home/t01/t01/auser/.ssh/id_rsa.pub\nThe key fingerprint is:\nSHA256:wX2bgNElbsPaT8HXKIflNmqnjSfg7a8BPM1R56b4/60 auser@ln02\nThe key's randomart image is:\n+---[RSA 3072]----+\n| ..... o .|\n| . *.o = = |\n| + B B B +|\n| * * % + |\n| S * X o |\n| . O * |\n| . B + |\n| . + ..|\n| ooE.=|\n+----[SHA256]-----+\n\nauser@ln04:~> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys\n
"},{"location":"user-guide/functional-accounts/#login-to-the-persistent-service-node-dvn04","title":"Login to the persistent service node (dvn04
)","text":"Once you are logged into an ARCHER2 login node, and assuming the SSH key is in the default location, you can now login to dvn04
:
auser@ln04:~> ssh dvn04\n
Note
You will need to enter the TOTP for your ARCHER2 account to login to dvn04
unless you have logged in to the node recently.
Once you are logged into dvn04
, you use sudo
to access the functional account.
Important
You must use the normal user account account password to use the sudo
command. This password was set on your first ever login to ARCHER2 (and not used subsequently). If you have forgotten this password, you can reset it in SAFE.
For example, if the functional account is called testlm
, you would access it (on dvn04
) with:
auser@dvn04:~> sudo -iu testlm\n
To exit the functional account, you use the exit
command which will return you to your normal user account on dvn04
.
You should use systemctl
to manage your persistent service on dvn04
. In order to use the systemctl
command, you need to add the following lines to the ~/.bashrc
for the functional account:
export XDG_RUNTIME_DIR=/run/user/$UID\nexport DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$UID/bus\n
Next, create a service definition file for the persistent service and save it to a plain text file. Here is the example used for the QChem licence server:
[Unit]\nDescription=Licence manger for QChem\nAfter=network.target\nConditionHost=dvn04\n\n[Service]\nType=forking\nExecStart=/work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmgrd -l +/work/y07/shared/apps/core/qchem/6.1/var/log/qchemlm.log -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/\nExecStop=/work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmutil lmdown -all -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/\nSuccessExitStatus=15\nRestart=always\nRestartSec=30\n\n[Install]\nWantedBy=default.target\n
Enable the licence server service, e.g. for the QChem licence server service:
testlm@dvn04:~> systemctl --user enable /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/qchem-lm.service\n\nCreated symlink /home/y07/y07/testlm/.config/systemd/user/default.target.wants/qchem-lm.service \u2192 /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/qchem-lm.service.\nCreated symlink /home/y07/y07/testlm/.config/systemd/user/qchem-lm.service \u2192 /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/qchem-lm.service.\n
Once it has been enabled, you can start the licence server service, e.g. for the QChem licence server service:
testlm@dvn04:~> systemctl --user start qchem-lm.service\n
Check the status to make sure it is running:
testlm@dvn04:~> systemctl --user status qchem-lm\n\u25cf qchem-lm.service - Licence manger for QChem\n Loaded: loaded (/home/y07/y07/testlm/.config/systemd/user/qchem-lm.service; enabled; vendor preset: disabled)\n Active: active (running) since Thu 2024-05-16 15:33:59 BST; 8s ago\n Process: 174248 ExecStart=/work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmgrd -l +/work/y07/shared/apps/core/qchem/6.1/var/log/qchemlm.log -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/ (code=exited, status=0/SUCCESS)\n Main PID: 174249 (lmgrd)\n Tasks: 8 (limit: 39321)\n Memory: 5.6M\n CPU: 18ms\n CGroup: /user.slice/user-35153.slice/user@35153.service/app.slice/qchem-lm.service\n \u251c\u2500 174249 /work/y07/shared/apps/core/qchem/6.1/bin/flexnet/lmgrd -l +/work/y07/shared/apps/core/qchem/6.1/var/log/qchemlm.log -c /work/y07/shared/apps/core/qchem/6.1/etc/flexnet/\n \u2514\u2500 174253 qchemlm -T 10.252.1.77 11.19 10 -c :/work/y07/shared/apps/core/qchem/6.1/etc/flexnet/: -lmgrd_port 6979 -srv mdSVdgushTnAjHX1s1PTj0ppCjHJw1Uk9ylvs1j13zkaUzhDBFlbv4thnqEIAXV --lmgrd_start 66461957 -vdrestart 0 -l /work/y07/shar>\n
"},{"location":"user-guide/gpu/","title":"AMD GPU Development Platform","text":"In early 2024, ARCHER2 users gained access to a small GPU system integrated into ARCHER2 which is designed to allow users to test and develop software using AMD GPUs.
Important
The GPU component is very small and so is aimed at software development and testing rather than to be used for production research.
"},{"location":"user-guide/gpu/#hardware-available","title":"Hardware available","text":"The GPU Development Platform consists of 4 compute nodes each with: - 1x AMD EPYC 7543P (Milan) processor, 32 core, 2.8 GHz - 4x AMD Instinct MI210 accelerator - 512 GiB host memory - 2\u00d7 100 Gb/s Slingshot interfaces per node
The AMD Instinct\u2122 MI210 Accelerators feature: - Architecture: CDNA 2 - Compute Units: 104 - Memory: 64 GB HBM2e
A comprehensive list of features is available on the AMD website.
"},{"location":"user-guide/gpu/#accessing-the-gpu-compute-nodes","title":"Accessing the GPU compute nodes","text":"The GPU nodes can be accessed through the Slurm job submission system from the standard ARCHER2 login nodes. Details of the scheduler limits and configuration and example job submission scripts are provided below.
"},{"location":"user-guide/gpu/#compiling-software-for-the-gpu-compute-nodes","title":"Compiling software for the GPU compute nodes","text":""},{"location":"user-guide/gpu/#overview","title":"Overview","text":"As a quick summary, the recommended procedure for compiling code that offloads to the AMD GPUs is as follows:
module load PrgEnv-xxx
module load rocm
module load craype-accel-amd-gfx90a
module load craype-x86-milan
ftn
, cc
, or CC
For details and alternative approaches, see below.
"},{"location":"user-guide/gpu/#programming-environments","title":"Programming Environments","text":"The following programming environments and compilers are available to compile code for the AMD GPUs on ARCHER2 using the usual compiler wrappers (ftn
, cc
, CC
), which is the recommended approach:
ftn
, cc
, CC
PrgEnv-amd
AMD LLVM compilers amdflang
, amdclang
, amdclang++
PrgEnv-cray
Cray compilers crayftn
, craycc
, crayCC
PrgEnv-gnu
GNU compilers gfortran
, gcc
, g++
PrgEnv-gnu-amd
hybrid gfortran
, amdclang
, amdclang++
PrgEnv-cray-amd
hybrid crayftn
, amdclang
, amdclang++
To decide which compiler(s) to use to compile offload code for the AMD GPUs, you may find it useful to consult the Compilation Strategies for GPU Offloading section below.
The hybrid environments PrgEnv-gnu-amd
and PrgEnv-cray-amd
are provided as a convenient way to mitigate less mature OpenMP offload support in the AMD LLVM Fortran compiler. In these hybrid environments ftn
therefore calls gfortran
or crayftn
instead of amdflang
.
Details about the underlying compiler being called by a compiler wrapper can be checked using the --version
flag, for example:
> module load PrgEnv-amd\n> cc --version\nAMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.3 22324 d6c88e5a78066d5d7a1e8db6c5e3e9884c6ad10e)\nTarget: x86_64-unknown-linux-gnu\nThread model: posix\nInstalledDir: /opt/rocm-5.2.3/llvm/bin\n
"},{"location":"user-guide/gpu/#rocm","title":"ROCm","text":"Access to AMD's ROCm software stack is provided through the rocm
module:
module load rocm\n
With the rocm
module loaded the AMD LLVM compilers amdflang
, amdclang
, and amdclang++
become available to use directly or through AMD's compiler driver utility hipcc
. Neither approach is recommended as a first choice for most users, as considerable care needs to be taken to pass suitable flags to the compiler or to hipcc
. With PrgEnv-amd
loaded the compiler wrappers ftn
, cc
, CC
, which bypass hipcc
and call amdflang
, amdclang
, or amdclang++
directly, take care of passing suitable compilation flags, which is why using these wrappers is the recommended approach for most users, at least initially.
Note: the rocm
module should be loaded whenever you are compiling for the AMD GPUs, even if you are not using the AMD LLVM compilers (amdflang
, amdclang
, amdclang++
).
The rocm
module also provides access to other AMD tools, such as HIPIFY (hipify-clang
or hipify-perl
command), which enables translation of CUDA to HIP code. See also the section below on HIPIFY.
Regardless of what approach you use, you will need to tell the underlying GPU compiler which GPU hardware to target. When using the compiler wrappers ftn
, cc
, or CC
, as recommended, this can be done by ensuring the appropriate GPU target module is loaded:
module load craype-accel-amd-gfx90a\n
"},{"location":"user-guide/gpu/#cpu-target","title":"CPU target","text":"The AMD GPU nodes are equipped with AMD EPYC Milan CPUs instead of the AMD EPYC Rome CPUs present on the regular CPU-only ARCHER2 compute nodes. Though the difference between these processors is small, when using the compiler wrappers ftn
, cc
, or CC
, as recommended, we should load the appropriate CPU target module:
module load craype-x86-milan\n
"},{"location":"user-guide/gpu/#compilation-strategies-for-gpu-offloading","title":"Compilation Strategies for GPU Offloading","text":"Compiler support on ARCHER2 for various programming models that enable offloading to AMD GPUs can be summarised at a glance in the following table:
PrgEnv Actual compiler OpenMP Offload HIP OpenACCPrgEnv-amd
amdflang
\u2705 \u274c \u274c PrgEnv-amd
amdclang
\u2705 \u274c \u274c PrgEnv-amd
amdclang++
\u2705 \u2705 \u274c PrgEnv-cray
crayftn
\u2705 \u274c \u2705 PrgEnv-cray
craycc
\u2705 \u274c \u274c PrgEnv-cray
crayCC
\u2705 \u2705 \u274c PrgEnv-gnu
gfortran
\u274c \u274c \u274c PrgEnv-gnu
gcc
\u274c \u274c \u274c PrgEnv-gnu
g++
\u274c \u274c \u274c It is generally recommended to do the following:
module load PrgEnv-xxx\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
And then to use the ftn
, cc
and/or CC
wrapper to compile as appropriate for the programming model in question. Specific guidance on how to do this for different programming models is provided in the subsections below.
When deviating from this procedure and using underlying compilers directly, or when debugging a problematic build using the wrappers, it may be useful to check what flags the compiler wrappers are passing to the underlying compiler. This can be done by using the -craype-verbose
option with a wrapper when compiling a file. Optionally piping the resulting output to the command tr \" \" \"\\n\"
so that flags are split over lines may be convenient for visual parsing. For example:
> CC -craype-verbose source.cpp | tr \" \" \"\\n\"\n
"},{"location":"user-guide/gpu/#openmp-offload","title":"OpenMP Offload","text":"To use the compiler wrappers to compile code that offloads to GPU with OpenMP directives, first load the desired PrgEnv module and other necessary modules:
module load PrgEnv-xxx\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
Then use the appropriate compiler wrapper and pass the -fopenmp
option to the wrapper when compiling. For example:
ftn -fopenmp source.f90\n
This should work under PrgEnv-amd
and PrgEnv-cray
, but not under PrgEnv-gnu as GCC 11.2.0 is the most recent version of GCC available on ARCHER2 and OpenMP offload to AMD MI200 series GPUs is only supported by GCC 13 and later.
You may find that offload directives introduced in more recent versions of the OpenMP standard, e.g. versions later than OpenMP 4.5, fail to compile with some compilers. Under PrgEnv-cray
an explicit description of supported OpenMP features can be viewed using the command man intro_openmp
.
To compile C or C++ code that uses HIP written specifically to offload to AMD GPUs, first load the desired PrgEnv module (either PrgEnv-amd
or PrgEnv-cray
) and other necessary modules:
module load PrgEnv-xxx\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
Then compile using the CC
compiler wrapper as follows:
CC -x hip -std=c++11 -D__HIP_ROCclr__ --rocm-path=${ROCM_PATH} source.cpp\n
Alternatively, you may use hipcc
to drive the AMD LLVM compiler amdclang(++)
to compile HIP code. In that case you will need to take care to explicitly pass all required offload flags to hipcc
, such as:
-D__HIP_PLATFORM_AMD__ --offload-arch=gfx90a\n
To see what hipcc
passes to the compiler, you can pass the --verbose
option. If you are compiling MPI-parallel HIP code with hipcc
, please see additional guidance under HIPCC and MPI.
hipcc
can compile both HIP code for device (GPU) execution and non-HIP code for host (CPU) execution and will default to using the AMD LLVM compiler amdclang(++)
to do so. If your software consists of separate compilation units - typically separate files - containing HIP code non-HIP code, it is possible to use a different compiler than hipcc
to compile the non-HIP code. To do this:
hipcc
CC
and a different PrgEnv than PrgEnv-amd
loaded.o
files) together using the compiler wrapperOffloading using OpenACC directives on ARCHER2 is only supported by the Cray Fortran compiler. You should therefore load the following:
module load PrgEnv-cray\nmodule load rocm\nmodule load craype-accel-amd-gfx90a\nmodule load craype-x86-milan\n
OpenACC Fortran code can then be compiled using the -hacc
flag, as follows:
ftn -hacc source.f90\n
Details on what OpenACC standard and features are supported under PrgEnv-cray
can be viewed using the command man intro_openacc
.
Code may use OpenMP for multithreaded execution on the host CPU in combination with target directives to offload work to GPU. Both uses of OpenMP can coexist in a single compilation unit, which should be compiled using the relevant compiler wrapper and the -fopenmp
flag.
Using both OpenMP and HIP to offload to GPU is possible, but only if the two programming models are not mixed in the same compilation unit. Two or more separate compilation units - typically separate source files - should be compiled as recommended individually for HIP and OpenMP offload code in the respective sections above. The resulting code objects (.o
files) should then be linked together using a compiler wrapper with the -fopenmp
flag, but without the -x hip
flag.
Code in a single compilation unit, such as a single source file, can use HIP to offload to GPU as well as OpenMP for multithreaded execution on the host CPU. Compilation should be done using the relevant compiler wrapper and the flags -fopenmp
and \u2013x hip
- in that order - as well as the flags for HIP compilation specified above:
CC -fopenmp -x hip -std=c++11 -D__HIP_ROCclr__ --rocm-path=${ROCM_PATH} source.cpp\n
"},{"location":"user-guide/gpu/#hipcc-and-mpi","title":"HIPCC and MPI","text":"When compiling an MPI-parallel code with hipcc
instead of a compiler wrapper, the path to the Cray MPI library include directory should be passed explicitly, or set as part of the CXXFLAGS
environment variable, as:
-I${CRAY_MPICH_DIR}/include\n
MPI library directories should also be passed to hipcc
, or set as part of the LDFLAGS
environment variable prior to compiling, as:
-L${CRAY_MPICH_DIR}/lib ${PE_MPICH_GTL_DIR_amd_gfx90a}\n
Finally the MPI library should be linked explicitly, or set as part of the LIBS
environment variable prior to linking, as:
-lmpi ${PE_MPICH_GTL_LIBS_amd_gfx90a}\n
"},{"location":"user-guide/gpu/#cmake","title":"Cmake","text":"Documentation about integrating rocm with cmake can be found here.
"},{"location":"user-guide/gpu/#gpu-aware-mpi","title":"GPU-aware MPI","text":"Need to set an environment variable to enable GPU support in cray-mpich
:
export MPICH_GPU_SUPPORT_ENABLED=1
No additional or alternative MPI modules need to be loaded instead of the default cray-mpich
module.
This supports GPU-GPU transfers:
Be aware that on these nodes there are only two PCIe network cards in each node and they may not be in the same memory region to a given GPU. Therefore NUMA effects are to be expected in multi-node communication. More detail on this is provided below.
"},{"location":"user-guide/gpu/#libraries","title":"Libraries","text":"In order to access the GPU-accelerated version of Cray's LibSci maths libraries, a new module has been provided:
cray-libsci_acc
With this module loaded, documentation can be viewed using the command man intro_libsci_acc
.
Additionally a number of libraries are provided as part of the rocm
module.
The cray-python
module can be used as normal for the GPU partition with mpi4py
package that is installed by default. mpi4py
uses cray-mpich
under the hood and in the same way as the CPU compute nodes.
However unless specifically compiled for GPU-GPU communication certain python packages/frameworks that try to take advantage of the fast links between GPUs by calling MPI on GPU pointers may have issues. To set the environment correctly for a given python program the following snippet can be added to load the required libmpi_gtl_hsa
library:
from os import environ\nif environ.get(\"MPICH_GPU_SUPPORT_ENABLED\", False):\n from ctypes import CDLL, RTLD_GLOBAL\n CDLL(f\"{environ.get('CRAY_MPICH_ROOTDIR')}/gtl/lib/libmpi_gtl_hsa.so\", mode=RTLD_GLOBAL)\n\nfrom mpi4py import MPI\n
"},{"location":"user-guide/gpu/#supported-software","title":"Supported software","text":"The ARCHER2 GPU development platform is intended for code development, testing and experimentation and will not have supported centrally installed versions of codes as is the case for the standard ARCHER2 CPU compute nodes. However some builds are being made available to users by members of CSE to under a best effort approach to support the community.
Codes that have modules targeting GPUs are:
Note
Will be filled out as applications are compiled and made available.
"},{"location":"user-guide/gpu/#running-jobs-on-the-gpu-nodes","title":"Running jobs on the GPU nodes","text":"To run a GPU job, you must specify a GPU partition and a quality of service (QoS) as well as the number of GPUs required. You specify the number of GPU cards you want per node using the --gpus=N
option, where N
is typically 1, 2 or 4.
Note
As there are 4 GPUs per node, each GPU is associated with 1/4 of the resources of the node, i.e., 8 of 32 physical cores and roughly 128 GiB of the total 512 GiB host memory.
Allocations of host resources are made pro-rata. For example, if 2 GPUs are requested, sbatch
will allocate 16 cores and around 256 GiB of host memory (in addition to 2 GPUs). Any attempt to use more than the allocated resources will result in an error.
This automatic allocation by Slurm for GPU jobs means that the submission script should not specify options such as --ntasks
and --cpus-per-task
. Such a job submission will be rejected. See below for some examples of how to use host resources and how to launch MPI applications.
Warning
In order to run jobs on the GPU nodes your ARCHER2 budget must have positive CU hours associated with it. However, your budget will not be charged for any GPU jobs you run.
"},{"location":"user-guide/gpu/#slurm-partitions","title":"Slurm Partitions","text":"Your job script must specify a partition. The following table has a list of relevant GPU partition(s) on ARCHER2.
Partition Description Max nodes available gpu GPU nodes with AMD EPYC 32-core processor, 512 GB memory, 4\u00d7AMD Instinct MI210 GPU 4"},{"location":"user-guide/gpu/#slurm-quality-of-service-qos","title":"Slurm Quality of Service (QoS)","text":"Your job script must specify a QoS relevant for the GPU nodes. Available QoS specifications are as follows.
QoS Max Nodes Per Job Max Walltime Jobs Queued Jobs Running Partition(s) Notes gpu-shd 1 12 hr 2 1 gpu Nodes potentially shared with other users gpu-exc 2 12 hr 2 1 gpu Exclusive node access"},{"location":"user-guide/gpu/#example-job-submission-scripts","title":"Example job submission scripts","text":"Here are a series of example jobs for various patterns of running on the ARCHER2 GPU nodes They cover the following scenarios:
This example requests a single GPU on a potentially shared node and launch using a single CPU process with offload to a single GPU.
#!/bin/bash\n\n#SBATCH --job-name=single-GPU\n#SBATCH --gpus=1\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-shd\n\n# Check assigned GPU\nsrun --ntasks=1 rocm-smi\n\nsrun --ntasks=1 --cpus-per-task=1 ./my_gpu_program.x\n
"},{"location":"user-guide/gpu/#multiple-gpu-on-a-single-node-shared-node-access-max-2-gpu","title":"Multiple GPU on a single node - shared node access (max. 2 GPU)","text":"This example requests two GPUs on a potentially shared node and launch using two MPI processes (one per GPU) with one MPI process per CPU NUMA region.
We use the --cpus-per-task=8
option to srun
to set the stride between the two MPI processes to 8 physical cores. This places the MPI processes on separate NUMA regions to ensure they are associated with the correct GPU that is closest to them on the compute node architecture.
#!/bin/bash\n\n#SBATCH --job-name=multi-GPU\n#SBATCH --gpus=2\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-shd\n\n# Enable GPU-aware MPI\nexport MPICH_GPU_SUPPORT_ENABLED=1\n\n# Check assigned GPU\nsrun --ntasks=1 rocm-smi\n\n# Check process/thread pinning\nmodule load xthi\nsrun --ntasks=2 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n xthi\n\nsrun --ntasks=2 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n ./my_gpu_program.x\n
"},{"location":"user-guide/gpu/#multiple-gpu-on-a-single-node-exclusive-node-access-max-4-gpu","title":"Multiple GPU on a single node - exclusive node access (max. 4 GPU)","text":"This example requests four GPUs on a single node and launches the program using four MPI processes (one per GPU) with one MPI process per CPU NUMA region.
We use the --cpus-per-task=8
option to srun
to set the stride between the MPI processes to 8 physical cores. This places the MPI processes on separate NUMA regions to ensure they are associated with the correct GPU that is closest to them on the compute node architecture.
#!/bin/bash\n\n#SBATCH --job-name=multi-GPU\n#SBATCH --gpus=4\n#SBATCH --nodes=1\n#SBATCH --exclusive\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-exc\n\n# Check assigned GPU\nsrun --ntasks=1 rocm-smi\n\n# Check process/thread pinning\nmodule load xthi\nsrun --ntasks=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n xthi\n\n# Enable GPU-aware MPI\nexport MPICH_GPU_SUPPORT_ENABLED=1\n\nsrun --ntasks=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n ./my_gpu_program.x\n
Note
When you use the --qos=gpu-exc
QoS you must also add the --exclusive
flag and then specify the number of nodes you want with --nodes=1
.
This example requests eight GPUs across two nodes and launches the program using eight MPI processes (one per GPU) with one MPI process per CPU NUMA region.
We use the --cpus-per-task=8
option to srun
to set the stride between the MPI processes to 8 physical cores. This places the MPI processes on separate NUMA regions to ensure they are associated with the correct GPU that is closest to them on the compute node architecture.
#!/bin/bash\n\n#SBATCH --job-name=multi-GPU\n#SBATCH --gpus=4\n#SBATCH --nodes=2\n#SBATCH --exclusive\n#SBATCH --time=00:20:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-exc\n\n# Check assigned GPU\nnodelist=$(scontrol show hostname $SLURM_JOB_NODELIST)\nfor nodeid in $nodelist\ndo\n echo $nodeid\n srun --ntasks=1 --gpus=4 --nodes=1 --ntasks-per-node=1 --nodelist=$nodeid rocm-smi\ndone\n\n# Check process/thread pinning\nmodule load xthi\nsrun --ntasks-per-node=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n xthi\n\n# Enable GPU-aware MPI\nexport MPICH_GPU_SUPPORT_ENABLED=1\n\nsrun --ntasks-per-node=4 --cpus-per-task=8 \\\n --hint=nomultithread --distribution=block:block \\\n ./my_gpu_program.x\n
Note
When you use the --qos=gpu-exc
QoS you must also add the --exclusive
flag and then specify the number of nodes you want with, for example, --nodes=2
.
salloc
","text":"Tip
This method does not give you an interactive shell on a GPU compute node. If you want an interactive shell on the GPU compute nodes, see the srun
method described below.
If you wish to have a terminal to perform interactive testing, you can use the salloc
command to reserve the resources so you can use srun
commands interactively. For example, to request 1 GPU for 20 minutes you would use (remember to replace t01
with your budget code):
auser@ln04:/work/t01/t01/auser> salloc --gpus=1 --time=00:20:00 --partition=gpu --qos=gpu-shd --account=t01\nsalloc: Pending job allocation 5335731\nsalloc: job 5335731 queued and waiting for resources\nsalloc: job 5335731 has been allocated resources\nsalloc: Granted job allocation 5335731\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid200001 are ready for job\n\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> srun rocm-smi\n\n\n======================= ROCm System Management Interface =======================\n================================= Concise Info =================================\nGPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%\n0 31.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n================================================================================\n============================= End of ROCm SMI Log ==============================\n\n\nsrun: error: nid200001: tasks 0: Exited with exit code 2\nsrun: launch/slurm: _step_signal: Terminating StepId=5335731.0\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> srun --ntasks=1 --cpus-per-task=8 --hint=nomultithread xthi\nNode summary for 1 nodes:\nNode 0, hostname nid200001, mpi 1, omp 1, executable xthi\nMPI summary: 1 ranks\nNode 0, rank 0, thread 0, (affinity = 0-7)\n
"},{"location":"user-guide/gpu/#using-srun","title":"Using srun
","text":"If you want an interactive terminal on a GPU node then you can use the srun
command to achieve this. For example, to request 1 GPU for 20 minutes with an interactive terminal on a GPU compute node you would use (remember to replace t01
with your budget code):
auser@ln04:/work/t01/t01/auser> srun --gpus=1 --time=00:20:00 --partition=gpu --qos=gpu-shd --account=z19 --pty /bin/bash\nsrun: job 5335771 queued and waiting for resources\nsrun: job 5335771 has been allocated resources\nauser@nid200001:/work/t01/t01/auser>\n
Note that the command prompt has changed to indicate we are now on a GPU compute node. You can now directly run commands that interact with the GPU devices, e.g.:
auser@nid200001:/work/t01/t01/auser> rocm-smi\n\n======================= ROCm System Management Interface =======================\n================================= Concise Info =================================\nGPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%\n0 29.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
Warning
Launching parallel jobs on GPU nodes from an interactive shell on a GPU node is not straightforward so you should either use job submission scripts or the salloc
method of interactive use described above.
A list of device indices or UUIDs that will be exposed to applications
Runtime : ROCm Platform Runtime. Applies to all applications using the user mode ROCm software stack.
export ROCR_VISIBLE_DEVICES=\"0,GPU-DEADBEEFDEADBEEF\"
https://rocm.docs.amd.com/projects/HIP/en/docs-5.2.3/how_to_guides/debugging.html#summary-of-environment-variables-in-hip
"},{"location":"user-guide/gpu/#amd_log_level","title":"AMD_LOG_LEVEL","text":"Enable HIP log on different Level.
export AMD_LOG_LEVEL=1
Enable HIP log on different Levels.
export AMD_LOG_MASK=0x1
Default: 0x7FFFFFFF\n\n0x1: Log API calls.\n0x02: Kernel and Copy Commands and Barriers.\n0x4: Synchronization and waiting for commands to finish.\n0x8: Enable log on information and below levels.\n0x20: Queue commands and queue contents.\n0x40: Signal creation, allocation, pool.\n0x80: Locks and thread-safety code.\n0x100: Copy debug.\n0x200: Detailed copy debug.\n0x400: Resource allocation, performance-impacting events.\n0x800: Initialization and shutdown.\n0x1000: Misc debug, not yet classified.\n0x2000: Show raw bytes of AQL packet.\n0x4000: Show code creation debug.\n0x8000: More detailed command info, including barrier commands.\n0x10000: Log message location.\n0xFFFFFFFF: Log always even mask flag is zero.\n
"},{"location":"user-guide/gpu/#hip_visible_devices","title":"HIP_VISIBLE_DEVICES:","text":"For system with multiple devices, it\u2019s possible to make only certain device(s) visible to HIP via setting environment variable, HIP_VISIBLE_DEVICES(or CUDA_VISIBLE_DEVICES on Nvidia platform), only devices whose index is present in the sequence are visible to HIP.
Runtime : HIP Runtime. Applies only to applications using HIP on the AMD platform.
export HIP_VISIBLE_DEVICES=0,1
To serialize the kernel enqueuing set the following variable,
export AMD_SERIALIZE_KERNEL=1
To serialize the copies set,
export AMD_SERIALIZE_COPY=1
Sets whether memory in coherent in hipHostMalloc.
export HIP_HOST_COHERENT=1
If the value is 1
, memory is coherent with host; if 0
, memory is not coherent between host and GPU.
https://rocm.docs.amd.com/en/docs-5.2.3/reference/openmp/openmp.html#environment-variables
"},{"location":"user-guide/gpu/#omp_default_device","title":"OMP_DEFAULT_DEVICE","text":"Default device used for OpenMP target offloading.
Runtime : OpenMP Runtime. Applies only to applications using OpenMP offloading.
export OMP_DEFAULT_DEVICE=\"2\"
sets the default device to the 3rd device on the node.
"},{"location":"user-guide/gpu/#omp_num_teams","title":"OMP_NUM_TEAMS","text":"Users can choose the number of teams used for kernel launch by setting,
export OMP_NUM_THREADS
this can be tuned to optimise performance.
"},{"location":"user-guide/gpu/#gpu_max_hw_queues","title":"GPU_MAX_HW_QUEUES","text":"To set the number of HSA queues used in the OpenMP runtime set,
export GPU_MAX_HW_QUEUES
Activates GPU aware MPI in Cray MPICH:
export MPICH_GPU_SUPPORT_ENABLED=1
If not set MPI calls that attempt to send messages from buffers that are on GPU-attached memory will crash/hang.
"},{"location":"user-guide/gpu/#hsa_enable_sdma","title":"HSA_ENABLE_SDMA","text":"export HSA_ENABLE_SDMA=0
Forces host-to-device and device-to-host copies to use compute shader blit kernels rather than the dedicated DMA copy engines.
Impact will be reduced bandwidth but this is recommended when isolating issues with hardware copy engines.
"},{"location":"user-guide/gpu/#mpich_ofi_nic_policy","title":"MPICH_OFI_NIC_POLICY","text":"For GPU-enabled parallel applications that involve MPI operations that access application arrays that are resident on GPU-attached memory regions users can set,
export MPICH_OFI_NIC_POLICY=GPU
In this case, for each MPI process, Cray MPI aims to select a NIC device that is closest to the GPU device being used.
"},{"location":"user-guide/gpu/#mpich_ofi_nic_verbose","title":"MPICH_OFI_NIC_VERBOSE","text":"To display information pertaining to NIC selection set,
export MPICH_OFI_NIC_VERBOSE=2
Note
Work in progress
Documentation for rocgdb can be found in the following locations:
https://rocm.docs.amd.com/projects/ROCgdb/en/docs-5.2.3/index.html
https://docs.amd.com/projects/HIP/en/docs-5.2.3/how_to_guides/debugging.html#using-rocgdb
"},{"location":"user-guide/gpu/#profiling","title":"Profiling","text":"An initial profiling capability is provided via rocprof
which is part of the rocm
module.
For example in an interactive session where resources have already been allocated you can call,
srun -n 2 --exclusive --nodes=1 --time=00:20:00 --partition=gpu --qos=gpu-exc --gpus=2 rocprof --stats ./myprog_exe\n
to profile your application. More detail on the use of rocprof can be found here.
"},{"location":"user-guide/gpu/#performance-tuning","title":"Performance tuning","text":"AMD provides some documentation on performance tuning here not all options will be available to users, so be aware that mileage may vary.
"},{"location":"user-guide/gpu/#hardware-details","title":"Hardware details","text":"The specifications of the GPU hardware can be found here.
Additionally you can use the command,
rocminfo
in job on a GPU node to print information about the GPUs and CPU on the node. This command is provided as part of the rocm
module.
Using rocm-smi --showtopo
we can learn about the connections between the GPUs in a node and the how memory regions between the GPU and CPU are connected.
======================= ROCm System Management Interface =======================\n=========================== Weight between two GPUs ============================\n GPU0 GPU1 GPU2 GPU3\nGPU0 0 15 15 15\nGPU1 15 0 15 15\nGPU2 15 15 0 15\nGPU3 15 15 15 0\n\n============================ Hops between two GPUs =============================\n GPU0 GPU1 GPU2 GPU3\nGPU0 0 1 1 1\nGPU1 1 0 1 1\nGPU2 1 1 0 1\nGPU3 1 1 1 0\n\n========================== Link Type between two GPUs ==========================\n GPU0 GPU1 GPU2 GPU3\nGPU0 0 XGMI XGMI XGMI\nGPU1 XGMI 0 XGMI XGMI\nGPU2 XGMI XGMI 0 XGMI\nGPU3 XGMI XGMI XGMI 0\n\n================================== Numa Nodes ==================================\nGPU 0 : (Topology) Numa Node: 0\nGPU 0 : (Topology) Numa Affinity: 0\nGPU 1 : (Topology) Numa Node: 1\nGPU 1 : (Topology) Numa Affinity: 1\nGPU 2 : (Topology) Numa Node: 2\nGPU 2 : (Topology) Numa Affinity: 2\nGPU 3 : (Topology) Numa Node: 3\nGPU 3 : (Topology) Numa Affinity: 3\n============================= End of ROCm SMI Log ==============================\n
To quote the rocm documentation:
- The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the \u201cdistance\u201d data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.\n\n- The second block has a matrix named \u201cHops between two GPUs\u201d, where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links.\n\n- The third block outputs the link types between the GPUs. This can either be \u201cXGMI\u201d for AMD Infinity Fabric links or \u201cPCIE\u201d for PCIe Gen4 links.\n\n- The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC processors.\n
"},{"location":"user-guide/gpu/#rocm-bandwidth-test","title":"rocm-bandwidth-test","text":"As part of the rocm
module the rocm-bandwidth-test
is provided that can be used to measure the performance of communications between the hardware in a node.
In addition to rocm-smi
this is a bandwidth test that can be useful in understanding the composition and performance limitations in a GPU node. Here is an example output from a GPU nodes on ARCHER2.
Device: 0, AMD EPYC 7543P 32-Core Processor\nDevice: 1, AMD EPYC 7543P 32-Core Processor\nDevice: 2, AMD EPYC 7543P 32-Core Processor\nDevice: 3, AMD EPYC 7543P 32-Core Processor\nDevice: 4, , GPU-ab43b63dec8adaf3, c9:0.0\nDevice: 5, , GPU-0b953cf8e6d4184a, 87:0.0\nDevice: 6, , GPU-b0266df54d0dd2e1, 49:0.0\nDevice: 7, , GPU-790a09bfbf673859, 09:0.0\n\nInter-Device Access\n\nD/D 0 1 2 3 4 5 6 7\n\n0 1 1 1 1 1 1 1 1\n\n1 1 1 1 1 1 1 1 1\n\n2 1 1 1 1 1 1 1 1\n\n3 1 1 1 1 1 1 1 1\n\n4 1 1 1 1 1 1 1 1\n\n5 1 1 1 1 1 1 1 1\n\n6 1 1 1 1 1 1 1 1\n\n7 1 1 1 1 1 1 1 1\n\n\nInter-Device Numa Distance\n\nD/D 0 1 2 3 4 5 6 7\n\n0 0 12 12 12 20 32 32 32\n\n1 12 0 12 12 32 20 32 32\n\n2 12 12 0 12 32 32 20 32\n\n3 12 12 12 0 32 32 32 20\n\n4 20 32 32 32 0 15 15 15\n\n5 32 20 32 32 15 0 15 15\n\n6 32 32 20 32 15 15 0 15\n\n7 32 32 32 20 15 15 15 0\n\n\nUnidirectional copy peak bandwidth GB/s\n\nD/D 0 1 2 3 4 5 6 7\n\n0 N/A N/A N/A N/A 26.977 26.977 26.977 26.977\n\n1 N/A N/A N/A N/A 26.977 26.975 26.975 26.975\n\n2 N/A N/A N/A N/A 26.977 26.977 26.975 26.975\n\n3 N/A N/A N/A N/A 26.975 26.977 26.975 26.977\n\n4 28.169 28.171 28.169 28.169 1033.080 42.239 42.112 42.264\n\n5 28.169 28.169 28.169 28.169 42.243 1033.088 42.294 42.286\n\n6 28.169 28.171 28.167 28.169 42.158 42.281 1043.367 42.277\n\n7 28.171 28.169 28.169 28.169 42.226 42.264 42.264 1051.212\n\n\nBidirectional copy peak bandwidth GB/s\n\nD/D 0 1 2 3 4 5 6 7\n\n0 N/A N/A N/A N/A 40.480 42.528 42.059 42.173\n\n1 N/A N/A N/A N/A 41.604 41.826 41.903 41.417\n\n2 N/A N/A N/A N/A 41.008 41.499 41.258 41.338\n\n3 N/A N/A N/A N/A 40.968 41.273 40.982 41.450\n\n4 40.480 41.604 41.008 40.968 N/A 80.946 80.631 80.888\n\n5 42.528 41.826 41.499 41.273 80.946 N/A 80.944 80.940\n\n6 42.059 41.903 41.258 40.982 80.631 80.944 N/A 80.896\n\n7 42.173 41.417 41.338 41.450 80.888 80.940 80.896 N/A\n
"},{"location":"user-guide/gpu/#tools","title":"Tools","text":""},{"location":"user-guide/gpu/#rocm-smi","title":"rocm-smi","text":"If you load the rocm module on the system you will have access to the rocm-smi
utility. This utility allows users to report information about the GPUs on node and can be very useful in better understanding the set up of the hardware you are working with and monitoring GPU metrics during job execution.
Here are some useful commands to get you started:
rocm-smi --alldevices
device status
======================= ROCm System Management Interface =======================\n================================= Concise Info =================================\nGPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%\n0 28.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n1 30.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n2 33.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n3 33.0c 41.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0%\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
This shows you the current state of the hardware while an application is running. Focusing on the GPU activity can be useful to understand when your code is active on the GPUs:
rocm-smi --showuse
GPU activity
======================= ROCm System Management Interface =======================\n============================== % time GPU is busy ==============================\nGPU[0] : GPU use (%): 0\nGPU[0] : GFX Activity: 705759841\nGPU[1] : GPU use (%): 0\nGPU[1] : GFX Activity: 664257322\nGPU[2] : GPU use (%): 0\nGPU[2] : GFX Activity: 660987914\nGPU[3] : GPU use (%): 0\nGPU[3] : GFX Activity: 665049119\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
Additionally you can focus on the memory use of the GPUs:
rocm-smi --showmemuse
GPU memory currently consumed
======================= ROCm System Management Interface =======================\n============================== Current Memory Use ==============================\nGPU[0] : GPU memory use (%): 0\nGPU[0] : Memory Activity: 323631375\nGPU[1] : GPU memory use (%): 0\nGPU[1] : Memory Activity: 319196585\nGPU[2] : GPU memory use (%): 0\nGPU[2] : Memory Activity: 318641690\nGPU[3] : GPU memory use (%): 0\nGPU[3] : Memory Activity: 319854295\n================================================================================\n============================= End of ROCm SMI Log ==============================\n
More commands can be found by running,
rocm-smi --help
will run on the login nodes to get more information about probing the GPUs.
More detail can be found at here.
"},{"location":"user-guide/gpu/#hipify","title":"HIPIFY","text":"HIPIFY is a CUDA to HIP source translator tool that can allow CUDA source code to be translated into HIP source code, easing the transition between the two hardware targets.
The tool is available on ARCHER2 by loading the rocm
module.
The github repository for HIPIFY can be found here.
The documentation for HIPIFY is found here.
"},{"location":"user-guide/gpu/#notes-and-useful-links","title":"Notes and useful links","text":"You should expect the software development environment to be similar to that available on the Frontier exascale system:
Note
Some of the material in this section is closely based on information provided by NASA as part of the documentation for the Aitkin HPC system.
"},{"location":"user-guide/hardware/#system-overview","title":"System overview","text":"ARCHER2 is a HPE Cray EX supercomputing system which has a total of 5,860 compute nodes. Each compute node has 128 cores (dual AMD EPYC 7742 64-core 2.25GHz processors) giving a total of 750,080 cores. Compute nodes are connected together by a HPE Slingshot interconnect.
There are additional User Access Nodes (UAN, also called login nodes), which provide access to the system, and data-analysis nodes, which are well-suited for preparation of job inputs and analysis of job outputs.
Compute nodes are only accessible via the Slurm job scheduling system.
There are two storage types: home and work. Home is available on login nodes and data-analysis nodes. Work is available on login, data-analysis nodes and compute nodes (see I/O and file systems).
This is shown in the ARCHER2 architecture diagram:
The home file system is provided by dual NetApp FAS8200A systems (one primary and one disaster recovery) with a capacity of 1 PB each.
The work file system consists of four separate HPE Cray L300 storage systems, each with a capacity of 3.6 PB. The interconnect uses a dragonfly topology, and has a bandwidth of 100 Gbps.
The system also includes 1.1 PB burst buffer NVMe storage, provided by an HPE Cray E1000.
"},{"location":"user-guide/hardware/#compute-node-overview","title":"Compute node overview","text":"The compute nodes each have 128 cores. They are dual socket nodes with two 64-core AMD EPYC 7742 processors. There are 5,276 standard memory nodes and 584 high memory nodes.
Note
Note due to Simultaneous Multi-Threading (SMT) each core has 2 threads, therefore a node has 128 cores / 256 threads. Most users will not want to use SMT, see Launching parallel jobs.
Component Details Processor 2x AMD Zen2 (Rome) EPYC 7742, 64-core, 2.25 Ghz Cores per node 128 NUMA structure 8 NUMA regions per node (16 cores per NUMA region) Memory per node 256 GB (standard), 512 GB (high memory) Memory per core 2 GB (standard), 4 GB (high memory) L1 cache 32 kB/core L2 cache 512 kB/core L3 cache 16 MB/4-cores Vector support AVX2 Network connection 2x 100 Gb/s injection ports per nodeEach socket contains eight Core Complex Dies (CCDs) and one I/O die (IOD). Each CCD contains two Core Complexes (CCXs). Each CCX has 4 cores and 16 MB of L3 cache. Thus, there are 64 cores per socket and 128 cores per node.
More information on the architecture of the AMD EPYC Zen2 processors:
The AMD EPYC 7742 Rome processor has a base CPU clock of 2.25 GHz and a maximum boost clock of 3.4 GHz. There are eight processor dies (CCDs) with a total of 64 cores per socket.
Tip
The processors can only access their boost frequencies if the CPU frequency is set to 2.25 GHz. See the documentation on setting CPU frequency for information on how to select the correct CPU frequency.
Note
When all 128 compute cores on a node are loaded with computationally intensive work, we typically see the processor clock frequency boost to around 2.8 GHz.
Hybrid multi-die design:
Within each socket, the eight processor dies are fabricated on a 7 nanometer (nm) process, while the I/O die is fabricated on a 14 nm process. This design decision was made because the processor dies need the leading edge (and more expensive) 7 nm technology in order to reduce the amount of power and space needed to double the number of cores, and to add more cache, compared to the first-generation EPYC processors. The I/O die retains the less expensive, older 14 nm technology.
2nd-generation Infinity Fabric technology:
Infinity Fabric technology is used for communication among different components throughout the node: within cores, between cores, between core complexes (CCX) in a core complex die (CCD), among CCDs in a socket, to the main memory and PCIe, and between the two sockets. The Rome processors are the first x86 systems to support 4th-generation PCIe, which delivers twice the I/O performance (to the Slingshot interconnect, storage, NVMe SSD, etc.) compared to 3rd-generation PCIe.
"},{"location":"user-guide/hardware/#processor-hierarchy","title":"Processor hierarchy","text":"The Zen2 processor hierarchy is as follows:
CPU core
AMD 7742 is a 64-bit x86 server microprocessor. A partial list of instructions and features supported in Rome includes SSE, SSE2, SSE3, SSSE3, SSE4a, SSE4.1, SSE4.2, AES, FMA, AVX, AVX2 (256 bit), Integrated x87 FPU (FPU), Multi-Precision Add-Carry (ADX), 16-bit Floating Point Conversion (F16C), and No-eXecute (NX). For a complete list, run cat /proc/cpuinfo
on the ARCHER2 login nodes.
Each core:
The cache hierarchy is as follows:
op cache (OC): 4K ops, private to each core; 64 sets; 64 bytes/line; 8-way. OC holds instructions that have already been decoded into micro-operations (micro-ops). This is useful when the CPU repeatedly executes a loop of code. Using OC improves:
L1 instruction cache: 32 KB, private to each core; 64 bytes/line; 8-way. The processor fetches instructions from the instruction cache in 32-byte naturally aligned blocks.
Note
With the write-back policy, data is updated in the current level cache first. The update in the next level storage is done later when the cache line is ready to be replaced.
Note
If a core misses in its local L2 and also in the L3, the shadow tags are consulted. If the shadow tag indicates that the data resides in another L2 within the CCX, a cache-to-cache transfer is initiated. 1 x 256 bits/cycle load bandwidth to L2 of each core; 1 x 256 bits/cycle store bandwidth from L2 of each core; write-back policy; populated by L2 victims.
"},{"location":"user-guide/hardware/#intra-socket-interconnect","title":"Intra-socket interconnect","text":"The Infinity Fabric, evolved from AMD's previous generation HyperTransport interconnect, is a software-defined, scalable, coherent, and high-performance fabric. It uses sensors embedded in each die to scale control (Scalable Control Fabric, or SCF) and data flow (Scalable Data Fabric, or SDF).
Two EPYC 7742 SoCs are interconnected via Socket to Socket Global Memory Interconnect (xGMI) links, part of the Infinity Fabric that connects all the components of the SoC together. On ARCHER2 compute nodes there are 3 xGMI links using a total of 48 PCIe lanes. With the xGMI link speed set at 16 GT/s, the theoretical throughput for each direction is 96 GB/s (3 links x 16 GT/s x 2 bytes/transfer) without factoring in the encoding for xGMI, since there is no publication from AMD available. However, the expected efficiencies are 66\u201375%, so the sustained bandwidth per direction will be 63.5\u201372 GB/s. xGMI Dynamic Link Width Management saves power during periods of low socket-to-socket data traffic by reducing the number of active xGMI lanes per link from 16 to 8.
"},{"location":"user-guide/hardware/#memory-subsystem","title":"Memory subsystem","text":"The Zen 2 microarchitecture places eight unified memory controllers in the centralized I/O die. The memory channels can be split into one, two, or four Non-Uniform Memory Access (NUMA) Nodes per Socket (NPS1, NPS2, and NPS4). ARCHER2 compute nodes are configured as NPS4, which is the highest memory bandwidth configuration geared toward HPC applications.
With eight 3,200 MHz memory channels, an 8-byte read or write operation taking place per cycle per channel results in a maximum total memory bandwidth of 204.8 GB/s per socket.
Each memory channel can be connected with up to two Double Data Rate (DDR) fourth-generation Dual In-line Memory Modules (DIMMs). On ARCHER2 standard memory nodes, each channel is connected to a single 16 GB DDR4 registered DIMM (RDIMM) with error correcting code (ECC) support leading to 128 GB per socket and 256 GB per node. For the high memory nodes, each channel is connected to a single 32 GB DDR4 registered DIMM (RDIMM) with error correcting code (ECC) support leading to 256 GB per socket and 512 GB per node.
"},{"location":"user-guide/hardware/#interconnect-details","title":"Interconnect details","text":"ARCHER2 has a HPE Slingshot interconnect with 200 Gb/s signalling per node. It uses a dragonfly topology:
Nodes are organized into groups.
All-to-all connection between groups using optical links.
Information on the ARCHER2 parallel Lustre file systems and how to get best performance is available in the IO section.
"},{"location":"user-guide/io/","title":"I/O performance and tuning","text":"This section describes common IO patterns, best practice for I/O and how to get good performance on the ARCHER2 storage.
Information on the file systems, directory layouts, quotas, archiving and transferring data can be found in the Data management and transfer section.
The advice here is targeted at use of the parallel file systems available on the compute nodes on ARCHER2 (i.e. Not the home and RDFaaS file systems).
"},{"location":"user-guide/io/#common-io-patterns","title":"Common I/O patterns","text":"There are number of I/O patterns that are frequently used in parallel applications:
"},{"location":"user-guide/io/#single-file-single-writer-serial-io","title":"Single file, single writer (Serial I/O)","text":"A common approach is to funnel all the I/O through one controller process (e.g. rank 0 in an MPI program). Although this has the advantage of producing a single file, the fact that only one client is doing all the I/O means that it gains little benefit from the parallel file system. In practice this severely limits the I/O rates, e.g. when writing large files the speed is not likely to significantly exceed 1 GB/s.
"},{"location":"user-guide/io/#file-per-process-fpp","title":"File-per-process (FPP)","text":"One of the first parallel strategies people use for I/O is for each parallel process to write to its own file. This is a simple scheme to implement and understand and can achieve high bandwidth as, with many I/O clients active at once, it benefits from the parallel Lustre filesystem. However, it has the distinct disadvantage that the data is spread across many different files and may therefore be very difficult to use for further analysis without a data reconstruction stage to recombine potentially thousands of small files.
In addition, having thousands of files open at once can overload the filesystem and lead to poor performance.
Tip
The ARCHER2 solid state file system can give very high performance when using this model of I/O
The ADIOS 2 I/O library uses an approach similar to file-per-process and so can achieve very good performance on modern parallel file systems.
"},{"location":"user-guide/io/#file-per-node-fpn","title":"File-per-node (FPN)","text":"A simple way to reduce the sheer number of files is to write a file per node rather than a file per process; as ARCHER2 has 128 CPU-cores per node, this can reduce the number of files by more than a factor of 100 and should not significantly affect the I/O rates. However, it still produces multiple files which can be hard to work with in practice.
"},{"location":"user-guide/io/#single-file-multiple-writers-without-collective-operations","title":"Single file, multiple writers without collective operations","text":"All aspects of data management are simpler if your parallel program produces a single file in the same format as a serial code, e.g. analysis or program restart are much more straightforward.
There are a number of ways to achieve this. For example, many processes can open the same file but access different parts by skipping some initial offset, although this is problematic when writing as locking may be needed to ensure consistency. Parallel I/O libraries such as MPI-IO, HDF5 and NetCDF allow for this form of access and will implement locking automatically.
The problem is that, with many clients all individually accessing the same file, there can be a lot of contention for file system resources, leading to poor I/O rates. When writing, file locking can effectively serialise the access and there is no benefit from the parallel filesystem.
"},{"location":"user-guide/io/#single-shared-file-with-collective-writes-ssf","title":"Single Shared File with collective writes (SSF)","text":"The problem with having many clients performing I/O at the same time is that the I/O library may have to restrict access to one client at a time by locking. However if I/O is done collectively, where the library knows that all clients are doing I/O at the same time, then reads and writes can be explicitly coordinated to avoid clashes and no locking is required.
It is only through collective I/O that the full bandwidth of the file system can be realised while accessing a single file. Whatever I/O library you are using, it is essential to use collective forms of the read and write calls to achieve good performance.
"},{"location":"user-guide/io/#achieving-efficient-io","title":"Achieving efficient I/O","text":"This section provides information on getting the best performance out of the parallel /work
file systems on ARCHER2 when writing data, particularly using parallel I/O patterns.
The ARCHER2 /work
file systems use Lustre as a parallel file system technology. It has many disk units (called Object Storage Targets or OSTs), all under the control of a single Meta Data Server (MDS) so that it appears to the user as a single file system. The Lustre file system provides POSIX semantics (changes on one node are immediately visible on other nodes) and can support very high data rates for appropriate I/O patterns.
In order to achieve good performance on the ARCHER2 Lustre file systems, you need to make sure your IO is configured correctly for the type of I/O you want to do. In the following sections we describe how to do this.
"},{"location":"user-guide/io/#summary-achieving-best-io-performance","title":"Summary: achieving best I/O performance","text":"The configuration you should use depends on the type of I/O you are performing. Here, we summarise the settings for two of the I/O patterns described above: File-Per-Process (FPP, including using ADIOS2) and Single Share File with collective writes (SSF).
Following sections describe the settings in more detail.
"},{"location":"user-guide/io/#file-per-process-fpp_1","title":"File-Per-Process (FPP)","text":"-c 1
), this is the default on ARCHER2-c -1
)export FI_OFI_RXM_SAR_LIMIT=64K
export MPICH_MPIIO_HINTS=\"*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4\u201d
We regularly run tests of FPP write performance on ARCHER2 `/work`` Lustre file systems using the benchio software in the following configuration:
Typical write performance:
We regularly run tests of FPP write performance on ARCHER2 `/work`` Lustre file systems using the benchio software in the following configuration:
FI_OFI_RXM_SAR_LIMIT=64K
, MPICH_MPIIO_HINTS=\"*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4\u201d
Typical write performance:
One of the main factors leading to the high performance of Lustre file systems is the ability to store data on multiple OSTs. For many small files, this is achieved by storing different files on different OSTs; large files must be striped across multiple OSTs to benefit from the parallel nature of Lustre.
When a file is striped it is split into chunks and stored across multiple OSTs in a round-robin fashion. Striping can improve the I/O performance because it increases the available bandwidth: multiple processes can read and write the same file simultaneously by accessing different OSTs. However striping can also increase the overhead. Choosing the right striping configuration is key to obtain high performance results.
Users have control of a number of striping settings on Lustre file systems. Although these parameters can be set on a per-file basis they are usually set on the directory where your output files will be written so that all output files inherit the same settings.
"},{"location":"user-guide/io/#default-configuration","title":"Default configuration","text":"The /work
file systems on ARCHER2 have the same default stripe settings:
These settings have been chosen to provide a good compromise for the wide variety of I/O patterns that are seen on the system but are unlikely to be optimal for any one particular scenario. The Lustre command to query the stripe settings for a directory (or file) is lfs getstripe
. For example, to query the stripe settings of an already created directory resdir
:
auser@ln03:~> lfs getstripe resdir/\nresdir\nstripe_count: 1 stripe_size: 1048576 stripe_offset: -1\n
"},{"location":"user-guide/io/#setting-custom-striping-configurations","title":"Setting custom striping configurations","text":"Users can set stripe settings for a directory (or file) using the lfs setstripe
command. The options for lfs setstripe
are:
[--stripe-count|-c]
to set the stripe count; 0 means use the system default (usually 1) and -1 means stripe over all available OSTs.[--stripe-size|-S]
to set the stripe size; 0 means use the system default (usually 1 MB) otherwise use k, m or g for KB, MB or GB respectively[--stripe-index|-i]
to set the OST index (starting at 0) on which to start striping for this file. An index of -1 allows the MDS to choose the starting index and it is strongly recommended, as this allows space and load balancing to be done by the MDS as needed.For example, to set a stripe size of 4 MiB for the existing directory resdir
, along with maximum striping count you would use:
auser@ln03:~> lfs setstripe -S 4m -c -1 resdir/\n
"},{"location":"user-guide/io/#environment-variables","title":"Environment variables","text":"The following environment variables typically only have an impact for the case when you using Single Shared Files with collective communications. As mentioned above, it is very important to use collective calls when doing parallel I/O to a single shared file.
However, with the default settings, parallel I/O on multiple nodes can currently give poor performance. We recommend always setting these environment variables in your SLURM batch script when you are using the SSF I/O pattern:
export FI_OFI_RXM_SAR_LIMIT=64K\nexport MPICH_MPIIO_HINTS=\"*:cray_cb_write_lock_mode=2,*:cray_cb_nodes_multiplier=4\u201d\n
"},{"location":"user-guide/io/#mpi-transport-protocol","title":"MPI transport protocol","text":"Setting the environment variables described above can improve the performance of MPI collectives when handling large amounts of data, which in turn can improve collective file I/O. An alternative is to use the non-default UCX implementation of the MPI library as an alternative to the default OFI version.
To switch library version see the Application Development Environment section of the User Guide.
Note
This will affect all your MPI calls, not just those related to I/O, so you should check the overall performance of your program before and after the switch. It is possible that other functions may run slower even if the I/O performance improves.
"},{"location":"user-guide/io/#io-profiling","title":"I/O profiling","text":"If you are concerned about your I/O performance, you should quantify your transfer rates in terms of GB/s of data read or written to disk. Small files can achieve very high I/O rates due to data caching in Lustre. However, for large files you should be able to achieve a maximum of around 1 GB/s for an unstriped file, or up to 10 GB/s for a fully striped file (across all 12 OSTs).
Warning
You share /work
with all other users so I/O rates can be very variable, especially if the machine is heavily loaded.
If your I/O rates are poor then you can get useful summary information about how the parallel libraries are performing by setting this variable in your Slurm script
export MPICH_MPIIO_STATS=1\n
Amongst other things, this will give you information on how many independent and collective I/O operations were issued. If you see a large number of independent operations compared to collectives, this indicates that you have inefficient I/O patterns and you should check that you are calling your parallel I/O library correctly.
Although this information comes from the MPI library, it is still useful for users of higher-level libraries such as HDF5 as they all call MPI-IO at the lowest level.
"},{"location":"user-guide/io/#tips-and-advice-for-io","title":"Tips and advice for I/O","text":""},{"location":"user-guide/io/#set-an-optimum-blocksize-when-untaring-data","title":"Set an optimum blocksize when untar'ing data","text":"When you are expanding a large tar archive file to the Lustre file systems you should specify the -b 2048
option to ensure that tar writes out data in blocks of 1 MiB. This will improve the performance of your tar command and reduce the impact of writing the data to Lustre on other users.
Two Machine Learning (ML) frameworks are supported on ARCHER2, PyTorch and TensorFlow.
For each framework, we'll show how to run a particular MLCommons HPC benchmark. We start with PyTorch.
"},{"location":"user-guide/machine-learning/#pytorch","title":"PyTorch","text":"On ARCHER2, PyTorch is supported for use on both the CPU and GPU nodes.
We'll demonstrate the use of PyTorch with DeepCam, a deep learning climate segmentation benchmark. It involves training a neural network to recognise large-scale weather phenomena (e.g., tropical cyclones, atmospheric rivers) in the output generated by ensembles of weather simulations, see link below for more details.
Exascale Deep Learning for Climate Analytics
There are two DeepCam training datasets available on ARCHER2. A 62 GB mini dataset (/work/z19/shared/mlperf-hpc/deepcam/mini
), and a much larger 8.9 TB dataset (/work/z19/shared/mlperf-hpc/deepcam/full
).
A binary install of PyTorch 1.13.1 suitable for ROCm 5.2.3 has been installed according to the instructions linked below.
https://github.com/hpc-uk/build-instructions/blob/main/pyenvs/pytorch/build_pytorch_1.13.1_archer2_gpu.md
This install can be accessed by loading the pytorch/1.13.1-gpu
module.
As DeepCam is an MLPerf benchmark, you may wish to base a local python environment on pytorch/1.13.1-gpu
so that you have the opportunity to install additional python packages that support MLPerf logging, as well as extra features pertinent to DeepCam (e.g., dynamic learning rates).
The following instructions show how to create such an environment.
#!/bin/bash\n\nmodule -q load pytorch/1.13.1-gpu\n\nPYTHON_TAG=python`echo ${CRAY_PYTHON_LEVEL} | cut -d. -f1-2`\n\nPRFX=${HOME/home/work}/pyenvs\nPYVENV_ROOT=${PRFX}/mlperf-pt-gpu\nPYVENV_SITEPKGS=${PYVENV_ROOT}/lib/${PYTHON_TAG}/site-packages\n\nmkdir -p ${PYVENV_ROOT}\ncd ${PYVENV_ROOT}\n\n\npython -m venv --system-site-packages ${PYVENV_ROOT}\n\nextend-venv-activate ${PYVENV_ROOT}\n\nsource ${PYVENV_ROOT}/bin/activate\n\n\nmkdir -p ${PYVENV_ROOT}/repos\ncd ${PYVENV_ROOT}/repos\n\ngit clone -b hpc-1.0-branch https://github.com/mlcommons/logging mlperf-logging\npython -m pip install -e mlperf-logging\n\nrm ${PYVENV_SITEPKGS}/mlperf-logging.egg-link\nmv ./mlperf-logging/mlperf_logging ${PYVENV_SITEPKGS}/\nmv ./mlperf-logging/mlperf_logging.egg-info ${PYVENV_SITEPKGS}/\n\npython -m pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git\n\ndeactivate\n
In order to run a DeepCam training job, you must first clone the MLCommons HPC github repo.
mkdir ${HOME/home/work}/tests\ncd ${HOME/home/work}/tests\n\ngit clone https://github.com/mlcommons/hpc.git mlperf-hpc\n\ncd ./mlperf-hpc/deepcam/src/deepCam\n
You are now ready to run the following DeepCam submission script via the sbatch
command.
#!/bin/bash\n\n#SBATCH --job-name=deepcam\n#SBATCH --account=[budget code]\n#SBATCH --partition=gpu\n#SBATCH --qos=gpu-exc\n#SBATCH --nodes=2\n#SBATCH --gpus=8\n#SBATCH --time=01:00:00\n#SBATCH --exclusive\n\n\nJOB_OUTPUT_PATH=./results/${SLURM_JOB_ID}\nmkdir -p ${JOB_OUTPUT_PATH}/logs\n\nsource ${HOME/home/work}/pyenvs/mlperf-pt-gpu/bin/activate\n\nexport OMP_NUM_THREADS=1\nexport HOME=${HOME/home/work}\n\nsrun --ntasks=8 --tasks-per-node=4 \\\n --cpu-bind=verbose,map_cpu:0,8,16,24 --hint=nomultithread \\\n python train.py \\\n --run_tag test \\\n --data_dir_prefix /work/z19/shared/mlperf-hpc/deepcam/mini \\\n --output_dir ${JOB_OUTPUT_PATH} \\\n --wireup_method nccl-slurm \\\n --max_epochs 64 \\\n --local_batch_size 1\n\nmv slurm-${SLURM_JOB_ID}.out ${JOB_OUTPUT_PATH}/slurm.out\n
The job submission script activates the python environment that was setup earlier, but that particular command (source ${HOME/home/work}/pyenvs/mlperf-pt-gpu/bin/activate
) could be replaced by module -q load pytorch/1.13.1-gpu
if you are not running DeepCam and have no need for additional Python packages such as mlperf-logging
and warmup-scheduler
.
In the script above, we specify four tasks per node, one for each GPU. These tasks are evenly spaced across the node so as to maximise the communications bandwidth between the host and the GPU devices. Note, PyTorch is not using Cray MPICH for inter-task communications, which is instead being handled by the ROCm Collective Communications Library (RCCL), hence the --wireup_method nccl-slurm
option (nccl-slurm
works as an alias for rccl-slurm
in this context).
The above job should achieve convergence \u2014 an Intersection over Union (IoU) of 0.82 \u2014 after 35 epochs or so. Runtime should be around 20-30 minutes.
We can also modify the DeepCam train.py
script so that the accuracy and loss are logged using TensorBoard.
The following lines must be added to the DeepCam train.py
script.
import os\n...\n\nfrom torch.utils.tensorboard import SummaryWriter\n\n...\n\ndef main(pargs):\n\n #init distributed training\n comm_local_group = comm.init(pargs.wireup_method, pargs.batchnorm_group_size)\n comm_rank = comm.get_rank()\n ...\n\n #set up logging\n pargs.logging_frequency = max([pargs.logging_frequency, 0])\n log_file = os.path.normpath(os.path.join(pargs.output_dir, \"logs\", pargs.run_tag + \".log\"))\n ...\n\n writer = SummaryWriter()\n\n #set seed\n ...\n\n ...\n\n #training loop\n while True:\n ...\n\n #training\n step = train_epoch(pargs, comm_rank, comm_size,\n ...\n logger, writer)\n\n ...\n
The train_epoch
function is defined in ./driver/trainer.py
and so that file must be amended like so.
...\n\ndef train_epoch(pargs, comm_rank, comm_size,\n ...,\n logger, writer):\n\n ...\n\n writer.add_scalar(\"Accuracy/train\", iou_avg_train, epoch+1)\n writer.add_scalar(\"Loss/train\", loss_avg_train, epoch+1)\n\n return step\n
"},{"location":"user-guide/machine-learning/#deepcam-on-cpu","title":"DeepCam on CPU","text":"PyTorch can also be run on the ARCHER2 CPU nodes. However, since the DeepCam uses the torch.distributed
module, we cannot use Horovod to handle (via MPI) inter-task communications. We must instead build PyTorch from source so that we can link torch.distributed
to the correct Cray MPICH libraries.
The instructions for doing such a build can be found here, https://github.com/hpc-uk/build-instructions/blob/main/pyenvs/pytorch/build_pytorch_1.13.0a0_from_source_archer2_cpu.md.
This install can be accessed by loading the pytorch/1.13.0a0
module. Please note, PyTorch source version 1.13.0a0
corresponds to PyTorch package version 1.13.1
.
Once again, as we are running the DeepCam benchmark, we'll need to setup a local Python environment for installing the MLPerf logging package. This time the local environment is based on the pytorch/1.13.0a0
module.
#!/bin/bash\n\nmodule -q load pytorch/1.13.0a0\n\nPYTHON_TAG=python`echo ${CRAY_PYTHON_LEVEL} | cut -d. -f1-2`\n\nPRFX=${HOME/home/work}/pyenvs\nPYVENV_ROOT=${PRFX}/mlperf-pt\nPYVENV_SITEPKGS=${PYVENV_ROOT}/lib/${PYTHON_TAG}/site-packages\n\nmkdir -p ${PYVENV_ROOT}\ncd ${PYVENV_ROOT}\n\n\npython -m venv --system-site-packages ${PYVENV_ROOT}\n\nextend-venv-activate ${PYVENV_ROOT}\n\nsource ${PYVENV_ROOT}/bin/activate\n\n\nmkdir -p ${PYVENV_ROOT}/repos\ncd ${PYVENV_ROOT}/repos\n\ngit clone -b hpc-1.0-branch https://github.com/mlcommons/logging mlperf-logging\npython -m pip install -e mlperf-logging\n\nrm ${PYVENV_SITEPKGS}/mlperf-logging.egg-link\nmv ./mlperf-logging/mlperf_logging ${PYVENV_SITEPKGS}/\nmv ./mlperf-logging/mlperf_logging.egg-info ${PYVENV_SITEPKGS}/\n\npython -m pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git\n\ndeactivate\n
In order to run a DeepCam training job, you must first clone the MLCommons HPC github repo.
mkdir ${HOME/home/work}/tests\ncd ${HOME/home/work}/tests\n\ngit clone https://github.com/mlcommons/hpc.git mlperf-hpc\n\ncd ./mlperf-hpc/deepcam/src/deepCam\n
Next, we need to edit some parts of the DeepCam Python source such that DeepCam is properly integrated with Cray MPICH.
The init
function defined in ./utils/comm.py
contains an if
statement that initialises the DeepCam job according to the selected communications method. You will need to edit the mpi
branch of this if
statement as shown below.
...\n\ndef init(method, batchnorm_group_size=1):\n\n if method == \"nccl-openmpi\":\n\n ...\n\n elif method == \"mpi\":\n rank = int(os.getenv(\"SLURM_PROCID\"))\n world_size = int(os.getenv(\"SLURM_NTASKS\"))\n dist.init_process_group(backend = \"mpi\",\n rank = rank,\n world_size = world_size)\n\n else:\n raise NotImplementedError()\n\n ... \n
Second, as we're not running on a GPU platform, we'll need to comment out a statement that calls a GPU-based synchronisation method, see the synchronize
method within ./utils/bnstats.py
.
...\n\ndef synchronize(self:\n\n if dist.is_initialized():\n # sync the device before\n #torch.cuda.synchronize()\n\n with torch.no_grad():\n ...\n
DeepCam can now be run on the CPU nodes using a submission script like the one below.
#!/bin/bash\n\n#SBATCH --job-name=deepcam\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n#SBATCH --nodes=32\n#SBATCH --ntasks-per-node=1\n#SBATCH --cpus-per-task=128\n#SBATCH --time=10:00:00\n#SBATCH --exclusive\n\n\nJOB_OUTPUT_PATH=./results/${SLURM_JOB_ID}\nmkdir -p ${JOB_OUTPUT_PATH}/logs\n\nsource ${HOME/home/work}/pyenvs/mlperf-pt/bin/activate\n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\nexport OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}\n\nsrun --hint=nomultithread \\\n python train.py \\\n --run_tag test \\\n --data_dir_prefix /work/z19/shared/mlperf-hpc/deepcam/mini \\\n --output_dir ${JOB_OUTPUT_PATH} \\\n --wireup_method mpi \\\n --max_inter_threads ${SLURM_CPUS_PER_TASK} \\\n --max_epochs 64 \\\n --local_batch_size 1\n\nmv slurm-${SLURM_JOB_ID}.out ${JOB_OUTPUT_PATH}/slurm.out\n
The script above activates the local Python environment so that the mlperf-logging
package is available; this is needed by the logger
object declared in the DeepCam train.py
script. Notice also that the --wireup-method
parameter is now set to mpi
and that a new parameter has been added, --max_inter_threads
, for specifying the maximum number of concurrent readers.
DeepCam performance on the CPU nodes is much slower than GPU. Running on 32 CPU nodes, as shown above, will take around 6 hours to complete 35 epochs. This assumes you're using the default hyperparameter settings for DeepCam.
"},{"location":"user-guide/machine-learning/#tensorflow","title":"TensorFlow","text":"On ARCHER2, TensorFlow is supported for use on the CPU nodes only.
We'll demonstrate the use of TensorFlow with the CosmoFlow benchmark. It involves training a neural network to recognise cosmological parameter values from the output generated by 3D dark matter simulations, see link below for more details.
CosmoFlow: using deep learning to learn the universe at scale
There are two CosmoFlow training datasets available on ARCHER2. A 5.6 GB mini dataset (/work/z19/shared/mlperf-hpc/cosmoflow/mini
), and a much larger 1.7 TB dataset (/work/z19/shared/mlperf-hpc/cosmoflow/full
).
In order to run a CosmoFlow training job, you must first clone the MLCommons HPC github repo.
mkdir ${HOME/home/work}/tests\ncd ${HOME/home/work}/tests\n\ngit clone https://github.com/mlcommons/hpc.git mlperf-hpc\n\ncd ./mlperf-hpc/cosmoflow\n
You are now ready to run the following CosmoFlow submission script via the sbatch
command.
#!/bin/bash\n\n#SBATCH --job-name=cosmoflow\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n#SBATCH --nodes=32\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=01:00:00\n#SBATCH --exclusive\n\nmodule -q load tensorflow/2.13.0\n\nexport UCX_MEMTYPE_CACHE=n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\nexport MPICH_DPM_DIR=${SLURM_SUBMIT_DIR}/dpmdir\n\nexport OMP_NUM_THREADS=16\nexport TF_ENABLE_ONEDNN_OPTS=1\n\nsrun --hint=nomultithread --distribution=block:block --cpu-freq=2250000 \\\n python train.py \\\n --distributed --omp-num-threads ${OMP_NUM_THREADS} \\\n --inter-threads 0 --intra-threads 0 \\\n --n-epochs 2048 --n-train 1024 --n-valid 1024 \\\n --data-dir /work/z19/shared/mlperf-hpc/cosmoflow/mini/cosmoUniverse_2019_05_4parE_tf_v2_mini\n
The CosmoFlow job runs eight MPI tasks per node (one per NUMA region) with sixteen threads per task, and so, each node is fully populated. The TF_ENABLE_ONEDNN_OPTS
variable refers to Intel's oneAPI Deep Neural Network library. Within the TensorFlow source there are #ifdef
guards that are activated when oneDNN is enabled. It turns out that having TF_ENABLE_ONEDNN_OPTS=1
also improves performance (by a factor of 12) on AMD processors.
The inter/intra thread training parameters allow one to exploit any parallelism implied by the TensorFlow (TF) DNN graph. For example, if a node in the TF graph can be parallelised, the number of threads assigned will be the value of --intra-threads
; and, if there are separate nodes in the TF graph that can be run concurrently, the available thread count for such an activity is the value of --inter-threads
. Of course, the optimum values for these parameters will depend on the DNN graph. The job script above tells TensorFlow to choose the values by setting both parameters to zero.
You will note that only a few hyperparameters are specified for the CosmoFlow training job (e.g., --n-epochs
, --n-train
and --n-valid
). Those settings in fact override the values assigned to those same parameters within the ./configs/cosmo.yaml
file. However, that file contains settings for many other hyperparameters that are not overwritten.
The CosmoFlow job specified above should take around 140 minutes to complete 2048 epochs, which should be sufficient to achieve a mean average error of 0.23.
"},{"location":"user-guide/profile/","title":"Profiling","text":"There are a number of different ways to access profiling data on ARCHER2. In this section, we discuss the HPE Cray profiling tools, CrayPat-lite and CrayPat. We also show how to get usage data on currently running jobs from Slurm batch system.
You can also use the Linaro Forge tool to profile applications on ARCHER2.
If you are specifically interested in profiling IO, then you may want to look at the Darshan IO profiling tool.
"},{"location":"user-guide/profile/#craypat-lite","title":"CrayPat-lite","text":"CrayPat-lite is a simplified and easy-to-use version of the Cray Performance Measurement and Analysis Tool (CrayPat). CrayPat-lite provides basic performance analysis information automatically, with a minimum of user interaction, and yet offers information useful to users wishing to explore a program's behaviour further using the full CrayPat suite.
"},{"location":"user-guide/profile/#how-to-use-craypat-lite","title":"How to use CrayPat-lite","text":"Ensure the perftools-base
module is loaded.
module list
Load the perftools-lite
module.
module load perftools-lite
Compile your application normally. An informational message from CrayPat-lite will appear indicating that the executable has been instrumented.
cc -h std=c99 -o myapplication.x myapplication.c\n
INFO: creating the CrayPat-instrumented executable 'myapplication.x' (lite-samples) ...OK \n
Run the generated executable normally by submitting a job.
#!/bin/bash\n\n#SBATCH --job-name=CrayPat_test\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=00:20:00\n\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nexport OMP_NUM_THREADS=1\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel program\nsrun --hint=nomultithread --distribution=block:block mpi_test.x\n
Analyse the data.
After the job finishes executing, CrayPat-lite output should be printed to stdout (i.e. at the end of the job's output file). A new directory will also be created containing .rpt
and .ap2
files. The .rpt
files are text files that contain the same information printed in the job's output file and the .ap2
files can be used to obtain more detailed information, which can be visualized using the Cray Apprentice2 tool.
The Cray Performance Analysis Tool (CrayPat) is a powerful framework for analysing a parallel application\u2019s performance on Cray supercomputers. It can provide very detailed information about the timing and performance of individual application procedures.
CrayPat can perform two types of performance analysis, sampling experiments and tracing experiments. A sampling experiment probes the code at a predefined interval and produces a report based on the data collected. A tracing experiment explicitly monitors the code performance within named routines. Typically, the overhead associated with a tracing experiment is higher than that associated with a sampling experiment but provides much more detailed information. The key to getting useful data out of a sampling experiment is to run your profiling for a representative length of time.
"},{"location":"user-guide/profile/#sampling-analysis","title":"Sampling analysis","text":"Ensure the perftools-base
module is loaded.
module list
Load perftools
module.
module load perftools
Compile your code in the standard way always using the Cray compiler wrappers (ftn, cc and CC). Object files need to be made available to CrayPat to correctly build an instrumented executable for profiling or tracing, this means that the compile and link stage should be separated by using the -c
compile flag.
auser@ln01:/work/t01/t01/auser> cc -h std=c99 -c jacobi.c\nauser@ln01:/work/t01/t01/auser> cc jacobi.o -o jacobi\n
To instrument the binary, run the pat_build
command. This will generate a new binary with +pat
appended to the end (e.g. jacobi+pat
).
auser@ln:/work/t01/t01/auser> pat_build jacobi
Run the new executable with +pat
appended as you would with the regular executable. Each run will produce its own 'experiment directory' containing the performance data as .xf
files inside a subdirectory called xf-files
(e.g. running the jacobi+pat
instrumented executable might produce jacobi+pat+12265-1573s/xf-files
).
pat_report
.The .xf
files contain the raw sampling data from the run and need to be post-processed to produce useful results. This is done using the pat_report
tool which converts all the raw data into a summarised and readable form. You should provide the name of the experiment directory as the argument to pat_report
.
auser@ln:/work/t01/t01/auser> pat_report jacobi+pat+12265-1573s\n\nTable 1: Profile by Function (limited entries shown)\n\nSamp% | Samp | Imb. | Imb. | Group\n | | Samp | Samp% | Function\n | | | | PE=HIDE\n100.0% | 849.5 | -- | -- | Total\n|--------------------------------------------------\n| 56.7% | 481.4 | -- | -- | MPI\n||-------------------------------------------------\n|| 48.7% | 414.1 | 50.9 | 11.0% | MPI_Allreduce\n|| 4.4% | 37.5 | 118.5 | 76.6% | MPI_Waitall\n|| 3.0% | 25.2 | 44.8 | 64.5% | MPI_Isend\n||=================================================\n| 29.9% | 253.9 | 55.1 | 18.0% | USER\n||-------------------------------------------------\n|| 29.9% | 253.9 | 55.1 | 18.0% | main\n||=================================================\n| 13.4% | 114.1 | -- | -- | ETC\n||-------------------------------------------------\n|| 13.4% | 113.9 | 26.1 | 18.8% | __cray_memcpy_SNB\n|==================================================\n
This report will generate more files with the extension .ap2
in the experiment directory. These hold the same data as the .xf
files but in the post-processed form. Another file produced has an .apa
extension and is a text file with a suggested configuration for generating a traced experiment.
The .ap2
files generated are used to view performance data graphically with the Cray Apprentice2 tool.
The pat_report
command is able to produce many different profile reports from the profiling data. You can select a predefined report with the -O
flag to pat_report
. A selection of the most generally useful predefined report types are:= listed below.
Example output:
auser@ln01:/work/t01/t01/auser> pat_report -O ca+src,load_balance jacobi+pat+12265-1573s\n\nTable 1: Profile by Function and Callers, with Line Numbers (limited entries shown)\n\nSamp% | Samp | Imb. | Imb. | Group\n | | Samp | Samp% | Function\n | | | | PE=HIDE\n100.0% | 849.5 | -- | -- | Total\n|--------------------------------------------------\n|--------------------------------------\n| 56.7% | 481.4 | MPI\n||-------------------------------------\n|| 48.7% | 414.1 | MPI_Allreduce\n3| | | main:jacobi.c:line.80\n|| 4.4% | 37.5 | MPI_Waitall\n3| | | main:jacobi.c:line.73\n|| 3.0% | 25.2 | MPI_Isend\n|||------------------------------------\n3|| 1.6% | 13.2 | main:jacobi.c:line.65\n3|| 1.4% | 12.0 | main:jacobi.c:line.69\n||=====================================\n| 29.9% | 253.9 | USER\n||-------------------------------------\n|| 29.9% | 253.9 | main\n|||------------------------------------\n3|| 18.7% | 159.0 | main:jacobi.c:line.76\n3|| 9.1% | 76.9 | main:jacobi.c:line.84\n|||====================================\n||=====================================\n| 13.4% | 114.1 | ETC\n||-------------------------------------\n|| 13.4% | 113.9 | __cray_memcpy_SNB\n3| | | __cray_memcpy_SNB\n|======================================\n
"},{"location":"user-guide/profile/#tracing-analysis","title":"Tracing analysis","text":""},{"location":"user-guide/profile/#automatic-program-analysis-apa","title":"Automatic Program Analysis (APA)","text":"We can produce a focused tracing experiment based on the results from the sampling experiment using pat_build
with the .apa
file produced during the sampling.
auser@ln01:/work/t01/t01/auser> pat_build -O jacobi+pat+12265-1573s/build-options.apa\n
This will produce a third binary with extension +apa
. This binary should once again be run on the compute nodes and the name of the executable changed to jacobi+apa
. As with the sampling analysis, a report can be produced using pat_report
. For example:
auser@ln01:/work/t01/t01/auser> pat_report jacobi+apa+13955-1573t\n\nTable 1: Profile by Function Group and Function (limited entries shown)\n\nTime% | Time | Imb. | Imb. | Calls | Group\n | | Time | Time% | | Function\n | | | | | PE=HIDE\n\n100.0% | 12.987762 | -- | -- | 1,387,544.9 | Total\n|-------------------------------------------------------------------------\n| 44.9% | 5.831320 | -- | -- | 2.0 | USER\n||------------------------------------------------------------------------\n|| 44.9% | 5.831229 | 0.398671 | 6.4% | 1.0 | main\n||========================================================================\n| 29.2% | 3.789904 | -- | -- | 199,111.0 | MPI_SYNC\n||------------------------------------------------------------------------\n|| 29.2% | 3.789115 | 1.792050 | 47.3% | 199,109.0 | MPI_Allreduce(sync)\n||========================================================================\n| 25.9% | 3.366537 | -- | -- | 1,188,431.9 | MPI\n||------------------------------------------------------------------------\n|| 18.0% | 2.334765 | 0.164646 | 6.6% | 199,109.0 | MPI_Allreduce\n|| 3.7% | 0.486714 | 0.882654 | 65.0% | 199,108.0 | MPI_Waitall\n|| 3.3% | 0.428731 | 0.557342 | 57.0% | 395,104.9 | MPI_Isend\n|=========================================================================\n
"},{"location":"user-guide/profile/#manual-program-analysis","title":"Manual Program Analysis","text":"CrayPat allows you to manually choose your profiling preference. This is particularly useful if the APA mode does not meet your tracing analysis requirements.
The entire program can be traced as a whole using -w
:
auser@ln01:/work/t01/t01/auser> pat_build -w jacobi\n
Using -g
, a program can be instrumented to trace all function entry point references belonging to the trace function group (mpi, libsci, lapack, scalapack, heap, etc):
auser@ln01:/work/t01/t01/auser> pat_build -w -g mpi jacobi\n
"},{"location":"user-guide/profile/#dynamically-linked-binaries","title":"Dynamically-linked binaries","text":"CrayPat allows you to profile un-instrumented, dynamically linked binaries with the pat_run
utility. pat_run
delivers profiling information for codes that cannot easily be rebuilt. To use pat_run
:
Load the perftools-base
module if it is not already loaded.
module load perftools-base
Run your application normally including the pat_run
command right after your srun
options.
srun [srun-options] pat_run [pat_run-options] program [program-options]
Use pat_report
to examine any data collected during the execution of your application.
auser@ln01:/work/t01/t01/auser> pat_report jacobi+pat+12265-1573s
Some useful pat_run
options are as follows.
-w
Collect data by tracing.-g
Trace functions belonging to group names. See the -g option in pat_build(1) for a list of valid tracegroup values.-r
Generate a text report upon successful execution.Cray Apprentice2 is an optional GUI tool that is used to visualize and manipulate the performance analysis data captured during program execution. Cray Apprentice2 can display a wide variety of reports and graphs, depending on the type of program being analyzed, the way in which the program was instrumented for data capture, and the data that was collected during program execution.
You will need to use CrayPat to first instrument your program and capture performance analysis data, and then pat_report
to generate the .ap2
files from the results. You may then use Cray Apprentice2 to visualize and explore those files.
The number and appearance of the reports that can be generated using Cray Apprentice2 is determined by the kind and quantity of data captured during program execution, which in turn is determined by the way in which the program was instrumented and the environment variables in effect at the time of program execution. For example, changing the PAT_RT_SUMMARY environment variable to 0 before executing the instrumented program nearly doubles the number of reports available when analyzing the resulting data in Cray Apprentice2.
export PAT_RT_SUMMARY=0\n
To use Cray Apprentice2 (app2
), load perftools-base
module if it is not already loaded.
module load perftools-base\n
Next, open the experiment directory generated during the instrumentation phase with Apprentice2.
auser@ln01:/work/t01/t01/auser> app2 jacobi+pat+12265-1573s\n
"},{"location":"user-guide/profile/#hardware-performance-counters","title":"Hardware Performance Counters","text":"Hardware performance counters can be used to monitor CPU and power events on ARCHER2 compute nodes. The monitoring and reporting of hardware counter events is integrated with CrayPat - users should use CrayPat as described earlier in this section to run profiling experiments to gather data from hardware counter events and to analyse the data.
"},{"location":"user-guide/profile/#counters-and-counter-groups-available","title":"Counters and counter groups available","text":"You can explore which event counters are available on compute nodes by running the following commands (replace t01
with a valid budget code for your account):
module load perftools\nsrun --ntasks=1 --partition=standard --qos=short --account=t01 papi_avail\n
For convenience, the CrayPat tool provides predetermined groups of hardware event counters. You can get more information on the hardware event counters available through CrayPat with the following commands (on a login or compute node):
module load perftools\npat_help counters rome groups\n
If you want information on which hardware event counters are included in a group you can type the group name at the prompt you get after running the command above. Once you have finished browsing the help, type .
to quit back to the command line.
You can also access counters on power/energy consumption. To list the counters available to monitor power/energy use you can use the command (replace t01
with a valid budget code for your account):
module load perftools\nsrun --ntasks=1 --partition=standard --qos=short --account=t01 papi_native_avail -i cray_pm\n
"},{"location":"user-guide/profile/#enabling-hardware-counter-data-collection","title":"Enabling hardware counter data collection","text":"You enable the collection of hardware event counter data as part of a CrayPat experiment by setting the environment variable PAT_RT_PERFCTR
to a comma separated list of the groups/counters that you wish to measure.
For example, you could set (usually in your job submission script):
export PAT_RT_PERFCTR=1\n
to use the 1
counter group (summary with branch activity).
If you enabled collection of hardware event counters when running your profiling experiment, you will automatically get a report on the data when you use the pat_report
command to analyse the profile experiment data file.
You will see information similar to the following in the output from CrayPat for different sections of your code (this example if for the case where export PAT_RT_PERFCTR=1
, counter group: summary with branch activity, was set in the job submission script):
==============================================================================\n USER / main\n------------------------------------------------------------------------------\n Time% 88.3% \n Time 446.113787 secs\n Imb. Time 33.094417 secs\n Imb. Time% 6.9% \n Calls 0.002 /sec 1.0 calls\n PAPI_BR_TKN 0.240G/sec 106,855,535,005.863 branch\n PAPI_TOT_INS 5.679G/sec 2,533,386,435,314.367 instr\n PAPI_BR_INS 0.509G/sec 227,125,246,394.008 branch\n PAPI_TOT_CYC 1,243,344,265,012.828 cycles\n Instr per cycle 2.04 inst/cycle\n MIPS 1,453,770.20M/sec \n Average Time per Call 446.113787 secs\n CrayPat Overhead : Time 0.2% \n
"},{"location":"user-guide/profile/#using-the-craypat-api-to-gather-hardware-counter-data","title":"Using the CrayPAT API to gather hardware counter data","text":"The CrayPAT API features a particular function, PAT_counters
, that allows you to obtain the values of specific hardware counters at specific points within your code.
For convenience, we have developed an MPI-based wrapper for this aspect of the CrayPAT API, called pat_mpi_lib
, which can be found via the link below.
https://github.com/cresta-eu/pat_mpi_lib
The PAT MPI Library makes it possible to monitor a user-defined set of hardware performance counters during the execution of an MPI code running across multiple compute nodes. The library is lightweight, containing just four functions, and is intended to be straightforward to use. Once you've defined the hooks in your code for recording counter values, you can control which counters are read at runtime by setting the PAT_RT_PERFCTR
environment variable in the job submission script. As your code executes, the defined set of counters will be read at various points. After each reading, the counter values are summed by rank 0 (via an MPI reduction) before being output to a log file.
Further information along with test harnesses and example scripts can be found by reading the PAT MPI Library readme file.
"},{"location":"user-guide/profile/#more-information-on-hardware-counters","title":"More information on hardware counters","text":"More information on using hardware counters can be found in the appropriate section of the HPE documentation:
Also available are two MPI-based wrapper libraries, one for Power Management (PM) counters that cover such properties as point-in-time power, cumulative energy use and temperature; and one that provides access to PAPI counters. See the links below for further details.
Slurm commands on the login nodes can be used to quickly and simply retrieve information about memory usage for currently running and completed jobs.
There are three commands you can use on ARCHER2 to query job data from Slurm, two are standard Slurm commands and one is a script that provides information on running jobs:
sstat
command is used to display status information of a running job or job stepsacct
command is used to display accounting data for all finished jobs and job steps within the Slurm job database.archer2jobload
command is used to show CPU and memory usage information for running jobs. (This script is based on one originally written for the COSMA HPC facility at the University of Durham.)We provide examples of the use of these three commands below.
For the sacct
and sstat
command, the memory properties we print out below are:
AveRSS
- The mean memory use per process over the length of the jobMaxRSS
- The maximum memory use by an individual process measured during the jobMaxRSSTask
- The process ID associated with the maximum memory use measured during the jobMaxRSSNode
- The node ID associated with the maximum memory use measured during the jobTRESUsageInTot
- Totals of various properties for the job. For example, the total memory use of the job is available in the mem=
propertyTip
Slurm polls for the memory use in a job, this means that short-term changes in memory use may not be captured in the Slurm data.
"},{"location":"user-guide/profile/#example-1-sstat-for-running-jobs","title":"Example 1:sstat
for running jobs","text":"To display the current memory use of a running job with the ID 123456:
sstat --format=JobID,AveCPU,AveRSS,MaxRSS,MaxRSSTask,MaxRSSNode,TRESUsageInTot%150 -j 123456\n
"},{"location":"user-guide/profile/#example-2-sacct-for-finished-jobs","title":"Example 2: sacct
for finished jobs","text":"To display the memory use of a completed job with the ID 123456:
sacct --format=JobID,JobName,AveRSS,MaxRSS,MaxRSSTask,MaxRSSNode,TRESUsageInTot%150 -j 123456\n
Another usage of sacct
is to display when a job was submitted, started running and ended for a particular user:
sacct --format=JobID,Submit,Start,End -u auser\n
"},{"location":"user-guide/profile/#example-3-archer2jobload-for-running-jobs","title":"Example 3: archer2jobload
for running jobs","text":"Using the archer2jobload
command on its own with no options will show the current CPU and memory use across compute nodes for all running jobs.
More usefully, you can provide a job ID to archer2jobload
and it will show a summary of the CPU and memory use for a specific job. For example, to get the usage data for job 123456, you would use:
auser@ln01:~> archer2jobload 123456\n# JOB: 123456\nCPU_LOAD MEMORY ALLOCMEM FREE_MEM TMP_DISK NODELIST \n127.35-127.86 256000 239872 169686-208172 0 nid[001481,001638-00\n
This shows the minimum CPU load on a compute node is 126.04 (close to the limit of 128 cores) with the maximum load 127.41 (indicating all the nodes are being used evenly). The minimum free memory is 171893 MB and the maximum free memory is 177224 MB.
If you add the -l
option, you will see a breakdown per node:
auser@ln01:~> archer2jobload -l 276236\n# JOB: 123456\nNODELIST CPU_LOAD MEMORY ALLOCMEM FREE_MEM TMP_DISK \nnid001481 127.86 256000 239872 169686 0 \nnid001638 127.60 256000 239872 171060 0 \nnid001639 127.64 256000 239872 171253 0 \nnid001677 127.85 256000 239872 173820 0 \nnid001678 127.75 256000 239872 173170 0 \nnid001891 127.63 256000 239872 173316 0 \nnid001921 127.65 256000 239872 207562 0 \nnid001922 127.35 256000 239872 208172 0 \n
"},{"location":"user-guide/profile/#further-help-with-slurm","title":"Further help with Slurm","text":"The definitions of any variables discussed here and more usage information can be found in the man pages of sstat
and sacct
.
The AMD \u03bcProf tool provides capabilities for low-level profiling on AMD processors, see:
The Linaro Forge tool also provides profiling capabilities. See:
The Darshan lightweight IO profiling tool provides a quick way to profile the IO part of your software:
Python is supported on ARCHER2 both for running intensive parallel jobs and also as an analysis tool. This section describes how to use Python in either of these scenarios.
The Python installations on ARCHER2 contain some of the most commonly used packages. If you wish to install additional Python packages, we recommend that you use the pip
command, see the section entitled Installing your own Python packages (with pip).
Important
Python 2 is not supported on ARCHER2 as it has been deprecated since the start of 2020.
Note
When you log onto ARCHER2, no Python module is loaded by default. You will generally need to load the cray-python
module to access the functionality described below.
The recommended way to use Python on ARCHER2 is to use the HPE Cray Python distribution.
The HPE Cray distribution provides Python 3 along with some of the most common packages used for scientific computation and data analysis. These include:
The HPE Cray Python distribution can be loaded (either on the front-end or in a submission script) using:
module load cray-python\n
Tip
The HPE Cray Python distribution is built using GCC compilers. If you wish to compile your own Python, C/C++ or Fortran code to use with HPE Cray Python, you should ensure that you compile using PrgEnv-gnu
to make sure they are compatible.
Sometimes, you may need to setup a local custom Python environment such that it extends a centrally-installed cray-python
module. By extend, we mean being able to install packages locally that are not provided by cray-python
. This is necessary because some Python packages such as mpi4py
must be built specifically for the ARCHER2 system and so are best provided centrally.
You can do this by creating a lightweight virtual environment where the local packages can be installed. This environment is created on top of an existing Python installation, known as the environment's base Python.
First, load the PrgEnv-gnu
environment.
auser@ln01:~> module load PrgEnv-gnu\n
This first step is necessary because subsequent pip
installs may involve source code compilation and it is better that this be done using the GCC compilers to maintain consistency with how some base Python packages have been built.
Second, select the base Python by loading the cray-python
module that you wish to extend.
auser@ln01:~> module load cray-python\n
Next, create the virtual environment within a designated folder.
python -m venv --system-site-packages /work/t01/t01/auser/myvenv\n
In our example, the environment is created within a myvenv
folder located on /work
, which means the environment will be accessible from the compute nodes. The --system-site-packages
option ensures this environment is based on the currently loaded cray-python
module. See https://docs.python.org/3/library/venv.html for more details.
You're now ready to activate your environment.
source /work/t01/t01/auser/myvenv/bin/activate\n
Tip
The myvenv
path uses a fictitious project code, t01
, and username, auser
. Please remember to replace those values with your actual project code and username. Alternatively, you could enter ${HOME/home/work}
in place of /work/t01/t01/auser
. That command fragment expands ${HOME}
and then replaces the home
part with work
.
Installing packages to your local environment can now be done as follows.
(myvenv) auser@ln01:~> python -m pip install <package name>\n
Running pip
directly as in pip install <package name>
will also work, but we show the python -m
approach as this is consistent with the way the virtual environment was created. Further, if the package installation will require code compilation, you should amend the command to ensure use of the ARCHER2 compiler wrappers.
(myvenv) auser@ln01:~> CC=cc CXX=CC FC=ftn python -m pip install <package name>\n
And when you have finished installing packages, you can deactivate the environment by running the deactivate
command.
(myvenv) auser@ln01:~> deactivate\nauser@ln01:~>\n
The packages you have installed will only be available once the local environment has been activated. So, when running code that requires these packages, you must first activate the environment, by adding the activation command to the submission script, as shown below.
#!/bin/bash --login\n\n#SBATCH --job-name=myvenv\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=64\n#SBATCH --cpus-per-task=2\n#SBATCH --time=00:10:00\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\nsource /work/t01/t01/auser/myvenv/bin/activate\n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\n\nsrun --distribution=block:block --hint=nomultithread python myvenv-script.py\n
Tip
If you find that a module you've installed to a virtual environment on /work
isn't found when running a job, it may be that it was previously installed to the default location of $HOME/.local
which is not mounted on the compute nodes. This can be an issue as pip
will reuse any modules found at this default location rather than reinstall them into a virtual environment. Thus, even if the virtual environment is on /work
, a module you've asked for may actually be located on /home
.
You can check a module's install location and its dependencies with pip show
, for example pip show matplotlib
. You may then run pip uninstall matplotlib
while no virtual environment is active to uninstall it from $HOME/.local
, and then re-run pip install matplotlib
while your virtual environment on /work
is active to reinstall it there. You will need to do this for any modules installed on /home
that will use either directly or indirectly. Remember you can check all your installed modules with pip list
.
The environment being extended does not have to come from one of the centrally-installed cray-python
modules. You can also create a local virtual environment based on one of the Machine Learning (ML) modules, e.g., tensorflow
or pytorch
. One extra command is required; it is issued immediately after the python -m venv ...
command.
extend-venv-activate /work/t01/t01/auser/myvenv\n
The extend-venv-activate
command merely adds some extra commands to the virtual environment's activate
script, ensuring that the Python packages will be gathered from the local virtual environment, the ML module and from the cray-python
base module. All this means you would avoid having to install ML packages within your local area.
Note
The extend-venv-activate
command becomes available (i.e., its location is placed on the path) only when the ML module is loaded. The ML modules are themselves based on cray-python
. For example, tensorflow/2.12.0
is based on the cray-python/3.9.13.1
module.
Conda-based Python distributions (e.g. Anaconda, Mamba, Miniconda) are an extremely popular way of installing and accessing software on many systems, including ARCHER2. Although conda-based distributions can be used on ARCHER2, care is needed in how they are installed and configured so that the installation does not adversely effect your use of ARCHER2. In particular, you should be careful of:
.bashrc
We cover each of these points in more detail below.
"},{"location":"user-guide/python/#conda-install-location","title":"Conda install location","text":"If you only need to use the files and executables from your conda installation on the login and data analysis nodes (via the serial
QoS) then the best place to install conda is in your home directory structure - this will usually be the default install location provided by the installation script.
If you need to access the files and executables from conda on the compute nodes then you will need to install to a different location as the home file systems are not available on the compute nodes. The work file systems are not well suited to hosting Python software natively due to the way in which file access work, particularly during Python startup. There are two main options for using conda from ARCHER2 compute nodes:
You can pull official conda-based container images from Dockerhub that you can use if you want just the standard set of Python modules that come with the distribution. For example, to get the latest Anaconda distribution as a Singularity container image on the ARCHER2 work file system, you would use (on an ARCHER2 login node, from the directory on the work file system where you want to store the container image):
singularity build anaconda3.sif docker://continuumio/anaconda3\n
Once you have the container image, you can run scripts in it with a command like:
singularity exec -B $PWD anaconda3.sif python my_script.py\n
As the container image is a single large file, you end up doing a single large read from the work file system rather than lots of small reads of individual Python files, this improves the performance of Python and reduces the detrimental impact on the wider file system performance for all users.
We have pre-built a Singularity container with the Anaconda distribution in on ARCHER2. Users can access it at $EPCC_SINGULARITY_DIR/anaconda3.sif
. To run a Python script with the centrally-installed image, you can use:
singularity exec -B $PWD $EPCC_SINGULARITY_DIR/anaconda3.sif python my_script.py\n
If you want additional packages that are not available in the standard container images then you will need to build your own container images. If you need help to do this, then please contact the ARCHER2 Service Desk
"},{"location":"user-guide/python/#conda-addtions-to-shell-configuration-files","title":"Conda addtions to shell configuration files","text":"During the install process most conda-based distributions will ask a question like:
Do you wish the installer to initialize Miniconda3 by running conda init?
If you are installing to the ARCHER2 work directories or the solid state storage, you should answer \"no\" to this question.
Adding the initialisation to shell startup scripts (typically .bashrc
) means that every time you login to ARCHER2, the conda environment will try to initialise by reading lots of files within the conda installation. This approach was designed for the case where a user has installed conda on their personal device and so is the only user of the file system. For shared file systems such as those on ARCHER2, this places a large load on the file system and will lead to you seeing slow login times and slow response from your command line on ARCHER2. It will also lead to degraded read/write performance from the work file systems for you and other users so should be avoided at all costs.
If you have previously installed a conda distribution and answered \"yes\" to the question about adding the initialisation to shell configuration files, you should edit your ~/.bashrc
file to remove the conda initialisation entries. This means deleting the lines that look something like:
# >>> conda initialize >>>\n# !! Contents within this block are managed by 'conda init' !!\n__conda_setup=\"$('/work/t01/t01/auser/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)\"\nif [ $? -eq 0 ]; then\neval \"$__conda_setup\"\nelse\nif [ -f \"/work/t01/t01/auser/miniconda3/etc/profile.d/conda.sh\" ]; then\n. \"/work/t01/t01/auser/miniconda3/etc/profile.d/conda.sh\"\nelse\nexport PATH=\"/work/t01/t01/auser/miniconda3/bin:$PATH\"\nfi\nfi\nunset __conda_setup\n# <<< conda initialize <<<\n
"},{"location":"user-guide/python/#running-python","title":"Running Python","text":""},{"location":"user-guide/python/#example-serial-python-submission-script","title":"Example serial Python submission script","text":"#!/bin/bash --login\n\n#SBATCH --job-name=python_test\n#SBATCH --ntasks=1\n#SBATCH --time=00:10:00\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=serial\n#SBATCH --qos=serial\n\n# Load the Python module, ...\nmodule load cray-python\n\n# ..., or, if using local virtual environment\nsource <<path to virtual environment>>/bin/activate\n\n# Run your Python program\npython python_test.py\n
"},{"location":"user-guide/python/#example-mpi4py-job-submission-script","title":"Example mpi4py job submission script","text":"Programs that have been parallelised with mpi4py can be run on the ARCHER2 compute nodes. Unlike the serial Python submission script however, we must launch the Python interpreter using srun
. Failing to do so will result in Python running a single MPI rank only.
#!/bin/bash --login\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=mpi4py_test\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --time=0:10:0\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the Python module, ...\nmodule load cray-python\n\n# ..., or, if using local virtual environment\nsource <<path to virtual environment>>/bin/activate\n\n# Pass cpus-per-task setting to srun\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\n\n# Run your Python program\n# Note that srun MUST be used to wrap the call to python,\n# otherwise your code will run serially\nsrun --distribution=block:block --hint=nomultithread python mpi4py_test.py\n
Tip
If you have installed your own packages you will need to activate your local Python environment within your job submission script as shown at the end of Installing your own Python packages (with pip).
By default, mpi4py will use the Cray MPICH OFI library. If one wishes to use UCX instead, you must first, within the submission script, load PrgEnv-gnu
before loading the UCX modules, as shown below.
module load PrgEnv-gnu\nmodule load craype-network-ucx\nmodule load cray-mpich-ucx\nmodule load cray-python\n
"},{"location":"user-guide/python/#running-python-at-scale","title":"Running Python at scale","text":"The file system metadata server may become overloaded when running a parallel Python script over many fully populated nodes (i.e., 128 MPI ranks per node). Performance degrades due to the IO operations that accompany a high volume of Python import statements. Typically, each import will first require the module or library to be located by searching a number of file paths before the module is loaded into memory. Such a workload scales as Np x Nlib x Npath , where Np is the number of parallel processes, Nlib is the number of libraries imported and Npath the number of file paths searched. And so, in this way much time can be lost during the initial phase of a large Python job, not to mention the fact that the IO contention will be impacting other users of the system. Spindle is a tool for improving the library-loading performance of dynamically linked HPC applications. It provides a mechanism for\u00a0scalable loading of shared libraries, executables and Python\u00a0files from a shared file system at scale without turning the file system into a bottleneck. This is achieved by caching libraries or their locations within node memory. Spindle takes a\u00a0pure user-space\u00a0approach: users do not need to configure new file systems, load particular OS kernels or build special system components. The tool operates on existing binaries \u2014\u00a0no application modification or special build flags\u00a0are required. The script below shows how to run Spindle with your Python code. The Note The It is possible to view and run Jupyter notebooks from both login nodes and compute nodes on ARCHER2. Note You can test such notebooks on the login nodes, but please do not attempt to run any computationally intensive work. Jobs may get killed once they hit a CPU limit on login nodes. Please follow these steps. Install JupyterLab in your work directory. #!/bin/bash --login\n\n#SBATCH --nodes=256\n#SBATCH --ntasks-per-node=128\n...\n\nmodule load cray-python\nmodule load spindle/0.13\n\nexport SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}\n\nspindle --slurm --python-prefix=/opt/cray/pe/python/${CRAY_PYTHON_LEVEL} \\ \n srun --overlap --distribution=block:block --hint=nomultithread \\\n python mpi4py_script.py\n
--python-prefix
argument can be set to a list of colon-separated paths if necessary. In the example above, the CRAY_PYTHON_LEVEL
environment variable is set as a conseqeunce of loading cray-python
.srun --overlap
option is required for Spindle as the version of Slurm on ARCHER2 is newer than 20.11.
module load cray-python\nexport PYTHONUSERBASE=/work/t01/t01/auser/.local\nexport PATH=$PYTHONUSERBASE/bin:$PATH\n# source <<path to virtual environment>>/bin/activate # If using a virtualenvironment uncomment this line and remove the --user flag from the next\n\npip install --user jupyterlab\n
If you want to test JupyterLab on the login node please go straight to step 3. To run your Jupyter notebook on a compute node, you first need to run an interactive session.
srun --nodes=1 --exclusive --time=00:20:00 --account=<your_budget> \\\n --partition=standard --qos=short --reservation=shortqos \\\n --pty /bin/bash\n
Your prompt will change to something like below. auser@nid001015:/tmp>\n
In this case, the node id is nid001015
. Now execute the following on the compute node. cd /work/t01/t01/auser # Update the path to your work directory\nexport PYTHONUSERBASE=$(pwd)/.local\nexport PATH=$PYTHONUSERBASE/bin:$PATH\nexport HOME=$(pwd)\nmodule load cray-python\n# source <<path to virtual environment>>/bin/activate # If using a virtualenvironment uncomment this line\n
Run the JupyterLab server.
export JUPYTER_RUNTIME_DIR=$(pwd)\njupyter lab --ip=0.0.0.0 --no-browser\n
Once it's started, you will see a URL printed in the terminal window of the form http://127.0.0.1:<port_number>/lab?token=<string>
; we'll need this URL for step 6. Please skip this step if you are connecting from a machine running Windows. Open a new terminal window on your laptop and run the following command.
ssh <username>@login.archer2.ac.uk -L<port_number>:<node_id>:<port_number>\n
where <username>
is your username, and <node_id>
is the id of the node you're currently on (for a login node, this will be ln01
, or similar; on a compute node, it will be a mix of numbers and letters). In our example, <node_id>
is nid001015
. Note, please use the same port number as that shown in the URL of step 3. This number may vary, likely values are 8888 or 8889. Please skip this step if you are connecting from Linux or macOS. If you are connecting from Windows, you should use MobaXterm to configure an SSH tunnel as follows.
Tunnelling
button above the MobaXterm terminal. Create a new tunnel by clicking on New SSH tunnel
in the window that opens.Local port forwarding
radio button is selected.forwarded port
text box on the left under My computer with MobaXterm
, enter the port number indicated in the JupyterLab server output (e.g., 8888 or 8890).SSH server
enter login.archer2.ac.uk
, your ARCHER2 username and then 22
.Remote server
, enter the id of the login or compute node running the JupyterLab server and the associated port number.Save
button..ppk
private key that you normally use when connecting to ARCHER2.Now, if you open a browser window locally, you should be able to navigate to the URL from step 3, and this should display the JupyterLab server. If JupyterLab is running on a compute node, the notebook will be available for the length of the interactive session you have requested.
Warning
Please do not use the other http address given by the JupyterLab output, the one formatted http://<node_id>:<port_number>/lab?token=<string>
. Your local browser will not recognise the <node_id>
part of the address.
The Dask-jobqueue project makes it easy to deploy Dask on ARCHER2. You can find more information in the Dask Job-Queue documentation.
Please follow these steps:
module load cray-python\nexport PYTHONUSERBASE=/work/t01/t01/auser/.local\nexport PATH=$PYTHONUSERBASE/bin:$PATH\n\npip install --user dask-jobqueue --upgrade\n
Dask-jobqueue creates a Dask Scheduler in the Python process where the cluster object is instantiated. A script for running dask jobs on ARCHER2 might look something like this:
from dask_jobqueue import SLURMCluster\ncluster = SLURMCluster(cores=128, \n processes=16,\n memory='256GB',\n queue='standard',\n header_skip=['--mem'],\n job_extra=['--qos=\"standard\"'],\n python='srun python',\n project='z19',\n walltime=\"01:00:00\",\n shebang=\"#!/bin/bash --login\",\n local_directory='$PWD',\n interface='hsn0',\n env_extra=['module load cray-python',\n 'export PYTHONUSERBASE=/work/t01/t01/auser/.local/',\n 'export PATH=$PYTHONUSERBASE/bin:$PATH',\n 'export PYTHONPATH=$PYTHONUSERBASE/lib/python3.8/site-packages:$PYTHONPATH'])\n\n\n\ncluster.scale(jobs=2) # Deploy two single-node jobs\n\nfrom dask.distributed import Client\nclient = Client(cluster) # Connect this local process to remote workers\n\n# wait for jobs to arrive, depending on the queue, this may take some time\nimport dask.array as da\nx = \u2026 # Dask commands now use these distributed resources\n
This script can be run on the login nodes and it submits the Dask jobs to the job queue. Users should ensure that the computationally intensive work is done with the Dask commands which run on the compute nodes.
The cluster object parameters specify the characteristics for running on a single compute node. The header_skip option is required as we are running on exclusive nodes where you should not specify the memory requirements, however Dask requires you to supply this option.
Jobs are be deployed with the cluster.scale command, where the jobs option sets the number of single node jobs requested. Job scripts are generated (from the cluster object) and these are submitted to the queue to begin running once the resources are available. You can check the status of the jobs by running squeue -u $USER
in a separate terminal.
If you wish to see the generated job script you can use:
print(cluster.job_script())\n
"},{"location":"user-guide/scheduler/","title":"Running jobs on ARCHER2","text":"As with most HPC services, ARCHER2 uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. ARCHER2 uses the Slurm software to schedule jobs.
Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.
Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.
Hint
If you have any questions on how to run jobs on ARCHER2 do not hesitate to contact the ARCHER2 Service Desk.
You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.
"},{"location":"user-guide/scheduler/#resources","title":"Resources","text":""},{"location":"user-guide/scheduler/#cus","title":"CUs","text":"Time used on ARCHER2 is measured in CUs. 1 CU = 1 Node Hour for a standard 128 core node.
The CU calculator will help you to calculate the CU cost for your jobs.
"},{"location":"user-guide/scheduler/#checking-available-budget","title":"Checking available budget","text":"You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g. e123 resources
and then under Resource Pool to the right of this, a note of the remaining budget in CUs.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins\n
This will list all the budget codes that you have access to e.g.
Account User MaxTRESMins\n---------- ---------- -------------\n e123 userx cpu=0\n e123-test userx\n
This shows that userx
is a member of budgets e123
and e123-test
. However, the cpu=0
indicates that the e123
budget is empty or disabled. This user can submit jobs using the e123-test
budget.
To see the number of CUs remaining you must check in SAFE.
"},{"location":"user-guide/scheduler/#charging","title":"Charging","text":"Jobs run on ARCHER2 are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).
Jobs are charged for the full number of nodes which are requested, even if they are not all used.
Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.
"},{"location":"user-guide/scheduler/#basic-slurm-commands","title":"Basic Slurm commands","text":"There are four key commands used to interact with the Slurm on the command line:
sinfo
- Get information on the partitions and resources availablesbatch jobscript.slurm
- Submit a job submission script (in this case called: jobscript.slurm
) to the schedulersqueue
- Get the current status of jobs submitted to the schedulerscancel 12345
- Cancel a job (in this case with the job ID 12345
)We cover each of these commands in more detail below.
"},{"location":"user-guide/scheduler/#sinfo-information-on-resources","title":"sinfo
: information on resources","text":"sinfo
is used to query information about available resources and partitions. Without any options, sinfo
lists the status of all resources and partitions, e.g.
auser@ln01:~> sinfo\n\nPARTITION AVAIL TIMELIMIT NODES STATE NODELIST\nstandard up 1-00:00:00 105 down* nid[001006,...,002014]\nstandard up 1-00:00:00 12 drain nid[001016,...,001969]\nstandard up 1-00:00:00 5 resv nid[001000,001002-001004,001114]\nstandard up 1-00:00:00 683 alloc nid[001001,...,001970-001991]\nstandard up 1-00:00:00 214 idle nid[001022-001023,...,002015-002023]\nstandard up 1-00:00:00 2 down nid[001021,001050]\n
Here we see the number of nodes in different states. For example, 683 nodes are allocated (running jobs), and 214 are idle (available to run jobs).
Note
that long lists of node IDs have been abbreviated with ...
.
sbatch
: submitting jobs","text":"sbatch
is used to submit a job script to the job submission system. The script will typically contain one or more srun
commands to launch parallel tasks.
When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.
auser@ln01:~> sbatch test-job.slurm\nSubmitted batch job 12345\n
"},{"location":"user-guide/scheduler/#squeue-monitoring-jobs","title":"squeue
: monitoring jobs","text":"squeue
without any options or arguments shows the current status of all jobs known to the scheduler. For example:
auser@ln01:~> squeue\n
will list all jobs on ARCHER2.
The output of this is often overwhelmingly large. You can restrict the output to just your jobs by adding the -u $USER
option:
auser@ln01:~> squeue -u $USER\n
"},{"location":"user-guide/scheduler/#scancel-deleting-jobs","title":"scancel
: deleting jobs","text":"scancel
is used to delete a jobs from the scheduler. If the job is waiting to run it is simply cancelled, if it is a running job then it is stopped immediately.
If you only want to cancel a specific job you need to provide the job ID of the job you wish to cancel/stop. For example:
auser@ln01:~> scancel 12345\n
will cancel (if waiting) or stop (if running) the job with ID 12345
.
scancel
can take other options. For example, if you want to cancel all your pending (queued) jobs but leave the running jobs running, you could use:
auser@ln01:~> scancel --state=PENDING --user=$USER\n
"},{"location":"user-guide/scheduler/#resource-limits","title":"Resource Limits","text":"The ARCHER2 resource limits for any given job are covered by three separate attributes.
The primary resource you can request for your job is the compute node.
Information
The --exclusive
option is enforced on ARCHER2 which means you will always have access to all of the memory on the compute node regardless of how many processes are actually running on the node.
Note
You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.
"},{"location":"user-guide/scheduler/#partitions","title":"Partitions","text":"On ARCHER2, compute nodes are grouped into partitions. You will have to specify a partition using the --partition
option in your Slurm submission script. The following table has a list of active partitions on ARCHER2.
Note
The standard
partition includes both the standard memory and high memory nodes but standard memory nodes are preferentially chosen for jobs where possible. To guarantee access to high memory nodes you should specify the highmem
partition.
On ARCHER2, job limits are defined by the requested Quality of Service (QoS), as specified by the --qos
Slurm directive. The following table lists the active QoS on ARCHER2.
You can find out the QoS that you can use by running the following command:
Full systemauser@ln01:~> sacctmgr show assoc user=$USER cluster=archer2 format=cluster,account,user,qos%50\n
Hint
If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.
"},{"location":"user-guide/scheduler/#e-mail-notifications","title":"E-mail notifications","text":"E-mail notifications from the scheduler are not currently available on ARCHER2.
"},{"location":"user-guide/scheduler/#priority","title":"Priority","text":"Job priority on ARCHER2 depends on a number of different factors:
Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on ARCHER2 is:
Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)] + [100 * P(size)]\n
The priority factors are:
lowpriority
QoS has a raw priority of 1.You can view the priorities for current queued jobs on the system with the sprio
command:
auser@ln04:~> sprio -l\n JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE QOS\n 828764 standard 1049 0 45 0 4 1000\n 828765 standard 1049 0 45 0 4 1000\n 828770 standard 1049 0 45 0 4 1000\n 828771 standard 1012 0 8 0 4 1000\n 828773 standard 1012 0 8 0 4 1000\n 828791 standard 1012 0 8 0 4 1000\n 828797 standard 1118 0 115 0 4 1000\n 828800 standard 1154 0 150 0 4 1000\n 828801 standard 1154 0 150 0 4 1000\n 828805 standard 1118 0 115 0 4 1000\n 828806 standard 1154 0 150 0 4 1000\n
"},{"location":"user-guide/scheduler/#troubleshooting","title":"Troubleshooting","text":""},{"location":"user-guide/scheduler/#slurm-error-messages","title":"Slurm error messages","text":"An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:
sbatch: unrecognized option <text>
One of your options is invalid or has a typo. man sbatch
to help.
error: Batch job submission failed: No partition specified or system default partition
A --partition=
option is missing. You must specify the partition (see the list above). This is most often --partition=standard
.
error: invalid partition specified: <partition>
error: Batch job submission failed: Invalid partition name specified
Check the partition exists and check the spelling is correct.
error: Batch job submission failed: Invalid account or account/partition combination specified
This probably means an invalid account has been given. Check the --account=
options against valid accounts in SAFE.
error: Batch job submission failed: Invalid qos specification
A QoS option is either missing or invalid. Check the script has a --qos=
option and that the option is a valid one from the table above. (Check the spelling of the QoS is correct.)
error: Your job has no time specification (--time=)...
Add an option of the form --time=hours:minutes:seconds
to the submission script. E.g., --time=01:30:00
gives a time limit of 90 minutes.
error: QOSMaxWallDurationPerJobLimit
error: Batch job submission failed: Job violates accounting/QOS policy
(job submit limit, user's size and/or time limits)
The script has probably specified a time limit which is too long for the corresponding QoS. E.g., the time limit for the short QoS is 20 minutes.
The squeue
command allows users to view information for jobs managed by Slurm. Jobs typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED. The first table provides a description of some job state codes. The second table provides a description of the reasons that cause a job to be in a state.
For a full list of see Job State Codes.
"},{"location":"user-guide/scheduler/#slurm-queued-reasons","title":"Slurm queued reasons","text":"Reason Description Priority One or more higher priority jobs exist for this partition or advanced reservation. Resources The job is waiting for resources to become available. BadConstraints The job's constraints can not be satisfied. BeginTime The job's earliest start time has not yet been reached. Dependency This job is waiting for a dependent job to complete. Licenses The job is waiting for a license. WaitingForScheduling No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. Prolog Its PrologSlurmctld program is still running. JobHeldAdmin The job is held by a system administrator. JobHeldUser The job is held by the user. JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc. NonZeroExitCode The job terminated with a non-zero exit code. InvalidAccount The job's account is invalid. InvalidQOS The job's QOS is invalid. QOSUsageThreshold Required QOS threshold has been breached. QOSJobLimit The job's QOS has reached its maximum job count. QOSResourceLimit The job's QOS has reached some resource limit. QOSTimeLimit The job's QOS has reached its time limit. NodeDown A node required by the job is down. TimeLimit The job exhausted its time limit. ReqNodeNotAvail Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's \"reason\" field as \"UnavailableNodes\". Such nodes will typically require the intervention of a system administrator to make available.For a full list of see Job Reasons.
"},{"location":"user-guide/scheduler/#output-from-slurm-jobs","title":"Output from Slurm jobs","text":"Slurm places standard output (STDOUT) and standard error (STDERR) for each job in the file slurm_<JobID>.out
. This file appears in the job's working directory once your job starts running.
Hint
Output may be buffered - to enable live output, e.g. for monitoring job status, add --unbuffered
to the srun
command in your Slurm script.
You specify the resources you require for your job using directives at the top of your job submission script using lines that start with the directive #SBATCH
.
Hint
Most options provided using #SBATCH
directives can also be specified as command line options to srun
.
If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:
--account=<budgetID>
your budget ID is usually something like t01
or t01-test
. You can see which budget codes you can charge to in SAFE.Important
You must specify an account code for your job otherwise it will fail to submit with the error: sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
. (This error can also mean that you have specified a budget that has run out of resources.)
Other common options that are used are:
--time=<hh:mm:ss>
the maximum walltime for your job. e.g. For a 6.5 hour walltime, you would use --time=6:30:0
.--job-name=<jobname>
set a name for the job to help identify it in SlurmTo prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option
--export=none
prevents the user environment from being exported to the batch system.Using the --export=none
means that the behaviour of batch submissions should be repeatable. We strongly recommend its use.
Note
When submitting your job, the scheduler will check that the requested resources are available e.g. that your account is a member of the requested budget, that the requested QoS exists. If things change before the job starts and e.g. your account has been removed from the requested budget or the requested QoS has been deleted then the job will not be able to start. In such cases, the job will be removed from the pending queue by our systems team, as it will no longer be eligible to run.
"},{"location":"user-guide/scheduler/#additional-options-for-parallel-jobs","title":"Additional options for parallel jobs","text":"Note
For parallel jobs, ARCHER2 operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 128 cores) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 128 cores).
In addition, parallel jobs will also need to specify how many nodes, parallel processes and threads they require.
--nodes=<nodes>
the number of nodes to use for the job.--ntasks-per-node=<processes per node>
the number of parallel processes (e.g. MPI ranks) per node.--cpus-per-task=1
if you are using parallel processes only with no threading and you want to use all 128 cores on the node then you should set the number of CPUs (cores) per parallel process to 1. Important: if you are using threading (e.g. with OpenMP) or you want to use less than 128 cores per node (e.g. to access more memory or memory bandwidth per core) then you will need to change this option as described below.--cpu-freq=<freq. in kHz>
set the CPU frequency for the compute nodes. Valid values are 2250000
(2.25 GHz), 2000000
(2.0 GHz), 1500000
(1.5 GHz). For more information on CPU frequency settings and energy use see the Energy use section.For parallel jobs that use threading (e.g. OpenMP) or when you want to use less than 128 cores per node (e.g. to access more memory or memory bandwidth per core), you will also need to change the --cpus-per-task
option.
For jobs using threading: - --cpus-per-task=<threads per task>
the number of threads per parallel process (e.g. number of OpenMP threads per MPI task for hybrid MPI/OpenMP jobs). Important: you must also set the OMP_NUM_THREADS
environment variable if using OpenMP in your job.
For jobs using less than 128 cores per node: - --cpus-per-task=<stride between placement of processes>
the stride between the parallel processes. For example, if you want to double the memory and memory bandwidth per process on an ARCHER2 compute node you would want to place 64 processes per node and leave an empty core between each process you would set --cpus-per-task=2
and --ntasks-per-node=64
.
Important
You must also add export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
to your job submission script to pass the --cpus-per-task
setting from the job script to the srun
command. (Alternatively, you could use the --cpus-per-task
option in the srun command itself.) If you do not do this then the placement of processes/threads will be incorrect and you will likely see poor performance of your application.
The data analysis nodes are shared between all users and can be used to run jobs that require small numbers of cores and/or access to an external network to transfer data. These jobs are often serial jobs that only require a single core.
To run jobs on the data analysis node you require the following options:
--partition=serial
to select the data analysis nodes--qos=serial
to select the data analysis QoS (see above for QoS limits)--ntasks=<number of cores>
to select the number of cores you want to use in this job (up to the maximum defined in the QoS)--mem=<amount of memory>
to select the amount of memory you require (up to the maximum defined in the QoS).More information on using the data analysis nodes (including example job submission scripts) can be found in the Data Analysis section of the User and Best Practice Guide.
"},{"location":"user-guide/scheduler/#srun-launching-parallel-jobs","title":"srun
: Launching parallel jobs","text":"If you are running parallel jobs, your job submission script should contain one or more srun
commands to launch the parallel executable across the compute nodes. In most cases you will want to add the options --distribution=block:block
and --hint=nomultithread
to your srun
command to ensure you get the correct pinning of processes to cores on a compute node.
Warning
If you do not add the --distribution=block:block
and --hint=nomultithread
options to your srun
command the default process placement may lead to a drop in performance for your jobs on ARCHER2.
A brief explanation of these options: - --hint=nomultithread
- do not use hyperthreads/SMP - --distribution=block:block
- the first block
means use a block distribution of processes across nodes (i.e. fill nodes before moving onto the next one) and the second block
means use a block distribution of processes across \"sockets\" within a node (i.e. fill a \"socket\" before moving on to the next one).
Important
The Slurm definition of a \"socket\" does not correspond to a physical CPU socket. On ARCHER2 it corresponds to a 4-core CCX (Core CompleX).
"},{"location":"user-guide/scheduler/#slurm-definition-of-a-socket","title":"Slurm definition of a \"socket\"","text":"On ARCHER2, Slurm is configured with the following setting:
SlurmdParameters=l3cache_as_socket\n
The effect of this setting is to define a Slurm socket as a unit that has a shared L3 cache. On ARCHER2, this means that each Slurm \"socket\" corresponds to a 4-core CCX (Core CompleX). For a more detailed discussion on the hardware and the memory/cache layout see the Hardware section.
The effect of this setting can be illustrated by using the xthi
program to report placement when we select a cyclic distribution of processes across sockets from srun (--distribution=block:cyclic
). As you can see from the output from xthi
included below, the cyclic
per-socket distribution results in sequential MPI processes being placed on every 4th core (i.e. cyclic placement across CCX).
Node summary for 1 nodes:\nNode 0, hostname nid000006, mpi 128, omp 1, executable xthi_mpi\nMPI summary: 128 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 4)\nNode 0, rank 2, thread 0, (affinity = 8)\nNode 0, rank 3, thread 0, (affinity = 12)\nNode 0, rank 4, thread 0, (affinity = 16)\nNode 0, rank 5, thread 0, (affinity = 20)\nNode 0, rank 6, thread 0, (affinity = 24)\nNode 0, rank 7, thread 0, (affinity = 28)\nNode 0, rank 8, thread 0, (affinity = 32)\nNode 0, rank 9, thread 0, (affinity = 36)\nNode 0, rank 10, thread 0, (affinity = 40)\nNode 0, rank 11, thread 0, (affinity = 44)\nNode 0, rank 12, thread 0, (affinity = 48)\nNode 0, rank 13, thread 0, (affinity = 52)\nNode 0, rank 14, thread 0, (affinity = 56)\nNode 0, rank 15, thread 0, (affinity = 60)\nNode 0, rank 16, thread 0, (affinity = 64)\nNode 0, rank 17, thread 0, (affinity = 68)\nNode 0, rank 18, thread 0, (affinity = 72)\nNode 0, rank 19, thread 0, (affinity = 76)\nNode 0, rank 20, thread 0, (affinity = 80)\nNode 0, rank 21, thread 0, (affinity = 84)\nNode 0, rank 22, thread 0, (affinity = 88)\nNode 0, rank 23, thread 0, (affinity = 92)\nNode 0, rank 24, thread 0, (affinity = 96)\nNode 0, rank 25, thread 0, (affinity = 100)\nNode 0, rank 26, thread 0, (affinity = 104)\nNode 0, rank 27, thread 0, (affinity = 108)\nNode 0, rank 28, thread 0, (affinity = 112)\nNode 0, rank 29, thread 0, (affinity = 116)\nNode 0, rank 30, thread 0, (affinity = 120)\nNode 0, rank 31, thread 0, (affinity = 124)\nNode 0, rank 32, thread 0, (affinity = 1)\nNode 0, rank 33, thread 0, (affinity = 5)\nNode 0, rank 34, thread 0, (affinity = 9)\nNode 0, rank 35, thread 0, (affinity = 13)\nNode 0, rank 36, thread 0, (affinity = 17)\nNode 0, rank 37, thread 0, (affinity = 21)\nNode 0, rank 38, thread 0, (affinity = 25)\n\n...output trimmed...\n
"},{"location":"user-guide/scheduler/#bolt-job-submission-script-creation-tool","title":"bolt: Job submission script creation tool","text":"The bolt job submission script creation tool has been written by EPCC to simplify the process of writing job submission scripts for modern multicore architectures. Based on the options you supply, bolt will generate a job submission script that uses ARCHER2 in a reasonable way.
MPI, OpenMP and hybrid MPI/OpenMP jobs are supported.
Warning
The tool will allow you to generate scripts for jobs that use the long
QoS but you will need to manually modify the resulting script to change the QoS to long
.
If there are problems or errors in your job parameter specifications then bolt will print warnings or errors. However, bolt cannot detect all problems.
"},{"location":"user-guide/scheduler/#basic-usage","title":"Basic Usage","text":"The basic syntax for using bolt is:
bolt -n [parallel tasks] -N [parallel tasks per node] -d [number of threads per task] \\\n -t [wallclock time (h:m:s)] -o [script name] -j [job name] -A [project code] [arguments...]\n
Example 1: to generate a job script to run an executable called my_prog.x
for 24 hours using 8192 parallel (MPI) processes and 128 (MPI) processes per compute node you would use something like:
bolt -n 8192 -N 128 -t 24:0:0 -o my_job.bolt -j my_job -A z01-budget my_prog.x arg1 arg2\n
(remember to substitute z01-budget
for your actual budget code.)
Example 2: to generate a job script to run an executable called my_prog.x
for 3 hours using 2048 parallel (MPI) processes and 64 (MPI) processes per compute node (i.e. using half of the cores on a compute node), you would use:
bolt -n 2048 -N 64 -t 3:0:0 -o my_job.bolt -j my_job -A z01-budget my_prog.x arg1 arg2\n
These examples generate the job script my_job.bolt
with the correct options to run my_prog.x
with command line arguments arg1
and arg2
. The project code against which the job will be charged is specified with the ' -A ' option. As usual, the job script is submitted as follows:
sbatch my_job.bolt\n
Hint
If you do not specify the script name with the '-o' option then your script will be a file called a.bolt
.
Hint
If you do not specify the number of parallel tasks then bolt will try to generate a serial job submission script (and throw an error on the ARCHER2 4 cabinet system as serial jobs are not supported).
Hint
If you do not specify a project code, bolt will use your default project code (set by your login account).
Hint
If you do not specify a job name, bolt will use either bolt_ser_job
(for serial jobs) or bolt_par_job
(for parallel jobs).
You can access further help on using bolt on ARCHER2 with the ' -h ' option:
bolt -h\n
A selection of other useful options are:
-s
Write and submit the job script rather than just writing the job script.-p
Force the job to be parallel even if it only uses a single parallel task.The checkScript tool has been written to allow users to validate their job submission scripts before submitting their jobs. The tool will read your job submission script and try to identify errors, problems or inconsistencies.
An example of the sort of output the tool can give would be:
auser@ln01:/work/t01/t01/auser> checkScript submit.slurm\n\n===========================================================================\ncheckScript\n---------------------------------------------------------------------------\nCopyright 2011-2020 EPCC, The University of Edinburgh\nThis program comes with ABSOLUTELY NO WARRANTY.\nThis is free software, and you are welcome to redistribute it\nunder certain conditions.\n===========================================================================\n\nScript details\n---------------\n User: auser\nScript file: submit.slurm\n Directory: /work/t01/t01/auser (ok)\n Job name: test (ok)\n Partition: standard (ok)\n QoS: standard (ok)\nCombination: (ok)\n\nRequested resources\n-------------------\n nodes = 3 (ok)\ntasks per node = 16\n cpus per task = 8\ncores per node = 128 (ok)\nOpenMP defined = True (ok)\n walltime = 1:0:0 (ok)\n\nCU Usage Estimate (if full job time used)\n------------------------------------------\n CU = 3.000\n\n\n\ncheckScript finished: 0 warning(s) and 0 error(s).\n
"},{"location":"user-guide/scheduler/#checking-scripts-and-estimating-start-time-with-test-only","title":"Checking scripts and estimating start time with --test-only
","text":"sbatch --test-only
validates the batch script and returns an estimate of when the job would be scheduled to run given the current scheduler state. Please note that it is just an estimate, the actual start time may differ as the scheduler status when the start time was estimated may be different once the job is actually submitted and due to subsequent changes to the scheduler state. The job is not actually submitted.
auser@ln01:~> sbatch --test-only submit.slurm\nsbatch: Job 1039497 to start at 2022-02-01T23:20:51 using 256 processors on nodes nid002836\nin partition standard\n
"},{"location":"user-guide/scheduler/#estimated-start-time-for-queued-jobs","title":"Estimated start time for queued jobs","text":"You can use the squeue
command to show the current estimated start time for a job. Please note that it is just an estimate, the actual start time may differ as the scheduler status when the start time was estimated may be different due to subsequent changes to the scheduler state. To return the estimated start time for a job you specify the job ID with the --jobs=<jobid>
and --Format=StartTime
options.
For example, to show the estimated start time for job 123456
, you would use:
squeue --jobs=123456 --Format=StartTime\n
The output from this command would look like:
START_TIME\n2024-09-25T13:07:00\n
"},{"location":"user-guide/scheduler/#example-job-submission-scripts","title":"Example job submission scripts","text":"A subset of example job submission scripts are included in full below. Examples are provided for both the full system and the 4-cabinet system.
"},{"location":"user-guide/scheduler/#example-job-submission-script-for-mpi-parallel-job","title":"Example: job submission script for MPI parallel job","text":"A simple MPI job submission script to submit a job using 4 compute nodes and 128 MPI ranks per node for 20 minutes would look like:
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_MPI_Job\n#SBATCH --time=0:20:0\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Launch the parallel job\n# Using 512 MPI processes and 128 MPI processes per node\n# srun picks up the distribution from the sbatch options\n\nsrun --distribution=block:block --hint=nomultithread ./my_mpi_executable.x\n
This will run your executable \"my_mpi_executable.x\" in parallel on 512 MPI processes using 4 nodes (128 cores per node, i.e. not using hyper-threading). Slurm will allocate 4 nodes to your job and srun will place 128 MPI processes on each node (one per physical core).
See above for a more detailed discussion of the different sbatch
options
Mixed mode codes that use both MPI (or another distributed memory parallel model) and OpenMP should take care to ensure that the shared memory portion of the process/thread placement does not span more than one NUMA region. Nodes on ARCHER2 are made up of two sockets each containing 4 NUMA regions of 16 cores, i.e. there are 8 NUMA regions in total. Therefore the total number of threads should ideally not be greater than 16, and also needs to be a factor of 16. Sensible choices for the number of threads are therefore 1 (single-threaded), 2, 4, 8, and 16. More information about using OpenMP and MPI+OpenMP can be found in the Tuning chapter.
To ensure correct placement of MPI processes the number of cpus-per-task needs to match the number of OpenMP threads, and the number of tasks-per-node should be set to ensure the entire node is filled with MPI tasks.
In the example below, we are using 4 nodes for 6 hours. There are 32 MPI processes in total (8 MPI processes per node) and 16 OpenMP threads per MPI process. This results in all 128 physical cores per node being used.
Hint
Note the use of the export OMP_PLACES=cores
environment option to generate the correct thread pinning.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_MPI_Job\n#SBATCH --time=0:20:0\n#SBATCH --nodes=4\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n\n# Replace [budget code] below with your project code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the number of threads to 16 and specify placement\n# There are 16 OpenMP threads per MPI process\n# We want one thread per physical core\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\n\n# Launch the parallel job\n# Using 32 MPI processes\n# 8 MPI processes per node\n# 16 OpenMP threads per MPI process\n# Additional srun options to pin one thread per physical core\nsrun --hint=nomultithread --distribution=block:block ./my_mixed_executable.x arg1 arg2\n
"},{"location":"user-guide/scheduler/#job-arrays","title":"Job arrays","text":"The Slurm job scheduling system offers the job array concept, for running collections of almost-identical jobs. For example, running the same program several times with different arguments or input data.
Each job in a job array is called a subjob. The subjobs of a job array can be submitted and queried as a unit, making it easier and cleaner to handle the full set, compared to individual jobs.
All subjobs in a job array are started by running the same job script. The job script also contains information on the number of jobs to be started, and Slurm provides a subjob index which can be passed to the individual subjobs or used to select the input data per subjob.
"},{"location":"user-guide/scheduler/#job-script-for-a-job-array","title":"Job script for a job array","text":"As an example, the following script runs 56 subjobs, with the subjob index as the only argument to the executable. Each subjob requests a single node and uses all 128 cores on the node by placing 1 MPI process per core and specifies 4 hours maximum runtime per subjob:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=Example_Array_Job\n#SBATCH --time=04:00:00\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --array=0-55\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\nsrun --distribution=block:block --hint=nomultithread /path/to/exe $SLURM_ARRAY_TASK_ID\n
"},{"location":"user-guide/scheduler/#submitting-a-job-array","title":"Submitting a job array","text":"Job arrays are submitted using sbatch
in the same way as for standard jobs:
sbatch job_script.pbs\n
"},{"location":"user-guide/scheduler/#expressing-dependencies-between-jobs","title":"Expressing dependencies between jobs","text":"SLURM allows one to express dependencies between jobs using the --dependency
(or -d
) option. This allows the start of execution of the dependent job to be delayed until some condition involving a current or previous job, or set of jobs, has been satisfied. A simple example might be:
$ sbatch --dependency=4394150 myscript.sh\nSubmitted batch job 4394325\n
This states that the execution of the new batch job should not start until job 4394150 has completed/terminated. Here, completion/termination is the only condition. The new job 4394325 should appear in the pending state with reason (Dependency)
assuming 4394150 is still running. A dependency may be of a different type, of which there are a number of relevant possibilities. If we explicitly include the default type afterany
in the example above, we would have
$ sbatch --dependency=afterany:4394150 myscript.sh\nSubmitted batch job 4394325\n
This emphasises that the first job may complete with any exit code, and still satisfy the dependency. If we wanted a dependent job which would only become eligible for execution following successful completion of the dependency, we would use afterok
: $ sbatch --dependency=afterok:4394150 myscript.sh\nSubmitted batch job 4394325\n
This means that should the dependency fail with non-zero exit code, the dependent job will be in a state where it will never run. This may appear in squeue
as (DependencyNeverSatisfied)
as the reason. Such jobs will need to be cancelled. The general form of the dependency list is <type:job_id[:job_id] [,type:job_id ...]>
where a dependency may include one or more jobs, with one or more types. If a list is comma-separated, all the dependencies must be satisfied before the dependent job becomes eligible. The use of ?
as the list separator implies that any of the dependencies is sufficient.
Useful type options include afterany
, afterok
, and afternotok
. For the last case, the dependency is only satisfied if there is non-zero exit code (the opposite of afterok
). See the current SLURM documentation for a full list of possibilities.
Job dependencies can be used to construct complex pipelines or chain together long simulations requiring multiple steps.
For example, if we have just two jobs, the following shell script extract will submit the second dependent on the first, irrespective of actual job ID:
jobid=$(sbatch --parsable first_job.sh)\nsbatch --dependency=afterok:${jobid} second_job.sh\n
where we have used the --parsable
option to sbatch
to return just the new job ID (without the Submitted batch job
). This can be extended to a longer chain as required. E.g.:
jobid1=$(sbatch --parsable first_job.sh)\njobid2=$(sbatch --parsable --dependency=afterok:${jobid1} second_job.sh)\njobid3=$(sbatch --parsable --dependency=afterok:${jobid1} third_job.sh)\nsbatch --dependency=afterok:${jobid2},afterok:${jobid3} last_job.sh\n
Note jobs 2 and 3 are dependent on job 1 (only), but the final job is dependent on both jobs 2 and 3. This allows quite general workflows to be constructed."},{"location":"user-guide/scheduler/#number-of-jobs-not-known-in-advance","title":"Number of jobs not known in advance","text":"This automation may be taken a step further to a case where a submission script propagates itself. E.g., a script might include, schematically,
#SBATCH ...\n\n# submit new job here ...\nsbatch --dependency=afterok:${SLURM_JOB_ID} thisscript.sh\n\n# perform work here...\nsrun ...\n
where the original submission of the script will submit a new instance of itself dependent on its own successful completion. This is done via the SLURM environment variable SLURM_JOB_ID
which holds the id of the current job. One could defer the sbatch
until the end of the script to avoid the dependency never being satisfied if the work associated with the srun
fails. This approach can be useful in situations where, e.g., simulations with checkpoint/restart need to continue until some criterion is met. Some care may be required to ensure the script logic is correct in determining the criterion for stopping: it is best to start with a small/short test example. Incorrect logic and/or errors may lead to a rapid proliferation of submitted jobs. Termination of such chains needs to be arranged either via appropriate logic in the script, or manual intervention to cancel pending jobs when no longer required.
"},{"location":"user-guide/scheduler/#using-multiple-srun-commands-in-a-single-job-script","title":"Using multiplesrun
commands in a single job script","text":"You can use multiple srun
commands within in a Slurm job submission script to allow you to use the resource requested more flexibly. For example, you could run a collection of smaller jobs within the requested resources or you could even subdivide nodes if your individual calculations do not scale up to use all 128 cores on a node.
In this guide we will cover two scenarios:
When subdivding a larger job into smaller subjobs you typically need to overwrite the --nodes
option to srun
and add the --ntasks
option to ensure that each subjob runs on the correct number of nodes and that subjobs are placed correctly onto separate nodes.
For example, we will show how to request 100 nodes and then run 100 separate 1-node jobs, each of which use 128 MPI processes and which run on a different compute node. We start by showing the job script that would achieve this and then explain how this works and the options used. In our case, we will run 100 copies of the xthi
program that prints the process placement on the node it is running on.
#!/bin/bash\n\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=multi_xthi\n#SBATCH --time=0:20:0\n#SBATCH --nodes=100\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Load the xthi module\nmodule load xthi\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Loop over 100 subjobs starting each of them on a separate node\nfor i in $(seq 1 100)\ndo\n# Launch this subjob on 1 node, note nodes and ntasks options and & to place subjob in the background\n srun --nodes=1 --ntasks=128 --distribution=block:block --hint=nomultithread xthi > placement${i}.txt &\ndone\n# Wait for all background subjobs to finish\nwait\n
Key points from the example job script:
#SBATCH
options select 100 full nodes in the usual way.srun
command sets the following:--nodes=1
We need override this setting from the main job so that each subjob only uses 1 node--ntasks=128
For normal jobs, the number of parallel tasks (MPI processes) is calculated from the number of nodes you request and the number of tasks per node. We need to explicitly tell srun
how many we require for this subjob.--distribution=block:block --hint=nomultithread
These options ensure correct placement of processes within the compute nodes.&
Each subjob srun
command ends with an ampersand to place the process in the background and move on to the next loop iteration (and subjob submission). Without this, the script would wait for this subjob to complete before moving on to submit the next.wait
command to tell the script to wait for all the background subjobs to complete before exiting. If we did not have this in place, the script would exit as soon as the last subjob was submitted and kill all running subjobs.As the ARCHER2 nodes contain a large number of cores (128 per node) it may sometimes be useful to be able to run multiple executables on a single node. For example, you may want to run 128 copies of a serial executable or Python script; or, you may want to run multiple copies of parallel executables that use fewer than 128 cores each. This use model is possible using multiple srun
commands in a job script on ARCHER2
Note
You can never share a compute node with another user. Although you can use srun
to place multiple copies of an executable or script on a compute node, you still have exclusive use of that node. The minimum amount of resources you can reserve for your use on ARCHER2 is a single node.
When using srun
to place multiple executables or scripts on a compute node you must be aware of a few things:
srun
command must specify any Slurm options that differ in value from those specified to sbatch
. This typically means that you need to specify the --nodes
, --ntasks
and --ntasks-per-node
options to srun
.--exact
flag to your srun
command. With this flag on, Slurm will ensure that the resources you request are assigned to your subjob. Furthermore, if the resources are not currently available, Slurm will output a message letting you know that this is the case and stall the launch of this subjob until enough of your previous subjobs have completed to free up the resources for this subjob.--mem=<amount of memory>
flag. The amount of memory is given in MiB by default but other units can be specified. If you do not know how much memory to specify, we recommend that you specify 1500M (1,500 MiB) per core being used.srun
command into the background and then use the wait
command at the end of the submission script to make sure it does not exit before the commands are complete.srun
per node (e.g. 256 single core processes across 2 nodes) then you need to pass the node ID to the srun
commands otherwise Slurm will oversubscribe cores on the first node.Below, we provide four examples or running multiple subjobs in a node: one that runs 128 serial processes across a single node; one that runs 8 subjobs each of which use 8 MPI processes with 2 OpenMP threads per MPI process; one that runs four inhomogeneous jobs, each of which requires a different number of MPI processes and OpenMP threads per process; and one that runs 256 serial processes across two nodes.
"},{"location":"user-guide/scheduler/#example-1-128-serial-tasks-running-on-a-single-node","title":"Example 1: 128 serial tasks running on a single node","text":"For our first example, we will run 128 single-core copies of the xthi
program (which prints process/thread placement) on a single ARCHER2 compute node with each copy of xthi
pinned to a different core. The job submission script for this example would look like:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiSerialOnCompute\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Loop over 128 subjobs pinning each to a different core\nfor i in $(seq 1 128)\ndo\n# Launch subjob overriding job settings as required and in the background\n# Make sure to change the amount specified by the `--mem=` flag to the amount\n# of memory required. The amount of memory is given in MiB by default but other\n# units can be specified. If you do not know how much memory to specify, we\n# recommend that you specify `--mem=1500M` (1,500 MiB).\nsrun --nodes=1 --ntasks=1 --ntasks-per-node=1 \\\n --exact --mem=1500M xthi > placement${i}.txt &\ndone\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#example-2-8-subjobs-on-1-node-each-with-8-mpi-processes-and-2-openmp-threads-per-process","title":"Example 2: 8 subjobs on 1 node each with 8 MPI processes and 2 OpenMP threads per process","text":"For our second example, we will run 8 subjobs, each running the xthi
program (which prints process/thread placement) across 1 node. Each subjob will use 8 MPI processes and 2 OpenMP threads per process. The job submission script for this example would look like:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiParallelOnCompute\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=64\n#SBATCH --cpus-per-task=2\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to 2 as required by all subjobs\nexport OMP_NUM_THREADS=2\n\n# Loop over 8 subjobs\nfor i in $(seq 1 8)\ndo\n echo $j $i\n # Launch subjob overriding job settings as required and in the background\n # Make sure to change the amount specified by the `--mem=` flag to the amount\n # of memory required. The amount of memory is given in MiB by default but other\n # units can be specified. If you do not know how much memory to specify, we\n # recommend that you specify `--mem=12500M` (12,500 MiB).\n srun --nodes=1 --ntasks=8 --ntasks-per-node=8 --cpus-per-task=2 \\\n --exact --mem=12500M xthi > placement${i}.txt &\ndone\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#example-3-running-inhomogeneous-subjobs-on-one-node","title":"Example 3: Running inhomogeneous subjobs on one node","text":"For our third example, we will run 4 subjobs, each running the xthi
program (which prints process/thread placement) across 1 node. Our subjobs will each run with a different number of MPI processes and OpenMP threads. We will run: one job with 64 MPI processes and 1 OpenMP process per thread; one job with 16 MPI processes and 2 threads per process; one job with 4 MPI processes and 4 OpenMP threads per job; and, one job with 1 MPI process and 16 OpenMP threads per job.
To be able to change the number of MPI processes and OpenMP threads per process, we will need to forgo using the #SBATCH --ntasks-per-node
and the #SBATCH cpus-per-task
commands -- if you set these Slurm will not let you alter the OMP_NUM_THREADS
variable and you will not be able to change the number of OpenMP threads per process between each job.
Before each srun
command, you will need to define the number of OpenMP threads per process you want by changing the OMP_NUM_THREADS
variable. Furthermore, for each srun
command, you will need to set the --ntasks
flag to equal the number of MPI processes you want to use. You will also need to set the --cpus-per-task
flag to equal the number of OpenMP threads per process you want to use.
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiParallelOnCompute\n#SBATCH --time=0:10:0\n#SBATCH --nodes=1\n#SBATCH --hint=nomultithread\n#SBATCH --distribution=block:block\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to value required by the first job\nexport OMP_NUM_THREADS=1\nsrun --ntasks=64 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Set the number of threads to the value required by the second job\nexport OMP_NUM_THREADS=2\nsrun --ntasks=16 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Set the number of threads to the value required by the second job\nexport OMP_NUM_THREADS=4\nsrun --ntasks=4 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Set the number of threads to the value required by the second job\nexport OMP_NUM_THREADS=16\nsrun --ntasks=1 --cpus-per-task=${OMP_NUM_THREADS} \\\n --exact --mem=12500M xthi > placement${OMP_NUM_THREADS}.txt &\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#example-4-256-serial-tasks-running-across-two-nodes","title":"Example 4: 256 serial tasks running across two nodes","text":"For our fourth example, we will run 256 single-core copies of the xthi
program (which prints process/thread placement) across two ARCHER2 compute nodes with each copy of xthi
pinned to a different core. We will illustrate a mechanism for getting the node IDs to pass to srun
as this is required to ensure that the individual subjobs are assigned to the correct node. This mechanism uses the scontrol
command to turn the nodelist from sbatch
into a format we can use as input to srun
. The job submission script for this example would look like:
#!/bin/bash\n# Slurm job options (job-name, compute nodes, job time)\n#SBATCH --job-name=MultiSerialOnComputes\n#SBATCH --time=0:10:0\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=128\n#SBATCH --cpus-per-task=1\n\n# Replace [budget code] below with your budget code (e.g. t01)\n#SBATCH --account=[budget code]\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# Make xthi available\nmodule load xthi\n\n# Set the number of threads to 1\n# This prevents any threaded system libraries from automatically\n# using threading.\nexport OMP_NUM_THREADS=1\n\n# Propagate the cpus-per-task setting from script to srun commands\n# By default, Slurm does not propagate this setting from the sbatch\n# options to srun commands in the job script. If this is not done,\n# process/thread pinning may be incorrect leading to poor performance\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Get a list of the nodes assigned to this job in a format we can use.\n# scontrol converts the condensed node IDs in the sbatch environment\n# variable into a list of full node IDs that we can use with srun to\n# ensure the subjobs are placed on the correct node. e.g. this converts\n# \"nid[001234,002345]\" to \"nid001234 nid002345\"\nnodelist=$(scontrol show hostnames $SLURM_JOB_NODELIST)\n\n# Loop over the nodes assigned to the job\nfor nodeid in $nodelist\ndo\n # Loop over 128 subjobs on each node pinning each to a different core\n for i in $(seq 1 128)\n do\n # Launch subjob overriding job settings as required and in the background\n # Make sure to change the amount specified by the `--mem=` flag to the amount\n # of memory required. The amount of memory is given in MiB by default but other\n # units can be specified. If you do not know how much memory to specify, we\n # recommend that you specify `--mem=1500M` (1,500 MiB).\n srun --nodelist=${nodeid} --nodes=1 --ntasks=1 --ntasks-per-node=1 \\\n --exact --mem=1500M xthi > placement_${nodeid}_${i}.txt &\n done\ndone\n\n# Wait for all subjobs to finish\nwait\n
"},{"location":"user-guide/scheduler/#process-placement","title":"Process placement","text":"There are many occasions where you may want to control (usually, MPI) process placement and change it from the default, for example:
There are a number of different methods for defining process placement, below we cover two different options: using Slurm options and using the MPICH_RANK_REORDER_METHOD
environment variable. Most users will likely use the Slurm options approach.
The standard approach recommended on ARCHER2 is to place processes sequentially on nodes until the maximum number of tasks is reached. You can use the xthi
program to verify this for MPI process placement:
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=128 \\\n --cpus-per-task=1 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170365\nsalloc: job 1170365 queued and waiting for resources\nsalloc: job 1170365 has been allocated resources\nsalloc: Granted job allocation 1170365\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002526-002527] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\nauser@ln04:/work/t01/t01/auser> srun --distribution=block:block --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002526, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002527, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\n\n...output trimmed...\n\nNode 0, rank 124, thread 0, (affinity = 124)\nNode 0, rank 125, thread 0, (affinity = 125)\nNode 0, rank 126, thread 0, (affinity = 126)\nNode 0, rank 127, thread 0, (affinity = 127)\nNode 1, rank 128, thread 0, (affinity = 0)\nNode 1, rank 129, thread 0, (affinity = 1)\nNode 1, rank 130, thread 0, (affinity = 2)\nNode 1, rank 131, thread 0, (affinity = 3)\n\n...output trimmed...\n
Note
For MPI programs on ARCHER2, each rank corresponds to a process.
Important
To get good performance out of MPI collective operations, MPI processes should be placed sequentially on cores as in the standard placement described above.
"},{"location":"user-guide/scheduler/#setting-process-placement-using-slurm-options","title":"Setting process placement using Slurm options","text":""},{"location":"user-guide/scheduler/#for-underpopulation-of-nodes-with-processes","title":"For underpopulation of nodes with processes","text":"When you are using fewer processes than cores on compute nodes (i.e. < 128 processes per node) the basic Slurm options (usually supplied in your script as options to sbatch
) for process placement are:
--ntasks-per-node=X
Place X processes on each node--cpus-per-task=Y
Set a stride of Y cores between each placed process. If you specify this option in a job submission script (queued using sbatch
) or via salloc
they you will also need to set export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
to ensure the setting is passed to srun
commands in the script or allocation.In addition, the following options are added to your srun
commands in your job submission script:
--hint=nomultithread
Only use physical cores (avoids use of SMT/hyperthreads)--distribution=block:block
Allocate processes to cores in a sequential fashionFor example, to place 32 processes per node and have 1 process per 4-core block (corresponding to a CCX, Core CompleX, that shares an L3 cache), you would set:
--ntasks-per-node=32
Place 32 processes on each node--cpus-per-task=4
Set a stride of 4 cores between each placed processHere is the output from xthi
:
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=32 \\\n --cpus-per-task=4 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170383\nsalloc: job 1170383 queued and waiting for resources\nsalloc: job 1170383 has been allocated resources\nsalloc: Granted job allocation 1170383\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002526-002527] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\nauser@ln04:/work/t01/t01/auser> srun --distribution=block:block --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002526, mpi 32, omp 1, executable xthi\nNode 1, hostname nid002527, mpi 32, omp 1, executable xthi\nMPI summary: 64 ranks\nNode 0, rank 0, thread 0, (affinity = 0-3)\nNode 0, rank 1, thread 0, (affinity = 4-7)\nNode 0, rank 2, thread 0, (affinity = 8-11)\nNode 0, rank 3, thread 0, (affinity = 12-15)\nNode 0, rank 4, thread 0, (affinity = 16-19)\nNode 0, rank 5, thread 0, (affinity = 20-23)\nNode 0, rank 6, thread 0, (affinity = 24-27)\nNode 0, rank 7, thread 0, (affinity = 28-31)\nNode 0, rank 8, thread 0, (affinity = 32-35)\nNode 0, rank 9, thread 0, (affinity = 36-39)\nNode 0, rank 10, thread 0, (affinity = 40-43)\nNode 0, rank 11, thread 0, (affinity = 44-47)\nNode 0, rank 12, thread 0, (affinity = 48-51)\nNode 0, rank 13, thread 0, (affinity = 52-55)\nNode 0, rank 14, thread 0, (affinity = 56-59)\nNode 0, rank 15, thread 0, (affinity = 60-63)\nNode 0, rank 16, thread 0, (affinity = 64-67)\nNode 0, rank 17, thread 0, (affinity = 68-71)\nNode 0, rank 18, thread 0, (affinity = 72-75)\nNode 0, rank 19, thread 0, (affinity = 76-79)\nNode 0, rank 20, thread 0, (affinity = 80-83)\nNode 0, rank 21, thread 0, (affinity = 84-87)\nNode 0, rank 22, thread 0, (affinity = 88-91)\nNode 0, rank 23, thread 0, (affinity = 92-95)\nNode 0, rank 24, thread 0, (affinity = 96-99)\nNode 0, rank 25, thread 0, (affinity = 100-103)\nNode 0, rank 26, thread 0, (affinity = 104-107)\nNode 0, rank 27, thread 0, (affinity = 108-111)\nNode 0, rank 28, thread 0, (affinity = 112-115)\nNode 0, rank 29, thread 0, (affinity = 116-119)\nNode 0, rank 30, thread 0, (affinity = 120-123)\nNode 0, rank 31, thread 0, (affinity = 124-127)\nNode 1, rank 32, thread 0, (affinity = 0-3)\nNode 1, rank 33, thread 0, (affinity = 4-7)\nNode 1, rank 34, thread 0, (affinity = 8-11)\nNode 1, rank 35, thread 0, (affinity = 12-15)\nNode 1, rank 36, thread 0, (affinity = 16-19)\nNode 1, rank 37, thread 0, (affinity = 20-23)\nNode 1, rank 38, thread 0, (affinity = 24-27)\nNode 1, rank 39, thread 0, (affinity = 28-31)\nNode 1, rank 40, thread 0, (affinity = 32-35)\nNode 1, rank 41, thread 0, (affinity = 36-39)\nNode 1, rank 42, thread 0, (affinity = 40-43)\nNode 1, rank 43, thread 0, (affinity = 44-47)\nNode 1, rank 44, thread 0, (affinity = 48-51)\nNode 1, rank 45, thread 0, (affinity = 52-55)\nNode 1, rank 46, thread 0, (affinity = 56-59)\nNode 1, rank 47, thread 0, (affinity = 60-63)\nNode 1, rank 48, thread 0, (affinity = 64-67)\nNode 1, rank 49, thread 0, (affinity = 68-71)\nNode 1, rank 50, thread 0, (affinity = 72-75)\nNode 1, rank 51, thread 0, (affinity = 76-79)\nNode 1, rank 52, thread 0, (affinity = 80-83)\nNode 1, rank 53, thread 0, (affinity = 84-87)\nNode 1, rank 54, thread 0, (affinity = 88-91)\nNode 1, rank 55, thread 0, (affinity = 92-95)\nNode 1, rank 56, thread 0, (affinity = 96-99)\nNode 1, rank 57, thread 0, (affinity = 100-103)\nNode 1, rank 58, thread 0, (affinity = 104-107)\nNode 1, rank 59, thread 0, (affinity = 108-111)\nNode 1, rank 60, thread 0, (affinity = 112-115)\nNode 1, rank 61, thread 0, (affinity = 116-119)\nNode 1, rank 62, thread 0, (affinity = 120-123)\nNode 1, rank 63, thread 0, (affinity = 124-127)\n
Tip
You usually only want to use physical cores on ARCHER2, so (ntasks-per-node
) \u00d7 (cpus-per-task
) should generally be equal to 128.
If you want to change the order processes are placed on nodes and cores using Slurm options then you should use the --distribution
option to srun
to change this.
For example, to place processes sequentially on nodes but round-robin on the 16-core NUMA regions in a single node, you would use the --distribution=block:cyclic
option to srun
. This type of process placement can be beneficial when a code is memory bound.
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=128 \\\n --cpus-per-task=1 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170594\nsalloc: job 1170594 queued and waiting for resources\nsalloc: job 1170594 has been allocated resources\nsalloc: Granted job allocation 1170594\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002616,002621] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> srun --distribution=block:cyclic --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002616, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002621, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 16)\nNode 0, rank 2, thread 0, (affinity = 32)\nNode 0, rank 3, thread 0, (affinity = 48)\nNode 0, rank 4, thread 0, (affinity = 64)\nNode 0, rank 5, thread 0, (affinity = 80)\nNode 0, rank 6, thread 0, (affinity = 96)\nNode 0, rank 7, thread 0, (affinity = 112)\nNode 0, rank 8, thread 0, (affinity = 1)\nNode 0, rank 9, thread 0, (affinity = 17)\nNode 0, rank 10, thread 0, (affinity = 33)\nNode 0, rank 11, thread 0, (affinity = 49)\nNode 0, rank 12, thread 0, (affinity = 65)\nNode 0, rank 13, thread 0, (affinity = 81)\nNode 0, rank 14, thread 0, (affinity = 97)\nNode 0, rank 15, thread 0, (affinity = 113\n\n...output trimmed...\n\nNode 0, rank 120, thread 0, (affinity = 15)\nNode 0, rank 121, thread 0, (affinity = 31)\nNode 0, rank 122, thread 0, (affinity = 47)\nNode 0, rank 123, thread 0, (affinity = 63)\nNode 0, rank 124, thread 0, (affinity = 79)\nNode 0, rank 125, thread 0, (affinity = 95)\nNode 0, rank 126, thread 0, (affinity = 111)\nNode 0, rank 127, thread 0, (affinity = 127)\nNode 1, rank 128, thread 0, (affinity = 0)\nNode 1, rank 129, thread 0, (affinity = 16)\nNode 1, rank 130, thread 0, (affinity = 32)\nNode 1, rank 131, thread 0, (affinity = 48)\nNode 1, rank 132, thread 0, (affinity = 64)\nNode 1, rank 133, thread 0, (affinity = 80)\nNode 1, rank 134, thread 0, (affinity = 96)\nNode 1, rank 135, thread 0, (affinity = 112)\n\n...output trimmed...\n
If you wish to place processes round robin on both nodes and 16-core regions (cores that share access to a DRAM single memory controller) within in a node you would use --distribution=cyclic:cyclic
:
auser@ln04:/work/t01/t01/auser> salloc --nodes=2 --ntasks-per-node=128 \\\n --cpus-per-task=1 --time=0:10:0 --partition=standard --qos=short \\\n --account=[your account]\n\nsalloc: Pending job allocation 1170594\nsalloc: job 1170594 queued and waiting for resources\nsalloc: job 1170594 has been allocated resources\nsalloc: Granted job allocation 1170594\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid[002616,002621] are ready for job\n\nauser@ln04:/work/t01/t01/auser> module load xthi\nauser@ln04:/work/t01/t01/auser> export OMP_NUM_THREADS=1\nauser@ln04:/work/t01/t01/auser> export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\nauser@ln04:/work/t01/t01/auser> srun --distribution=cyclic:cyclic --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002616, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002621, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 2, thread 0, (affinity = 16)\nNode 0, rank 4, thread 0, (affinity = 32)\nNode 0, rank 6, thread 0, (affinity = 48)\nNode 0, rank 8, thread 0, (affinity = 64)\nNode 0, rank 10, thread 0, (affinity = 80)\nNode 0, rank 12, thread 0, (affinity = 96)\nNode 0, rank 14, thread 0, (affinity = 112)\nNode 0, rank 16, thread 0, (affinity = 1)\nNode 0, rank 18, thread 0, (affinity = 17)\nNode 0, rank 20, thread 0, (affinity = 33)\nNode 0, rank 22, thread 0, (affinity = 49)\nNode 0, rank 24, thread 0, (affinity = 65)\nNode 0, rank 26, thread 0, (affinity = 81)\nNode 0, rank 28, thread 0, (affinity = 97)\nNode 0, rank 30, thread 0, (affinity = 113)\n\n...output trimmed...\n\nNode 1, rank 1, thread 0, (affinity = 0)\nNode 1, rank 3, thread 0, (affinity = 16)\nNode 1, rank 5, thread 0, (affinity = 32)\nNode 1, rank 7, thread 0, (affinity = 48)\nNode 1, rank 9, thread 0, (affinity = 64)\nNode 1, rank 11, thread 0, (affinity = 80)\nNode 1, rank 13, thread 0, (affinity = 96)\nNode 1, rank 15, thread 0, (affinity = 112)\nNode 1, rank 17, thread 0, (affinity = 1)\nNode 1, rank 19, thread 0, (affinity = 17)\nNode 1, rank 21, thread 0, (affinity = 33)\nNode 1, rank 23, thread 0, (affinity = 49)\nNode 1, rank 25, thread 0, (affinity = 65)\nNode 1, rank 27, thread 0, (affinity = 81)\nNode 1, rank 29, thread 0, (affinity = 97)\nNode 1, rank 31, thread 0, (affinity = 113)\n\n...output trimmed...\n
Remember, MPI collective performance is generally much worse if processes are not placed sequentially on a node (so adjacent MPI ranks are as close to each other as possible). This is the reason that the default recommended placement on ARCHER2 is sequential rather than round-robin.
"},{"location":"user-guide/scheduler/#mpich_rank_reorder_method-for-mpi-process-placement","title":"MPICH_RANK_REORDER_METHOD
for MPI process placement","text":"The MPICH_RANK_REORDER_METHOD
environment variable can also be used to specify other types of MPI task placement. For example, setting it to \"0\" results in a round-robin placement on both nodes and NUMA regions in a node (equivalent to the --distribution=cyclic:cyclic
option to srun
). Note, we do not specify the --distribution
option to srun
in this case as the environment variable is controlling placement:
salloc --nodes=8 --ntasks-per-node=2 --cpus-per-task=1 --time=0:10:0 --account=t01\n\nsalloc: Granted job allocation 24236\nsalloc: Waiting for resource configuration\nsalloc: Nodes cn13 are ready for job\n\nmodule load xthi\nexport OMP_NUM_THREADS=1\nexport MPICH_RANK_REORDER_METHOD=0\nsrun --hint=nomultithread xthi\n\nNode summary for 2 nodes:\nNode 0, hostname nid002616, mpi 128, omp 1, executable xthi\nNode 1, hostname nid002621, mpi 128, omp 1, executable xthi\nMPI summary: 256 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 2, thread 0, (affinity = 16)\nNode 0, rank 4, thread 0, (affinity = 32)\nNode 0, rank 6, thread 0, (affinity = 48)\nNode 0, rank 8, thread 0, (affinity = 64)\nNode 0, rank 10, thread 0, (affinity = 80)\nNode 0, rank 12, thread 0, (affinity = 96)\nNode 0, rank 14, thread 0, (affinity = 112)\nNode 0, rank 16, thread 0, (affinity = 1)\nNode 0, rank 18, thread 0, (affinity = 17)\nNode 0, rank 20, thread 0, (affinity = 33)\nNode 0, rank 22, thread 0, (affinity = 49)\nNode 0, rank 24, thread 0, (affinity = 65)\nNode 0, rank 26, thread 0, (affinity = 81)\nNode 0, rank 28, thread 0, (affinity = 97)\nNode 0, rank 30, thread 0, (affinity = 113)\n\n...output trimmed...\n
There are other modes available with the MPICH_RANK_REORDER_METHOD
environment variable, including one which lets the user provide a file called MPICH_RANK_ORDER
which contains a list of each task's placement on each node. These options are described in detail in the intro_mpi
man page.
For MPI applications which perform a large amount of nearest-neighbor communication, e.g., stencil-based applications on structured grids, HPE provide a tool in the perftools-base
module (Loaded by default for all users) called grid_order
which can generate a MPICH_RANK_ORDER
file automatically by taking as parameters the dimensions of the grid, core count, etc. For example, to place 256 MPI parameters in row-major order on a Cartesian grid of size $(8, 8, 4)$, using 128 cores per node:
grid_order -R -c 128 -g 8,8,4\n\n# grid_order -R -Z -c 128 -g 8,8,4\n# Region 3: 0,0,1 (0..255)\n0,1,2,3,32,33,34,35,64,65,66,67,96,97,98,99,128,129,130,131,160,161,162,163,192,193,194,195,224,225,226,227,4,5,6,7,36,37,38,39,68,69,70,71,100,101,102,103,132,133,134,135,164,165,166,167,196,197,198,199,228,229,230,231,8,9,10,11,40,41,42,43,72,73,74,75,104,105,106,107,136,137,138,139,168,169,170,171,200,201,202,203,232,233,234,235,12,13,14,15,44,45,46,47,76,77,78,79,108,109,110,111,140,141,142,143,172,173,174,175,204,205,206,207,236,237,238,239\n16,17,18,19,48,49,50,51,80,81,82,83,112,113,114,115,144,145,146,147,176,177,178,179,208,209,210,211,240,241,242,243,20,21,22,23,52,53,54,55,84,85,86,87,116,117,118,119,148,149,150,151,180,181,182,183,212,213,214,215,244,245,246,247,24,25,26,27,56,57,58,59,88,89,90,91,120,121,122,123,152,153,154,155,184,185,186,187,216,217,218,219,248,249,250,251,28,29,30,31,60,61,62,63,92,93,94,95,124,125,126,127,156,157,158,159,188,189,190,191,220,221,222,223,252,253,254,255\n
One can then save this output to a file called MPICH_RANK_ORDER
and then set MPICH_RANK_REORDER_METHOD=3
before running the job, which tells Cray MPI to read the MPICH_RANK_ORDER
file to set the MPI task placement. For more information, please see the man page man grid_order
.
salloc
to reserve resources","text":"When you are developing or debugging code you often want to run many short jobs with a small amount of editing the code between runs. This can be achieved by using the login nodes to run MPI but you may want to test on the compute nodes (e.g. you may want to test running on multiple nodes across the high performance interconnect). One of the best ways to achieve this on ARCHER2 is to use interactive jobs.
An interactive job allows you to issue srun
commands directly from the command line without using a job submission script, and to see the output from your program directly in the terminal.
You use the salloc
command to reserve compute nodes for interactive jobs.
To submit a request for an interactive job reserving 8 nodes (1024 physical cores) for 20 minutes on the short QoS you would issue the following command from the command line:
auser@ln01:> salloc --nodes=8 --ntasks-per-node=128 --cpus-per-task=1 \\\n --time=00:20:00 --partition=standard --qos=short \\\n --account=[budget code]\n
When you submit this job your terminal will display something like:
salloc: Granted job allocation 24236\nsalloc: Waiting for resource configuration\nsalloc: Nodes nid000002 are ready for job\nauser@ln01:>\n
It may take some time for your interactive job to start. Once it runs you will enter a standard interactive terminal session (a new shell). Note that this shell is still on the front end (the prompt has not change). Whilst the interactive session lasts you will be able to run parallel jobs on the compute nodes by issuing the srun --distribution=block:block --hint=nomultithread
command directly at your command prompt using the same syntax as you would inside a job script. The maximum number of nodes you can use is limited by resources requested in the salloc
command.
Important
If you wish the cpus-per-task
option to salloc
to propagate to srun
commands in the allocation, you will need to use the command export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
before you issue any srun
commands.
If you know you will be doing a lot of intensive debugging you may find it useful to request an interactive session lasting the expected length of your working session, say a full day.
Your session will end when you hit the requested walltime. If you wish to finish before this you should use the exit
command - this will return you to your prompt before you issued the salloc
command.
srun
directly","text":"A second way to run an interactive job is to use srun
directly in the following way (here using the short
QoS):
auser@ln01:/work/t01/t01/auser> srun --nodes=1 --exclusive --time=00:20:00 \\\n --partition=standard --qos=short --account=[budget code] \\\n --pty /bin/bash\nauser@nid001261:/work/t01/t01/auser> hostname\nnid001261\n
The --pty /bin/bash
will cause a new shell to be started on the first node of a new allocation . This is perhaps closer to what many people consider an 'interactive' job than the method using salloc
appears.
One can now issue shell commands in the usual way. A further invocation of srun
is required to launch a parallel job in the allocation.
Note
When using srun
within an interactive srun
session, you will need to include both the --overlap
and --oversubscribe
flags, and specify the number of cores you want to use:
auser@nid001261:/work/t01/t01/auser> srun --overlap --oversubscribe --distribution=block:block \\\n --hint=nomultithread --ntasks=128 ./my_mpi_executable.x\n
Without --overlap
the second srun
will block until the first one has completed. Since your interactive session was launched with srun
this means it will never actually start -- you will get repeated warnings that \"Requested nodes are busy\".
When finished, type exit
to relinquish the allocation and control will be returned to the front end.
By default, the interactive shell will retain the environment of the parent. If you want a clean shell, remember to specify --export=none
.
Most of the Slurm submissions discussed above involve running a single executable. However, there are situations where two or more distinct executables are coupled and need to be run at the same time, potentially using the same MPI communicator. This is most easily handled via the Slurm heterogeneous job mechanism.
Two common cases are discussed below: first, a client server model in which client and server each have a different MPI_COMM_WORLD
, and second the case were two or more executables share MPI_COMM_WORLD
.
MPI_COMM_WORLDs
","text":"The essential feature of a heterogeneous job here is to create a single batch submission which specifies the resource requirements for the individual components. Schematically, we would use
#!/bin/bash\n\n# Slurm specifications for the first component\n\n#SBATCH --partition=standard\n\n...\n\n#SBATCH hetjob\n\n# Slurm specifications for the second component\n\n#SBATCH --partition=standard\n\n...\n
where new each component beyond the first is introduced by the special token #SBATCH hetjob
(note this is not a normal option and is not --hetjob
). Each component must specify a partition. Such a job will appear in the scheduler as, e.g.,
50098+0 standard qscript- user PD 0:00 1 (None)\n 50098+1 standard qscript- user PD 0:00 2 (None)\n
and counts as (in this case) two separate jobs from the point of QoS limits. Consider a case where we have two executables which may both be parallel (in that they use MPI), both run at the same time, and communicate with each other using MPI or by some other means. In the following example, we run two different executables, xthi-a
and xthi-b
, both of which must finish before the jobs completes.
#!/bin/bash\n\n#SBATCH --time=00:20:00\n#SBATCH --exclusive\n#SBATCH --export=none\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=8\n\n#SBATCH hetjob\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=4\n\n# Run two executables with separate MPI_COMM_WORLD\n\nsrun --distribution=block:block --hint=nomultithread --het-group=0 ./xthi-a &\nsrun --distribution=block:block --hint=nomultithread --het-group=1 ./xthi-b &\nwait\n
In this case, each executable is launched with a separate call to srun
but specifies a different heterogeneous group via the --het-group
option. The first group is --het-group=0
. Both are run in the background with &
and the wait
is required to ensure both executables have completed before the job submission exits. The above is a rather artificial example using two executables which are in fact just symbolic links in the job directory to xthi
, used without loading the module. You can test this script yourself by creating symbolic links to the original executable before submitting the job:
auser@ln04:/work/t01/t01/auser/job-dir> module load xthi\nauser@ln04:/work/t01/t01/auser/job-dir> which xthi\n/work/y07/shared/utils/core/xthi/1.2/CRAYCLANG/11.0/bin/xthi\nauser@ln04:/work/t01/t01/auser/job-dir> ln -s /work/y07/shared/utils/core/xthi/1.2/CRAYCLANG/11.0/bin/xthi xthi-a\nauser@ln04:/work/t01/t01/auser/job-dir> ln -s /work/y07/shared/utils/core/xthi/1.2/CRAYCLANG/11.0/bin/xthi xthi-b\n
The example job will produce two reports showing the placement of the MPI tasks from the two instances of xthi
running in each of the heterogeneous groups. For example, the output might be
Node summary for 1 nodes:\nNode 0, hostname nid002400, mpi 8, omp 1, executable xthi-a\nMPI summary: 8 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode summary for 2 nodes:\nNode 0, hostname nid002146, mpi 4, omp 1, executable xthi-b\nNode 1, hostname nid002149, mpi 4, omp 1, executable xthi-b\nMPI summary: 8 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 1, rank 4, thread 0, (affinity = 0)\nNode 1, rank 5, thread 0, (affinity = 1)\nNode 1, rank 6, thread 0, (affinity = 2)\nNode 1, rank 7, thread 0, (affinity = 3)\n
Here we have the first executable running on one node with a communicator size 8 (ranks 0-7). The second executable runs on two nodes also with communicator size 8 (ranks 0-7, 4 ranks per node). Further examples of placement for heterogenenous jobs are given below. Finally, if your workflow requires the different heterogeneous jobs to communicate via MPI, but without sharing their MPI_COM_WORLD
, you will need to export two new variables before your srun
commands as defined below:
export PMI_UNIVERSE_SIZE=3\nexport MPICH_SINGLE_HOST_ENABLED=0\n
"},{"location":"user-guide/scheduler/#heterogeneous-jobs-for-a-shared-mpi_com_world","title":"Heterogeneous jobs for a shared MPI_COM_WORLD
","text":"Note
The directive SBATCH hetjob
can no longer be used for jobs requiring a shared MPI_COMM_WORLD
Note
In this approach, each hetjob
component must be on its own set of nodes. You cannot use this approach to place different hetjob
components on the same node.
If two or more heterogeneous components need to share a unique MPI_COMM_WORLD
, a single srun
invocation with the different components separated by a colon :
should be used. Arguments to the individual components of the srun
control the placement of the tasks and threads for each component. For example, running the same xthi-a
and xthi-b
executables as above but now in a shared communicator, we might run:
#!/bin/bash\n\n#SBATCH --time=00:20:00\n#SBATCH --export=none\n#SBATCH --account=[...]\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n# We must specify correctly the total number of nodes required.\n#SBATCH --nodes=3\n\nSHARED_ARGS=\"--distribution=block:block --hint=nomultithread\"\n\nsrun --het-group=0 --nodes=1 --ntasks-per-node=8 ${SHARED_ARGS} ./xthi-a : \\\n --het-group=1 --nodes=2 --ntasks-per-node=4 ${SHARED_ARGS} ./xthi-b\n
The output should confirm we have a single MPI_COMM_WORLD
with a total of three nodes, xthi-a
running on one and xthi-b
on two, with ranks 0-15 extending across both executables.
Node summary for 3 nodes:\nNode 0, hostname nid002668, mpi 8, omp 1, executable xthi-a\nNode 1, hostname nid002669, mpi 4, omp 1, executable xthi-b\nNode 2, hostname nid002670, mpi 4, omp 1, executable xthi-b\nMPI summary: 16 ranks\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 1, thread 0, (affinity = 1)\nNode 0, rank 2, thread 0, (affinity = 2)\nNode 0, rank 3, thread 0, (affinity = 3)\nNode 0, rank 4, thread 0, (affinity = 4)\nNode 0, rank 5, thread 0, (affinity = 5)\nNode 0, rank 6, thread 0, (affinity = 6)\nNode 0, rank 7, thread 0, (affinity = 7)\nNode 1, rank 8, thread 0, (affinity = 0)\nNode 1, rank 9, thread 0, (affinity = 1)\nNode 1, rank 10, thread 0, (affinity = 2)\nNode 1, rank 11, thread 0, (affinity = 3)\nNode 2, rank 12, thread 0, (affinity = 0)\nNode 2, rank 13, thread 0, (affinity = 1)\nNode 2, rank 14, thread 0, (affinity = 2)\nNode 2, rank 15, thread 0, (affinity = 3)\n
"},{"location":"user-guide/scheduler/#heterogeneous-placement-for-mixed-mpiopenmp-work","title":"Heterogeneous placement for mixed MPI/OpenMP work","text":"Some care may be required for placement of tasks/threads in heterogeneous jobs in which the number of threads needs to be specified differently for different components.
In the following we have two components, again using xthi-a
and xthi-b
as our two separate executables. The first component runs 8 MPI tasks each with 16 OpenMP threads on one node. The second component runs 8 MPI tasks with one task per NUMA region on a second node; each task has one thread. An appropriate Slurm submission might be:
#!/bin/bash\n\n#SBATCH --time=00:20:00\n#SBATCH --export=none\n#SBATCH --account=[...]\n\n#SBATCH --partition=standard\n#SBATCH --qos=standard\n\n#SBATCH --nodes=2\n\nSHARED_ARGS=\"--distribution=block:block --hint=nomultithread \\\n --nodes=1 --ntasks-per-node=8 --cpus-per-task=16\"\n\n# Do not set OMP_NUM_THREADS in the calling environment\n\nunset OMP_NUM_THREADS\nexport OMP_PROC_BIND=spread\n\nsrun --het-group=0 ${SHARED_ARGS} --export=all,OMP_NUM_THREADS=16 ./xthi-a : \\\n --het-group=1 ${SHARED_ARGS} --export=all,OMP_NUM_THREADS=1 ./xthi-b\n
The important point here is that OMP_NUM_THREADS
must not be set in the environment that calls srun
in order that the different specifications for the separate groups via --export
on the srun
command line take effect. If OMP_NUM_THREADS
is set in the calling environment, then that value takes precedence, and each component will see the same value of OMP_NUM_THREADS
.
The output might then be:
Node 0, hostname nid001111, mpi 8, omp 16, executable xthi-a\nNode 1, hostname nid001126, mpi 8, omp 1, executable xthi-b\nNode 0, rank 0, thread 0, (affinity = 0)\nNode 0, rank 0, thread 1, (affinity = 1)\nNode 0, rank 0, thread 2, (affinity = 2)\nNode 0, rank 0, thread 3, (affinity = 3)\nNode 0, rank 0, thread 4, (affinity = 4)\nNode 0, rank 0, thread 5, (affinity = 5)\nNode 0, rank 0, thread 6, (affinity = 6)\nNode 0, rank 0, thread 7, (affinity = 7)\nNode 0, rank 0, thread 8, (affinity = 8)\nNode 0, rank 0, thread 9, (affinity = 9)\nNode 0, rank 0, thread 10, (affinity = 10)\nNode 0, rank 0, thread 11, (affinity = 11)\nNode 0, rank 0, thread 12, (affinity = 12)\nNode 0, rank 0, thread 13, (affinity = 13)\nNode 0, rank 0, thread 14, (affinity = 14)\nNode 0, rank 0, thread 15, (affinity = 15)\nNode 0, rank 1, thread 0, (affinity = 16)\nNode 0, rank 1, thread 1, (affinity = 17)\n...\nNode 0, rank 7, thread 14, (affinity = 126)\nNode 0, rank 7, thread 15, (affinity = 127)\nNode 1, rank 8, thread 0, (affinity = 0)\nNode 1, rank 9, thread 0, (affinity = 16)\nNode 1, rank 10, thread 0, (affinity = 32)\nNode 1, rank 11, thread 0, (affinity = 48)\nNode 1, rank 12, thread 0, (affinity = 64)\nNode 1, rank 13, thread 0, (affinity = 80)\nNode 1, rank 14, thread 0, (affinity = 96)\nNode 1, rank 15, thread 0, (affinity = 112)\n
Here we can see the eight MPI tasks from xthi-a
each running with sixteen OpenMP threads. Then the 8 MPI tasks with no threading from xthi-b
are spaced across the cores on the second node, one per NUMA region.
Low priority jobs are not charged against your allocation but will only run when other, higher-priority, jobs cannot be run. Although low priority jobs are not charged, you do need a valid, positive budget to be able to submit and run low priority jobs, i.e. you need at least 1 CU in your budget.
Low priority access is always available and has the following limits:
You submit a low priority job on ARCHER2 by using the lowpriority
QoS. For example, you would usually have the following line in your job submission script sbatch options:
#SBATCH --qos=lowpriority\n
"},{"location":"user-guide/scheduler/#reservations","title":"Reservations","text":"Reservations are available on ARCHER2. These allow users to reserve a number of nodes for a specified length of time starting at a particular time on the system.
Reservations require justification. They will only be approved if the request could not be fulfilled with the normal QoS's. For instance, you require a job/jobs to run at a particular time e.g. for a demonstration or course.
Note
Reservation requests must be submitted at least 60 hours in advance of the reservation start time. If requesting a reservation for a Monday at 18:00, please ensure this is received by the Friday at 12:00 the latest. The same applies over Service Holidays.
Note
Reservations are only valid for standard compute nodes, high memory compute nodes and/or PP nodes cannot be included in reservations.
Reservations will be charged at 1.5 times the usual CU rate and our policy is that they will be charged the full rate for the entire reservation at the time of booking, whether or not you use the nodes for the full time. In addition, you will not be refunded the CUs if you fail to use them due to a job issue unless this issue is due to a system failure.
To request a reservation you complete a form on SAFE:
On the first page, you need to provide the following:
On the second page, you will need to specify which username you wish the reservation to be charged against and, once the username has been selected, the budget you want to charge the reservation to. (The selected username will be charged for the reservation but the reservation can be used by all members of the selected budget.)
Your request will be checked by the ARCHER2 User Administration team and, if approved, you will be provided a reservation ID which can be used on the system. To submit jobs to a reservation, you need to add --reservation=<reservation ID>
and --qos=reservation
options to your job submission script or command.
Important
You must have at least 1 CU in the budget to submit a job on ARCHER2, even to a pre-paid reservation.
Tip
You can submit jobs to a reservation as soon as the reservation has been set up; jobs will remain queued until the reservation starts.
"},{"location":"user-guide/scheduler/#capability-days","title":"Capability Days","text":"Important
The next Capability Days session has not been scheduled yet
ARCHER2 Capability Days are a mechanism to allow users to run large scale (512 node or more) tests on the system free of charge. The motivations behind Capability Days are:
To enable this, a period will be made available regularly where users can run jobs at large scale free of charge.
Capability Days are made up of different parts:
pre-capabilityday
QoS) to allow users to test scaling and job setup ahead of full Capability DayNERCcapability
reservation) to allow NERC users to test at large scalecapabilityday
QoS)Tip
Any jobs left in the queues when Capability Days finish will be deleted.
"},{"location":"user-guide/scheduler/#pre-capability-day-session","title":"pre-Capability Day session","text":"The pre-Capability Day session is typically available directly before the full Capability Day session and allows short test jobs to prepare for Capability Day.
Submit to the pre-capabilityday
QoS. Jobs can be submitted ahead of time and will start when the pre-Capability Day session starts.
pre-capabilityday
QoS limits:
srun
commands) within job scripts should also be a minimum of 256 nodes#!/bin/bash\n#SBATCH --job-name=test_capability_job\n#SBATCH --nodes=256\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=1:0:0\n#SBATCH --partition=standard\n#SBATCH --qos=pre-capabilityday\n#SBATCH --account=t01\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Check process/thread placement\nmodule load xthi\nsrun --hint=multithread --distribution=block:block xthi > placement-${SLURM_JOBID}.out\n\nsrun --hint=multithread --distribution=block:block my_app.x\n
"},{"location":"user-guide/scheduler/#nerc-capability-reservation","title":"NERC Capability reservation","text":"The NERC Capability reservation is typically available directly before the full Capability Day session and allows short test jobs to prepare for Capability Day.
Submit to the NERCcapability
reservation. Jobs can be submitted ahead of time and will start when the NERC Capability reservatoin starts.
NERCcapability
reservation limits:
#!/bin/bash\n#SBATCH --job-name=NERC_capability_job\n#SBATCH --nodes=256\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=1:0:0\n#SBATCH --partition=standard\n#SBATCH --reservation=NERCcapability\n#SBATCH --qos=reservation\n#SBATCH --account=t01\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Check process/thread placement\nmodule load xthi\nsrun --hint=multithread --distribution=block:block xthi > placement-${SLURM_JOBID}.out\n\nsrun --hint=multithread --distribution=block:block my_app.x\n
"},{"location":"user-guide/scheduler/#capability-day-session","title":"Capability Day session","text":"The Capability Day session is typically available directly after the pre-Capability Day session.
Submit to the capability
QoS. Jobs can be submitted ahead of time and will start when the Capability Day session starts.
capabilityday
QoS limits:
srun
commands) within job scripts should also be a minimum of 512 nodes#!/bin/bash\n#SBATCH --job-name=capability_job\n#SBATCH --nodes=1024\n#SBATCH --ntasks-per-node=8\n#SBATCH --cpus-per-task=16\n#SBATCH --time=1:0:0\n#SBATCH --partition=standard\n#SBATCH --qos=capabilityday\n#SBATCH --account=t01\n\nexport OMP_NUM_THREADS=16\nexport OMP_PLACES=cores\nexport SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK\n\n# Check process/thread placement\nmodule load xthi\nsrun --hint=multithread --distribution=block:block xthi > placement-${SLURM_JOBID}.out\n\nsrun --hint=multithread --distribution=block:block my_app.x\n
"},{"location":"user-guide/scheduler/#capability-day-tips","title":"Capability Day tips","text":"You can run serial jobs on the shared data analysis nodes. More information on using the data analysis nodes (including example job submission scripts) can be found in the Data Analysis section of the User and Best Practice Guide.
"},{"location":"user-guide/scheduler/#gpu-jobs","title":"GPU jobs","text":"You can run on the ARCHER2 GPU nodes and full guidance can be found on the GPU development platform page
"},{"location":"user-guide/scheduler/#best-practices-for-job-submission","title":"Best practices for job submission","text":"This guidance is adapted from the advice provided by NERSC
"},{"location":"user-guide/scheduler/#time-limits","title":"Time Limits","text":"Due to backfill scheduling, short and variable-length jobs generally start quickly resulting in much better job throughput. You can specify a minimum time for your job with the --time-min
option to SBATCH:
#SBATCH --time-min=<lower_bound>\n#SBATCH --time=<upper_bound>\n
Within your job script, you can get the time remaining in the job with squeue -h -j ${Slurm_JOBID} -o %L
to allow you to deal with potentially varying runtimes when using this option.
Simulations which must run for a long period of time achieve the best throughput when composed of many small jobs using a checkpoint and restart method chained together (see above for how to chain jobs together). However, this method does occur a startup and shutdown overhead for each job as the state is saved and loaded so you should experiment to find the best balance between runtime (long runtimes minimise the checkpoint/restart overheads) and throughput (short runtimes maximise throughput).
"},{"location":"user-guide/scheduler/#interconnect-locality","title":"Interconnect locality","text":"For jobs which are sensitive to interconnect (MPI) performance and utilise 128 nodes or less it is possible to request that all nodes are in a single Slingshot dragonfly group. The maximum number of nodes in a group on ARCHER2 is 128.
Slurm has a concept of \"switches\" which on ARCHER2 are configured to map to Slingshot electrical groups; where all compute nodes have all-to-all electrical connections which minimises latency. Since this places an additional constraint on the scheduler a maximum time to wait for the requested topology can be specified - after this time, the job will be placed without the constraint.
For example, to specify that all requested nodes should come from one electrical group and to wait for up to 6 hours (360 minutes) for that placement, you would use the following option in your job:
#SBATCH --switches=1@360\n
You can request multiple groups using this option if you are using more nodes than are in a single group to maximise the number of nodes that share electrical connetions in the job. For example, to request 4 groups (maximum of 512 nodes) and have this as an absolute constraint with no timeout, you would use:
#SBATCH --switches=4\n
Danger
When specifying the number of groups take care to request enough groups to satisfy the requested number of nodes. If the number is too low then an unnecessary delay will be added due to the unsatisfiable request.
A useful heuristic to ensure this is the case is to ensure that the total nodes requested is less than or equal to the number of groups multiplied by 128.
"},{"location":"user-guide/scheduler/#large-jobs","title":"Large Jobs","text":"Large jobs may take longer to start up. The sbcast
command is recommended for large jobs requesting over 1500 MPI tasks. By default, Slurm reads the executable on the allocated compute nodes from the location where it is installed; this may take long time when the file system (where the executable resides) is slow or busy. The sbcast
command, the executable can be copied to the /tmp
directory on each of the compute nodes. Since /tmp
is part of the memory on the compute nodes, it can speed up the job startup time.
sbcast --compress=none /path/to/exe /tmp/exe\nsrun /tmp/exe\n
"},{"location":"user-guide/scheduler/#huge-pages","title":"Huge pages","text":"Huge pages are virtual memory pages which are bigger than the default page size of 4K bytes. Huge pages can improve memory performance for common access patterns on large data sets since it helps to reduce the number of virtual to physical address translations when compared to using the default 4KB.
To use huge pages for an application (with the 2 MB huge pages as an example):
module load craype-hugepages2M\ncc -o mycode.exe mycode.c\n
And also load the same huge pages module at runtime.
Warning
Due to the huge pages memory fragmentation issue, applications may get Cannot allocate memory warnings or errors when there are not enough hugepages on the compute node, such as:
libhugetlbfs [nid0000xx:xxxxx]: WARNING: New heap segment map at 0x10000000 failed: Cannot allocate memory``
By default, The verbosity level of libhugetlbfs HUGETLB_VERBOSE
is set to 0
on ARCHER2 to suppress debugging messages. Users can adjust this value to obtain more information on huge pages use.
HUGETLB_RESTRICT_EXE
can be used to specify the susbset of the programs to use hugepages.Important
This section covers the software environment on the initial, 4-cabinet ARCHER2 system. For docmentation on the software environment on the full ARCHER2 system, please see Software environment: full system.
The software environment on ARCHER2 is primarily controlled through the module
command. By loading and switching software modules you control which software and versions are available to you.
Information
A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.
By default, all users on ARCHER2 start with the default software environment loaded.
Software modules on ARCHER2 are provided by both HPE Cray (usually known as the Cray Development Environment, CDE) and by EPCC, who provide the Service Provision, and Computational Science and Engineering services.
In this section, we provide:
module
commandmodule
command manipulates your environmentmodule
command","text":"We only cover basic usage of the module
command here. For full documentation please see the Linux manual page on modules
The module
command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:
module list [name]
- List modules currently loaded in your environment, optionally filtered by [name]
module avail [name]
- List modules available, optionally filtered by [name]
module savelist
- List module collections available (usually used for accessing different programming environments)module restore name
- Restore the module collection called name
(usually used for setting up a programming environment)module load name
- Load the module called name
into your environmentmodule remove name
- Remove the module called name
from your environmentmodule swap old new
- Swap module new
for module old
in your environmentmodule help name
- Show help information on module name
module show name
- List what module name
actually does to your environmentThese are described in more detail below.
"},{"location":"user-guide/sw-environment-4cab/#information-on-the-available-modules","title":"Information on the available modules","text":"The module list
command will give the names of the modules and their versions you have presently loaded in your environment:
auser@uan01:~> module list\nCurrently Loaded Modulefiles:\n1) cpe-aocc 7) cray-dsmml/0.1.2(default)\n2) aocc/2.1.0.3(default) 8) perftools-base/20.09.0(default)\n3) craype/2.7.0(default) 9) xpmem/2.2.35-7.0.1.0_1.3__gd50fabf.shasta(default)\n4) craype-x86-rome 10) cray-mpich/8.0.15(default)\n5) libfabric/1.11.0.0.233(default) 11) cray-libsci/20.08.1.2(default)\n6) craype-network-ofi\n
Finding out which software modules are available on the system is performed using the module avail
command. To list all software modules available, use:
auser@uan01:~> module avail\n------------------------------- /opt/cray/pe/perftools/20.09.0/modulefiles --------------------------------\nperftools perftools-lite-events perftools-lite-hbm perftools-nwpc \nperftools-lite perftools-lite-gpu perftools-lite-loops perftools-preload \n\n---------------------------------- /opt/cray/pe/craype/2.7.0/modulefiles ----------------------------------\ncraype-hugepages1G craype-hugepages8M craype-hugepages128M craype-network-ofi \ncraype-hugepages2G craype-hugepages16M craype-hugepages256M craype-network-slingshot10 \ncraype-hugepages2M craype-hugepages32M craype-hugepages512M craype-x86-rome \ncraype-hugepages4M craype-hugepages64M craype-network-none \n\n------------------------------------- /usr/local/Modules/modulefiles --------------------------------------\ndot module-git module-info modules null use.own \n\n-------------------------------------- /opt/cray/pe/cpe-prgenv/7.0.0 --------------------------------------\ncpe-aocc cpe-cray cpe-gnu \n\n-------------------------------------------- /opt/modulefiles ---------------------------------------------\naocc/2.1.0.3(default) cray-R/4.0.2.0(default) gcc/8.1.0 gcc/9.3.0 gcc/10.1.0(default) \n\n\n---------------------------------------- /opt/cray/pe/modulefiles -----------------------------------------\natp/3.7.4(default) cray-mpich-abi/8.0.15 craype-dl-plugin-py3/20.06.1(default) \ncce/10.0.3(default) cray-mpich-ucx/8.0.15 craype/2.7.0(default) \ncray-ccdb/4.7.1(default) cray-mpich/8.0.15(default) craypkg-gen/1.3.10(default) \ncray-cti/2.7.3(default) cray-netcdf-hdf5parallel/4.7.4.0 gdb4hpc/4.7.3(default) \ncray-dsmml/0.1.2(default) cray-netcdf/4.7.4.0 iobuf/2.0.10(default) \ncray-fftw/3.3.8.7(default) cray-openshmemx/11.1.1(default) papi/6.0.0.2(default) \ncray-ga/5.7.0.3 cray-parallel-netcdf/1.12.1.0 perftools-base/20.09.0(default) \ncray-hdf5-parallel/1.12.0.0 cray-pmi-lib/6.0.6(default) valgrind4hpc/2.7.2(default) \ncray-hdf5/1.12.0.0 cray-pmi/6.0.6(default) \ncray-libsci/20.08.1.2(default) cray-python/3.8.5.0(default) \n
This will list all the names and versions of the modules available on the service. Not all of them may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.
You can list all the modules of a particular type by providing an argument to the module avail
command. For example, to list all available versions of the HPE Cray FFTW library, use:
auser@uan01:~> module avail cray-fftw\n\n---------------------------------------- /opt/cray/pe/modulefiles -----------------------------------------\ncray-fftw/3.3.8.7(default) \n
If you want more info on any of the modules, you can use the module help
command:
auser@uan01:~> module help cray-fftw\n\n-------------------------------------------------------------------\nModule Specific Help for /opt/cray/pe/modulefiles/cray-fftw/3.3.8.7:\n\n\n===================================================================\nFFTW 3.3.8.7\n============\n Release Date:\n -------------\n June 2020\n\n\n Purpose:\n --------\n This Cray FFTW 3.3.8.7 release is supported on Cray Shasta Systems. \n FFTW is supported on the host CPU but not on the accelerator of Cray systems.\n\n The Cray FFTW 3.3.8.7 release provides the following:\n - Optimizations for AMD Rome CPUs.\n See the Product and OS Dependencies section for details\n\n[...]\n
The module show
command reveals what operations the module actually performs to change your environment when it is loaded. We provide a brief overview of what the significance of these different settings mean below. For example, for the default FFTW module:
auser@uan01:~> module show cray-fftw\n-------------------------------------------------------------------\n/opt/cray/pe/modulefiles/cray-fftw/3.3.8.7:\n\nconflict cray-fftw\nconflict fftw\nsetenv FFTW_VERSION 3.3.8.7\nsetenv CRAY_FFTW_VERSION 3.3.8.7\nsetenv CRAY_FFTW_PREFIX /opt/cray/pe/fftw/3.3.8.7/x86_rome\nsetenv FFTW_ROOT /opt/cray/pe/fftw/3.3.8.7/x86_rome\nsetenv FFTW_DIR /opt/cray/pe/fftw/3.3.8.7/x86_rome/lib\nsetenv FFTW_INC /opt/cray/pe/fftw/3.3.8.7/x86_rome/include\nprepend-path PATH /opt/cray/pe/fftw/3.3.8.7/x86_rome/bin\nprepend-path MANPATH /opt/cray/pe/fftw/3.3.8.7/share/man\nprepend-path CRAY_LD_LIBRARY_PATH /opt/cray/pe/fftw/3.3.8.7/x86_rome/lib\nprepend-path PE_PKGCONFIG_PRODUCTS PE_FFTW\nsetenv PE_FFTW_TARGET_x86_skylake x86_skylake\nsetenv PE_FFTW_TARGET_x86_rome x86_rome\nsetenv PE_FFTW_TARGET_x86_cascadelake x86_cascadelake\nsetenv PE_FFTW_TARGET_x86_64 x86_64\nsetenv PE_FFTW_TARGET_share share\nsetenv PE_FFTW_TARGET_sandybridge sandybridge\nsetenv PE_FFTW_TARGET_mic_knl mic_knl\nsetenv PE_FFTW_TARGET_ivybridge ivybridge\nsetenv PE_FFTW_TARGET_haswell haswell\nsetenv PE_FFTW_TARGET_broadwell broadwell\nsetenv PE_FFTW_VOLATILE_PKGCONFIG_PATH /opt/cray/pe/fftw/3.3.8.7/@PE_FFTW_TARGET@/lib/pkgconfig\nsetenv PE_FFTW_PKGCONFIG_VARIABLES PE_FFTW_OMP_REQUIRES_@openmp@\nsetenv PE_FFTW_OMP_REQUIRES { }\nsetenv PE_FFTW_OMP_REQUIRES_openmp _mp\nsetenv PE_FFTW_PKGCONFIG_LIBS fftw3_mpi:libfftw3_threads:fftw3:fftw3f_mpi:libfftw3f_threads:fftw3f\nmodule-whatis {FFTW 3.3.8.7 - Fastest Fourier Transform in the West}\n [...]\n
"},{"location":"user-guide/sw-environment-4cab/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"To load a module to use the module load
command. For example, to load the default version of HPE Cray FFTW into your environment, use:
auser@uan01:~> module load cray-fftw\n
Once you have done this, your environment will be setup to use the HPE Cray FFTW library. The above command will load the default version of HPE Cray FFTW. If you need a specific version of the software, you can add more information:
auser@uan01:~> module load cray-fftw/3.3.8.7\n
will load HPE Cray FFTW version 3.3.8.7 into your environment, regardless of the default.
If you want to remove software from your environment, module remove
will remove a loaded module:
auser@uan01:~> module remove cray-fftw\n
will unload what ever version of cray-fftw
(even if it is not the default) you might have loaded.
There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule
.
Suppose you have loaded version 3.3.8.7 of cray-fftw
, the following command will change to version 3.3.8.5:
auser@uan01:~> module swap cray-fftw cray-fftw/3.3.8.5\n
You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.
"},{"location":"user-guide/sw-environment-4cab/#changing-programming-environment","title":"Changing Programming Environment","text":"The three programming environments PrgEnv-aocc
, PrgEnv-cray
, PrgEnv-gnu
are implemented as module collections. The correct way to change programming environment, that is, change the collection of modules, is therefore via module restore
. For example:
auser@uan01:~> module restore PrgEnv-gnu\n
!!! note there is only one argument, which is the collection to be restored. The command module restore
will output a list of modules in the outgoing collection as they are unloaded, and the modules in the incoming collection as they are loaded. If you prefer not to have messages
auser@uan1:~> module -s restore PrgEnv-gnu\n
will suppress the messages. An attempt to restore a collection which is already loaded will result in no operation.
Module collections are stored in a user's home directory ${HOME}/.module
. However, as the home directory is not available to the back end, module restore
may fail for batch jobs. In this case, it is possible to restore one of the three standard programming environments via, e.g.,
module restore /etc/cray-pe.d/PrgEnv-gnu\n
"},{"location":"user-guide/sw-environment-4cab/#capturing-your-environment-for-reuse","title":"Capturing your environment for reuse","text":"Sometimes it is useful to save the module environment that you are using to compile a piece of code or execute a piece of software. This is saved as a module collection. You can save a collection from your current environment by executing:
auser@uan01:~> module save [collection_name]\n
Note
If you do not specify the environment name, it is called default
.
You can find the list of saved module environments by executing:
auser@uan01:~> module savelist\nNamed collection list:\n 1) default 2) PrgEnv-aocc 3) PrgEnv-cray 4) PrgEnv-gnu \n
To list the modules in a collection, you can execute, e.g.,:
auser@uan01:~> module saveshow PrgEnv-gnu\n-------------------------------------------------------------------\n/home/t01/t01/auser/.module/default:\nmodule use --append /opt/cray/pe/perftools/20.09.0/modulefiles\nmodule use --append /opt/cray/pe/craype/2.7.0/modulefiles\nmodule use --append /usr/local/Modules/modulefiles\nmodule use --append /opt/cray/pe/cpe-prgenv/7.0.0\nmodule use --append /opt/modulefiles\nmodule use --append /opt/cray/modulefiles\nmodule use --append /opt/cray/pe/modulefiles\nmodule use --append /opt/cray/pe/craype-targets/default/modulefiles\nmodule load cpe-gnu\nmodule load gcc\nmodule load craype\nmodule load craype-x86-rome\nmodule load --notuasked libfabric\nmodule load craype-network-ofi\nmodule load cray-dsmml\nmodule load perftools-base\nmodule load xpmem\nmodule load cray-mpich\nmodule load cray-libsci\nmodule load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env\n
Note again that the details of the collection have been saved to the home directory (the first line of output above). It is possible to save a module collection with a fully qualified path, e.g.,
auser@uan1:~> module save /work/t01/z01/auser/.module/PrgEnv-gnu\n
which would make it available from the batch system.
To delete a module environment, you can execute:
auser@uan01:~> module saverm <environment_name>\n
"},{"location":"user-guide/sw-environment-4cab/#shell-environment-overview","title":"Shell environment overview","text":"When you log in to ARCHER2, you are using the bash shell by default. As any other software, the bash shell has loaded a set of environment variables that can be listed by executing printenv
or export
.
The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS
define the number of threads.
To define an environment variable, you need to execute:
export OMP_NUM_THREADS=4\n
Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.
You can show the value of a specific environment variable if you print it:
echo $OMP_NUM_THREADS\n
Do not forget the dollar symbol. To remove an environment variable, just execute:
unset OMP_NUM_THREADS\n
"},{"location":"user-guide/sw-environment/","title":"Software environment","text":"The software environment on ARCHER2 is managed using the Lmod software. Selecting which software is available in your environment is primarily controlled through the module
command. By loading and switching software modules you control which software and versions are available to you.
Information
A module is a self-contained description of a software package -- it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.
By default, all users on ARCHER2 start with the default software environment loaded.
Software modules on ARCHER2 are provided by both HPE (usually known as the HPE Cray Programming Environment, CPE) and by EPCC, who provide the Service Provision, and Computational Science and Engineering services.
In this section, we provide:
module
commandmodule
command manipulates your environmentmodule
command","text":"We only cover basic usage of the Lmod module
command here. For full documentation please see the Lmod documentation
The module
command takes a subcommand to indicate what operation you wish to perform. Common subcommands are:
module restore
- Restore the default module setup (i.e. as if you had logged out and back in again)module list [name]
- List modules currently loaded in your environment, optionally filtered by [name]
module avail [name]
- List modules available, optionally filtered by [name]
module spider [name][/version]
- Search available modules (including hidden modules) and provide information on modulesmodule load name
- Load the module called name
into your environmentmodule remove name
- Remove the module called name
from your environmentmodule help name
- Show help information on module name
module show name
- List what module name
actually does to your environmentThese are described in more detail below.
Tip
Lmod allows you to use the ml
shortcut command. Without any arguments, ml
behaves like module list
; when a module name is specified to ml
, ml
behaves like module load
.
Note
You will often have to include module
commands in any job submission scripts to setup the software to use in your jobs. Generally, if you load modules in interactive sessions, these loaded modules do not carry over into any job submission scripts.
Important
You should not use the module purge
command on ARCHER2 as this will cause issues for the HPE Cray programming environment. If you wish to reset your modules, you should use the module restore
command instead.
The key commands for getting information on modules are covered in more detail below. They are:
module list
module avail
module spider
module help
module show
module list
","text":"The module list
command will give the names of the modules and their versions you have presently loaded in your environment:
auser@ln03:~> module list\n\nCurrently Loaded Modules:\n 1) craype-x86-rome 6) cce/15.0.0 11) PrgEnv-cray/8.3.3\n 2) libfabric/1.12.1.2.2.0.0 7) craype/2.7.19 12) bolt/0.8\n 3) craype-network-ofi 8) cray-dsmml/0.2.2 13) epcc-setup-env\n 4) perftools-base/22.12.0 9) cray-mpich/8.1.23 14) load-epcc-module\n 5) xpmem/2.5.2-2.4_3.30__gd0f7936.shasta 10) cray-libsci/22.12.1.1\n
All users start with a default set of modules loaded corresponding to:
module avail
","text":"Finding out which software modules are currently available to load on the system is performed using the module avail
command. To list all software modules currently available to load, use:
auser@uan01:~> module avail\n\n--------------------------- /work/y07/shared/archer2-lmod/utils/compiler/crayclang/10.0 ---------------------------\n darshan/3.3.1\n\n------------------------------------ /work/y07/shared/archer2-lmod/python/core ------------------------------------\n matplotlib/3.4.3 netcdf4/1.5.7 pytorch/1.10.0 scons/4.3.0 seaborn/0.11.2 tensorflow/2.7.0\n\n------------------------------------- /work/y07/shared/archer2-lmod/libs/core -------------------------------------\n aocl/3.1 (D) gmp/6.2.1 matio/1.5.23 parmetis/4.0.3 slepc/3.14.1\n aocl/4.0 gsl/2.7 metis/5.1.0 petsc/3.14.2 slepc/3.18.3 (D)\n boost/1.72.0 hypre/2.18.0 mkl/2023.0.0 petsc/3.18.5 (D) superlu-dist/6.4.0\n boost/1.81.0 (D) hypre/2.25.0 (D) mumps/5.3.5 scotch/6.1.0 superlu-dist/8.1.2 (D)\n eigen/3.4.0 libxml2/2.9.7 mumps/5.5.1 (D) scotch/7.0.3 (D) superlu/5.2.2\n\n------------------------------------- /work/y07/shared/archer2-lmod/apps/core -------------------------------------\n castep/22.11 namd/2.14 (D) py-chemshell/21.0.3\n code_saturne/7.0.1-cce15 nektar/5.2.0 quantum_espresso/6.8 (D)\n code_saturne/7.0.1-gcc11 (D) nwchem/7.0.2 quantum_espresso/7.1\n cp2k/cp2k-2023.1 onetep/6.1.9.0-CCE-LibSci (D) tcl-chemshell/3.7.1\n elk/elk-7.2.42 onetep/6.1.9.0-GCC-LibSci vasp/5/5.4.4.pl2-vtst\n fhiaims/210716.3 onetep/6.1.9.0-GCC-MKL vasp/5/5.4.4.pl2\n gromacs/2022.4+plumed openfoam/com/v2106 vasp/6/6.3.2-vtst\n gromacs/2022.4 (D) openfoam/com/v2212 (D) vasp/6/6.3.2 (D)\n lammps/17Feb2023 openfoam/org/v9.20210903\n namd/2.14-nosmp openfoam/org/v10.20230119 (D)\n\n------------------------------------ /work/y07/shared/archer2-lmod/utils/core -------------------------------------\n amd-uprof/3.6.449 darshan-util/3.3.1 imagemagick/7.1.0 reframe/4.1.0\n forge/24.0 epcc-reframe/0.2 ncl/6.6.2 tcl/8.6.13\n bolt/0.7 epcc-setup-env (L) nco/5.0.3 (D) tk/8.6.13\n bolt/0.8 (L,D) gct/v6.2.20201212 nco/5.0.5 usage-analysis/1.2\n cdo/1.9.9rc1 genmaskcpu/1.0 ncview/2.1.7 visidata/2.1\n cdo/2.1.1 (D) gnuplot/5.4.2-simg other-software/1.0 vmd/1.9.3-gcc10\n cmake/3.18.4 gnuplot/5.4.2 (D) paraview/5.9.1 (D) xthi/1.3\n cmake/3.21.3 (D) gnuplot/5.4.3 paraview/5.10.1\n\n--------------------- /opt/cray/pe/lmod/modulefiles/mpi/crayclang/14.0/ofi/1.0/cray-mpich/8.0 ---------------------\n cray-hdf5-parallel/1.12.2.1 cray-mpixlate/1.0.0.6 cray-parallel-netcdf/1.12.3.1\n\n--------------------------- /opt/cray/pe/lmod/modulefiles/comnet/crayclang/14.0/ofi/1.0 ---------------------------\n cray-mpich-abi/8.1.23 cray-mpich/8.1.23 (L)\n\n...output trimmed...\n
This will list all the names and versions of the modules that you can currently load. Note that other modules may be defined but not available to you as they depend on modules you do not have loaded. Lmod only shows modules that you can currently load, not all those that are defined. You can search for modules that are not currently visble to you using the module spider
command - we cover this in more detail below.
Note also, that not all modules may work in your account though due to, for example, licencing restrictions. You will notice that for many modules we have more than one version, each of which is identified by a version number. One of these versions is the default. As the service develops the default version will change and old versions of software may be deleted.
You can list all the modules of a particular type by providing an argument to the module avail
command. For example, to list all available versions of the HPE Cray FFTW library, use:
auser@ln03:~> module avail cray-fftw\n\n--------------------------------- /opt/cray/pe/lmod/modulefiles/cpu/x86-rome/1.0 ----------------------------------\n cray-fftw/3.3.10.3\n\nModule defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.\nSee https://lmod.readthedocs.io/en/latest/060_locating.html for details.\n\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
"},{"location":"user-guide/sw-environment/#module-spider","title":"module spider
","text":"The module spider
command is used to find out which modules are defined on the system. Unlike module avail
, this includes modules that are not currently able to be loaded due to the fact you have not yet loaded dependencies to make them directly available.
module spider
takes 3 forms:
module spider
without any arguments lists all modules defined on the systemmodule spider <module>
shows information on which versions of <module>
are defined on the systemmodule spider <module>/<version>
shows information on the specific version of the module defined on the system, including dependencies that must be loaded before this module can be loaded (if any)If you cannot find a module that you expect to be on the system using module avail
then you can use module spider
to find out which dependencies you need to load to make the module available.
For example, the module cray-netcdf-hdf5parallel
is installed on ARCHER2 but it will not be found by module avail
:
auser@ln03:~> module avail cray-netcdf-hdf5parallel\nNo module(s) or extension(s) found!\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
We can use module spider
without any arguments to verify it exists and list the versions available:
auser@ln03:~> module spider\n\n-----------------------------------------------------------------------------------------------\nThe following is a list of the modules and extensions currently available:\n-----------------------------------------------------------------------------------------------\n\n...output trimmed...\n\n cray-mpich-abi: cray-mpich-abi/8.1.23\n\n cray-mpixlate: cray-mpixlate/1.0.0.6\n\n cray-mrnet: cray-mrnet/5.0.4\n\n cray-netcdf: cray-netcdf/4.9.0.1\n\n cray-netcdf-hdf5parallel: cray-netcdf-hdf5parallel/4.9.0.1\n\n cray-openshmemx: cray-openshmemx/11.5.7\n\n...output trimmed...\n
Now we know which versions are available, we can use module spider cray-netcdf-hdf5parallel/4.9.0.1
to find out how we can make it available:
auser@ln03:~> module spider module spider cray-netcdf-hdf5parallel/4.9.0.1\n\n---------------------------------------------------------------------------------------------------------------\n cray-netcdf-hdf5parallel: cray-netcdf-hdf5parallel/4.9.0.1\n---------------------------------------------------------------------------------------------------------------\n\n You will need to load all module(s) on any one of the lines below before the \"cray-netcdf-hdf5parallel/4.9.0.1\" module is available to load.\n\n aocc/3.2.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n cce/15.0.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n craype-network-none cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n craype-network-ofi cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n craype-network-ucx cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n gcc/10.3.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n gcc/11.2.0 cray-mpich/8.1.23 cray-hdf5-parallel/1.12.2.1\n\n Help:\n Release info: /opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/release_info\n
There is a lot of information here, but what the output is essentailly telling us is that in order to have cray-netcdf-hdf5parallel/4.9.0.1
available to load we need to have loaded a compiler (any version of CCE, GCC or AOCC), an MPI library (any version of cray-mpich) and cray-hdf5-parallel
loaded. As we always have a compiler and MPI library loaded, we can satisfy all of the dependencies by loading cray-hdf5-parallel
, and then we can use module avail cray-netcdf-hdf5parallel
again to show that the module is now available to load:
auser@ln03:~> module load cray-hdf5-parallel\nauser@ln03:~> module avail cray-netcdf-hdf5parallel\n\n--- /opt/cray/pe/lmod/modulefiles/hdf5-parallel/crayclang/14.0/ofi/1.0/cray-mpich/8.0/cray-hdf5-parallel/1.12.2 ---\n cray-netcdf-hdf5parallel/4.9.0.1\n\nModule defaults are chosen based on Find First Rules due to Name/Version/Version modules found in the module tree.\nSee https://lmod.readthedocs.io/en/latest/060_locating.html for details.\n\nUse \"module spider\" to find all possible modules and extensions.\nUse \"module keyword key1 key2 ...\" to search for all possible modules matching any of the \"keys\".\n
"},{"location":"user-guide/sw-environment/#module-help","title":"module help
","text":"If you want more info on any of the modules, you can use the module help
command:
auser@ln03:~> module help gromacs\n
"},{"location":"user-guide/sw-environment/#module-show","title":"module show
","text":"The module show
command reveals what operations the module actually performs to change your environment when it is loaded. For example, for the default FFTW module:
auser@ln03:~> module show gromacs\n\n [...]\n
"},{"location":"user-guide/sw-environment/#loading-removing-and-swapping-modules","title":"Loading, removing and swapping modules","text":"To change your environment and make different software available you use the following commands which we cover in more detail below.
module load
module remove
module swap
module load
","text":"To load a module to use the module load
command. For example, to load the default version of GROMACS into your environment, use:
auser@ln03:~> module load gromacs\n
Once you have done this, your environment will be setup to use GROMACS. The above command will load the default version of GROMACS. If you need a specific version of the software, you can add more information:
auser@uan01:~> module load gromacs/2022.4 \n
will load GROMACS version 2022.4 into your environment, regardless of the default.
"},{"location":"user-guide/sw-environment/#module-remove","title":"module remove
","text":"If you want to remove software from your environment, module remove
will remove a loaded module:
auser@uan01:~> module remove gromacs\n
will unload what ever version of gromacs
you might have loaded (even if it is not the default).
module swap
","text":"There are many situations in which you might want to change the presently loaded version to a different one, such as trying the latest version which is not yet the default or using a legacy version to keep compatibility with old data. This can be achieved most easily by using module swap oldmodule newmodule
.
For example, to swap from the default CCE (cray) compiler environment to the GCC (gnu) compiler environment, you would use:
auser@ln03:~> module swap PrgEnv-cray PrgEnv-gnu\n
You did not need to specify the version of the loaded module in your current environment as this can be inferred as it will be the only one you have loaded.
"},{"location":"user-guide/sw-environment/#shell-environment-overview","title":"Shell environment overview","text":"When you log in to ARCHER2, you are using the bash shell by default. As with any software, the bash shell has loaded a set of environment variables that can be listed by executing printenv
or export
.
The environment variables listed before are useful to define the behaviour of the software you run. For instance, OMP_NUM_THREADS
define the number of threads.
To define an environment variable, you need to execute:
export OMP_NUM_THREADS=4\n
Please note there are no blanks between the variable name, the assignation symbol, and the value. If the value is a string, enclose the string in double quotation marks.
You can show the value of a specific environment variable if you print it:
echo $OMP_NUM_THREADS\n
Do not forget the dollar symbol. To remove an environment variable, just execute:
unset OMP_NUM_THREADS\n
Note that the dollar symbol is not included when you use the unset
command.
Note that it not possible for a single user to monopolise the resources on a login node as this is controlled by cgroups. This means that a user cannot slow down the response time for other users.
"},{"location":"user-guide/tds/","title":"ARCHER2 Test and Development System (TDS) user notes","text":"The ARCHER2 Test and Development System (TDS) is a small system used for testing changes before they are rolled out onto the full ARCHER2 system. This page contains useful information for people using the TDS on its configuration and what they can expect from the system.
Important
The TDS is used for testing on a day to day basis. This means that nodes and the entire system may be made unavailable or rebooted with little or no warning.
"},{"location":"user-guide/tds/#tds-system-details","title":"TDS system details","text":"Compute nodes: 8 compute nodes in total
Slingshot interconnect
Storage:
You can only log into the TDS from an ARCHER2 login node. You should create an SSH key pair on an ARCHER2 login node and add the public part to your ARCHER2 account in SAFE in the usual way.
Once your new key pair is setup, you can then login to the TDS (from an ARCHER2 login node) with
ssh login-tds.archer2.ac.uk\n
You will require your SSH key passphrase (for the new key pair you generated) and your usual ARCHER2 account password to login to the TDS.
"},{"location":"user-guide/tds/#slurm-scheduler-configuration","title":"Slurm scheduler configuration","text":"standard
: includes all compute nodeshighmem
: includes high memory compute nodesstandard
: same limits as on ARCHER2 main systemhighmem
: same limits as on ARCHER2 main systemSoftware modules
/work
file system - i.e. you may be able to load a module but the software it points to may not be available. Check if the software is actually installed before trying to use it.GCC 12.2.0 (gcc/g++/gfortran) compiler has been shown to give incorrect numerical results for a number of software packages (VASP, CASTEP. CP2K). If you want to use this compiler version we recommend checking output carefully. We may remove this version from the PE software stack installed on the full system as part of the software upgrade.
Singularity + MPI does not currently work - MPI executable in the Singularity container segfaults.
Energy use data is not available from TDS compute nodes.
Change of behaviour of the --cpus-per-task
Slurm option. If you set --cpus-per-task
greater than 1
in your job submission script (e.g. using #SBATCH
directives) then this option is not inhereted by srun
commands in the job script. You need to eithe set something like export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
or repeat the option explicitly in the srun
command (e.g. srun --cpus-per-task=$SLURM_CPUS_PER_TASK --hint=nomultithread --distribution=block:block
).
Change in definition of a Slurm NUMA region. On the TDS, a Slurm NUMA region is 4 cores (corresponding to an Core CompleX CCX in the AMD EPYC Zen2 architecture). This means cyclic process placements on NUMA regions (e.g. --distribution=block:cyclic
) will cycle over 4-core CCX. (On the main system, a Slurm NUMA region is 16 cores).
The vast majority of parallel scientific applications use the MPI library as the main way to implement parallelism; it is used so universally that the Cray compiler wrappers on ARCHER2 link to the Cray MPI library by default. Unlike other clusters you may have used, there is no choice of MPI library on ARCHER2: regardless of what compiler you are using, your program will use Cray MPI. This is because the Slingshot network on ARCHER2 is Cray-specific and significant effort has been put in by Cray software engineers to optimise the MPI performance on their Shasta systems.
Here we list a number of suggestions for improving the performance of your MPI programs on ARCHER2. Although MPI programs are capable of scaling very well due to the bespoke communications hardware and software, the details of how a program calls MPI can have significant effects on achieved performance.
Note
Many of these tips are actually quite generic and should be beneficial to any MPI program; however, they all become much more important when running on very large numbers of processes on a machine the size of ARCHER2.
"},{"location":"user-guide/tuning/#mpi-environment-variables","title":"MPI environment variables","text":"There are a number of environment variables available to control aspects of MPI behavour on ARCHER2, the set of options can be displayed by running,
man intro_mpi\n
o n the ARCHER2 login nodes. A couple of specific variables to highlight are MPICH_OFI_STARTUP_CONNECT and MPICH_OFI_RMA_STARTUP_CONNECT.
When using the default OFI transport layer the connections between ranks are set-up as they are required. This allows for good performance while reducing memory requirements. However for jobs using all-to-all communication it might be better to generate these connections in a coordinated way at the start of the application. To enable this set the following environment variable:
export MPICH_OFI_STARTUP_CONNECT=1 \n
Additionally, RMA jobs requiring an all-to-all communication pattern on node it may be beneficial to set up the connections between processes on a node in a coordinated fashion:
export MPICH_OFI_RMA_STARTUP_CONNECT=1\n
This option automatically enables MPICH_OFI_STARTUP_CONNECT.
"},{"location":"user-guide/tuning/#synchronous-vs-asynchronous-communications","title":"Synchronous vs asynchronous communications","text":""},{"location":"user-guide/tuning/#mpi_send","title":"MPI_Send","text":"A standard way to send data in MPI is using MPI_Send
(aptly called standard send). Somewhat confusingly, MPI is allowed to choose how to implement this in two different ways:
Synchronously The sending process waits until a matching receive has been posted, i.e. it operates like MPI_Ssend
. This clearly has the risk of deadlock if no receive is ever issued.
Asynchronously MPI makes a copy of the message into an internal buffer and returns straight away without waiting for a matching receive; the message may actually be delivered later on. This is like the behaviour of the the buffered send routine MPI_Bsend
.
The rationale is that MPI, rather than the user, should decide how best to send a message.
In practice, what typically happens is that MPI tries to use an asynchronous approach via the eager protocol: the message is sent directly to a preallocated buffer on the receiver and the routine returns immediately afterwards. Clearly there is a limit on how much space can be reserved for this, so:
The threshold is often termed the eager limit which is fixed for the entire run of your program. It will have some default setting which varies from system to system, but might be around 8K bytes.
"},{"location":"user-guide/tuning/#implications","title":"Implications","text":"MPI_Send
is implemented asynchronously using the eager protocol since synchronisation between sender and receive is much reduced.MPI_Send
buffers your message, so if you have concerns about deadlock you will need to use the non-blocking variant MPI_Isend
to guarantee that the send routine returns control to you immediately even if there is no matching receive.MPI_Send
/ MPI_Isend
with MPI_Ssend
/ MPI_Issend
. A correct MPI program should still run correctly when all references to standard send are replaced by synchronous send (since MPI is allowed to implement standard send as synchronous send).With most MPI libraries you should be able to alter the default value of the eager limit at runtime, perhaps via an environment variable or a command-line argument to mpirun
.
The advice for tuning the performance of MPI_Send
is
MPI_Send
is (a profiling tool may be useful here);MPI_Isend
as well: even in the non-blocking form, which can help to weaken synchronisation between sender and receiver, the amount of hand-shaking required is much reduced if the eager protocol is used;Note
It cannot be stressed strongly enough that although the performance may be affected by the value of the eager limit, the functionality of your program should be unaffected. If changing the eager limit affects the correctness of your program (e.g. whether or not it deadlocks) then you have an incorrect MPI program.
"},{"location":"user-guide/tuning/#setting-the-eager-limit-on-archer2","title":"Setting the eager limit on ARCHER2","text":"On ARCHER2, things are a little more complicated. Although the eager limit defaults to 16KiB, messages up to 256KiB are sent asynchronously because they are actually sent as a number of smaller messages.
To send even larger messages asynchronously, alter the value of FI_OFI_RXM_SAR_LIMIT
in your job submission script, e.g. to set to 512KiB:
export FI_OFI_RXM_SAR_LIMIT=524288\n
You can also control the size of the smaller messages by altering the value of FI_OFI_RXM_BUFFER_SIZE
in your job submission script, e.g. to set to 128KiB:
export FI_OFI_RXM_BUFFER_SIZE=131072\n
A different protocol is used for messages between two processes on the same node. The default eager limit for these is 8K. Although the performance of on-node messages is unlikely to be a limiting factor for your program you can change this value, e.g. to set to 16KiB:
export MPICH_SMP_SINGLE_COPY_SIZE=16384\n
"},{"location":"user-guide/tuning/#collective-operations","title":"Collective operations","text":"Many of the collective operations that are commonly required by parallel scientific programs, i.e. operations that involve a group of processes, are already implemented in MPI. The canonical operation is perhaps adding up a double precision number across all MPI processes, which is best achieved by a reduction operation:
MPI_Allreduce(&x, &xsum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);\n
This will be implemented using an efficient algorithm, for example based on a binary tree. Using such divide-and-conquer approaches typically results in an algorithm whose execution time on P processes scales as log_2(P); compare this to a naive approach where every process sends its input to rank 0 where the time will scale as P. This might not be significant on your laptop, but even on as few as 1000 processes the tree-based algorithm will already be around 100 times faster.
So, the basic advice is always use a collective routine to implement your communications pattern if at all possible.
In real MPI applications, collective operations are often called on a small amount of data, for example a global reduction of a single variable. In these cases, the time taken will be dominated by message latency and the first port of call when looking at performance optimisation is to call them as infrequently as possible!
Sometimes, the collective routines available may not appear to do exactly what you want. However, they can sometimes be used with a small amount of additional programming work:
To operate on a subset of processes, create sub-communicators containing the relevant subset(s) and use these communicators instead of MPI_COMM_WORLD
. Useful functions for communicator management include:
MPI_Comm_split
is the most general routine;MPI_Comm_split_type
can be used to create a separate communicator for each shared-memory node with split type = MPI_COMM_TYPE_SHARED
;MPI_Cart_sub
can divide a Cartesian communicator into regular slices.If the communication pattern is what you want, but the data on each process is not arranged in the required layout, consider using MPI derived data types for the input and/or output. This can be useful, for example, if you want to communicate non-contiguous data such as a subsection of a multidimensional array although care must be taken in defining these types to ensure they have the correct extents.
Another example would be using MPI_Allreduce
to add up an integer and a double-precision variable using a single call by putting them together into a C struct
and defining a matching MPI datatype using MPI_Type_create_struct
. Here you would also have to provide MPI with a custom reduction operation using MPI_Op_create
.
Many MPI programs call MPI_Barrier
to explicitly synchronise all the processes. Although this can be useful for getting reliable performance timings, it is rare in practice to find a program where the call is actually needed for correctness. For example, you may see:
// Ensure the input x is available on all processes\nMPI_Barrier(MPI_COMM_WORLD);\n// Perform a global reduction operation\nMPI_Allreduce(&x, &xsum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);\n// Ensure the result xsum is available on all processes\nMPI_Barrier(MPI_COMM_WORLD);\n
Neither of these barriers are needed as the reduction operation performs all the required synchronisation.
If removing a barrier from your MPI code makes it run incorrectly, then this should ring alarm bells -- it is often a symptom of an underlying bug that is simply being masked by the barrier.
For example, if you use non-blocking calls such as MPI_Irecv
then it is the programmer's responsibility to ensure that these are completed at some later point, for example by calling MPI_Wait
on the returned request object. A common bug is to forget to do this, in which case you might be reading the contents of the receive buffer before the incoming message has arrived (e.g. if the sender is running late).
Calling a barrier may mask this bug as it will make all the processes wait for each other, perhaps allowing the late sender to catch up. However, this is not guaranteed so the real solution is to call the non-blocking communications correctly.
One of the few times when a barrier may be required is if processes are communicating with each other via some other non-MPI method, e.g. via the file system. If you want processes to sequentially open, append to, then close the same file then barriers are a simple way to achieve this:
for (i=0; i < size; i++)\n{\n if (rank == i) append_data_to_file(data, filename);\n MPI_Barrier(comm);\n}\n
but this is really something of a special case.
Global synchronisation may be required if you are using more advanced techniques such as hybrid MPI/OpenMP or single-sided MPI communication with put and get, but typically you should be using specialised routines such as MPI_Win_fence
rather than MPI_Barrier
.
Tip
If you run a performance profiler on your code and it shows a lot of time being spent in a collective operation such as MPI_Allreduce
, this is not necessarily a sign that the reduction operation itself is the bottleneck. This is often a symptom of load imbalance: even if a reduction operation is efficiently implemented, it may take a long time to complete if the MPI processes do not all call it at the same time. MPI_Allreduce
synchronises across processes so will have to wait for all the processes to call it before it can complete. A single slow process will therefore adversely impact the performance of your entire parallel program.
There are a variety of possible issues that can result in poor performance of OpenMP programs. These include:
"},{"location":"user-guide/tuning/#sequential-code","title":"Sequential code","text":"Code outside of parallel regions is executed sequentially by the master thread.
"},{"location":"user-guide/tuning/#idle-threads","title":"Idle threads","text":"If different threads have different amounts of computation to do, then threads may be idle whenever a barrier is encountered, for example at the end of parallel regions or the end of worksharing loops. For worksharing loops, choosing a suitable schedule kind may help. For more irregular computation patterns, using OpenMP tasks might offer a solution: the runtime will try to load balance tasks across the threads in the team.
Synchronisation mechanisms that enforce mutual exclusion, such as critical regions, atomic statements and locks can also result in idle threads if there is contention - threads have to wait their turn for access.
"},{"location":"user-guide/tuning/#synchronisation","title":"Synchronisation","text":"The act of synchronising threads comes at some cost, even if the threads are never idle. In OpenMP, the most common source of synchronisation overheads is the implicit barriers at the end of parallel regions and worksharing loops. The overhead of these barriers depends on the OpenMP implementation being used as well as on the number of threads, but is typically in the range of a few microseconds. This means that for a simple parallel loop such as
#pragma omp parallel for reduction(+:sum)\nfor (i=0;i<n;i++){\n sum += a[i];\n}\n
the number of iterations required to make parallel execution worthwhile may be of the order of 100,000. On ARCHER2, benchmarking has shown that for the AOCC compiler, OpenMP barriers have significantly higher overhead than for either the Cray or GNU compilers.
It is possible to suppress the implicit barrier at the end of worksharing loop using a nowait
clause, taking care that this does not introduce and race conditions.
Atomic statements are designed to be capable of more efficient implementation that the equivalent critical region or lock/unlock pair, so should be used where applicable.
"},{"location":"user-guide/tuning/#scheduling","title":"Scheduling","text":"Whenever we rely on the OpenMP runtime to dynamically assign computation to threads (e.g. dynamic or guided loop schedules, tasks), there is some overhead incurred (some of this cost may actually be internal synchronisation in the runtime). It is often necessary to adjust the granularity of the computation to find a compromise between too many small units (and high scheduling cost) and too few large units (where load imbalance may dominate). For example, we can choose a non-default chunksize for the dynamic schedule, or adjust the amount of computation within each OpenMP task construct.
"},{"location":"user-guide/tuning/#communication","title":"Communication","text":"Communication between threads in OpenMP takes place via the cache coherency mechanism. In brief, whenever a thread writes a memory location, all copies of this location which are in a cache belonging to a different core have to be marked as invalid. Subsequent accesses to this location by other threads will result in the up-to-date value being retrieved from the cache where the last write occurred (or possibly from main memory).
Due to the fine granularity of memory accesses, these overheads are difficult to analyse or monitor. To minimise communication, we need to write code with good data affinity - i.e. each thread should access the same subset of program data as much as possible.
"},{"location":"user-guide/tuning/#numa-effects","title":"NUMA effects","text":"On modern CPU nodes, main memory is often organised in NUMA regions - sections of main memory associated with a subset of the cores on a node. On ARCHER2 nodes, there are 8 NUMA regions per node, each associated with 16 CPU cores. On such systems the location of data in main memory with respect to the cores that are accessing it can be important. The default OS policy is to place data in the NUMA region which first accesses it (first touch policy). For OpenMP programs this can be the worst possible option: if the data is initialised by the master thread, it is all allocated one NUMA region and having all threads accessing data becomes a bandwidth bottleneck.
This default policy can be changed using the numactl
command, but it is probably better to make use of the first touch policy by explicitly parallelising the data initialisation in the application code. This may be straightforward for large multidimensional arrays, but more challenging for irregular data structures.
The cache coherency mechanism described above operates on units of data corresponding to the size of cache lines - for ARCHER2 CPUs this is 64 bytes. This means that if different threads are accessing neighbouring words in memory, and at least some of the accesses are writes, then communication may be happening even if no individual word is actually being accessed by more than one thread. This means that patterns such as
#pragma omp parallel shared(count) private(myid) \n{\n myid = omp_get_thread_num();\n ....\n count[myid]++;\n ....\n}\n
may give poor performance if the updates to the count
array are sufficiently frequent.
Whenever there are multiple threads (or processes) executing inside a node, they may contend for some hardware resources. The most important of these for many HPC applications is memory bandwidth. This is effect is very evident on ARCHER2 CPUs - it is possible for just 2 threads to almost saturate the available memory bandwidth in a NUMA region which has 16 cores associated with it. For very bandwidth-intensive applications, running more that 2 threads per NUMA region may gain little additional performance. If an OpenMP code is not using all the cores on a node, by default Slurm will spread the threads out across NUMA regions to maximise the available bandwidth.
Another resource that threads may contend for is space in shared caches. On ARCHER2, every set of 4 cores shares 16MB of L3 cache.
"},{"location":"user-guide/tuning/#compiler-non-optimisation","title":"Compiler non-optimisation","text":"In rare cases, adding OpenMP directives can adversely affect the compiler's optimisation process. The symptom of this is that the OpenMP code running on 1 thread is slower than the same code compiled without the OpenMP flag. It can be difficult to find a workaround - using the compiler's diagnostic flags to find out which optimisation (e.g. vectorisation, loop unrolling) is being affected and adding compiler-specific directives may help.
"},{"location":"user-guide/tuning/#hybrid-mpi-and-openmp","title":"Hybrid MPI and OpenMP","text":"There are two main motivations for using both MPI and OpenMP in the same application code: reducing memory requirements and improving performance. At low core counts, where the pure MPI version of the code is still scaling well, adding OpenMP is unlikely to improve performance. In fact, it can introduce some additional overheads which make performance worse! The benefit is likely to come in the regime where the pure MPI version starts to lose scalability - here adding OpenMP can reduce communication costs, make load balancing easier, or be an effective way of exploiting additional parallelism without excessive code re-writing.
An important performance consideration for MPI + OpenMP applications is the choice of the number of OpenMP threads per MPI process. The optimum value will depend on the application, the input data, the number of nodes requested and the choice of compiler, and is hard to predict without experimentation. However, there are some considerations that apply to ARCHER2:
Due to NUMA effects, it is likely that running at least one MPI process per NUMA region (i.e. at least 8 MPI processes per node) will be beneficial.
The number of MPI processes per node should be a power of 2, so that all OpenMP threads run in the same NUMA region as their parent MPI process.
For applications where each process has a small memory footprint (e.g. some molecular dynamics codes), running no more than 4 OpenMP threads per MPI process may be beneficial, so that all the threads in a process share a single L3 cache.
module -q load pytorch/1.13.1-gpu
if you are not running DeepCam and have no need for additional Python packages such as mlperf-logging
and warmup-scheduler
.
In the script above, we specify four tasks per node, one for each GPU. These tasks are evenly spaced across the node so as to maximise the communications
bandwidth between the host and the GPU devices. Note, PyTorch is not using Cray MPICH for inter-task communications, which is instead being handled by the
-ROCm Collective Communications Library (RCCL), hence the --wireup_method nccl-slurm
option (nccl-slurm
works as an alias for `rccl-slurm in this context).
--wireup_method nccl-slurm
option (nccl-slurm
works as an alias for rccl-slurm
in this context).
The above job should achieve convergence — an Intersection over Union (IoU) of 0.82 — after 35 epochs or so. Runtime should be around 20-30 minutes.
We can also modify the DeepCam train.py
script so that the accuracy and loss are logged using TensorBoard.
The following lines must be added to the DeepCam train.py
script.
In order to run a DeepCam training job, you must first clone the MLCommons HPC github repo.
+mkdir ${HOME/home/work}/tests
+cd ${HOME/home/work}/tests
+
+git clone https://github.com/mlcommons/hpc.git mlperf-hpc
+
+cd ./mlperf-hpc/deepcam/src/deepCam
+
Next, we need to edit some parts of the DeepCam Python source such that DeepCam is properly integrated with Cray MPICH.
+The init
function defined in ./utils/comm.py
contains an if
statement that initialises the DeepCam job according
+to the selected communications method. You will need to edit the mpi
branch of this if
statement as shown below.
...
+
+def init(method, batchnorm_group_size=1):
+
+ if method == "nccl-openmpi":
+
+ ...
+
+ elif method == "mpi":
+ rank = int(os.getenv("SLURM_PROCID"))
+ world_size = int(os.getenv("SLURM_NTASKS"))
+ dist.init_process_group(backend = "mpi",
+ rank = rank,
+ world_size = world_size)
+
+ else:
+ raise NotImplementedError()
+
+ ...
+
Second, as we're not running on a GPU platform, we'll need to comment out a statement that calls a GPU-based
+synchronisation method, see the synchronize
method within ./utils/bnstats.py
.
...
+
+def synchronize(self:
+
+ if dist.is_initialized():
+ # sync the device before
+ #torch.cuda.synchronize()
+
+ with torch.no_grad():
+ ...
+
DeepCam can now be run on the CPU nodes using a submission script like the one below.
#!/bin/bash