Merge pull request #603 from kevinstratford/fix-issue-602

GPU time limits updated
ARCHER2-HPC · May 3, 2024 · deb59a2 · deb59a2
2 parents d2efbe9 + 21deca8
commit deb59a2
Show file tree

Hide file tree

Showing 2 changed files with 38 additions and 41 deletions.
diff --git a/docs/user-guide/gpu.md b/docs/user-guide/gpu.md
@@ -4,12 +4,12 @@
     This page is work in progress. More details on the GPU Development Platform
     and how to use it will be added as they become available.
 
-In early 2024 ARCHER2 users will gain access to a small GPU system 
-integrated into ARCHER2 which is designed to allow users to test and develop software 
+In early 2024 ARCHER2 users will gain access to a small GPU system
+integrated into ARCHER2 which is designed to allow users to test and develop software
 using AMD GPUs.
 
 !!! important
-    The GPU component is very small and so is aimed at software development and 
+    The GPU component is very small and so is aimed at software development and
     testing rather than to be used for production research.
 
 ## Hardware available
@@ -28,7 +28,7 @@ The GPU Development Platform consists of 4 compute nodes each with:
 
 The GPU nodes can be accessed through the Slurm job submission system from the
 standard ARCHER2 login nodes. Details of the scheduler limits and configuration
-and example job submission scripts are provided below. 
+and example job submission scripts are provided below.
 
 ## Compiling software for the GPU compute nodes
 
@@ -44,7 +44,7 @@ offloads to the AMD GPUs is as follows:
 - Load any other modules (e.g. libraries)
 - Use the usual compiler wrappers `ftn`, `cc`, or `CC`
 
-For details and alternative approaches, see below. 
+For details and alternative approaches, see below.
 
 ### Programming Environments
 
@@ -86,7 +86,7 @@ InstalledDir: /opt/rocm-5.2.3/llvm/bin
 ```
 
 
-### ROCm 
+### ROCm
 
 Access to AMD's ROCm software stack is provided through the `rocm`
 module:
@@ -108,7 +108,7 @@ for most users, at least initially.
 
 **Note**: the `rocm` module should be loaded whenever you are compiling
 for the AMD GPUs, even if you are not using the AMD LLVM compilers
-(`amdflang`, `amdclang`, `amdclang++`).  
+(`amdflang`, `amdclang`, `amdclang++`).
 
 The `rocm` module also provides access to other AMD tools, such as
 HIPIFY (`hipify-clang` or `hipify-perl` command), which enables
@@ -151,7 +151,7 @@ table:
 | `PrgEnv-amd`  | `amdclang`      |     ✅	   | ❌  |   ❌    |
 | `PrgEnv-amd`  | `amdclang++`    |     ✅	   | ✅  |   ❌    |
 | `PrgEnv-cray` | `crayftn`       |     ✅	   | ❌  |   ✅    |
-| `PrgEnv-cray` | `craycc`        |     ✅	   | ❌  |   ❌    |  
+| `PrgEnv-cray` | `craycc`        |     ✅	   | ❌  |   ❌    |
 | `PrgEnv-cray` | `crayCC`        |     ✅	   | ✅  |   ❌    |
 | `PrgEnv-gnu`  | `gfortran`      |     ❌	   | ❌  |   ❌    |
 | `PrgEnv-gnu`  | `gcc`           |     ❌	   | ❌  |   ❌    |
@@ -170,7 +170,7 @@ module load craype-x86-milan
 And then to use the `ftn`, `cc` and/or `CC` wrapper to compile as
 appropriate for the programming model in question. Specific guidance
 on how to do this for different programming models is provided in the
-subsections below. 
+subsections below.
 
 When deviating from this procedure and using underlying compilers
 directly, or when debugging a problematic build using the wrappers, it
@@ -218,7 +218,7 @@ fail to compile with some compilers. Under `PrgEnv-cray` an explicit
 description of supported OpenMP features can be viewed using the
 command `man intro_openmp`.
 
-    
+
 #### HIP
 
 To compile C or C++ code that uses HIP written specifically to offload
@@ -261,7 +261,7 @@ HIP code non-HIP code, it is possible to use a different compiler than
 
 - Compile the HIP code as above using `hipcc`
 - Compile the non-HIP code using the compiler wrapper `CC` and a *different* PrgEnv than `PrgEnv-amd` loaded
-- Link the resulting code objects (`.o` files) together using the compiler wrapper 
+- Link the resulting code objects (`.o` files) together using the compiler wrapper
 
 
 #### OpenACC
@@ -279,7 +279,7 @@ module load craype-x86-milan
 OpenACC Fortran code can then be compiled using the `-hacc` flag, as follows:
 
 ```
-ftn -hacc source.f90 
+ftn -hacc source.f90
 ```
 
 Details on what OpenACC standard and features are supported under
@@ -365,7 +365,7 @@ This supports GPU-GPU transfers:
 - Inter-node via GPU-NIC RDMA
 - Intra-node via GPU Peer2Peer IPC
 
-Be aware that on these nodes there are only two PICe network cards in each node and they may not be in the same memory region to a given GPU.
+Be aware that on these nodes there are only two PCIe network cards in each node and they may not be in the same memory region to a given GPU.
 Therefore NUMA effects are to be expected in multi-node communication. More detail on this is provided below.
 
 ### Libraries
@@ -374,13 +374,13 @@ In order to access the GPU-accelerated version of Cray's LibSci maths libraries,
 
 `cray-libsci_acc`
 
-With this module loaded, documentation can be viewed using the command `man intro_libsci_acc`. 
+With this module loaded, documentation can be viewed using the command `man intro_libsci_acc`.
 
 Additionally a number of libraries are provided as part of the `rocm` module.
 
 - [Math Libraries](https://rocm.docs.amd.com/en/docs-5.2.3/reference/gpu_libraries/math.html)
 - [AI Libraries](https://rocm.docs.amd.com/projects/MIOpen/en/docs-5.2.3/index.html)
-- [C++ primative Libraries](https://rocm.docs.amd.com/en/docs-5.2.3/reference/gpu_libraries/c%2B%2B_primitives.html)
+- [C++ primitive Libraries](https://rocm.docs.amd.com/en/docs-5.2.3/reference/gpu_libraries/c%2B%2B_primitives.html)
 
 
 
@@ -404,10 +404,10 @@ from mpi4py import MPI
 
 The ARCHER2 GPU development platform is intended for code development, testing and experimentation and will not have supported centrally installed versions of codes as is the case for the standard ARCHER2 CPU compute nodes. However some builds are being made available to users by members of CSE to under a best effort approach to support the community.
 
-Codes that have modules targetting GPUs are:
+Codes that have modules targeting GPUs are:
 
 !!! Note
-    Will be filled out as applications are compiled and made available. 
+    Will be filled out as applications are compiled and made available.
 
 ## Running jobs on the GPU nodes
 
@@ -454,8 +454,8 @@ QoS specifications are as follows.
 
 | QoS        | Max Nodes Per Job | Max Walltime | Jobs Queued | Jobs Running | Partition(s) | Notes |
 | ---------- | ----------------- | ------------ | ----------- | ------------ | ------------ | ------|
-| gpu-shd    | 1               | 1 hr       | 2          | 1           | gpu    | Nodes potentially shared with other users |
-| gpu-exc    | 2               | 1 hr       | 2          | 1           | gpu    | Exclusive node access |
+| gpu-shd    | 1               | 12 hr       | 2          | 1           | gpu    | Nodes potentially shared with other users |
+| gpu-exc    | 2               | 12 hr       | 2          | 1           | gpu    | Exclusive node access |
 
 ### Example job submission scripts
 
@@ -572,7 +572,7 @@ srun --ntasks=4 --cpus-per-task=8 \
 
 !!! note
     When you use the `--qos=gpu-exc` QoS you must also add the `--exclusive` flag
-    and then specify the number of nodes you want with `--nodes=1`.  
+    and then specify the number of nodes you want with `--nodes=1`.
 
 ### Multiple GPU on multiple nodes - exclusive node access (max. 8 GPU)
 
@@ -622,7 +622,7 @@ srun --ntasks=8 --cpus-per-task=8 \
 
 !!! note
     When you use the `--qos=gpu-exc` QoS you must also add the `--exclusive` flag
-    and then specify the number of nodes you want with `--nodes=1`.  
+    and then specify the number of nodes you want with `--nodes=1`.
 
 ### Interactive jobs
 
@@ -633,8 +633,8 @@ srun --ntasks=8 --cpus-per-task=8 \
     want an interactive shell on the GPU compute nodes, see the `srun` method described
     below.
 
-If you wish to have a terminal to perform interactive testing, you can 
-use the `salloc` command to reserve the resources so you can use `srun` commands interactively. 
+If you wish to have a terminal to perform interactive testing, you can
+use the `salloc` command to reserve the resources so you can use `srun` commands interactively.
 For example, to request 1 GPU for 20 minutes you would use (remember to replace `t01` with your
 budget code):
 
@@ -653,8 +653,8 @@ auser@ln04:/work/t01/t01/auser> srun rocm-smi
 
 ======================= ROCm System Management Interface =======================
 ================================= Concise Info =================================
-GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
-0    31.0c  43.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%       
+GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
+0    31.0c  43.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%
 ================================================================================
 ============================= End of ROCm SMI Log ==============================
 
@@ -666,8 +666,8 @@ auser@ln04:/work/t01/t01/auser> module load xthi
 auser@ln04:/work/t01/t01/auser> srun --ntasks=1 --cpus-per-task=8 --hint=nomultithread xthi
 Node summary for    1 nodes:
 Node    0, hostname nid200001, mpi   1, omp   1, executable xthi
-MPI summary: 1 ranks 
-Node    0, rank    0, thread   0, (affinity =  0-7) 
+MPI summary: 1 ranks
+Node    0, rank    0, thread   0, (affinity =  0-7)
 ```
 
 #### Using `srun`
@@ -680,7 +680,7 @@ would use (remember to replace `t01` with your budget code):
 auser@ln04:/work/t01/t01/auser> srun --gpus=1 --time=00:20:00 --partition=gpu --qos=gpu-shd --account=z19 --pty /bin/bash
 srun: job 5335771 queued and waiting for resources
 srun: job 5335771 has been allocated resources
-auser@nid200001:/work/t01/t01/auser> 
+auser@nid200001:/work/t01/t01/auser>
 ```
 
 Note that the command prompt has changed to indicate we are now on a GPU compute node. You can now directly run commands
@@ -691,8 +691,8 @@ auser@nid200001:/work/t01/t01/auser> rocm-smi
 
 ======================= ROCm System Management Interface =======================
 ================================= Concise Info =================================
-GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
-0    29.0c  43.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%     
+GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
+0    29.0c  43.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%
 ================================================================================
 ============================= End of ROCm SMI Log ==============================
 ```
@@ -702,7 +702,7 @@ GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
     use job submission scripts or the `salloc` method of interactive use described above.
 
 
-### Environment variables 
+### Environment variables
 
 
 
@@ -769,7 +769,7 @@ Runtime : HIP Runtime. Applies only to applications using HIP on the AMD platfor
 
 ##### AMD_SERIALIZE_KERNEL
 
-To serialize the kernel enqueuing set the following variable, 
+To serialize the kernel enqueuing set the following variable,
 
 `export AMD_SERIALIZE_KERNEL=1`
 
@@ -1125,7 +1125,7 @@ More commands can be found by running,
 
 `rocm-smi --help`
 
-will run on the login nodes to get more infomation about probing the GPUs.
+will run on the login nodes to get more information about probing the GPUs.
 
 More detail can be found at [here](https://github.com/ROCm/rocm_smi_lib/tree/rocm-5.2.3/python_smi_tools).
 
@@ -1149,7 +1149,4 @@ available on the Frontier exascale system:
 - [Lumi docs](https://docs.lumi-supercomputer.eu/hardware/lumig/)
 - [rocm-examples](https://github.com/amd/rocm-examples/tree/develop)
 - [hip-examples](https://github.com/ROCm/HIP-Examples/tree/rocm-5.2.x)
-- [hello-jobstep](https://code.ornl.gov/olcf/hello_jobstep) 
-
-
-
+- [hello-jobstep](https://code.ornl.gov/olcf/hello_jobstep)
diff --git a/docs/user-guide/scheduler.md b/docs/user-guide/scheduler.md
@@ -222,8 +222,8 @@ lists the active QoS on ARCHER2.
     | serial | 32 cores and/or 128 GB memory   | 24 hrs        | 32           | 4            | serial    | Jobs not charged but requires at least 1 CU in budget to use. Maximum of 32 cores and/or 128 GB in use by any one user at any time. |
     | reservation | Size of reservation  | Length of reservation       | No limit           | no limit           | standard   |  |
     | capabilityday | At least 4096 nodes  | 3 hrs        | 8           | 2            | standard     | Minimum job size of 512 nodes. Jobs only run during [Capability Days](#capability-days) |
-    | gpu-shd    | 1               | 1 hr       | 2          | 1           | gpu    | GPU nodes potentially shared with other users |
-    | gpu-exc    | 2               | 1 hr       | 2          | 1           | gpu    | GPU node exclusive node access |
+    | gpu-shd    | 1               | 12 hrs      | 2          | 1           | gpu    | GPU nodes potentially shared with other users |
+    | gpu-exc    | 2               | 12 hrs      | 2          | 1           | gpu    | GPU node exclusive node access |
 
 You can find out the QoS that you can use by running the following
 command:
@@ -433,7 +433,7 @@ that they wish to charge the job too with the option:
      to in SAFE.
 
 !!! important
-    You **must** specify an acount code for your job otherwise it will
+    You **must** specify an account code for your job otherwise it will
     fail to submit with the error: `sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified`.
     (This error can also mean that you have specified a budget that
     has run out of resources.)
@@ -2298,7 +2298,7 @@ charge with the following limits:
     - 2 jobs maximum running per user
 - Users must have a valid, positive CU budget to be able to run jobs during Capability Days
 
-Users wishing to run jobs during Capability Day should submit to the `capabilityday` QoS. Jobs can be 
+Users wishing to run jobs during Capability Day should submit to the `capabilityday` QoS. Jobs can be
 submitted ahead of time and will start when the Capability Day starts.
 
 ### Example Capability Day job submission script