From 9c80197a1c97d3c4bd4d0538bc1f671c5c1fde6a Mon Sep 17 00:00:00 2001 From: Nathan Rockershousen Date: Tue, 7 Jan 2025 13:29:52 -0600 Subject: [PATCH] TECHPUBS-4669: Add PML recovery docs and fix table formatting (#108) * TECHPUBS-4669: PML recovery docs * TECHPUBS-4669: support matrix fix * fix versions Signed-off-by: prudvi-danda --------- Signed-off-by: prudvi-danda Co-authored-by: prudvi-danda --- ...stallation_and_Configuration_Guide.ditamap | 1 + .../developer-portal/install/pml_recovery.md | 20 +++++ .../release_notes/support_matrix.md | 74 +++++++++---------- 3 files changed, 57 insertions(+), 38 deletions(-) create mode 100644 docs/portal/developer-portal/install/pml_recovery.md diff --git a/docs/portal/developer-portal/HPE_Slingshot_Host_Software_Installation_and_Configuration_Guide.ditamap b/docs/portal/developer-portal/HPE_Slingshot_Host_Software_Installation_and_Configuration_Guide.ditamap index c206ad2..e261b6c 100644 --- a/docs/portal/developer-portal/HPE_Slingshot_Host_Software_Installation_and_Configuration_Guide.ditamap +++ b/docs/portal/developer-portal/HPE_Slingshot_Host_Software_Installation_and_Configuration_Guide.ditamap @@ -56,4 +56,5 @@ + diff --git a/docs/portal/developer-portal/install/pml_recovery.md b/docs/portal/developer-portal/install/pml_recovery.md new file mode 100644 index 0000000..951ff7d --- /dev/null +++ b/docs/portal/developer-portal/install/pml_recovery.md @@ -0,0 +1,20 @@ +# Configure PML recovery + +Starting in the Slingshot Host Software (SHS) 11.1.0 release, PML recovery is supported for edge links. Edge links of earlier SHS versions will flap instead of recovering. + +PML recovery is disabled by default and must be enabled on the fabric before configuring it on the host. See the [PML recovery on the fabric](#./pml-recovery-on-the-fabric) section. + +PML recovery enables links to recover from transient faults that would otherwise cause the link to flap, without packet loss and only a brief delay in transmission. This can stabilize the fabric by reducing occasional random disruptions. + +Links that frequently PML recover require hardware action like repeat flappers. Monitor `HsnPmlRecoveryDetected` and `HsnLinkFlapDetected` Redfish events to identify maintenance candidates. See the "PML recovery summary events" section of the _HPE Slingshot Administration Guide_ for more information. + +Recovery must be enabled on both ends of a link. Use `ethtool` to enable it on the host as follows. + +```screen +ethtool --set-priv-flags disable-pml-recovery off +``` + +## PML recovery on the fabric + +PML recovery is available for the HPE Slingshot fabric. +See the "Configure PML recovery" section in the _HPE Slingshot Installation Guide_ for the environment you are installing. diff --git a/docs/portal/developer-portal/release_notes/support_matrix.md b/docs/portal/developer-portal/release_notes/support_matrix.md index d4d8886..adaa046 100644 --- a/docs/portal/developer-portal/release_notes/support_matrix.md +++ b/docs/portal/developer-portal/release_notes/support_matrix.md @@ -12,52 +12,50 @@ Advisory: older platform targets (i.e. SLE 15 SP3, COS 2.4, CSM 1.3, RHEL 8.5) a ## Fabric Manager and HPE Slingshot Host Software Release Compatibility -X-Axis: Fabric Manager + Switch Agent version +In the following table, **FM/SA Version** stands for the Fabric Manager and Switch Agent version. -Y-Axis: SHS version +| SHS Version | FM/SA Version | FM/SA Version | FM/SA Version | +|:--------------:|:-------------:|:-------------:|:-------------:| +| | **2.1.3** | **2.2.0** | **2.3.0** | +| **2.1.3** | R-A-B | Supported | Supported | +| **2.2.0** | Supported\*\* | R-A-B | Supported | +| **SHS-11.0.2** | Supported\*\* | Supported | Supported | +| **SHS-11.1.0** | Supported\*\* | Supported | R-A-B | -| | 2.1.0 | 2.1.1 | 2.2.0 | -| :--------: | :-----------: | :-----------: | :-------: | -| 2.1.0 | R-A-B | Supported | Supported | -| 2.1.2 | Supported | R-A-B | Supported | -| 2.2.0 | Supported\*\* | Supported\*\* | R-A-B | -| SHS-11.0.0 | Supported\*\* | Supported\*\* | Supported | -| SHS-11.1.0 | Supported\*\* | Supported\*\* | Supported | - -\*\* The combination is supported, but the FMN features introduced in version 2.2.0 will not be available. +\*\* The combination is supported, but the new FMN features added in later versions will not be available. **KEY:** | Label | Meaning | -| ------------ | ---------------------------- | +|--------------|------------------------------| | R-A-B | Release As Bundle, Supported | | Supported | Supported | | Incompatible | Incompatible | ## AMD ROCM and Nvidia CUDA Versions -| Distribution | Versions | ROCM Version | CUDA Version | Nvidia SDK | -| ------------------------ | --------------------------------- | ------------ | ------------ | ---------- | -| Red Hat Enterprise Linux | 8.9 | 6.0.0 | 535.154.05 | 23.11 | -| Red Hat Enterprise Linux | 8.10 | 6.1.0 | 550.54.15 | 24.03 | -| Red Hat Enterprise Linux | 9.4 | 6.1.0 | 550.54.15 | 24.03 | -| Red Hat Enterprise Linux | 9.4 ARM | NA | 550.54.15 | 24.03 | -| SuSE Linux Enterprise 15 | SP4 | 5.5.1 | 525.105.17 | 23.03 | -| SuSE Linux Enterprise 15 | SP5 | 6.1.0 | 550.54.15 | 24.03 | -| SuSE Linux Enterprise 15 | SP5 ARM | NA | 550.54.15 | 24.03 | -| SuSE Linux Enterprise 15 | SP6 | 6.2.1 | 550.90.07 | 24.07 | -| SuSE Linux Enterprise 15 | SP6 ARM | NA | 550.90.07 | 24.07 | -| Cray Operating System | 2.5 | 5.5.1 | 525.105.17 | 23.03 | -| Cray Operating System | COS 24.07.x w/ COS Base 3.1 | 6.1.0 | 550.54.15 | 24.03 | -| Cray Operating System | COS 24.07.x w/ COS Base 3.1 ARM | NA | 550.54.15 | 24.03 | -| Cray Operating System | COS 24.11.x w/ COS Base 3.2 | 6.2.1 | 550.90.07 | 24.07 | -| Cray Operating System | COS 24.11.x w/ COS Base 3.2 ARM | NA | 550.90.07 | 24.07 | +| Distribution | Versions | ROCM Version | CUDA Version | Nvidia SDK | +|--------------------------|---------------------------------|--------------|--------------|------------| +| Red Hat Enterprise Linux | 8.9 | 6.0.0 | 535.154.05 | 23.11 | +| Red Hat Enterprise Linux | 8.10 | 6.1.0 | 550.54.15 | 24.03 | +| Red Hat Enterprise Linux | 9.4 | 6.1.0 | 550.54.15 | 24.03 | +| Red Hat Enterprise Linux | 9.4 ARM | NA | 550.54.15 | 24.03 | +| SuSE Linux Enterprise 15 | SP4 | 5.5.1 | 525.105.17 | 23.03 | +| SuSE Linux Enterprise 15 | SP5 | 6.1.0 | 550.54.15 | 24.03 | +| SuSE Linux Enterprise 15 | SP5 ARM | NA | 550.54.15 | 24.03 | +| SuSE Linux Enterprise 15 | SP6 | 6.2.1 | 550.90.07 | 24.07 | +| SuSE Linux Enterprise 15 | SP6 ARM | NA | 550.90.07 | 24.07 | +| Cray Operating System | 2.5 | 5.5.1 | 525.105.17 | 23.03 | +| Cray Operating System | COS 24.07.x w/ COS Base 3.1 | 6.1.0 | 550.54.15 | 24.03 | +| Cray Operating System | COS 24.07.x w/ COS Base 3.1 ARM | NA | 550.54.15 | 24.03 | +| Cray Operating System | COS 24.11.x w/ COS Base 3.2 | 6.2.1 | 550.90.07 | 24.07 | +| Cray Operating System | COS 24.11.x w/ COS Base 3.2 ARM | NA | 550.90.07 | 24.07 | ## NIC Support | Distribution | Versions | Mellanox NIC | Mellanox Version | HPE Slingshot Ethernet 200Gb | -| ------------------------ | ------------------------------- | ------------ | ---------------- | ---------------------------- | +|--------------------------|---------------------------------|--------------|------------------|------------------------------| | Red Hat Enterprise Linux | 8.9 | Yes | 23.10-3.2.2.0 | Yes | | Red Hat Enterprise Linux | 8.10 | Yes | 23.10-3.2.2.0 | Yes | | Red Hat Enterprise Linux | 9.4 | Yes | 23.10-3.2.2.0 | Yes | @@ -77,10 +75,10 @@ Y-Axis: SHS version _**Mellanox External Vendor Software**_ -| Name | Contains | Typical Install Target | Recommended Version | URL | -| -------------------------- | ------------------------------------------- | --------------------------------------- | ------------------- | ---------------------------------------------------------------------------------------------- | -| Mellanox OFED distribution | Mellanox Networking Software Stack | all compute nodes and user access nodes | Listed Above | [Mellanox OFED download](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) | -| Mellanox Device Firmware | Mellanox NIC Firmware | all compute nodes | 16.32.1010 | Contact your Support or account team to obtain the recommended firmware | +| Name | Contains | Typical Install Target | Recommended Version | URL | +|----------------------------|------------------------------------|-----------------------------------------|---------------------|------------------------------------------------------------------------------------------------| +| Mellanox OFED distribution | Mellanox Networking Software Stack | all compute nodes and user access nodes | Listed Above | [Mellanox OFED download](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) | +| Mellanox Device Firmware | Mellanox NIC Firmware | all compute nodes | 16.32.1010 | Contact your Support or account team to obtain the recommended firmware | ## Libfabric Versions @@ -91,15 +89,15 @@ All the Distributions provided with libfabric 1.22.x. The following cluster manager software compatibility information is for reference. For the most up-to-date compatibility details, see the "CSM Software Compatibility Matrix Version" section of the _HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052)_ and the "2.2 Operating System Support" section in the _HPE Performance Cluster Manager Release Notes_. -| Cluster Management | Versions Supported | -| --------------------------- | ------------------------- | -| HPE Cray EX System Software | 1.4.X, 1.5.X, 1.6.X | -| HPE Performance Cluster Manager (HPCM) | 1.11, 1.12 | +| Cluster Management | Versions Supported | +|----------------------------------------|---------------------| +| HPE Cray EX System Software | 1.4.X, 1.5.X, 1.6.X | +| HPE Performance Cluster Manager (HPCM) | 1.11, 1.12 | _**Compute Node Image and Cluster Management Software Compatibility**_ | Distribution | Versions | Cray EX(CSM) | HPCM | -| ------------------------ | ------------------------------- | ------------ | ----- | +|--------------------------|---------------------------------|--------------|-------| | Red Hat Enterprise Linux | 8.9 | NA | 1.10+ | | Red Hat Enterprise Linux | 8.10 | NA | 1.11+ | | Red Hat Enterprise Linux | 9.3 ARM | NA | 1.10+ |