Skip to content

Commit

Permalink
TECHPUBS-4452: HPE Slingshot Host Software User Guide (#101)
Browse files Browse the repository at this point in the history
* TECHPUBS-4452: HPE Slingshot Host Software User Guide

* TECHPUBS-4452: grammar, wording, and review feedback
  • Loading branch information
nrockershousen authored and GitHub Enterprise committed Dec 10, 2024
1 parent 1353b7e commit 7a82ee6
Show file tree
Hide file tree
Showing 37 changed files with 978 additions and 3 deletions.
34 changes: 34 additions & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -20,20 +20,29 @@
200gb
200Gbps
2-port
4KiB
802.1Q
ack
ACKs
acknowledgment
acknowledgments
adminStatus
AMA
AMAs
amdgpu
AMOs
aarch64
AERs
all2all
analytics
analyze
analyzing
Ansible
API
APIs
arm64
Arista
artifact
Artifactory
ASIC
ASIC_0
Expand Down Expand Up @@ -61,6 +70,7 @@ BOS
bootprep
catalog
Cassini
Center
Ceph
CFS
CFS-based
Expand Down Expand Up @@ -126,6 +136,7 @@ deallocation
debugfs
default.yml
defragmented
deregistering
dgnettest
diags
diskless
Expand Down Expand Up @@ -162,6 +173,7 @@ EC_TRNSNT_S
EC_UNCOR_NS
EC_UNCOR_S
EPEL
enablement
ENOMEM
ENOENT
eth0
Expand All @@ -174,6 +186,7 @@ EX235a
failback
Failback
failover
favoring
fi_info
fi_pingpong
Flavored
Expand All @@ -183,6 +196,7 @@ GbE
Gbps
Gen4
gc_thresh
GDRCopy
Git
Gitea
GPCNeT
Expand All @@ -197,6 +211,7 @@ HealthCheck
HeartBeat
heatsink
highpriority
high_empty
hodagd
honor
honoring
Expand All @@ -216,10 +231,13 @@ hsn_traffic
hugepages
Hugepages
Jenkinsfile
IB-over-Ethernet
IBM
ifcfg
IMAGE_NAME
image.rpmlist
incast
inflight
initramfs
initrd
int
Expand All @@ -242,6 +260,7 @@ Keycloak
keycloak_group
keycloak_passwd
keypair
Kfabric
kfi
kfi1
kfi2
Expand All @@ -255,6 +274,8 @@ libcxi
Libfabric
libfabric
libfabric-devel
Libfabric-to-NCCL
Libfabric-to-RCCL
libpals
limits
Linux
Expand All @@ -270,13 +291,15 @@ LNM
Loadbalance
loadbalance
localtime
lockup
LOG_DEBUG
LogLevelMax
LOG_NOTICE
LOG_INFO
LOG_WARN
Loopback
loopback
low_empty
low-noise-mode
lownoise-service
Lua
Expand All @@ -296,16 +319,19 @@ Mellanox
memhooks
Memhooks
metadata
misconfigurations
modprobe
mountpoints
MOFED
MPI
MPI-3
MPI-3.1
MPIR
MpiDefault
MpiParams
mpiexec
mpirun
MRs
msr-safe
munged*
multisocket
Expand Down Expand Up @@ -352,6 +378,7 @@ nil
node-identity
nodename
non-CFS
non-HPE
non-VLAN
nonprivileged
Nonprivileged
Expand All @@ -362,6 +389,7 @@ NVIDIA
ntp
OData
ogopogod
onloaded
OOM
OPA
OpenMPI
Expand All @@ -387,6 +415,7 @@ Podman
PMI
PMIx
pmix
preemptive
prepended
prepends
ProLiant
Expand Down Expand Up @@ -444,6 +473,7 @@ serdes
SEQUENCE_ERROR
shadow
SharePoint
Shmem
SHS
shs-docs
shs-version
Expand Down Expand Up @@ -486,6 +516,7 @@ TCTs
TEMPLATE_NAME
tmpfs
TLV
toolkits
traceback
tunable
tunables
Expand Down Expand Up @@ -547,6 +578,8 @@ Zypper
2.x
3.x
4.x
5.0.x
5.1.2
cos-2.x
sle15spx
SSHOT1.2.1
Expand All @@ -571,6 +604,7 @@ S-9009
S-9010
S-9011
S-9012
S-9929


#
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE map PUBLIC "-//OASIS//DTD DITA Map//EN" "map.dtd">
<map id="shs_relnotes">
<title>HPE Slingshot Host Software User Guide (S-9014) (@product_version@)</title>
<topicmeta>
<shortdesc>This publication describes user procedures for SHS.</shortdesc>
<data name="pubsnumber" value="S-9014"></data>
<data name="edition" value="SHS Software Release @product_version@"></data>
</topicmeta>
<topicref href="VeRsIoN.md" format="mdita" />
<topicref href="user/about.md" format="mdita" />
<topichead>
<topicmeta>
<navtitle>HPE Slingshot NIC overview</navtitle>
</topicmeta>
<topicref href="user/hardware_overview.md" format="mdita" />
<topicref href="user/libfabric_and_the_hpe_slingshot_nic_offloads.md" format="mdita" />
<topicref href="user/memory_registration.md" format="mdita" />
<topicref href="user/hardware_offload_capabilities.md" format="mdita" />
<topicref href="user/software_architecture.md" format="mdita" />
<topicref href="user/performance_counters.md" format="mdita" />
<topicref href="user/ip_networking_considerations.md" format="mdita" />
</topichead>
<topichead>
<topicmeta>
<navtitle>HPE Slingshot NIC Libfabric</navtitle>
</topicmeta>
<topicref href="user/user_configurable_libfabric_environment_variables.md" format="mdita">
<topicref href="user/rdma_messaging_and_relationship_to_environment_settings.md" format="mdita"/>
<topicref href="user/memory_cache_monitor_settings.md" format="mdita"/>
<topicref href="user/endpoint_receive_size_attribute.md" format="mdita"/>
<topicref href="user/endpoint_transmit_size_attribute.md" format="mdita"/>
<topicref href="user/completion_queue_size_attribute.md" format="mdita"/>
<topicref href="user/expected_number_of_ranks_and_peers.md" format="mdita"/>
<topicref href="user/tag_matching_mode_settings.md" format="mdita"/>
<topicref href="user/rendezvous_protocol_configuration.md" format="mdita"/>
</topicref>
<topicref href="user/debug_performance_and_failure_issues.md" format="mdita"/>
</topichead>
<topicref href="user/application_software_overview.md" format="mdita">
<topicref href="user/hpe_cray_programming_environment.md" format="mdita"/>
<topicref href="user/nccl.md" format="mdita"/>
<topicref href="user/rccl.md" format="mdita"/>
<topicref href="user/intel_mpi.md" format="mdita"/>
<topicref href="user/daos.md" format="mdita"/>
<topicref href="user/openmpi.md" format="mdita"/>
</topicref>
<topichead>
<topicmeta>
<navtitle>Appendex</navtitle>
</topicmeta>
<topicref href="user/hpe_slingshot_nic_rdma_protocol_and_traffic_classes.md" format="mdita"/>
<topicref href="user/ip_performance_and_configuration_settings.md" format="mdita"/>
<topicref href="user/memory_registration_and_cache_monitors.md" format="mdita"/>
<topicref href="user/libfabric_runtime_configurable_parameters.md" format="mdita"/>
</topichead>
</map>
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"content_type": "htmlzip",
"content_class": "html-default",
"source_system": "git",
"source_system_id": "https://github.com/Cray-HPE/docs-shs",
"source_system_version": "@docs_git_hash@",
"lifecycle": "DRAFT",
"products": ["1013247219","1013083813"],
"product_version": "@product_version@",
"full_title": "HPE Slingshot Host Software User Guide (S-9014) @product_version@",
"description": "This publication includes user procedures for SHS.",
"language_code": "en_US",
"submitter": "nathan.rockershousen@hpe.com",
"company_info": "HPE-green",
"customer_available_date": "",
"content_org": "CMG708"
}
1 change: 1 addition & 0 deletions docs/portal/developer-portal/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ dir: $(TMPDIR)/VeRsIoN.md
cp -r performance $(TMPDIR)
cp -r operations $(TMPDIR)
cp -r overview $(TMPDIR)
cp -r user $(TMPDIR)
cp -r images $(TMPDIR)


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Each of the traditional monitors has advantages and disadvantages. Memhooks are
By default, HPE Slingshot uses the memhooks monitor unless set otherwise with the appropriate Libfabric environment variable. Also, HPE guides to select userfaultfd for applications that use NCCL or RCCL collectives libraries as they can hang at scale under memhooks.

To overcome many of the previously described limitations as well as avoiding the need to configure this per-application, HPE introduced kdreg2 as a third memory cache monitor. kdreg2 is provided as a Linux kernel module and uses an open-source licensing model.
As of the date of this note, it ships in the HPE Slingshot host software distribution and is optionally installed.
As of the date of this note, it ships in the HPE Slingshot Host Software distribution and is optionally installed.
Future releases may install this by default, and eventually HPE expects HPE Slingshot NIC Libfabric provider to select kdreg2 by default instead of memhooks.

kdreg2 uses kernel mechanisms to monitor mapping changes and provides synchronous notification to the memory registration cache. It can report changes at the byte level to any memory within the application’s virtual address space.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ echo "e4000003,80000000" > /sys/class/cxi/cxi0/device/err_flgs_irqa/hni_pml/no_p

## `tct_tbl_dealloc` errors

This error occurs when, under certain conditions, the HPE Slingshot host software stack does not take proper precautions to prevent the HPE Slingshot 200 GbE NIC from entering an error state. An example of such a condition that may initiate this error - is a fabric event causing packet transfers to be significantly delayed. Normal NIC and fabric operation is not expected to initiate this error.
This error occurs when, under certain conditions, the HPE Slingshot Host Software stack does not take proper precautions to prevent the HPE Slingshot 200 GbE NIC from entering an error state. An example of such a condition that may initiate this error - is a fabric event causing packet transfers to be significantly delayed. Normal NIC and fabric operation is not expected to initiate this error.

The following is an example of this error:

Expand Down
9 changes: 9 additions & 0 deletions docs/portal/developer-portal/user/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# About this publication

This document provides an overview of the HPE Slingshot NIC software environment
for application users. It includes background information on the "theory of operations" to offer context for product configuration and troubleshooting. This document supplements the configuration and troubleshooting information found in the product documentation.

Tuning guidance discussed here is specific to each system or application, so consider your intended application workload and system configuration. For example, the HPE Cray Programming Environment runtime middleware (MPI and SHMEM) sets default values, as detailed in this document and in the Cray PE documentation.
Users may need to adjust settings for non-HPE Cray Software, such as open-source MPI stacks that may not have tuned values, and for specific applications.

Default environment settings are rarely changed to avoid unintended impacts during upgrades. Therefore, users are encouraged to evaluate whether adjusting environment variables will improve performance. Tuning environment settings is also useful when a specific application is failing or running slowly.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Configure application software

This section provides guidance on configuring application for the HPE Slingshot NIC using the environment variables previously described to share best-known methods from HPE and other users.
The needs of specific applications with specific data sets may always vary from these guidelines.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Completion queue size attribute (`FI_CXI_DEFAULT_CQ_SIZE`)

This variable specifies the maximum number of entries in the CXI provider completion queue. This is used for various software and hardware event queues to generate Libfabric completion events.
While the size of the software queues may grow dynamically, hardware event queue sizes are static. If the hardware event queue is undersized, it will fill quicker than expected, and the next operation targeting a full event queue will result in the message operation being dropped and flow control triggered. Flow control results in expensive, side-band, CXI provider internal messaging to recover from which can appear as lockup to the user.

The provider default is 1024. Users are encouraged to set the completion queue size attribute based on the expected number of inflight RDMA operations to and from a single endpoint. The default provider default value can be set in the application, like MPI, to override the provider default value.
The default CXI provider value is sized to handle the sum of the TX and RX default values, and it must not be below the sum of the TX and RX values if they have been changed from the default. Cray MPI sets this value to a default size of 131072.
This size is partially an artifact of wanting to prevent a condition in earlier versions of cxi provider when overflowing the buffer could cause lock-ups.
This is no longer the case – instead overflowing the buffer will cause slower performance because it triggers flow control.

The impact of sizing this too high is reserving extra host memory that may ultimately be unnecessary.
12 changes: 12 additions & 0 deletions docs/portal/developer-portal/user/daos.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# DAOS

Intel DAOS sets this list of environment variables for compatibility with the HPE Slingshot Host Software (SHS) stack.

- `setenv("CRT_MRC_ENABLE","1")`
- `setenv("FI_CXI_OPTIMIZED_MRS","0")`
- `setenv("FI_CXI_RX_MATCH_MODE","hybrid")`
- `setenv("FI_MR_CACHE_MONITOR","memhooks")`
- `setenv("FI_CXI_REQ_BUF_MIN_POSTED","8")`
- `setenv("FI_CXI_REQ_BUF_SIZE","8388608")`
- `setenv("FI_CXI_DEFAULT_CQ_SIZE","131072")`
- `setenv("FI_CXI_OFLOW_BUF_SIZE","8388608")`
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Debug performance and failure issues

This section describes how to debug applications once a fabric is considered operational.

When a fabric is first being brought up and applications are failing, there can be many issues related to either the network or the host. Transient network failures can impact applications, but debugging whether that is the cause of the application failure is not covered here in depth.
For example, if links are flapping causing an application to fail one would use link debugging procedures.

## Prerequisites

- AMAs must be assigned to every NIC as is done at boot up.
- TCP communication must be working. Even for RDMA communications, the job scheduler and MPI use TCP/IP to set up connections. If a system is being set up, TCP failures can relate to Linux misconfigurations in the ARP cache, static ARP tables, or missing routing rules that should have been set up using `ifroute` during boot up (for nodes with more than one NIC).
- VNI job configuration must be enabled unless the system is running with the “default” `cxi-service`.
- For systems with GPUs, there is a matched set of GPU drivers and programming toolkits for each version of the `cxi` driver as documented in the release notes. Install the GDRCopy library for NVIDIA GPUs.

## Debug steps

The following is a high-level list of actions that can be taken to debug applications:

- Check the _HPE Slingshot Host Software Release Notes_ for known issues or resolved issues. If not running the latest release, check the release notes for the releases that came after the running system.
- Run the application with Libfabric logging, `FI_LOG_LEVEL=warn` and `FI_LOG_PROV=cxi`. The resulting logs provide guidance and will greatly aid the teams in responding to support tickets.
- For memory registration related issues, try running with `kdreg2` memory monitor to see if the issue relates to choice or memory cache monitor. Also one can disable memory registration caching altogether, which will free up an application that is deadlocking but allow it to run instead of locking up. This points to tuning the memory registration cache settings.
- If failures are being caused be hardware matching resource exhaustion, try setting matching mode to hybrid.
- For general concern with resource exhaustion when not running Cray MPI, try setting the environment variables sized larger. Using the Cray MPI settings described below plus setting matching mode to hybrid would help detect whether the default settings are too small for the system or application. If so subsequent testing can help tune the size to avoid too much unneeded memory consumption is desired.
- If the application performed differently after a software upgrade to the HPE Slingshot Host Software, it is possible to try running with the previous version of the Libfabric user space libraries, or even a more recent version of the Libfabric libraries. This might be easier for a user to try than building a new host image. (It is possible that this combination will not work – one can ask the HPE support team whether there are any known incompatibilities.) Today mixing and matching is not always an officially tested or supported combination, but it can be helpful in debugging and sometimes will be perfectly fine in production.
- Trying the alternative rendezvous protocol – if the application is using large message and is performance glacially slow, trying the instructions for the alternative rendezvous protocol may be a useful debug step.
- Collect the NIC counters for an application. See the _HPE Cray Cassini Performance Counters User Guide (S-9929)_ on the [HPE Support Center](https://support.hpe.com/connect/s/?language=en_US) for details.
Counters are collected with Cray MPI, Libfabric, `sysfs`, or LDMS – different deployments use different strategies. Some of these counters are the same as can be collected on the switch port but will be easier for the user. These can present issues such as PCIe congestion, network congestion (pause exertions), and other factors. This can also be of great use by the support teams in responding to tickets.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Endpoint receive size attribute (`FI_CXI_DEFAULT_RX_SIZE`)

This attribute sizes the internal receive command and hardware event queues at job start up. Users are encouraged to set the endpoint receive size attribute based on the number of outstanding receive buffers being posted. The primary benefit to changing from the default setting is when running in hybrid match mode which is more common with HPE Slingshot release 2.1.1 and later.
See section on [Tag matching mode settings](tag_matching_mode_settings.md#tag-matching-mode-settings-fi_cxi_rx_match_mode) for more information.

The current default is set to 512 (which is not changed with Cray MPI). Over-specifying can consume more memory, while under-specifying it can cause flow control to be exerted which will reduce performance. When running in “hybrid mode” (see [Tag matching mode settings](tag_matching_mode_settings.md#tag-matching-mode-settings-fi_cxi_rx_match_mode)), over-specifying the amount of hardware receive buffers will force other processes to use a software endpoint.

Libfabric allows applications to suggest a receive attribute size in the `fi_info hints` specific to an application.
If explicitly set, the `cxi` provider will use the size specified rather than the value of this environment variable.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Endpoint transmit size attribute (`FI_CXI_DEFAULT_TX_SIZE`)

The endpoint transmit size attribute sizes the internal command and hardware event queues. This controls how many messages are in flight, so at a minimum, users are encouraged to set the endpoint transmit size attribute based on the expected number of inflight, initiator RDMA operations.

If users are going to be issuing more messages than the CXI provider rendezvous limit (`FI_CXI_RDZV_THRESHOLD`), the transmit size attribute must also include the number of outstanding, unexpected rendezvous operations.
For instance, inflight, initiator RDMA operations and outstanding, unexpected rendezvous operations.
See the section on [Rendezvous protocol configuration](rendezvous_protocol_configuration.md#rendezvous-protocol-configuration) for more information.

The current default is 512. Cray MPI sets this to 1024.

If the setting is too high, it can consume more memory than necessary and allow too many messages to be in flight, potentially overwhelming an endpoint. Conversely, if the setting is too low, it can impact performance due to the instantiation of flow control.
In some cases, a low setting may cause a deadlock because an application might post too many transmissions before it can post a receive. These issues are often caused by poorly written applications. This situation typically occurs with the Rendezvous protocol, where too many unexpected messages are received.
Loading

0 comments on commit 7a82ee6

Please sign in to comment.