-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
TECHPUBS-4452: HPE Slingshot Host Software User Guide (#101)
* TECHPUBS-4452: HPE Slingshot Host Software User Guide * TECHPUBS-4452: grammar, wording, and review feedback
- Loading branch information
1 parent
1353b7e
commit 7a82ee6
Showing
37 changed files
with
978 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
57 changes: 57 additions & 0 deletions
57
docs/portal/developer-portal/HPE_Slingshot_Host_Software_User_Guide.ditamap
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<!DOCTYPE map PUBLIC "-//OASIS//DTD DITA Map//EN" "map.dtd"> | ||
<map id="shs_relnotes"> | ||
<title>HPE Slingshot Host Software User Guide (S-9014) (@product_version@)</title> | ||
<topicmeta> | ||
<shortdesc>This publication describes user procedures for SHS.</shortdesc> | ||
<data name="pubsnumber" value="S-9014"></data> | ||
<data name="edition" value="SHS Software Release @product_version@"></data> | ||
</topicmeta> | ||
<topicref href="VeRsIoN.md" format="mdita" /> | ||
<topicref href="user/about.md" format="mdita" /> | ||
<topichead> | ||
<topicmeta> | ||
<navtitle>HPE Slingshot NIC overview</navtitle> | ||
</topicmeta> | ||
<topicref href="user/hardware_overview.md" format="mdita" /> | ||
<topicref href="user/libfabric_and_the_hpe_slingshot_nic_offloads.md" format="mdita" /> | ||
<topicref href="user/memory_registration.md" format="mdita" /> | ||
<topicref href="user/hardware_offload_capabilities.md" format="mdita" /> | ||
<topicref href="user/software_architecture.md" format="mdita" /> | ||
<topicref href="user/performance_counters.md" format="mdita" /> | ||
<topicref href="user/ip_networking_considerations.md" format="mdita" /> | ||
</topichead> | ||
<topichead> | ||
<topicmeta> | ||
<navtitle>HPE Slingshot NIC Libfabric</navtitle> | ||
</topicmeta> | ||
<topicref href="user/user_configurable_libfabric_environment_variables.md" format="mdita"> | ||
<topicref href="user/rdma_messaging_and_relationship_to_environment_settings.md" format="mdita"/> | ||
<topicref href="user/memory_cache_monitor_settings.md" format="mdita"/> | ||
<topicref href="user/endpoint_receive_size_attribute.md" format="mdita"/> | ||
<topicref href="user/endpoint_transmit_size_attribute.md" format="mdita"/> | ||
<topicref href="user/completion_queue_size_attribute.md" format="mdita"/> | ||
<topicref href="user/expected_number_of_ranks_and_peers.md" format="mdita"/> | ||
<topicref href="user/tag_matching_mode_settings.md" format="mdita"/> | ||
<topicref href="user/rendezvous_protocol_configuration.md" format="mdita"/> | ||
</topicref> | ||
<topicref href="user/debug_performance_and_failure_issues.md" format="mdita"/> | ||
</topichead> | ||
<topicref href="user/application_software_overview.md" format="mdita"> | ||
<topicref href="user/hpe_cray_programming_environment.md" format="mdita"/> | ||
<topicref href="user/nccl.md" format="mdita"/> | ||
<topicref href="user/rccl.md" format="mdita"/> | ||
<topicref href="user/intel_mpi.md" format="mdita"/> | ||
<topicref href="user/daos.md" format="mdita"/> | ||
<topicref href="user/openmpi.md" format="mdita"/> | ||
</topicref> | ||
<topichead> | ||
<topicmeta> | ||
<navtitle>Appendex</navtitle> | ||
</topicmeta> | ||
<topicref href="user/hpe_slingshot_nic_rdma_protocol_and_traffic_classes.md" format="mdita"/> | ||
<topicref href="user/ip_performance_and_configuration_settings.md" format="mdita"/> | ||
<topicref href="user/memory_registration_and_cache_monitors.md" format="mdita"/> | ||
<topicref href="user/libfabric_runtime_configurable_parameters.md" format="mdita"/> | ||
</topichead> | ||
</map> |
17 changes: 17 additions & 0 deletions
17
docs/portal/developer-portal/HPE_Slingshot_Host_Software_User_Guide.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
{ | ||
"content_type": "htmlzip", | ||
"content_class": "html-default", | ||
"source_system": "git", | ||
"source_system_id": "https://github.com/Cray-HPE/docs-shs", | ||
"source_system_version": "@docs_git_hash@", | ||
"lifecycle": "DRAFT", | ||
"products": ["1013247219","1013083813"], | ||
"product_version": "@product_version@", | ||
"full_title": "HPE Slingshot Host Software User Guide (S-9014) @product_version@", | ||
"description": "This publication includes user procedures for SHS.", | ||
"language_code": "en_US", | ||
"submitter": "nathan.rockershousen@hpe.com", | ||
"company_info": "HPE-green", | ||
"customer_available_date": "", | ||
"content_org": "CMG708" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# About this publication | ||
|
||
This document provides an overview of the HPE Slingshot NIC software environment | ||
for application users. It includes background information on the "theory of operations" to offer context for product configuration and troubleshooting. This document supplements the configuration and troubleshooting information found in the product documentation. | ||
|
||
Tuning guidance discussed here is specific to each system or application, so consider your intended application workload and system configuration. For example, the HPE Cray Programming Environment runtime middleware (MPI and SHMEM) sets default values, as detailed in this document and in the Cray PE documentation. | ||
Users may need to adjust settings for non-HPE Cray Software, such as open-source MPI stacks that may not have tuned values, and for specific applications. | ||
|
||
Default environment settings are rarely changed to avoid unintended impacts during upgrades. Therefore, users are encouraged to evaluate whether adjusting environment variables will improve performance. Tuning environment settings is also useful when a specific application is failing or running slowly. |
4 changes: 4 additions & 0 deletions
4
docs/portal/developer-portal/user/application_software_overview.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Configure application software | ||
|
||
This section provides guidance on configuring application for the HPE Slingshot NIC using the environment variables previously described to share best-known methods from HPE and other users. | ||
The needs of specific applications with specific data sets may always vary from these guidelines. |
11 changes: 11 additions & 0 deletions
11
docs/portal/developer-portal/user/completion_queue_size_attribute.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Completion queue size attribute (`FI_CXI_DEFAULT_CQ_SIZE`) | ||
|
||
This variable specifies the maximum number of entries in the CXI provider completion queue. This is used for various software and hardware event queues to generate Libfabric completion events. | ||
While the size of the software queues may grow dynamically, hardware event queue sizes are static. If the hardware event queue is undersized, it will fill quicker than expected, and the next operation targeting a full event queue will result in the message operation being dropped and flow control triggered. Flow control results in expensive, side-band, CXI provider internal messaging to recover from which can appear as lockup to the user. | ||
|
||
The provider default is 1024. Users are encouraged to set the completion queue size attribute based on the expected number of inflight RDMA operations to and from a single endpoint. The default provider default value can be set in the application, like MPI, to override the provider default value. | ||
The default CXI provider value is sized to handle the sum of the TX and RX default values, and it must not be below the sum of the TX and RX values if they have been changed from the default. Cray MPI sets this value to a default size of 131072. | ||
This size is partially an artifact of wanting to prevent a condition in earlier versions of cxi provider when overflowing the buffer could cause lock-ups. | ||
This is no longer the case – instead overflowing the buffer will cause slower performance because it triggers flow control. | ||
|
||
The impact of sizing this too high is reserving extra host memory that may ultimately be unnecessary. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# DAOS | ||
|
||
Intel DAOS sets this list of environment variables for compatibility with the HPE Slingshot Host Software (SHS) stack. | ||
|
||
- `setenv("CRT_MRC_ENABLE","1")` | ||
- `setenv("FI_CXI_OPTIMIZED_MRS","0")` | ||
- `setenv("FI_CXI_RX_MATCH_MODE","hybrid")` | ||
- `setenv("FI_MR_CACHE_MONITOR","memhooks")` | ||
- `setenv("FI_CXI_REQ_BUF_MIN_POSTED","8")` | ||
- `setenv("FI_CXI_REQ_BUF_SIZE","8388608")` | ||
- `setenv("FI_CXI_DEFAULT_CQ_SIZE","131072")` | ||
- `setenv("FI_CXI_OFLOW_BUF_SIZE","8388608")` |
27 changes: 27 additions & 0 deletions
27
docs/portal/developer-portal/user/debug_performance_and_failure_issues.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Debug performance and failure issues | ||
|
||
This section describes how to debug applications once a fabric is considered operational. | ||
|
||
When a fabric is first being brought up and applications are failing, there can be many issues related to either the network or the host. Transient network failures can impact applications, but debugging whether that is the cause of the application failure is not covered here in depth. | ||
For example, if links are flapping causing an application to fail one would use link debugging procedures. | ||
|
||
## Prerequisites | ||
|
||
- AMAs must be assigned to every NIC as is done at boot up. | ||
- TCP communication must be working. Even for RDMA communications, the job scheduler and MPI use TCP/IP to set up connections. If a system is being set up, TCP failures can relate to Linux misconfigurations in the ARP cache, static ARP tables, or missing routing rules that should have been set up using `ifroute` during boot up (for nodes with more than one NIC). | ||
- VNI job configuration must be enabled unless the system is running with the “default” `cxi-service`. | ||
- For systems with GPUs, there is a matched set of GPU drivers and programming toolkits for each version of the `cxi` driver as documented in the release notes. Install the GDRCopy library for NVIDIA GPUs. | ||
|
||
## Debug steps | ||
|
||
The following is a high-level list of actions that can be taken to debug applications: | ||
|
||
- Check the _HPE Slingshot Host Software Release Notes_ for known issues or resolved issues. If not running the latest release, check the release notes for the releases that came after the running system. | ||
- Run the application with Libfabric logging, `FI_LOG_LEVEL=warn` and `FI_LOG_PROV=cxi`. The resulting logs provide guidance and will greatly aid the teams in responding to support tickets. | ||
- For memory registration related issues, try running with `kdreg2` memory monitor to see if the issue relates to choice or memory cache monitor. Also one can disable memory registration caching altogether, which will free up an application that is deadlocking but allow it to run instead of locking up. This points to tuning the memory registration cache settings. | ||
- If failures are being caused be hardware matching resource exhaustion, try setting matching mode to hybrid. | ||
- For general concern with resource exhaustion when not running Cray MPI, try setting the environment variables sized larger. Using the Cray MPI settings described below plus setting matching mode to hybrid would help detect whether the default settings are too small for the system or application. If so subsequent testing can help tune the size to avoid too much unneeded memory consumption is desired. | ||
- If the application performed differently after a software upgrade to the HPE Slingshot Host Software, it is possible to try running with the previous version of the Libfabric user space libraries, or even a more recent version of the Libfabric libraries. This might be easier for a user to try than building a new host image. (It is possible that this combination will not work – one can ask the HPE support team whether there are any known incompatibilities.) Today mixing and matching is not always an officially tested or supported combination, but it can be helpful in debugging and sometimes will be perfectly fine in production. | ||
- Trying the alternative rendezvous protocol – if the application is using large message and is performance glacially slow, trying the instructions for the alternative rendezvous protocol may be a useful debug step. | ||
- Collect the NIC counters for an application. See the _HPE Cray Cassini Performance Counters User Guide (S-9929)_ on the [HPE Support Center](https://support.hpe.com/connect/s/?language=en_US) for details. | ||
Counters are collected with Cray MPI, Libfabric, `sysfs`, or LDMS – different deployments use different strategies. Some of these counters are the same as can be collected on the switch port but will be easier for the user. These can present issues such as PCIe congestion, network congestion (pause exertions), and other factors. This can also be of great use by the support teams in responding to tickets. |
9 changes: 9 additions & 0 deletions
9
docs/portal/developer-portal/user/endpoint_receive_size_attribute.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Endpoint receive size attribute (`FI_CXI_DEFAULT_RX_SIZE`) | ||
|
||
This attribute sizes the internal receive command and hardware event queues at job start up. Users are encouraged to set the endpoint receive size attribute based on the number of outstanding receive buffers being posted. The primary benefit to changing from the default setting is when running in hybrid match mode which is more common with HPE Slingshot release 2.1.1 and later. | ||
See section on [Tag matching mode settings](tag_matching_mode_settings.md#tag-matching-mode-settings-fi_cxi_rx_match_mode) for more information. | ||
|
||
The current default is set to 512 (which is not changed with Cray MPI). Over-specifying can consume more memory, while under-specifying it can cause flow control to be exerted which will reduce performance. When running in “hybrid mode” (see [Tag matching mode settings](tag_matching_mode_settings.md#tag-matching-mode-settings-fi_cxi_rx_match_mode)), over-specifying the amount of hardware receive buffers will force other processes to use a software endpoint. | ||
|
||
Libfabric allows applications to suggest a receive attribute size in the `fi_info hints` specific to an application. | ||
If explicitly set, the `cxi` provider will use the size specified rather than the value of this environment variable. |
12 changes: 12 additions & 0 deletions
12
docs/portal/developer-portal/user/endpoint_transmit_size_attribute.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Endpoint transmit size attribute (`FI_CXI_DEFAULT_TX_SIZE`) | ||
|
||
The endpoint transmit size attribute sizes the internal command and hardware event queues. This controls how many messages are in flight, so at a minimum, users are encouraged to set the endpoint transmit size attribute based on the expected number of inflight, initiator RDMA operations. | ||
|
||
If users are going to be issuing more messages than the CXI provider rendezvous limit (`FI_CXI_RDZV_THRESHOLD`), the transmit size attribute must also include the number of outstanding, unexpected rendezvous operations. | ||
For instance, inflight, initiator RDMA operations and outstanding, unexpected rendezvous operations. | ||
See the section on [Rendezvous protocol configuration](rendezvous_protocol_configuration.md#rendezvous-protocol-configuration) for more information. | ||
|
||
The current default is 512. Cray MPI sets this to 1024. | ||
|
||
If the setting is too high, it can consume more memory than necessary and allow too many messages to be in flight, potentially overwhelming an endpoint. Conversely, if the setting is too low, it can impact performance due to the instantiation of flow control. | ||
In some cases, a low setting may cause a deadlock because an application might post too many transmissions before it can post a receive. These issues are often caused by poorly written applications. This situation typically occurs with the Rendezvous protocol, where too many unexpected messages are received. |
Oops, something went wrong.