From 9273a529ee44abeb6f186c66ee2d3f742a7af98a Mon Sep 17 00:00:00 2001 From: Ian Ziemba Date: Wed, 20 Nov 2024 14:39:06 -0600 Subject: [PATCH] man/fi_cxi: Update manpage for force dev reg Signed-off-by: Ian Ziemba --- man/man7/fi_cxi.7 | 82 +++++++++++++++++++++++++++++------------------ 1 file changed, 50 insertions(+), 32 deletions(-) diff --git a/man/man7/fi_cxi.7 b/man/man7/fi_cxi.7 index aa0271826e0..5171b6a9e83 100644 --- a/man/man7/fi_cxi.7 +++ b/man/man7/fi_cxi.7 @@ -1,7 +1,21 @@ -.\"t -.\" Automatically generated by Pandoc 2.9.2.1 +'\" t +.\" Automatically generated by Pandoc 2.18 .\" -.TH "fi_cxi" "7" "2024\-10\-15" "Libfabric Programmer\[cq]s Manual" "#VERSION#" +.\" Define V font for inline verbatim, using C font in formats +.\" that render this, and otherwise B font. +.ie "\f[CB]x\f[]"x" \{\ +. ftr V B +. ftr VI BI +. ftr VB B +. ftr VBI BI +.\} +.el \{\ +. ftr V CR +. ftr VI CI +. ftr VB CB +. ftr VBI CBI +.\} +.TH "fi_cxi" "7" "2024\-11\-20" "Libfabric Programmer\[cq]s Manual" "#VERSION#" .hy .SH NAME .PP @@ -176,7 +190,7 @@ Classes. .PP While a libfabric user provided authorization key is optional, it is highly encouraged that libfabric users provide an authorization key -through the domain attribute hints during \f[C]fi_getinfo()\f[R]. +through the domain attribute hints during \f[V]fi_getinfo()\f[R]. How libfabric users acquire the authorization key may vary between the users and is outside the scope of this document. .PP @@ -192,18 +206,18 @@ authorization key using them. .IP \[bu] 2 \f[I]SLINGSHOT_VNIS\f[R]: Comma separated list of VNIs. The CXI provider will only use the first VNI if multiple are provide. -Example: \f[C]SLINGSHOT_VNIS=234\f[R]. +Example: \f[V]SLINGSHOT_VNIS=234\f[R]. .IP \[bu] 2 \f[I]SLINGSHOT_DEVICES\f[R]: Comma separated list of device names. Each device index will use the same index to lookup the service ID in \f[I]SLINGSHOT_SVC_IDS\f[R]. -Example: \f[C]SLINGSHOT_DEVICES=cxi0,cxi1\f[R]. +Example: \f[V]SLINGSHOT_DEVICES=cxi0,cxi1\f[R]. .IP \[bu] 2 \f[I]SLINGSHOT_SVC_IDS\f[R]: Comma separated list of pre-configured CXI service IDs. Each service ID index will use the same index to lookup the CXI device in \f[I]SLINGSHOT_DEVICES\f[R]. -Example: \f[C]SLINGSHOT_SVC_IDS=5,6\f[R]. +Example: \f[V]SLINGSHOT_SVC_IDS=5,6\f[R]. .PP \f[B]Note:\f[R] How valid VNIs and device services are configured is outside the responsibility of the CXI provider. @@ -608,7 +622,7 @@ into the fi_control(FI_QUEUE_WORK) critical path. The following subsections outline the CXI provider fork support. .SS RDMA and Fork Overview .PP -Under Linux, \f[C]fork()\f[R] is implemented using copy-on-write (COW) +Under Linux, \f[V]fork()\f[R] is implemented using copy-on-write (COW) pages, so the only penalty that it incurs is the time and memory required to duplicate the parent\[cq]s page tables, mark all of the process\[cq]s page structs as read only and COW, and create a unique @@ -651,22 +665,22 @@ The crux of the issue is the parent issuing forks while trying to do RDMA operations to registered memory regions. Excluding software RDMA emulation, two options exist for RDMA NIC vendors to resolve this data corruption issue. -- Linux \f[C]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - RDMA NIC +- Linux \f[V]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - RDMA NIC support for on-demand paging (ODP) .SS Linux madvise() MADV_DONTFORK and MADV_DOFORK .PP The generic (i.e.\ non-vendor specific) RDMA NIC solution to the Linux COW fork policy and RDMA problem is to use the following -\f[C]madvise()\f[R] operations during memory registration and +\f[V]madvise()\f[R] operations during memory registration and deregistration: - MADV_DONTFORK: Do not make the pages in this range -available to the child after a \f[C]fork()\f[R]. +available to the child after a \f[V]fork()\f[R]. This is useful to prevent copy-on-write semantics from changing the physical location of a page if the parent writes to it after a -\f[C]fork()\f[R]. +\f[V]fork()\f[R]. (Such page relocations cause problems for hardware that DMAs into the -page.) - MADV_DOFORK: Undo the effect of MADV_DONTFORK, restoring the -default behavior, whereby a mapping is inherited across -\f[C]fork()\f[R]. +page.) +- MADV_DOFORK: Undo the effect of MADV_DONTFORK, restoring the default +behavior, whereby a mapping is inherited across \f[V]fork()\f[R]. .PP In the Linux kernel, MADV_DONTFORK will result in the virtual memory area struct (VMA) being marked with the VM_DONTCOPY flag. @@ -677,14 +691,14 @@ Should the child reference the virtual address corresponding to the VMA which was not duplicated, it will segfault. .PP In the previous example, if Process A issued -\f[C]madvise(0xffff0000, 4096, MADV_DONTFORK)\f[R] before performing +\f[V]madvise(0xffff0000, 4096, MADV_DONTFORK)\f[R] before performing RDMA memory registration, the physical address 0x1000 would have remained with Process A. This would prevent the Process A data corruption as well. If Process B were to reference virtual address 0xffff0000, it will segfault due to the hole in the virtual address space. .PP -Using \f[C]madvise()\f[R] with MADV_DONTFORK may be problematic for +Using \f[V]madvise()\f[R] with MADV_DONTFORK may be problematic for applications performing RDMA and page aliasing. Paging aliasing is where the parent process uses part or all of a page to share information with the child process. @@ -738,7 +752,7 @@ The CXI provider is subjected to the Linux COW fork policy and RDMA issues described in section \f[I]RDMA and Fork Overview\f[R]. To prevent data corruption with fork, the CXI provider supports the following options: - CXI specific fork environment variables to enable -\f[C]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - ODP Support* +\f[V]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - ODP Support* .PP **Formal ODP support pending.* .SS CXI Specific Fork Environment Variables @@ -746,27 +760,27 @@ following options: - CXI specific fork environment variables to enable The CXI software stack has two environment variables related to fork: 0 CXI_FORK_SAFE: Enables base fork safe support. With this environment variable set, regardless of value, libcxi will -issue \f[C]madvise()\f[R] with MADV_DONTFORK on the virtual address +issue \f[V]madvise()\f[R] with MADV_DONTFORK on the virtual address range being registered for RDMA. -In addition, libcxi always align the \f[C]madvise()\f[R] to the system +In addition, libcxi always align the \f[V]madvise()\f[R] to the system default page size. On x86, this is 4 KiB. -To prevent redundant \f[C]madvise()\f[R] calls with MADV_DONTFORK +To prevent redundant \f[V]madvise()\f[R] calls with MADV_DONTFORK against the same virtual address region, reference counting is used -against each tracked \f[C]madvise()\f[R] region. -In addition, libcxi will spilt and merge tracked \f[C]madvise()\f[R] +against each tracked \f[V]madvise()\f[R] region. +In addition, libcxi will spilt and merge tracked \f[V]madvise()\f[R] regions if needed. Once the reference count reaches zero, libcxi will call -\f[C]madvise()\f[R] with MADV_DOFORK, and no longer track the region. +\f[V]madvise()\f[R] with MADV_DOFORK, and no longer track the region. - CXI_FORK_SAFE_HP: With this environment variable set, in conjunction with CXI_FORK_SAFE, libcxi will not assume the page size is system default page size. -Instead, libcxi will walk \f[C]/proc//smaps\f[R] to determine the -correct page size and align the \f[C]madvise()\f[R] calls accordingly. +Instead, libcxi will walk \f[V]/proc//smaps\f[R] to determine the +correct page size and align the \f[V]madvise()\f[R] calls accordingly. This environment variable should be set if huge pages are being used for RDMA. To amortize the per memory registration walk of -\f[C]/proc//smaps\f[R], the libfabric MR cache should be used. +\f[V]/proc//smaps\f[R], the libfabric MR cache should be used. .PP Setting these environment variables will prevent data corruption when the parent issues a fork. @@ -800,7 +814,7 @@ transfer. The following is the CXI provider fork support guidance: - Enable CXI_FORK_SAFE. If huge pages are also used, CXI_FORK_SAFE_HP should be enabled as well. -Since enabling this will result in \f[C]madvice()\f[R] with +Since enabling this will result in \f[V]madvice()\f[R] with MADV_DONTFORK, the following steps should be taken to prevent a child process segfault: - Avoid using stack memory for RDMA - Avoid child process having to access a virtual address range the parent process is @@ -1559,6 +1573,10 @@ events. \f[I]FI_CXI_MR_CACHE_EVENTS_DISABLE_LE_POLL_NSECS\f[R] Max amount of time to poll when LE invalidate disabling an MR configured with MR match events. +.TP +\f[I]FI_CXI_FORCE_DEV_REG_COPY\f[R] +Force the CXI provider to use the HMEM device register copy routines. +If not supported, RDMA operations or memory registration will fail. .PP Note: Use the fi_info utility to query provider environment variables: fi_info -p cxi -e @@ -1624,7 +1642,7 @@ It can only be changed prior to any MR being created. .PP CXI domain extensions have been named \f[I]FI_CXI_DOM_OPS_6\f[R]. The flags parameter is ignored. -The fi_open_ops function takes a \f[C]struct fi_cxi_dom_ops\f[R]. +The fi_open_ops function takes a \f[V]struct fi_cxi_dom_ops\f[R]. See an example of usage below: .IP .nf @@ -1717,7 +1735,7 @@ removed from the domain opts prior to software release 2.2. .PP CXI counter extensions have been named \f[I]FI_CXI_COUNTER_OPS\f[R]. The flags parameter is ignored. -The fi_open_ops function takes a \f[C]struct fi_cxi_cntr_ops\f[R]. +The fi_open_ops function takes a \f[V]struct fi_cxi_cntr_ops\f[R]. See an example of usage below. .IP .nf @@ -1846,7 +1864,7 @@ memory operation as a PCIe operation as compared to a NIC operation. The CXI provider extension flag FI_CXI_PCIE_AMO is used to signify this. .PP Since not all libfabric atomic memory operations can be executed as a -PCIe atomic memory operation, \f[C]fi_query_atomic()\f[R] could be used +PCIe atomic memory operation, \f[V]fi_query_atomic()\f[R] could be used to query if a given libfabric atomic memory operation could be executed as PCIe atomic memory operation. .PP @@ -2164,6 +2182,6 @@ In this case, the target NIC is reachable. FI_EIO: Catch all errno. .SH SEE ALSO .PP -\f[C]fabric\f[R](7), \f[C]fi_provider\f[R](7), +\f[V]fabric\f[R](7), \f[V]fi_provider\f[R](7), .SH AUTHORS OpenFabrics.