Skip to content

Commit

Permalink
man/fi_cxi: Update manpage for force dev reg
Browse files Browse the repository at this point in the history
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
  • Loading branch information
iziemba committed Nov 21, 2024
1 parent dafa631 commit 9273a52
Showing 1 changed file with 50 additions and 32 deletions.
82 changes: 50 additions & 32 deletions man/man7/fi_cxi.7
Original file line number Diff line number Diff line change
@@ -1,7 +1,21 @@
.\"t
.\" Automatically generated by Pandoc 2.9.2.1
'\" t
.\" Automatically generated by Pandoc 2.18
.\"
.TH "fi_cxi" "7" "2024\-10\-15" "Libfabric Programmer\[cq]s Manual" "#VERSION#"
.\" Define V font for inline verbatim, using C font in formats
.\" that render this, and otherwise B font.
.ie "\f[CB]x\f[]"x" \{\
. ftr V B
. ftr VI BI
. ftr VB B
. ftr VBI BI
.\}
.el \{\
. ftr V CR
. ftr VI CI
. ftr VB CB
. ftr VBI CBI
.\}
.TH "fi_cxi" "7" "2024\-11\-20" "Libfabric Programmer\[cq]s Manual" "#VERSION#"
.hy
.SH NAME
.PP
Expand Down Expand Up @@ -176,7 +190,7 @@ Classes.
.PP
While a libfabric user provided authorization key is optional, it is
highly encouraged that libfabric users provide an authorization key
through the domain attribute hints during \f[C]fi_getinfo()\f[R].
through the domain attribute hints during \f[V]fi_getinfo()\f[R].
How libfabric users acquire the authorization key may vary between the
users and is outside the scope of this document.
.PP
Expand All @@ -192,18 +206,18 @@ authorization key using them.
.IP \[bu] 2
\f[I]SLINGSHOT_VNIS\f[R]: Comma separated list of VNIs.
The CXI provider will only use the first VNI if multiple are provide.
Example: \f[C]SLINGSHOT_VNIS=234\f[R].
Example: \f[V]SLINGSHOT_VNIS=234\f[R].
.IP \[bu] 2
\f[I]SLINGSHOT_DEVICES\f[R]: Comma separated list of device names.
Each device index will use the same index to lookup the service ID in
\f[I]SLINGSHOT_SVC_IDS\f[R].
Example: \f[C]SLINGSHOT_DEVICES=cxi0,cxi1\f[R].
Example: \f[V]SLINGSHOT_DEVICES=cxi0,cxi1\f[R].
.IP \[bu] 2
\f[I]SLINGSHOT_SVC_IDS\f[R]: Comma separated list of pre-configured CXI
service IDs.
Each service ID index will use the same index to lookup the CXI device
in \f[I]SLINGSHOT_DEVICES\f[R].
Example: \f[C]SLINGSHOT_SVC_IDS=5,6\f[R].
Example: \f[V]SLINGSHOT_SVC_IDS=5,6\f[R].
.PP
\f[B]Note:\f[R] How valid VNIs and device services are configured is
outside the responsibility of the CXI provider.
Expand Down Expand Up @@ -608,7 +622,7 @@ into the fi_control(FI_QUEUE_WORK) critical path.
The following subsections outline the CXI provider fork support.
.SS RDMA and Fork Overview
.PP
Under Linux, \f[C]fork()\f[R] is implemented using copy-on-write (COW)
Under Linux, \f[V]fork()\f[R] is implemented using copy-on-write (COW)
pages, so the only penalty that it incurs is the time and memory
required to duplicate the parent\[cq]s page tables, mark all of the
process\[cq]s page structs as read only and COW, and create a unique
Expand Down Expand Up @@ -651,22 +665,22 @@ The crux of the issue is the parent issuing forks while trying to do
RDMA operations to registered memory regions.
Excluding software RDMA emulation, two options exist for RDMA NIC
vendors to resolve this data corruption issue.
- Linux \f[C]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - RDMA NIC
- Linux \f[V]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - RDMA NIC
support for on-demand paging (ODP)
.SS Linux madvise() MADV_DONTFORK and MADV_DOFORK
.PP
The generic (i.e.\ non-vendor specific) RDMA NIC solution to the Linux
COW fork policy and RDMA problem is to use the following
\f[C]madvise()\f[R] operations during memory registration and
\f[V]madvise()\f[R] operations during memory registration and
deregistration: - MADV_DONTFORK: Do not make the pages in this range
available to the child after a \f[C]fork()\f[R].
available to the child after a \f[V]fork()\f[R].
This is useful to prevent copy-on-write semantics from changing the
physical location of a page if the parent writes to it after a
\f[C]fork()\f[R].
\f[V]fork()\f[R].
(Such page relocations cause problems for hardware that DMAs into the
page.) - MADV_DOFORK: Undo the effect of MADV_DONTFORK, restoring the
default behavior, whereby a mapping is inherited across
\f[C]fork()\f[R].
page.)
- MADV_DOFORK: Undo the effect of MADV_DONTFORK, restoring the default
behavior, whereby a mapping is inherited across \f[V]fork()\f[R].
.PP
In the Linux kernel, MADV_DONTFORK will result in the virtual memory
area struct (VMA) being marked with the VM_DONTCOPY flag.
Expand All @@ -677,14 +691,14 @@ Should the child reference the virtual address corresponding to the VMA
which was not duplicated, it will segfault.
.PP
In the previous example, if Process A issued
\f[C]madvise(0xffff0000, 4096, MADV_DONTFORK)\f[R] before performing
\f[V]madvise(0xffff0000, 4096, MADV_DONTFORK)\f[R] before performing
RDMA memory registration, the physical address 0x1000 would have
remained with Process A.
This would prevent the Process A data corruption as well.
If Process B were to reference virtual address 0xffff0000, it will
segfault due to the hole in the virtual address space.
.PP
Using \f[C]madvise()\f[R] with MADV_DONTFORK may be problematic for
Using \f[V]madvise()\f[R] with MADV_DONTFORK may be problematic for
applications performing RDMA and page aliasing.
Paging aliasing is where the parent process uses part or all of a page
to share information with the child process.
Expand Down Expand Up @@ -738,35 +752,35 @@ The CXI provider is subjected to the Linux COW fork policy and RDMA
issues described in section \f[I]RDMA and Fork Overview\f[R].
To prevent data corruption with fork, the CXI provider supports the
following options: - CXI specific fork environment variables to enable
\f[C]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - ODP Support*
\f[V]madvise()\f[R] MADV_DONTFORK and MADV_DOFORK - ODP Support*
.PP
**Formal ODP support pending.*
.SS CXI Specific Fork Environment Variables
.PP
The CXI software stack has two environment variables related to fork: 0
CXI_FORK_SAFE: Enables base fork safe support.
With this environment variable set, regardless of value, libcxi will
issue \f[C]madvise()\f[R] with MADV_DONTFORK on the virtual address
issue \f[V]madvise()\f[R] with MADV_DONTFORK on the virtual address
range being registered for RDMA.
In addition, libcxi always align the \f[C]madvise()\f[R] to the system
In addition, libcxi always align the \f[V]madvise()\f[R] to the system
default page size.
On x86, this is 4 KiB.
To prevent redundant \f[C]madvise()\f[R] calls with MADV_DONTFORK
To prevent redundant \f[V]madvise()\f[R] calls with MADV_DONTFORK
against the same virtual address region, reference counting is used
against each tracked \f[C]madvise()\f[R] region.
In addition, libcxi will spilt and merge tracked \f[C]madvise()\f[R]
against each tracked \f[V]madvise()\f[R] region.
In addition, libcxi will spilt and merge tracked \f[V]madvise()\f[R]
regions if needed.
Once the reference count reaches zero, libcxi will call
\f[C]madvise()\f[R] with MADV_DOFORK, and no longer track the region.
\f[V]madvise()\f[R] with MADV_DOFORK, and no longer track the region.
- CXI_FORK_SAFE_HP: With this environment variable set, in conjunction
with CXI_FORK_SAFE, libcxi will not assume the page size is system
default page size.
Instead, libcxi will walk \f[C]/proc/<pid>/smaps\f[R] to determine the
correct page size and align the \f[C]madvise()\f[R] calls accordingly.
Instead, libcxi will walk \f[V]/proc/<pid>/smaps\f[R] to determine the
correct page size and align the \f[V]madvise()\f[R] calls accordingly.
This environment variable should be set if huge pages are being used for
RDMA.
To amortize the per memory registration walk of
\f[C]/proc/<pid>/smaps\f[R], the libfabric MR cache should be used.
\f[V]/proc/<pid>/smaps\f[R], the libfabric MR cache should be used.
.PP
Setting these environment variables will prevent data corruption when
the parent issues a fork.
Expand Down Expand Up @@ -800,7 +814,7 @@ transfer.
The following is the CXI provider fork support guidance: - Enable
CXI_FORK_SAFE.
If huge pages are also used, CXI_FORK_SAFE_HP should be enabled as well.
Since enabling this will result in \f[C]madvice()\f[R] with
Since enabling this will result in \f[V]madvice()\f[R] with
MADV_DONTFORK, the following steps should be taken to prevent a child
process segfault: - Avoid using stack memory for RDMA - Avoid child
process having to access a virtual address range the parent process is
Expand Down Expand Up @@ -1559,6 +1573,10 @@ events.
\f[I]FI_CXI_MR_CACHE_EVENTS_DISABLE_LE_POLL_NSECS\f[R]
Max amount of time to poll when LE invalidate disabling an MR configured
with MR match events.
.TP
\f[I]FI_CXI_FORCE_DEV_REG_COPY\f[R]
Force the CXI provider to use the HMEM device register copy routines.
If not supported, RDMA operations or memory registration will fail.
.PP
Note: Use the fi_info utility to query provider environment variables:
fi_info -p cxi -e
Expand Down Expand Up @@ -1624,7 +1642,7 @@ It can only be changed prior to any MR being created.
.PP
CXI domain extensions have been named \f[I]FI_CXI_DOM_OPS_6\f[R].
The flags parameter is ignored.
The fi_open_ops function takes a \f[C]struct fi_cxi_dom_ops\f[R].
The fi_open_ops function takes a \f[V]struct fi_cxi_dom_ops\f[R].
See an example of usage below:
.IP
.nf
Expand Down Expand Up @@ -1717,7 +1735,7 @@ removed from the domain opts prior to software release 2.2.
.PP
CXI counter extensions have been named \f[I]FI_CXI_COUNTER_OPS\f[R].
The flags parameter is ignored.
The fi_open_ops function takes a \f[C]struct fi_cxi_cntr_ops\f[R].
The fi_open_ops function takes a \f[V]struct fi_cxi_cntr_ops\f[R].
See an example of usage below.
.IP
.nf
Expand Down Expand Up @@ -1846,7 +1864,7 @@ memory operation as a PCIe operation as compared to a NIC operation.
The CXI provider extension flag FI_CXI_PCIE_AMO is used to signify this.
.PP
Since not all libfabric atomic memory operations can be executed as a
PCIe atomic memory operation, \f[C]fi_query_atomic()\f[R] could be used
PCIe atomic memory operation, \f[V]fi_query_atomic()\f[R] could be used
to query if a given libfabric atomic memory operation could be executed
as PCIe atomic memory operation.
.PP
Expand Down Expand Up @@ -2164,6 +2182,6 @@ In this case, the target NIC is reachable.
FI_EIO: Catch all errno.
.SH SEE ALSO
.PP
\f[C]fabric\f[R](7), \f[C]fi_provider\f[R](7),
\f[V]fabric\f[R](7), \f[V]fi_provider\f[R](7),
.SH AUTHORS
OpenFabrics.

0 comments on commit 9273a52

Please sign in to comment.