-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bphilip/add mofed and dependencies #11479
Conversation
Infiniband is a requirement for AI/ML deployments for inter-GPU communication and scale out GPU clusters. HPC team built HPC azure linux image needs infiniband mofed drivers. Their major customer singularity is on a deadline to move from Ubuntu to azure linux for AI/ML workloads. All of the required sources are open source and have spec's which are already being used by NVIDIA. This PR brings to Azure Linux MOFED driver for infiniband driver and all dependencies to help use, manage and debug the stack. These modules have already been built, integrated into an HPC image and tested on an existing 16 mode cluster owned by HPC team. Performance characteristics are within specified tolerance limits.
%{!?_name: %define _name fwctl} | ||
%{!?_version: %define _version 24.10} | ||
%{!?_release: %define _release OFED.24.10.0.6.7.1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why define these _version
separately? Won't the %{version}
work? Similar question for _name
. And what's the logic behind _release
? I see it trickles down to _kmp_rel
for instance but I don't see _kmp_rel
used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is defined from MOFED SPECS for alignment. Hence keeping it same for furture updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -17,6 +17,7 @@ | |||
"Fedora (ISC)": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from Fedora \d+ \(license: ISC\)(\.|\n|$)', | |||
"Magnus Edenhill Open Source": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from Magnus Edenhill Open Source \(license: BSD\)(\.|\n|$)', | |||
"NVIDIA": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from NVIDIA \(license: (ASL 2\.0|GPLv2)\)(\.|\n|$)', | |||
"NVIDIA (BSD)": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from NVIDIA \(BSD\) \(license: (BSD)\)(\.|\n|$)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for "CBL-Mariner"` in the new source attributions. The other ones are to support the old specs. And there's no need to repeat "BSD", I think.
"NVIDIA (BSD)": r'\n-\s+Initial (CBL-Mariner|Azure Linux) import from NVIDIA \(BSD\) \(license: (BSD)\)(\.|\n|$)', | |
"NVIDIA (BSD)": r'\n-\s+Initial Azure Linux import from NVIDIA \(license: (BSD)\)(\.|\n|$)', |
A bigger question, perhaps: why add a new entry specially for BSD? I see the older one for Nvidia already covers two licenses. Is there a reason we don't want to just update that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the Fedora entries and tried to keep different licenses separate. If there is no problem in ASL/GPL/BSD going to the same source we can make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Added BSD to default Nvidia entry.
BuildRequires: gcc | ||
BuildRequires: make | ||
BuildRequires: kernel-devel = %{target_kernel_version_full} | ||
BuildRequires: kernel-headers = %{target_kernel_version_full} | ||
BuildRequires: binutils | ||
BuildRequires: systemd | ||
BuildRequires: kmod | ||
BuildRequires: mlnx-ofa_kernel-devel = %{_version} | ||
BuildRequires: mlnx-ofa_kernel-source = %{_version} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for all signed specs: they don't need most BuildRequires
, since they only expand already built RPMs and replace a few files. The packages providing rpm2cpio
and cpio
are part of the default build chroot but to be on the safe side, I'd add them to BuildRequires
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Requires: kmod | ||
|
||
%description | ||
fwctl signed kernel modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the description, that the published package will have, so I'd suggest something, that gives users more information about the package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description is copied from unsigned SPECS. Both unsigned and signed SPECS have same package description.
# This package's "version" and "release" must reflect the unsigned version that | ||
# was signed. | ||
# An important consequence is that when making a change to this package, the | ||
# unsigned version/release must be increased to keep the two versions consistent. | ||
# Ideally though, this spec will not change much or at all, so the version will | ||
# just track the unsigned package's version/release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the comment - the specs entanglement PR check will handle that for you. Just make sure to update the set of spec pairs, which need to be kept in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
Merge Checklist
All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)
*-static
subpackages, etc.) have had theirRelease
tag incremented../cgmanifest.json
,./toolkit/scripts/toolchain/cgmanifest.json
,.github/workflows/cgmanifest.json
)./LICENSES-AND-NOTICES/SPECS/data/licenses.json
,./LICENSES-AND-NOTICES/SPECS/LICENSES-MAP.md
,./LICENSES-AND-NOTICES/SPECS/LICENSE-EXCEPTIONS.PHOTON
)*.signatures.json
filessudo make go-tidy-all
andsudo make go-test-coverage
passSummary
What does the PR accomplish, why was it needed?
This PR brings in MOFED and dependent packages required to build our AI/ML distribution story. Immediately HPC images require infiniband for GPU-clustering. MOFED is thus a core requirement. There are first and third party requirments for MOFED drivers directly and dependent packages.
CUDA and GdrCopy drivers need MOFED as a build-time and runtime requirements. This is not satisfied today and CUDA driver is built without MOFED support. These packages getting integrated into core will enable that work stream to be unblocked as well.
Change Log
There are 18 new packages being added by this change to the SPECS directory. There are 8 new SIGNED-SPECS being added too. They are either directly required to build out MOFED and support infrastructure or are tools required to provide a complete infiniband solution like tools to configure the hw/driver and debug issues.
fwctl
ibarr
ibsim
iser
isert
knem
mft_kernel
mlnx-ethtool
mlnx-iproute2
mlnx-nfsrdma
mlnx-ofa_kernel
mlx-steering-dump
multiperf rshim
sockperf
srp
xpmem-lib
xpmem
Does this affect the toolchain?
No
YES/NO
Associated issues
Links to CVEs
Test Methodology
https://dev.azure.com/mariner-org/mariner/_build/results?buildId=709280&view=results
The RPMs from a local build of these specs were installed on A100, V100 and T4 GPU based VMs and functionality verified.