Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Nvidia RTX A400 Desktop Workstation GPU #677

Open
geerlingguy opened this issue Oct 17, 2024 · 5 comments
Open

Test Nvidia RTX A400 Desktop Workstation GPU #677

geerlingguy opened this issue Oct 17, 2024 · 5 comments

Comments

@geerlingguy
Copy link
Owner

The Nvidia RTX A400 is a relatively inexpensive graphics card targeted at the desktop workstation space—when you need something slightly better than the built in iGPU on a server-class CPU, or slightly worse than low end consumer gaming cards.

nvidia-rtx-a400

I have one I'd like to test on the Pi—it might be the case that newer cards like this have better out of the box arm64 driver support from Nvidia, since they're building these things to work in Ampere workstations and have a lot of stuff going on with their own Arm chips like Grace Hopper...

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 10, 2024

On a System76 Thelio Astra system (see geerlingguy/sbc-reviews#53), I can see the card:

0004:01:00.0 VGA compatible controller: NVIDIA Corporation Device 25b2 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 1879
	Physical Slot: 1-3
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 323
	NUMA node: 0
	IOMMU group: 26
	Region 0: Memory at 20000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 280000000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at 280010000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at 21000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fffaf040  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+ LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: LaneErr at lane: 0 1 2 3 5 7
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

0004:01:00.1 Audio device: NVIDIA Corporation Device 2291 (rev a1)
	Subsystem: NVIDIA Corporation Device 1879
	Physical Slot: 1-3
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 322
	NUMA node: 0
	IOMMU group: 26
	Region 0: Memory at 21080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+ LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [160 v1] Data Link Feature <?>
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

And nvidia-smi works:

system76@thelio-astra:~/Downloads$ nvidia-smi
Tue Dec 10 14:34:30 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A400                Off |   00000004:01:00.0  On |                  N/A |
| 46%   72C    P0             N/A /   50W |    2271MiB /   4094MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      5555      G   /usr/bin/gnome-shell                          196MiB |
|    0   N/A  N/A      6553      G   /usr/bin/Xwayland                              11MiB |
|    0   N/A  N/A     13368    C+G   ./GravityMark.arm64                          1998MiB |
|    0   N/A  N/A     14614      G   /usr/bin/gnome-control-center                  27MiB |
|    0   N/A  N/A     15265      G   /usr/bin/nautilus                              25MiB |
+-----------------------------------------------------------------------------------------+

nvtop shows the card appropriately.

GravityMark will run with OpenGL but not Vulkan, and it got the following score:

M:  3:00.819: Benchmark Finished
M:  3:00.819: NVIDIA RTX A400/PCIe
M:  3:00.819: API: OpenGL
M:  3:00.819: Platform: Linux
M:  3:00.819: Resolution: 1024x768
M:  3:00.819: Antialiasing: Temporal
M:  3:00.819: Asteroids: 200,000
M:  3:00.819: Score: 9125
M:  3:00.819: Time: 167.1 s
M:  3:00.819: FPS: 54.6

And running again with all the proper settings at 1600x900, and a proper benchmark result upload:

image

So we know the thing works on Arm :P

@geerlingguy
Copy link
Owner Author

Card is on the site here: https://pipci.jeffgeerling.com/cards_gpu/nvidia-rtx-a400.html

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 13, 2024

As @Coreforge's 6.6.y fork is a little out of date and I was lazy this morning, I decided to pop the A400 into my Pi 5 rig, to procrastinate on a video edit :)

On the Pi 5, I see it on the bus of course:

pi@pi5-pcie:~ $ lspci
0000:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21)
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA107GL [RTX A400] (rev a1)
0000:01:00.1 Audio device: NVIDIA Corporation Device 2291 (rev a1)
0001:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries BCM2712 PCIe Bridge (rev 21)
0001:01:00.0 Ethernet controller: Raspberry Pi Ltd RP1 PCIe 2.0 South Bridge

And @bexcran just mentioned a new driver (565.77) dropped, so I might as well test it.

sudo apt install -y raspberrypi-kernel-headers
cd Downloads
wget https://us.download.nvidia.com/XFree86/aarch64/565.77/NVIDIA-Linux-aarch64-565.77.run
chmod +x NVIDIA-Linux-aarch64-565.77.run
sudo ./NVIDIA-Linux-aarch64-565.77.run

Interestingly, this pops up a new selection for which driver to install (hadn't seen this before):

Screenshot 2024-12-13 at 12 47 40 PM

I chose the Proprietary driver and continued installation. Installation completed successfully and then I rebooted.

After reboot:

pi@pi5-pcie:~ $ dmesg | grep nvidia
[    3.470685] nvidia: loading out-of-tree module taints kernel.
[    3.470697] nvidia: module license 'NVIDIA' taints kernel.
[    3.470701] nvidia: module license taints kernel.
[    3.495652] nvidia-nvlink: Nvlink Core is being initialized, major device number 507
[    3.503451] nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[    3.503480] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[    3.576994] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.77  Wed Nov 27 22:53:24 UTC 2024
[    3.601818] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    3.601821] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 2

pi@pi5-pcie:~ $ dmesg | grep NVRM
[    3.558933] NVRM: loading NVIDIA UNIX aarch64 Kernel Module  565.77  Wed Nov 27 23:58:34 UTC 2024
[    5.694117] NVRM: Chipset not recognized (vendor ID 0x14e4, device ID 0x2712)
[    5.785581] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2482)
[    5.785888] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    5.893608] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2482)
[    5.894032] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    6.004417] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2482)
[    6.004771] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    6.112904] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:2482)
[    6.113282] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Same old same old.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 13, 2024

I uninstalled with nvidia-uninstall, rebooted, and installed again, this time choosing the open driver (MIT/GPL).

After a reboot:

pi@pi5-pcie:~ $ dmesg | grep NVRM
[    3.368416] NVRM: loading NVIDIA UNIX Open Kernel Module for aarch64  565.77  Release Build  (dvs-builder@U16-A23-20-3)  Wed Nov 27 23:12:03 UTC 2024
[    5.657785] NVRM: Chipset not recognized (vendor ID 0x14e4, device ID 0x2712)
[    5.657807] NVRM: clCheckUpstreamLtrSupport_IMPL: PCIE config space is inaccessible!
[    5.753823] NVRM: kgspExecuteFwsec_TU102: failed to execute FWSEC for FRTS: no initialized WPR2 found
[    5.753831] NVRM: kgspExecuteFwsec_TU102: (note: VBIOS version 94.07.9B.00.01)
[    5.753834] NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from status @ kernel_gsp_tu102.c:482
[    5.753942] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[    5.755325] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:1863)
[    5.755657] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    5.771962] NVRM: clCheckUpstreamLtrSupport_IMPL: PCIE config space is inaccessible!
[    5.863358] NVRM: kgspExecuteFwsec_TU102: failed to execute FWSEC for FRTS: no initialized WPR2 found
[    5.863365] NVRM: kgspExecuteFwsec_TU102: (note: VBIOS version 94.07.9B.00.01)
[    5.863368] NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from status @ kernel_gsp_tu102.c:482
[    5.863468] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[    5.864850] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:1863)
[    5.865204] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    5.880993] NVRM: clCheckUpstreamLtrSupport_IMPL: PCIE config space is inaccessible!
[    5.972240] NVRM: kgspExecuteFwsec_TU102: failed to execute FWSEC for FRTS: no initialized WPR2 found
[    5.972246] NVRM: kgspExecuteFwsec_TU102: (note: VBIOS version 94.07.9B.00.01)
[    5.972250] NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from status @ kernel_gsp_tu102.c:482
[    5.972370] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[    5.973723] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:1863)
[    5.974089] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    5.989666] NVRM: clCheckUpstreamLtrSupport_IMPL: PCIE config space is inaccessible!
[    6.079505] NVRM: kgspExecuteFwsec_TU102: failed to execute FWSEC for FRTS: no initialized WPR2 found
[    6.079511] NVRM: kgspExecuteFwsec_TU102: (note: VBIOS version 94.07.9B.00.01)
[    6.079514] NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from status @ kernel_gsp_tu102.c:482
[    6.079619] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[    6.080914] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:1863)
[    6.081257] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

That's at least giving me more reference points!

But same ol same ol: NVIDIA/open-gpu-kernel-modules#725

Config space warning from https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/platform/chipset/chipset_pcie.c#L548-L555

@geerlingguy
Copy link
Owner Author

geerlingguy commented Jan 14, 2025

Going to re-test today and submit a full bug report with nvidia-bug-report.sh.

Edit: done: NVIDIA/open-gpu-kernel-modules#725 (reply in thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant