You are not logged in.
I recently purchased a new GPU card for my server, but have not been successful in getting the nvidia drivers to work with it - either as a graphics card, or as a cuda compute engine. The card in question is a Nvidia GeForce GT 710. (https://www.zotac.com/us/product/graphi … b-pcie-x-1) (The motherboard I'm using is an Asus PRIME B250M-A, if that matters.)
I have all the needed packages installed:
[darose@darsys12 ~]$ pacman -Q | grep nvidia
nvidia-dkms 465.31-1
nvidia-settings 465.31-1
nvidia-utils 465.31-1
opencl-nvidia 465.31-1
The machine definitely *sees* the card:
[darose@darsys12 ~]$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
But no matter what I try, I keep getting "GPU has fallen off the bus" errors:
[darose@darsys12 ~]$ dmesg | grep -i -e nvidia -e nvrm
[ 0.000000] Command line: BOOT_IMAGE=../vmlinuz-linux root=LABEL=root rw nvidia-drm.modeset=1 initrd=../intel-ucode.img,../initramfs-linux.img
[ 0.046539] Kernel command line: BOOT_IMAGE=../vmlinuz-linux root=LABEL=root rw nvidia-drm.modeset=1 initrd=../intel-ucode.img,../initramfs-linux.img
[ 1.099115] nvidia: loading out-of-tree module taints kernel.
[ 1.099123] nvidia: module license 'NVIDIA' taints kernel.
[ 1.109036] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1.132441] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 1.133265] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1.249990] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 465.31 Thu May 13 22:24:36 UTC 2021
[ 1.253004] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 465.31 Thu May 13 22:14:23 UTC 2021
[ 1.253833] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 1.262431] nvidia-uvm: Loaded the UVM driver, major device number 237.
[ 1.284546] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 1.692351] NVRM: GPU at PCI:0000:01:00: GPU-94909028-e642-5b1e-39c9-b3c510991de2
[ 1.692354] NVRM: Xid (PCI:0000:01:00): 79, pid=140, GPU has fallen off the bus.
[ 1.692355] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 1.696537] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0xf:1242)
[ 1.696593] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 1.696712] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ 1.696916] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
...
And the nvidia drivers can't access it:
[darose@darsys12 ~]$ sudo nvidia-smi
No devices were found
[darose@darsys12 ~]$ sudo clinfo
Number of platforms 0
That "GPU has fallen off the bus" is a pretty vague error message, and there's not a ton of good suggestions on the Net about how to fix this. Several pages suggested either that the card wasn't seated correctly, or it's a hardware error. But I tried re-seating the card again (as well as installing it in a different slot) but that made no difference. As far as a hardware error, that seems unlikely too, as a) this is a brand new card, and b) it works *perfectly* when I use the nouveau driver instead (though without any compute capabilities, obviously).
I've tried upgrading and downgrading kernels, upgrading and downgrading the nvidia packages, booting using "rcutree.rcu_idle_gp_delay=1 pcie_aspm=off" and numerous other things, but nothing seems to fix the issue.
Yes, I can get the graphics working on the card if I need to (using nouveau), but that doesn't really help me. This is a headless server, and so doesn't really make any use of graphics capabilities, and I bought the card in order to use it for compute purposes.
Been wrestling with this for several days now, and I'm pretty frustrated and at my wits end on this.
Any suggestions on how to fix or debug this further would be most welcome!
Last edited by darose (2021-07-13 22:07:31)
Offline
According to page 1-1 of your motherboard manual[1] uses an intel chipset and has 2 x1 slots : PCIEX1_1 and PCIEX1_2 .
Which slot is the card in and does it matter if you change the slot ?
Many intel processors have an integrated gpu, does yours ?
please post full lspci -k output , also dmesg and/or journalctl -b
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
Thanks very much for the response!
1) I'm currently using slot PCIEX1_1. (Though as I mentioned earlier I've tried using the other slot as well.)
2) It looks like the CPU does have integrated graphics. (Intel® HD Graphics 630, see https://ark.intel.com/content/www/us/en … -ghz.html) But I've updated the BIOS to instruct it to use the PCIE graphics card, and I don't see the intel graphics being used:
[darose@darsys12 tmp]$ inxi -G --display
Graphics: Device-1: NVIDIA GK208B [GeForce GT 710] driver: nvidia v: 465.31
Display: server: X.org 1.20.12 driver: loaded: nvidia note: n/a (using device driver) unloaded: modesetting
Message: No advanced graphics data found on this system.
[darose@darsys12 tmp]$ lsmod | grep i915
[darose@darsys12 tmp]$
3) http://darose.net/journalctl.txt
[darose@darsys12 ~]$ lspci -k
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
Subsystem: ASUSTeK Computer Inc. Device 8694
Kernel driver in use: skl_uncore
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
Subsystem: ASUSTeK Computer Inc. Device 8694
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
Subsystem: ASUSTeK Computer Inc. Device 8694
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
Subsystem: ASUSTeK Computer Inc. Device 8694
Kernel driver in use: ahci
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0)
Kernel driver in use: pcieport
00:1c.7 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #8 (rev f0)
Kernel driver in use: pcieport
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation 200 Series PCH LPC Controller (B250)
Subsystem: ASUSTeK Computer Inc. Device 8694
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
Subsystem: ASUSTeK Computer Inc. Device 8694
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
Subsystem: ASUSTeK Computer Inc. Device 8694
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device 5360
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
01:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device 5360
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
Subsystem: ASUSTeK Computer Inc. PRIME B450M-A Motherboard
Kernel driver in use: r8169
Kernel modules: r8169
Thanks!
Offline
I bought the card in order to use it for compute purposes.
You're aware that it's not exactly a powerful GPU?
Is there currently a display attached to the GPU?
Try to use the 460xx driver via kma, because of https://bbs.archlinux.org/viewtopic.php?id=265563 (or wait for 470xx to hit the repos)
Offline
Are you booting to EFI or Bios ?
(if unsure, post ls -l /sys/firmware/efi/efivars/ | wc -l output ).
But I've updated the BIOS to instruct it to use the PCIE graphics card, and I don't see the intel graphics being used:
Usually that setting prompts the firmware/chipset to look at the primary pcie graphics slot (the x16 one) .
Some firmwares do allow to select a specific card / slot there.
You could try setting the intel gpu as primary for testing and then see whether nvidia proprietary drivers can find / use the card.
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
You're aware that it's not exactly a powerful GPU?
Yes of course. I'm not doing crypto mining or gaming. (Just planning to use it for FoldingAtHome and some pycuda stuff.) And I didn't have *any* GPU in the machine prior to this, so this is a step up. (Plus there's a big shortage on good GPU's right now, and the prices are pretty insane.)
Is there currently a display attached to the GPU?
I had a Lantronix Spider KVM hooked up to the VGA port on the GPU card. I've now moved that to the integrated graphics port. But the nvidia card "falls off the bus" in either scenario.
Try to use the 460xx driver via kma, because of https://bbs.archlinux.org/viewtopic.php?id=265563 (or wait for 470xx to hit the repos)
Tried that as well, but no love:
[root@darsys12 ~]# dmesg | grep NVRM
[ 3.556602] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.67 Thu Mar 11 00:11:45 UTC 2021
[ 70.780217] NVRM: GPU at PCI:0000:01:00: GPU-94909028-e642-5b1e-39c9-b3c510991de2
[ 70.780221] NVRM: Xid (PCI:0000:01:00): 79, pid=1326, GPU has fallen off the bus.
[ 70.780222] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 70.785988] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x1a:1290)
...
Offline
Are you booting to EFI or Bios ?
(if unsure, post ls -l /sys/firmware/efi/efivars/ | wc -l output ).
I'm definitely using BIOS, not EFI:
[darose@darsys12 ~]$ ls -l /sys/firmware/efi/efivars
ls: cannot access '/sys/firmware/efi/efivars': No such file or directory
As this box has never had Windows on it, I never saw the need to use UEFI. Is UEFI needed in order to use a NVIDIA GPU?
But I've updated the BIOS to instruct it to use the PCIE graphics card, and I don't see the intel graphics being used:
Usually that setting prompts the firmware/chipset to look at the primary pcie graphics slot (the x16 one) .
Some firmwares do allow to select a specific card / slot there.You could try setting the intel gpu as primary for testing and then see whether nvidia proprietary drivers can find / use the card.
I tried this. Good news is that I was able to set the intel onboard GPU as primary, and get to a login screen with it (using the i915 kernel module and the modeset Xorg driver). Bad news though is that I still can't get the nvidia card working:
[root@darsys12 ~]# dmesg | grep NVRM
[ 3.556602] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.67 Thu Mar 11 00:11:45 UTC 2021
[ 70.780217] NVRM: GPU at PCI:0000:01:00: GPU-94909028-e642-5b1e-39c9-b3c510991de2
[ 70.780221] NVRM: Xid (PCI:0000:01:00): 79, pid=1326, GPU has fallen off the bus.
[ 70.780222] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 70.785988] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x1a:1290)
[ 70.786074] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
...
[root@darsys12 ~]# nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error
[root@darsys12 ~]# clinfo
Number of platforms 0
This is really frustrating! :-( The card is worthless to me if I can't get it to do any compute work.
Offline
Can you reload the nvidia modules at runtime?
How's the PSU situation? It's a 25W GPU but if the rest of the system is at its limit, 25W can still be 5W too much.
(nvidia-smi won't work as long as the GPU keeps falling off the bus. It's like as if it's physically fallen out of the slot)
Offline
As this box has never had Windows on it, I never saw the need to use UEFI. Is UEFI needed in order to use a NVIDIA GPU?
gpu manufacturers increasingly focus on making their cards work under EFI and neglect bios users.
This has a rather big impact on hardware initialisation for non-standard situations .
Whether swtiching to efi boot would help is another matter.
Try seth suggestions in #8
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
Can you reload the nvidia modules at runtime?
Doesn't seem to work:
[root@darsys12 ~]# lsmod | grep nvidia
nvidia_drm 65536 0
nvidia_modeset 1187840 1 nvidia_drm
nvidia 34992128 1 nvidia_modeset
drm_kms_helper 290816 2 nvidia_drm,i915
drm 581632 5 drm_kms_helper,nvidia_drm,i915
[root@darsys12 ~]# modprobe -r nvidia_drm
[root@darsys12 ~]# lsmod | grep nvidia
[root@darsys12 ~]#
[root@darsys12 ~]# modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': No such device
[root@darsys12 ~]# modprobe nvidia_modeset
modprobe: ERROR: could not insert 'nvidia_modeset': No such device
[root@darsys12 ~]# modprobe nvidia_drm
modprobe: ERROR: could not insert 'nvidia_drm': No such device
How's the PSU situation? It's a 25W GPU but if the rest of the system is at its limit, 25W can still be 5W too much.
PSU should be fine. The box has a 500W supply in it (https://www.newegg.com/evga-500-b1-100- … 6817438012) and the card's specs say it needs a minimum of 300. Plus there's very little other hardware running in the box - I think this might be the only card I have installed, and about the only other things installed are an SSD and 3 HD's.
(nvidia-smi won't work as long as the GPU keeps falling off the bus. It's like as if it's physically fallen out of the slot)
:-)
I'm going to try a few more things to see if my current arch setup is the issue: 1) booting with LTS and fallback initramfs, and 2) booting with systemrescuecd. After that I think the next step might be to try a UEFI boot.
This is really frustrating ...
Offline
Try seth suggestions in #8
Tried. (See previous comment.) No luck. :-(
gpu manufacturers increasingly focus on making their cards work under EFI and neglect bios users.
This has a rather big impact on hardware initialisation for non-standard situations .Whether swtiching to efi boot would help is another matter.
This may have to be my next step.
Offline
I'm going to try a few more things to see if my current arch setup is the issue: 1) booting with LTS and fallback initramfs, and 2) booting with systemrescuecd. After that I think the next step might be to try a UEFI boot.
Meh. Tried all of these, but nothing worked. (And I can't get my mobo to boot Arch in UEFI mode.) I think this graphics card just became a paperweight. :-(
Offline
Do you have and did you try the GPU in an unused PCIe x16 slot?
Offline
Do you have and did you try the GPU in an unused PCIe x16 slot?
? It's a PCIe x1 card. I didn't think you could put it in a x16 slot.
Offline
Sure you can.
There're even PCIe x1 slots that can take x16 cards (on limited power and features)
Offline
Sure you can.
There're even PCIe x1 slots that can take x16 cards (on limited power and features)
Huh! I never knew this! I know that there's no issues with the x16 slot on my mobo, as I had an older card in there for ages until I bough this new one. So perhaps this might make a difference. Will give this a try later.
Offline
Do you have and did you try the GPU in an unused PCIe x16 slot?
And that ... worked!!!
[darose@darsys12 ~]$ nvidia-smi
Tue Jul 13 17:41:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31 Driver Version: 465.31 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 N/A | N/A |
| 40% 35C P8 N/A / N/A | 3MiB / 981MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Go figure!
Oh man, I cannot thank you enough!!!!
Offline
Hardware design by optimism
Either the GPU draws more than 25W or the PCIe x1 doesn't provide the full 25W - either way there was no margin for error…
Please always remember to mark resolved threads by editing your initial posts subject - so others will know that there's no task left, but maybe a solution to find.
Thanks.
Offline