You are not logged in.

#1 2021-07-10 16:50:15

darose
Member
Registered: 2004-04-13
Posts: 158

[SOLVED] "GPU has fallen off the bus" :-(

I recently purchased a new GPU card for my server, but have not been successful in getting the nvidia drivers to work with it - either as a graphics card, or as a cuda compute engine.  The card in question is a Nvidia GeForce GT 710.  (https://www.zotac.com/us/product/graphi … b-pcie-x-1)  (The motherboard I'm using is an Asus PRIME B250M-A, if that matters.)

I have all the needed packages installed:

[darose@darsys12 ~]$ pacman -Q | grep nvidia
nvidia-dkms 465.31-1
nvidia-settings 465.31-1
nvidia-utils 465.31-1
opencl-nvidia 465.31-1

The machine definitely *sees* the card:

[darose@darsys12 ~]$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)

But no matter what I try, I keep getting "GPU has fallen off the bus" errors:

[darose@darsys12 ~]$ dmesg | grep -i -e nvidia -e nvrm
[    0.000000] Command line: BOOT_IMAGE=../vmlinuz-linux root=LABEL=root rw nvidia-drm.modeset=1 initrd=../intel-ucode.img,../initramfs-linux.img
[    0.046539] Kernel command line: BOOT_IMAGE=../vmlinuz-linux root=LABEL=root rw nvidia-drm.modeset=1 initrd=../intel-ucode.img,../initramfs-linux.img
[    1.099115] nvidia: loading out-of-tree module taints kernel.
[    1.099123] nvidia: module license 'NVIDIA' taints kernel.
[    1.109036] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    1.132441] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[    1.133265] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    1.249990] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  465.31  Thu May 13 22:24:36 UTC 2021
[    1.253004] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  465.31  Thu May 13 22:14:23 UTC 2021
[    1.253833] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[    1.262431] nvidia-uvm: Loaded the UVM driver, major device number 237.
[    1.284546] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    1.692351] NVRM: GPU at PCI:0000:01:00: GPU-94909028-e642-5b1e-39c9-b3c510991de2
[    1.692354] NVRM: Xid (PCI:0000:01:00): 79, pid=140, GPU has fallen off the bus.
[    1.692355] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[    1.696537] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0xf:1242)
[    1.696593] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    1.696712] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    1.696916] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
...

And the nvidia drivers can't access it:

[darose@darsys12 ~]$ sudo nvidia-smi 
No devices were found
[darose@darsys12 ~]$ sudo clinfo 
Number of platforms                               0

That "GPU has fallen off the bus" is a pretty vague error message, and there's not a ton of good suggestions on the Net about how to fix this.  Several pages suggested either that the card wasn't seated correctly, or it's a hardware error.  But I tried re-seating the card again (as well as installing it in a different slot) but that made no difference.  As far as a hardware error, that seems unlikely too, as a) this is a brand new card, and b) it works *perfectly* when I use the nouveau driver instead (though without any compute capabilities, obviously).

I've tried upgrading and downgrading kernels, upgrading and downgrading the nvidia packages, booting using "rcutree.rcu_idle_gp_delay=1 pcie_aspm=off" and numerous other things, but nothing seems to fix the issue.

Yes, I can get the graphics working on the card if I need to (using nouveau), but that doesn't really help me.  This is a headless server, and so doesn't really make any use of graphics capabilities, and I bought the card in order to use it for compute purposes.

Been wrestling with this for several days now, and I'm pretty frustrated and at my wits end on this.

Any suggestions on how to fix or debug this further would be most welcome!

Last edited by darose (2021-07-13 22:07:31)

Offline

#2 2021-07-10 18:18:58

Lone_Wolf
Administrator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 12,953

Re: [SOLVED] "GPU has fallen off the bus" :-(

According to page 1-1 of your motherboard manual[1] uses an intel chipset and has 2 x1 slots : PCIEX1_1 and PCIEX1_2 .
Which slot is the card in and does it matter if you change the slot ?

Many intel processors have an integrated gpu, does yours ?

please post full lspci -k output , also dmesg and/or journalctl -b


[1] https://dlcdnets.asus.com/pub/ASUS/mb/L … B_only.pdf


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.

clean chroot building not flexible enough ?
Try clean chroot manager by graysky

Offline

#3 2021-07-10 19:00:22

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

Thanks very much for the response!

1) I'm currently using slot PCIEX1_1.  (Though as I mentioned earlier I've tried using the other slot as well.)

2) It looks like the CPU does have integrated graphics.  (Intel® HD Graphics 630, see https://ark.intel.com/content/www/us/en … -ghz.html)  But I've updated the BIOS to instruct it to use the PCIE graphics card, and I don't see the intel graphics being used:

[darose@darsys12 tmp]$ inxi -G --display
Graphics:  Device-1: NVIDIA GK208B [GeForce GT 710] driver: nvidia v: 465.31 
           Display: server: X.org 1.20.12 driver: loaded: nvidia note: n/a (using device driver) unloaded: modesetting 
           Message: No advanced graphics data found on this system. 

[darose@darsys12 tmp]$ lsmod | grep i915
[darose@darsys12 tmp]$ 

3) http://darose.net/journalctl.txt

[darose@darsys12 ~]$ lspci -k
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
        Subsystem: ASUSTeK Computer Inc. Device 8694
        Kernel driver in use: skl_uncore
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
        Subsystem: ASUSTeK Computer Inc. Device 8694
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
        Subsystem: ASUSTeK Computer Inc. Device 8694
        Kernel driver in use: mei_me
        Kernel modules: mei_me
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
        Subsystem: ASUSTeK Computer Inc. Device 8694
        Kernel driver in use: ahci
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0)
        Kernel driver in use: pcieport
00:1c.7 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #8 (rev f0)
        Kernel driver in use: pcieport
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
        Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation 200 Series PCH LPC Controller (B250)
        Subsystem: ASUSTeK Computer Inc. Device 8694
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
        Subsystem: ASUSTeK Computer Inc. Device 8694
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
        Subsystem: ASUSTeK Computer Inc. Device 8694
        Kernel driver in use: i801_smbus
        Kernel modules: i2c_i801
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 5360
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia
01:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 5360
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
        Subsystem: ASUSTeK Computer Inc. PRIME B450M-A Motherboard
        Kernel driver in use: r8169
        Kernel modules: r8169

Thanks!

Offline

#4 2021-07-10 21:45:59

seth
Member
Registered: 2012-09-03
Posts: 58,859

Re: [SOLVED] "GPU has fallen off the bus" :-(

I bought the card in order to use it for compute purposes.

You're aware that it's not exactly a powerful GPU?

Is there currently a display attached to the GPU?
Try to use the 460xx driver via kma, because of https://bbs.archlinux.org/viewtopic.php?id=265563 (or wait for 470xx to hit the repos)

Offline

#5 2021-07-11 10:34:22

Lone_Wolf
Administrator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 12,953

Re: [SOLVED] "GPU has fallen off the bus" :-(

Are you booting to EFI or Bios ?
(if unsure, post ls -l /sys/firmware/efi/efivars/ | wc -l output ).


darose wrote:

But I've updated the BIOS to instruct it to use the PCIE graphics card, and I don't see the intel graphics being used:

Usually that setting prompts the firmware/chipset to look at the primary pcie graphics slot (the x16 one) .
Some firmwares do allow to select a specific card / slot there.

You could try setting the intel gpu as primary for testing and then see whether nvidia proprietary drivers can find / use the card.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.

clean chroot building not flexible enough ?
Try clean chroot manager by graysky

Offline

#6 2021-07-12 02:02:58

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

seth wrote:

You're aware that it's not exactly a powerful GPU?

Yes of course.  I'm not doing crypto mining or gaming.  (Just planning to use it for FoldingAtHome and some pycuda stuff.)  And I didn't have *any* GPU in the machine prior to this, so this is a step up.  (Plus there's a big shortage on good GPU's right now, and the prices are pretty insane.)

seth wrote:

Is there currently a display attached to the GPU?

I had a Lantronix Spider KVM hooked up to the VGA port on the GPU card.  I've now moved that to the integrated graphics port.  But the nvidia card "falls off the bus" in either scenario.

seth wrote:

Try to use the 460xx driver via kma, because of https://bbs.archlinux.org/viewtopic.php?id=265563 (or wait for 470xx to hit the repos)

Tried that as well, but no love:

[root@darsys12 ~]# dmesg | grep NVRM
[    3.556602] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.67  Thu Mar 11 00:11:45 UTC 2021
[   70.780217] NVRM: GPU at PCI:0000:01:00: GPU-94909028-e642-5b1e-39c9-b3c510991de2
[   70.780221] NVRM: Xid (PCI:0000:01:00): 79, pid=1326, GPU has fallen off the bus.
[   70.780222] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[   70.785988] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x1a:1290)
...

Offline

#7 2021-07-12 02:12:34

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

Lone_Wolf wrote:

Are you booting to EFI or Bios ?
(if unsure, post ls -l /sys/firmware/efi/efivars/ | wc -l output ).

I'm definitely using BIOS, not EFI:

[darose@darsys12 ~]$ ls -l /sys/firmware/efi/efivars
ls: cannot access '/sys/firmware/efi/efivars': No such file or directory

As this box has never had Windows on it, I never saw the need to use UEFI.  Is UEFI needed in order to use a NVIDIA GPU?

darose wrote:

But I've updated the BIOS to instruct it to use the PCIE graphics card, and I don't see the intel graphics being used:

Lone_Wolf wrote:

Usually that setting prompts the firmware/chipset to look at the primary pcie graphics slot (the x16 one) .
Some firmwares do allow to select a specific card / slot there.

You could try setting the intel gpu as primary for testing and then see whether nvidia proprietary drivers can find / use the card.

I tried this.  Good news is that I was able to set the intel onboard GPU as primary, and get to a login screen with it (using the i915 kernel module and the modeset Xorg driver).  Bad news though is that I still can't get the nvidia card working:

[root@darsys12 ~]# dmesg | grep NVRM
[    3.556602] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.67  Thu Mar 11 00:11:45 UTC 2021
[   70.780217] NVRM: GPU at PCI:0000:01:00: GPU-94909028-e642-5b1e-39c9-b3c510991de2
[   70.780221] NVRM: Xid (PCI:0000:01:00): 79, pid=1326, GPU has fallen off the bus.
[   70.780222] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[   70.785988] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x1a:1290)
[   70.786074] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
...
[root@darsys12 ~]# nvidia-smi 
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

[root@darsys12 ~]# clinfo 
Number of platforms                               0

This is really frustrating!  :-(  The card is worthless to me if I can't get it to do any compute work.

Offline

#8 2021-07-12 06:50:37

seth
Member
Registered: 2012-09-03
Posts: 58,859

Re: [SOLVED] "GPU has fallen off the bus" :-(

Can you reload the nvidia modules at runtime?
How's the PSU situation? It's a 25W GPU but if the rest of the system is at its limit, 25W can still be 5W too much.

(nvidia-smi won't work as long as the GPU keeps falling off the bus. It's like as if it's physically fallen out of the slot)

Offline

#9 2021-07-12 11:48:41

Lone_Wolf
Administrator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 12,953

Re: [SOLVED] "GPU has fallen off the bus" :-(

darose wrote:

As this box has never had Windows on it, I never saw the need to use UEFI.  Is UEFI needed in order to use a NVIDIA GPU?

gpu manufacturers increasingly focus on making their cards work under EFI and neglect bios users.
This has a rather big impact on hardware initialisation for non-standard situations .

Whether swtiching to efi boot would help is another matter.
Try seth suggestions  in #8


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.

clean chroot building not flexible enough ?
Try clean chroot manager by graysky

Offline

#10 2021-07-12 15:06:37

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

seth wrote:

Can you reload the nvidia modules at runtime?

Doesn't seem to work:

[root@darsys12 ~]# lsmod | grep nvidia
nvidia_drm             65536  0
nvidia_modeset       1187840  1 nvidia_drm
nvidia              34992128  1 nvidia_modeset
drm_kms_helper        290816  2 nvidia_drm,i915
drm                   581632  5 drm_kms_helper,nvidia_drm,i915
[root@darsys12 ~]# modprobe -r nvidia_drm
[root@darsys12 ~]# lsmod | grep nvidia
[root@darsys12 ~]#

[root@darsys12 ~]# modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': No such device
[root@darsys12 ~]# modprobe nvidia_modeset
modprobe: ERROR: could not insert 'nvidia_modeset': No such device
[root@darsys12 ~]# modprobe nvidia_drm
modprobe: ERROR: could not insert 'nvidia_drm': No such device
seth wrote:

How's the PSU situation? It's a 25W GPU but if the rest of the system is at its limit, 25W can still be 5W too much.

PSU should be fine.  The box has a 500W supply in it (https://www.newegg.com/evga-500-b1-100- … 6817438012) and the card's specs say it needs a minimum of 300.  Plus there's very little other hardware running in the box - I think this might be the only card I have installed, and about the only other things installed are an SSD and 3 HD's.

seth wrote:

(nvidia-smi won't work as long as the GPU keeps falling off the bus. It's like as if it's physically fallen out of the slot)

:-)

I'm going to try a few more things to see if my current arch setup is the issue:  1) booting with LTS and fallback initramfs, and 2) booting with systemrescuecd.  After that I think the next step might be to try a UEFI boot.

This is really frustrating ...

Offline

#11 2021-07-12 15:08:34

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

Lone_Wolf wrote:

Try seth suggestions  in #8

Tried.  (See previous comment.)  No luck.  :-(

Lone_Wolf wrote:

gpu manufacturers increasingly focus on making their cards work under EFI and neglect bios users.
This has a rather big impact on hardware initialisation for non-standard situations .

Whether swtiching to efi boot would help is another matter.

This may have to be my next step.

Offline

#12 2021-07-12 22:32:06

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

darose wrote:

I'm going to try a few more things to see if my current arch setup is the issue:  1) booting with LTS and fallback initramfs, and 2) booting with systemrescuecd.  After that I think the next step might be to try a UEFI boot.

Meh.  Tried all of these, but nothing worked.  (And I can't get my mobo to boot Arch in UEFI mode.)  I think this graphics card just became a paperweight.  :-(

Offline

#13 2021-07-13 06:14:16

seth
Member
Registered: 2012-09-03
Posts: 58,859

Re: [SOLVED] "GPU has fallen off the bus" :-(

Do you have and did you try the GPU in an unused PCIe x16 slot?

Offline

#14 2021-07-13 12:33:46

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

seth wrote:

Do you have and did you try the GPU in an unused PCIe x16 slot?

?  It's a PCIe x1 card.  I didn't think you could put it in a x16 slot.

Offline

#15 2021-07-13 12:37:15

seth
Member
Registered: 2012-09-03
Posts: 58,859

Re: [SOLVED] "GPU has fallen off the bus" :-(

Sure you can.
There're even PCIe x1 slots that can take x16 cards (on limited power and features)

https://en.wikipedia.org/wiki/PCI_Expre … ical_layer

Offline

#16 2021-07-13 12:41:34

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

seth wrote:

Sure you can.
There're even PCIe x1 slots that can take x16 cards (on limited power and features)

Huh!  I never knew this!  I know that there's no issues with the x16 slot on my mobo, as I had an older card in there for ages until I bough this new one.  So perhaps this might make a difference.  Will give this a try later.

Offline

#17 2021-07-13 21:42:29

darose
Member
Registered: 2004-04-13
Posts: 158

Re: [SOLVED] "GPU has fallen off the bus" :-(

seth wrote:

Do you have and did you try the GPU in an unused PCIe x16 slot?

And that ... worked!!!

[darose@darsys12 ~]$ nvidia-smi
Tue Jul 13 17:41:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 N/A |                  N/A |
| 40%   35C    P8    N/A /  N/A |      3MiB /   981MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Go figure!

Oh man, I cannot thank you enough!!!!

Offline

#18 2021-07-13 21:52:38

seth
Member
Registered: 2012-09-03
Posts: 58,859

Re: [SOLVED] "GPU has fallen off the bus" :-(

Hardware design by optimism smile

Either the GPU draws more than 25W or the PCIe x1 doesn't provide the full 25W - either way there was no margin for error…

Please always remember to mark resolved threads by editing your initial posts subject - so others will know that there's no task left, but maybe a solution to find.
Thanks.

Offline

Board footer

Powered by FluxBB