GPU has fallen off the bus

Marcus_x80 · 2017-08-23 10:38:03

Hi,

I got a problem since yesterday. My system freezes every 10 minutes when I'm browsing or just using thunar. I can still hear sound but the screen and mouse freezes. My last system update was a week ago, so I updated the system, but this didn't solve the problem. After the reboot I looked in the system journal and got the following error every time before the freeze:

kernel: NVRM: Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
kernel: NVRM: GPU at 0000:02:00.0 has fallen off the bus.
kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

I guessed the graphic card could be damaged so I started Windows and did a benchmark. In Windows everything works fine, no freezes. Does anybody had the same problem and maybe a solution?

Kind regards,
Marcus

seth · 2017-08-23 12:57:20

Exactly 10 minute interval? Some powersaving turns off the PCI bus.
Otherwise i'd start with downgrading the driver (nvidia blob?)
A typical HW cause in underpowerage (w/ a spike drain causing this)

Marcus_x80 · 2017-08-24 11:52:20

No, sometimes after 1 minute or after 15 minutes or later. I thought about the power supply, but the system runs fine under Windows. So I think it's not the hardware. I started a stress test under Windows and everything runs fine. I updated the driver after the freezes but with no results. Downgrading the driver did not help either.

seth · 2017-08-24 14:19:14

Is this a hybrid system ("optimus", check/dump lspci)?
How is the GPU cooled and do temperatures relate to the incident?
Other warnings/errors from NVRM?

Marcus_x80 · 2017-08-24 21:52:14

No, it is no hybrid system. The GPU is cooled by two fans. I don't think it's temperature related, because I had this error yesterday one minute after booting. And also the stresstest under Windows was fine. Got no other warnings from NVRM.

seth · 2017-08-26 11:17:16

Marcus_x80 wrote:

Got no other warnings from NVRM.

None? Not even the typical "don't use a framebuffer console" ones?
Check your pacman logs around the time of the first occurence - notably whether you updated the kernel or the nvidia driver - and try downgrading the driver.
If this is not HW related, it's a driver bug and either in the device driver or in the PCI stack.

Marcus_x80 · 2017-08-29 09:09:05

I got no error saying "don't use a framebuffer console" in the systemd journal. When the first freeze occured, the system was not updated since two weeks (and it run with no problems in this time). Are there any other logs I can check?

seth · 2017-08-29 18:38:49

dmesg | grep -iE '(NVRM|pci)'

Marcus_x80 · 2017-09-07 19:45:16

System was running fine a week now and today freezed four times in a row. I've checked

 dmesg | grep -iE '(NVRM|pci)'

but got no more errors as the system journal gave me. The current error says:

NVRM: GPU at PCI:0000:02:00: GPU-487dd981-a21a-29ca-9690-2c9d38cc746e
Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
GPU at 0000:02:00.0 has fallen off the bus.
NVRM: A GPU crash dump has been created. If possible, please run
                             NVRM: nvidia-bug-report.sh as root to collect this data before
                             NVRM: the NVIDIA kernel module is unloaded.
arch kernel: irq 17: nobody cared (try booting with the "irqpoll" option)
arch kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O    4.12.10-1-ARCH #1
arch kernel: Hardware name: MSI MS-7681/Z68A-GD65 (G3) (MS-7681), BIOS V25.8 01/11/2013
arch kernel: Call Trace:
arch kernel:  <IRQ>
arch kernel:  dump_stack+0x63/0x8d
arch kernel:  __report_bad_irq+0x35/0xc0
arch kernel:  note_interrupt+0x24b/0x2a0
arch kernel:  handle_irq_event_percpu+0x54/0x80
arch kernel:  handle_irq_event+0x39/0x60
arch kernel:  handle_fasteoi_irq+0x8b/0x150
arch kernel:  handle_irq+0x1a/0x30
arch kernel:  do_IRQ+0x46/0xd0
arch kernel:  common_interrupt+0x89/0x89
arch kernel: RIP: 0010:cpuidle_enter_state+0x128/0x300
arch kernel: RSP: 0018:ffffffff94a03dc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff8e
arch kernel: RAX: ffff91071ec18c40 RBX: 00000013d0b5aaec RCX: 000000000000001f
arch kernel: RDX: 00000013d0b5aaec RSI: ffff91071ec16458 RDI: 0000000000000000
arch kernel: RBP: ffffffff94a03e00 R08: cccccccccccccccd R09: 0000000000000008
arch kernel: R10: 0000000000000001 R11: 0000000000000001 R12: ffff91071ec21200
arch kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffffffff94aa9098
arch kernel:  </IRQ>
arch kernel:  cpuidle_enter+0x17/0x20
arch kernel:  call_cpuidle+0x23/0x40
arch kernel:  do_idle+0x18a/0x1e0
arch kernel:  cpu_startup_entry+0x71/0x80
arch kernel:  rest_init+0x84/0x90
arch kernel:  start_kernel+0x43b/0x45c
arch kernel:  ? early_idt_handler_array+0x120/0x120
arch kernel:  x86_64_start_reservations+0x29/0x2b
arch kernel:  x86_64_start_kernel+0x143/0x166
arch kernel:  secondary_startup_64+0x9f/0x9f
arch kernel: handlers:
arch kernel: [<ffffffffc06ab7c0>] oxygen_interrupt [snd_oxygen_lib]
arch kernel: Disabling IRQ #17

Could it be an IRQ conflict with my sound card? By the way, thanks for your help seth!

seth · 2017-09-08 14:31:29

Let's see

cat /proc/interrupts

Voodoo solutions google came up with:

/etc/modprobe.d/nvidia.conf

options nvidia NVreg_EnableMSI=1

Kernel parameter

acpi=debug

Marcus_x80 · 2017-09-08 15:22:09

MSI is already activated.

 cat /proc/interrupts

gives me the following output:

  
           CPU0       CPU1       CPU2       CPU3       
  0:         14          0          0          0   IO-APIC   2-edge      timer
  8:          1          0          0          0   IO-APIC   8-edge      rtc0
  9:          0          0          0          0   IO-APIC   9-fasteoi   acpi
 16:     816782          0          0          0   IO-APIC  16-fasteoi   ehci_hcd:usb1
 17:        123          0          0          0   IO-APIC  17-fasteoi   snd_oxygen_lib
 18:        415          0          0          0   IO-APIC  18-fasteoi   i801_smbus, snd_hda_intel:card1
 19:          0          0          0          0   IO-APIC  19-fasteoi   ata_piix
 23:       8798          0          0          0   IO-APIC  23-fasteoi   ehci_hcd:usb4
 26:          0          0          0          0   PCI-MSI 3145728-edge      xhci_hcd
 27:          0          0          0          0   PCI-MSI 3145729-edge      xhci_hcd
 28:          0          0          0          0   PCI-MSI 3145730-edge      xhci_hcd
 29:          0          0          0          0   PCI-MSI 3145731-edge      xhci_hcd
 30:          0          0          0          0   PCI-MSI 3145732-edge      xhci_hcd
 31:       1280          0          0          0   PCI-MSI 4718592-edge      xhci_hcd
 32:          0          0          0          0   PCI-MSI 4718593-edge      xhci_hcd
 33:          0          0          0          0   PCI-MSI 4718594-edge      xhci_hcd
 34:          0          0          0          0   PCI-MSI 4718595-edge      xhci_hcd
 35:          0          0          0          0   PCI-MSI 4718596-edge      xhci_hcd
 36:     630962          0          0          0   PCI-MSI 512000-edge      ahci[0000:00:1f.2]
 37:         13          0          0          0   PCI-MSI 360448-edge      mei_me
 38:     569833          0          0          0   PCI-MSI 5242880-edge      enp10s0
 39:      96469          0          0          0   PCI-MSI 1048576-edge      nvidia
NMI:         62         53         52         52   Non-maskable interrupts
LOC:     800738     512037     752322     467936   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:         62         53         52         52   Performance monitoring interrupts
IWI:          1          0          0          0   IRQ work interrupts
RTR:          1          0          0          0   APIC ICR read retries
RES:     190427     173242     187343     172887   Rescheduling interrupts
CAL:     126859     121613     127758     120121   Function call interrupts
TLB:     115942     111307     114983     108516   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
DFR:          0          0          0          0   Deferred Error APIC interrupts
MCE:          0          0          0          0   Machine check exceptions
MCP:         36         36         36         36   Machine check polls
ERR:          0
MIS:          0
PIN:          0          0          0          0   Posted-interrupt notification event
PIW:          0          0          0          0   Posted-interrupt wakeup event

Last edited by Marcus_x80 (2017-09-08 16:17:08)

seth · 2017-09-08 20:26:00

At least they don't share an interrupt - why do you suspect a conflict?
Did you try to blacklist snd_hda_intel? (This should nuke both audio devices)

progandy · 2017-09-08 20:40:08

Is your arch linux stable when you boot to the terminal (multi-user.target)? https://wiki.archlinux.org/index.php/Sy … _boot_into
Does it work with the nouveau driver instead of the nvidia binary?

Marcus_x80 · 2017-09-13 21:27:32

I tried to disabel MSI and got the following interrupts:

          CPU0       CPU1       CPU2       CPU3       
  0:         15          0          0          0   IO-APIC   2-edge      timer
  8:          1          0          0          0   IO-APIC   8-edge      rtc0
  9:          0          0          0          0   IO-APIC   9-fasteoi   acpi
 16:      17627          0          0          0   IO-APIC  16-fasteoi   ehci_hcd:usb1
 17:       2953          0          0          0   IO-APIC  17-fasteoi   nvidia, snd_oxygen_lib
 18:        462          0          0          0   IO-APIC  18-fasteoi   i801_smbus, snd_hda_intel:card0
 19:          0          0          0          0   IO-APIC  19-fasteoi   ata_piix
 23:         33          0          0          0   IO-APIC  23-fasteoi   ehci_hcd:usb4
 26:          0          0          0          0   PCI-MSI 3145728-edge      xhci_hcd
 27:          0          0          0          0   PCI-MSI 3145729-edge      xhci_hcd
 28:          0          0          0          0   PCI-MSI 3145730-edge      xhci_hcd
 29:          0          0          0          0   PCI-MSI 3145731-edge      xhci_hcd
 30:          0          0          0          0   PCI-MSI 3145732-edge      xhci_hcd
 31:      32230          0          0          0   PCI-MSI 512000-edge      ahci[0000:00:1f.2]
 32:        190          0          0          0   PCI-MSI 4718592-edge      xhci_hcd
 33:          0          0          0          0   PCI-MSI 4718593-edge      xhci_hcd
 34:          0          0          0          0   PCI-MSI 4718594-edge      xhci_hcd
 35:          0          0          0          0   PCI-MSI 4718595-edge      xhci_hcd
 36:          0          0          0          0   PCI-MSI 4718596-edge      xhci_hcd
 37:         13          0          0          0   PCI-MSI 360448-edge      mei_me
 38:        181          0          0          0   PCI-MSI 5242880-edge      enp10s0
NMI:          0          0          0          0   Non-maskable interrupts
LOC:       8296       3378       2756       3106   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0   Performance monitoring interrupts
IWI:          0          0          0          0   IRQ work interrupts
RTR:          0          0          0          0   APIC ICR read retries
RES:        979       1109        872       1012   Rescheduling interrupts
CAL:       1142       1609       1526       1678   Function call interrupts
TLB:         20         74         50         68   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
DFR:          0          0          0          0   Deferred Error APIC interrupts
MCE:          0          0          0          0   Machine check exceptions
MCP:          2          2          2          2   Machine check polls
ERR:          0
MIS:          0
PIN:          0          0          0          0   Posted-interrupt notification event
PIW:          0          0          0          0   Posted-interrupt wakeup event

That's why I guessed an IRQ conflict. Disabling MSI did not resolve the problem, it frozen moments ago (it's annoying that I can not reproduce the freeze, I have to wait until it happens).
I now blacklisted snd_hda_intel (still got sound with my Asus sound card). If the system freezes again I'm going to try the nouveau driver instead of the nvidia driver.

anarki@buttereblume · 2017-10-09 11:51:47

Since switching to kernel 4.13.4 from 4.12.x and thus, nvidia 387.12 (earlier versions depend on <4.13), I experience the same issue using a GP107 (GTX 1050). For me, the last working kernel / nvidia combination was 4.12.8 / 384.59.

But, news is, since using nvidia-settings to set the GPU clock from adaptive to maximum, I experienced no more freezes for three days with current kernel / nvidia packages.

Last edited by anarki@buttereblume (2017-10-09 11:52:54)

seth · 2017-10-09 13:46:35

anarki@buttereblume, probably rather the wildly popular "intel_iommu=off" bug?

anarki@buttereblume · 2017-10-17 14:56:12

After two weeks of testing, I can conclude that there is a correlation between the issue and the modeset kernel parameter. Turning it on by setting nvidia-drm.modeset=1, it will take between 10 and 30 minutes until the nvidia module crashes. Turning if off makes the issue disappear.

Last edited by anarki@buttereblume (2017-10-27 09:56:02)

Arch Linux

#1 2017-08-23 10:38:03

GPU has fallen off the bus

#2 2017-08-23 12:57:20

Re: GPU has fallen off the bus

#3 2017-08-24 11:52:20

Re: GPU has fallen off the bus

#4 2017-08-24 14:19:14

Re: GPU has fallen off the bus

#5 2017-08-24 21:52:14

Re: GPU has fallen off the bus

#6 2017-08-26 11:17:16

Re: GPU has fallen off the bus

#7 2017-08-29 09:09:05

Re: GPU has fallen off the bus

#8 2017-08-29 18:38:49

Re: GPU has fallen off the bus

#9 2017-09-07 19:45:16

Re: GPU has fallen off the bus

#10 2017-09-08 14:31:29

Re: GPU has fallen off the bus

#11 2017-09-08 15:22:09

Re: GPU has fallen off the bus

#12 2017-09-08 20:26:00

Re: GPU has fallen off the bus

#13 2017-09-08 20:40:08

Re: GPU has fallen off the bus

#14 2017-09-13 21:27:32

Re: GPU has fallen off the bus

#15 2017-10-09 11:51:47

Re: GPU has fallen off the bus

#16 2017-10-09 13:46:35

Re: GPU has fallen off the bus

#17 2017-10-17 14:56:12

Re: GPU has fallen off the bus

Board footer