X hangs: NVRM: GPU has fallen off the bus

kokoko3k · 2013-08-28 11:39:49

Hi, i need some advice to debug this issue.

When my wife plays civilizationIV under wine, after some time (30 minutes on average), Xorg completely locks.
switching VTs doesn't work, but the network is alive so i can ssh it and check the logs.

Here's what i think is relevant:

my kernel boot line:

kernel /vmlinuz-linux root=/dev/disk/by-uuid/a1468340-35da-456b-8e02-c0a263a8b0ab ro vga=792 resume=/dev/sda5 pcie_aspm=force consoleblank=0

dmesg:

[11439.712213] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[11439.712223] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[11440.425226] irq 16: nobody cared (try booting with the "irqpoll" option)
[11440.425233] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P  R        O 3.10.9-1-pae #1
[11440.425235] Hardware name: System manufacturer System Product Name/P5QL-ASUS-SE, BIOS 0402    10/09/2008
[11440.425237]  f4482a80 f4482a80 f4409f50 c050490d f4409f70 c01b51f9 c05db3c0 00000010
[11440.425242]  0032ff30 e08eb01a f4482a80 00000010 f4409f94 c01b5599 e0bc4f2a 00000010
[11440.425247]  c0408110 00000000 f4482a80 f4482ad0 00000000 f4409fd4 c01b355d f4482b90
[11440.425252] Call Trace:
[11440.425259]  [<c050490d>] dump_stack+0x16/0x18
[11440.425263]  [<c01b51f9>] __report_bad_irq+0x29/0xd0
[11440.425266]  [<c01b5599>] note_interrupt+0xf9/0x1a0
[11440.425270]  [<c0408110>] ? cpuidle_enter_state+0x40/0xd0
[11440.425273]  [<c01b355d>] handle_irq_event_percpu+0xcd/0x1f0
[11440.425276]  [<c01b5f20>] ? unmask_irq+0x30/0x30
[11440.425279]  [<c0509ff7>] ? nmi_stack_correct+0x2f/0x34
[11440.425282]  [<c01b36b1>] handle_irq_event+0x31/0x50
[11440.425284]  [<c01b5f20>] ? unmask_irq+0x30/0x30
[11440.425287]  [<c01b5f6e>] handle_fasteoi_irq+0x4e/0xe0
[11440.425288]  <IRQ>  [<c051030c>] ? do_IRQ+0x3c/0xb0
[11440.425293]  [<c050c9c1>] ? notifier_call_chain+0x41/0x60
[11440.425296]  [<c05101b3>] ? common_interrupt+0x33/0x38
[11440.425300]  [<c018007b>] ? hibernation_restore+0xeb/0x150
[11440.425315]  [<f8a900e0>] ? is_processor_present+0x1f/0x69 [processor]
[11440.425318]  [<c0408110>] ? cpuidle_enter_state+0x40/0xd0
[11440.425322]  [<c040823e>] ? cpuidle_idle_call+0x9e/0x240
[11440.425326]  [<c010999d>] ? arch_cpu_idle+0xd/0x30
[11440.425329]  [<c0186b63>] ? cpu_startup_entry+0x1a3/0x210
[11440.425332]  [<c04f3e71>] ? rest_init+0x71/0x80
[11440.425335]  [<c06c1ab4>] ? start_kernel+0x39c/0x3a2
[11440.425338]  [<c06c154f>] ? repair_env_string+0x51/0x51
[11440.425341]  [<c06c1376>] ? i386_start_kernel+0x12c/0x12f
[11440.425342] handlers:
[11440.425352] [<f85c3490>] usb_hcd_irq [usbcore]
[11440.425360] [<f84c34c0>] ata_bmdma_interrupt [libata] <---- MY NOTE: IS ATA SOMEHOW RELATED?
[11440.425361] Disabling IRQ #16                        <---- MY NOTE: IRQ16 DISABLED... WHY?
[11441.453258] nvidia 0000:01:00.0: irq 47 for MSI/MSI-X
[11441.454355] NVRM: RmInitAdapter failed! (0x23:0x2f:558)
[11441.454359] NVRM: rm_init_adapter(0) failed

At this point, if i rmmod the nvidia module and reprobe for it, i get:

[11685.809853] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=io+mem
[11685.809879] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:0614)
NVRM: installed in this system is not supported by the 325.15
NVRM: NVIDIA Linux driver release.  Please see 'Appendix
NVRM: A - Supported NVIDIA GPU Products' in this release's
NVRM: README, available on the Linux driver download page
NVRM: at www.nvidia.com.
[11685.809890] nvidia: probe of 0000:01:00.0 failed with error -1
[11685.810087] NVRM: The NVIDIA probe routine failed for 1 device(s).
[11685.810089] NVRM: None of the NVIDIA graphics adapters were initialized!
[11685.810090] [drm] Module unloaded

I've to add that this problem started happening about one month ago, but unfortunately i can't say what is changed.
For sure the system was working with nvidia driver version 319.23 with kernel 3.9.6

After that, i've had this problem with:
linux-3.9.9 + nvidia 319.32
linux-3.10.9 + nvidia 325.15

The log messages suggested me it was an irq problem.
Before nvidia 325.15, the nvidia board shared the IRQ 16 with an ide/Ata adaptor (module named pata_jmicron)
With nvidia 325.15, the driver uses Message signaled interrupts, so that my /proc/interrupts is as follows:

           CPU0       CPU1       
  0:      16188      16075   IO-APIC-edge      timer
  1:         10          6   IO-APIC-edge      i8042
  6:          2          1   IO-APIC-edge      floppy
  7:          1          0   IO-APIC-edge      parport0
  8:          0          1   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 16:         80         80   IO-APIC-fasteoi   uhci_hcd:usb1, pata_jmicron
 17:         22         19   IO-APIC-fasteoi   snd_ice1712
 18:         35         36   IO-APIC-fasteoi   uhci_hcd:usb3, ehci_hcd:usb4, uhci_hcd:usb7, i801_smbus
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb6
 21:          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
 23:         16         19   IO-APIC-fasteoi   uhci_hcd:usb5, ehci_hcd:usb8
 44:       8590       8679   PCI-MSI-edge      ahci
 45:        292        282   PCI-MSI-edge      eth0
 46:        158        159   PCI-MSI-edge      snd_hda_intel
 47:        216        203   PCI-MSI-edge      nvidia
NMI:         14         11   Non-maskable interrupts
LOC:      14545      14270   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:         14         11   Performance monitoring interrupts
IWI:        167        220   IRQ work interrupts
RTR:          0          0   APIC ICR read retries
RES:      11746      11730   Rescheduling interrupts
CAL:         62         94   Function call interrupts
TLB:        199        175   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:          1          1   Machine check polls
ERR:          0
MIS:          0

 # lspci -v|grep  "IRQ 16" -B3

00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 (prog-if 00 [UHCI])
        Subsystem: ASUSTeK Computer Inc. P5Q Deluxe Motherboard
        Flags: bus master, medium devsel, latency 0, IRQ 16
--

03:00.0 IDE interface: JMicron Technology Corp. JMB368 IDE controller (prog-if 85 [Master SecO PriO])
        Subsystem: ASUSTeK Computer Inc. Device 827e
        Flags: bus master, fast devsel, latency 0, IRQ 16

So it seems that nvidia now is using irq 47, while pata_jmicron is using irq 16

-EDIT-
What i tried so far without success:
set the bios to plug'n play os=YES (was = no)
switch nvidia option UseEvents=True in xorg.conf
disabling the ATA incontroller that shared IRQ 16 with gpu board from the bios

Now i'm trying again with the old combination of linux-3.9.6 + nvidia-319.23, it is working since 20 minutes...

Last edited by kokoko3k (2013-08-28 16:23:39)

kokoko3k · 2013-09-01 11:12:57

Oh man, i think (i hope) i solved the issue.
And the log message was right, the board was "falling" off the bus... i toke it out, cleaned connector and replugged it, and now it seems to work with latest driver kernel.

Will update the thread in case of other errors.

falconindy · 2013-09-01 18:09:18

Yeah this error most often coincides with a physical problem with the card. Either it isn't seated properly or the card is dying. If the errors go away after reseating the card, then you're probably good to go.

Arch Linux

#1 2013-08-28 11:39:49

X hangs: NVRM: GPU has fallen off the bus

#2 2013-09-01 11:12:57

Re: X hangs: NVRM: GPU has fallen off the bus

#3 2013-09-01 18:09:18

Re: X hangs: NVRM: GPU has fallen off the bus

Board footer