You are not logged in.
Pages: 1
Hi,
I got a problem since yesterday. My system freezes every 10 minutes when I'm browsing or just using thunar. I can still hear sound but the screen and mouse freezes. My last system update was a week ago, so I updated the system, but this didn't solve the problem. After the reboot I looked in the system journal and got the following error every time before the freeze:
kernel: NVRM: Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
kernel: NVRM: GPU at 0000:02:00.0 has fallen off the bus.
kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
I guessed the graphic card could be damaged so I started Windows and did a benchmark. In Windows everything works fine, no freezes. Does anybody had the same problem and maybe a solution?
Kind regards,
Marcus
Offline
Exactly 10 minute interval? Some powersaving turns off the PCI bus.
Otherwise i'd start with downgrading the driver (nvidia blob?)
A typical HW cause in underpowerage (w/ a spike drain causing this)
Offline
No, sometimes after 1 minute or after 15 minutes or later. I thought about the power supply, but the system runs fine under Windows. So I think it's not the hardware. I started a stress test under Windows and everything runs fine. I updated the driver after the freezes but with no results. Downgrading the driver did not help either.
Offline
Is this a hybrid system ("optimus", check/dump lspci)?
How is the GPU cooled and do temperatures relate to the incident?
Other warnings/errors from NVRM?
Offline
No, it is no hybrid system. The GPU is cooled by two fans. I don't think it's temperature related, because I had this error yesterday one minute after booting. And also the stresstest under Windows was fine. Got no other warnings from NVRM.
Offline
Got no other warnings from NVRM.
None? Not even the typical "don't use a framebuffer console" ones?
Check your pacman logs around the time of the first occurence - notably whether you updated the kernel or the nvidia driver - and try downgrading the driver.
If this is not HW related, it's a driver bug and either in the device driver or in the PCI stack.
Offline
I got no error saying "don't use a framebuffer console" in the systemd journal. When the first freeze occured, the system was not updated since two weeks (and it run with no problems in this time). Are there any other logs I can check?
Offline
dmesg | grep -iE '(NVRM|pci)'
Offline
System was running fine a week now and today freezed four times in a row. I've checked
dmesg | grep -iE '(NVRM|pci)'
but got no more errors as the system journal gave me. The current error says:
NVRM: GPU at PCI:0000:02:00: GPU-487dd981-a21a-29ca-9690-2c9d38cc746e
Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
GPU at 0000:02:00.0 has fallen off the bus.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
arch kernel: irq 17: nobody cared (try booting with the "irqpoll" option)
arch kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 4.12.10-1-ARCH #1
arch kernel: Hardware name: MSI MS-7681/Z68A-GD65 (G3) (MS-7681), BIOS V25.8 01/11/2013
arch kernel: Call Trace:
arch kernel: <IRQ>
arch kernel: dump_stack+0x63/0x8d
arch kernel: __report_bad_irq+0x35/0xc0
arch kernel: note_interrupt+0x24b/0x2a0
arch kernel: handle_irq_event_percpu+0x54/0x80
arch kernel: handle_irq_event+0x39/0x60
arch kernel: handle_fasteoi_irq+0x8b/0x150
arch kernel: handle_irq+0x1a/0x30
arch kernel: do_IRQ+0x46/0xd0
arch kernel: common_interrupt+0x89/0x89
arch kernel: RIP: 0010:cpuidle_enter_state+0x128/0x300
arch kernel: RSP: 0018:ffffffff94a03dc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff8e
arch kernel: RAX: ffff91071ec18c40 RBX: 00000013d0b5aaec RCX: 000000000000001f
arch kernel: RDX: 00000013d0b5aaec RSI: ffff91071ec16458 RDI: 0000000000000000
arch kernel: RBP: ffffffff94a03e00 R08: cccccccccccccccd R09: 0000000000000008
arch kernel: R10: 0000000000000001 R11: 0000000000000001 R12: ffff91071ec21200
arch kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffffffff94aa9098
arch kernel: </IRQ>
arch kernel: cpuidle_enter+0x17/0x20
arch kernel: call_cpuidle+0x23/0x40
arch kernel: do_idle+0x18a/0x1e0
arch kernel: cpu_startup_entry+0x71/0x80
arch kernel: rest_init+0x84/0x90
arch kernel: start_kernel+0x43b/0x45c
arch kernel: ? early_idt_handler_array+0x120/0x120
arch kernel: x86_64_start_reservations+0x29/0x2b
arch kernel: x86_64_start_kernel+0x143/0x166
arch kernel: secondary_startup_64+0x9f/0x9f
arch kernel: handlers:
arch kernel: [<ffffffffc06ab7c0>] oxygen_interrupt [snd_oxygen_lib]
arch kernel: Disabling IRQ #17
Could it be an IRQ conflict with my sound card? By the way, thanks for your help seth!
Offline
Let's see
cat /proc/interrupts
Voodoo solutions google came up with:
/etc/modprobe.d/nvidia.conf
options nvidia NVreg_EnableMSI=1
Kernel parameter
acpi=debug
Offline
MSI is already activated.
cat /proc/interrupts
gives me the following output:
CPU0 CPU1 CPU2 CPU3
0: 14 0 0 0 IO-APIC 2-edge timer
8: 1 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 IO-APIC 9-fasteoi acpi
16: 816782 0 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1
17: 123 0 0 0 IO-APIC 17-fasteoi snd_oxygen_lib
18: 415 0 0 0 IO-APIC 18-fasteoi i801_smbus, snd_hda_intel:card1
19: 0 0 0 0 IO-APIC 19-fasteoi ata_piix
23: 8798 0 0 0 IO-APIC 23-fasteoi ehci_hcd:usb4
26: 0 0 0 0 PCI-MSI 3145728-edge xhci_hcd
27: 0 0 0 0 PCI-MSI 3145729-edge xhci_hcd
28: 0 0 0 0 PCI-MSI 3145730-edge xhci_hcd
29: 0 0 0 0 PCI-MSI 3145731-edge xhci_hcd
30: 0 0 0 0 PCI-MSI 3145732-edge xhci_hcd
31: 1280 0 0 0 PCI-MSI 4718592-edge xhci_hcd
32: 0 0 0 0 PCI-MSI 4718593-edge xhci_hcd
33: 0 0 0 0 PCI-MSI 4718594-edge xhci_hcd
34: 0 0 0 0 PCI-MSI 4718595-edge xhci_hcd
35: 0 0 0 0 PCI-MSI 4718596-edge xhci_hcd
36: 630962 0 0 0 PCI-MSI 512000-edge ahci[0000:00:1f.2]
37: 13 0 0 0 PCI-MSI 360448-edge mei_me
38: 569833 0 0 0 PCI-MSI 5242880-edge enp10s0
39: 96469 0 0 0 PCI-MSI 1048576-edge nvidia
NMI: 62 53 52 52 Non-maskable interrupts
LOC: 800738 512037 752322 467936 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 62 53 52 52 Performance monitoring interrupts
IWI: 1 0 0 0 IRQ work interrupts
RTR: 1 0 0 0 APIC ICR read retries
RES: 190427 173242 187343 172887 Rescheduling interrupts
CAL: 126859 121613 127758 120121 Function call interrupts
TLB: 115942 111307 114983 108516 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 36 36 36 36 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
PIW: 0 0 0 0 Posted-interrupt wakeup event
Last edited by Marcus_x80 (2017-09-08 16:17:08)
Offline
At least they don't share an interrupt - why do you suspect a conflict?
Did you try to blacklist snd_hda_intel? (This should nuke both audio devices)
Offline
Is your arch linux stable when you boot to the terminal (multi-user.target)? https://wiki.archlinux.org/index.php/Sy … _boot_into
Does it work with the nouveau driver instead of the nvidia binary?
| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |
Offline
I tried to disabel MSI and got the following interrupts:
CPU0 CPU1 CPU2 CPU3
0: 15 0 0 0 IO-APIC 2-edge timer
8: 1 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 IO-APIC 9-fasteoi acpi
16: 17627 0 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1
17: 2953 0 0 0 IO-APIC 17-fasteoi nvidia, snd_oxygen_lib
18: 462 0 0 0 IO-APIC 18-fasteoi i801_smbus, snd_hda_intel:card0
19: 0 0 0 0 IO-APIC 19-fasteoi ata_piix
23: 33 0 0 0 IO-APIC 23-fasteoi ehci_hcd:usb4
26: 0 0 0 0 PCI-MSI 3145728-edge xhci_hcd
27: 0 0 0 0 PCI-MSI 3145729-edge xhci_hcd
28: 0 0 0 0 PCI-MSI 3145730-edge xhci_hcd
29: 0 0 0 0 PCI-MSI 3145731-edge xhci_hcd
30: 0 0 0 0 PCI-MSI 3145732-edge xhci_hcd
31: 32230 0 0 0 PCI-MSI 512000-edge ahci[0000:00:1f.2]
32: 190 0 0 0 PCI-MSI 4718592-edge xhci_hcd
33: 0 0 0 0 PCI-MSI 4718593-edge xhci_hcd
34: 0 0 0 0 PCI-MSI 4718594-edge xhci_hcd
35: 0 0 0 0 PCI-MSI 4718595-edge xhci_hcd
36: 0 0 0 0 PCI-MSI 4718596-edge xhci_hcd
37: 13 0 0 0 PCI-MSI 360448-edge mei_me
38: 181 0 0 0 PCI-MSI 5242880-edge enp10s0
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 8296 3378 2756 3106 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 0 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 979 1109 872 1012 Rescheduling interrupts
CAL: 1142 1609 1526 1678 Function call interrupts
TLB: 20 74 50 68 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 2 2 2 2 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
PIW: 0 0 0 0 Posted-interrupt wakeup event
That's why I guessed an IRQ conflict. Disabling MSI did not resolve the problem, it frozen moments ago (it's annoying that I can not reproduce the freeze, I have to wait until it happens).
I now blacklisted snd_hda_intel (still got sound with my Asus sound card). If the system freezes again I'm going to try the nouveau driver instead of the nvidia driver.
Offline
Since switching to kernel 4.13.4 from 4.12.x and thus, nvidia 387.12 (earlier versions depend on <4.13), I experience the same issue using a GP107 (GTX 1050). For me, the last working kernel / nvidia combination was 4.12.8 / 384.59.
But, news is, since using nvidia-settings to set the GPU clock from adaptive to maximum, I experienced no more freezes for three days with current kernel / nvidia packages.
Last edited by anarki@buttereblume (2017-10-09 11:52:54)
Offline
anarki@buttereblume, probably rather the wildly popular "intel_iommu=off" bug?
Offline
After two weeks of testing, I can conclude that there is a correlation between the issue and the modeset kernel parameter. Turning it on by setting nvidia-drm.modeset=1, it will take between 10 and 30 minutes until the nvidia module crashes. Turning if off makes the issue disappear.
Last edited by anarki@buttereblume (2017-10-27 09:56:02)
Offline
Pages: 1