You are not logged in.

#1 2021-01-09 19:41:32

areuz
Member
Registered: 2016-09-16
Posts: 56

Nvidia "recursive fault" causes system to permanently lock up

As the title says, I believe the nvidia driver causes my system to occasionally lock up (often when just doing normal daily tasks, opening & closing tabs, switching songs, etc). I don't believe the lock up ever goes away, but I tend to loose my patience after about 5 minutes, and hard-reboot.
I've seen and partly read through these topics:
https://bbs.archlinux.org/viewtopic.php?id=261618
https://bbs.archlinux.org/viewtopic.php … 3#p1938113
https://bbs.archlinux.org/viewtopic.php?id=260131)
https://forums.developer.nvidia.com/t/4 … nts/155250

But I don't believe they are related to my issue - I've also seen this issue many years ago (2015 or so, shortly before I stopped using Linux for a while).
The issue has some very specific symptoms:
Screens freeze - no artifacts, unplugging and replugging monitors "works" but doesn't solve the lock-up
The keyboard seems to be unresponsive - numlock, capslock, other mod keys don't work, replugging also doesn't work
Audio keeps playing, be it YT, spotify, anything - until I press the power button, then it stops, but I have to long-press for the system to actually reboot

Journal logs: (beginning at what I believe to be the relevant part, I didn't see any other errors or warnings prior)
First occurence: (pressed power button, waited for 30+ seconds, then hard reset. My PC turns off within 5-10s normally)

Jan 09 09:10:21 mia-arch kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: #PF: supervisor read access in kernel mode
Jan 09 09:10:21 mia-arch kernel: #PF: error_code(0x0000) - not-present page
Jan 09 09:10:21 mia-arch kernel: PGD 25bb60067 P4D 25bb60067 PUD 0 
Jan 09 09:10:21 mia-arch kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 09 09:10:21 mia-arch kernel: CPU: 9 PID: 664 Comm: irq/106-nvidia Tainted: P           OE     5.10.4-arch2-1 #1
Jan 09 09:10:21 mia-arch kernel: Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII IMPACT, BIOS 3003 12/04/2020
Jan 09 09:10:21 mia-arch kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 09 09:10:21 mia-arch kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 09 09:10:21 mia-arch kernel: RSP: 0018:ffffae61c11c7be0 EFLAGS: 00010202
Jan 09 09:10:21 mia-arch kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 09 09:10:21 mia-arch kernel: RDX: ffff9f9f63b9cf48 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: RBP: ffff9f9e18dbd940 R08: ffffffffc29cd510 R09: ffff9f9e18dbd920
Jan 09 09:10:21 mia-arch kernel: R10: ffffffffc1618690 R11: ffff9f9e0df2b008 R12: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: R13: 0000000000000000 R14: ffff9f9e18dbdaa8 R15: ffff9f9e18dbdbb0
Jan 09 09:10:21 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff9fa50ec40000(0000) knlGS:0000000000000000
Jan 09 09:10:21 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 09:10:21 mia-arch kernel: CR2: 0000000000000020 CR3: 00000001b64c4000 CR4: 0000000000350ee0
Jan 09 09:10:21 mia-arch kernel: Call Trace:
Jan 09 09:10:21 mia-arch kernel:  ? _nv029950rm+0x1b/0x90 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv025474rm+0x18/0x60 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv011691rm+0x13d/0x1c0 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv011619rm+0xff/0x180 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv018449rm+0x1af/0x210 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv018389rm+0xd9a/0xe90 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv018390rm+0xde/0x260 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv018356rm+0x72/0xc0 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv018370rm+0x235/0x2d0 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv026076rm+0x10/0x10 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv018403rm+0xac/0xe0 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv027734rm+0x820/0xdc0 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv007566rm+0x155/0x270 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv027742rm+0x8d/0x180 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? disable_irq_nosync+0x10/0x10
Jan 09 09:10:21 mia-arch kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Jan 09 09:10:21 mia-arch kernel:  ? irq_thread_fn+0x20/0x60
Jan 09 09:10:21 mia-arch kernel:  ? irq_thread+0xf5/0x1a0
Jan 09 09:10:21 mia-arch kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Jan 09 09:10:21 mia-arch kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Jan 09 09:10:21 mia-arch kernel:  ? kthread+0x133/0x150
Jan 09 09:10:21 mia-arch kernel:  ? __kthread_bind_mask+0x60/0x60
Jan 09 09:10:21 mia-arch kernel:  ? ret_from_fork+0x22/0x30
Jan 09 09:10:21 mia-arch kernel: Modules linked in: uas usb_storage md4 cmac nls_utf8 cifs dns_resolver fscache libdes hid_logitech_hidpp mousedev joydev qcserial usb_wwan btusb nvidia_drm(POE) nvidia_modeset(POE) btrtl btbcm btintel bluetooth hid_corsair hid_logitech_dj >
Jan 09 09:10:21 mia-arch kernel:  pinctrl_amd mac_hid acpi_cpufreq drm vmmon(OE) vmw_vmci fuse crypto_user agpgart bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci xhci_pci_renesas
Jan 09 09:10:21 mia-arch kernel: CR2: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: ---[ end trace f64cb1d7fb847c2a ]---
Jan 09 09:10:21 mia-arch kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 09 09:10:21 mia-arch kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 09 09:10:21 mia-arch kernel: RSP: 0018:ffffae61c11c7be0 EFLAGS: 00010202
Jan 09 09:10:21 mia-arch kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 09 09:10:21 mia-arch kernel: RDX: ffff9f9f63b9cf48 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: RBP: ffff9f9e18dbd940 R08: ffffffffc29cd510 R09: ffff9f9e18dbd920
Jan 09 09:10:21 mia-arch kernel: R10: ffffffffc1618690 R11: ffff9f9e0df2b008 R12: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: R13: 0000000000000000 R14: ffff9f9e18dbdaa8 R15: ffff9f9e18dbdbb0
Jan 09 09:10:21 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff9fa50ec40000(0000) knlGS:0000000000000000
Jan 09 09:10:21 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 09:10:21 mia-arch kernel: CR2: 0000000000000020 CR3: 00000001b64c4000 CR4: 0000000000350ee0
Jan 09 09:10:21 mia-arch kernel: BUG: kernel NULL pointer dereference, address: 0000000000000959
Jan 09 09:10:21 mia-arch kernel: #PF: supervisor write access in kernel mode
Jan 09 09:10:21 mia-arch kernel: #PF: error_code(0x0002) - not-present page
Jan 09 09:10:21 mia-arch kernel: PGD 25bb60067 P4D 25bb60067 PUD 0 
Jan 09 09:10:21 mia-arch kernel: Oops: 0002 [#2] PREEMPT SMP NOPTI
Jan 09 09:10:21 mia-arch kernel: CPU: 9 PID: 664 Comm: irq/106-nvidia Tainted: P      D    OE     5.10.4-arch2-1 #1
Jan 09 09:10:21 mia-arch kernel: Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII IMPACT, BIOS 3003 12/04/2020
Jan 09 09:10:21 mia-arch kernel: RIP: 0010:mutex_lock+0x10/0x20
Jan 09 09:10:21 mia-arch kernel: Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 a1 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
Jan 09 09:10:21 mia-arch kernel: RSP: 0018:ffffae61c11c7e30 EFLAGS: 00010246
Jan 09 09:10:21 mia-arch kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Jan 09 09:10:21 mia-arch kernel: RDX: ffff9f9e13539ec0 RSI: 0000000000001b41 RDI: 0000000000000959
Jan 09 09:10:21 mia-arch kernel: RBP: 0000000000000959 R08: 000000000000001f R09: 0000000000000000
Jan 09 09:10:21 mia-arch kernel: R10: ffff9f9e0b0f3000 R11: 0000000000000000 R12: ffff9f9e1353a6b4
Jan 09 09:10:21 mia-arch kernel: R13: 0000000000000001 R14: 0000000000000001 R15: ffff9f9e13539ec0
Jan 09 09:10:21 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff9fa50ec40000(0000) knlGS:0000000000000000
Jan 09 09:10:21 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 09:10:21 mia-arch kernel: CR2: 0000000000000959 CR3: 00000001b64c4000 CR4: 0000000000350ee0
Jan 09 09:10:21 mia-arch kernel: Call Trace:
Jan 09 09:10:21 mia-arch kernel:  perf_event_exit_task+0x30/0x440
Jan 09 09:10:21 mia-arch kernel:  ? kfree+0x40c/0x440
Jan 09 09:10:21 mia-arch kernel:  do_exit+0x355/0xa40
Jan 09 09:10:21 mia-arch kernel:  ? task_work_run+0x5c/0x90
Jan 09 09:10:21 mia-arch kernel:  ? do_exit+0x345/0xa40
Jan 09 09:10:21 mia-arch kernel:  ? kthread+0x133/0x150
Jan 09 09:10:21 mia-arch kernel:  ? rewind_stack_do_exit+0x17/0x17
Jan 09 09:10:21 mia-arch kernel: Modules linked in: uas usb_storage md4 cmac nls_utf8 cifs dns_resolver fscache libdes hid_logitech_hidpp mousedev joydev qcserial usb_wwan btusb nvidia_drm(POE) nvidia_modeset(POE) btrtl btbcm btintel bluetooth hid_corsair hid_logitech_dj >
Jan 09 09:10:21 mia-arch kernel:  pinctrl_amd mac_hid acpi_cpufreq drm vmmon(OE) vmw_vmci fuse crypto_user agpgart bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci xhci_pci_renesas
Jan 09 09:10:21 mia-arch kernel: CR2: 0000000000000959
Jan 09 09:10:21 mia-arch kernel: ---[ end trace f64cb1d7fb847c2b ]---
Jan 09 09:10:21 mia-arch kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 09 09:10:21 mia-arch kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 09 09:10:21 mia-arch kernel: RSP: 0018:ffffae61c11c7be0 EFLAGS: 00010202
Jan 09 09:10:21 mia-arch kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 09 09:10:21 mia-arch kernel: RDX: ffff9f9f63b9cf48 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: RBP: ffff9f9e18dbd940 R08: ffffffffc29cd510 R09: ffff9f9e18dbd920
Jan 09 09:10:21 mia-arch kernel: R10: ffffffffc1618690 R11: ffff9f9e0df2b008 R12: 0000000000000020
Jan 09 09:10:21 mia-arch kernel: R13: 0000000000000000 R14: ffff9f9e18dbdaa8 R15: ffff9f9e18dbdbb0
Jan 09 09:10:21 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff9fa50ec40000(0000) knlGS:0000000000000000
Jan 09 09:10:21 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 09:10:21 mia-arch kernel: CR2: 0000000000000959 CR3: 00000001b64c4000 CR4: 0000000000350ee0
Jan 09 09:10:21 mia-arch kernel: Fixing recursive fault but reboot is needed!
Jan 09 09:10:36 mia-arch systemd-logind[463]: Power key pressed.
Jan 09 09:10:36 mia-arch systemd-logind[463]: Powering Off...
Jan 09 09:10:36 mia-arch systemd-logind[463]: System is powering down.
Jan 09 09:10:36 mia-arch systemd[1]: Stopping Session 1 of user niraami.
Jan 09 09:10:36 mia-arch audit[717]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=1 pid=717 comm="flameshot" exe="/usr/bin/flameshot" sig=11 res=1
Jan 09 09:10:36 mia-arch systemd[1]: Removed slice system-modprobe.slice.
Jan 09 09:10:36 mia-arch systemd[1]: Removed slice system-systemd\x2dcoredump.slice.

Second ocurrence:

Jan 09 20:04:48 mia-arch kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: #PF: supervisor read access in kernel mode
Jan 09 20:04:48 mia-arch kernel: #PF: error_code(0x0000) - not-present page
Jan 09 20:04:48 mia-arch kernel: PGD 1afff9067 P4D 1afff9067 PUD 0 
Jan 09 20:04:48 mia-arch kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 09 20:04:48 mia-arch kernel: CPU: 9 PID: 659 Comm: irq/106-nvidia Tainted: P           OE     5.10.4-arch2-1 #1
Jan 09 20:04:48 mia-arch kernel: Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII IMPACT, BIOS 3003 12/04/2020
Jan 09 20:04:48 mia-arch kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 09 20:04:48 mia-arch kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 09 20:04:48 mia-arch kernel: RSP: 0018:ffffb4540169bbe0 EFLAGS: 00010202
Jan 09 20:04:48 mia-arch kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 09 20:04:48 mia-arch kernel: RDX: ffff921a2cf2d708 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: RBP: ffff9217977b2940 R08: ffffffffc301f510 R09: ffff9217977b2920
Jan 09 20:04:48 mia-arch kernel: R10: ffffffffc1c6a690 R11: ffff921797393808 R12: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: R13: 0000000000000000 R14: ffff9217977b2aa8 R15: ffff9217977b2bb0
Jan 09 20:04:48 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff921e8ec40000(0000) knlGS:0000000000000000
Jan 09 20:04:48 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 20:04:48 mia-arch kernel: CR2: 0000000000000020 CR3: 00000001bc62a000 CR4: 0000000000350ee0
Jan 09 20:04:48 mia-arch kernel: Call Trace:
Jan 09 20:04:48 mia-arch kernel:  ? _nv029950rm+0x1b/0x90 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv025474rm+0x18/0x60 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv011691rm+0x13d/0x1c0 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv011619rm+0xff/0x180 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv018449rm+0x1af/0x210 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv018389rm+0xd9a/0xe90 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv018390rm+0xde/0x260 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv018356rm+0x72/0xc0 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv018370rm+0x235/0x2d0 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv026076rm+0x10/0x10 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv018403rm+0xac/0xe0 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv027734rm+0x820/0xdc0 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv007566rm+0x155/0x270 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv027742rm+0x8d/0x180 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? disable_irq_nosync+0x10/0x10
Jan 09 20:04:48 mia-arch kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Jan 09 20:04:48 mia-arch kernel:  ? irq_thread_fn+0x20/0x60
Jan 09 20:04:48 mia-arch kernel:  ? irq_thread+0xf5/0x1a0
Jan 09 20:04:48 mia-arch kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Jan 09 20:04:48 mia-arch kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Jan 09 20:04:48 mia-arch kernel:  ? kthread+0x133/0x150
Jan 09 20:04:48 mia-arch kernel:  ? __kthread_bind_mask+0x60/0x60
Jan 09 20:04:48 mia-arch kernel:  ? ret_from_fork+0x22/0x30
Jan 09 20:04:48 mia-arch kernel: Modules linked in: ch341 md4 cmac nls_utf8 cifs dns_resolver fscache libdes hid_logitech_hidpp mousedev joydev vmnet(OE) qcserial usb_wwan nvidia_drm(POE) nvidia_modeset(POE) btusb btrtl btbcm btintel bluetooth hid_corsair ecdh_generic nvi>
Jan 09 20:04:48 mia-arch kernel:  mac_hid acpi_cpufreq drm vmmon(OE) vmw_vmci fuse crypto_user agpgart bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci xhci_pci_renesas
Jan 09 20:04:48 mia-arch kernel: CR2: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: ---[ end trace 3cdff250a0cd1785 ]---
Jan 09 20:04:48 mia-arch kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 09 20:04:48 mia-arch kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 09 20:04:48 mia-arch kernel: RSP: 0018:ffffb4540169bbe0 EFLAGS: 00010202
Jan 09 20:04:48 mia-arch kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 09 20:04:48 mia-arch kernel: RDX: ffff921a2cf2d708 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: RBP: ffff9217977b2940 R08: ffffffffc301f510 R09: ffff9217977b2920
Jan 09 20:04:48 mia-arch kernel: R10: ffffffffc1c6a690 R11: ffff921797393808 R12: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: R13: 0000000000000000 R14: ffff9217977b2aa8 R15: ffff9217977b2bb0
Jan 09 20:04:48 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff921e8ec40000(0000) knlGS:0000000000000000
Jan 09 20:04:48 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 20:04:48 mia-arch kernel: CR2: 0000000000000020 CR3: 00000001bc62a000 CR4: 0000000000350ee0
Jan 09 20:04:48 mia-arch kernel: BUG: kernel NULL pointer dereference, address: 0000000000000959
Jan 09 20:04:48 mia-arch kernel: #PF: supervisor write access in kernel mode
Jan 09 20:04:48 mia-arch kernel: #PF: error_code(0x0002) - not-present page
Jan 09 20:04:48 mia-arch kernel: PGD 1afff9067 P4D 1afff9067 PUD 0 
Jan 09 20:04:48 mia-arch kernel: Oops: 0002 [#2] PREEMPT SMP NOPTI
Jan 09 20:04:48 mia-arch kernel: CPU: 9 PID: 659 Comm: irq/106-nvidia Tainted: P      D    OE     5.10.4-arch2-1 #1
Jan 09 20:04:48 mia-arch kernel: Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII IMPACT, BIOS 3003 12/04/2020
Jan 09 20:04:48 mia-arch kernel: RIP: 0010:mutex_lock+0x10/0x20
Jan 09 20:04:48 mia-arch kernel: Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 a1 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
Jan 09 20:04:48 mia-arch kernel: RSP: 0018:ffffb4540169be30 EFLAGS: 00010246
Jan 09 20:04:48 mia-arch kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Jan 09 20:04:48 mia-arch kernel: RDX: ffff921794b88000 RSI: 0000000000001b41 RDI: 0000000000000959
Jan 09 20:04:48 mia-arch kernel: RBP: 0000000000000959 R08: 000000000000001f R09: 0000000000000000
Jan 09 20:04:48 mia-arch kernel: R10: ffff921793fd6000 R11: 0000000000000000 R12: ffff921794b887f4
Jan 09 20:04:48 mia-arch kernel: R13: 0000000000000001 R14: 0000000000000001 R15: ffff921794b88000
Jan 09 20:04:48 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff921e8ec40000(0000) knlGS:0000000000000000
Jan 09 20:04:48 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 20:04:48 mia-arch kernel: CR2: 0000000000000959 CR3: 00000001bc62a000 CR4: 0000000000350ee0
Jan 09 20:04:48 mia-arch kernel: Call Trace:
Jan 09 20:04:48 mia-arch kernel:  perf_event_exit_task+0x30/0x440
Jan 09 20:04:48 mia-arch kernel:  ? kfree+0x40c/0x440
Jan 09 20:04:48 mia-arch kernel:  do_exit+0x355/0xa40
Jan 09 20:04:48 mia-arch kernel:  ? task_work_run+0x5c/0x90
Jan 09 20:04:48 mia-arch kernel:  ? do_exit+0x345/0xa40
Jan 09 20:04:48 mia-arch kernel:  ? kthread+0x133/0x150
Jan 09 20:04:48 mia-arch kernel:  ? rewind_stack_do_exit+0x17/0x17
Jan 09 20:04:48 mia-arch kernel: Modules linked in: ch341 md4 cmac nls_utf8 cifs dns_resolver fscache libdes hid_logitech_hidpp mousedev joydev vmnet(OE) qcserial usb_wwan nvidia_drm(POE) nvidia_modeset(POE) btusb btrtl btbcm btintel bluetooth hid_corsair ecdh_generic nvi>
Jan 09 20:04:48 mia-arch kernel:  mac_hid acpi_cpufreq drm vmmon(OE) vmw_vmci fuse crypto_user agpgart bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci xhci_pci_renesas
Jan 09 20:04:48 mia-arch kernel: CR2: 0000000000000959
Jan 09 20:04:48 mia-arch kernel: ---[ end trace 3cdff250a0cd1786 ]---
Jan 09 20:04:48 mia-arch kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 09 20:04:48 mia-arch kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 09 20:04:48 mia-arch kernel: RSP: 0018:ffffb4540169bbe0 EFLAGS: 00010202
Jan 09 20:04:48 mia-arch kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 09 20:04:48 mia-arch kernel: RDX: ffff921a2cf2d708 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: RBP: ffff9217977b2940 R08: ffffffffc301f510 R09: ffff9217977b2920
Jan 09 20:04:48 mia-arch kernel: R10: ffffffffc1c6a690 R11: ffff921797393808 R12: 0000000000000020
Jan 09 20:04:48 mia-arch kernel: R13: 0000000000000000 R14: ffff9217977b2aa8 R15: ffff9217977b2bb0
Jan 09 20:04:48 mia-arch kernel: FS:  0000000000000000(0000) GS:ffff921e8ec40000(0000) knlGS:0000000000000000
Jan 09 20:04:48 mia-arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 09 20:04:48 mia-arch kernel: CR2: 0000000000000959 CR3: 00000001bc62a000 CR4: 0000000000350ee0
Jan 09 20:04:48 mia-arch kernel: Fixing recursive fault but reboot is needed!
Jan 09 20:05:11 mia-arch audit[27998]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=1 pid=27998 comm="GpuWatchdog" exe="/opt/spotify/spotify" sig=11 res=1
Jan 09 20:05:11 mia-arch kernel: GpuWatchdog[28016]: segfault at 0 ip 00007f69f52bdc4f sp 00007f69ed340120 error 6 in libcef.so[7f69f1218000+6dbb000]
Jan 09 20:05:11 mia-arch kernel: Code: 89 de e8 44 c2 9f fe 80 7d cf 00 79 09 48 8b 7d b8 e8 15 31 67 fe 41 8b 84 24 e0 00 00 00 89 45 b8 48 8d 7d b8 e8 f1 ae f5 fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e
Jan 09 20:05:11 mia-arch kernel: audit: type=1701 audit(1610219111.138:54): auid=1000 uid=1000 gid=1000 ses=1 pid=27998 comm="GpuWatchdog" exe="/opt/spotify/spotify" sig=11 res=1
Jan 09 20:05:24 mia-arch pulseaudio[718]: Got POLLNVAL from ALSA
Jan 09 20:05:24 mia-arch kernel: usb 5-3: USB disconnect, device number 2
Jan 09 20:05:24 mia-arch vmware-usbarbitrator[486]: USBGL: kevent: removing USB device 5-3
Jan 09 20:05:33 mia-arch kernel: usb 5-3: new high-speed USB device number 4 using xhci_hcd
Jan 09 20:05:34 mia-arch kernel: usb 5-3: New USB device found, idVendor=2708, idProduct=0006, bcdDevice= 1.25
Jan 09 20:05:34 mia-arch kernel: usb 5-3: New USB device strings: Mfr=1, Product=3, SerialNumber=0
Jan 09 20:05:34 mia-arch kernel: usb 5-3: Product: EVO4
Jan 09 20:05:34 mia-arch kernel: usb 5-3: Manufacturer: Audient
Jan 09 20:05:34 mia-arch vmware-usbarbitrator[486]: USBGL: kevent: adding USB device 5-3
Jan 09 20:05:34 mia-arch kernel: usb 5-3: cannot get ctl value: req = 0x81, wValue = 0x100, wIndex = 0x3200, type = 0
Jan 09 20:05:34 mia-arch systemd[601]: Reached target Sound Card.
Jan 09 20:05:34 mia-arch rtkit-daemon[776]: Supervising 3 threads of 1 processes of 1 users.
Jan 09 20:05:34 mia-arch rtkit-daemon[776]: Successfully made thread 40972 of process 718 owned by '1000' RT at priority 5.
Jan 09 20:05:34 mia-arch rtkit-daemon[776]: Supervising 4 threads of 1 processes of 1 users.
Jan 09 20:05:34 mia-arch rtkit-daemon[776]: Supervising 4 threads of 1 processes of 1 users.
Jan 09 20:05:34 mia-arch rtkit-daemon[776]: Successfully made thread 40973 of process 718 owned by '1000' RT at priority 5.
Jan 09 20:05:34 mia-arch rtkit-daemon[776]: Supervising 5 threads of 1 processes of 1 users.
Jan 09 20:05:35 mia-arch kernel: usb 5-4: USB disconnect, device number 3
Jan 09 20:05:35 mia-arch vmware-usbarbitrator[486]: USBGL: kevent: removing USB device 5-4
Jan 09 20:05:37 mia-arch kernel: usb 5-4: new full-speed USB device number 5 using xhci_hcd
Jan 09 20:05:37 mia-arch vmware-usbarbitrator[486]: USBGL: kevent: adding USB device 5-4
Jan 09 20:05:37 mia-arch kernel: usb 5-4: New USB device found, idVendor=1b1c, idProduct=1b09, bcdDevice= 1.09
Jan 09 20:05:37 mia-arch kernel: usb 5-4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Jan 09 20:05:37 mia-arch kernel: usb 5-4: Product: Corsair K70R Gaming Keyboard
Jan 09 20:05:37 mia-arch kernel: usb 5-4: Manufacturer: Corsair
Jan 09 20:05:37 mia-arch kernel: input: Corsair Corsair K70R Gaming Keyboard as /devices/pci0000:00/0000:00:08.1/0000:0b:00.3/usb5/5-4/5-4:1.0/0003:1B1C:1B09.0009/input/input33
Jan 09 20:05:37 mia-arch kernel: corsair 0003:1B1C:1B09.0009: input,hidraw4: USB HID v1.11 Keyboard [Corsair Corsair K70R Gaming Keyboard] on usb-0000:0b:00.3-4/input0
Jan 09 20:05:37 mia-arch kernel: input: Corsair Corsair K70R Gaming Keyboard as /devices/pci0000:00/0000:00:08.1/0000:0b:00.3/usb5/5-4/5-4:1.1/0003:1B1C:1B09.000A/input/input34
Jan 09 20:05:37 mia-arch kernel: corsair 0003:1B1C:1B09.000A: input,hidraw6: USB HID v1.11 Device [Corsair Corsair K70R Gaming Keyboard] on usb-0000:0b:00.3-4/input1
Jan 09 20:05:37 mia-arch kernel: input: Corsair Corsair K70R Gaming Keyboard as /devices/pci0000:00/0000:00:08.1/0000:0b:00.3/usb5/5-4/5-4:1.2/0003:1B1C:1B09.000B/input/input35
Jan 09 20:05:38 mia-arch kernel: corsair 0003:1B1C:1B09.000B: input,hidraw7: USB HID v1.11 Keyboard [Corsair Corsair K70R Gaming Keyboard] on usb-0000:0b:00.3-4/input2
Jan 09 20:05:38 mia-arch upowerd[2922]: treating change event as add on /sys/devices/pci0000:00/0000:00:08.1/0000:0b:00.3/usb5/5-4
Jan 09 20:05:38 mia-arch systemd-logind[484]: Watching system buttons on /dev/input/event5 (Corsair Corsair K70R Gaming Keyboard)
Jan 09 20:05:38 mia-arch systemd-logind[484]: Watching system buttons on /dev/input/event8 (Corsair Corsair K70R Gaming Keyboard)
Jan 09 20:05:38 mia-arch systemd-logind[484]: Watching system buttons on /dev/input/event6 (Corsair Corsair K70R Gaming Keyboard)

The USB events at the end are due to my attempt to bring the system back by unplugging everything from the motherboard, and plugging it back it. No dice.

So far, it only happened on one of my computers, total of 3 times (posted 2 most recent logs), but, as I said, I've seen this exact issue on older hardware years ago, and also on a notebook. All had Nvidia GPUs...

System:
MB: Asus Crosshair VIII Impact
CPU: Ryzen 3700x, stock
GPU: Nvidia 1070, stock
RAM: HyperX 32GB 3466 CL16, stock

❯ uname -r
5.10.4-arch2-1

❯ yay -Qs linux
local/linux 5.10.4.arch2-1
local/linux-api-headers 5.8-1
local/linux-firmware 20201218.646f159-1
local/linux-headers 5.10.4.arch2-1

❯ yay -Qs nvidia
local/lib32-nvidia-utils 455.45.01-1
local/libvdpau 1.4-1
local/libxnvctrl 455.45.01-1
local/nvidia 455.45.01-12
local/nvidia-utils 455.45.01-1
local/opencl-nvidia 455.45.01-1

Other info about my setup is also in my dotfiles Readme, but I'll be glad to answer any questions and test anything to try and resolve this... you can tell I'm pretty much at the end of my current skillset here big_smile

I also know that I can update to 5.10.5 1-1, and nvidia 460.32.03-1, but I doubt this will make a difference (as I believe this to be a very old issue), but I want to keep the current logs relevant.


If I'm not a bush, I'm not no one.
Dotfiles Git

Offline

#2 2021-01-10 08:50:22

shyaminayesh
Member
From: Gampaha, Sri Lanka
Registered: 2015-04-19
Posts: 12
Website

Re: Nvidia "recursive fault" causes system to permanently lock up

I think I have the same problem. exact same scene happened few mins ago. I was came to post the logs here and ask for help and then saw your thread, so I'm adding mine here too.

Jan 10 13:50:27 bifrost.shyamin.com kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020
Jan 10 13:50:27 bifrost.shyamin.com kernel: #PF: supervisor read access in kernel mode
Jan 10 13:50:27 bifrost.shyamin.com kernel: #PF: error_code(0x0000) - not-present page
Jan 10 13:50:27 bifrost.shyamin.com kernel: PGD 80000001db130067 P4D 80000001db130067 PUD 0 
Jan 10 13:50:27 bifrost.shyamin.com kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Jan 10 13:50:27 bifrost.shyamin.com kernel: CPU: 0 PID: 555 Comm: irq/33-nvidia Tainted: P           OE     5.9.14-arch1-1 #1
Jan 10 13:50:27 bifrost.shyamin.com kernel: Hardware name: ECS H81H3-M4/H81H3-M4, BIOS 4.6.5 10/21/2015
Jan 10 13:50:27 bifrost.shyamin.com kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 >
Jan 10 13:50:27 bifrost.shyamin.com kernel: RSP: 0018:ffff9fe9836efc00 EFLAGS: 00010202
Jan 10 13:50:27 bifrost.shyamin.com kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 10 13:50:27 bifrost.shyamin.com kernel: RDX: ffff938532e46e48 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 10 13:50:27 bifrost.shyamin.com kernel: RBP: ffff9385318629d0 R08: ffffffffc2497530 R09: ffff9385318629b0
Jan 10 13:50:27 bifrost.shyamin.com kernel: R10: ffffffffc10e2820 R11: ffff938553199808 R12: 0000000000000020
Jan 10 13:50:27 bifrost.shyamin.com kernel: R13: 0000000000000000 R14: ffff938531862b38 R15: ffff938531862c78
Jan 10 13:50:27 bifrost.shyamin.com kernel: FS:  0000000000000000(0000) GS:ffff938556c00000(0000) knlGS:0000000000000000
Jan 10 13:50:27 bifrost.shyamin.com kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 10 13:50:27 bifrost.shyamin.com kernel: CR2: 0000000000000020 CR3: 00000001e7390006 CR4: 00000000001706f0
Jan 10 13:50:27 bifrost.shyamin.com kernel: Call Trace:
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv029950rm+0x1b/0x90 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv025474rm+0x18/0x60 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv011691rm+0x13d/0x1c0 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv036719rm+0xc3/0x350 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv036718rm+0x5c/0x70 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv011615rm+0x78/0xd0 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv011615rm+0x1a/0xd0 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv024757rm+0x251/0x3e0 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv024705rm+0x1f/0xf0 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv015452rm+0xcb/0x370 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv026076rm+0x10/0x10 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv027734rm+0x273/0xdc0 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv007566rm+0x155/0x270 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv027742rm+0x8d/0x180 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? disable_irq_nosync+0x10/0x10
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? irq_thread_fn+0x20/0x60
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? irq_thread+0xf5/0x1a0
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? kthread+0x142/0x160
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? __kthread_bind_mask+0x60/0x60
Jan 10 13:50:27 bifrost.shyamin.com kernel:  ? ret_from_fork+0x22/0x30
Jan 10 13:50:27 bifrost.shyamin.com kernel: Modules linked in: ccm algif_aead cbc des_generic libdes ecb algif_skcipher cmac md4 algif_hash af_alg hwmon_vid nls_iso8859_1 nls_cp437 vfat fat nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) >
Jan 10 13:50:27 bifrost.shyamin.com kernel:  pkcs8_key_parser ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid dm_mod crc32c_intel xhci_pci xhci_pci_renesas ehci_pci xhci_hcd ehci_hcd
Jan 10 13:50:27 bifrost.shyamin.com kernel: CR2: 0000000000000020
Jan 10 13:50:27 bifrost.shyamin.com kernel: ---[ end trace 48313f7d452dd5f2 ]---
Jan 10 13:50:27 bifrost.shyamin.com kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 10 13:50:27 bifrost.shyamin.com kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 >
Jan 10 13:50:27 bifrost.shyamin.com kernel: RSP: 0018:ffff9fe9836efc00 EFLAGS: 00010202
Jan 10 13:50:27 bifrost.shyamin.com kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 10 13:50:27 bifrost.shyamin.com kernel: RDX: ffff938532e46e48 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 10 13:50:27 bifrost.shyamin.com kernel: RBP: ffff9385318629d0 R08: ffffffffc2497530 R09: ffff9385318629b0
Jan 10 13:50:27 bifrost.shyamin.com kernel: R10: ffffffffc10e2820 R11: ffff938553199808 R12: 0000000000000020
Jan 10 13:50:27 bifrost.shyamin.com kernel: R13: 0000000000000000 R14: ffff938531862b38 R15: ffff938531862c78
Jan 10 13:50:27 bifrost.shyamin.com kernel: FS:  0000000000000000(0000) GS:ffff938556c00000(0000) knlGS:0000000000000000
Jan 10 13:50:27 bifrost.shyamin.com kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 10 13:50:27 bifrost.shyamin.com kernel: CR2: 0000000000000020 CR3: 00000001e7390006 CR4: 00000000001706f0

Tech spec:
Kernal: 5.9.14-arch1
Nvidia: nvidia 455.45.01-7
GPU: GTX 960 ( Galax )
CPU: i5 4th

Please help us to figure out what is happening

Offline

#3 2021-01-10 09:40:35

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 11,800

Re: Nvidia "recursive fault" causes system to permanently lock up

Not much to do here for us https://forums.developer.nvidia.com/t/b … 155506/122

Apparently nvidia has managed to analyze where the issue is and there might be a fix in the next driver version.

Other than that you guys should update, there was a new driver which fixes the page allocation issues linked earlier and from what I'm seeing the null pointer issue go t alleviated slightly by these as well, it's not completely gone but doesn't seem to occur that often.

Offline

#4 2021-01-10 14:11:52

areuz
Member
Registered: 2016-09-16
Posts: 56

Re: Nvidia "recursive fault" causes system to permanently lock up

Cool! It seems like I missed that bug report when researching - it looks related. I'll update everything and see where it goes.
Thanks for the help!


If I'm not a bush, I'm not no one.
Dotfiles Git

Offline

#5 2021-01-16 00:30:13

jdoggsc
Member
Registered: 2011-08-15
Posts: 81

Re: Nvidia "recursive fault" causes system to permanently lock up

just to chime in, I'm getting the issue as well--have been dealing with this for months.  I'll be knee-deep in work and randomly the screen will completely freeze.  I can ssh in, but can't rescue--can't even poweroff or reboot.

Jan 15 15:16:09 ryzen kernel: PGD 2bd03f067 P4D 2bd03f067 PUD 0 
Jan 15 15:16:09 ryzen kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 15 15:16:09 ryzen kernel: CPU: 0 PID: 709 Comm: irq/70-nvidia Tainted: P           OE     5.9.14-arch1-1 #1
Jan 15 15:16:09 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
Jan 15 15:16:09 ryzen kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 15 15:16:09 ryzen kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 15 15:16:09 ryzen kernel: RSP: 0018:ffff94e580d73be0 EFLAGS: 00010202
Jan 15 15:16:09 ryzen kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 15 15:16:09 ryzen kernel: RDX: ffff8f764b7f89c8 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 15 15:16:09 ryzen kernel: RBP: ffff8f76a675d940 R08: ffffffffc29ce530 R09: ffff8f76a675d920
Jan 15 15:16:09 ryzen kernel: R10: ffffffffc1619820 R11: ffff8f76c68c5808 R12: 0000000000000020
Jan 15 15:16:09 ryzen kernel: R13: 0000000000000000 R14: ffff8f76a675daa8 R15: ffff8f76a675dbb0
Jan 15 15:16:09 ryzen kernel: FS:  0000000000000000(0000) GS:ffff8f76ce800000(0000) knlGS:0000000000000000
Jan 15 15:16:09 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 15:16:09 ryzen kernel: CR2: 0000000000000020 CR3: 000000010f4e2000 CR4: 00000000003506f0
Jan 15 15:16:09 ryzen kernel: Call Trace:
Jan 15 15:16:09 ryzen kernel:  ? _nv029950rm+0x1b/0x90 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv025474rm+0x18/0x60 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv011691rm+0x13d/0x1c0 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv011619rm+0xff/0x180 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv018449rm+0x1af/0x210 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv018389rm+0xd9a/0xe90 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv018390rm+0xde/0x260 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv018356rm+0x72/0xc0 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv018370rm+0x235/0x2d0 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv026076rm+0x10/0x10 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv018403rm+0xac/0xe0 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv027734rm+0x820/0xdc0 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv007566rm+0x155/0x270 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv027742rm+0x8d/0x180 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? disable_irq_nosync+0x10/0x10
Jan 15 15:16:09 ryzen kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Jan 15 15:16:09 ryzen kernel:  ? irq_thread_fn+0x20/0x60
Jan 15 15:16:09 ryzen kernel:  ? irq_thread+0xf5/0x1a0
Jan 15 15:16:09 ryzen kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Jan 15 15:16:09 ryzen kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Jan 15 15:16:09 ryzen kernel:  ? kthread+0x142/0x160
Jan 15 15:16:09 ryzen kernel:  ? __kthread_bind_mask+0x60/0x60
Jan 15 15:16:09 ryzen kernel:  ? ret_from_fork+0x22/0x30
Jan 15 15:16:09 ryzen kernel: Modules linked in: md4 nls_utf8 cifs libarc4 dns_resolver fscache libdes nfnetlink_queue nfnetlink_log nfnetlink elan_i2c rfcomm cfg80211 8021q garp mrp stp llc cmac algif_hash algif_skcipher af_alg bnep vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) nvidia_drm(POE) nvidia_modeset(POE) nvi>
Jan 15 15:16:09 ryzen kernel:  v4l2loopback_dc(OE) videodev drm mc sg crypto_user fuse agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid crc32c_intel xhci_pci xhci_pci_renesas xhci_hcd
Jan 15 15:16:09 ryzen kernel: CR2: 0000000000000020
Jan 15 15:16:09 ryzen kernel: ---[ end trace 6eee9e0cf91b80a8 ]---
Jan 15 15:16:09 ryzen kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 15 15:16:09 ryzen kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 15 15:16:09 ryzen kernel: RSP: 0018:ffff94e580d73be0 EFLAGS: 00010202
Jan 15 15:16:09 ryzen kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 15 15:16:09 ryzen kernel: RDX: ffff8f764b7f89c8 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 15 15:16:09 ryzen kernel: RBP: ffff8f76a675d940 R08: ffffffffc29ce530 R09: ffff8f76a675d920
Jan 15 15:16:09 ryzen kernel: R10: ffffffffc1619820 R11: ffff8f76c68c5808 R12: 0000000000000020
Jan 15 15:16:09 ryzen kernel: R13: 0000000000000000 R14: ffff8f76a675daa8 R15: ffff8f76a675dbb0
Jan 15 15:16:09 ryzen kernel: FS:  0000000000000000(0000) GS:ffff8f76ce800000(0000) knlGS:0000000000000000
Jan 15 15:16:09 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 15:16:09 ryzen kernel: CR2: 0000000000000020 CR3: 000000010f4e2000 CR4: 00000000003506f0
Jan 15 15:16:09 ryzen kernel: BUG: kernel NULL pointer dereference, address: 0000000000000930
Jan 15 15:16:09 ryzen kernel: #PF: supervisor write access in kernel mode
Jan 15 15:16:09 ryzen kernel: #PF: error_code(0x0002) - not-present page
Jan 15 15:16:09 ryzen kernel: PGD 2bd03f067 P4D 2bd03f067 PUD 0 
Jan 15 15:16:09 ryzen kernel: Oops: 0002 [#2] PREEMPT SMP NOPTI
Jan 15 15:16:09 ryzen kernel: CPU: 0 PID: 709 Comm: irq/70-nvidia Tainted: P      D    OE     5.9.14-arch1-1 #1
Jan 15 15:16:09 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
Jan 15 15:16:09 ryzen kernel: RIP: 0010:mutex_lock+0x10/0x20
Jan 15 15:16:09 ryzen kernel: Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 61 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
Jan 15 15:16:09 ryzen kernel: RSP: 0018:ffff94e580d73e30 EFLAGS: 00010246
Jan 15 15:16:09 ryzen kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jan 15 15:16:09 ryzen kernel: RDX: ffff8f76bb8ddb80 RSI: 0000000000000000 RDI: 0000000000000930
Jan 15 15:16:09 ryzen kernel: RBP: 0000000000000930 R08: 0000000000000001 R09: 0000000000000000
Jan 15 15:16:09 ryzen kernel: R10: ffff8f76c3a95400 R11: ffff94e580d73800 R12: ffff8f76bb8de34c
Jan 15 15:16:09 ryzen kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffff8f76bb8ddb80
Jan 15 15:16:09 ryzen kernel: FS:  0000000000000000(0000) GS:ffff8f76ce800000(0000) knlGS:0000000000000000
Jan 15 15:16:09 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 15:16:09 ryzen kernel: CR2: 0000000000000930 CR3: 000000010f4e2000 CR4: 00000000003506f0
Jan 15 15:16:09 ryzen kernel: Call Trace:
Jan 15 15:16:09 ryzen kernel:  perf_event_exit_task+0x30/0x440
Jan 15 15:16:09 ryzen kernel:  ? kfree+0x40f/0x440
Jan 15 15:16:09 ryzen kernel:  do_exit+0x37f/0xaa0
Jan 15 15:16:09 ryzen kernel:  ? task_work_run+0x5c/0x90
Jan 15 15:16:09 ryzen kernel:  ? do_exit+0x36f/0xaa0
Jan 15 15:16:09 ryzen kernel:  ? kthread+0x142/0x160
Jan 15 15:16:09 ryzen kernel:  ? rewind_stack_do_exit+0x17/0x17
Jan 15 15:16:09 ryzen kernel: Modules linked in: md4 nls_utf8 cifs libarc4 dns_resolver fscache libdes nfnetlink_queue nfnetlink_log nfnetlink elan_i2c rfcomm cfg80211 8021q garp mrp stp llc cmac algif_hash algif_skcipher af_alg bnep vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) nvidia_drm(POE) nvidia_modeset(POE) nvi>
Jan 15 15:16:09 ryzen kernel:  v4l2loopback_dc(OE) videodev drm mc sg crypto_user fuse agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid crc32c_intel xhci_pci xhci_pci_renesas xhci_hcd
Jan 15 15:16:09 ryzen kernel: CR2: 0000000000000930
Jan 15 15:16:09 ryzen kernel: ---[ end trace 6eee9e0cf91b80a9 ]---
Jan 15 15:16:09 ryzen kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Jan 15 15:16:09 ryzen kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Jan 15 15:16:09 ryzen kernel: RSP: 0018:ffff94e580d73be0 EFLAGS: 00010202
Jan 15 15:16:09 ryzen kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Jan 15 15:16:09 ryzen kernel: RDX: ffff8f764b7f89c8 RSI: ffffffffffffffff RDI: 0000000000000020
Jan 15 15:16:09 ryzen kernel: RBP: ffff8f76a675d940 R08: ffffffffc29ce530 R09: ffff8f76a675d920
Jan 15 15:16:09 ryzen kernel: R10: ffffffffc1619820 R11: ffff8f76c68c5808 R12: 0000000000000020
Jan 15 15:16:09 ryzen kernel: R13: 0000000000000000 R14: ffff8f76a675daa8 R15: ffff8f76a675dbb0
Jan 15 15:16:09 ryzen kernel: FS:  0000000000000000(0000) GS:ffff8f76ce800000(0000) knlGS:0000000000000000
Jan 15 15:16:09 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 15 15:16:09 ryzen kernel: CR2: 0000000000000930 CR3: 000000010f4e2000 CR4: 00000000003506f0
Jan 15 15:16:09 ryzen kernel: Fixing recursive fault but reboot is needed!

I'm looking forward to word from NVIDIA on a fix for this.

Offline

#6 2021-01-16 00:37:34

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 11,800

Re: Nvidia "recursive fault" causes system to permanently lock up

You can reboot safelish with a sysrq REISUB

Offline

Board footer

Powered by FluxBB