You are not logged in.

#1 2020-11-06 21:44:56

mjandrews
Member
From: Nottingham, UK
Registered: 2013-03-26
Posts: 17

Diagnosing system freezes using journalctl

I have a new AMD Ryzen 7 3700X with 64GB of RAM and Nvidia GeForce GTX 1660.
From `lspci -v`, for the GPU, I have this

        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

I am getting relatively frequent system freezes since I got this machine about 2 months ago: The monitor freezes, can't move mouse, keyboard freezes, even caps lock led on keyboard does not respond. However, I can still ssh into the machine, so that leads me to think that it is a graphics card driver problem.

The kernel I am using is 5.4.74-1-lts

looking at journal -b-1 from the time of the last crash, I get things like this

Nov 06 20:49:12 siegfried kernel: NVRM: GPU at PCI:0000:07:00: GPU-b1980631-3d3d-03b6-07f4-90df35f1e845
Nov 06 20:49:12 siegfried kernel: NVRM: GPU Board Serial Number: 
Nov 06 20:49:12 siegfried kernel: NVRM: Xid (PCI:0000:07:00): 8, pid=54331, Channel 0000004b
Nov 06 20:49:42 siegfried kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [irq/110-nvidia:909]
Nov 06 20:49:42 siegfried kernel: Modules linked in: xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc rfkill overlay wacom nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_hdmi nvidia(POE) mousedev input_leds ucsi_ccg typec_ucsi edac_mce_amd typec wmi_bmof snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio ccp rng_core kvm snd_hda_intel snd_intel_nhlt snd_hda_codec snd_usb_audio snd_hda_core irqbypass crct10dif_pclmul crc32_pclmul snd_usbmidi_lib ghash_clmulni_intel snd_hwdep snd_rawmidi uvcvideo aesni_intel snd_seq_device drm_kms_helper snd_pcm videobuf2_vmalloc igb crypto_simd videobuf2_memops videobuf2_v4l2 cryptd glue_helper videobuf2_common snd_timer joydev syscopyarea sysfillrect snd sysimgblt videodev fb_sys_fops i2c_algo_bit pcspkr sp5100_tco k10temp i2c_piix4 i2c_nvidia_gpu dca soundcore mc evdev pinctrl_amd mac_hid wmi
Nov 06 20:49:42 siegfried kernel:  acpi_cpufreq drm crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid sd_mod ahci libahci libata crc32c_intel xhci_pci xhci_hcd scsi_mod
Nov 06 20:49:42 siegfried kernel: CPU: 14 PID: 909 Comm: irq/110-nvidia Tainted: P           OE     5.4.74-1-lts #1
Nov 06 20:49:42 siegfried kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F11 12/06/2019
Nov 06 20:49:42 siegfried kernel: RIP: 0010:_nv031954rm+0x12/0x40 [nvidia]
Nov 06 20:49:42 siegfried kernel: Code: d2 0e 31 c0 e8 5f 1e 88 ff e8 aa 87 f8 ff 31 c0 48 83 c4 08 c3 0f 1f 00 48 83 ec 08 39 4a 10 76 17 48 8b 02 c1 e9 02 8b 04 88 <48> 83 c4 08 c3 66 0f 1f 84 00 00 00 00 00 be 00 00 e2 09 bf 0a ad
Nov 06 20:49:42 siegfried kernel: RSP: 0018:ffffb6afc0757c20 EFLAGS: 00000256 ORIG_RAX: ffffffffffffff13
Nov 06 20:49:42 siegfried kernel: RAX: 00000000168000a1 RBX: ffff94b65757c008 RCX: 0000000000000000
Nov 06 20:49:42 siegfried kernel: RDX: ffff94b65757d1d8 RSI: ffff94b65757c008 RDI: ffff94b6922cb008
Nov 06 20:49:42 siegfried kernel: RBP: ffff94b657582c10 R08: 0000000000000020 R09: ffff94b657582c28
Nov 06 20:49:42 siegfried kernel: R10: ffffffffc14b9500 R11: ffff94b6922cb008 R12: 0000000000000000
Nov 06 20:49:42 siegfried kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff94b65757d1d8
Nov 06 20:49:42 siegfried kernel: FS:  0000000000000000(0000) GS:ffff94b69eb80000(0000) knlGS:0000000000000000
Nov 06 20:49:42 siegfried kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 06 20:49:42 siegfried kernel: CR2: 00007fce1a029000 CR3: 0000000f37f92000 CR4: 0000000000340ee0
Nov 06 20:49:42 siegfried kernel: Call Trace:
Nov 06 20:49:42 siegfried kernel:  ? _nv021047rm+0x10c/0x190 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv020490rm+0x27/0x70 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv035148rm+0x5c/0xa0 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv008593rm+0xe6/0xf0 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv018374rm+0x2c0/0x330 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv018375rm+0x1f7/0x320 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv018339rm+0x72/0xc0 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv018353rm+0x235/0x2d0 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv026058rm+0x290/0x290 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv018355rm+0x39/0x4b0 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv026058rm+0x290/0x290 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv018386rm+0xac/0xe0 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv027717rm+0x820/0xdc0 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv007563rm+0x155/0x270 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv027725rm+0x8d/0x180 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? irq_thread_dtor+0xb0/0xb0
Nov 06 20:49:42 siegfried kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Nov 06 20:49:42 siegfried kernel:  ? irq_thread_fn+0x20/0x60
Nov 06 20:49:42 siegfried kernel:  ? irq_thread+0xf5/0x1a0
Nov 06 20:49:42 siegfried kernel:  ? irq_finalize_oneshot.part.0+0xf0/0xf0
Nov 06 20:49:42 siegfried kernel:  ? irq_thread_check_affinity+0xe0/0xe0
Nov 06 20:49:42 siegfried kernel:  ? kthread+0x117/0x130
Nov 06 20:49:42 siegfried kernel:  ? __kthread_bind_mask+0x60/0x60
Nov 06 20:49:42 siegfried kernel:  ? ret_from_fork+0x22/0x40

Does this imply that I have a nvidia driver problem? Does it make sense to anyone else?

Another strange thing to note is that even though I can ssh in, I can't reboot using `sudo shutdown -r now`. That command immediately breaks the ssh connection, but the computer does not reboot. I have to press the reset button.

Offline

#2 2020-11-27 17:05:33

ponyrider
Member
Registered: 2014-11-18
Posts: 112

Re: Diagnosing system freezes using journalctl

watchdog: BUG: soft lockup - CPU#14

Seems like a bug in the kernel is causing this some type of infinite loop, and the NVIDIA kernel module is the problem...

See if disabling the kernel module helps

modeprobe -r nvidia

Offline

#3 2020-11-28 03:34:26

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 25,231

Re: Diagnosing system freezes using journalctl

I haven't really seen that stack trace yet, but there are some issues not unlike what you see here with the 455 series of drivers.

I suggest you run nvidia-bug-report.sh after sshing in after the crash and email that result to linux-bugs[at]nvidia[dot]com or post on the linux Nvidia boards

You might want to give https://aur.archlinux.org/packages/nvidia-440xx-dkms a shot, as most of the issues in 455 are presumably not present in that version

Offline

Board footer

Powered by FluxBB