You are not logged in.

#1 2024-03-19 20:03:03

awbs
Member
Registered: 2013-01-18
Posts: 27

Random kernel oops/panics since upgrading from Dec-2023 to Feb-2024

I have two systems with different hardware, one headless and one a KDE GUI that are both experiencing random, frequent kernel panics since running a pacman -Syu in February 2024, with the last system upgrade prior to that in December 2023 (running kernel linux-6.6.8.arch1-1 or linux-lts-6.1.69-1).

These oops' and panics happen on the regular stock kernel and when switching to LTS - so I assume it is not a kernel update issue but some other package or something else in my environment?

Two systems:
ThinkPad L390 Yoga with KDE
Intel NUC NUC10i7FNB, headless system accessed via SSH

Relevant dmesg logs:

Mar 19 13:16:15 ArchL390 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000398
Mar 19 13:16:15 ArchL390 kernel: #PF: supervisor write access in kernel mode
Mar 19 13:16:15 ArchL390 kernel: #PF: error_code(0x0002) - not-present page
Mar 19 13:16:15 ArchL390 kernel: PGD 0 P4D 0 
Mar 19 13:16:15 ArchL390 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Mar 19 13:16:15 ArchL390 kernel: CPU: 3 PID: 690152 Comm: kworker/3:0 Tainted: G           OE      6.6.21-1-lts #1 0c0a74bb77159d2e130f727f514cce3b101bcba5
Mar 19 13:16:15 ArchL390 kernel: Hardware name: LENOVO 20NTCTO1WW/20NTCTO1WW, BIOS R10ET46W (1.31 ) 07/09/2020
Mar 19 13:16:15 ArchL390 kernel: Workqueue: events sg_remove_sfp_usercontext [sg]
Mar 19 13:16:15 ArchL390 kernel: RIP: 0010:mutex_lock+0x1d/0x30
Mar 19 13:16:15 ArchL390 kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 48 89 fb e8 7e db ff ff 31 c0 65 48 8b 14 25 80 39 03 00 <f0> 48 0f b1 13 75 06 5b >
Mar 19 13:16:15 ArchL390 kernel: RSP: 0018:ffffb7f5e728bdd0 EFLAGS: 00010246
Mar 19 13:16:15 ArchL390 kernel: RAX: 0000000000000000 RBX: 0000000000000398 RCX: 00000000820001ed
Mar 19 13:16:15 ArchL390 kernel: RDX: ffff9ffb2e1db600 RSI: fffff83f84f56780 RDI: 0000000000000398
Mar 19 13:16:15 ArchL390 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000820001ed
Mar 19 13:16:15 ArchL390 kernel: R10: ffff9ff63d59e000 R11: 0000000000000246 R12: 0000000000000398
Mar 19 13:16:15 ArchL390 kernel: R13: ffff9ff6215329c0 R14: ffff9ff600188805 R15: ffff9ff6e7493328
Mar 19 13:16:15 ArchL390 kernel: FS:  0000000000000000(0000) GS:ffff9ffd5f6c0000(0000) knlGS:0000000000000000
Mar 19 13:16:15 ArchL390 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 19 13:16:15 ArchL390 kernel: CR2: 0000000000000398 CR3: 000000010cf68003 CR4: 00000000003706e0
Mar 19 13:16:15 ArchL390 kernel: Call Trace:
Mar 19 13:16:15 ArchL390 kernel:  <TASK>
Mar 19 13:16:15 ArchL390 kernel:  ? __die+0x23/0x70
Mar 19 13:16:15 ArchL390 kernel:  ? page_fault_oops+0x171/0x4e0
Mar 19 13:16:15 ArchL390 kernel:  ? __slab_free+0xf1/0x380
Mar 19 13:16:15 ArchL390 kernel:  ? exc_page_fault+0x7f/0x180
Mar 19 13:16:15 ArchL390 kernel:  ? asm_exc_page_fault+0x26/0x30
Mar 19 13:16:15 ArchL390 kernel:  ? mutex_lock+0x1d/0x30
Mar 19 13:16:15 ArchL390 kernel:  blk_trace_remove+0x1a/0x70
Mar 19 13:16:15 ArchL390 kernel:  sg_device_destroy+0x26/0xa0 [sg e0aa8dc69a365a6d88a9d8a7b2ca623614d59679]
Mar 19 13:16:15 ArchL390 kernel:  sg_remove_sfp_usercontext+0x15a/0x1c0 [sg e0aa8dc69a365a6d88a9d8a7b2ca623614d59679]
Mar 19 13:16:15 ArchL390 kernel:  process_one_work+0x178/0x350
Mar 19 13:16:15 ArchL390 kernel:  worker_thread+0x30f/0x450
Mar 19 13:16:15 ArchL390 kernel:  ? __pfx_worker_thread+0x10/0x10
Mar 19 13:16:15 ArchL390 kernel:  kthread+0xe5/0x120
Mar 19 13:16:15 ArchL390 kernel:  ? __pfx_kthread+0x10/0x10
Mar 19 13:16:15 ArchL390 kernel:  ret_from_fork+0x31/0x50
Mar 19 13:16:15 ArchL390 kernel:  ? __pfx_kthread+0x10/0x10
Mar 19 13:16:15 ArchL390 kernel:  ret_from_fork_asm+0x1b/0x30
Mar 19 13:16:15 ArchL390 kernel:  </TASK>
Mar 19 13:16:15 ArchL390 kernel: Modules linked in: udf crc_itu_t uas iwlmvm iwlwifi sr_mod cdrom usb_storage typec_displayport ax88179_178a cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet mii xt_M>
Mar 19 13:16:15 ArchL390 kernel:  snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel snd_ctl_led nls_iso8859_1 snd_soc_acpi snd_hda_codec_realtek vfat snd_soc_core tps6598x snd_hda_codec_gene>
Mar 19 13:16:15 ArchL390 kernel:  int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad mac_hid ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_comment xt_recen>
Mar 19 13:16:15 ArchL390 kernel: CR2: 0000000000000398
Mar 19 13:16:15 ArchL390 kernel: ---[ end trace 0000000000000000 ]---
Mar 19 13:16:15 ArchL390 kernel: RIP: 0010:mutex_lock+0x1d/0x30
Mar 19 13:16:15 ArchL390 kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 48 89 fb e8 7e db ff ff 31 c0 65 48 8b 14 25 80 39 03 00 <f0> 48 0f b1 13 75 06 5b >
Mar 19 13:16:15 ArchL390 kernel: RSP: 0018:ffffb7f5e728bdd0 EFLAGS: 00010246
Mar 19 13:16:15 ArchL390 kernel: RAX: 0000000000000000 RBX: 0000000000000398 RCX: 00000000820001ed
Mar 19 13:16:15 ArchL390 kernel: RDX: ffff9ffb2e1db600 RSI: fffff83f84f56780 RDI: 0000000000000398
Mar 19 13:16:15 ArchL390 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000820001ed
Mar 19 13:16:15 ArchL390 kernel: R10: ffff9ff63d59e000 R11: 0000000000000246 R12: 0000000000000398
Mar 19 13:16:15 ArchL390 kernel: R13: ffff9ff6215329c0 R14: ffff9ff600188805 R15: ffff9ff6e7493328
Mar 19 13:16:15 ArchL390 kernel: FS:  0000000000000000(0000) GS:ffff9ffd5f6c0000(0000) knlGS:0000000000000000
Mar 19 13:16:15 ArchL390 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 19 13:16:15 ArchL390 kernel: CR2: 0000000000000398 CR3: 000000010cf68003 CR4: 00000000003706e0

Immediately afterwards on the Lenovo, but not the NUC:

Mar 19 13:16:25 ArchL390 kwin_wayland[1544]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Mar 19 13:16:30 ArchL390 kwin_wayland[1544]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Mar 19 13:16:35 ArchL390 wpa_supplicant[1290]: wlp0s20f3: CTRL-EVENT-SIGNAL-CHANGE above=0 signal=-48 noise=9999 txrate=1560000
Mar 19 13:16:35 ArchL390 kwin_wayland[1544]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Mar 19 13:16:40 ArchL390 kwin_wayland[1544]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Mar 19 13:16:45 ArchL390 kwin_wayland[1544]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Mar 19 13:16:46 ArchL390 systemd-logind[1012]: Power key pressed short.
Mar 19 13:16:47 ArchL390 wpa_supplicant[1290]: wlp0s20f3: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-49 noise=9999 txrate=1560000
Mar 19 13:16:50 ArchL390 kwin_wayland[1544]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug

Interestingly, the Lenovo with GUI freezes and has to be hard rebooted but it is logging I've pushed the power button for the hard-reboot, so part of it is still up.

The NUC continues to be accessible via SSH after the oops but USB devices are not accessible i.e. using the terminal to 'cd' into a mounted USB device hangs indefinitely. Eventually the NUC will also hang but is usually responsive for a while after the first oops.

As you can see the oops message is very vague and after extensive searching through the last few weeks of logs I haven't been able to get the actual panic error, only the kernel oops right before the panic.

Can anyone help me narrow down how I should test next to figure out what may be causing this? Since USB devices are not accessible afterwards on the NUC I assume it is something USB-related for both devices?

Last edited by awbs (2024-03-20 02:15:54)

Offline

#2 2024-05-29 15:11:31

aleh
Member
Registered: 2024-05-29
Posts: 2

Re: Random kernel oops/panics since upgrading from Dec-2023 to Feb-2024

The similar issue has started with my MSI Prestige 14 laptop after updating the system in Feb 2024.
Sometimes it is Memory Page Fault, sometimes PCI bus write error and sometimes Kernel Panic.

I tried reverting Linux kernel to pre-upgrade version but it did not help.

I am going to try downgrading "linux-firmware" package.

Offline

#3 2024-06-07 12:43:30

aleh
Member
Registered: 2024-05-29
Posts: 2

Re: Random kernel oops/panics since upgrading from Dec-2023 to Feb-2024

Downgrading nvidia driver to version 535 by installing [https://aur.archlinux.org/packages/nvidia-535xx-dkms] has solved the issue.

Offline

#4 2024-06-09 11:53:44

CuriousRubick
Member
Registered: 2023-11-19
Posts: 6

Re: Random kernel oops/panics since upgrading from Dec-2023 to Feb-2024

Keep an eye on this thread: https://forums.developer.nvidia.com/t/s … /284772/59.

555.52.04 beta with kernel param NVreg_EnableGpuFirmware=0 might also solve the issue. I'm using nvidia-open (555) which also works but doesn't suspend correctly.

Offline

Board footer

Powered by FluxBB