[SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

Fish · 2020-10-12 11:21:48

Hello!

I have just built a new machine and installed arch on it, added Nvidia drivers and got Gnome running via GDM, however the system will freeze randomly after between 5 minutes and a couple of hours of booting. When investigating the logs I find Kernel Oops and core-dump logs.The hardware seems to be OK, as I can run it under windows no problem. If you can help me track down what I've done wrong, that would be fantastic!

AMD Ryzen 5 3600XT
Gigabyte B550 AORUS PRO AC (F10 BIOS version)
Nvidia RTX2060
Samsung 970 Evo NVMe 1TB SSD

I have tried the following:
-using the LTS kernel
-I have forced GDM to use X.org as I use an Nvidia GPU, and I use the machine for games
-I installed the r8125 module for the ethernet adapter on the motherboard from the AUR
- I investigated a number of errors in the log from the boot process but each one seemed to be benign
- Updated to latest Packages (this was a couple days ago.)
- ensured that the AMD microcode loader was working (seems there's no patches for my chip yet though)
- Checked the SMART status of the SSD
- there are no firmware updates for the SSD
- stressing the system with furmark on windows, or Unigine Superposition on either OS does not cause the system to fail

Kernel Oops

kernel: BUG: unable to handle page fault for address: 000055d7a0360600

kernel: #PF: error_code(0x0003) - permissions violation

kernel: Oops: 0003 [#1] SMP NOPTI

kernel: RIP: 0010:account_entity_dequeue+0x6e/0x80

kernel: #PF: error_code(0x0010) - not-present page

kernel: watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [kworker/9:2:269]

Gnome Core Dump

gnome-shell[1111]: gnome-shell: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.

gnome-session-binary[1062]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1062]: GLib-CRITICAL: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed
gnome-session-binary[1062]: GLib-CRITICAL: g_variant_new_object_path: assertion 'g_variant_is_object_path (object_path)' failed
gnome-session-binary[1062]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

systemd-coredump[4627]: Process 1062 (gnome-session-b) of user 120 dumped core.

I have two more Oops' an Evolution mail client dump and another Gnome dump if any of these might be useful?

Last edited by Fish (2022-02-19 13:20:00)

schard · 2020-10-12 12:26:45

My first try would be to replace the proprietary Nvidia drivers with nouveau and see whether the issue still occurs.
Some time ago one of my machines encountered a crashing WiFi driver when the proprietary Nvidia drivers were loaded.

Fish · 2020-10-14 10:20:14

Thanks for taking the time to reply!

I tried it out, used the Nouveau driver from the mesa package, but I ended up with a null pointer Kernel Oops in the same manner as before, a couple hours after startup, the machine freezes and I check journalctl.

kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page

although the stack trace does mention nouveau...

kernel: Call Trace:
kernel:  drm_atomic_state_default_release+0x1b/0x30 [drm]
kernel:  nv50_disp_atomic_state_free+0xe/0x20 [nouveau]
kernel:  drm_atomic_helper_update_plane+0xf6/0x140 [drm_kms_helper]
kernel:  drm_mode_cursor_universal+0x128/0x240 [drm]
kernel:  drm_mode_cursor_common+0x102/0x230 [drm]
kernel:  ? drm_mode_setplane+0x330/0x330 [drm]
kernel:  drm_mode_cursor_ioctl+0x4d/0x70 [drm]
kernel:  drm_ioctl_kernel+0xb2/0x100 [drm]
kernel:  drm_ioctl+0x208/0x360 [drm]
kernel:  ? drm_mode_setplane+0x330/0x330 [drm]
kernel:  nouveau_drm_ioctl+0x55/0xa0 [nouveau]
kernel:  ksys_ioctl+0x82/0xc0
kernel:  __x64_sys_ioctl+0x16/0x20
kernel:  do_syscall_64+0x44/0x70
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9

twelveeighty · 2020-10-14 21:19:46

Random freezes are tricky to troubleshoot. Does the system freeze if you leave it "alone" for 5 minutes - without any activity on browsers, email client, music players, etc.? Alternatively, can you force the freeze by performing some task?

Since you have a log statement that says "CPU#9 stuck for 23s!", also monitor your system load with "top" or any other tool of choice. Keep an eye for memory usage spikes as well. Perhaps there is an app or thread that acts up and eventually crashes your system. Then again, it could also be that those "23s" stuck was when the kernel dump is being made, so it could be just an "effect", not a "cause".

d_fajardo · 2020-10-14 22:12:52

ensured that the AMD microcode loader was working (seems there's no patches for my chip yet though)

What do you mean there are no patches for the chip? Do you actually have a a microcode installed and in your kernel parameters?
Also can you also post a log exactly at the point when the freeze happens?
Your Gnome problem might be unrelated to the freezing.

Last edited by d_fajardo (2020-10-14 22:13:40)

Fish · 2020-10-15 14:06:48

twelveeighty: I have been trying to find a task that would reproduce the issue, but I haven't found one as yet. I have had situations where I have left it for some time, and it has been fine when I returned, I'll try an extended session and see how we fair. I had also though the same in terms of monitoring CPU usage, since I'm using Gnome I'm using the system monitor for gnome, but so far I have not caught it. TBH though, if it were a leak, i would have thought there would be a gradual build before it fell over as the OS struggles under load, but for me its perfectly smooth until it completely halts.

d_fajardo:
So, following the article on loading microcode, I looked to see that it was trying to load something if it was available.

dmesg | grep microcode
[    0.711561] microcode: CPU0: patch_level=0x08701021
[    0.711565] microcode: CPU1: patch_level=0x08701021
[    0.711567] microcode: CPU2: patch_level=0x08701021
[    0.711571] microcode: CPU3: patch_level=0x08701021
[    0.711576] microcode: CPU4: patch_level=0x08701021
[    0.711582] microcode: CPU5: patch_level=0x08701021
[    0.711587] microcode: CPU6: patch_level=0x08701021
[    0.711590] microcode: CPU7: patch_level=0x08701021
[    0.711594] microcode: CPU8: patch_level=0x08701021
[    0.711599] microcode: CPU9: patch_level=0x08701021
[    0.711604] microcode: CPU10: patch_level=0x08701021
[    0.711609] microcode: CPU11: patch_level=0x08701021
[    0.711625] microcode: Microcode Update Driver: v2.2.

which I think indicates my microcode loading is working correctly, but there isn't a microcode update for it? I have the AMD microcode package installed, the configs for systemd-boot do the early loading have been modified to load them.

As to the gnome issue not being related, that's entirely possible, but it would be bad practice not to mention it as part of the symptoms in case it was important.

The Oops log was what I found when I rebooted the system, the dump was at about the right time for the crash, and there were no more log lines after that for that boot session. I'll add seconds to my clock in Gnome so I can give a more accurate timestamp with the next log.

Update:
so, I caught it freezing at 10:54:06 today, and the last entry in journalctl is from 10:53:08, complaining that the system can't keep up with my mouse, which as far as I'm aware is benign:

 /usr/lib/gdm-x-session[1054]: (EE) event18 - Corsair Corsair Gaming SABRE RGB Mouse: client bug: event processing lagging behind by 15ms, your system is too slow

this goes back every 30 seconds or so until 10:41. this also means I have no dump or segfault recorded for this freeze, frustratingly enough.
Also managed to see the CPU usage at the point of the freeze, and it seemed normal.

Update
tried Manjaro and Ubuntu 20.04 LTS, both showed the same freezing issue.
reinstalled Arch, LAN driver is in the kernel now (hooray! )
Tracker was crashing all over the place obviously having issues with some of my files, and the Gnome music app was struggling, but that didn't tear the system down. Firefox has crashed a few times, nuveau encountered a protection fault. I have now moved back to having the Nvidia driver installed.

Update
Uninstalled Gnome, tried lxde with lxdm, this showed the system freeze. Going over journalctl logs I found

                                                    Process 6046 (lxsession-defau) of user 1000 dumped core.
                                                    
                                                    Stack trace of thread 6046:
                                                    #0  0x000056395f96118c n/a (lxsession-default-apps + 0x1318c)
                                                    #1  0x000056395f96321f n/a (lxsession-default-apps + 0x1521f)
                                                    #2  0x000056395f96339e n/a (lxsession-default-apps + 0x1539e)
                                                    #3  0x000056395f95b6d8 n/a (lxsession-default-apps + 0xd6d8)
                                                    #4  0x000056395f95bd02 n/a (lxsession-default-apps + 0xdd02)
                                                    #5  0x000056395f956939 n/a (lxsession-default-apps + 0x8939)
                                                    #6  0x000056395f959fde n/a (lxsession-default-apps + 0xbfde)
                                                    #7  0x00007f51dba60224 n/a (libglib-2.0.so.0 + 0x50224)
                                                    #8  0x00007f51dba62914 g_main_context_dispatch (libglib-2.0.so.0 + 0x52914)
                                                    #9  0x00007f51dbab67d1 n/a (libglib-2.0.so.0 + 0xa67d1)
                                                    #10 0x00007f51dba61e63 g_main_loop_run (libglib-2.0.so.0 + 0x51e63)
                                                    #11 0x00007f51dbeabb7e gtk_main (libgtk-x11-2.0.so.0 + 0x12eb7e)
                                                    #12 0x000056395f95a234 n/a (lxsession-default-apps + 0xc234)
                                                    #13 0x00007f51db86f152 __libc_start_main (libc.so.6 + 0x28152)
                                                    #14 0x000056395f95405e n/a (lxsession-default-apps + 0x605e)
                                                    
                                                    Stack trace of thread 6048:
                                                    #0  0x00007f51db93c46f __poll (libc.so.6 + 0xf546f)
                                                    #1  0x00007f51dbab675f n/a (libglib-2.0.so.0 + 0xa675f)
                                                    #2  0x00007f51dba61121 g_main_context_iteration (libglib-2.0.so.0 + 0x51121)
                                                    #3  0x00007f51dba61172 n/a (libglib-2.0.so.0 + 0x51172)
                                                    #4  0x00007f51dba8fce1 n/a (libglib-2.0.so.0 + 0x7fce1)
                                                    #5  0x00007f51db2c23e9 start_thread (libpthread.so.0 + 0x93e9)
                                                    #6  0x00007f51db947293 __clone (libc.so.6 + 0x100293)

The machine seems to have stumbled on for 7 minutes, then the kernel seems to have detonated on me mentioning iwlwifi. the card is an intel Dual Band Wireless-AC 3168, for which there are iwlwifi drivers in the kernel

BUG: workqueue leaked lock or atomic: kworker/u65:2/0x00000001/488
                                         last function: iwl_pcie_rx_allocator_work [iwlwifi]
 CPU: 3 PID: 488 Comm: kworker/u65:2 Tainted: P           OE     5.9.2-arch1-1 #1
 Hardware name: Gigabyte Technology Co., Ltd. B550 AORUS PRO AC/B550 AORUS PRO AC, BIOS F10 09/18/2020
 Workqueue: rb_allocator iwl_pcie_rx_allocator_work [iwlwifi]
 Call Trace:
  dump_stack+0x6b/0x88
  process_one_work.cold+0x2b/0x30
  worker_thread+0x4d/0x3d0
  ? rescuer_thread+0x410/0x410
  kthread+0x142/0x160
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 BUG: scheduling while atomic: kworker/u65:2/488/0x00000002
 Modules linked in: ccm joydev input_leds mousedev btusb btrtl btbcm btintel hid_generic bluetooth usbhid ecdh_generic hid ecc nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg edac_mce_amd iwlmvm snd_hda_codec mac80211 kvm libarc4 snd_hda_core ucsi_ccg irqbypass iwlwifi crct10dif_pclmul snd_hwdep crc32_pclmul drm_kms_helper r8169 ghash_clmulni_intel typec_ucsi snd_pcm typec cec aesni_intel realtek crypto_simd wmi_bmof rc_core cfg80211 mdio_devres syscopyarea snd_timer cryptd nls_iso8859_1 ccp of_mdio nls_cp437 glue_helper sysfillrect snd fixed_phy uas vfat sp5100_tco sysimgblt fat rapl usb_storage pcspkr libphy k10temp i2c_piix4 rng_core soundcore fb_sys_fops rfkill i2c_nvidia_gpu wmi evdev acpi_cpufreq pinctrl_amd gpio_amdpt mac_hid drm agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci xhci_pci_renesas xhci_hcd
 Preemption disabled at:
 [<ffffffffa4a81543>] kernel_init_free_pages+0x33/0x90
 CPU: 3 PID: 488 Comm: kworker/u65:2 Tainted: P           OE     5.9.2-arch1-1 #1
 Hardware name: Gigabyte Technology Co., Ltd. B550 AORUS PRO AC/B550 AORUS PRO AC, BIOS F10 09/18/2020
 Workqueue:  0x0 (rb_allocator)
 Call Trace:
  dump_stack+0x6b/0x88
  __schedule_bug.cold+0x89/0x97
  __schedule+0x717/0x830
  schedule+0x46/0xf0
  worker_thread+0xbf/0x3d0
  ? rescuer_thread+0x410/0x410
  kthread+0x142/0x160
  ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30

Update
OK so now I can use the 2.5 GBPS LAN card and I get the same symptoms as described above, but if I turn on the WiFi card, if I boot Gnome, it will relyably crash within a couple of minutes, blacklisting iwlmvm and iwlwifi mean I can run it for between one and three hours. I have also found that the motherboard does not apply good power settings for my RAM when in XMP, so I had to set DRAM power manually and this is stable under windows, but not with Arch.

Last edited by Fish (2020-12-20 15:48:46)

Fish · 2022-02-19 13:20:40

Replaced the motherboard, no longer affected by the issue, assuming its was a H/W fault.

Arch Linux

#1 2020-10-12 11:21:48

[SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

#2 2020-10-12 12:26:45

Re: [SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

#3 2020-10-14 10:20:14

Re: [SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

#4 2020-10-14 21:19:46

Re: [SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

#5 2020-10-14 22:12:52

Re: [SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

#6 2020-10-15 14:06:48

Re: [SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

#7 2022-02-19 13:20:40

Re: [SOLVED] Seemingly Random Kernel Oops and Application Core Dumps

Board footer