You are not logged in.

#1 2020-10-12 11:21:48

Fish
Member
Registered: 2020-10-04
Posts: 3

Seemingly Random Kernel Oops and Application Core Dumps

Hello!

I have just built a new machine and installed arch on it, added Nvidia drivers and got Gnome running via GDM, however the system will freeze randomly after between 5 minutes and a couple of hours of booting. When investigating the logs I find Kernel Oops and core-dump logs.The hardware seems to be OK, as I can run it under windows no problem. If you can help me track down what I've done wrong, that would be fantastic!

AMD Ryzen 5 3600XT
Gigabyte B550 AORUS PRO AC (F10 BIOS version)
Nvidia RTX2060
Samsung 970 Evo NVMe 1TB SSD


I have tried the following:
-using the LTS kernel
-I have forced GDM to use X.org as I use an Nvidia GPU, and I use the machine for games
-I installed the r8125 module for the ethernet adapter on the motherboard from the AUR
- I investigated a number of errors in the log from the boot process but each one seemed to be benign
- Updated to latest Packages (this was a couple days ago.)
- ensured that the AMD microcode loader was working (seems there's no patches for my chip yet though)
- Checked the SMART status of the SSD
- there are no firmware updates for the SSD
- stressing the system with furmark on windows, or Unigine Superposition on either OS does not cause the system to fail

Kernel Oops

kernel: BUG: unable to handle page fault for address: 000055d7a0360600
kernel: #PF: error_code(0x0003) - permissions violation
kernel: Oops: 0003 [#1] SMP NOPTI
kernel: RIP: 0010:account_entity_dequeue+0x6e/0x80
kernel: #PF: error_code(0x0010) - not-present page
kernel: watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [kworker/9:2:269]

Gnome Core Dump

gnome-shell[1111]: gnome-shell: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
gnome-session-binary[1062]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
gnome-session-binary[1062]: GLib-CRITICAL: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed
gnome-session-binary[1062]: GLib-CRITICAL: g_variant_new_object_path: assertion 'g_variant_is_object_path (object_path)' failed
gnome-session-binary[1062]: GLib-GObject-CRITICAL: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
systemd-coredump[4627]: Process 1062 (gnome-session-b) of user 120 dumped core.

I have  two more Oops' an Evolution mail client dump and another Gnome dump if any of these might be useful?

Offline

#2 2020-10-12 12:26:45

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 995
Website

Re: Seemingly Random Kernel Oops and Application Core Dumps

My first try would be to replace the proprietary Nvidia drivers with nouveau and see whether the issue still occurs.
Some time ago one of my machines encountered a crashing WiFi driver when the proprietary Nvidia drivers were loaded.

Offline

#3 2020-10-14 10:20:14

Fish
Member
Registered: 2020-10-04
Posts: 3

Re: Seemingly Random Kernel Oops and Application Core Dumps

Thanks for taking the time to reply!

I tried it out, used the Nouveau driver from the mesa package, but I ended up with a null pointer Kernel Oops in the same manner as before, a couple hours after startup, the machine freezes and I check journalctl.

kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page

although the stack trace does mention nouveau...

kernel: Call Trace:
kernel:  drm_atomic_state_default_release+0x1b/0x30 [drm]
kernel:  nv50_disp_atomic_state_free+0xe/0x20 [nouveau]
kernel:  drm_atomic_helper_update_plane+0xf6/0x140 [drm_kms_helper]
kernel:  drm_mode_cursor_universal+0x128/0x240 [drm]
kernel:  drm_mode_cursor_common+0x102/0x230 [drm]
kernel:  ? drm_mode_setplane+0x330/0x330 [drm]
kernel:  drm_mode_cursor_ioctl+0x4d/0x70 [drm]
kernel:  drm_ioctl_kernel+0xb2/0x100 [drm]
kernel:  drm_ioctl+0x208/0x360 [drm]
kernel:  ? drm_mode_setplane+0x330/0x330 [drm]
kernel:  nouveau_drm_ioctl+0x55/0xa0 [nouveau]
kernel:  ksys_ioctl+0x82/0xc0
kernel:  __x64_sys_ioctl+0x16/0x20
kernel:  do_syscall_64+0x44/0x70
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Offline

#4 2020-10-14 21:19:46

twelveeighty
Member
From: Alberta, Canada
Registered: 2011-09-04
Posts: 689

Re: Seemingly Random Kernel Oops and Application Core Dumps

Random freezes are tricky to troubleshoot. Does the system freeze if you leave it "alone" for 5 minutes - without any activity on browsers, email client, music players, etc.? Alternatively, can you force the freeze by performing some task?

Since you have a log statement that says "CPU#9 stuck for 23s!", also monitor your system load with "top" or any other tool of choice. Keep an eye for memory usage spikes as well. Perhaps there is an app or thread that acts up and eventually crashes your system. Then again, it could also be that those "23s" stuck was when the kernel dump is being made, so it could be just an "effect", not a "cause".

Offline

#5 2020-10-14 22:12:52

d_fajardo
Member
Registered: 2017-07-28
Posts: 938

Re: Seemingly Random Kernel Oops and Application Core Dumps

ensured that the AMD microcode loader was working (seems there's no patches for my chip yet though)

What do you mean there are no patches for the chip? Do you actually have a a microcode installed and in your kernel parameters?
Also can you also post a log exactly at the point when the freeze happens?
Your Gnome problem might be unrelated to the freezing.

Last edited by d_fajardo (2020-10-14 22:13:40)

Offline

#6 2020-10-15 14:06:48

Fish
Member
Registered: 2020-10-04
Posts: 3

Re: Seemingly Random Kernel Oops and Application Core Dumps

twelveeighty: I have been trying to find a task that would reproduce the issue, but I haven't found one as yet. I have had situations where I have left it for some time, and it has been fine when I returned, I'll try an extended session and see how we fair. I had also though the same in terms of monitoring CPU usage, since I'm using Gnome I'm using the system monitor for gnome, but so far I have not caught it. TBH though, if it were a leak, i would have thought there would be a gradual build before it fell over as the OS struggles under load, but for me its perfectly smooth until it completely halts.

d_fajardo:
So, following the article on loading microcode, I looked to see that it was trying to load something if it was available.

dmesg | grep microcode
[    0.711561] microcode: CPU0: patch_level=0x08701021
[    0.711565] microcode: CPU1: patch_level=0x08701021
[    0.711567] microcode: CPU2: patch_level=0x08701021
[    0.711571] microcode: CPU3: patch_level=0x08701021
[    0.711576] microcode: CPU4: patch_level=0x08701021
[    0.711582] microcode: CPU5: patch_level=0x08701021
[    0.711587] microcode: CPU6: patch_level=0x08701021
[    0.711590] microcode: CPU7: patch_level=0x08701021
[    0.711594] microcode: CPU8: patch_level=0x08701021
[    0.711599] microcode: CPU9: patch_level=0x08701021
[    0.711604] microcode: CPU10: patch_level=0x08701021
[    0.711609] microcode: CPU11: patch_level=0x08701021
[    0.711625] microcode: Microcode Update Driver: v2.2.

which I think indicates my microcode loading is working correctly, but there isn't a microcode update for it? I have the AMD microcode package installed, the configs for systemd-boot do the early loading have been modified to load them.

As to the gnome issue not being related, that's entirely possible, but it would be bad practice not to mention it as part of the symptoms in case it was important.

The Oops log was what I found when I rebooted the system, the dump was at about the right time for the crash, and there were no more log lines after that for that boot session. I'll add seconds to my clock in Gnome so I can give a more accurate timestamp with the next log.

Update:
so, I caught it freezing at 10:54:06 today, and the last entry in journalctl is from 10:53:08, complaining that the system can't keep up with my mouse, which as far as I'm aware is benign:

 /usr/lib/gdm-x-session[1054]: (EE) event18 - Corsair Corsair Gaming SABRE RGB Mouse: client bug: event processing lagging behind by 15ms, your system is too slow

this goes back every 30 seconds or so until 10:41. this also means I have no dump or segfault recorded for this freeze, frustratingly enough.
Also managed to see the CPU usage at the point of the freeze, and it seemed normal.

Update
tried Manjaro and Ubuntu 20.04 LTS, both showed the same freezing issue.
reinstalled Arch, LAN driver is in the kernel now (hooray! smile )
Tracker was crashing all over the place obviously having issues with some of my files, and the Gnome music app was struggling, but that didn't tear the system down. Firefox has crashed a few times, nuveau encountered a protection fault. I have now moved back to having the Nvidia driver installed.

Last edited by Fish (2020-10-24 15:06:43)

Offline

Board footer

Powered by FluxBB