Crashed - kernel: BUG: unable to handle page fault

truth_believer · 2022-07-07 15:32:46

Hi again!

Experience

I was oblivious:

pacman -Ss wine
Killed

and out of wonder, I repeated it a couple of times.
I wondered: Does it mean project wine is killed? kidding.

top -Hd .2
Killed

I went to search that, but some windows wouldn't draw.
In tty2, journalctl showed kernel having some non-english, ending in:

kernel: CR2:  ???????????????? CR3:  ???????????????? CR4:  ????????????????

After a minute, tty7 was black, and a bit later tty2 wouldn't come too.
And this old ThinkPad has no disk LED.
After 10 minutes, I forcefully powered off.

Evidence

Kernel version: linux-lts 5.15.52-1
journalctl: http://0x0.st/oQtt.txt
The dumped core of gnome-system-monitor (Should I upload it?).

What I could understand

gnome-system-monitor had an ANOM_ABEND, which is:

Triggered when a process ends abnormally (with a signal that could cause a core dump, if enabled).

And it did cause a core dump.
But then nothing for 8 minutes.
Then:

kernel: BUG: unable to handle page fault for address: ????
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD ...
kernel: Oops: 0000 [#1] SMP PTI
kernel: CPU: 1 PID: 15645 Comm: firefox Tainted: ...

and this appears 4 other times for pacman (I guess that's the couple of times I ran pacman before noticing the freezing) that make up #3 to #6 [SMP PTI]s.

Each followed by debugging info, and then nothing for several seconds before the next one.

The #2 [SMP PTI] is different and instead of "Oops" says:

kernel: general protection fault, probably for non-canonical address ????: 0000 [#2] SMP PTI
kernel: CPU: 1 PID: 15645 Comm: firefox Tainted: ....

and seems related to the #1, because:

It happens in the same moment as #1.
It is for "firefox" as well.
Is the only one that ends with:

kernel: Fixing recursive fault but reboot is needed!

A research in the dark

I went to know what PTI is, and found 15. Page Table Isolation (PTI).
The [Overview] and this part of [Debugging] are interesting:

15.6. Debugging
Bugs in PTI cause a few different signatures of crashes that are worth noting here.
...
Double faults: overflowing the kernel stack because of page faults upon page faults. Caused by touching non-pti-mapped data in the entry code, or forgetting to switch to kernel CR3 before calling into C functions which are not pti-mapped.

Then I tried to search for "unable to handle page fault" and "SMP PTI" in kernel bugs and patches and found things that couldn't quite catch!

Like these and these.
And this in the latest patch of my kernel from kernel.org:

   /*
 30566 -    * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
 30567 -    * unconditional CR3 write, even in the PTI case.  So do an lfence
 30568 -    * to prevent GS speculation, regardless of whether PTI is enabled.
 30569 -    */

Empirical guesses

History of crashes:
On ArchLinux, first I had none.
At some point, I had a crash every once in a while.
Before yesterday, it was months that I had none.
Yesterday I updated, and 17 hours later, I had one.
And now:
dmesg is showing this today:

[Thu Jul  7 14:39:26 2022] pool-pcmanfm[20765]: segfault at 9c ip 00007f78aae0e9d3 sp 00007f78a88e6b38 error 6 in libfm.so.4.1.3[7f78aae06000+25000]

So I downgraded the kernel to a version before the second to last update.
Is this a good thing to do?

Further guessing

I guess a memtest86+ is in order before going deep into software debugging. Will do.

And, is there any chance that it's because of this udev rule? You see, I applied it the day before the crash.

Thanks in advance.

Last edited by truth_believer (2022-07-07 15:49:18)

truth_believer · 2022-07-07 17:31:32

truth_believer wrote:

I guess a memtest86+ is in order before going deep into software debugging. Will do.

memtest86+ did 1 pass with 0 errors.

V1del · 2022-07-07 18:10:19

general consensus is that one pass is not sufficient, let it run over night.

truth_believer · 2022-07-08 08:52:15

V1del wrote:

general consensus is that one pass is not sufficient, let it run over night.

Thanks. Did it, which made about another 2.8 passes with no errors (in 4-5 hours).

Last edited by truth_believer (2022-07-08 08:54:08)

Arch Linux

#1 2022-07-07 15:32:46