You are not logged in.

#1 2022-07-07 15:32:46

truth_believer
Member
Registered: 2022-06-15
Posts: 17

Crashed - kernel: BUG: unable to handle page fault

Hi again! big_smile

Experience

I was oblivious:

pacman -Ss wine
Killed

and out of wonder, I repeated it a couple of times.
I wondered: Does it mean project wine is killed? big_smile  kidding.

top -Hd .2
Killed

I went to search that, but some windows wouldn't draw.
In tty2, journalctl showed kernel having some non-english, ending in:

kernel: CR2:  ???????????????? CR3:  ???????????????? CR4:  ????????????????

After a minute, tty7 was black, and a bit later tty2 wouldn't come too.
And this old ThinkPad has no disk LED.
After 10 minutes, I forcefully powered off.

Evidence

Kernel version: linux-lts 5.15.52-1
journalctl: http://0x0.st/oQtt.txt
The dumped core of gnome-system-monitor (Should I upload it?).

What I could understand

gnome-system-monitor had an ANOM_ABEND, which is:

Triggered when a process ends abnormally (with a signal that could cause a core dump, if enabled).

And it did cause a core dump.
But then nothing for 8 minutes.
Then:

kernel: BUG: unable to handle page fault for address: ????
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD ...
kernel: Oops: 0000 [#1] SMP PTI
kernel: CPU: 1 PID: 15645 Comm: firefox Tainted: ...

and this appears 4 other times for pacman (I guess that's the couple of times I ran pacman before noticing the freezing) that make up #3 to #6 [SMP PTI]s.

Each followed by debugging info, and then nothing for several seconds before the next one.

The #2 [SMP PTI] is different and instead of "Oops" says:

kernel: general protection fault, probably for non-canonical address ????: 0000 [#2] SMP PTI
kernel: CPU: 1 PID: 15645 Comm: firefox Tainted: ....

and seems related to the #1, because:

  • It happens in the same moment as #1.

  • It is for "firefox" as well.

  • Is the only one that ends with:

kernel: Fixing recursive fault but reboot is needed!
A research in the dark

I went to know what PTI is, and found 15. Page Table Isolation (PTI).
The [Overview] and this part of [Debugging] are interesting:

15.6. Debugging
Bugs in PTI cause a few different signatures of crashes that are worth noting here.
...
Double faults: overflowing the kernel stack because of page faults upon page faults. Caused by touching non-pti-mapped data in the entry code, or forgetting to switch to kernel CR3 before calling into C functions which are not pti-mapped.

Then I tried to search for "unable to handle page fault" and "SMP PTI" in kernel bugs and patches and found things that couldn't quite catch! big_smile

Like these and these.
And this in the latest patch of my kernel from kernel.org:

   /*
 30566 -    * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
 30567 -    * unconditional CR3 write, even in the PTI case.  So do an lfence
 30568 -    * to prevent GS speculation, regardless of whether PTI is enabled.
 30569 -    */
Empirical guesses

History of crashes:
        On ArchLinux, first I had none.
        At some point, I had a crash every once in a while.
        Before yesterday, it was months that I had none.
        Yesterday I updated, and 17 hours later, I had one.
And now:
        dmesg is showing this today:

[Thu Jul  7 14:39:26 2022] pool-pcmanfm[20765]: segfault at 9c ip 00007f78aae0e9d3 sp 00007f78a88e6b38 error 6 in libfm.so.4.1.3[7f78aae06000+25000]

So I downgraded the kernel to a version before the second to last update.
Is this a good thing to do?

Further guessing

I guess a memtest86+ is in order before going deep into software debugging. Will do.

And, is there any chance that it's because of this udev rule? You see, I applied it the day before the crash.

Thanks in advance.

Last edited by truth_believer (2022-07-07 15:49:18)

Offline

#2 2022-07-07 17:31:32

truth_believer
Member
Registered: 2022-06-15
Posts: 17

Re: Crashed - kernel: BUG: unable to handle page fault

truth_believer wrote:

I guess a memtest86+ is in order before going deep into software debugging. Will do.

memtest86+ did 1 pass with 0 errors.

Offline

#3 2022-07-07 18:10:19

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 25,104

Re: Crashed - kernel: BUG: unable to handle page fault

general consensus is that one pass is not sufficient, let it run over night.

Offline

#4 2022-07-08 08:52:15

truth_believer
Member
Registered: 2022-06-15
Posts: 17

Re: Crashed - kernel: BUG: unable to handle page fault

V1del wrote:

general consensus is that one pass is not sufficient, let it run over night.

Thanks. Did it, which made about another 2.8 passes with no errors (in 4-5 hours).

Last edited by truth_believer (2022-07-08 08:54:08)

Offline

Board footer

Powered by FluxBB