You are not logged in.

#1 2023-09-07 18:07:09

cyayon
Member
Registered: 2016-09-05
Posts: 35

[SOLVED] Random kernel panic since > 6.3.7

Hi,

I have random kernel panic on my archlinux server, since upgrade from > kernel 6.3.7.
It's a headless server, no X11, no graphic card, and it is an intel processor (not AMD).

On 6.3.7, everything was fine, since upgrade to 6.3.9, 6.4.3, 6.4.4, and now 6.4.12, I got random crash (sometime 24h, 48h, sometime a little bit more...).

No log in journald/syslog, I managed to get only these three lines (with mounting /var/log/journal outside my nvme boot disk) :

BUG: unable to handle page fault for address: 00000000352aa941
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page

Some information :
uname -a : Linux xxxxorg 6.4.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 24 Aug 2023 00:38:14 +0000 x86_64 GNU/Linux
dmesg | curl -F 'f:1=<-' ix.io : http://ix.io/4FE8
journalctl -b -1 | tail -1000 | curl -F 'f:1=<-' ix.io : http://ix.io/4FE9 (only last 1000 lines, too big)
lspci -vvv | curl -F 'f:1=<-' ix.io : http://ix.io/4FEa
lsmod | curl -F 'f:1=<-' ix.io : http://ix.io/4FEf
lscpu | curl -F 'f:1=<-' ix.io : http://ix.io/4FEh

Of course, if I rollback to 6.3.7, no more crash.

Could you please help me to debug ?

thanks.

Last edited by cyayon (2023-10-07 23:16:38)

Offline

#2 2023-09-07 18:37:53

loqs
Member
Registered: 2014-03-06
Posts: 17,900

Re: [SOLVED] Random kernel panic since > 6.3.7

What about 6.3.8?  Available from the ALA.
As the system is headless does it have a serial port,  baseboard management controller or some other way to extract more info on the panic when it happens?

Offline

#3 2023-09-07 18:42:06

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

loqs wrote:

What about 6.3.8?  Available from the ALA.
As the system is headless does it have a serial port,  baseboard management controller or some other way to extract more info on the panic when it happens?

I didn't tried 6.3.8 (6.3.7 -> 6.3.9 -> 6.4.3 ....)

No, I do not have a management board, but sorry it is not really headless, I have a keyboard and a screen with console access. There is no X11, only console. When the crash occur, the screen is just black.
Sometimes the server reboot itself (thanks to sysctl kernel.panic = 60), sometimes it is only freeze and no log on the console).

Last edited by cyayon (2023-09-07 18:44:14)

Offline

#4 2023-09-07 19:10:32

loqs
Member
Registered: 2014-03-06
Posts: 17,900

Re: [SOLVED] Random kernel panic since > 6.3.7

cyayon wrote:
loqs wrote:

What about 6.3.8?  Available from the ALA.

I didn't tried 6.3.8 (6.3.7 -> 6.3.9 -> 6.4.3 ....)

Please do so.

Offline

#5 2023-09-07 19:14:14

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

Unless loqs already has a commit in mind and isn't just narrowing his inevitable bisecting efforts wink

Sep 07 17:29:01 daneel.nbux.org root[2099643]: ExecMGR[2099527] ExecMGR@daneel v1.91 [ id:2099527 user:root Kill: Cmd:'/jobs/check_wan_onu.sh' Prom:  ] - 17:29:01 07/09/2023
Sep 07 17:29:02 daneel.nbux.org root[2099707]: ExecMGR[2099527] ExecMGR id:2099527 cmd:'/jobs/check_wan_onu.sh' elaps:1 finished successfully
Sep 07 17:29:02 daneel.nbux.org CROND[2099526]: (root) CMDEND (/jobs/ExecMGR -X "/jobs/check_wan_onu.sh")
Sep 07 17:29:02 daneel.nbux.org CROND[2099526]: pam_unix(crond:session): session closed for user root
Sep 07 17:29:48 daneel.nbux.org root[2102328]: UpsMGR[2102012] UpsMGR@daneel v1.20 [ UpsName:eaton1 Prom:1 ] - 17:29:48 07/09/2023
Sep 07 17:29:48 daneel.nbux.org root[2102377]: HddMGR[2102031] HddMGR@daneel v2.3 [ HddName:any Prom:1 ] - 17:29:48 07/09/2023
Sep 07 17:29:48 daneel.nbux.org kernel: BUG: unable to handle page fault for address: 00000000352aa941
Sep 07 17:29:48 daneel.nbux.org kernel: #PF: supervisor write access in kernel mode
Sep 07 17:29:48 daneel.nbux.org kernel: #PF: error_code(0x0002) - not-present page

1st 4 lines only for context.

Do you have more journals of affected boots?
Are all incidents immediately preceded by UpsMGR/HddMGR?

If you add "consoleblank=0" to the kernel commandline and keep the monitor on (except if it's a VGA one, then it probably doesn't matter) do you get a visible output after the freeze?
(You'd expect some OOPs that would allow to narrow down the likely offending module, assuming it's not the core)

Offline

#6 2023-09-07 21:53:02

loqs
Member
Registered: 2014-03-06
Posts: 17,900

Re: [SOLVED] Random kernel panic since > 6.3.7

I see you have been asked to bisect the kernel.  Please ask if you need any help with that after testing 6.3.8.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217884

Offline

#7 2023-09-08 05:57:25

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

Thanks for your answer.
I have disabled a script which do a lot of smartctl command (to get smart info from all my 20 disks/ssd/nvme). This script via cron 28 * * * * *.
HddMGR script run also some smartctl on all my disk to get temperature. UpsMGR run upsc command from nut to get info from my usb UPS.
I will test a few days, if it crash again I will check previous scripts and disable them too.
Of course, I will test 6.3.8 kernel. I have just downloaded packages from archive.
Thanks !

Offline

#8 2023-09-08 06:51:26

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

Were all crashes in direct proximity to that script being run?
Make sure to disable the consoleblank, the page fault happens in the kernel context so even if some userspace process triggers that, that would only (maybe) help to ballpark the involved kernel code.
The cause would still be either a kernel bug or faulty HW.

Offline

#9 2023-09-08 08:09:53

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

Hi,
I just checked previous crashs. It seems it is not related to scripts execution.

To disable consoleblank, I just have to add consoleblank=0  to the kernel command line ? do I have to use setterm too ?

Which faulty hardware could it be ?

thanks.

Offline

#10 2023-09-08 08:16:30

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

just have to add consoleblank=0  to the kernel command line

This.

Which faulty hardware could it be ?

Any. None (given the kernel version relevance).
That's why the kernel OOPs would be helpful - if it had always been related to your HDD scripts, that could have indicated ata* or nvme, but w/o any pattern anything might be the trigger.

Edit:
https://wiki.archlinux.org/title/Displa … _Signaling
Seems console blanking is actually disabled by default since a couple of years.
Might be the i915 module…
Does the system still respond to pings? Can you ssh into it?

Last edited by seth (2023-09-08 08:19:42)

Offline

#11 2023-09-08 08:58:14

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

When the system crash, no more ping or any network communication.

Offline

#12 2023-09-08 12:26:35

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

If you can't get any console output (let the monitor attached and powered, disable dpms w/ setterm  and force the consoleblank parameter) https://wiki.archlinux.org/title/Kdump might get us a usable dump. (Since this is not deterministic and can take 48+h to show up, bisecting can become a LOOOOOOOOONG process)

Offline

#13 2023-09-08 12:51:28

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

I have explicitly added consoleblank=0 in my kernel command line.
I am sure about disabling DPMS with setterm.

Should I just execute the following ?
for i in {0..256}; do setterm --blank 0 --powerdown 0 --powersave off >> /dev/tty$i; done

thanks.

Last edited by cyayon (2023-09-08 12:54:48)

Offline

#14 2023-09-08 12:59:16

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

https://wiki.archlinux.org/title/Displa … th_setterm
Disable all, not sure whether doing that for a lot of TTYs that aren't active is required anyway (and you don't have 257 terminals)

for ttydev in /dev/tty{0..9}*; do setterm --powerdown 0 >> $ttydev; setterm --blank 0 >> $ttydev; setterm --powersave off >> $ttydev; done

Offline

#15 2023-09-08 13:12:50

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

thanks.
done.

perhaps should I disable sysctl kernel.panic=60 to avoid auto reboot and keep last message on the local display ?

Offline

#16 2023-09-08 13:14:10

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

If you can accept the dead system for a random duration, yes.

Offline

#17 2023-09-08 13:17:01

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

thanks. done.
waiting for the next crash

Offline

#18 2023-09-10 06:46:49

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

Hi,
This night crash again (kernel 6.4.12). I manage to got some logs from syslog-ng (no log from journald).

Here are the last logs just before the crash.

http://ix.io/4FTW

It seems to be related to nft.

thanks.

Last edited by cyayon (2023-09-10 06:47:31)

Offline

#19 2023-09-10 07:27:11

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

Yup.
At least superficially, https://lore.kernel.org/netdev/20230810 … ter.org/T/ looks relevant, but those are in 6.4.12 sad
https://github.com/archlinux/linux/comm … /netfilter

https://github.com/archlinux/linux/comm … /netfilter - the upper two blocks are for 6.3.8 (already results on this) and 6.3.9
https://lore.kernel.org/lkml/87ttuuajbo.fsf@43-1.org/T/  suggests that https://github.com/archlinux/linux/comm … bd315a313b might be it … hmm

Edit:blank position...

Last edited by seth (2023-09-10 07:51:26)

Offline

#20 2023-09-10 07:37:17

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

Hi Seth,

Thanks for your answer !

If I understand correctly, this seems to explain why in 6.3.7 everything was fine.
Note that I use some nft sets with 48h timeout for elements and I have also "gc-interval 1h" in these sets. Perhaps that's why I have crash after about 48h ?

What is your advice in this situation ? rollback to 6.3.7 and wait for a fix or trying the very last 6.5.2 ? Or even trying mainline 6.5 ?

thanks.

Last edited by cyayon (2023-09-10 07:37:43)

Offline

#21 2023-09-10 07:50:39

seth
Member
Registered: 2012-09-03
Posts: 57,030

Re: [SOLVED] Random kernel panic since > 6.3.7

If this isn't somehow fixed in 6.5, first see whether 6.3.8 is stable, then whether 6.3.9 w/ the mentioned commit reverted is.
Then file a bug upstream, since this seems not fully resolved.

Can you reduce the timeouts to perhaps trigger this more aggressively?

Offline

#22 2023-09-10 22:37:05

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

Hi,

I opened a ticket to netfilter.

Pablo N. tell to test 6.4.15 instead of 6.5.x and 6.3.x (revert bdace3b1a51887211d3e49417a18fdbd315a313b required).

Do you have a PKGBUILD for this version 6.4.15 ?

Thanks

Offline

#23 2023-09-10 23:09:03

loqs
Member
Registered: 2014-03-06
Posts: 17,900

Re: [SOLVED] Random kernel panic since > 6.3.7

Could you please linked to your report.  6.4.15 does not contain bdace3b1a51887211d3e49417a18fdbd315a313b as that is the 6.3 backport of 1240eb93f0616b21c675416516ff3d74798fdc97 although reverting either should be equivalent.
Neither can be reverted cleanly from 6.4.15.

$ git checkout v6.4.15
HEAD is now at f60d5fd5e950 Linux 6.4.15
$ git status
HEAD detached at v6.4.15
$ git revert -n bdace3b1a51887211d3e49417a18fdbd315a313b
Auto-merging net/netfilter/nf_tables_api.c
CONFLICT (content): Merge conflict in net/netfilter/nf_tables_api.c
error: could not revert bdace3b1a518... netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
$ git revert --abort
$ git revert -n 1240eb93f0616b21c675416516ff3d74798fdc97
Auto-merging net/netfilter/nf_tables_api.c
CONFLICT (content): Merge conflict in net/netfilter/nf_tables_api.c
error: could not revert 1240eb93f061... netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'

You can locally generate such a PKGBUILD (requires devtools and base-devel)

$ pkgctl repo clone --protocol=https linux
$ cd linux/
$ git checkout 6.4.12.arch1-1 # The last Arch 6.4 release
$ curl -o PKGBUILD.diff  http://ix.io/4FYM  # download a diff of the changes I made
$ git apply -v PKGBUILD.diff
$ makepkg -rsi

Edit:
Manually merged the failing revert and added that to the PKGBUILD  and update diff.  The old diff is at http://ix.io/4FYH
Locally built package with pkrel changed to 1.1:
https://drive.google.com/file/d/1oJTRYe … sp=sharing linux-6.4.15-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qUomh3 … sp=sharing linux-headers-6.4.15-1.1-x86_64.pkg.tar.zst

Last edited by loqs (2023-09-11 00:06:04)

Offline

#24 2023-09-11 07:50:41

cyayon
Member
Registered: 2016-09-05
Posts: 35

Re: [SOLVED] Random kernel panic since > 6.3.7

Hi,

I can directly download and install shared packages ?

The report of netfilter bug is by email directly to email bugzilla-account@netfilter.org. The register to bugzilla netfilter is closed to avoid spam.
Pablo N just answered me by email to test 6.4.15, because 6.5.2 is a little behind 6.4.15 for this issue.

Last edited by cyayon (2023-09-11 08:35:45)

Offline

#25 2023-09-11 10:27:38

loqs
Member
Registered: 2014-03-06
Posts: 17,900

Re: [SOLVED] Random kernel panic since > 6.3.7

cyayon wrote:

Hi,

I can directly download and install shared packages ?

Yes.

Offline

Board footer

Powered by FluxBB