You are not logged in.
Hi,
I have random kernel panic on my archlinux server, since upgrade from > kernel 6.3.7.
It's a headless server, no X11, no graphic card, and it is an intel processor (not AMD).
On 6.3.7, everything was fine, since upgrade to 6.3.9, 6.4.3, 6.4.4, and now 6.4.12, I got random crash (sometime 24h, 48h, sometime a little bit more...).
No log in journald/syslog, I managed to get only these three lines (with mounting /var/log/journal outside my nvme boot disk) :
BUG: unable to handle page fault for address: 00000000352aa941
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
Some information :
uname -a : Linux xxxxorg 6.4.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 24 Aug 2023 00:38:14 +0000 x86_64 GNU/Linux
dmesg | curl -F 'f:1=<-' ix.io : http://ix.io/4FE8
journalctl -b -1 | tail -1000 | curl -F 'f:1=<-' ix.io : http://ix.io/4FE9 (only last 1000 lines, too big)
lspci -vvv | curl -F 'f:1=<-' ix.io : http://ix.io/4FEa
lsmod | curl -F 'f:1=<-' ix.io : http://ix.io/4FEf
lscpu | curl -F 'f:1=<-' ix.io : http://ix.io/4FEh
Of course, if I rollback to 6.3.7, no more crash.
Could you please help me to debug ?
thanks.
Last edited by cyayon (2023-10-07 23:16:38)
Offline
I didn't tried 6.3.8 (6.3.7 -> 6.3.9 -> 6.4.3 ....)
No, I do not have a management board, but sorry it is not really headless, I have a keyboard and a screen with console access. There is no X11, only console. When the crash occur, the screen is just black.
Sometimes the server reboot itself (thanks to sysctl kernel.panic = 60), sometimes it is only freeze and no log on the console).
Last edited by cyayon (2023-09-07 18:44:14)
Offline
Unless loqs already has a commit in mind and isn't just narrowing his inevitable bisecting efforts
Sep 07 17:29:01 daneel.nbux.org root[2099643]: ExecMGR[2099527] ExecMGR@daneel v1.91 [ id:2099527 user:root Kill: Cmd:'/jobs/check_wan_onu.sh' Prom: ] - 17:29:01 07/09/2023
Sep 07 17:29:02 daneel.nbux.org root[2099707]: ExecMGR[2099527] ExecMGR id:2099527 cmd:'/jobs/check_wan_onu.sh' elaps:1 finished successfully
Sep 07 17:29:02 daneel.nbux.org CROND[2099526]: (root) CMDEND (/jobs/ExecMGR -X "/jobs/check_wan_onu.sh")
Sep 07 17:29:02 daneel.nbux.org CROND[2099526]: pam_unix(crond:session): session closed for user root
Sep 07 17:29:48 daneel.nbux.org root[2102328]: UpsMGR[2102012] UpsMGR@daneel v1.20 [ UpsName:eaton1 Prom:1 ] - 17:29:48 07/09/2023
Sep 07 17:29:48 daneel.nbux.org root[2102377]: HddMGR[2102031] HddMGR@daneel v2.3 [ HddName:any Prom:1 ] - 17:29:48 07/09/2023
Sep 07 17:29:48 daneel.nbux.org kernel: BUG: unable to handle page fault for address: 00000000352aa941
Sep 07 17:29:48 daneel.nbux.org kernel: #PF: supervisor write access in kernel mode
Sep 07 17:29:48 daneel.nbux.org kernel: #PF: error_code(0x0002) - not-present page
1st 4 lines only for context.
Do you have more journals of affected boots?
Are all incidents immediately preceded by UpsMGR/HddMGR?
If you add "consoleblank=0" to the kernel commandline and keep the monitor on (except if it's a VGA one, then it probably doesn't matter) do you get a visible output after the freeze?
(You'd expect some OOPs that would allow to narrow down the likely offending module, assuming it's not the core)
Online
I see you have been asked to bisect the kernel. Please ask if you need any help with that after testing 6.3.8.
Offline
Thanks for your answer.
I have disabled a script which do a lot of smartctl command (to get smart info from all my 20 disks/ssd/nvme). This script via cron 28 * * * * *.
HddMGR script run also some smartctl on all my disk to get temperature. UpsMGR run upsc command from nut to get info from my usb UPS.
I will test a few days, if it crash again I will check previous scripts and disable them too.
Of course, I will test 6.3.8 kernel. I have just downloaded packages from archive.
Thanks !
Offline
Were all crashes in direct proximity to that script being run?
Make sure to disable the consoleblank, the page fault happens in the kernel context so even if some userspace process triggers that, that would only (maybe) help to ballpark the involved kernel code.
The cause would still be either a kernel bug or faulty HW.
Online
Hi,
I just checked previous crashs. It seems it is not related to scripts execution.
To disable consoleblank, I just have to add consoleblank=0 to the kernel command line ? do I have to use setterm too ?
Which faulty hardware could it be ?
thanks.
Offline
just have to add consoleblank=0 to the kernel command line
This.
Which faulty hardware could it be ?
Any. None (given the kernel version relevance).
That's why the kernel OOPs would be helpful - if it had always been related to your HDD scripts, that could have indicated ata* or nvme, but w/o any pattern anything might be the trigger.
Edit:
https://wiki.archlinux.org/title/Displa … _Signaling
Seems console blanking is actually disabled by default since a couple of years.
Might be the i915 module…
Does the system still respond to pings? Can you ssh into it?
Last edited by seth (2023-09-08 08:19:42)
Online
When the system crash, no more ping or any network communication.
Offline
If you can't get any console output (let the monitor attached and powered, disable dpms w/ setterm and force the consoleblank parameter) https://wiki.archlinux.org/title/Kdump might get us a usable dump. (Since this is not deterministic and can take 48+h to show up, bisecting can become a LOOOOOOOOONG process)
Online
I have explicitly added consoleblank=0 in my kernel command line.
I am sure about disabling DPMS with setterm.
Should I just execute the following ?
for i in {0..256}; do setterm --blank 0 --powerdown 0 --powersave off >> /dev/tty$i; done
thanks.
Last edited by cyayon (2023-09-08 12:54:48)
Offline
https://wiki.archlinux.org/title/Displa … th_setterm
Disable all, not sure whether doing that for a lot of TTYs that aren't active is required anyway (and you don't have 257 terminals)
for ttydev in /dev/tty{0..9}*; do setterm --powerdown 0 >> $ttydev; setterm --blank 0 >> $ttydev; setterm --powersave off >> $ttydev; done
Online
thanks.
done.
perhaps should I disable sysctl kernel.panic=60 to avoid auto reboot and keep last message on the local display ?
Offline
If you can accept the dead system for a random duration, yes.
Online
thanks. done.
waiting for the next crash
Offline
Hi,
This night crash again (kernel 6.4.12). I manage to got some logs from syslog-ng (no log from journald).
Here are the last logs just before the crash.
It seems to be related to nft.
thanks.
Last edited by cyayon (2023-09-10 06:47:31)
Offline
Yup.
At least superficially, https://lore.kernel.org/netdev/20230810 … ter.org/T/ looks relevant, but those are in 6.4.12
https://github.com/archlinux/linux/comm … /netfilter
https://github.com/archlinux/linux/comm … /netfilter - the upper two blocks are for 6.3.8 (already results on this) and 6.3.9
https://lore.kernel.org/lkml/87ttuuajbo.fsf@43-1.org/T/ suggests that https://github.com/archlinux/linux/comm … bd315a313b might be it …
Edit:blank position...
Last edited by seth (2023-09-10 07:51:26)
Online
Hi Seth,
Thanks for your answer !
If I understand correctly, this seems to explain why in 6.3.7 everything was fine.
Note that I use some nft sets with 48h timeout for elements and I have also "gc-interval 1h" in these sets. Perhaps that's why I have crash after about 48h ?
What is your advice in this situation ? rollback to 6.3.7 and wait for a fix or trying the very last 6.5.2 ? Or even trying mainline 6.5 ?
thanks.
Last edited by cyayon (2023-09-10 07:37:43)
Offline
If this isn't somehow fixed in 6.5, first see whether 6.3.8 is stable, then whether 6.3.9 w/ the mentioned commit reverted is.
Then file a bug upstream, since this seems not fully resolved.
Can you reduce the timeouts to perhaps trigger this more aggressively?
Online
Hi,
I opened a ticket to netfilter.
Pablo N. tell to test 6.4.15 instead of 6.5.x and 6.3.x (revert bdace3b1a51887211d3e49417a18fdbd315a313b required).
Do you have a PKGBUILD for this version 6.4.15 ?
Thanks
Offline
Could you please linked to your report. 6.4.15 does not contain bdace3b1a51887211d3e49417a18fdbd315a313b as that is the 6.3 backport of 1240eb93f0616b21c675416516ff3d74798fdc97 although reverting either should be equivalent.
Neither can be reverted cleanly from 6.4.15.
$ git checkout v6.4.15
HEAD is now at f60d5fd5e950 Linux 6.4.15
$ git status
HEAD detached at v6.4.15
$ git revert -n bdace3b1a51887211d3e49417a18fdbd315a313b
Auto-merging net/netfilter/nf_tables_api.c
CONFLICT (content): Merge conflict in net/netfilter/nf_tables_api.c
error: could not revert bdace3b1a518... netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
$ git revert --abort
$ git revert -n 1240eb93f0616b21c675416516ff3d74798fdc97
Auto-merging net/netfilter/nf_tables_api.c
CONFLICT (content): Merge conflict in net/netfilter/nf_tables_api.c
error: could not revert 1240eb93f061... netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
You can locally generate such a PKGBUILD (requires devtools and base-devel)
$ pkgctl repo clone --protocol=https linux
$ cd linux/
$ git checkout 6.4.12.arch1-1 # The last Arch 6.4 release
$ curl -o PKGBUILD.diff http://ix.io/4FYM # download a diff of the changes I made
$ git apply -v PKGBUILD.diff
$ makepkg -rsi
Edit:
Manually merged the failing revert and added that to the PKGBUILD and update diff. The old diff is at http://ix.io/4FYH
Locally built package with pkrel changed to 1.1:
https://drive.google.com/file/d/1oJTRYe … sp=sharing linux-6.4.15-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1qUomh3 … sp=sharing linux-headers-6.4.15-1.1-x86_64.pkg.tar.zst
Last edited by loqs (2023-09-11 00:06:04)
Offline
Hi,
I can directly download and install shared packages ?
The report of netfilter bug is by email directly to email bugzilla-account@netfilter.org. The register to bugzilla netfilter is closed to avoid spam.
Pablo N just answered me by email to test 6.4.15, because 6.5.2 is a little behind 6.4.15 for this issue.
Last edited by cyayon (2023-09-11 08:35:45)
Offline
Hi,
I can directly download and install shared packages ?
Yes.
Offline