You are not logged in.
I'm experiencing complete lockups (doesn't even respond to ping), but sysreq still works. Several crashes today while running `cargo test` on big rust project. I had several of these over the course of the last few days. At first I thought it was virtualbox, but today I was not running any VMs. Nothing in logs about the crash.
Kernel was updated today to (6.10.9-arch1-2) but it does not fix the issue.
info:
CPU: 24-core (8-mt/16-st) 13th Gen Intel Core i9-13900K (-MST AMCP-)
speed/min/max: 811/800/5500:5800:4300 MHz Kernel: 6.10.9-arch1-2 x86_64
Up: 17m Mem: 5.47/31.09 GiB (17.6%) Storage: 3.29 TiB (13.5% used) Procs: 536
Shell: fish inxi: 3.3.35
Graphics:
Device-1: AMD Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
driver: amdgpu v: kernel
Device-2: Logitech Logitech Webcam C925e driver: snd-usb-audio,uvcvideo
type: USB
Display: x11 server: X.Org v: 21.1.13 with: Xwayland v: 24.1.2 driver: X:
loaded: amdgpu unloaded: modesetting,radeon dri: radeonsi gpu: amdgpu
resolution: 1: 2560x1440~60Hz 2: 2560x1440~60Hz
API: EGL v: 1.5 drivers: kms_swrast,radeonsi,swrast
platforms: gbm,x11,surfaceless,device
API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.2.2-arch1.1
renderer: AMD Radeon RX 580 Series (radeonsi polaris10 LLVM 18.1.8 DRM 3.57
6.10.9-arch1-2)
It seems to be related to memory pressure. When I run 'cargo test' on my project, it consumes ridiculous amounts of memory, and this crashes the kernel EVERY single time. With LTS kernel, the OOM killer kills the terminal and kernel does not crash:
Out of memory: Killed process 306333 (snarkvm_synthes) total-vm:52890776kB, anon-rss:29081172kB, file-rss:512kB, shmem-rss:0kB, UID:1000 pgtables:72928kB oom_score_adj:200
System was rock solid before. Seems like a kernel regression as the LTS kernel does not lock up.
Last edited by joske (2024-09-16 06:26:50)
Offline
The following command reproduces the lockup:
perl -wE 'my @xs; for (1..2**20) { push @xs, q{a} x 2**20 }; say scalar @xs;'
On LTS kernel 6.6.50-2-lts, the OOM killer just kills the process, on 6.10.9-arch1-2, the system locks up, fans still spinning, but not reacting, not even to pings.
On another system I was not able to reproduce this behaviour, there also the OOM killer killed the fork bomb and stayed responsive.
Offline
I built my own 6.10.9 kernel from kernel.org using 'localmodconfig'. With that kernel, OOM killer works and system doesn't lock up. So the problem must lie with the arch patches.
Offline
also the zen kernel doesn't lock up.
Offline
My suggestion is to publish somewhere complete log (dmesg) from your system. There should be additional information from kernel placed before or after OOM occurred.
Offline
Like I said, there's nothing at all in the logs. When I check with 'journalctl -t kernel' it just shows the new boot
Offline
Others may see things you missed.
Please reproduce the oom crash, after rebooting run as root
# journalctl -b -1 | curl -F 'file=@-' 0x0.st
and post the link it will output.
(the -1 is to get the journal from the previous boot, if you need older boot logs use -2,-3 etc) .
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
FWIW the Arch patches are listed under https://github.com/archlinux/linux/rele … 0.10-arch1 for example. Of these the only thing I can imagine having any relevance is the increase in ASLR bits.
Offline
Offline
with zen kernel, no lockup: http://0x0.st/XxGR.txt
Offline
What is on the screen when the system is lockup? If you are in X-Window session try to switch it to virtual terminal mode (the text one).
Additionally you can configure serial console (rs 232 directly - real HW with UART, not USB-to-rs232 adapter) for this PC if it is possible and then check the messages on the screen.
Make a photo of the screen and publish it somewhere. Put link to this photo here. We will take a look at it.
Offline
sep 14 10:36:45 silence gnome-shell[2333]: Extension apps-menu@gnome-shell-extensions.gcampax.github.com already installed in /home/jos/.local/share/gnome-shell/extensions/apps-menu@gnome-shell-extensions.gcampax.github.com. /usr/share/gnome-shell/extensions/apps-menu@gnome-shell-extensions.gcampax.github.com will not be loaded
sep 14 10:36:45 silence gnome-shell[2333]: Extension auto-move-windows@gnome-shell-extensions.gcampax.github.com already installed in /home/jos/.local/share/gnome-shell/extensions/auto-move-windows@gnome-shell-extensions.gcampax.github.com. /usr/share/gnome-shell/extensions/auto-move-windows@gnome-shell-extensions.gcampax.github.com will not be loaded
sep 14 10:36:45 silence gnome-shell[2333]: Extension launch-new-instance@gnome-shell-extensions.gcampax.github.com already installed in /home/jos/.local/share/gnome-shell/extensions/launch-new-instance@gnome-shell-extensions.gcampax.github.com. /usr/share/gnome-shell/extensions/launch-new-instance@gnome-shell-extensions.gcampax.github.com will not be loaded
sep 14 10:36:45 silence gnome-shell[2333]: Extension native-window-placement@gnome-shell-extensions.gcampax.github.com already installed in /home/jos/.local/share/gnome-shell/extensions/native-window-placement@gnome-shell-extensions.gcampax.github.com. /usr/share/gnome-shell/extensions/native-window-placement@gnome-shell-extensions.gcampax.github.com will not be loaded
sep 14 10:36:45 silence gnome-shell[2333]: Extension screenshot-window-sizer@gnome-shell-extensions.gcampax.github.com already installed in /home/jos/.local/share/gnome-shell/extensions/screenshot-window-sizer@gnome-shell-extensions.gcampax.github.com. /usr/share/gnome-shell/extensions/screenshot-window-sizer@gnome-shell-extensions.gcampax.github.com will not be loaded
sep 14 10:36:45 silence gnome-shell[2333]: Extension user-theme@gnome-shell-extensions.gcampax.github.com already installed in /home/jos/.local/share/gnome-shell/extensions/user-theme@gnome-shell-extensions.gcampax.github.com. /usr/share/gnome-shell/extensions/user-theme@gnome-shell-extensions.gcampax.github.com will not be loaded
You appear to be using a local version of extensions that are also present in the gnome systemwide install.
Is this intentional ?
There are atleast 2 coredumps at the bottom of the log, and it looks like they are related to dleyna-renderer-service .
In the log from the zen kernel there's no sign of dleyna or coredumps .
I suggest you ensure dleyna is inactive while booted in the stock kernel and try to reproduce the crash.
Last edited by Lone_Wolf (2024-09-14 12:47:42)
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
What happens is, screen stays active, cursor doesn't move anymore, no corruption on screen. I can still see disk activity, and fans are spinning up. Switching VT does not work, system does not react to any key. Alt-SysRq does work.
https://photos.app.goo.gl/jrxa98asXY9281ha6
I was not aware of the extension duplicates. I removed the local versions. I also deleted dleyna.
Triggered the issue again with default kernel: http://0x0.st/Xx5B.txt
Last edited by joske (2024-09-14 18:46:24)
Offline
I think the kernel doesn't really lock up, but is just trashing non-stop (RAM + swap full) without triggering OOM
Offline
The system does have a real serial port (but disabled in BIOS). Need to dig out my old serial cables, it's been a long time that I had to use this :-D
Offline
What happens is, screen stays active, cursor doesn't move anymore, no corruption on screen. I can still see disk activity, and fans are spinning up. Switching VT does not work, system does not react to any key. Alt-SysRq does work.
You could try to switch to VT *before* you start perl test application and then take the picture of the screen while still in VT mode.
Then you can also use some of the SYSRQ combinations like "m" and then "e" and check if after "e" the system recovered from the thrashing.
Last edited by wtx (2024-09-14 21:32:07)
Offline
bijiben crash looks like https://bbs.archlinux.org/viewtopic.php?id=295914 and isn't in the (shorter) zen kernel boot, do you eventually get those there as well?
Next to the ASLR bits, try to add "transparent_hugepage=never" (general advise) to the kernel commandline and disable https://wiki.archlinux.org/title/Zswap because that can massively blow up when decompressing the swap.
If you're running OOM (incl. swap) thrashing becomes inevitable until the OOM killer steps in, you can drive the latter more aggressively but of course probably first and foremost want to prevent an unwarranted OOM.
Finally
System was rock solid before.
Which kernel was the last good one?
Online
Yes bijiben crash always happens, I've been trying to look into this, but never found a solution. Seems harmless though.
I don't know which kernel was fine, as I don't usually run 8 VMs (as I needed to do for work). This was the first crash, but I blamed that one on virtualbox modules. Also I had to run rust unit tests a few days later that for some reason blow up memory, that's how I noticed that it was strange and reproducible (system would hang every time I was running those tests). That it uses so much memory was known (but not by me :-D) and I can't fix those now.
Offline
Tried now on the latest 6.10.10-arch1-1 kernel, and system still becomes unresponsive. This time I let it run a bit longer, but didn't recover. Then I tried to do alt-sysrq-k, this killed the terminal and the perl process, and system was working again.
I think the logs now have something maybe useful: http://0x0.st/X3-b.txt
Offline
AHA! I disabled zwap at runtime, and did swapoff/swapon, and now the same kernel OOM killed the perl process!
Last edited by joske (2024-09-15 18:29:54)
Offline
systool -vm zswap
compare the output among kernels.
But with little phsyical swap, much RAM, a large enough https://wiki.archlinux.org/title/Zswap# … _pool_size and highly compressible data I can see how the principal concept has explosive potential and you may end up juggling pages between the physcal swap and the cache instead of killing the process.
You might wantto stress your finding in the subject ("zswap leads to trashing" or sth. like that) and try to constrain the pool (more) to keep more RAM and write back the zswap cache more gradually.
Did you also test to disable THP?
Online