Arch hard-locking up for unidentified reasons

bluedino · 2021-03-13 17:40:13

Hello (sorry if i seem to post in the wrong subforum, but i thought this would be the most fitting one),

for a good while I've been having an issue with my install (The install is 1 year old, the error has occurred for the past 3-4 months). Been trying to find the cause myself and have blamed the nvidia drivers for the longest time without questioning it.
My issue is, that the lock-ups seem to happen quite more frequently and at random times now. Since I can't seem to reproduce the issue, I'm a little clueless as to how I could manage to try things out. The way the lock-ups display is also not showing in the same way everytime. Many times they happen directly when I try to start a program after booting the machine, other times they happen "slowly" during usage (especially while streaming via OBS). Which means: Starting with small lags that solve once i tab out and back in, to huge lags (where my cursor is affected too), after they resolve, it's a gamble of whether my PC will lock up or not. Usually during load, it'll freeze, then I'll be able to use it for another 5-10 seconds, then it'll entirely lock up (including sound being gone, no sysrq, no KP even with everything set up for it such as NMI watchdog). No blinking lights on my keyboard, just a total freeze. (There is one thing that irritates me slightly, because the reset button on my hardware works. In my experience, a total hardware lockup would make it not work).

I've asked in a few discords, and I was led onto it might being a kernel bug in 5.10+. Looked into the issues, they seem similar, but in the end also not. For most it seems to be some kind of GPU lockup, which isn't the case for me, as pretty much everything stops responding (including sound stopping, etc.)
Also been checking my disks with extended SMART tests, they're all 3 fine.
memtest over night reported 0 errors
been trying with swap on and off
installed newest microcode, updated system

This is essentially the only piece of log i got to work with at the time of the crash, before the reboot:

Mar 13 01:16:18 " " com.obsproject.Studio.desktop[9216]: error: xcompcap: cleanup glXReleaseTexImageEXT failed: GLXBadPixmap
Mar 13 01:16:18 " " com.obsproject.Studio.desktop[9216]: info: xcompcap: [window-capture: 'Window Capture (Xcomposite) 5'] update settings:
Mar 13 01:16:18 " " com.obsproject.Studio.desktop[9216]:         title: FINAL FANTASY XIV
Mar 13 01:16:18 " " com.obsproject.Studio.desktop[9216]:         class: ffxiv_dx11.exe
Mar 13 01:16:18 " " com.obsproject.Studio.desktop[9216]:         Bit depth: 24
Mar 13 01:16:18 " " com.obsproject.Studio.desktop[9216]:         Found proper GLXFBConfig (in 9): yes
Mar 13 01:16:45 " " discord.desktop[1286]: [2021-03-13 01:16:45.942] [1308] (audio_device_pulse_linux.cc:1855): Can't query latency

Again, nothing in dmesg, nothing in the xorg log, no KP, I've got essentially nothing to work with here and I'm at the end of my knowledge.
Everything else in my journal (extended view) is occasionally a GNOME plugin error.

I'm using the latest version of zen kernel.
My hardware includes the Intel i7 7700k, an nvidia gtx 1080ti and my mainboard is the z270a from MSI.

Edit: I've tried with stock and LTS kernel aswell, same issue
Edit 2: Other people, and myself, have suspected an I/O issue, which was later on ruled out due to all drive's state being really well.

If there is more questions, please ask. I hope someone has an idea of how to eventually be able to get at least more information out of this. I doubt this is a HW issue, since other OSes seem to work.

Greetings

Last edited by bluedino (2021-03-13 17:53:04)

d_fajardo · 2021-03-13 19:11:23

There doesn't seem a lot to work with since you've pretty much tried most things one could recommend.
But as the error log you posted seem to point to graphics, why not try disabling your nvidia via the BIOS and run on the integrated since the i7 7700k has integrated graphics, and see if you get the same behavior?
Needless to say of course you have to disable the nvidia driver and adjust your KMS, etc..

bluedino · 2021-03-13 19:40:48

Will try doing so, I'll edit for updates. Can't really tell how long it would take as the lockups sometimes happen multiple times a day, sometimes don't happen for a week, etc.
Edit 1: This seems to be the output during earlier mentioned lags

Jan 15 23:27:05 "" /usr/lib/gdm-x-session[815]: (EE) event26 - Kingsis Peripherals ZOWIE Gaming mouse: client bug: event processing lagging behind by 12ms, your system is too slow
Jan 15 23:31:14 "" discord.desktop[1355]: [2021-01-15 23:31:14.841] [1378] (audio_device_pulse_linux.cc:1855): Can't query latency
Jan 15 23:31:33 "" /usr/lib/gdm-x-session[815]: (EE) event23 - CM Storm Quickfire TKL 6keys: client bug: event processing lagging behind by 33ms, your system is too slow

- again, nothing else seems sus in the log -
I know it's an older log but i took my time to find literally anything that seems sus.

Last edited by bluedino (2021-03-13 19:52:08)

V1del · 2021-03-13 19:53:35

What are you using to produce these logs? This reads a bit to userspacey to me, if this is the equivalent of

journalctl -b-1

try running that with sudo so we get some kernel messages.

bluedino · 2021-03-13 21:31:56

Sorry for the hold up,

There has been something quite interesting happening.
As opening sublime text to load the logfile and censor out my computers name (sorry, It's just personal to me) the slow-crash happened again. This time i immediately switched ttys before the system would get entirely unresponsive.
(I saved dmesg logs in that short period too!!)
Seems like it threw a lot of segfaults on me and after reboot, my gnome is now stock settings again ( ?).
Will upload logs in an edit. Just thought this is interesting.

Maniaxx · 2021-03-13 21:52:40

You could install a 2nd kdump kernel to get a backtrace/dmesg from crashed kernel. You don't need to compile anything. It should work out of the box.
I recommend 'crashkernel=256M' as kernel parameter and not using 'makedumpfile'. It doesn't support recent kernels.
I'm using this 'kdump-save.service' instead:

[Unit]
Description=Create dump after kernel crash
DefaultDependencies=no
Wants=local-fs.target
After=local-fs.target

[Service]
Type=idle
ExecStart=/bin/sh -c 'vmcore-dmesg /proc/vmcore > /root/vmcore-dmesg-$$(date +%%F-%%T)'
ExecStopPost=/usr/bin/systemctl reboot
UMask=0077
StandardInput=tty-force
StandardOutput=inherit
StandardError=inherit

And this 'kdump.service':

[Unit]
Description=Load dump capture kernel
After=local-fs.target

[Service]
ExecStart=/usr/bin/kexec -p /boot/vmlinuz-linux --initrd=/boot/initramfs-linux.img --append="root=UUID=11111111-1111-1111-1111-111111111111 systemd.unit=kdump-save.service irqpoll nr_cpus=1 reset_devices"
Type=oneshot

[Install]
WantedBy=multi-user.target

This commands might be helpful for testing:

Load/arm kexec with target=kdump-save.service:
sudo /usr/bin/kexec -p /boot/vmlinuz-linux --initrd=/boot/initramfs-linux.img --append="root=UUID=11111111-1111-1111-1111-111111111111 systemd.unit=kdump-save.service irqpoll nr_cpus=1 reset_devices"

unload armed kexec:
sudo kexec -pu

Do immediate kexec to 'single-user' target (with graceful shutdown):
sudo /usr/bin/kexec /boot/vmlinuz-linux --initrd=/boot/initramfs-linux.img --append="root=UUID=11111111-1111-1111-1111-111111111111 single irqpoll nr_cpus=1 reset_devices"

Trigger crash (not graceful, beware mounted filesystems):
sudo sh -c "sync; echo c > /proc/sysrq-trigger"

check kernel capabilities:
zgrep -E 'CONFIG_DEBUG_INFO=|CONFIG_CRASH_DUMP=|CONFIG_PROC_VMCORE=' /proc/config.gz

get info:
cat /sys/kernel/kexec_crash_size
cat /sys/kernel/kexec_crash_loaded

Last edited by Maniaxx (2021-03-13 22:13:49)

bluedino · 2021-03-13 22:20:07

journal
dmesg

this is after the first short lag, and before the total lock up.

edit 2: will set up Kdump! thank you

Last edited by bluedino (2021-03-13 23:18:01)

seth · 2021-03-13 23:02:56

https://wiki.archlinux.org/index.php/St … ing_memory

Also monitor the system temperature - there're memory errors all over the place, I doubt that this is a software issue, given it affects muliple kernels.

Maniaxx · 2021-03-13 23:05:00

Please do not use commercial heavy-weight services like discord to distribute plain text files. Use something simple like https://bpa.st instead. Thank you!

bluedino · 2021-03-13 23:12:10

seth wrote:

https://wiki.archlinux.org/index.php/St … ing_memory
Also monitor the system temperature - there're memory errors all over the place, I doubt that this is a software issue, given it affects muliple kernels.

Interesting point. I can recall my mainboard having issues with XMP enabled. When having that on i experience program crashes/lockups full time basically.
Nonetheless, i'll run another memtest for a longer period of time. Last memtest has been all successes. So i excluded it from being a HW issue at all.
Most hardware components i have tested in different PCs aswell during the day.

Maniaxx · 2021-03-14 14:35:45

But if it's a "rising" problem that comes out of nowhere beginning slowly and getting worse quickly it really, really smells like temperature problem. I would forget about anything else and check that first.

bluedino · 2021-03-14 18:04:41

Temps seem fine.
The crash happens during first startup aswell.
For me personally the most reasonable would be my mainboard actually being damaged.
I've been having issues in the past with it (XMP is quite literally unusable as example). With similar crashes happening. This didn't happen once i turned off any OC on it but i don't really have another lead.
RAM I tested twice, no issues. Tested in another system, no issues. Same with GPU, Generally CPU temps don't exceed the 50c mark (with games running) and GPU not more than 65. Anything else is below 40 at all times.

Monitored them during my raid session just now.. No crashes, but a scary lag out of nowhere.

Last edited by bluedino (2021-03-14 18:14:21)

Maniaxx · 2021-03-14 19:01:06

Another possible candidate could be the power supply.
Did you test the system under Windows as well?

Arch Linux

#1 2021-03-13 17:40:13

Arch hard-locking up for unidentified reasons

#2 2021-03-13 19:11:23

Re: Arch hard-locking up for unidentified reasons

#3 2021-03-13 19:40:48

Re: Arch hard-locking up for unidentified reasons

#4 2021-03-13 19:53:35

Re: Arch hard-locking up for unidentified reasons

#5 2021-03-13 21:31:56

Re: Arch hard-locking up for unidentified reasons

#6 2021-03-13 21:52:40

Re: Arch hard-locking up for unidentified reasons

#7 2021-03-13 22:20:07

Re: Arch hard-locking up for unidentified reasons

#8 2021-03-13 23:02:56

Re: Arch hard-locking up for unidentified reasons

#9 2021-03-13 23:05:00

Re: Arch hard-locking up for unidentified reasons

#10 2021-03-13 23:12:10

Re: Arch hard-locking up for unidentified reasons

#11 2021-03-14 14:35:45

Re: Arch hard-locking up for unidentified reasons

#12 2021-03-14 18:04:41

Re: Arch hard-locking up for unidentified reasons

#13 2021-03-14 19:01:06

Re: Arch hard-locking up for unidentified reasons

Board footer