You are not logged in.

#1 2021-10-26 00:06:29

kplaceholder50
Member
Registered: 2021-10-24
Posts: 5

[SOLVED] System-wide freezes happening at random

Hello. First post here. I have been experiencing constant, full system-wide crashes ocurring regularly at apparently random times on my Arch installation. I'm coming here since, after several weeks attempting to find anything even remotely close to a cause, I'm still at square one. I was hesitant to post this since I have come across multiple posts on several forums online, both old and new, dealing with very similar issues. However, those threads either never found a solution, or the solution doesn't apply to my system because they were dealing with a different issue after all, and currently I'm at a loss.

The issue being that the system locks up entirely, and when it does, it responds to absolutely nothing other than a forced shutdown (further details below). There isn't any kind of build-up towards the crash that I could identify anywhere; the system can be running perfectly smoothly at one second, and be entirely paralyzed the next. This issue occurs, apparently, at completely random times, but often enough that it happens roughly on a daily basis, and I have been unable to find any sort of pattern whatsoever. So far, it has never crashed outside of the DE, but at this point I'm willing to bet that is just because the DE is on 99% of the time. While crashed, the screen freezes, all LEDs connected to the device that were on remain on, the fan remains on, and USB ports keep outputing voltage, but can no longer be used for I/O operations.

The device where this happens has a Windows 10 partition where I have never run into this kind of crashes. Furthermore, I have a separate laptop running Arch Linux that doesn't present this kind of issue either.

I am posting this out of desperation since I'm running out of leads and have been unable to identify even the slightest trace of the cause of these crashes. While straight solutions are certainly appreciated, I hope to find more ways to identify the cause as well as a way to verify that this issue is indeed solved, rather than applying some patch and hoping it never crashes again. (So far, it always crashes again, eventually).

Since I'm uncertain whether this issue is coming from hardware or software, I'll post the specs of both sides. I skipped some details that I consider irrelevant to this problem; if I need to specify further details please do point it out. Please consider that I am posting this to "Kernel & Hardware" because I think that the kernel is the most reasonable cause for this that I can think of, but note that I could be wrong.

Hardware specs:

  • CPU: Intel i7-8700 / GPU: NVIDIA GeForce GTX 1060 / Motherboard: MSI Z370 / Memory: Kingston HyperX Fury, 8GB running at 2400 MHz (not overclocked)

  • Storage: An NVMe SSD with the Linux filesystem, a separate SSD with the Windows partitions, and an additional HDD without OS that is permanently mounted.

  • The device has two 1080p monitors connected to it, one via HDMI-0 and the other via DP-0.

Software specs:

  • OS: Arch Linux x86_64, currently running with the Linux Zen kernel version 5.14.14

  • Desktop environment: KDE Plasma 5.23.1 with Xorg. Notably, the Kscreen service is disabled because of an incompatibility between xrandr, one of my monitors, and the propietary NVIDIA driver.

  • Bootloader: rEFInd 0.13.2-1

  • The GPU is running under the propietary NVIDIA driver.

  • There are a handful of everyday programs that are permanently open while I'm using the computer, such as Firefox, Discord, Mailspring, Yakuake, etc. I really don't think these are causing such an issue, but at this point, it is a possibility I cannot disregard. These programs are open often enough that I have never experienced a crash without them, but since I'd like to be able to use my computer for regular everyday tasks, I'm not willing to avoid using them unless there is good reason to suspect that they are tied to the issue. Plus I find it unlikely that a regular program like Firefox can crash the system to a point that not even the SysRq key can do anything.

What I have tried so far:
  • Looking into journalctl/dmesg to find a potential cause. I do this everytime after the crash hoping to find a recurring message that only appears shortly before the time the crash occurs. Unfortunately, there is never such message.

  • Keeping system monitor/htop open looking for some sort of a sudden spike in CPU/Memory/Disk usage. Again, nothing.

  • Enabling and using the SysRq key. All SysRq commands are enabled on my system, as confirmed by the fact that I can REISUB out of the system, as long as no crash happened. Once a crash happens, all keyboard functions (including SysRq commands) become entirely unresponsive. In addition, the lock LEDs remain in the state they were before the crash, and the lock keys don't respond. It is also not possible to switch to another TTY in this state.

  • Checking that there isn't a naughty cronjob single-handedly crashing the system without prior notification. Both my crontab and root's are empty. Furthermore, I used to have rsnapshot+anacron as my backup utility, which I have since uninstalled in order to find a possible cause. Unfortunately, the crashes still happen after removing rsnapshot.

  • Passing the nomodeset parameter to the kernel. This did seem to fix the issue for a while, since after adding it, the system didn't have any sort of crash for about a week. Then, 9 crashes happened over the following weekend, and after that I went back to experiencing a random crash roughly once per day.

  • Ensuring that the Intel microcode is indeed being passed to the kernel at boot with the initrd parameter. As it turns out, I did forget about this when I first installed the OS. Unfortunately, after I took care of it, the crashes persisted.

  • Switching kernels. I have tried both the standard Arch Linux kernel and the Linux Zen kernel, both in version 5.14, without success. I have also tried the Linux LTS kernel, version 5.10, but then I run into a different issue with the proprietary NVIDIA driver that is out of the scope of this post.

  • Running hardware checks. I figured that, since my Windows and Linux partitions are installed on separate drives, the Linux SSD might be malfunctioning. However, smartctl shows no error or warning. Furthermore, I have tested the memory stick with memtest86, which was left running overnight and came off without any error as well. I don't really know how to test other components and I don't have a good reason to try and search the problem there, since I would assume that at that point, the problem would affect the Windows partition as well, and it does not. Again, if I am wrong, please do point it out.

  • Disabling hardware acceleration on everyday programs such as Firefox, just in case. Again, nothing.

  • Just to see what happens, I plugged my phone into an USB port after the computer crashed. The phone did begin to charge up, but it didn't show any option to transfer files, which it certainly does when there is no crash.

Short of more drastic options such as switching to the Nouveau driver, changing to a different DE, reinstalling Arch or even trying out a new distro and seeing if the issue persists, I don't really have many more options, and I feel like I have made no progress whatsoever in identifying the problem. In addition, everything I have tried so far have been blind attempts at solving a problem that I don't know the nature of, and I'm not willing to make drastic decisions that are take more effort than they're worth just for a chance that the crashes might go away because I don't know any better.

I hope this forum can shed some light on this problem and let me know about more possible conflicts I can look into. Again, I'm not just looking for a solution but also for the cause, so that if I were to fix this problem, I can verify that I actually fixed it. Thank you for your attention.

Last edited by kplaceholder50 (2022-02-05 16:27:16)

Offline

#2 2021-10-26 06:46:55

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,410

Re: [SOLVED] System-wide freezes happening at random

You should post such a journal from after the crash, if it does contain something it will be within a

sudo journalctl -b-1

for the previous boot: https://wiki.archlinux.org/title/List_o … n_services

You might also want to post these other diagnostics like the actual output of

smartctl -a

from an affected drive.

For testing whether it's a kernel bug it would indeed be helpful if you could try out the LTS kernel, what's your issue with the nvidia driver there? The only thing you'd need to set up would be installing nvidia-lts to have the corresponding kernel module.

Other than that there are sometimes firmware bugs, have you checked whether there are any upgrades in that space for your mainboard?

Offline

#3 2021-10-26 16:22:46

kplaceholder50
Member
Registered: 2021-10-24
Posts: 5

Re: [SOLVED] System-wide freezes happening at random

You should post such a journal from after the crash

Sure, I figured it wouldn't be of much help.

Here is the output of sudo journalctl for a run that ended in a crash: http://ix.io/3CXY
The system crashed a few minutes after the last message.
Here is the output of sudo journalctl for the run that came right after the crash: http://ix.io/3CXZ
This run ended in a smooth, regular shutdown.

The output of sudo smartctl -a from the affected drive: http://ix.io/3CY0

For testing whether it's a kernel bug it would indeed be helpful if you could try out the LTS kernel, what's your issue with the nvidia driver there? The only thing you'd need to set up would be installing nvidia-lts to have the corresponding kernel module.

The issue is that all software seems to ignore the GPU when I'm using Linux LTS + nvidia-lts, as given away by the fact that Plasma disables all fancy visual effects including blur and animations. This has more tangible consequences in graphically heavy programs such as Blender, where the framerate really takes a hit. However, it seems that the software can indeed detect the GPU, as it shows up in lspci, it's present in the NVIDIA settings program, and is enabled in Blender as well.

I came accross a Reddit post advising against using nvidia-lts and using nvidia-dkms instead. However still I get the same issue with nvidia-dkms. Unfortunately, it isn't as simple as just installing nvidia-lts, and I wish it was.

Admittedly, I hadn't looked too deep into this issue and gave up after a few hours of trying, but after your reply I decided to give it a second try. And funnily enough, I booted into Linux LTS and the system had a full crash on its own, very shortly after boot, but still after Plasma launched. So I can confirm this happens with the LTS kernel as well. Here is the journal for that run: http://ix.io/3CY1

Other than that there are sometimes firmware bugs, have you checked whether there are any upgrades in that space for your mainboard?

I'm afraid I don't really understand this question. If you mean whether there are updates for the BIOS of my motherboard, as a matter of fact, yes there are. I'm planning to update the BIOS soon, but I considered this topic unrelated to the issue. I can try and update and see if it does anything.

EDIT: Sorry, I didn't realize that pastebin.com was discouraged on this forum. I have replaced all pastebin.com links with ix.io links.

Last edited by kplaceholder50 (2021-10-26 16:38:04)

Offline

#4 2022-01-04 15:04:07

catectonews
Member
Registered: 2022-01-04
Posts: 1

Re: [SOLVED] System-wide freezes happening at random

Hi kplaceholder50, did you end up finding the issue?
I've had the same problem description for a long while but lately it got worse. I'm not on arch but tried 3 different major versions of my distro overtime and the problem persists on all hence I suspect an hardware fault.
I thought it was a faulty ram module but, like you, I ran memtest86 full cycle twice (for 32GB about 8 hours each) with no issues. I'm also cycling through each module one at a time and the issue keeps happening with all modules.
I'm on an AMD system but I also have an nvidia 1060GTX, tried multiple kernel versions, driver versions, etc. Sadly I don't have another VGA to try as all my old ones don't boot at all.
I also have an MNVe ssd drive but I've upgraded it some time ago and the old one works fine in the machine it went into.
Most of the time its a complete freeze (caps lock doesn't light up anymore) but sometimes I can still move the mouse around. Also I used to get apps to core dump and I could see those in the logs (reason why I thought is was a memory problem) but lately I just get freezes.

Offline

#5 2022-01-04 16:28:13

twobooks
Member
From: rainforest
Registered: 2020-06-23
Posts: 46

Re: [SOLVED] System-wide freezes happening at random

1. you can check if there is a new version BIOS firmware
2. use a different piece of ram module if possible or a different slot
3. try to use a different hard drive or a different SATA port or a different cable, and remove all other unused hard drives
4. remove GTX 1060, use integrated GPU instead

Offline

#6 2022-01-05 21:30:39

kplaceholder50
Member
Registered: 2021-10-24
Posts: 5

Re: [SOLVED] System-wide freezes happening at random

Hi. Thank you for your replies. While I haven't managed to solve this issue yet, I made some progress on it. I apologize for not updating this thread; I was unsure if I should keep bumping the thread with new findings if no one else was replying.

I attempted to set up Kdump to get a proper trace of the kernel state after it crashed. However, I never got it working. The dump kernel never booted after an unexpected crash, despite the fact that it did boot if the crash was caused via SysRq+C. Even then, the file /proc/vmcore was not available even under the dump kernel so it could not be written to disk. I gave up, as I figured Kdump is just beyond the limits of my own Linux expertise. Despite that, I advise anybody else with crashes like these to give it a try.

I managed to pin down the crashes to the NTFS driver. Crashes seem to come hand in hand with handling large amounts of data in one of my NTFS drives (the non-SSD drive in particular). Opening a torrent client (doesn't seem to matter which; I tried qBitTorrent and Deluge) almost guarantees that the system will crash later that session, if at least one of the seeded files is located on the NTFS drive. Furthermore, playing games with large save data, such as Minecraft, can occasionally cause a system freeze, albeit much less frequently, again as long as the save is written to the NTFS drive. Avoiding these certainly helped to avoid crashes for weeks on end. As far as I can tell, there has been one single instance of a crash happening without me doing anything that requires handling large amounts of data on that drive, but then again it could be some background process such as Baloo doing that. You would think that the drive is defective or in a bad state, but SMART tools don't report any issue at all, and Windows can handle large data on that drive just fine. My system has never frozen with the NTFS drives unmounted.

Lastly, I found out how much of a PITA debugging a system-wide crash like this is on Linux. There are tons of red herrings when it comes for searching a possible cause to this on the Internet; if your symptoms are even a tiny bit different, you might be experiencing an entirely different problem altoghether. If you can move your mouse sometimes, chances are your cause is completely unrelated to mine unfortunately. Being able to set up a core dump utility reasonably easily, and getting logs after having the system become completely unresponsive, would certainly go a long way for solving this entire family of issues, rather than my individual issue.

Offline

#7 2022-02-05 16:26:39

kplaceholder50
Member
Registered: 2021-10-24
Posts: 5

Re: [SOLVED] System-wide freezes happening at random

It seems I have finally solved this issue, and it turns out it was indeed a hardware issue after all.

After failing to get Kdump working, I set up netconsole, which is comparably easier. I managed to get SysRq+C kernel panic logs on other devices reliably. However, the crash I'm looking for wasn't being logged remotely. Also, I realized that everything I was testing the NTFS drive were also heavy on network traffic. I tried moving all my torrents to my main Linux SSD and dismounting NTFS drives, and the system still hang regularly within less than an hour. So it became clear that the NTFS-3g issue I said earlier was a red herring as well.

This means that either the router's firmware, the network card, or NetworkManager were at fault. I tried the following strategy: set up my phone as a wi-fi extender, and then connect my system to it wirelessly, which, in the event of a crash, would rule out the router. It did crash within an hour of opening qbittorrent, so I connected the phone as a usb modem to the system, and the system no longer hang. I kept doing this for days, so it became clear that the culprit was indeed the network card.

I bought a new network card, installed it, and voila, crashes gone. It has been almost two weeks, I have been using heavy network traffic including torrents daily, and not a crash in sight. So yeah, that was it.

This was the network card that I was using while the system was still crashing:

Asus PCE-N15 WiFi 11n 300Mbps Low Profile PCI-e N300

Whose chipset showed up in lspci as:

Realtek Semiconductor Co., Ltd. RTL8192CE PCIe Wireless Network Adapter

I'm leaving this info for anybody who comes across this issue in the future. I really don't know what was with bittorrent clients that so consistently triggered a full system crash with that chipset, but they definitely helped in recreating the crash and testing possible causes. Either way, I'm marking this thread as solved.

Offline

Board footer

Powered by FluxBB