You are not logged in.

#1 2015-08-28 07:49:12

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

[SOLVED] High-end system hard locks after a little bit of usage.

Alright, I am posting here because I am at my wit's end trying to solve this really frustrating problem I'm having with Arch.

My PC specs are as follows:
Intel i7 5930k CPU
64GB of Corsair DDR4 RAM
EVGA x99 FTW motherboard
2 AMD Radeon R9 295x2 GPUs
4 Samsung 850 EVO 250GB SSDs in RAID10 on /
2 WD Black 3TB HDD in degraded RAID10 on /mnt/data (Waiting on the other 2 drives.)
1 Lexar 16GB USB flash drive on /boot, GRUB is installed here too.

I am running a fresh install of Arch Linux with LightDM and Cinnamon. I am using the proprietary Catalyst driver from the wirephire repo. The partition on the RAID for / is encrypted with dm-crypt/LUKS.

My issue is as follows:

The system boots perfectly normally and I can log in and X starts with no issues. I can use it for a little while, and it will work flawlessly. Then, after some time of use, the system just completely hard locks. The mouse or keyboard no longer accepts input, the display no longer updates, any audio that was playing stops, and I can no longer ping the system from another machine on my LAN. The only way that I have found to get a working system again is to hold down the power button until the system shuts down, and then boot up again. Windows 8.1 and Windows 10 both work flawlessly, ran CPU and GPU stress tests overnight with no issues.

I have observed at least two conditions under which this locking has occurred, at least 2 times for each condition:

- While a pacman package is installing, or an upgrade is running, after the packages are downloaded but before it is finished.
- When I close Firefox (Aurora) and re-open it, once the window appears but before my session restores.

I have tried the following to either fix my issue or simply help troubleshoot it or find the cause of the problem:

- Swapped out the GPU for a NVIDIA GTX 780 Ti and switched to the proprietary NVIDIA drivers. - No change, issue happened again after awhile.
- REISUB - Nothing happened, system stayed on and locked
- Switching TTYs with CTRL-ALT-F2 - Nothing happened
- Writing 25GB to the SSD RAID array in an attempt to figure out if it was a disk issue (I know that needless writes degrade the drives.) - Nothing happened when I did this, system did not lock
- Read the journal from the last boot after I hard rebooted - Nothing was in journal that would indicate an error, the journal just cut off before the hard lock (In the most recent case, it cut off at me running `sudo pacman ...`.)

Can anybody please help me troubleshoot this issue? I have asked in IRC to no avail, and all my friends who use Linux have just told me to try a different distro, but I do not want to do that. I like Arch.

If you need any more information or want to see any specific logs, I will be more than happy to provide whatever you need!

Thanks in advance to anybody who helps me.

Last edited by AppleDash (2015-09-01 04:07:24)

Offline

#2 2015-08-28 10:52:57

Lone_Wolf
Forum Moderator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,966

Re: [SOLVED] High-end system hard locks after a little bit of usage.

A few things that come to mind :

did you implement intel microcode update changes ? see https://www.archlinux.org/news/changes- … deupdates/ .

Is bios/uefi firmware uptodate ?

Have you tried with the LTS kernel ?

Are all SSD/HDD using motherboard SATA ports, or do you have additional SATA controllers ?

Added :

post lspci -k , lsusb and lsusb -t please.

What do you use to provide/manage the raid mirrors ( HW raid, mdadm, BtrFS , lvm ? )

Last edited by Lone_Wolf (2015-08-28 11:04:19)


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#3 2015-08-28 12:14:49

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Yes, I have intel-ucode installed and the microcode is updated according to dmesg.
I updated the UEFI to the latest about a week ago to try and fix this.
I have not tried with the LTS kernel.
All the drives are on the motherboard SATA ports.
I use mdadm for the RAID.

I am in bed right now, but I will post those outputs for you when I wake up.

Offline

#4 2015-08-30 11:57:19

kris7t
Member
Registered: 2013-04-25
Posts: 20

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Could this issue maybe connected to the instabilities with quad-core Broadwell processors (https://bbs.archlinux.org/viewtopic.php?id=201194 ), even though yours is a Haswell-E (considering both are rather new Intel chips)?

You could try booting with mce=3 to disable kernel panic on Machine Check Exceptions, and use the system without X to actually see what gets printed on the text console. Using the default or mainline kernels, I was only able to get CPU stall messages and some SATA command timeouts printed as one of the CPU cores locked up completely, but booting the LTS kernel with mce=3 revealed a Machine Check Exception, most likely caused by some unfortunate interaction between the intel_idle driver and SpeedStep downclocking.

Last edited by kris7t (2015-08-30 11:57:31)

Offline

#5 2015-08-30 12:48:41

mrlamud
Member
Registered: 2014-09-27
Posts: 104

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Since you have Samsung SSD 850, might consider to observe this topic : https://bbs.archlinux.org/viewtopic.php … 0#p1523030 .

Offline

#6 2015-08-30 21:27:27

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

kris7t wrote:

Could this issue maybe connected to the instabilities with quad-core Broadwell processors (https://bbs.archlinux.org/viewtopic.php?id=201194 ), even though yours is a Haswell-E (considering both are rather new Intel chips)?

You could try booting with mce=3 to disable kernel panic on Machine Check Exceptions, and use the system without X to actually see what gets printed on the text console. Using the default or mainline kernels, I was only able to get CPU stall messages and some SATA command timeouts printed as one of the CPU cores locked up completely, but booting the LTS kernel with mce=3 revealed a Machine Check Exception, most likely caused by some unfortunate interaction between the intel_idle driver and SpeedStep downclocking.

Hmm, but the only way I have been able to trigger such a crash has been with X running... And perhaps it could be something like what you said, what I booted with intel_pstate=disable? Would that be worth a shot?

Offline

#7 2015-08-30 21:40:10

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Lone_Wolf wrote:

A few things that come to mind :

did you implement intel microcode update changes ? see https://www.archlinux.org/news/changes- … deupdates/ .

Is bios/uefi firmware uptodate ?

Have you tried with the LTS kernel ?

Are all SSD/HDD using motherboard SATA ports, or do you have additional SATA controllers ?

Added :

post lspci -k , lsusb and lsusb -t please.

What do you use to provide/manage the raid mirrors ( HW raid, mdadm, BtrFS , lvm ? )

Here you go, sorry for taking so long!

lspci -k:
http://sprunge.us/DRDH

lsusb:
http://sprunge.us/AOcd

lsusb -t:
http://sprunge.us/HGON

Offline

#8 2015-08-30 21:44:53

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

I have just updated the OP with some more information. Specifically, I previously forgot to detail that GRUB and /boot are on a flash drive.

Offline

#9 2015-08-30 22:19:36

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

I have just tried booting with intel_pstate=disable and mce=3 (Running X too), and the system hard locked once again. This time, I was installing Sublime Text 2 from the AUR, and it locked up during "Compressing package..."
Any ideas what I can do to further troubleshoot or perhaps even solve this issue?

Offline

#10 2015-08-30 22:29:06

twelveeighty
Member
From: Alberta, Canada
Registered: 2011-09-04
Posts: 1,096

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Have you carefully reviewed entries in your journalctl? It's obviously not possible to check dmesg after it has locked up, but there could be events being tracked in the journal leading up to the freeze?

For what it's worth: I recently switched laptops to a brand-new, highly powered one that had an advanced HD with a gigantic cache on it (not SSD). Turns out that this HD was causing low level (kernel) errors under high strain (pacman -Syu, for example). Replacing the drive fixed my problems. The point is that there were "early warning" errors in the journal just before it bombed.

Offline

#11 2015-08-30 22:33:28

dice
Member
From: Germany
Registered: 2014-02-10
Posts: 413

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Well as mralmud said it might be caused by NCQ errors because of the 850 SSDs. I had similar problems with the new firmware for the 840 EVO some time ago. Disabling NCQ solved it https://wiki.archlinux.org/index.php/So … NCQ_errors
The situations you mentioned where the lock up occurs all have in common that a lot of disk IO is produced. Compressing a package, installing packages, opening and closing firefox which causes it to update it's database files and disk caches and whatnot...


I put at button on it. Yes. I wish to press it, but I'm not sure what will happen if I do.  (Gune | Titan A.E.)

Offline

#12 2015-08-30 22:50:17

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

dice wrote:

Well as mralmud said it might be caused by NCQ errors because of the 850 SSDs. I had similar problems with the new firmware for the 840 EVO some time ago. Disabling NCQ solved it https://wiki.archlinux.org/index.php/So … NCQ_errors
The situations you mentioned where the lock up occurs all have in common that a lot of disk IO is produced. Compressing a package, installing packages, opening and closing firefox which causes it to update it's database files and disk caches and whatnot...

I will try that. Can disabling NCQ globally cause anything bad to happen performance-wise or anything like that?

Offline

#13 2015-08-30 22:51:07

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

twelveeighty wrote:

Have you carefully reviewed entries in your journalctl? It's obviously not possible to check dmesg after it has locked up, but there could be events being tracked in the journal leading up to the freeze?

For what it's worth: I recently switched laptops to a brand-new, highly powered one that had an advanced HD with a gigantic cache on it (not SSD). Turns out that this HD was causing low level (kernel) errors under high strain (pacman -Syu, for example). Replacing the drive fixed my problems. The point is that there were "early warning" errors in the journal just before it bombed.

I've skimmed my entire journal for the relevant boots and scoured the last half of it or so after all the initialization and everything... Nothing really stands out to me.

Offline

#14 2015-08-30 23:21:20

dice
Member
From: Germany
Registered: 2014-02-10
Posts: 413

Re: [SOLVED] High-end system hard locks after a little bit of usage.

AppleDash wrote:

I will try that. Can disabling NCQ globally cause anything bad to happen performance-wise or anything like that?

Maybe.. I don't know for sure, but I didn't notice anything special. But then I don't have such a complex setup with a raid just a single SSD so I don't know what happens in your case.

But performance wise: can it get worse than a hard lock? tongue


I put at button on it. Yes. I wish to press it, but I'm not sure what will happen if I do.  (Gune | Titan A.E.)

Offline

#15 2015-08-31 01:35:09

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

dice wrote:
AppleDash wrote:

I will try that. Can disabling NCQ globally cause anything bad to happen performance-wise or anything like that?

Maybe.. I don't know for sure, but I didn't notice anything special. But then I don't have such a complex setup with a raid just a single SSD so I don't know what happens in your case.

But performance wise: can it get worse than a hard lock? tongue

You raise a good point there tongue

Update to the situation: My system has been up for around 2 hours with NCQ disabled, and I've done lots of disk and CPU intensive tasks. I THINK this might have fixed it.

Offline

#16 2015-08-31 01:44:16

Webbeh
Member
Registered: 2012-07-08
Posts: 49

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Looking at the spec sheet of your computer, I have a quick question for you...

...have you overclocked your CPU ?

I had the very same problems, and lockups occured often enough when XZing some packages (manually or via makepkg) or playing minecraft (two cpu-intensive tasks). A stability test showed me that, indeed, my CPU wasn't stable and caused the crash.

You might consider testing your system's stability (inside linux, I don't trust windows tools for some reason) if you indeed overclocked your CPU.

Note that the lockups didn't appear every time I played minecraft or built packages, and also occured some times in firefox too.

Last edited by Webbeh (2015-08-31 01:45:52)

Offline

#17 2015-08-31 02:26:42

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Webbeh wrote:

Looking at the spec sheet of your computer, I have a quick question for you...

...have you overclocked your CPU ?

I had the very same problems, and lockups occured often enough when XZing some packages (manually or via makepkg) or playing minecraft (two cpu-intensive tasks). A stability test showed me that, indeed, my CPU wasn't stable and caused the crash.

You might consider testing your system's stability (inside linux, I don't trust windows tools for some reason) if you indeed overclocked your CPU.

Note that the lockups didn't appear every time I played minecraft or built packages, and also occured some times in firefox too.

I've overclocked to 4Ghz, yes, and I've stressed the CPU under both Windows and Linux. Windows using Prime95, and Linux using `stress`. Both ran fine for several hours before I closed them.

Offline

#18 2015-08-31 13:39:01

Webbeh
Member
Registered: 2012-07-08
Posts: 49

Re: [SOLVED] High-end system hard locks after a little bit of usage.

If the lockup still occurs after the SSD fix, I'm still fairly certain it's due to overclocking. No matter how stable you can think your rig is, there are external factors, like room temperature, dust, power supply stability, etc, who can alter the stability of everything. Stress testing overnight isn't the same as stress testing in the day.

I overly overclocked mine (from 3.4 to 5GHz) and was not having any lockups at night, but they would occur really often by day. Downgrading to 4.6GHz almost fixed it. I have enough stability for daily use and know that I have to save everything when I use a more CPU hungry application. Both case scenario, stress testing was fine for several hours too...

Offline

#19 2015-09-01 04:07:11

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

It's been running fine for over 24 hours now with no issues. I'm calling it fixed.
The final solution was to add "processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll libata.force=noncq" to my kernel command line. I'm not sure if it was the CPU options or the libata option that fixed it, but I will not touch it for now as the system is running well.

Offline

#20 2015-09-01 05:33:08

mich41
Member
Registered: 2012-06-22
Posts: 796

Re: [SOLVED] High-end system hard locks after a little bit of usage.

I think you meant "running hot" smile

You shouldn't need this cstate and idle stuff if it's only a matter of buggy SSD firmware.

Offline

#21 2015-09-03 04:55:33

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

mich41 wrote:

I think you meant "running hot" smile

You shouldn't need this cstate and idle stuff if it's only a matter of buggy SSD firmware.

~45c under 20-30% load on all cores isn't all that hot, and honestly, before my upgrade (When I had a 2nd gen i5), I ran my system with intel_pstate=disable anyways because I didn't like the fact that it was downclocking, and had no issues. This system is also water cooled, so yeah.

Last edited by AppleDash (2015-09-03 04:56:22)

Offline

#22 2015-12-24 01:49:04

AppleDash
Member
From: Canada
Registered: 2015-08-28
Posts: 26

Re: [SOLVED] High-end system hard locks after a little bit of usage.

Sorry for the bump, but I figured I'd provide an update, finally.

Today, I disabled all of the previously-mentioned kernel command line arguments other than the libata one. My system hardlocked within a minute of booting. I rebooted and re-added them (before it locked again) and then rebooted again, and it hasn't locked in the past couple of hours. It would appear that the necessary flags for this setup are the libata one as well as one or more of the others.

Offline

Board footer

Powered by FluxBB