You are not logged in.
Offline
Machine check exceptions means hardware error. Usually caused by "undercooling" (check all heatsinks and fans), overclocking, undervolting or hardware failure.
Offline
I don't undervolt or overclock. The temperatures seem to be normal.
Offline
I am having the same problems on my (new, bought last friday) MSI GE62 2QE laptop with an Intel Core i7-5700HQ processor, both with the normal and the LTS kernel.
A nearly sure trigger for me is trying to install a large-ish emacs package through M-x package-install, e.g. ess or yasnippet. Looking at strace output from emacs until the system dies does not reveal anything useful, there are some mmap, munmap, mremap and write calls, as well as some SIGIO handling. I set mce=3 on the kernel command line to ignore machine check exceptions, but it didn't enllighten me. dmesg output seems to be about stuff just timing out (iwlwifi, or some SATA WRITE QUEUED command) and the system remains completely unresponsive. Seems one of the CPU cores (which sends the machine check interrupt) just locks up.
I ran memcheck for a whole night without any errors; I tried disabling NCQ of my Sandisk X300s SSD, as well as setting it from AHCI to IDE mode, but it didn't help at all. I finally migrated all my partitions from the SSD to the HDD, but I was still able to reproduce the issue (altough much more randomly).
Now I am downgrading to a kernel version from before thursday (4.1.3-1, which, according to the Rollback Machine, hit the repos on 07/29/2015, to be exact), let's see if it helps...
Edit: No, downgrading to 4.1.3-1 does not help, just had a machine check right during boot.
Edit2: Downgrading to 4.0.7 doesn't help either...
Last edited by kris7t (2015-08-19 22:39:42)
Offline
Is it possible that the problem is caused by the apparently buggy Intel SpeedStep on Broadwell processor? See eg. http://www.phoronix.com/scan.php?page=n … fixed-mode, http://superuser.com/questions/938199/n … o-problems and http://unix.stackexchange.com/questions … -broadwell. After disabling SpeedStep, my machine seems to run fine, but of course, the clock speed is stuck at 2.7GHz, which is unacceptable for some stuff I wanted to do with the machine (running and developing scientific applications), Unfortunately, installing the latest ucode does not fix the problem, and neither does installing the latest kernel.
Edit: The problem persists even after updating to 4.2rc7 (-mainline), which supposedly contains some Broadwell power management fixes.
Last edited by kris7t (2015-08-23 13:19:30)
Offline
I guess switching to the "performance" cpufreq governor doesn't suffice?
Offline
No, and neither does booting with intel_pscale=disable. It seems the good old ACPI scaling driver cannot handle newer Intel processor, so the only way to scale them is to use SpeedStep and intel_pscale.
I was really surprised that even though Phoronix has reported on the issue (and they claimed to get in touch with the developer to intel_idle), there was no bug report about this issue on the kernel.org bugtracker. Thus, I created ticket with the machine check exception details I managed to capture while investigating: https://bugzilla.kernel.org/show_bug.cgi?id=103351
Although to tell the truth, I don't really have much hope, and I am thinking about returning the computer to the store to exchange it for a 4720HQ model...
Last edited by kris7t (2015-08-23 16:51:11)
Offline
Well, it seems that the MCEs always happen during idle so how about processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll ?
If this works, you can try something less extreme
Offline
Well, the issues stay even with intel_idle=disable, so I am unsure, but I will try tomorrow anyways, thanks!
Offline
For me, processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll is stable even with SpeedStep enabled (but of course, increases power consumption substantially), however, I just had a panic during early userspace with processor.max_cstate=0 intel_idle.max_cstate=0 idle=halt. I wonder what happens of agihr's desktop CPU.
Offline
Was this panic the same MCE or something else?
IIRC, on x86 C1 is HALT, so maybe these CPUs simply don't like something about Linux' usage of C1. Normally C1 is used rarely, with idle=poll - never and with idle=halt - all the time.
BTW, this would explain how Linux can bring down Windows machines from inside of a VM.
I wonder if intel_idle could be easily hacked to avoid C1 and go straight to some further state. Does anybody know if Windows uses C1 at all?
I'm not sure what idle=nomwait does, you can try that. Or dropping idle= completely and setting only max_cstate.
Last edited by mich41 (2015-08-24 10:37:56)
Offline
I don't know whether it is the same or not, because now I had no backtrace or MCE error message on the console,and obviously, the kernel log was not written to disk. I was able too boot the system with both idle=halt and idle=mwait and then trigger a CPU soft lockup (interestingly, not a stall, as previously) by ignoring the MCE with mce=3.
Windows's Performance monitor shows quite a bit of time spent in C1 and many transitions to C1 (although the majority of idle time is actually C2), while another tool (RealTemp TI) reports that the CPU idles in C2 (sometimes) and C3 (mostly). It is possible that the C1/C2/C3 states in the Performance monitor are just generic names, and do not actually correspond to the real C-states of the CPU.
If I will have more free time, I will try do comment out some C-states from bdw_cstates in drivers/idle/intel_idle.c an test the system with those. Unfortunately, there are 8 different C-states for Broadwell, which means 256 different combinations to test in theory...
Offline
You may want to add your findings to this bug report and maybe change component to intel_idle since it's definitely Intel-specific and apparently triggered by the idle code, not by cpufreq.
Offline
Great findings gusy! I just got an MSI GS60 2QE Ghost Pro which comes with Intel Core i7-5700HQ and the system wasn't stable and randomly freezes.
After booting with:
processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll
the system is stable.
Offline
Yes, but it kinda makes the system unuseable (fans spin at full speed, and guzzles power like crazy).
Offline
If I will have more free time, I will try do comment out some C-states from bdw_cstates in drivers/idle/intel_idle.c an test the system with those. Unfortunately, there are 8 different C-states for Broadwell, which means 256 different combinations to test in theory...
Maybe 8 reboots with only one C-state uncommented at a time would quickly reveal the culprit?
Offline
I have played around a little bit with 4.2-rc8 and commenting out Broadwell C-states. A decided to try compiling the kernel with NO_HZ_FULL and 1000HZ, as well as set the CPU governor to performance, because, following an observation of Phoronix that Broadwell is stable under Fedora (https://www.phoronix.com/scan.php?page= … d-For-5775) I reviewed Fedora's kernel configs and settings, and I could only find these three significant differences from Arch. I might try compiling with Fedora configs and patches (a bit difficult), or even installing Fedora, but for now, I compiled with mostly Arch configs.
grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*
confirms (and is written in the datasheet http://www.intel.com/content/dam/www/pu … -vol-1.pdf), Broadwell actually only supports the C1, C1E, C3 and C6 C-states. Even though other C-states, such as C7 are mentioned in the Broadwell part of intel_idle.h, they are actually ignored at probe time, because the CPU reports 0 substates for them. I suspect C1, C3 and C6 directly map to the C1, C2 and C3 states reported by the Windows Performance Monitor.
Unfortunately, I was able to produce a crash for each C-state commented out with enabled SpeedStep. The kernel without C6 crashed just before the start of X.org, while no-C1E survived until login. No-C1 died when trying to start emacs. No-C3 was the most promising, but I managed to kill it by repeatedly starting and quitting emacs a few times. Because I only tried each kernel for a single reboot, these effects may be entierly random and do not reflect the general stability of the settings.
Last edited by kris7t (2015-08-30 18:37:35)
Offline
Obvious solution: use Fedora 22 kernel binaries
Or at least steal .config from them.
PS. I suggested only one state uncommented a a time; there's a tiny bit of difference if you think about it
BTW, according to Phoronix, Fedora uses gcc 5.1.1. Something to check if everything sane fails.
Last edited by mich41 (2015-08-30 19:00:28)
Offline
Yes, but it kinda makes the system unuseable (fans spin at full speed, and guzzles power like crazy).
Yes you are right, the system is not usable with this workaround but it's stable so at least we know something that might be the cause of this issue.
This post from phoronix that explains another workaround, might help some:
http://www.phoronix.com/scan.php?page=n … Fixed-Mode
Offline
PS. I suggested only one state uncommented a a time; there's a tiny bit of difference if you think about it
Oh yeah, I missed that one. Will try again.
Offline
Hey, on the Phoronix page at: http://www.phoronix.com/scan.php?page=a … 2-of&num=2
They mention "CONFIG_OF" and "CONFIG_CPUFREQ_DT" as causing issues, and CONFIG_OF IS enabled in the arch kernel. I'm going to try compiling a kernel with it disabled and see if that works on my system.
EDIT: If someone else wants to take a crack at it too, that would be appreciated - my only system stable enough to compile it due to this issue is an old AMD Bobcat netbook. And this is my first time messing around with kconfig, so I'm trying to figure that out...
EDIT: NEVERMIND, CONFIG_OF isn't in the x86_64 config.
Last edited by GourdCaptain (2015-08-31 22:37:35)
Avatar by Ditey: https://twitter.com/phrobitey
Offline
I was kinda serious about these Fedora binaries. Why don't give them a try if your boxes are unusable anyway?
Offline
Because I'm really not sure how to set that kind of package up (I'm admittedly completely inept at building packages apart from tweaking others PKGBUILDS. And I installed Fedora (XFCE spin) in the meantime to see if it works (a test that I do know how to do), because I kind of needed an urgently functioning system with more than 1.4 GHz of CPU QUICK.
On the plus side, it seems to be working so far. We'll be in the same test case once I install Bumblebee.
EDIT: I did set it up with seperate partitions for root and other stuff so if I get something to play with I can stick Arch in and see if that works? I'm willing to play guinea pig, and I want to get back to Arch eventually, it's just I've had one desktop die completely this week and my new laptop is unusable under most Linuxes, it seems...
Last edited by GourdCaptain (2015-08-31 23:49:17)
Avatar by Ditey: https://twitter.com/phrobitey
Offline
Oh, I didn't know that you simply switched to Fedora.
To get Fedora kernel in Arch it will probably suffice to manually copy /boot/vmlinuz-something and /lib/modules/4.x.y-fedora-whatever to the Arch partition, chroot or reboot to arch to run mkinitcpio -k 4.x.y-fedora-whatever and add a grub menu entry for this kernel.
Offline
Oh, I didn't know that you simply switched to Fedora.
To get Fedora kernel in Arch it will probably suffice to manually copy /boot/vmlinuz-something and /lib/modules/4.x.y-fedora-whatever to the Arch partition, chroot or reboot to arch to run mkinitcpio -k 4.x.y-fedora-whatever and add a grub menu entry for this kernel.
I literally did the Fedora install between my first and second posts. (Fedora's gotten substantially nicer since I used it last for my purposes, but I'm still going to get this machine over to Arch since I last used it. It just jumped up over Linux Mint for personal use as the fallback if I don't feel like messing with Arch. Even getting BUMBLEBEE working wasn't that big of a pain.)
Oh, that makes sense, but is a bit messier than I'd want to do on something I'd keep. Hmmm... I've not really got the chance to play around with my main install (sorry, I REALLY need this machine) but would it help if I did a Arch Linux on a USB flash drive install and used that to test this? That way I don't have to worry about trashing an install I need to do my job. I could get on that in the next few days.
Man, when I bought this machine, it never even crossed my mind to check if the problem would be the CPU...
Last edited by GourdCaptain (2015-09-01 05:34:19)
Avatar by Ditey: https://twitter.com/phrobitey
Offline