Troubleshoot recent overheating + immediate shutdowns (hard/software?)

jwhendy · 2014-04-15 03:18:48

Since last Friday, I've been experiencing frequent, instant shutdowns due to overheating. I've had these occur before, but generally when running intensive computations in R. I'd initiate a script at bedtime, and wake up to find my computer off (shut down in the night). I learned to put a small book under it, run a fan next to it, or literally place it on it's side, resting on the non-fan edge and the lid to expose the whole bottom to "breathe." Recently, however, I have incessant, unexpected, immediate shutdowns during "normal use" (terminal, browser with, say, 4-10 non-video-playing tabs open), or even just updating with pacman.

Unfortunately, use of some proprietary software along with needing some IE-specific sites at work had me in Windows 7 for ~1.5 weeks non-stop. This is really unusual for me, and I prefer to live in Arch as much as possible. Because of this, I didn't get to notice if this would have gradually increased in frequency or started abruptly. The behavior is drastically and noticeably different than anything I've experienced, completely kills the use of Linux for me, and I'd really like to identify if this is something to do with software, or if I'm experiencing a hardware failure of some sort (loose thermal gel?). I did a mess of trials this weekend, with results below. Thanks for any input or suggestions.

System details

- HP 8540w workstation
- Intel i7 Q720 (4 physical cores, 8 virtual)
- Nvidia Quadro FX 1800M
- Using nvidia (334.21-4), nvidia-utils (334.21-4), nvidia-libgl (334.21-7)

- Relevant BIOS options (as far as I can tell):
--- Multi-core: turns off all physical cores other than the first
--- Intel HT Technology: hyperthreading support (appears to turn off all virtual/hyperthreaded cores)
--- Virtualization (unsure if relevant, but affects dmesg output for kvm abilities)

$ uname -a
Linux bigBang 3.14.0-5-ARCH #1 SMP PREEMPT Fri Apr 11 21:02:33 CEST 2014 x86_64 GNU/Linux

$ sudo cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 
ondemand     # same for all other cpus, 1-7

I have acpi installed, but not acpid, laptop tools, cpupower, or others mentioned in the power management or CPU frequency scaling Wiki articles.

Edit: Oops! I caught that I actually did have cpupower installed, from way back in 2012, and removed it last night. I don't recall ever setting it up, and issuing `locate cpupower` now that it's gone doesn't get any hits, so I'm thinking this means I never created a config file for it (which would have yielded, I think, a file.pacsave upon removal). Sorry for the mixup!

Assume everything else is "stock" (in other words, no kernel options appended, etc.). This is pretty much a "vanilla" Arch system other than the applications I've installed. Feel free to ask follow up questions if I've missed something relevant.

System messages

Here are examples of the errors I'm seeing:

$ dmesg | grep -i mce
[    0.040954] mce: CPU supports 9 MCE banks
[18912.069281] mce: [Hardware Error]: Machine check events logged
[19662.531507] mce: [Hardware Error]: Machine check events logged
[20112.808818] mce: [Hardware Error]: Machine check events logged

$ dmesg | grep -i temper
[  367.990859] CPU6: Core temperature above threshold, cpu clock throttled (total events = 1)
[  367.990862] CPU7: Core temperature above threshold, cpu clock throttled (total events = 1)
[  367.991919] CPU6: Core temperature/speed normal
[  367.991922] CPU7: Core temperature/speed normal
[  668.175622] CPU6: Core temperature above threshold, cpu clock throttled (total events = 508)
[  668.175623] CPU7: Core temperature above threshold, cpu clock throttled (total events = 508)
[  668.176683] CPU6: Core temperature/speed normal
[  668.176685] CPU7: Core temperature/speed normal

Here are my limits, per lm_sensors:

$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +57.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:       +57.0°C  (high = +84.0°C, crit = +100.0°C)
Core 2:       +57.0°C  (high = +84.0°C, crit = +100.0°C)
Core 3:       +57.0°C  (high = +84.0°C, crit = +100.0°C)

I installed the mcelog package, which shows messages like this:

Hardware event. This is not a software error.
Apr 14 21:36:50 bigBang mcelog[10019]: MCE 0
Apr 14 21:36:50 bigBang mcelog[10019]: CPU 5 THERMAL EVENT TSC 1b184fe88d88
Apr 14 21:36:50 bigBang mcelog[10019]: TIME 1397511379 Mon Apr 14 16:36:19 2014
Apr 14 21:36:50 bigBang mcelog[10019]: Processor 5 heated above trip temperature. Throttling enabled.
Apr 14 21:36:50 bigBang mcelog[10019]: Please check your system cooling. Performance will be impacted
Apr 14 21:36:50 bigBang mcelog[10019]: STATUS 880003c3 MCGSTATUS 0
Apr 14 21:36:50 bigBang mcelog[10019]: MCGCAP 1c09 APICID 5 SOCKETID 0
Apr 14 21:36:50 bigBang mcelog[10019]: CPUID Vendor Intel Family 6 Model 30
Apr 14 21:36:50 bigBang mcelog[10019]: Hardware event. This is not a software error.
Apr 14 21:36:50 bigBang mcelog[10019]: MCE 1
Apr 14 21:36:50 bigBang mcelog[10019]: CPU 4 THERMAL EVENT TSC 1b184fe8a5da
Apr 14 21:36:50 bigBang mcelog[10019]: TIME 1397511379 Mon Apr 14 16:36:19 2014
Apr 14 21:36:50 bigBang mcelog[10019]: Processor 4 heated above trip temperature. Throttling enabled.
Apr 14 21:36:50 bigBang mcelog[10019]: Please check your system cooling. Performance will be impacted
Apr 14 21:36:50 bigBang mcelog[10019]: STATUS 880003c3 MCGSTATUS 0
Apr 14 21:36:50 bigBang mcelog[10019]: MCGCAP 1c09 APICID 4 SOCKETID 0
Apr 14 21:36:50 bigBang mcelog[10019]: CPUID Vendor Intel Family 6 Model 30
Apr 14 21:36:50 bigBang mcelog[10019]: Hardware event. This is not a software error.

Method

In updating my AUR-built packages, I got repeated shutdowns when trying to build pandoc-static, a pretty compiling-heavy package. I googled around quite a bit about thermal events, with nothing conclusively sticking out as the solution. Honestly, it reduced to trying pcie_aspm=0 or pcie_aspm=force -- that's really about all I could find! I thought I would try various things and log the temperature and frequency while trying to build pandoc-static. Here was my final summary of trials conducted:

Key:
- bios virt: "Virtualization Technology" turned on in BIOS?
- bios multi: "Multi-Core" turned on in BIOS?
- intel ht: "Intel HT Technology" turned on in BIOS?
- X: started X with `startx` or run from TTY1?
- cores: which CPUs were active (as per the scheme in /sys/devices/system/cpu/cpu*)?
- success: did pandoc-static build successfully (1), or did the machine hard kill itself (0)?

| log id |   kernel | pcie_aspm       | bios virt | bios multi | intel ht | X? | cores         | success? | note                  |
|--------+----------+-----------------+-----------+------------+----------+----+---------------+----------+-----------------------|
|      4 | 3.14.0-5 | default         |         1 |          1 |        1 |  1 | [0-5]         |        1 |                       |
|      5 | 3.14.0-5 | pcie_aspm=0     |         1 |          1 |        1 |  1 | all           |        0 |                       |
|      6 | 3.14.0-5 | pcie_aspm=force |         1 |          1 |        1 |  1 | all           |        0 |                       |
|      7 | 3.14.0-5 | default         |         1 |          1 |        1 |  1 | all           |        1 |                       |
|      8 | 3.14.0-5 | default         |         1 |          1 |        1 |  1 | 0, 2, 4, 6    |        0 |                       |
|      9 | 3.14.0-5 | default         |         1 |          1 |        1 |  1 | 0, 1, 3, 5, 7 |        0 | can't turn off core 0 |
|     10 | 3.14.0-5 | default         |         1 |          1 |        1 |  1 | [0-5]         |        1 | replicate (4)         |
|     11 |  3.10.36 | default         |         1 |          1 |        1 |  1 | all           |        0 |                       |
|     12 | 3.14.0-5 | default         |         1 |          1 |        1 |  0 | all           |        0 |                       |
|     13 | 3.14.0-5 | default         |         0 |          1 |        1 |  1 | all           |        0 |                       |
|     14 | 3.14.0-5 | default         |         0 |          0 |        1 |  1 | 0-1           |        1 |                       |
|     15 | 3.14.0-5 | default         |         0 |          1 |        0 |  1 | 0-3           |        0 |                       |
|     16 |  3.10.36 | default         |         1 |          1 |        1 |  0 | all           |        0 |                       |
|     17 | 3.14.0-5 | default         |         1 |          1 |        1 |  1 | all           |        0 |                       |
|     18 | 3.14.0-5 | default         |         1 |          1 |        1 |  1 | [0-5]         |        1 | replicate (4)         |

For instances where only certain cores were enabled, the following method was used:

# echo 0 > /sys/devices/system/cpu/cpu6/online

The effect was verified in the output of dmesg for each core:

[ 2024.174235] kvm: disabling virtualization on CPU6
[ 2030.187053] kvm: disabling virtualization on CPU7

I tried a couple without X to rule out an Nvidia issue, as I also have a recent issue of closing my lid not suspending anymore. This is apparently an Nvidia issue, so I thought I'd at least try to rule it out. I threw in the LTS kernel just to see if I noticed any differences (I didn't).

In each case, I used the following procedure:
- Fresh boot
- Open one instance of urxvt with three tabs (or three TTYs if running one of the two non-X trials)
- run a cpu-logging script as root in one tab, a dmesg logging script from a regular user tab, and `yaourt pandoc-static` from a third tab
- if the machine hard killed, move on. If not, stop the logger, reboot, and start over (applying any BIOS changes when necessary)

All trials except 16, 17, and 18 were run at my house in the dining room, with the laptop resting on a small book (not obscuring the fan ports underneath). The last three were run today at work, and I put a similarly sized book under the laptop to try and replicate conditions reasonably well.

I wrote a kludgy script (I'm no bash guy...) to collect the data in csv format. It's called by another script (not shown) that runs a continuous while loop every 10 seconds.

#!/bin/sh

temp0=`sensors | grep "Core 0" | awk '{print $3}' | sed s/"+"// | sed s/"°C"//`
temp1=`sensors | grep "Core 1" | awk '{print $3}' | sed s/"+"// | sed s/"°C"//`
temp2=`sensors | grep "Core 2" | awk '{print $3}' | sed s/"+"// | sed s/"°C"//`
temp3=`sensors | grep "Core 3" | awk '{print $3}' | sed s/"+"// | sed s/"°C"//`

for i in 0 1 2 3 4 5 6 7
  do
    eval cpu$i=`cat /sys/devices/system/cpu/cpu$i/cpufreq/cpuinfo_cur_freq`
  done

time=`date +%H:%M:%S`

top=`ps aux | awk '{print $11, $3}' | sort -k2rn | head -n 3`
top=`echo $top | sed s/" "/", "/g`

echo "$time, $temp0, $temp1, $temp2, $temp3, $cpu0, $cpu1, $cpu2, $cpu3, $cpu4, $cpu5, $cpu6, $cpu7, $top" >> /home/jwhendy/temp-freq.log

I also output the results of `dmesg | grep -i mce` and `dmesg | grep -i temper` to two separate files every 10 seconds.

Results

The result of most interest came when I noticed that it was always cpu6 and cpu7 that showed up in dmesg logs. I found the trick above about disabling specific cores, and decided to try it... the build succeeded! I replicated this effect three different times, which I find significant given that almost nothing else succeeded.

I used R and the ggplot2 package to plot the results, and here are two replicates of "default" Arch 3.14.0 (no changes) vs. the three runs in which I've disabled cores 6-7:

Note the absence of the rapid spikes > 90C, which are entirely the blue lines in the left two plots. That color is for the "Core 3" temperature reported by lm_sensors (fourth core), which I take it is responsible for cpu6 and cpu7. Subjectively, I noticed that the laptop ran with much lower fan speeds, they kicked in later in the build, and simply ran less with cpu6/7 off. Frequencies appear to jump around with both, showing a pretty even spread of them cycling. The plots don't make it very clear, but the first default run succeeded, while the second one hard shutdown. Again, all three with cpu6/7 off succeeded.

Discussion

Case closed? Is this definitely a hardware issue? I'm still not sure, and that's where I'd like some assistance. Having replicated three successes with cpu6 and cpu7 off, I felt pretty confident. But from my current boot (with cpu6/7 off), I checked dmesg after running rsync for a couple hours catching up on backups!

[18661.079071] CPU4: Core temperature above threshold, cpu clock throttled (total events = 1)
[18661.079072] CPU5: Core temperature above threshold, cpu clock throttled (total events = 1)
[18661.080133] CPU4: Core temperature/speed normal
[18661.080135] CPU5: Core temperature/speed normal
[19598.570458] CPU5: Core temperature above threshold, cpu clock throttled (total events = 182)
[19598.570459] CPU4: Core temperature above threshold, cpu clock throttled (total events = 182)
[19598.571517] CPU4: Core temperature/speed normal
[19598.571520] CPU5: Core temperature/speed normal
[20015.016963] CPU4: Core temperature above threshold, cpu clock throttled (total events = 221)
[20015.016965] CPU5: Core temperature above threshold, cpu clock throttled (total events = 221)
[20015.019037] CPU4: Core temperature/speed normal
[20015.019040] CPU5: Core temperature/speed normal

Also, note that run #15 above (disabling hyperthreading in BIOS) turns off cpu4-7, however the dmesg temp logs show that cpu3 was reporting temperature overages and that trial resulted in a hard shutdown. I'm not confident in how the system numbers the cpus (cpu0 and cpu1 are the hyperthreads for the first physical core? Is this always the case?). If numbering is consistent and this means that cpu6 and cpu7 are both hyperthreads of two different cores, this is also perplexing, since runs #8 and #9 turned off odd and even (respectively) cores yet saw no improvement. Both reported one of the suspect cpus (6 for even run, 7 for the odd), and both hard killed, #8 before it could even log any high spikes.
- Here's a plot of those two runs

There's also the issue that this never happens on Windows, or at least the forced shutdowns don't occur. Perhaps there are warnings or logs somewhere showing the same issue, but it's controlling the issue somehow. I'd love suggestions to check; if there aren't any errors reported, that shifts the evidence back to a potential software issue.

Lastly, mcelog says flat out "This is not a software error." Now, I hear that... but I don't understand why changing which hyperthreads are online should change which cores overheat (always the highest numbered one). Or, say, could disabling those cores just disable the associated temperature sensor and it's still just as hot in that area, but it's not being measured/reported?

For now, turning off cpu6 and cpu7 at least gives me a system that doesn't die on me with every action, so Arch remains usable. I've read a bunch of posts on this that frankly aren't much help. They range in content from "One cpu is ~10C higher than others" (to which people say, "Don't worry about it") to advice to simply go in an manually change the cpu temperature threshold values (seems like a horrible idea to me), to aspm fiddling (which seems like it was supposedly fixed in Linux 3.0+ if not before), and changing governors (I'm on ondemand, not performance, which seems reasonable).

I typed a lot above, and that's because I would like the best chance I have to know what this is related to. Alright, 'nuff of the problem. Onto the solution. Many thanks for reading, and any assistance/suggestions you can provide.

Last edited by jwhendy (2014-04-15 16:08:56)

ewaller · 2014-04-15 04:24:02

Dang. That has to be one of the best posts making a statement of a problem that I have ever seen. Bravo.

This system runs well under Windows, correct? It leads me to think this is not a failure of the cooling system. This cooling system is also not going to be simple. It probably uses a heat exchanger that uses mass flow to transfer heat from the processor to a remote heat sink that will have forced air cooling. The interface between the heat exchanger (heat pipe) will use a high performance bi-phase type thermal layer -- not some cheesy thermal paste. Maybe I skimmed past it in your post, but what has changed on your system in the last couple weeks? You might post the relevant parts of your pacman log.

jwhendy · 2014-04-15 15:47:11

@ewaller: thanks for the feedback! Yeah, this was one of my more obsessive ones for sure. I admit that data collection and analysis is both a part of my job, and a hobby

Re. Windows: subjectively, I think Windows runs with the fan much less (but always has, not a recent thing). It's also much slower and far less responsive (issue something like alt+tab to change windows and wait 1-3 full seconds for it to catch up). I could run something intensive in Win7 and check the logs to see if anything similar shows up (could Win7 just be controlling it somehow?). What surprises me is that Arch is catching that the cpus are over critical and throttling them, but they seem to spike so fast from ~80C to 100C that it can't do anything but hard kill.

Re. hardware: thanks for your take. I plan to take this apart and brush/blow any dust out, so I'll take a look while I'm in there for anything obviously loose or weird.

Re. what's changed: I've wondered that myself and am just drawing a blank. I tried to check for over-temp reports in journalctl, but unfortunately, have SystemMaxUse=250M in /etc/systemd/journald.conf. It appears that all of my experiments this weekend purged older entries, so journalctl starts at 2014-04-12 -- bummer. Still, I haven't changed anything major that I'm aware of. Just booted up, started doing normal stuff (updating packages, chromium, etc.), and I noticed the fan ramp up like crazy. Then that horrid "peeew" sound when the computer kills itself. Thought it was a fluke, but happened again and again.

Re. pacman logs: here's my pacman log for what I think is a relevant period. Prior to 2014-04-09, the previous entry was 2014-03-31 (that period was the living-in-Windows time I mentioned). I omitted some more recent entries since the problem was already present as of Apr 09-10, and a majority of more recent entries are related to trying to get nouveau to work with acceleration (lots of installing/un-installing nvidia/nouveau to troubleshoot that issue -- in progress!).

[2014-04-09 17:48] [PACMAN] Running 'pacman -Syu'
[2014-04-09 17:48] [PACMAN] synchronizing package lists
[2014-04-09 17:48] [PACMAN] starting full system upgrade
[2014-04-09 17:54] [PACMAN] upgraded apr-util (1.5.3-3 -> 1.5.3-4)
[2014-04-09 17:54] [PACMAN] upgraded readline (6.3-3 -> 6.3.003-2)
[2014-04-09 17:54] [PACMAN] upgraded bash (4.3-3 -> 4.3.008-2)
[2014-04-09 17:54] [PACMAN] upgraded boost-libs (1.55.0-4 -> 1.55.0-5)
[2014-04-09 17:54] [PACMAN] upgraded openssl (1.0.1.f-2 -> 1.0.1.g-1)
[2014-04-09 17:54] [PACMAN] upgraded ca-certificates (20140223-2 -> 20140325-1)
[2014-04-09 17:54] [PACMAN] upgraded cpupower (3.13-2 -> 3.14-1)
[2014-04-09 17:54] [PACMAN] upgraded cups-filters (1.0.50-1 -> 1.0.52-1)
[2014-04-09 17:54] [PACMAN] installed libavc1394 (0.5.4-2)
[2014-04-09 17:54] [PACMAN] installed libiec61883 (1.2.0-4)
[2014-04-09 17:54] [PACMAN] installed gstreamer0.10-good-plugins (0.10.31-5)
[2014-04-09 17:54] [PACMAN] upgraded farstream-0.1 (0.1.2-3 -> 0.1.2-4)
[2014-04-09 17:54] [PACMAN] upgraded fftw (3.3.3-2 -> 3.3.4-1)
[2014-04-09 17:54] [PACMAN] upgraded sqlite (3.8.4.2-1 -> 3.8.4.3-1)
[2014-04-09 17:54] [PACMAN] upgraded nss (3.15.5-2 -> 3.16-1)
[2014-04-09 17:54] [PACMAN] upgraded flashplugin (11.2.202.346-1 -> 11.2.202.350-1)
[2014-04-09 17:54] [PACMAN] upgraded ghostscript (9.10-3 -> 9.14-1)
[2014-04-09 17:54] [PACMAN] upgraded gnutls (3.2.12.1-1 -> 3.2.13-1)
[2014-04-09 17:54] [PACMAN] upgraded groff (1.22.2-5 -> 1.22.2-6)
[2014-04-09 17:54] [PACMAN] upgraded hplip (3.14.3-2 -> 3.14.4-1)
[2014-04-09 17:55] [ALPM-SCRIPTLET] >>> Updating module dependencies. Please wait ...
[2014-04-09 17:55] [ALPM-SCRIPTLET] >>> Generating initial ramdisk, using mkinitcpio.  Please wait...
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Building image from preset: /etc/mkinitcpio.d/linux.preset: 'default'
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> -k /boot/vmlinuz-linux -c /etc/mkinitcpio.conf -g /boot/initramfs-linux.img
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Starting build: 3.13.8-1-ARCH
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [base]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [udev]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [autodetect]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [block]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [encrypt]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [filesystems]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [keyboard]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [fsck]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [shutdown]
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Generating module dependencies
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Creating gzip initcpio image: /boot/initramfs-linux.img
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Image generation successful
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Building image from preset: /etc/mkinitcpio.d/linux.preset: 'fallback'
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> -k /boot/vmlinuz-linux -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-fallback.img -S autodetect
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Starting build: 3.13.8-1-ARCH
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [base]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [udev]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [block]
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> WARNING: Possibly missing firmware for module: aic94xx
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> WARNING: Possibly missing firmware for module: smsmdtv
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [encrypt]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [filesystems]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [keyboard]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [fsck]
[2014-04-09 17:55] [ALPM-SCRIPTLET]   -> Running build hook: [shutdown]
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Generating module dependencies
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Creating gzip initcpio image: /boot/initramfs-linux-fallback.img
[2014-04-09 17:55] [ALPM-SCRIPTLET] ==> Image generation successful
[2014-04-09 17:55] [PACMAN] upgraded linux (3.13.7-1 -> 3.13.8-1)
[2014-04-09 17:55] [PACMAN] upgraded linux-headers (3.13.7-1 -> 3.13.8-1)
[2014-04-09 17:55] [PACMAN] upgraded man-pages (3.63-1 -> 3.64-1)
[2014-04-09 17:55] [ALPM] warning: /etc/pacman.d/mirrorlist installed as /etc/pacman.d/mirrorlist.pacnew
[2014-04-09 17:55] [PACMAN] upgraded pacman-mirrorlist (20140107-1 -> 20140405-1)
[2014-04-09 17:55] [PACMAN] upgraded python2-numpy (1.8.0-3 -> 1.8.1-1)
[2014-04-09 17:55] [PACMAN] upgraded r (3.0.2-2 -> 3.0.3-1)
[2014-04-09 17:55] [PACMAN] upgraded xorg-xauth (1.0.8-1 -> 1.0.9-1)
[2014-04-09 17:55] [PACMAN] upgraded yajl (2.0.4-2 -> 2.1.0-1)
[2014-04-09 18:03] [PACMAN] Running 'pacman --color auto -Sy'
[2014-04-09 18:03] [PACMAN] synchronizing package lists
[2014-04-10 13:45] [PACMAN] Running 'pacman --color auto -Sy'
[2014-04-10 13:45] [PACMAN] synchronizing package lists
[2014-04-10 13:45] [PACMAN] Running 'pacman -Syu'
[2014-04-10 13:45] [PACMAN] synchronizing package lists
[2014-04-10 13:45] [PACMAN] starting full system upgrade
[2014-04-10 13:46] [PACMAN] upgraded kmod (16-1 -> 17-1)
[2014-04-10 13:46] [PACMAN] upgraded libsystemd (212-1 -> 212-2)
[2014-04-10 13:46] [PACMAN] upgraded coreutils (8.22-3 -> 8.22-4)
[2014-04-10 13:46] [PACMAN] upgraded libutil-linux (2.24.1-4 -> 2.24.1-6)
[2014-04-10 13:46] [PACMAN] upgraded util-linux (2.24.1-4 -> 2.24.1-6)
[2014-04-10 13:46] [PACMAN] upgraded systemd (212-1 -> 212-2)
[2014-04-10 13:46] [PACMAN] upgraded chromium (33.0.1750.152-1 -> 34.0.1847.116-1)
[2014-04-10 13:46] [PACMAN] upgraded x265 (0.8-2 -> 0.9-1)
[2014-04-10 13:46] [PACMAN] upgraded ffmpeg (1:2.2-2 -> 1:2.2-3)
[2014-04-10 13:47] [PACMAN] upgraded git (1.9.1-1 -> 1.9.2-1)
[2014-04-10 13:47] [ALPM-SCRIPTLET] >>> Updating module dependencies. Please wait ...
[2014-04-10 13:47] [ALPM-SCRIPTLET] >>> Generating initial ramdisk, using mkinitcpio.  Please wait...
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Building image from preset: /etc/mkinitcpio.d/linux.preset: 'default'
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> -k /boot/vmlinuz-linux -c /etc/mkinitcpio.conf -g /boot/initramfs-linux.img
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Starting build: 3.14.0-4-ARCH
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [base]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [udev]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [autodetect]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [block]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [encrypt]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [filesystems]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [keyboard]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [fsck]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [shutdown]
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Generating module dependencies
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Creating gzip initcpio image: /boot/initramfs-linux.img
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Image generation successful
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Building image from preset: /etc/mkinitcpio.d/linux.preset: 'fallback'
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> -k /boot/vmlinuz-linux -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-fallback.img -S autodetect
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Starting build: 3.14.0-4-ARCH
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [base]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [udev]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [block]
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> WARNING: Possibly missing firmware for module: aic94xx
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> WARNING: Possibly missing firmware for module: smsmdtv
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [encrypt]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [filesystems]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [keyboard]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [fsck]
[2014-04-10 13:47] [ALPM-SCRIPTLET]   -> Running build hook: [shutdown]
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Generating module dependencies
[2014-04-10 13:47] [ALPM-SCRIPTLET] ==> Creating gzip initcpio image: /boot/initramfs-linux-fallback.img
[2014-04-10 13:48] [ALPM-SCRIPTLET] ==> Image generation successful
[2014-04-10 13:48] [PACMAN] upgraded linux (3.13.8-1 -> 3.14-4)
[2014-04-10 13:48] [PACMAN] upgraded linux-headers (3.13.8-1 -> 3.14-4)
[2014-04-10 13:48] [PACMAN] upgraded lirc-utils (1:0.9.0-70 -> 1:0.9.0-71)
[2014-04-10 13:48] [PACMAN] upgraded nvidia (334.21-2 -> 334.21-4)
[2014-04-10 13:48] [PACMAN] upgraded systemd-sysvcompat (212-1 -> 212-2)
[2014-04-10 13:51] [PACMAN] Running 'pacman -Syu'
[2014-04-10 13:51] [PACMAN] synchronizing package lists
[2014-04-10 13:51] [PACMAN] starting full system upgrade
[2014-04-10 13:51] [PACMAN] Running 'pacman -Sc'
[2014-04-10 14:16] [PACMAN] Running 'pacman --color auto -Sy'
[2014-04-10 14:16] [PACMAN] synchronizing package lists
[2014-04-10 14:21] [PACMAN] Running 'pacman --color auto -U /tmp/yaourt-tmp-jwhendy/PKGDEST.K0M/chromium-libpdf-1:34.0.1847.116-1-x86_64.pkg.tar.xz'
[2014-04-10 14:21] [PACMAN] upgraded chromium-libpdf (1:33.0.1750.149-1 -> 1:34.0.1847.116-1)
[2014-04-10 14:26] [PACMAN] Running 'pacman --color auto -U /tmp/yaourt-tmp-jwhendy/PKGDEST.R9f/dropbox-2.6.25-1-x86_64.pkg.tar.xz'
[2014-04-10 14:26] [PACMAN] upgraded dropbox (2.6.13-1 -> 2.6.25-1)
[2014-04-10 15:41] [PACMAN] Running 'pacman -Syu'
[2014-04-10 15:41] [PACMAN] synchronizing package lists
[2014-04-10 15:41] [PACMAN] Running 'pacman --color auto -Sy'
[2014-04-10 15:41] [PACMAN] synchronizing package lists
[2014-04-10 15:43] [PACMAN] Running 'pacman --color auto -S -u'
[2014-04-10 15:43] [PACMAN] starting full system upgrade
[2014-04-10 15:43] [PACMAN] upgraded libjpeg-turbo (1.3.0-4 -> 1.3.1-1)
[2014-04-10 15:43] [PACMAN] upgraded mpd (0.18.9-1 -> 0.18.10-1)
[2014-04-10 15:43] [PACMAN] upgraded python2-urwid (1.2.0-2 -> 1.2.1-1)
[2014-04-10 15:43] [PACMAN] Running 'pacman --color auto -S --asdeps --needed extra/itstool'
[2014-04-10 15:44] [PACMAN] installed itstool (2.0.2-1)
[2014-04-10 15:53] [PACMAN] Running 'pacman --color auto -U /tmp/yaourt-tmp-jwhendy/PKGDEST.Ikd/evince-gtk-3.12.0-1-x86_64.pkg.tar.xz'
[2014-04-10 15:53] [PACMAN] upgraded evince-gtk (3.10.3-1 -> 3.12.0-1)
[2014-04-10 16:08] [PACMAN] Running 'pacman --color auto -U /tmp/yaourt-tmp-jwhendy/PKGDEST.7Bs/jdk-8u0-1-x86_64.pkg.tar.xz'
[2014-04-10 16:11] [PACMAN] Running 'pacman -S lm_sensors'
[2014-04-10 16:12] [ALPM-SCRIPTLET] ==> Updating desktop MIME database...
[2014-04-10 16:12] [ALPM-SCRIPTLET] ==> Updating MIME database...
[2014-04-10 16:12] [ALPM-SCRIPTLET] ==> Updating icon cache...
[2014-04-10 16:12] [PACMAN] upgraded jdk (7.51-1 -> 8u0-1)
[2014-04-10 16:12] [PACMAN] Running 'pacman --color auto -S --asdeps --needed extra/cabal-install extra/ghc community/alex community/happy'
[2014-04-10 16:13] [PACMAN] installed cabal-install (1.18.0.2-1)
[2014-04-10 16:13] [PACMAN] installed ghc (7.6.3-1)
[2014-04-10 16:13] [PACMAN] installed alex (3.1.3-1)
[2014-04-10 16:13] [PACMAN] installed happy (1.19.3-1)
[2014-04-10 16:14] [PACMAN] Running 'pacman --color auto -U /tmp/yaourt-tmp-jwhendy/PKGDEST.DcA/zukitwo-themes-20140329-1-any.pkg.tar.xz'
[2014-04-10 16:14] [PACMAN] upgraded zukitwo-themes (20131210-1 -> 20140329-1)
[2014-04-10 16:14] [PACMAN] Running 'pacman -Syu'
[2014-04-10 16:14] [PACMAN] synchronizing package lists
[2014-04-10 16:15] [PACMAN] starting full system upgrade
[2014-04-10 16:16] [PACMAN] Running 'pacman -Rsc itstool happy ghc alex cabal-install'
[2014-04-10 16:16] [PACMAN] removed cabal-install (1.18.0.2-1)
[2014-04-10 16:16] [PACMAN] removed alex (3.1.3-1)
[2014-04-10 16:16] [PACMAN] removed ghc (7.6.3-1)
[2014-04-10 16:16] [PACMAN] removed happy (1.19.3-1)
[2014-04-10 16:16] [PACMAN] removed itstool (2.0.2-1)
[2014-04-10 17:40] [PACMAN] Running 'pacman -Rsc gnome-themes-standard unixobdc'
[2014-04-10 17:40] [PACMAN] Running 'pacman -Rsc gnome-themes-standard unixodbc'
[2014-04-10 17:40] [PACMAN] removed unixodbc (2.3.2-1)
[2014-04-10 17:40] [PACMAN] removed gnome-themes-standard (3.10.0-1)
[2014-04-10 17:40] [PACMAN] removed cantarell-fonts (0.0.15-1)
[2014-04-10 17:40] [PACMAN] Running 'pacman -Sc'
[2014-04-11 12:51] [PACMAN] Running 'pacman -S gnome-themes-standard'
[2014-04-11 12:52] [PACMAN] installed cantarell-fonts (0.0.15-1)
[2014-04-11 12:52] [PACMAN] installed gnome-themes-standard (3.10.0-1)

I found a post about the graphics driver affecting the laptop temperature, which is prompting me to re-run the test above with nouveau. Unfortunately, I can't startx without noaccel=1 and I get horrible performance with this.The solution there was to use the kernel option `radeon.dpm=1`, which for nvidia appears to be to use DPMS, which I seem to have, so perhaps that isn't relevant/related?

$ grep -i dpms /var/log/Xorg.0.log
[    52.969] Initializing built-in extension DPMS
[    56.794] (==) NVIDIA(0): DPMS enabled

While I was looking for how far back the overheating went, I also stumbled across some other errors in journalctl like this:

Apr 15 08:10:07 bigBang kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Apr 15 08:10:07 bigBang kernel: ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052f conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20131218/utaddress-258)
Apr 15 08:10:07 bigBang kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Apr 15 08:10:07 bigBang kernel: ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053f conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20131218/utaddress-258)
Apr 15 08:10:07 bigBang kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Apr 15 08:10:07 bigBang kernel: ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054f conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20131218/utaddress-258)
Apr 15 08:10:07 bigBang kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Apr 15 08:10:07 bigBang kernel: ACPI Warning: SystemIO range 0x0000000000000428-0x000000000000042f conflicts with OpRegion 0x0000000000000400-0x000000000000047f (\PMIO) (20131218/utaddress-258)

This post mentions a similar warning, but doesn't really get resolved. Here's another bug about that, which is both recent and also unresolved (active). The bug, however, says nothing about what effect it actually has! I'm not sure if there's anything undesirable about it other than an ugly warning...

Action items for me
- Try with nouveau (X and no-X)
- I'll be trying one of these utilities on Windows 7 shortly to see if the cpu temp is reported there.

Thanks again for the response/suggestions!

ewaller · 2014-04-15 15:56:58

The standouts from your pacman log are the kernel itself and cpu-power. You may want to downgrade those and see if provides any clues.
BTW, you said in post 1 you did not have cpupower installed, but it seems that you do

jwhendy · 2014-04-15 16:05:47

@ewaller: Yeah... I actually caught that last night inadvertently and removed it -- I didn't know I had it! I just checked, and I installed it in Oct 2012; no idea why, and sorry for the mis-information. I plan to re-run my compiler tests with the following settings:
- Re-run of "default" (with cpupower now gone)
- Run with nouveau and/or downgraded nvidia

Would downgrading the kernel do something other than using the arch-lts kernel? I ran two tests with 3.10.36 (lts) in my original trials (see table of trials, #11 and #16).

Edit: changed original post with mixup about cpupower! Thanks for catching that, ewaller.

Last edited by jwhendy (2014-04-15 16:09:28)

ewaller · 2014-04-15 16:10:34

jwhendy wrote:

Would downgrading the kernel do something other than using the arch-lts kernel? I ran two tests with 3.10.36 (lts) in my original trials (see table of trials, #11 and #16).

I missed that. No, it would just put you back in that old state. I am running out of ideas, but let me ponder some.

jwhendy · 2014-04-15 22:47:27

Well, sunova! I tried a program called CoreTemp on Windows (found via this SuperUser post) while running various non-scientific/repeatable things to stress the CPU (creating a tile plot in R of a laser profilometry scan containing 1.6 million rows, running `summary(rnorm(100000000))` to generate 10 million random numbers and compute summary statistics, opening several videos at the same time, etc.). The temperature, frequency, and load of the four cores was recorded to .csv every 10 seconds. Here's the plot.

Not quite as clear as the Arch data, but I have to say it appears that core_3 (4th core) hit higher temperatures in every case (log01-log04 were four different "trials"). The first showed the most distinguished difference.

Then again, even with it spiking high, it's not high as on Arch, so either WIndows is controlling things better or I'm not stressing things the same way that compilation would on Arch. On that note, I have `MAKEFLAGS=-j4` in /etc/makepkg.conf.

I don't have any compiler tools on Windows (and don't care to install whatever it's called -- mingw?), so if there's an equivalently stressful test to what compiling pandoc-static might do on Arch, I'd be happy to replicate to see if the effect is consistent on Windows.

Let me know if the plot seems to indicate that this actually is repeatable and thus more likely to be a hardware issue. I haven't found an obvious source of information for checking Windows logs somewhere to see if it's reported. That would be awesome. Maybe it's just that I don't run enough computationally heavy stuff on Windows compared to Arch? Still, Windows sure the hell acts like I'm stressing the hell out of it. Running those R commands above and two videos locked up my system (mouse moved, but I couldn't click anything, alt+tab, or ctrl+alt+del for 10 minutes at which point I hard killed it with the power button).

Edit: I brought home a can of compressed air with me to see if I can just clean stuff out. I plan to take off the bottom as well as the keyboard to thoroughly clean, and will re-run the default test on Arch tonight. Also, I re-ran with nouveau installed and without starting X and the build failed, so I don't think it's graphics/X/Nvidia related.

Last edited by jwhendy (2014-04-15 22:49:08)

jwhendy · 2014-04-16 19:17:33

Cleaned things out and originally thought it made a difference, as my cores were all at 52C.. so I re-ran my pandoc-static compilation test. It succeeded, however I still got 5 reports of over temps on cpu6 and cpu7, as usual. I'm thinking the compressed air can might have chilled things down a bit, and it just took a while to re-heat. Here's the plot comparing a representative "before" trial (Arch 3.14.0 kernel, nvidia, after startx, all BIOS options enabled) to my after cleaning run. No significant difference, and the issue is still on the Core 3 temp.

I'm still torn on whether or not this is hardware or software. Windows sort of replicated the anomaly, however it's not hard killing itself either...

jwhendy · 2014-04-24 16:09:50

Just curious if there were any further thoughts on this. Still torn:
- Windows showing possible evidence of similar high temps on the 4th core
- Then again, Windows isn't hard killing my computer whereas Arch will, repeatedly.

Doing `echo 0 > /sys/devices/system/cpu/cpu[6/7]/online` is still working to provide a usable computer that doesn't turn off unexpectedly.

Dmitrii · 2014-12-24 00:12:27

jwhendy wrote:

Since last Friday, I've been experiencing frequent, instant shutdowns due to overheating.

You mast clean yours computer from dust.
Don't be afraid, this is a common problem .
My advice is to use a vacuum cleaner with reverse thrust!
If you need a specific model just ask me, but only through e mail.

Last edited by Dmitrii (2014-12-24 00:12:55)

Arch Linux

#1 2014-04-15 03:18:48

Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#2 2014-04-15 04:24:02

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#3 2014-04-15 15:47:11

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#4 2014-04-15 15:56:58

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#5 2014-04-15 16:05:47

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#6 2014-04-15 16:10:34

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#7 2014-04-15 22:47:27

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#8 2014-04-16 19:17:33

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#9 2014-04-24 16:09:50

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

#10 2014-12-24 00:12:27

Re: Troubleshoot recent overheating + immediate shutdowns (hard/software?)

Board footer