You are not logged in.

#1 2017-08-25 14:40:30

Purgator
Member
Registered: 2016-03-02
Posts: 102

Shut down due to wrong thermal information

Hello everyone !

I experienced an unexpected shut down today due to, according to kernel logs, an overheat. The computer shut down properly by the way.

The issue here is that my CPU was around 60 C or less, not 130 C as the log said. It's the first time this issue show up.

Here is the log :

thermal thermal_zone0: critical temperature reached(130 C),shutting down

My computer is a 2 years old laptop.
Right after the shut down, I have touch the hot part of the laptop with my hand, and it was not that hot.

Do you think the sensor thermal_zone0 could be anything else than the CPU ? Something that I can't touch with my hand. Or can we assume it's a bug ?

What I can do to be sure to not reproduce the shut down ?

Thank you.

Offline

#2 2017-08-25 14:44:49

ugjka
Member
From: Latvia
Registered: 2014-04-01
Posts: 1,808
Website

Re: Shut down due to wrong thermal information

Check thermal paste


https://ugjka.net
paru > yay | webcord > discord
pacman -S spotify-launcher
mount /dev/disk/by-...

Offline

#3 2017-08-25 14:51:12

Purgator
Member
Registered: 2016-03-02
Posts: 102

Re: Shut down due to wrong thermal information

My CPU is not hot as I said. between 50 C and 60 C. Not 130 as the log. I'm monitoring every time my CPU temp from a widget.

Offline

#4 2017-08-25 15:36:18

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,791

Re: Shut down due to wrong thermal information

What processor is it?  What tool are you using to monitor it?


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#5 2017-08-25 15:45:21

Purgator
Member
Registered: 2016-03-02
Posts: 102

Re: Shut down due to wrong thermal information

CPU is an Intel N3545 (baytrail)

My tool for monitoring is stock xfce4-plugin with acpitz-0

BTW the tool is true, my CPU is at 56 °C atm, and when I touch a laptop with my hand, i can feel the difference between 56 °C and 130 °C, just in case you would ask the question wink
I have touched a lot of rad in my life btw.

EDIT : Hum something interesting in my widget, I can choose between acpitz-0, coretemp-0 and ACPI. Inside ACPI there is a sensor for thermal_zone0. I have added it to my widget. 51 °C ATM.

Last edited by Purgator (2017-08-25 15:47:58)

Offline

#6 2017-08-25 15:51:52

ugjka
Member
From: Latvia
Registered: 2014-04-01
Posts: 1,808
Website

Re: Shut down due to wrong thermal information

install lm_sensors and watch the temps with "watch sensors" in terminal


https://ugjka.net
paru > yay | webcord > discord
pacman -S spotify-launcher
mount /dev/disk/by-...

Offline

#7 2017-08-25 16:57:00

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,791

Re: Shut down due to wrong thermal information

Short primer on thermodynamics.  Heat will flow from a hot source to a cool sink.  The rate it which it flows is an ugly equation dominated by the difference in temperature and the "thermal resistance".  In some ways, it can be modeled as an electrical circuit, where the temperature differential is the "voltage" and the thermal path a bunch of series resistors.  These "resistors" include the thermal resistance between the junction temperature of the transistors in your processor (the heat source) and the case of your processor.  Another "resistor" is the thermal resistance of the thermal paste between the processor and the heat sink.  Then there is the thermal mass and resistance of the heat sink, finally,  there is the inlet air temperature of your cooling air, the exhaust temperature, and the amount of airflow over your heat sink.   What you can feel with your hand is the impact of the heatsink through the thermal resistance of whatever it is that your computer's case is made out of. 

In general, I would expect the heatsink fins to be about 10C hotter than the outside of the computer case.  I would expect the base of the heat sink to be another 10C hotter.  At this point, the thermal path cross section starts to get small -- from maybe 15 sq cm on the heat sink to  maybe 4 sq cm at the top of the cpu.  This drives the thermal resistance up, so lets assume Maybe 5C or 10C  drop over the thermal paste.  Now the thermal path cross section gets really small, maybe .5 sq cm.  Anyway, I would throw in another 10 to 15C drop to the die.    What I am driving at is that if heatsink fins could are at 56C (which is where you might be able to put your hand), the CPU temperature could well be 30 to 50C higher.    In addition, if CPU activity peaks, the die temperature can rise at startling high rates -- Several hundred degrees per second kind of rates; the cooling system cannot keep up with this and therefore the high activity can only be maintained as long as there is sufficient thermal mass near the  die to absorb the spike.

The temperature in this room is in the mid 20 C range.  My processor temperature is reporting 49C; it is a 8 core processor.  If I compile something with a single core, using i7z, I can see that core temperature jump to 90C in much less than a second.  This cannot be detected with sensors outside the processor for several seconds.


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#8 2017-08-25 17:43:52

olive
Member
From: Belgium
Registered: 2008-06-22
Posts: 1,490

Re: Shut down due to wrong thermal information

You can disable the shutdown feature by passing thermal.nocrt=1 to the kernel.

But I am  perplex. Are you sure your CPU is not overheating? From where you get the temperature? If you see that you have reached 130C it means that some sensors read 130C. Maybe this is a wrong information, but are you sure that the 56°C is correct? Have you installed and configured lm_sensors? What do they say? Are you sure you are not overheating?

Note that what you say is probable. I think the Bios shut down the system if the laptop is really overheating without the kernel having the possibility of issuing messages. But that's only my guess.

Last edited by olive (2017-08-25 18:12:36)

Offline

#9 2017-08-25 18:30:51

seth
Member
Registered: 2012-09-03
Posts: 51,229

Re: Shut down due to wrong thermal information

1. core temp should be relevant, the thermal zone is the thermal budget ("can I heat up a second core?")
=> WATCH *ALL* TEMPERATURES USING LMSENSORS not some stupid "widget" some teenager might have clicked together.
This can already fail by a wide poll interval - as ewaller suggested: the core temperature can raise *really* fast if the cooling system has a malfunction (or you eg. forcefully lowered the fan or something)

2. let's assume the CPU is not actually overheating: 130°F are about 72°C ..

Offline

#10 2017-08-25 19:53:21

loqs
Member
Registered: 2014-03-06
Posts: 17,372

Re: Shut down due to wrong thermal information

seth wrote:

2. let's assume the CPU is not actually overheating: 130°F are about 72°C ..

thermal thermal_zone0: critical temperature reached(130 C),shutting down

C implies Celsius to me also 72 seems somewhat low to me for a shutdown temp even if it was TCase rather than TJunction.

Offline

#11 2017-08-25 19:56:28

seth
Member
Registered: 2012-09-03
Posts: 51,229

Re: Shut down due to wrong thermal information

The idea is a unit bug. Like your cpu reaches 130°F and that triggers a 130°C emergency.

Offline

#12 2017-08-28 11:52:10

Purgator
Member
Registered: 2016-03-02
Posts: 102

Re: Shut down due to wrong thermal information

I will record with lm sensor, but honestly guys i was not asking for any help to record my temps. I'm 100% sure that the record was wrong. I'm not posting random information.
I would accept the CPU or something else would overheat with an high CPU load, but I was just using gmail...
I don't know how to explain it to you, but sometimes there is an evidence that you can feel when you are in front of your computer, he doesn't have any cooling issue, he is cold. Sometimes i pushed it very hot and he never shut down.

Next step lm sensor... but then i will want to fix the bug.

@Olive i prefer not use  thermal.nocrt=1

Offline

#13 2017-08-29 10:05:35

Purgator
Member
Registered: 2016-03-02
Posts: 102

Re: Shut down due to wrong thermal information

I have run lm-sensors and it tell the same temp as my widget.

BTW :
[purgator@legendance ~]$ pacman -Qi xfce4-sensors-plugin
Name            : xfce4-sensors-plugin
Version         : 1.2.6-3
Description     : A lm_sensors plugin for the Xfce panel
Architecture    : x86_64
URL             : http://goodies.xfce.org/projects/panel- … ors-plugin
Licenses        : GPL2
Groups          : xfce4-goodies
Provides        : None
Depends On      : xfce4-panel  lm_sensors  libnotify  hicolor-icon-theme
Optional Deps   : hddtemp: for monitoring the temperature of hard drives
Required By     : None
Optional For    : None
Conflicts With  : None
Replaces        : None
Installed Size  : 474.00 KiB
Packager        : Evangelos Foutras <evangelos@foutrelis.com>
Build Date      : Sat 07 May 2016 06:55:46 AM CEST
Install Date    : Sun 08 May 2016 08:40:15 PM CEST
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

My widget use lm-sensors.

So now can we talk about my issue ? big_smile

Last edited by Purgator (2017-08-29 10:05:47)

Offline

#14 2017-08-29 13:40:09

seth
Member
Registered: 2012-09-03
Posts: 51,229

Re: Shut down due to wrong thermal information

I have run lm-sensors

This binary does not exist. https://bbs.archlinux.org/viewtopic.php?id=57855

Post the output of "sensors -u"

As mentioned, I suspect something confuses fahrenheit and celsius - do you get anywhere near 72°C?

Offline

#15 2017-08-29 14:35:06

Purgator
Member
Registered: 2016-03-02
Posts: 102

Re: Shut down due to wrong thermal information

It's all about celcius, I wrote it. I don't think the system is that stupid to be confuse between Celsius and Fahrenheit

I never reached 72 °C or maybe a long time ago and I don't remember.

BTW i have run sensors-detect

[purgator@legendance ~]$ sensors -u
acpitz-virtual-0
Adapter: Virtual device
temp1:
  temp1_input: 58.000
  temp1_crit: 120.000

coretemp-isa-0000
Adapter: ISA adapter
Core 0:
  temp2_input: 52.000
  temp2_max: 105.000
  temp2_crit: 105.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 52.000
  temp3_max: 105.000
  temp3_crit: 105.000
  temp3_crit_alarm: 0.000
Core 2:
  temp4_input: 53.000
  temp4_max: 105.000
  temp4_crit: 105.000
  temp4_crit_alarm: 0.000
Core 3:
  temp5_input: 53.000
  temp5_max: 105.000
  temp5_crit: 105.000
  temp5_crit_alarm: 0.000

Offline

#16 2017-08-29 18:36:25

seth
Member
Registered: 2012-09-03
Posts: 51,229

Re: Shut down due to wrong thermal information

Well, the message is actually directly from the thermal module, so either there's a bug in that module (try the lts kernel) or your system in fact is overheating.
The inability to feel that on the case means nothing - it's probably overheating because of the inability to spread the heat.
The problem with lmsensors etc. is that they'll likely run at a much lower polling rate.

=> If there's a bug, this should not happen with the lts kernel.
If it does happen with the lts kernel, there's a major chance of a hardware issue.

Offline

#17 2017-08-29 20:07:09

olive
Member
From: Belgium
Registered: 2008-06-22
Posts: 1,490

Re: Shut down due to wrong thermal information

As I understand, the thermal module gets its info from acpi, i.e. the firmware while sensors read the sensors directly. I would think that Purgator is right and that the laptop is not overheating. I think the firmware would shut down the machine itself if this was the case. Is the fan of the laptop running fast before the shutdown (the fan are normally managed by the firmware according to the temperature)?

Maybe this is a bug in the thermal module but more probably there is a problem with acpi, maybe a buggy acpi implementation. What's the output of acpi -t? This will show the temperature as reported by acpi vs the temperature directly read from the sensors.

If the module actually read 130°C, there isn't much that you can hope here. It is a bug that needs to be reported to the kernel developers and the only sensible solution is to blacklist the thermal module or at least prevent it to shut down the machine, but you don't want to do it.

Offline

#18 2017-08-30 06:00:45

seth
Member
Registered: 2012-09-03
Posts: 51,229

Re: Shut down due to wrong thermal information

"acpitz-virtual-0" ;-)

If this was a singular incident, one of the sensors could have had a hiccup.
Not sure whether the system would grant the OS a shot before cutting power for a hard reset.

Offline

#19 2017-08-30 09:02:41

Purgator
Member
Registered: 2016-03-02
Posts: 102

Re: Shut down due to wrong thermal information

I can't really use lts kernel ATM.

I think I was not 100% clear but I want to say that the issue happened only 1 time. I don't remember if the fan was active cause there was noise in the room.
About the fan, it start rotating at 53-54 °C acpitz-0.
I got this laptop for 2 years and I touched his butt many times on many temps.
When 53 °C is displayed, I feel the computer not that hot.
When like 65+ °C is displayed (I have to play a video for example) it feel hot when I touch the computer.
When it's about 60, it feel a bit hot... etc

The more is the displayed temp, the more i can feel it hot.

The cooling system is 100% working fine, I just don't know how to explain it.

By the way, I can't reproduce the bug.

[purgator@legendance ~]$ acpi -t
Thermal 0: ok, 52.0 degrees C

Offline

#20 2017-08-30 10:41:35

seth
Member
Registered: 2012-09-03
Posts: 51,229

Re: Shut down due to wrong thermal information

https://www.archlinux.org/packages/comm … 4/cpuburn/

If you cannot reproduce it even this way, this is going nowhere. There's been a singular incident which lead to a (likely) false temperature report. This could be a very weird bug in the linux kernel, a hardware bug, physical stress, microwaves, cosmic rays, voodoo ...

If you want to avoid the system being shut down for such sensor errors, you need to deactivate the kernel feature.
If there's only one sensor, its validity cannot be cross-checked. You either trust it (and act) or you don't (and ignore it)

Offline

#21 2017-08-30 14:40:59

Purgator
Member
Registered: 2016-03-02
Posts: 102

Re: Shut down due to wrong thermal information

Someone is doing something nasty, someone is voodooing your computer !

I think it could have been interesting to understand the issue anyway.

I will run a burn test.

Offline

Board footer

Powered by FluxBB