You are not logged in.

#1 2020-03-03 20:10:08

mutcianm
Member
Registered: 2020-03-03
Posts: 3

Intel CPU freeze under load with kernel 5.x - kernel panic

My dell latitude 7480(i7-7600U CPU @ 2.80GHz) started freezing under load(such as parallel compilation) some time after upgrading from linux-lts 4.x to 5.x
When the freeze happens display becomes static, mouse cursor is unresponsive, I cannot switch to another kernel console via Ctrl+Alt+F1..12, only the caps lock led keeps blinking indicating a kernel panic has occurred
The kernel panic is 100% reproducible if no mitigations described  later are applied

No panic with kernel 4.19.71-1-lts
Panic with kernel 5.5.6.arch1-1


Remotely intercepting kernel messages via netconsole during the freeze yielded this:

[  348.924171] Disabling lock debugging due to kernel taint
[  348.924188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: ba00000011000402
[  348.924190] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f66c75a2552> 
[  348.924192] mce: [Hardware Error]: TSC 1083d31156a 
[  348.924194] mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1583262110 SOCKET 0 APIC 0 microcode ca
[  348.924196] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[  348.924198] mce: [Hardware Error]: Machine check: Processor context corrupt
[  348.924199] Kernel panic - not syncing: Fatal machine check
[  348.924207] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  348.924211] Rebooting in 30 seconds..

Which when decoded using mcelog gives this info:

CPU 0 BANK 4 TSC 763fee2ddf8                                                                                                                                                                                                                                                                                                   
RIP !INEXACT! 33:7f4405025e34                                                                                                                                                                                                                                                                                                  
MISC 0                                                                                                                                                                                                                                                                                                                         
TIME 1583241581 Tue Mar  3 16:19:41 2020                                                                                                                                                                                                                                                                                       
MCG status:RIPV MCIP                                                                                                                                                                                                                                                                                                           
MCi status:                                                                                                                                                                                                                                                                                                                    
Uncorrected error                                                                                                                                                                                                                                                                                                              
Error enabled                                                                                                                                                                                                                                                                                                                  
MCi_MISC register valid                                                                                                                                                                                                                                                                                                        
Processor context corrupt                                                                                                                                                                                                                                                                                                      
MCA: Internal unclassified error: 402                                                                                                                                                                                                                                                                                          
STATUS ba00000009000402 MCGSTATUS 5                                                                                                                                                                                                                                                                                            
CPUID Vendor Intel Family 6 Model 142 Step 9                                                                                                                                                                                                                                                                                   
SOCKET 0 APIC 0 microcode ca   

My first guess was it had to do with cpu cooling, so i've tried replacing the thermal interface and monitoring CPU frequency and core temperature.
When the load kicks in and turbo boost bumps the frequency up to 3.9GHz core temperatures spike up to 100°C but quickly drop to ~85°C @ ~3.1GHz and stay constant - some time after that the panic happens.
However I've witnessed the same thermal behaviour on kernel 4.19 and it doesn't go into panic even with mentioned temperatures.
Bringing down the maximum cpu frequency as well as disabling TB completely seems to have some positive effect, but the performance penalty of this workaround is too costly

Workarounds I've tried:

  • updating BIOS - no effect

  • booting with intel_idle.max_cstate=1 - no effect

  • booting with intel_pstate=disable - no effect

  • removing xf86-video-intel - no effect

  • booting with mitigations-off - no effect

  • lowering max cpu speed to 3600MHz - reduced frequency of crashes

  • disabling turbo boost - severely reduced frequency of crashes

Last edited by mutcianm (2020-03-04 15:44:43)

Offline

#2 2020-03-13 15:36:24

paulkerry
Member
From: Sheffield, UK
Registered: 2014-10-02
Posts: 611

Re: Intel CPU freeze under load with kernel 5.x - kernel panic

Hi
To start off with, update your system - 5.5.6.arch1-1 is old: 5.5.8.arch1-1 is currently in core and 5.5.9.arch1-2 is in testing as of writing. 4.19.71-1-lts is ancient as well.
Also don't post truncated logs!

Offline

#3 2020-04-27 04:37:36

Kempniu
Member
Registered: 2020-04-27
Posts: 4

Re: Intel CPU freeze under load with kernel 5.x - kernel panic

I am observing identical symptoms on a sibling Latitude model (7280).  I believe these crashes are caused by the i915 kernel driver.  I described my findings in a different post.  In short: as of kernel 5.6.6, the only workaround that works reliably for me is blacklisting the i915 module, but that cripples certain laptop features.

Offline

#4 2020-06-25 10:57:27

mutcianm
Member
Registered: 2020-03-03
Posts: 3

Re: Intel CPU freeze under load with kernel 5.x - kernel panic

This MCE-related random freezes are still happening with kernel 5.7.2-arch1-1
Blacklisting i915 is not really a workaround in my case since I need graphics acceleration

Offline

#5 2020-06-26 16:17:40

paulkerry
Member
From: Sheffield, UK
Registered: 2014-10-02
Posts: 611

Re: Intel CPU freeze under load with kernel 5.x - kernel panic

mutcianm wrote:

This MCE-related random freezes are still happening with kernel 5.7.2-arch1-1

You are still using an old kernel - have you tried with the latest linux and linux-lts packages?

Have you installed microcode - https://wiki.archlinux.org/index.php/Microcode ?

How about posting a full un-truncated log...

 journalctl -b

Offline

#6 2021-08-12 05:59:50

Kempniu
Member
Registered: 2020-04-27
Posts: 4

Re: Intel CPU freeze under load with kernel 5.x - kernel panic

Kempniu wrote:

I am observing identical symptoms on a sibling Latitude model (7280).  I believe these crashes are caused by the i915 kernel driver.  I described my findings in a different post.  In short: as of kernel 5.6.6, the only workaround that works reliably for me is blacklisting the i915 module, but that cripples certain laptop features.

I seem to have found another workaround which makes my machine stable (at the cost of increased power consumption): setting the i915.enable_dc=0 kernel parameter. More details here.

Offline

Board footer

Powered by FluxBB