You are not logged in.
My dell latitude 7480(i7-7600U CPU @ 2.80GHz) started freezing under load(such as parallel compilation) some time after upgrading from linux-lts 4.x to 5.x
When the freeze happens display becomes static, mouse cursor is unresponsive, I cannot switch to another kernel console via Ctrl+Alt+F1..12, only the caps lock led keeps blinking indicating a kernel panic has occurred
The kernel panic is 100% reproducible if no mitigations described later are applied
No panic with kernel 4.19.71-1-lts
Panic with kernel 5.5.6.arch1-1
Remotely intercepting kernel messages via netconsole during the freeze yielded this:
[ 348.924171] Disabling lock debugging due to kernel taint
[ 348.924188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: ba00000011000402
[ 348.924190] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f66c75a2552>
[ 348.924192] mce: [Hardware Error]: TSC 1083d31156a
[ 348.924194] mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1583262110 SOCKET 0 APIC 0 microcode ca
[ 348.924196] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 348.924198] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 348.924199] Kernel panic - not syncing: Fatal machine check
[ 348.924207] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 348.924211] Rebooting in 30 seconds..
Which when decoded using mcelog gives this info:
CPU 0 BANK 4 TSC 763fee2ddf8
RIP !INEXACT! 33:7f4405025e34
MISC 0
TIME 1583241581 Tue Mar 3 16:19:41 2020
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
Processor context corrupt
MCA: Internal unclassified error: 402
STATUS ba00000009000402 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 142 Step 9
SOCKET 0 APIC 0 microcode ca
My first guess was it had to do with cpu cooling, so i've tried replacing the thermal interface and monitoring CPU frequency and core temperature.
When the load kicks in and turbo boost bumps the frequency up to 3.9GHz core temperatures spike up to 100°C but quickly drop to ~85°C @ ~3.1GHz and stay constant - some time after that the panic happens.
However I've witnessed the same thermal behaviour on kernel 4.19 and it doesn't go into panic even with mentioned temperatures.
Bringing down the maximum cpu frequency as well as disabling TB completely seems to have some positive effect, but the performance penalty of this workaround is too costly
Workarounds I've tried:
updating BIOS - no effect
booting with intel_idle.max_cstate=1 - no effect
booting with intel_pstate=disable - no effect
removing xf86-video-intel - no effect
booting with mitigations-off - no effect
lowering max cpu speed to 3600MHz - reduced frequency of crashes
disabling turbo boost - severely reduced frequency of crashes
Last edited by mutcianm (2020-03-04 15:44:43)
Offline
Hi
To start off with, update your system - 5.5.6.arch1-1 is old: 5.5.8.arch1-1 is currently in core and 5.5.9.arch1-2 is in testing as of writing. 4.19.71-1-lts is ancient as well.
Also don't post truncated logs!
Offline
I am observing identical symptoms on a sibling Latitude model (7280). I believe these crashes are caused by the i915 kernel driver. I described my findings in a different post. In short: as of kernel 5.6.6, the only workaround that works reliably for me is blacklisting the i915 module, but that cripples certain laptop features.
Offline
This MCE-related random freezes are still happening with kernel 5.7.2-arch1-1
Blacklisting i915 is not really a workaround in my case since I need graphics acceleration
Offline
This MCE-related random freezes are still happening with kernel 5.7.2-arch1-1
You are still using an old kernel - have you tried with the latest linux and linux-lts packages?
Have you installed microcode - https://wiki.archlinux.org/index.php/Microcode ?
How about posting a full un-truncated log...
journalctl -b
Offline
I am observing identical symptoms on a sibling Latitude model (7280). I believe these crashes are caused by the i915 kernel driver. I described my findings in a different post. In short: as of kernel 5.6.6, the only workaround that works reliably for me is blacklisting the i915 module, but that cripples certain laptop features.
I seem to have found another workaround which makes my machine stable (at the cost of increased power consumption): setting the i915.enable_dc=0 kernel parameter. More details here.
Offline