Another iteration on random freezes

lncrz · 2020-02-15 20:14:35

Hello,

I am experiencing issues related to the ones described there :
https://bbs.archlinux.org/viewtopic.php?id=236686 (reports for Baytrail/Skylake/Broadwell processors)
https://bbs.archlinux.org/viewtopic.php?id=245608 (reports for AMD Ryzen processors)

Error description:
- The system will freeze randomly, usually a few seconds after boot, or after a few keypresses.
- As in the related issues, the system won't react to anything, only hard reset will work. While hanging, the display does not update anymore and any music that was playing will start looping the same 2 seconds or so. The caps lock key will start blinking, indicating a kernel panic.
- Unlike the related issues, the freeze happens very fast, so that I will not be able to input anything significant after login.
- Booting with "emergency" kernel parameter will result in a stable system. Booting with "rescue" will not.
- Setting intel_idle.max_cstate=0 (or 1) will not fix the problem.
- The journal does not seem to contain any relevant information, much like what is described in the related topics. However, since I am only able to use it in emergency shell, it is a bit difficult for me to reproduce the contents here.

System information:
- Laptop model: DELL Latitude 7480.
- CPU: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor (Kaby lake-based, if I understand correctly).
- Intel ucode is updated normally.
- My system is pretty much vanilla Arch, no tweaking was made.

Some history:
- With kernel 4.X I only had random freezes on suspend, that I could fix by disabling suspend.
- With kernel 5.1 I had random freezes exactly similar to the ones described in related threads (would happen after a few minutes to a few hours of uptime). Setting intel_idle.max_cstate=0 did fix the problem.
- Out of laziness, I downgraded to kernel 4.20 (iirc) and kept it like this until today.
- With kernel 5.5 (as of now) freezes will happen much faster and will not be fixed by disabling intel_idle.

Some observations:
- If my laptop is plugged in a DELL docking station (and *not* to a regular power outlet), then no freezes will occur at all. This also fixed the problems with the earlier kernel versions.
- Booting in emergency shell does show the following error message after a few seconds:

pcieport 0000:00:1c.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
pcieport 0000:00:1c.4: AER:   device [8086:9d14] error status/mask=00000001/00002000
pcieport 0000:00:1c.4: AER:    [ 0] RxErr                    (First)
pcieport 0000:00:1c.4: can't change power state from D3cold to D0 (config space inaccessible)

which does not seem to have any consequence. I do not know if this is correlated to the freezes.
- Booting in rescue mode shows the same error message after a few seconds, followed by a kernel panic:

pcieport 0000:00:1c.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
pcieport 0000:00:1c.4: AER:   device [8086:9d14] error status/mask=00000001/00002000
pcieport 0000:00:1c.4: AER:    [ 0] RxErr                    (First)
pcieport 0000:00:1c.4: can't change power state from D3cold to D0 (config space inaccessible)
pci_bus 0000:05: busn_res: [bus 05] is released
pci_bus 0000:06: busn_res: [bus 06-3a] is released
pci_bus 0000:3b: busn_res: [bus 3b] is released
pci_bus 0000:04: busn_res: [bus 04-3b] is released
Disabling lock debugging due to kernel taint
mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: ba00000011000402
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff84154370> {mwait_idle+0x80/0x200}
mce: [Hardware Error]: TSC 2e3f6ff1cc
mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1581785993 SOCKET 0 APIC 0 microcode ca
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
[drm] [transcoder EDP] PSR aux error
Kernel Offset:0x2800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff
Rebooting in 30 seconds..

It seems this kernel panic message will not show every time. It seems to indicate a hardware problem, which is weird given the absence of problems when plugged to a docking station, or with earlier kernels.

I would greatly appreciate any help on how to diagnose and/or fix this problem.
Thank you!

Kempniu · 2020-04-27 04:33:51

I can confirm identical symptoms on a Dell Latitude 7280, including the docking station aspect. After spending some time looking into this, I believe it is yet another chapter in the never-ending story of problems with the Intel GPU kernel driver, i915.

For me, the hard lockups also started happening with kernel 5.5, with machine check exceptions like this one being logged through netconsole:

mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: ba00000011000402
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffad5a82e8> {mutex_unlock+0x18/0x30}
mce: [Hardware Error]: TSC 29e93cf5d61 
mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1586929374 SOCKET 0 APIC 0 microcode ca
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
Kernel Offset: 0x2bc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

I also initially assumed that these were hardware issues, but I never got a single BSoD under Windows running on the same machine (and my understanding is that an actual hardware issue would result in one sooner or later). I started bisecting the kernel, hoping to find a single offending commit, but I only got as far back as 5.5.0 when I started getting a different flavor of hard lockups, this time identified by:

Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

I got discouraged at that point.

The only workaround that works reliably for me so far is to prevent the i915 module from being loaded at all, by adding the following parameter to the kernel command line:

module_blacklist=i915

With this in effect, X.org falls back to using xf86-video-fbdev and that mostly works fine for my needs, though there are caveats, of course (no LCD brightness control, no sound through HDMI, very high CPU usage during videoconferencing, and probably more). This is a bandaid, of course, but at least it allows me to have a stable machine when I am on the move.

I am on kernel 5.6.6 right now and these problems are still happening.

Lone_Wolf · 2020-04-27 10:25:59

Both of you show Machine Check Exceptions , possibly the aur package rasdaemon can help to get more info.

see https://wiki.archlinux.org/index.php/Ma … _exception

Kempniu · 2021-08-12 05:56:03

I looked into this issue a bit in the past few days. It turns out the crashes also happen with kernel versions as old as 4.16, they are just not so frequent. The good news is that I seem to have found a reliable workaround: setting the i915.enable_dc=0 kernel parameter. This option disables GPU power management, but it allows me to use the i915 module without getting occasional crashes. The bad news is that this looks like a hardware flaw that a kernel driver can not do much about except for trying not to trigger it.

I suspect the problem became worse in the more recent kernels because the power-saving logic in the i915 driver became more aggressive (read: more power-efficient) over time. The testing methodology I used (fire up netconsole, start a CPU-intensive task, wait until the CPU temperature becomes high, blank the screen using xset s activate, wait for a crash, confirm symptoms by looking at what the netconsole target machine captured) suggests that 4.15 is stable and 4.16 is not, but I was not getting consistent bisect results between those two versions (the same kernel commit crashed immediately on some occasions and then worked fine for a long time after a reboot), so I cannot tell which specific commit introduced the issue, or whether 4.15 is really 100% stable. This issue is cumbersome enough to troubleshoot for a non-expert that I am happy enough with the workaround I found for the time being and I do not plan on trying to figure this out any further.

Hope this helps!

Speranskiy · 2021-09-24 19:27:20

Got the same issues with Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz. I've added i915.enable_dc=0 kernel parameter will see if it helps

zilverling · 2022-07-13 09:18:47

I might have a related issue. From time to time (I'm glad it is not too often) - I have the impression that this happens after a period of no activity when the screen normally would go into sleeping mode -- my ASRock N3700-ITX machine (running a 5.15.53-2-lts kernel) freezes and the screen goes white. I have to switch off and restart the machine.

The journalctl logs show these lines just before the freeze:

Jul 13 10:24:51 celaeno kernel: i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+me>
Jul 13 10:24:53 celaeno kernel: i915 0000:00:02.0: [drm] *ERROR* pipe C underrun
Jul 13 10:24:53 celaeno kernel: i915 0000:00:02.0: [drm] *ERROR* CPU pipe C FIFO underrun

Any insights?

Last edited by zilverling (2022-07-13 09:19:16)

JonL · 2022-07-15 07:55:30

Is anyone using a cache file or partition? It seems like there is a deadlock somewhere between paging while a cache is mounted and journaling.

x4rch_q7 · 2022-08-01 03:41:14

Similar issue,
Lenovo ideapad intel i3 -7th gen (kabylake) .
i used Linux Mint for 1 year with no freezing issue (5.4 Lts kernel).
But i got this issue after switching to arch, i switched between different arch based distros thinking it might be distro issues but no.
i am still using arch , with intel_idle.max_cstate=0 now which don't fix the problem but makes pc usable after couple of freezes initially.
i915.enable_dc=0 didn't help.
Also, I have installed Zen kernel and Clear kernel which gives less freeze .

x4rch_q7 · 2022-11-15 02:41:26

New kernel 6.0 update seem to have fixed the problem for me,

Maëlan · 2023-06-01 08:59:42

Adding my voice for the record: my laptop is a Dell Latitude 7390, I had the exact same symptoms, and the fix given by previous people solved the problem. :-)

The fix: adding either module_blacklist=i915 (brutal) or i915.enable_dc=0 (more surgical) to kernel boot parameters. (With the “surgical” option at least, the screen is flickering and sometimes turns to black altogether, when running Archlinux’ live-USB or my TTY-only newly installed system; but that problem seems gone now that I am running X.)

The symptoms: random freeze, nothing responding, the capslock LED blinking, music stuck in a 2-second loop, all of which occurring when on battery or when plugged to a charger, but not when plugged to my WD15 docking station. I started to notice in December 2022, so probably after upgrading the kernel to 6.x, but it’s possible I had some unexplained crashes before that, only less frequent. The freeze could happen anytime from a few seconds after early booting (as early as the prompt for the LUKS password, to decrypt the root partition), to several hours later.

Since the freeze was leaving no trace and I am by no means a kernel expert, I had lost all hope to debug this, but at least I could attribute the issue to the Linux kernel (because it did not happen in Grub, nor in Windows running on the same laptop, and it occurred very early in the process of booting Linux). My gratitude and admiration for the people who found the source of the problem and the fix!

Arch Linux

#1 2020-02-15 20:14:35

Another iteration on random freezes

#2 2020-04-27 04:33:51

Re: Another iteration on random freezes

#3 2020-04-27 10:25:59

Re: Another iteration on random freezes

#4 2021-08-12 05:56:03

Re: Another iteration on random freezes

#5 2021-09-24 19:27:20

Re: Another iteration on random freezes

#6 2022-07-13 09:18:47

Re: Another iteration on random freezes

#7 2022-07-15 07:55:30

Re: Another iteration on random freezes

#8 2022-08-01 03:41:14

Re: Another iteration on random freezes

#9 2022-11-15 02:41:26

Re: Another iteration on random freezes

#10 2023-06-01 08:59:42

Re: Another iteration on random freezes

Board footer