You are not logged in.
My kernel crashes at random times and I don't know why. I'm not even sure if it only happens when I transcode video with plex but that's the only way I can consistantly reproduce it.
Here is how I currently reproduce the problem:
Start server
open plex web interface
start movie
watch for a few seconds
switch to transcoded version
wait for 5-60sec
kernel crashes, system freezes
serial shows this error:
[ 246.698967] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
[ 247.951544] Shutting down cpus with NMI
[ 247.967890] Kernel Offset: 0x4600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 247.990159] Rebooting in 30 seconds..
[ 279.066045] ACPI MEMORY or I/O RESET_REG.
[ 283.388997] ACPI MEMORY or I/O RESET_REG.
Other information:
while booting the kernel shows the following messages:
[ 0.101676] DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x0000000088400000-0x000000008abfffff], contact BIOS vendor for fixes
[ 62.447709] intel_ish_ipc 0000:00:13.0: enabling device (0000 -> 0002)
[ 72.589918] intel_ish_ipc 0000:00:13.0: [ishtp-ish]: Timed out waiting for FW-initiated reset
[ 72.598488] intel_ish_ipc 0000:00:13.0: ISH: hw start failed.
Versions:
$ pacman -Q linux linux-lts zfs-linux zfs-linux-lts zfs-utils plex-media-server-plexpass
linux 5.7.2.arch1-1
linux-lts 5.4.46-1
zfs-linux 0.8.4_5.7.2.arch1.1-1
zfs-linux-lts 0.8.4_5.4.46.1-1
zfs-utils 0.8.4-1
plex-media-server-plexpass 1.19.4.2935-1
Server:
Mainboard: Supermicro X11SSH-LN4F
CPU: Xeon E3-1275v6
BIOS: 2.3
Firmware: 01.58
What I did so far:
replaced mainboard
replaced CPU (E3-1245v6 -> E3-1275v6)
replaced RAM (2x 16GB DIMM -> 3x 16GB DIMM)
replaced power supply (350W -> 450W)
replaced nvme ssd (boot drive)
The only parts I didn't replace:
4 HDDs which have the zfs pool
SAS/SATA backplane
I suspected a hardware issue first because the problem was consistent since several kernel updates. So I replaced the mainboard. That didn't help.
The RAM is from Samsung and it is from the recommended hardware list. I ordered a new DIMM but that didn't help, so I now have 3 DIMMS. memtest86 shows no errors on them after multiple runs.
Then I replaced the CPU E3-1245v6 with an E3-1275v6 but the crashes got worse. Before it would crash less often. That's why I thought it has something to do with power consumption and replaced the 350W power supply with a 450W one. That also didn't help. The server consumed about ~100W so no stress on the power supply.
Then I replaced the nvme. I cloned the OS from the old to the new one so software didn't change. That didn't help.
I basically replaced the whole server. So I'm pretty sure it's not a hardware issue.
I also tried booting it without the zfs module enabled and did my test with a movie on the local nvme but that didn't change anything.
I also tried the lts kernel but it shows the same behavior.
Does someone have an idea? I don't know how I could debug the kernel even more to find out what really happens when it crashes. I would expect more lines from it. Also it doesn't reboot after the said 30sec. It just stays that way.
Last edited by XenGi (2020-06-19 21:27:31)
# got root?█
Offline
I disabled the integrated Intel GPU today and I'm watching and transcoding a movie for over half an hour now. I think I found the problem. Still it would be interesting what exactly went wrong. Isn't the first time that an Intel GPU driver crashes a kernel but still annoying because I wanted to use it to offload some transcoding work.
Last edited by XenGi (2020-06-20 16:50:37)
# got root?█
Offline
Did you try updating the BIOS?
EDIT: Nevermind, you seem to have fixed the problem
Last edited by sunflsks (2020-06-20 16:56:38)
Offline
Not really. By disabling the Intel GPU, the problem doesn't occur any more. But this is like fixing a flat tire on your car by removing the wheel. Yes your tire isn't flat anymore but does the car still drive?
Last edited by XenGi (2020-06-20 17:15:12)
# got root?█
Offline
Don't get your hopes up too much, but have you tried giving the kernel parameters like this: processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll (See https://wiki.archlinux.org/index.php/Kernel_parameters)?
This helped in an issue with the same error message here: https://www.linuxquestions.org/question … 175668068/
Since this could be a kernel bug, this may be helpful: https://www.kernel.org/doc/html/latest/ … -bugs.html
EDIT: If nothing else helps, you could try building a very old kernel, with the hopes of it not having your bug, and then (git) bisecting between that and a version that does have the bug. The result should be the git commit that introduced the bug:
https://www.kernel.org/doc/html/latest/ … isect.html
Before that you should build the linux git master to check that the bug is not fixed already.
Last edited by Neven (2020-06-21 03:34:15)
Offline
Is it happening with both "linux" and "linux-lts"?
Offline
have you tried giving the kernel parameters like this: processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll
Yes all of those. The issue I have seems to be a different one. I think the messages are not related to my issue and can be ignored in my case. In the end my issue is definitely the GPU.
try building a very old kernel, with the hopes of it not having your bug, and then (git) bisecting between that and a version that does have the bug
That sounds like an interesting journey. Maybe I'll do that. Thx for the tip.
both "linux" and "linux-lts"
Yes. Both kernels and also since a few months. So not only since the last bigger kernel update.
My current solution is that I disabled the Intel GPU. I ordered a NVIDIA GPU because it supports transcoding of more streams at the same time. I hope that it won't have any issues with that GPU. If the error comes back I will probably have some fun with gitbisecting.
Last edited by XenGi (2020-07-18 00:35:10)
# got root?█
Offline