Kernel freezes when trancoding movies with plexmediaserver

XenGi · 2020-06-19 21:05:54

My kernel crashes at random times and I don't know why. I'm not even sure if it only happens when I transcode video with plex but that's the only way I can consistantly reproduce it.

Here is how I currently reproduce the problem:

Start server
open plex web interface
start movie
watch for a few seconds
switch to transcoded version
wait for 5-60sec
kernel crashes, system freezes

serial shows this error:

[  246.698967] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
[  247.951544] Shutting down cpus with NMI
[  247.967890] Kernel Offset: 0x4600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  247.990159] Rebooting in 30 seconds..
[  279.066045] ACPI MEMORY or I/O RESET_REG.
[  283.388997] ACPI MEMORY or I/O RESET_REG.

Other information:

while booting the kernel shows the following messages:

[    0.101676] DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x0000000088400000-0x000000008abfffff], contact BIOS vendor for fixes

[   62.447709] intel_ish_ipc 0000:00:13.0: enabling device (0000 -> 0002)
[   72.589918] intel_ish_ipc 0000:00:13.0: [ishtp-ish]: Timed out waiting for FW-initiated reset
[   72.598488] intel_ish_ipc 0000:00:13.0: ISH: hw start failed.

Versions:

$ pacman -Q linux linux-lts zfs-linux zfs-linux-lts zfs-utils plex-media-server-plexpass                                                                                              
linux 5.7.2.arch1-1
linux-lts 5.4.46-1
zfs-linux 0.8.4_5.7.2.arch1.1-1
zfs-linux-lts 0.8.4_5.4.46.1-1
zfs-utils 0.8.4-1
plex-media-server-plexpass 1.19.4.2935-1

Server:

Mainboard: Supermicro X11SSH-LN4F
CPU: Xeon E3-1275v6
BIOS: 2.3
Firmware: 01.58

What I did so far:

replaced mainboard
replaced CPU (E3-1245v6 -> E3-1275v6)
replaced RAM (2x 16GB DIMM -> 3x 16GB DIMM)
replaced power supply (350W -> 450W)
replaced nvme ssd (boot drive)

The only parts I didn't replace:

4 HDDs which have the zfs pool
SAS/SATA backplane

I suspected a hardware issue first because the problem was consistent since several kernel updates. So I replaced the mainboard. That didn't help.
The RAM is from Samsung and it is from the recommended hardware list. I ordered a new DIMM but that didn't help, so I now have 3 DIMMS. memtest86 shows no errors on them after multiple runs.
Then I replaced the CPU E3-1245v6 with an E3-1275v6 but the crashes got worse. Before it would crash less often. That's why I thought it has something to do with power consumption and replaced the 350W power supply with a 450W one. That also didn't help. The server consumed about ~100W so no stress on the power supply.
Then I replaced the nvme. I cloned the OS from the old to the new one so software didn't change. That didn't help.

I basically replaced the whole server. So I'm pretty sure it's not a hardware issue.

I also tried booting it without the zfs module enabled and did my test with a movie on the local nvme but that didn't change anything.

I also tried the lts kernel but it shows the same behavior.

Does someone have an idea? I don't know how I could debug the kernel even more to find out what really happens when it crashes. I would expect more lines from it. Also it doesn't reboot after the said 30sec. It just stays that way.

Last edited by XenGi (2020-06-19 21:27:31)

XenGi · 2020-06-20 16:50:12

I disabled the integrated Intel GPU today and I'm watching and transcoding a movie for over half an hour now. I think I found the problem. Still it would be interesting what exactly went wrong. Isn't the first time that an Intel GPU driver crashes a kernel but still annoying because I wanted to use it to offload some transcoding work.

Last edited by XenGi (2020-06-20 16:50:37)

sunflsks · 2020-06-20 16:56:03

Did you try updating the BIOS?

EDIT: Nevermind, you seem to have fixed the problem

Last edited by sunflsks (2020-06-20 16:56:38)

XenGi · 2020-06-20 17:14:40

Not really. By disabling the Intel GPU, the problem doesn't occur any more. But this is like fixing a flat tire on your car by removing the wheel. Yes your tire isn't flat anymore but does the car still drive?

Last edited by XenGi (2020-06-20 17:15:12)

Neven · 2020-06-21 02:18:21

Don't get your hopes up too much, but have you tried giving the kernel parameters like this: processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll (See https://wiki.archlinux.org/index.php/Kernel_parameters)?

This helped in an issue with the same error message here: https://www.linuxquestions.org/question … 175668068/

Since this could be a kernel bug, this may be helpful: https://www.kernel.org/doc/html/latest/ … -bugs.html

EDIT: If nothing else helps, you could try building a very old kernel, with the hopes of it not having your bug, and then (git) bisecting between that and a version that does have the bug. The result should be the git commit that introduced the bug:

https://www.kernel.org/doc/html/latest/ … isect.html

Before that you should build the linux git master to check that the bug is not fixed already.

Last edited by Neven (2020-06-21 03:34:15)

Ropid · 2020-06-21 03:13:56

Is it happening with both "linux" and "linux-lts"?

XenGi · 2020-07-18 00:33:36

Neven wrote:

have you tried giving the kernel parameters like this: processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll

Yes all of those. The issue I have seems to be a different one. I think the messages are not related to my issue and can be ignored in my case. In the end my issue is definitely the GPU.

Neven wrote:

try building a very old kernel, with the hopes of it not having your bug, and then (git) bisecting between that and a version that does have the bug

That sounds like an interesting journey. Maybe I'll do that. Thx for the tip.

Ropid wrote:

both "linux" and "linux-lts"

Yes. Both kernels and also since a few months. So not only since the last bigger kernel update.

My current solution is that I disabled the Intel GPU. I ordered a NVIDIA GPU because it supports transcoding of more streams at the same time. I hope that it won't have any issues with that GPU. If the error comes back I will probably have some fun with gitbisecting.

Last edited by XenGi (2020-07-18 00:35:10)

Arch Linux

#1 2020-06-19 21:05:54

Kernel freezes when trancoding movies with plexmediaserver

#2 2020-06-20 16:50:12

Re: Kernel freezes when trancoding movies with plexmediaserver

#3 2020-06-20 16:56:03

Re: Kernel freezes when trancoding movies with plexmediaserver

#4 2020-06-20 17:14:40

Re: Kernel freezes when trancoding movies with plexmediaserver

#5 2020-06-21 02:18:21

Re: Kernel freezes when trancoding movies with plexmediaserver

#6 2020-06-21 03:13:56

Re: Kernel freezes when trancoding movies with plexmediaserver

#7 2020-07-18 00:33:36

Re: Kernel freezes when trancoding movies with plexmediaserver

Board footer