Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

The Most Curious Thing · 2021-01-19 22:12:47

Recently upgraded to linux 5.10.8 and mesa 20.3.3, and some length of time after each boot (usually minutes, one time hours), the system crashes unrecoverably. This happens whether I'm on desktop (idle or active) or in TTY. journalctl is inconsistent in reporting what happened, but the culprit seems to always be snd_hda_intel or amdgpu.

What tends to happen is that my monitors will cut to black suddenly, and sound will cut out about a minute later.

I'm running a Sapphire Pulse RX 5600XT on an MSI B450 Tomahawk Max and a Ryzen 5 3600.

This journal is from a sample crash on desktop:

Jan 19 10:33:44 ocean kernel: snd_hda_intel 0000:28:00.1: can't change power state from D3cold to D0 (config space inaccessible)
Jan 19 10:33:44 ocean kernel: snd_hda_intel 0000:28:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 19 10:33:47 ocean kernel: amdgpu 0000:28:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Jan 19 10:33:48 ocean kernel: amdgpu 0000:28:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
Jan 19 10:33:55 ocean kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2414028, emitted seq=2414030
Jan 19 10:33:55 ocean kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 786 thread Xorg:cs0 pid 787
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 19 10:33:55 ocean kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 19 10:33:55 ocean kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Jan 19 10:33:55 ocean kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: Failed to disable smu features except BACO.
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: Fail to disable dpm features!
Jan 19 10:33:55 ocean kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -5
Jan 19 10:33:55 ocean kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000f588a2d0; ring_buffer_end = 0000000067cbc0a7; write_frame = 000000006624ebe3
Jan 19 10:33:55 ocean kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Jan 19 10:33:55 ocean kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate hdcp ta
Jan 19 10:33:55 ocean kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: BACO reset
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: Failed to enter BACO state!
Jan 19 10:33:55 ocean kernel: amdgpu 0000:28:00.0: amdgpu: ASIC reset failed with error, -5 for drm dev, 0000:28:00.0
Jan 19 10:34:15 ocean kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 19 10:34:15 ocean kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 937E (len 110, WS 12, PS 8) @ 0x93E3
Jan 19 10:34:15 ocean kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 930C (len 37, WS 0, PS 8) @ 0x9325
Jan 19 10:34:15 ocean kernel: amdgpu 0000:28:00.0: amdgpu: asic atom init failed!
Jan 19 10:34:15 ocean kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset(2) failed
Jan 19 10:34:15 ocean kernel: snd_hda_intel 0000:28:00.1: can't change power state from D3cold to D0 (config space inaccessible)
Jan 19 10:34:15 ocean kernel: snd_hda_intel 0000:28:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 19 10:34:15 ocean kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset end with ret = -5
Jan 19 10:34:25 ocean kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
Jan 19 10:34:25 ocean kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=44083, emitted seq=44085
Jan 19 10:34:25 ocean kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 19 10:34:25 ocean kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!

Then, the following happened ten times in roughly 122-second intervals.

Jan 19 10:38:09 ocean kernel: INFO: task kworker/0:2:46273 blocked for more than 122 seconds.
Jan 19 10:38:09 ocean kernel:       Not tainted 5.10.8-arch1-1 #1
Jan 19 10:38:09 ocean kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 19 10:38:09 ocean kernel: task:kworker/0:2     state:D stack:    0 pid:46273 ppid:     2 flags:0x00004080
Jan 19 10:38:09 ocean kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Jan 19 10:38:09 ocean kernel: Call Trace:
Jan 19 10:38:09 ocean kernel:  __schedule+0x295/0x810
Jan 19 10:38:09 ocean kernel:  schedule+0x5b/0xc0
Jan 19 10:38:09 ocean kernel:  schedule_timeout+0x11c/0x160
Jan 19 10:38:09 ocean kernel:  dma_fence_default_wait+0x1c7/0x210
Jan 19 10:38:09 ocean kernel:  ? dma_fence_free+0x20/0x20
Jan 19 10:38:09 ocean kernel:  dma_fence_wait_timeout+0xe8/0x110
Jan 19 10:38:09 ocean kernel:  ? dma_fence_remove_callback+0x44/0x50
Jan 19 10:38:09 ocean kernel:  drm_sched_stop+0xea/0x160 [gpu_sched]
Jan 19 10:38:09 ocean kernel:  amdgpu_device_gpu_recover.cold+0x47f/0x95d [amdgpu]
Jan 19 10:38:09 ocean kernel:  amdgpu_job_timedout+0x121/0x140 [amdgpu]
Jan 19 10:38:09 ocean kernel:  drm_sched_job_timedout+0x64/0xe0 [gpu_sched]
Jan 19 10:38:09 ocean kernel:  process_one_work+0x1d6/0x3a0
Jan 19 10:38:09 ocean kernel:  worker_thread+0x4d/0x3d0
Jan 19 10:38:09 ocean kernel:  ? rescuer_thread+0x410/0x410
Jan 19 10:38:09 ocean kernel:  kthread+0x133/0x150
Jan 19 10:38:09 ocean kernel:  ? __kthread_bind_mask+0x60/0x60
Jan 19 10:38:09 ocean kernel:  ret_from_fork+0x22/0x30

I don't know how to troubleshoot video issues of this magnitude. Where's my next area of research? What info do you need?

Last edited by The Most Curious Thing (2021-01-20 10:25:39)

The Most Curious Thing · 2021-01-20 04:51:38

The common denominator seems to be the snd_hda_intel issue -- but I'm having difficulty troubleshooting on that axis because I'm seeing fixes that involve different config files (/etc/modprobe.d/my-alsa.conf? /etc/modprobe.d/alsa-base.conf?) and a line that the wiki doesn't mention:

options snd-hda-intel enable=0,1

Not sure how to get immediate feedback on my configuration attempts other than rebooting and waiting.

>cat /proc/asound/cards
 0 [Generic        ]: HDA-Intel - HD-Audio Generic
                      HD-Audio Generic at 0xfc800000 irq 97

Last edited by The Most Curious Thing (2021-01-20 05:04:13)

The Most Curious Thing · 2021-01-20 07:39:58

Another note: I started seeing this ALSA bug more since the upgrade on Jan 18:

Aug 01 18:54:50 ocean pulseaudio[894]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 01 19:43:14 ocean pulseaudio[850]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 02 08:14:53 ocean pulseaudio[902]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 02 10:10:21 ocean pulseaudio[1017]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 02 11:08:28 ocean pulseaudio[989]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 02 13:41:26 ocean pulseaudio[1027]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 04 01:50:46 ocean pulseaudio[990]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 04 16:16:53 ocean pulseaudio[994]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 09 08:35:54 ocean pulseaudio[1028]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 16 18:00:00 ocean pulseaudio[996]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Aug 24 14:38:14 ocean pulseaudio[1008]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Sep 05 03:56:45 ocean pulseaudio[1016]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Sep 12 15:17:33 ocean pulseaudio[998]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Sep 17 02:29:58 ocean pulseaudio[1031]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Sep 26 15:23:26 ocean pulseaudio[1041]: E: [alsa-sink-ALC892 Analog] alsa-sink.c: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Oct 13 18:04:21 ocean pulseaudio[1017]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Oct 16 20:24:41 ocean pulseaudio[1030]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Oct 23 06:18:26 ocean pulseaudio[1003]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Nov 14 00:19:18 ocean pulseaudio[994]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Nov 14 18:56:19 ocean pulseaudio[988]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Dec 12 15:48:58 ocean pulseaudio[1013]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 03 23:10:50 ocean pulseaudio[1157]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 08 02:48:35 ocean pulseaudio[1048]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 08 03:40:29 ocean pulseaudio[19653]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 18 10:14:38 ocean pulseaudio[1038]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 18 10:22:23 ocean pulseaudio[1034]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 18 10:59:10 ocean pulseaudio[2923]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 18 16:46:54 ocean pulseaudio[1060]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 18 17:08:59 ocean pulseaudio[1067]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 18 21:11:20 ocean pulseaudio[2379]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 19 06:10:30 ocean pulseaudio[1030]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 19 16:23:28 ocean pulseaudio[1050]: ALSA woke us up to write new data to the device, but there was actually nothing to write.
Jan 20 02:34:55 ocean pulseaudio[1016]: ALSA woke us up to write new data to the device, but there was actually nothing to write.

So is ALSA the culprit for all this?

Last edited by The Most Curious Thing (2021-01-20 07:40:33)

kokoko3k · 2021-01-20 08:12:49

https://ask.fedoraproject.org/t/amd-rad … blem/11648

Are you able to rule out hardware issues by going back to the previous kernel and workaround the issue that way?

V1del · 2021-01-20 08:13:18

snd_hda_intel has recently seen enablement of power saving by default, it's possible this leads to a conflict with HDMI audio/power support on your amdgpu.

Two ways around this, disable power_saving on snd_hda_intel

options snd_hda_intel power_save=0

(... the name, by the way, doesn't matter as long as the file ends in .conf)

Otherwise if you'd like to keep snd_hda_intel power saving and don't need HDMI audio you could potentially do

options amdgpu audio=0

to disable HDMI audio on the amdgpu side

Last edited by V1del (2021-01-20 08:14:32)

jkhsjdhjs · 2021-01-20 09:18:05

I'm also experiencing this issue since I upgraded to linux 5.10.8 yesterday evening:

Jan 20 09:40:03 benziuminator kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered

Followed by a coredump of Xorg and some other KDE programs.

I'm not seeing any snd_hda_intel errors, that's why I currently suspect this to be a regression in the amdgpu kernel module.

There's also a reddit post from 11 hours ago, describing the same problem: https://www.reddit.com/r/archlinux/comm … 0_timeout/

EDIT: Reddit OP says he also had this issue with 5.10.7, which doesn't fit the regression theory.

Last edited by jkhsjdhjs (2021-01-20 09:24:24)

V1del · 2021-01-20 10:00:46

If there is a kernel regression I'd rather look at actually big windows, i.e. the 5.9 -> 5.10 kernel update.

The Most Curious Thing · 2021-01-20 10:28:23

Thanks all for the responses!

V1del wrote:

Two ways around this, disable power_saving on snd_hda_intel
options snd_hda_intel power_save=0
Otherwise if you'd like to keep snd_hda_intel power saving and don't need HDMI audio you could potentially do
options amdgpu audio=0
to disable HDMI audio on the amdgpu side

I've tried both of these and neither worked.

V1del wrote:

... the name, by the way, doesn't matter as long as the file ends in .conf

TIL! c:

jkhsjdhjs wrote:

There's also a reddit post from 11 hours ago, describing the same problem: https://www.reddit.com/r/archlinux/comm … 0_timeout/

Thanks for linking us up.

I've downgraded to kernel 5.9.14; will see happens from here.

EDIT: No good, same issue from a downgrade to 5.9.14 (with no other change). Surely it couldn't be a hardware issue if my system was stable before my upgrades? (Worth noting that my chroot experience the past two days has been without any issues.) Will examine what else could've gone awry from the upgrade other than linux or mesa.

EDIT: Just as I say chroot was fine, I come back to a black screen in chroot. Something is clearly very broken with my computer and I want to step away until I have a stronger idea what's happening here.

Last edited by The Most Curious Thing (2021-01-20 10:48:06)

The Most Curious Thing · 2021-01-20 17:04:17

kokoko3k wrote:

https://ask.fedoraproject.org/t/amd-rad … blem/11648
Are you able to rule out hardware issues by going back to the previous kernel and workaround the issue that way?

That thread describes a situation curiously similar to mine, though I haven't tested the reported temps like they did.

I don't understand how the problem persists for both of us on live media.

(And yes, I have tested on 5.9.14 and the problem persists; wasn't able to rollback mesa yet.)

Arch Linux

#1 2021-01-19 22:12:47

Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#2 2021-01-20 04:51:38

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#3 2021-01-20 07:39:58

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#4 2021-01-20 08:12:49

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#5 2021-01-20 08:13:18

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#6 2021-01-20 09:18:05

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#7 2021-01-20 10:00:46

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#8 2021-01-20 10:28:23

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

#9 2021-01-20 17:04:17

Re: Some kernel issue (possibly snd_hda_intel or amdgpu) crashing system

Board footer