[SOLVED] kworker stuck waiting in amdgpu driver

ctrlaltmilk · 2024-09-20 05:56:39

Been dealing with this issue for a bit - sometimes, for no apparent reason (mostly when under graphics load maybe?) one kworker task gets stuck in a busy loop and drags the whole system to a crawl. I got a stacktrace of the worker task and it seems to be stuck in a uwait() call in some amdgpu code. Unsure how to proceed from here. I did try tracking the bug down in kernel code but from what I can tell there's supposed to be a timeout that would stop this issue from happening. Here's the stack trace from dmesg:

[113314.728081] NMI backtrace for cpu 5
[113314.728083] CPU: 5 PID: 288532 Comm: kworker/u48:10 Not tainted 6.10.10-arch1-1 #1 e28ee6293423e91d57555c4cc06eb839714254b7
[113314.728085] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP05, BIOS 03.03 10/17/2023
[113314.728087] Workqueue: events_unbound commit_work
[113314.728097] RIP: 0010:delay_halt_mwaitx+0x3c/0x50
[113314.728101] Code: 31 d2 48 89 d1 48 05 00 60 00 00 0f 01 fa b8 ff ff ff ff b9 02 00 00 00 48 39 c6 48 0f 46 c6 48 89 c3 b8 f0 00 00 00 0f 01 fb <5b> e9 c9 35 33 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90
[113314.728102] RSP: 0018:ffff9d4907c2b8f8 EFLAGS: 00000293
[113314.728103] RAX: 00000000000000f0 RBX: 0000000000000da6 RCX: 0000000000000002
[113314.728104] RDX: 0000000000000000 RSI: 0000000000000da6 RDI: 0001680ea13fa39a
[113314.728105] RBP: 0001680ea13fa39a R08: 0000000000002000 R09: 0000000000000780
[113314.728106] R10: ffff9d4940ef9200 R11: 0000000000000000 R12: 00000000000186a0
[113314.728107] R13: ffff9d4907c2b998 R14: ffff8dc48833d400 R15: 0000000000000000
[113314.728108] FS:  0000000000000000(0000) GS:ffff8dcb01e80000(0000) knlGS:0000000000000000
[113314.728110] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[113314.728110] CR2: 000073bbbbe3100a CR3: 00000003e4c20000 CR4: 0000000000f50ef0
[113314.728112] PKRU: 55555554
[113314.728112] Call Trace:
[113314.728113]  <NMI>
[113314.728115]  ? nmi_cpu_backtrace.cold+0x32/0x68
[113314.728118]  ? nmi_cpu_backtrace_handler+0x11/0x20
[113314.728119]  ? nmi_handle+0x5e/0x120
[113314.728122]  ? default_do_nmi+0x40/0x130
[113314.728126]  ? exc_nmi+0x122/0x1a0
[113314.728128]  ? end_repeat_nmi+0xf/0x53
[113314.728131]  ? delay_halt_mwaitx+0x3c/0x50
[113314.728132]  ? delay_halt_mwaitx+0x3c/0x50
[113314.728134]  ? delay_halt_mwaitx+0x3c/0x50
[113314.728135]  </NMI>
[113314.728136]  <TASK>
[113314.728137]  delay_halt+0x3c/0x70
[113314.728140]  dmub_srv_wait_for_idle+0x55/0x90 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[113314.728611]  dc_dmub_srv_cmd_run_list+0x7b/0x160 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[113314.728763]  dc_dmub_update_dirty_rect+0x1f1/0x220 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[113314.728921]  commit_planes_for_stream+0x2e6/0x1700 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[113314.729073]  ? srso_alias_return_thunk+0x5/0xfbef5
[113314.729075]  update_planes_and_stream_v2+0x2b1/0x590 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[113314.729211]  amdgpu_dm_atomic_commit_tail+0x1bf2/0x3dc0 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[113314.729365]  ? srso_alias_return_thunk+0x5/0xfbef5
[113314.729366]  ? generic_reg_get+0x21/0x40 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[113314.729498]  ? __entry_text_end+0x101e45/0x101e49
[113314.729502]  ? srso_alias_return_thunk+0x5/0xfbef5
[113314.729503]  ? dma_fence_default_wait+0x8b/0x250
[113314.729506]  ? srso_alias_return_thunk+0x5/0xfbef5
[113314.729507]  ? wait_for_completion_timeout+0x130/0x180
[113314.729510]  ? srso_alias_return_thunk+0x5/0xfbef5
[113314.729510]  ? dma_fence_wait_timeout+0x108/0x140
[113314.729514]  commit_tail+0x91/0x130
[113314.729516]  process_one_work+0x17b/0x330
[113314.729519]  worker_thread+0x2e2/0x410
[113314.729521]  ? __pfx_worker_thread+0x10/0x10
[113314.729523]  kthread+0xcf/0x100
[113314.729525]  ? __pfx_kthread+0x10/0x10
[113314.729527]  ret_from_fork+0x31/0x50
[113314.729530]  ? __pfx_kthread+0x10/0x10
[113314.729531]  ret_from_fork_asm+0x1a/0x30
[113314.729536]  </TASK>

If anyone has an idea what might be going on here, I'd really appreciate help as I'm pretty out of my depth on this one.

Last edited by ctrlaltmilk (2024-11-05 00:58:24)

foreverip · 2024-09-21 15:09:10

Don't have an answer here, but I have been having the same issue for the past few days. My computer starts running at <1fps at random (usually after a few hours of use), and it stays this way even if I close most programs. I usually end up having to restart the computer to fix the issue. I have an AMD Radeon 680M, and it seems like my stacktrace is fairly similar to the OP:

[33504.134462] NMI backtrace for cpu 3
[33504.134463] CPU: 3 PID: 54614 Comm: kworker/u48:31 Tainted: P        W  OE      6.10.10-arch1-1 #1 e28ee6293423e91d57555c4cc06eb839714254b7
[33504.134466] Hardware name: ASUSTeK COMPUTER INC. ASUS TUF Gaming A15 FA507NU_FA507NU/FA507NU, BIOS FA507NU.312 08/11/2023
[33504.134467] Workqueue: events_unbound commit_work
[33504.134471] RIP: 0010:srso_alias_return_thunk+0x0/0xfbef5
[33504.134474] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 48 8d 64 24 08 c3 cc <e8> f4 ff ff ff 0f 0b cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[33504.134476] RSP: 0018:ffffb7aaebb2f8f8 EFLAGS: 00000206
[33504.134477] RAX: ffffffff8c5d0b00 RBX: 0000000000000cde RCX: 0000000000000003
[33504.134478] RDX: 000064653c1a9a33 RSI: 0000000000000cde RDI: 000064653c1a9a33
[33504.134479] RBP: 000064653c1a9a33 R08: 0000000000002000 R09: 0000000000000bc0
[33504.134480] R10: ffffb7ab1fae4100 R11: 0000000000000000 R12: 00000000000186a0
[33504.134481] R13: ffffb7aaebb2f998 R14: ffff935fcbad2880 R15: 0000000000000000
[33504.134482] FS:  0000000000000000(0000) GS:ffff9366fe580000(0000) knlGS:0000000000000000
[33504.134484] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[33504.134485] CR2: 00007a93c8004dd0 CR3: 0000000028a20000 CR4: 0000000000f50ef0
[33504.134486] PKRU: 55555554
[33504.134486] Call Trace:
[33504.134487]  <NMI>
[33504.134489]  ? nmi_cpu_backtrace.cold+0x32/0x68
[33504.134492]  ? nmi_cpu_backtrace_handler+0x11/0x20
[33504.134494]  ? nmi_handle+0x61/0x120
[33504.134496]  ? default_do_nmi+0x40/0x130
[33504.134499]  ? exc_nmi+0x122/0x1a0
[33504.134501]  ? end_repeat_nmi+0xf/0x53
[33504.134503]  ? __pfx_delay_halt_mwaitx+0x10/0x10
[33504.134506]  ? srso_alias_safe_ret+0x7/0x7
[33504.134508]  ? srso_alias_safe_ret+0x7/0x7
[33504.134511]  ? srso_alias_safe_ret+0x7/0x7
[33504.134512]  </NMI>
[33504.134513]  <TASK>
[33504.134514]  __pfx_delay_halt_mwaitx+0x10/0x10
[33504.134516]  delay_halt+0x3f/0x70
[33504.134519]  dmub_srv_wait_for_idle+0x55/0x90 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[33504.134749]  dc_dmub_srv_cmd_run_list+0x7b/0x160 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[33504.134968]  dc_dmub_update_dirty_rect+0x1f1/0x220 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[33504.135189]  commit_planes_for_stream+0x2e6/0x1700 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[33504.135407]  ? srso_alias_return_thunk+0x5/0xfbef5
[33504.135409]  update_planes_and_stream_v2+0x2b1/0x590 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[33504.135628]  amdgpu_dm_atomic_commit_tail+0x1bf2/0x3dc0 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[33504.135874]  ? srso_alias_return_thunk+0x5/0xfbef5
[33504.135875]  ? generic_reg_get+0x21/0x40 [amdgpu aacbd1c3f6c8ff6c07430a269d605a645f6ffb32]
[33504.136094]  ? __entry_text_end+0x101e45/0x101e49
[33504.136100]  ? srso_alias_return_thunk+0x5/0xfbef5
[33504.136101]  ? dma_fence_default_wait+0x8b/0x250
[33504.136104]  ? srso_alias_return_thunk+0x5/0xfbef5
[33504.136106]  ? wait_for_completion_timeout+0x130/0x180
[33504.136108]  ? srso_alias_return_thunk+0x5/0xfbef5
[33504.136110]  ? dma_fence_wait_timeout+0x108/0x140
[33504.136113]  commit_tail+0x94/0x130
[33504.136116]  process_one_work+0x17e/0x330
[33504.136119]  worker_thread+0x2e2/0x410
[33504.136122]  ? __pfx_worker_thread+0x10/0x10
[33504.136123]  kthread+0xd2/0x100
[33504.136126]  ? __pfx_kthread+0x10/0x10
[33504.136129]  ret_from_fork+0x34/0x50
[33504.136132]  ? __pfx_kthread+0x10/0x10
[33504.136134]  ret_from_fork_asm+0x1a/0x30
[33504.136139]  </TASK>

UPDATE: I will follow https://dri.freedesktop.org/docs/drm/gp … debug.html and try to reproduce the bug to get better logs.

Last edited by foreverip (2024-09-21 15:53:47)

foreverip · 2024-09-26 11:34:01

I haven't been able to reproduce the bug for the past few days. Don't know what changed, but maybe the issue got patched? Gonna move on for now, will update if anything happens.

tghe-retford · 2024-09-29 16:10:09

More of a +1 response but I've had the same issue twice today. All I get via journalctl without a traceback when the bug happens is:

kernel: amdgpu 0000:09:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data

What I found is that krunner would start up to 30 seconds later after that amdgpu error appeared in the log. But by then the system was so sluggish that only a reboot was the only option.

ctrlaltmilk · 2024-09-30 19:24:47

Quick update: issue still happening, looked through the AMDGPU driver and I don't think there's a debug log near the code that's failing. At this point, to get meaningful info out I think I might need to compile a kernel with extra debug logging statements in.

foreverip · 2024-10-01 13:34:48

Yep, you're right, the issues is still happening. Just got it again today. I guess I got lucky and just didn't hit the conditions that caused the bug for the past few days.

ccapitalK · 2024-10-06 07:06:52

I'm hitting a very similar issue, I've hit it multiple times over the past few days on amdgpu plasma wayland. I'm getting the same issue of 100% cpu stuck in delay_halt_mwaitx. One pattern I've noticed is that it appears to happen more often when I have firefox open while watching videos, so this might be triggered by hardware video decode? I'm also seeing the following line in the logs around the time that this slowdown occurs, might be a red herring but it might be useful for people trying to reliably repro:

[49993.490890] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.

I've also noticed a few other weird things:
- the cpu pinning is consistent, but moves from kernel thread to kernel thread, I have to keep changing PIDs when running `perf top` to try to keep a lock on it
- If I close all programs and don't move my mouse or have anything refreshing the display, the cpu usage eventually goes down to 0 . As soon as I try moving my mouse again, it goes back up to 100% for a while.
- The system is perfectly responsive otherwise, I notice no lag whatsoever when ssh-ing in from another machine

These are my system details (Framework Laptop 16 (AMD Ryzen 7040 Series)):

$ lspci | grep VGA
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600] (rev c1)
c4:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (rev c1)
$ uname -a
Linux REDACTED 6.10.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 12 Sep 2024 17:21:02 +0000 x86_64 GNU/Linux
$ pacman -Qi mesa rocm-core rocm-opencl-runtime libva libva-mesa-driver libva-utils  | grep -B1 Version
Name            : mesa
Version         : 1:24.2.3-1
--
Name            : rocm-core
Version         : 6.0.2-2
--
Name            : rocm-opencl-runtime
Version         : 6.0.2-1
--
Name            : libva
Version         : 2.22.0-1
--
Name            : libva-mesa-driver
Version         : 1:24.2.3-1
--
Name            : libva-utils
Version         : 2.22.0-1

Zzzt · 2024-10-17 23:36:54

This might be a easy method to reproduce the issue:
1: Boot/login
2: Open chromium (only 1 tab needed)
3: Watch any (>10min) youtube video
4: Repeatedly/rapidly (for 2-4 minutes) hit keyboard left/right arrows to jump around

Without step 4, no issues. With 4 softlockup & pw.node errors in logs.

This is highly repeatable for me (another linux, but this seems to be the best tread on the topic).
Hope this helps.

ctrlaltmilk · 2024-10-22 01:29:41

Update: tracked down a report on the AMD freedesktop gitlab, seems related if not the same problem
https://gitlab.freedesktop.org/drm/amd/-/issues/3647

Going to try the suggested fix of appending

amdgpu.dcdebugmask=0x10

to my kernel parameters, will update with results soon.

ctrlaltmilk · 2024-10-22 22:51:41

Seems to be working for me - loaded up the GPU and no freezes. Going to give it a bit longer before assuming this is the solution, but I think we're on the right track here.

ctrlaltmilk · 2024-11-05 00:58:02

Been working for nearly two weeks with no freezes, marking as solved.

JuicyMio · 2024-11-07 08:22:04

Has randomly met similar problem since 10.29. I'll try your solution.

Arch Linux

#1 2024-09-20 05:56:39

[SOLVED] kworker stuck waiting in amdgpu driver

#2 2024-09-21 15:09:10

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#3 2024-09-26 11:34:01

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#4 2024-09-29 16:10:09

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#5 2024-09-30 19:24:47

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#6 2024-10-01 13:34:48

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#7 2024-10-06 07:06:52

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#8 2024-10-17 23:36:54

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#9 2024-10-22 01:29:41

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#10 2024-10-22 22:51:41

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#11 2024-11-05 00:58:02

Re: [SOLVED] kworker stuck waiting in amdgpu driver

#12 2024-11-07 08:22:04

Re: [SOLVED] kworker stuck waiting in amdgpu driver

Board footer