You are not logged in.
Hi there,
I suspect the latest kernel update to be the reason for the frequent freezes I experienced in the last two days or so.
The screen freezes but I can still ssh into the machine but I can't poweroff from the cli. Shutdown seems to hang somewhere.
This is what I got from dmesg:
[ 5281.846937] amdgpu 0000:11:00.0: amdgpu: Dumping IP State
[ 5282.257537] fbcon: Taking over console
[ 5282.257620] Console: switching to colour frame buffer device 240x67
[ 5285.182382] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5285.182386] amdgpu 0000:11:00.0: amdgpu: Failed to disable gfxoff!
[ 5289.719882] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5289.719887] amdgpu 0000:11:00.0: amdgpu: Failed to disable gfxoff!
[ 5294.183503] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5294.183507] amdgpu 0000:11:00.0: amdgpu: Failed to export SMU metrics table!
[ 5298.641082] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5298.641086] amdgpu 0000:11:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[ 5303.102423] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5303.102427] amdgpu 0000:11:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[ 5307.559974] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5307.559978] amdgpu 0000:11:00.0: amdgpu: Failed to disable gfxoff!
[ 5312.097470] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5312.097474] amdgpu 0000:11:00.0: amdgpu: Failed to export SMU metrics table!
[ 5316.555408] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5316.555412] amdgpu 0000:11:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[ 5321.018158] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5321.018163] amdgpu 0000:11:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[ 5325.482277] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5325.482282] amdgpu 0000:11:00.0: amdgpu: Failed to disable gfxoff!
[ 5329.939341] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5329.939345] amdgpu 0000:11:00.0: amdgpu: Failed to export SMU metrics table!
[ 5334.477337] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5334.477342] amdgpu 0000:11:00.0: amdgpu: Failed to export SMU metrics table!
[ 5338.937188] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5338.937193] amdgpu 0000:11:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[ 5343.399334] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5343.399338] amdgpu 0000:11:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[ 5347.860371] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5347.860375] amdgpu 0000:11:00.0: amdgpu: Failed to disable gfxoff!
[ 5347.860708] amdgpu 0000:11:00.0: amdgpu: Dumping IP State Completed
[ 5347.860767] amdgpu 0000:11:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[ 5347.860769] amdgpu 0000:11:00.0: amdgpu: [drm] Check your /sys/class/drm/card2/device/devcoredump/data
[ 5347.860771] amdgpu 0000:11:00.0: amdgpu: ring gfx_0.1.0 timeout, signaled seq=302421, emitted seq=302423
[ 5347.860773] amdgpu 0000:11:00.0: amdgpu: Process kwin_wayland pid 5448 thread kwin_wayla:cs0 pid 5862
[ 5347.860774] amdgpu 0000:11:00.0: amdgpu: Starting gfx_0.1.0 ring reset
[ 5348.021452] amdgpu 0000:11:00.0: amdgpu: Ring gfx_0.1.0 reset failed
[ 5348.021457] amdgpu 0000:11:00.0: amdgpu: GPU reset begin!
[ 5352.329109] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5352.329114] amdgpu 0000:11:00.0: amdgpu: Failed to export SMU metrics table!
[ 5356.857540] amdgpu 0000:11:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000017 SMN_C2PMSG_82:0x00000000
[ 5356.857545] amdgpu 0000:11:00.0: amdgpu: Failed to retrieve enabled ppfeatures!Downgrading linux-firmware didn't solve the issue. I'm on 6.17.8 to check if the freezes disappear. Edit: nope. Downgraded amd-ucode for now...
Last edited by dingodoppelt (2025-11-30 00:02:05)
Offline
I also run into this problem ever since 6.17.9. I was on the cachyos-version thou.
I found that this can be steadily reproduced via:
stress-ng --cpu 0 --cpu-method fft --timeout 20m --metrics-briefbut I don't know why stress CPU will crash the GPU.
Last edited by schrodingerzhu (2025-12-01 01:37:58)
Offline
Yes, I can confirm downgrading to amd-ucode-20251111-1 works around the issue.
wget -4 https://archive.archlinux.org/repos/2025/11/11/core/os/x86_64/amd-ucode-20251111-1-any.pkg.tar.zstLast edited by schrodingerzhu (2025-12-01 01:39:22)
Offline
Sorry, I was wrong.
Downgrading ucode only delays the issue but it can still reproduce after stressing long enough.
I enabled the reset and gpu_recovery parameter and get some more messages like:
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: Dumping IP State
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=6771, emitted seq=6773
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: Process wezterm-gui pid 20998 thread wezterm-gui pid 20998
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!
[Sun Nov 30 20:25:55 2025] amdgpu 0000:06:00.0: amdgpu: MODE2 reset
[Sun Nov 30 20:26:01 2025] amdgpu 0000:06:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000A SMN_C2PMSG_82:0x00000002
[Sun Nov 30 20:26:01 2025] amdgpu 0000:06:00.0: amdgpu: Failed to mode reset!
[Sun Nov 30 20:26:01 2025] amdgpu 0000:06:00.0: amdgpu: Mode2 reset failed!
[Sun Nov 30 20:26:01 2025] amdgpu 0000:06:00.0: amdgpu: GPU mode2 reset failed
[Sun Nov 30 20:26:01 2025] amdgpu 0000:06:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:06:00.0
[Sun Nov 30 20:26:01 2025] amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -62
[Sun Nov 30 20:26:01 2025] amdgpu 0000:06:00.0: amdgpu: GPU Recovery Failed: -62
[Sun Nov 30 20:26:11 2025] amdgpu 0000:06:00.0: amdgpu: Dumping IP State
[Sun Nov 30 20:26:11 2025] amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed
[Sun Nov 30 20:26:11 2025] amdgpu 0000:06:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
[Sun Nov 30 20:26:11 2025] amdgpu 0000:06:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[Sun Nov 30 20:26:11 2025] amdgpu 0000:06:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=6773, emitted seq=6773
[Sun Nov 30 20:26:11 2025] amdgpu 0000:06:00.0: amdgpu: Process wezterm-gui pid 20998 thread wezterm-gui pid 20998
[Sun Nov 30 20:26:11 2025] amdgpu 0000:06:00.0: amdgpu: Starting gfx_0.0.0 ring reset
[Sun Nov 30 20:26:12 2025] amdgpu 0000:06:00.0: amdgpu: Ring gfx_0.0.0 reset failed
[Sun Nov 30 20:26:12 2025] amdgpu 0000:06:00.0: amdgpu: GPU reset begin!Last edited by schrodingerzhu (2025-12-01 01:38:32)
Offline
Should be related to https://gitlab.freedesktop.org/drm/amd/-/issues/4737
Offline
Offline
Downgrading ucode didn't fix it.
Now trying linux-firmware-amdgpu downgrade as recommended in another thread
Offline