More AMD GPU crashes

aa6kj · 2026-03-06 23:03:15

This is on 6.19.6-arch1-1 (steam deck):

[56637.829222] ------------[ cut here ]------------
[56637.829228] refcount_t: underflow; use-after-free.
[56637.829230] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x59/0x90, CPU#4: Xorg/1071
[56637.829244] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq r8153_ecm cdc_ether usbnet snd_acp5x_pcm_dma snd_usb_audio snd_soc_acp5x_mach snd_acp5x_i2s snd_sof_amd_acp70 snd_usbmidi_lib r8152 snd_ump mii snd_sof_amd_acp63 snd_rawmidi snd_sof_amd_vangogh libphy snd_seq_device mc cp210x snd_sof_amd_rembrandt mdio_bus snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp lpvo_usb_gpib snd_sof snd_sof_utils btusb snd_pci_ps btmtk snd_soc_acpi_amd_match intel_rapl_msr gpib_common rtw88_8822ce btrtl rtw88_8822c snd_amd_sdw_acpi soundwire_amd amd_atl btbcm soundwire_generic_allocation rtw88_pci intel_rapl_common btintel snd_soc_cs35l41_spi snd_hda_codec_atihdmi soundwire_bus rtw88_core ftdi_sio snd_soc_sdca snd_soc_cs35l41 cdc_acm snd_hda_codec_hdmi mousedev bluetooth snd_hda_intel snd_soc_cs35l41_lib snd_rpl_pci_acp6x snd_hda_codec snd_soc_wm_adsp vfat mac80211 snd_acp_pci kvm_amd cs_dsp snd_soc_nau8821 snd_hda_core snd_amd_acpi_mach spd5118 fat hid_multitouch snd_intel_dspcfg snd_soc_core
[56637.829366]  snd_acp_legacy_common snd_pci_acp6x kvm cfg80211 sp5100_tco snd_compress snd_pci_acp5x snd_intel_sdw_acpi ac97_bus irqbypass snd_rn_pci_acp3x snd_hwdep snd_pcm_dmaengine ghash_clmulni_intel snd_acp_config rfkill i2c_piix4 aesni_intel snd_soc_acpi ccp snd_pcm snd_pci_acp3x rapl snd_timer opt3001 wdat_wdt ltrf216a k10temp pcspkr libarc4 i2c_smbus snd i2c_hid_acpi industrialio soundcore i2c_hid 8250_dw joydev mac_hid i2c_dev crypto_user pkcs8_key_parser ntsync nfnetlink zram 842_decompress 842_compress lz4hc_compress lz4_compress uas hid_steam usb_storage ff_memless amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec sdhci_pci drm_panel_backlight_quirks gpu_sched drm_suballoc_helper sdhci_uhs2 nvme sdhci drm_buddy video nvme_core cqhci serio_raw nvme_keyring wmi drm_display_helper mmc_core nvme_auth spi_amd cec hkdf
[56637.829492] CPU: 4 UID: 0 PID: 1071 Comm: Xorg Not tainted 6.19.6-arch1-1 #1 PREEMPT(full)  a70f585a3574c37bff18875a6cf7bd8652b4cbca
[56637.829498] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0131 01/30/2024
[56637.829502] RIP: 0010:refcount_warn_saturate+0x59/0x90
[56637.829508] Code: 44 48 8d 3d 69 9b f4 01 67 48 0f b9 3a e9 ef 33 85 00 48 8d 3d 68 9b f4 01 67 48 0f b9 3a e9 de 33 85 00 48 8d 3d 67 9b f4 01 <67> 48 0f b9 3a e9 cd 33 85 00 48 8d 3d 66 9b f4 01 67 48 0f b9 3a
[56637.829511] RSP: 0018:ffffd3b309bbf870 EFLAGS: 00010246
[56637.829516] RAX: ffff8b17cb22b000 RBX: 0000000000000001 RCX: 0000000000000006
[56637.829518] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff8b903470
[56637.829521] RBP: ffff8b19e0700068 R08: 0000000000000040 R09: fffff68d8da16000
[56637.829523] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b19e0700000
[56637.829525] R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000019
[56637.829528] FS:  00007f1a55972a00(0000) GS:ffff8b1b63dc4000(0000) knlGS:0000000000000000
[56637.829531] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[56637.829534] CR2: 000000000bbe3000 CR3: 000000010cea1000 CR4: 0000000000350ef0
[56637.829537] Call Trace:
[56637.829540]  <TASK>
[56637.829545]  dc_stream_release+0x43/0x60 [amdgpu 5e714176d80a5d8ea5e1b22333f539a9a00d92a7]
[56637.830157]  dc_state_destruct+0x51/0x250 [amdgpu 5e714176d80a5d8ea5e1b22333f539a9a00d92a7]
[56637.830735]  dc_state_release+0x43/0xa0 [amdgpu 5e714176d80a5d8ea5e1b22333f539a9a00d92a7]
[56637.831331]  dm_atomic_destroy_state+0x24/0x40 [amdgpu 5e714176d80a5d8ea5e1b22333f539a9a00d92a7]
[56637.831960]  drm_atomic_state_default_clear+0x29e/0x350
[56637.831972]  __drm_atomic_state_free+0x71/0xc0
[56637.831977]  drm_mode_obj_set_property_ioctl+0x3a1/0x3e0
[56637.831985]  ? __pfx_drm_connector_property_set_ioctl+0x10/0x10
[56637.831991]  drm_connector_property_set_ioctl+0x3c/0x60
[56637.831996]  drm_ioctl_kernel+0xae/0x100
[56637.832003]  drm_ioctl+0x29b/0x520
[56637.832009]  ? __pfx_drm_connector_property_set_ioctl+0x10/0x10
[56637.832019]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu 5e714176d80a5d8ea5e1b22333f539a9a00d92a7]
[56637.832604]  __x64_sys_ioctl+0x97/0xe0
[56637.832614]  do_syscall_64+0x81/0x610
[56637.832621]  ? srso_return_thunk+0x5/0x5f
[56637.832626]  ? amdgpu_drm_ioctl+0x6a/0x80 [amdgpu 5e714176d80a5d8ea5e1b22333f539a9a00d92a7]
[56637.833163]  ? srso_return_thunk+0x5/0x5f
[56637.833171]  ? __x64_sys_ioctl+0xb1/0xe0
[56637.833178]  ? srso_return_thunk+0x5/0x5f
[56637.833181]  ? do_syscall_64+0x81/0x610
[56637.833187]  ? srso_return_thunk+0x5/0x5f
[56637.833190]  ? do_syscall_64+0x81/0x610
[56637.833193]  ? sock_poll+0x54/0x110
[56637.833199]  ? srso_return_thunk+0x5/0x5f
[56637.833202]  ? __x64_sys_epoll_ctl+0x6f/0xa0
[56637.833208]  ? srso_return_thunk+0x5/0x5f
[56637.833211]  ? do_syscall_64+0x81/0x610
[56637.833216]  ? srso_return_thunk+0x5/0x5f
[56637.833219]  ? srso_return_thunk+0x5/0x5f
[56637.833222]  ? sock_poll+0x54/0x110
[56637.833225]  ? srso_return_thunk+0x5/0x5f
[56637.833228]  ? srso_return_thunk+0x5/0x5f
[56637.833231]  ? ep_item_poll.isra.0+0x56/0x90
[56637.833236]  ? srso_return_thunk+0x5/0x5f
[56637.833240]  ? do_epoll_ctl+0x2b5/0xef0
[56637.833246]  ? srso_return_thunk+0x5/0x5f
[56637.833249]  ? __x64_sys_epoll_ctl+0x6f/0xa0
[56637.833254]  ? srso_return_thunk+0x5/0x5f
[56637.833257]  ? do_syscall_64+0x81/0x610
[56637.833260]  ? srso_return_thunk+0x5/0x5f
[56637.833265]  ? srso_return_thunk+0x5/0x5f
[56637.833268]  ? __x64_sys_epoll_ctl+0x6f/0xa0
[56637.833272]  ? srso_return_thunk+0x5/0x5f
[56637.833275]  ? do_syscall_64+0x81/0x610
[56637.833278]  ? __x64_sys_epoll_ctl+0x6f/0xa0
[56637.833283]  ? srso_return_thunk+0x5/0x5f
[56637.833286]  ? do_syscall_64+0x81/0x610
[56637.833290]  ? srso_return_thunk+0x5/0x5f
[56637.833293]  ? do_syscall_64+0x81/0x610
[56637.833297]  ? __irq_exit_rcu+0x4c/0xf0
[56637.833303]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[56637.833310] RIP: 0033:0x7f1a55dde04d
[56637.833344] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[56637.833347] RSP: 002b:00007fff9bde5430 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[56637.833352] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1a55dde04d
[56637.833355] RDX: 00007fff9bde54c0 RSI: 00000000c01064ab RDI: 0000000000000012
[56637.833357] RBP: 00007fff9bde5480 R08: 0000000000000003 R09: 0000000000000002
[56637.833360] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff9bde54c0
[56637.833362] R13: 00000000c01064ab R14: 0000000000000012 R15: 0000000000000000
[56637.833371]  </TASK>
[56637.833373] ---[ end trace 0000000000000000 ]---

Any workarounds for this one?

seth · 2026-03-07 10:20:10

That's a warning, not a "crash", please describe the symptoms of the crash.
If you need to reboot avoid the power button, use https://wiki.archlinux.org/title/Keyboa … el_(SysRq) and then please post your complete system journal for the boot, eg.

sudo journalctl -b -1 | curl -F 'file=@-' 0x0.st

for the previous one ("-1")

aa6kj · 2026-03-07 17:32:43

That I have to do next time. I had to shut down everything for the thunderstorm that rolled by. But the system was sitting at plasma desktop (Xorg) with a couple of programs running. Nothing GPU intensive. The system did not crash or become unresponsive. Most programs running died and I had to restart them.

I have set the GPU clocks to constant frequency (I recall 800 MHz or so) to avoid other crashing issues with the amd driver.

Based on the dmesg looks like it tried to use memory that was freed earlier. That should be fairly easy to track down (famous last words...).

slayerking · 2026-03-08 03:32:00

Here's mine https://0x0.st/PeZW.txt time crash 14:17 (2:17)

LuxFerre · 2026-03-08 10:44:29

slayerking wrote:

Here's mine https://0x0.st/PeZW.txt time crash 14:17 (2:17)

It's probably worth mentioning that your system is a desktop, while first post is about a handheld steam deck.
I recently had an AMDGPU crash with the latest kernel too (6.19.6), so I'm guessing they made quite a few changes to 6.19 (was fine on 6.18), but these are probably separate issues.

slayerking · 2026-03-08 11:10:14

LuxFerre wrote:

slayerking wrote:
Here's mine https://0x0.st/PeZW.txt time crash 14:17 (2:17)
It's probably worth mentioning that your system is a desktop, while first post is about a handheld steam deck.
I recently had an AMDGPU crash with the latest kernel too (6.19.6), so I'm guessing they made quite a few changes to 6.19 (was fine on 6.18), but these are probably separate issues.

Good point didn't read that properly but still, I have had this for over a month and it's starting to get on my nerves. Started in the latter 6.18.XX

seth · 2026-03-08 14:19:57

Mar 08 14:17:53 hell kernel: amdgpu 0000:03:00.0: amdgpu: Dumping IP State
Mar 08 14:17:53 hell kernel: amdgpu 0000:03:00.0: amdgpu: Dumping IP State Completed
Mar 08 14:17:53 hell kernel: amdgpu 0000:03:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Mar 08 14:17:53 hell kernel: amdgpu 0000:03:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Mar 08 14:17:53 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=7962438, emitted seq=7962440
Mar 08 14:17:53 hell kernel: amdgpu 0000:03:00.0: amdgpu:  Process Xorg pid 1136 thread Xorg:cs0 pid 1166
Mar 08 14:17:53 hell kernel: amdgpu 0000:03:00.0: amdgpu: Starting gfx_0.0.0 ring reset
Mar 08 14:17:53 hell kernel: [drm:gfx_v11_0_bad_op_irq [amdgpu]] *ERROR* Illegal opcode in command stream 
Mar 08 14:17:55 hell kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=RESET
Mar 08 14:17:55 hell kernel: amdgpu 0000:03:00.0: amdgpu: failed to reset legacy queue
Mar 08 14:17:55 hell kernel: amdgpu 0000:03:00.0: amdgpu: reset via MES failed and try pipe reset -110
Mar 08 14:17:55 hell kernel: amdgpu 0000:03:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
Mar 08 14:17:55 hell kernel: amdgpu 0000:03:00.0: amdgpu: Ring gfx_0.0.0 reset failed
Mar 08 14:17:55 hell kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!. Source:  1
Mar 08 14:17:57 hell kernel: amdgpu 0000:03:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Mar 08 14:17:57 hell kernel: amdgpu 0000:03:00.0: amdgpu: failed to unmap legacy queue
Mar 08 14:17:57 hell kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Mar 08 14:17:57 hell kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Mar 08 14:17:57 hell kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Mar 08 14:17:57 hell kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
Mar 08 14:17:58 hell kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000F00000).
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: VRAM is lost due to GPU reset!
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x85fc000000 for PSP TMR
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: optional securedisplay ta ucode is not available
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x004e8300 (78.131.0)
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x07002F00
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(1) succeeded!
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: amdgpu: [drm] *ERROR* Failed to initialize parser -125!
Mar 08 14:17:58 hell systemd-coredump[35610]: Process 1136 (Xorg) of user 0 terminated abnormally with signal 6/ABRT, processing...
Mar 08 14:17:58 hell systemd[1]: Created slice Slice /system/systemd-coredump.
Mar 08 14:17:58 hell systemd[1]: Started Process Core Dump (PID 35610/UID 0).
Mar 08 14:17:58 hell kernel: amdgpu 0000:03:00.0: [drm] device wedged, but recovered through reset

You fit https://bbs.archlinux.org/viewtopic.php?id=311937 (2nd post there will lead you down the rabbit hole) - the FW was supposed to be fixed, though.

Arch Linux

#1 2026-03-06 23:03:15

More AMD GPU crashes

#2 2026-03-07 10:20:10

Re: More AMD GPU crashes

#3 2026-03-07 17:32:43

Re: More AMD GPU crashes

#4 2026-03-08 03:32:00

Re: More AMD GPU crashes

#5 2026-03-08 10:44:29

Re: More AMD GPU crashes

#6 2026-03-08 11:10:14

Re: More AMD GPU crashes

#7 2026-03-08 14:19:57

Re: More AMD GPU crashes

Board footer