Display repeatedly crashes; journal shows many amdgpu/atombios errors

iridyllic · 2023-12-06 03:42:14

Kernel: 6.6.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 04 Dec 2023 00:29:19 +0000 x86_64 GNU/Linux
GPU: AMD ATI Radeon RX 570 (I have vulkan-radeon, xf86-video-amdgpu, mesa installed)
DE: swaywm
Displays: two monitors

My system display crashes irregularly (sometimes several times in an hour, sometimes a few weeks go by OK).

When it is crashed, audio continues playing but I am unable to interact with the WM and I am forced to reboot the system.

The journal messages somewhat vary but they are all similar to the following (these occur when the display crashes, and after these I reboot the system):

Dec 05 22:13:46 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=805217, emitted seq=805220
Dec 05 22:13:46 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 97881 thread firefox:cs0 pid 98083
Dec 05 22:14:06 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Dec 05 22:14:06 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D826 (len 824, WS 0, PS 0) @ 0xD9A6
Dec 05 22:14:06 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D6E0 (len 326, WS 0, PS 0) @ 0xD7D0
Dec 05 22:14:26 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Dec 05 22:14:26 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C1AA (len 116, WS 0, PS 0) @ 0xC1F7
Dec 05 22:14:46 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Dec 05 22:14:46 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C21E (len 62, WS 0, PS 0) @ 0xC23A
Dec 05 22:15:06 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Dec 05 22:15:06 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D826 (len 824, WS 0, PS 0) @ 0xD9A6
Dec 05 22:15:06 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D6E0 (len 326, WS 0, PS 0) @ 0xD7D0
Dec 05 22:15:26 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Dec 05 22:15:26 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C1AA (len 116, WS 0, PS 0) @ 0xC1F7
Dec 05 22:15:46 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Dec 05 22:15:46 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C21E (len 62, WS 0, PS 0) @ 0xC23A
Dec 05 22:15:46 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
Dec 05 22:15:47 kernel: amdgpu: Failed to force to switch arbf0!
Dec 05 22:15:47 kernel: amdgpu: [disable_dpm_tasks] Failed to disable DPM!
Dec 05 22:15:47 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22
Dec 05 22:15:47 kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
Dec 05 22:15:47 kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Dec 05 22:15:48 kernel: amdgpu: cp is busy, skip halt cp
Dec 05 22:15:48 kernel: amdgpu: rlc is busy, skip halt rlc

I couldn't find any exact matches online but I saw some users claiming similar issues were resolved by adding `amdgpu.dc=0` or `radeon.runpm=0` to the bootflags. Since I use systemd-boot, I added these to the `options` line in `/boot/loader/entries/arch.conf`. However, these have not reduced the frequency of display crashes.

I have ensured the temperature of the system and GPU are stable and cleaned out the case with compressed air, in case this is relevant.

Are there any other troubleshooting steps I could go through? Thanks.

Last edited by iridyllic (2023-12-06 03:47:56)

seth · 2023-12-06 08:55:21

https://bugzilla.kernel.org/show_bug.cgi?id=211425 seems to be related to the outputs, pot. DPMS
"radeon.anything" won't do because you're not using that module.

amdgpu.runpm=0 amdgpu.aspm=0 amdgpu.bapm=0 pcie_aspm=off amdgpu.audio=0 amdgpu.dpm=0

amdgpu.dpm=0 is likely going to prevent the system from booting altogether, though.

xf86-video-amdgpu is irrelevant w/ sway, but can you reproduce this w/ an X11 session(i3, openbox,…) and perhaps an explicit

xset dpms force off

there?

iridyllic · 2023-12-06 16:21:12

I haven't noticed any definitive pattern so it might be hard to reproduce. It used to be triggered (sometimes) when entering full-screen but not always, and it often happens even when not entering full-screen.

seth · 2023-12-06 19:03:34

If it's dpms bound, the command in #2 should be a surefire trigger.

iridyllic · 2023-12-07 06:39:28

seth wrote:

If it's dpms bound, the command in #2 should be a surefire trigger.

Got it, I misunderstood. I started up an X session with `exec xterm` in sxrc and ran the command. The display did turn off but typing on the keyboard brought it back, unlike when the display is crashing. The amdgpu/atombios journal entries do not appear either.

seth · 2023-12-07 15:03:32

Already w/ any of the parameters in #2 ?
Can you cause the issue in an X11 session at all?

And please post a complete system journal covering the incident - there might be indicators what leads up to the stall.
Since the stalling process is actually FF and you're on a wayland session, is this natively wayland FF or xwayland, https://wiki.archlinux.org/title/Firefox#Wayland ?

iridyllic · 2023-12-12 04:04:08

I tried it with the parameters I mentioned in the original post since you said the parameter you listed might cause the system not to boot.
I'm not sure about X11 but I did reproduce it on Firefox on Wayland both using xwayland and native wayland Firefox. Additionally, one time it crashed it was due to mpv crashing, and Firefox was not open on the system.

Thanks for the help. Do you think removing the graphics card and upgrading the CPU to one with integrated graphics would solve the issue? I do not need a graphics card since I don't game, I mostly just program and browse the web (no heavy AI/ML/3D workflows that would need graphics cards). I wonder if it could just be a hardware failure since this only started more than 3 years after I've had the GPU and it's worked fine before that.

Last edited by iridyllic (2023-12-12 04:04:32)

seth · 2023-12-12 07:39:02

Only "amdgpu.dpm=0" is prone to fail the boot, the others are not.

iridyllic · 2023-12-16 19:12:49

I used all the parameters except for "amdgpu.dpm=0". The system ran OK for a few days but then began crashing again repeatedly.

Here is a log file for a run that crashed about 4 minutes after boot as I opened firefox: http://0x0.st/HYM6.txt.

However, interestingly, even after I removed firefox from autostart, the system still sometimes crashed without it. Even `pid 0` has crashed, which is strange since I don't think that has anything to do with graphics.

~ journalctl -n 100000 -p warning | grep 'Process information' | cut -d ' ' -f 5-
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 897 thread firefox:cs0 pid 1417
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 833 thread sway:cs0 pid 879
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 814 thread sway:cs0 pid 859
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 935 thread firefox:cs0 pid 1726
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 937 thread firefox:cs0 pid 1554
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xwayland pid 941 thread Xwayland:cs0 pid 985
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 1031 thread firefox:cs0 pid 1838
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 920 thread firefox:cs0 pid 1743
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 97881 thread firefox:cs0 pid 98083
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 983 thread firefox:cs0 pid 1437
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 909 thread firefox:cs0 pid 1483
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 957 thread firefox:cs0 pid 1654
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 793 thread sway:cs0 pid 831
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mpv pid 1251 thread mpv:cs0 pid 1263
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Keybase pid 22121 thread Keybase:cs0 pid 22131
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 821 thread sway:cs0 pid 861
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 821 thread sway:cs0 pid 860
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process sway pid 802 thread sway:cs0 pid 840
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 1860 thread firefox:cs0 pid 1921

I have not tried switching to X entirely. However I have ordered a new motherboard and CPU with integrated graphics. Hopefully this will resolve the issue.

Last edited by iridyllic (2023-12-16 19:13:42)

seth · 2023-12-16 19:42:00

Ever managed to cause this w/ an X11 session or maybe weston?

Things go south starting w/

Dec 16 03:38:08 sibylantgraze kernel: amdgpu 0000:07:00.0: [drm] *ERROR* [CRTC:60:crtc-0] flip_done timed out

for OUT in /sys/class/drm/card0*; do echo $OUT; edid-decode $OUT/edid; echo "================="; done

You'll need https://aur.archlinux.org/packages/edid-decode-git

plant.cat · 2023-12-17 06:02:09

I'm able to consistently trigger amdgpu kernel errors by switching Blender to GPU Compute.

[  214.630508] amdgpu 0000:03:00.0: amdgpu: bo 000000004cc48f8f va 0x0800000000-0x0800000001 conflict with 0x0800000000-0x0800000200
[  214.630516] amdgpu: Failed to map VA 0x800000000000 in vm. ret -22
[  214.630518] amdgpu: Failed to map bo to gpuvm
[  214.633239] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  214.633243] #PF: supervisor read access in kernel mode
[  214.633244] #PF: error_code(0x0000) - not-present page
[  214.633246] PGD 1bb6c4067 P4D 1bb6c4067 PUD 1f7736067 PMD 0 
[  214.633250] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  214.633252] CPU: 2 PID: 9778 Comm: blender Tainted: G           OE      6.6.7-arch1-1 #1 4505c4baa0b3d7c4037b0e8f5402626fa360717f
[  214.633255] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P2.90 11/27/2019
[  214.633257] RIP: 0010:dma_resv_add_fence+0x47/0x1f0
[  214.633262] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 59 01 00 00 8d 50 01 09 c2 0f 88 5d 01 00 00 <49> 8b 46 08 48 3d 00 73 5b ac 0f 84 c9 00 00 00 48 3d a0 72 5b ac
[  214.633264] RSP: 0018:ffffc9000c9d7c90 EFLAGS: 00010246
[  214.633266] RAX: ffff8881f7870000 RBX: ffff8881f7870158 RCX: 00000000013b4002
[  214.633268] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881f7870158
[  214.633269] RBP: ffff88815014f000 R08: 0000000000000000 R09: 000000000003a5f0
[  214.633270] R10: ffff888228070f60 R11: 0000000000000000 R12: ffff8881a99d0338
[  214.633271] R13: ffff8881a99d0340 R14: 0000000000000000 R15: ffff8881f7870000
[  214.633272] FS:  00007fa5dc6ef580(0000) GS:ffff8887a0680000(0000) knlGS:0000000000000000
[  214.633274] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  214.633275] CR2: 0000000000000008 CR3: 0000000195d74000 CR4: 00000000003506e0
[  214.633277] Call Trace:
[  214.633279]  <TASK>
[  214.633281]  ? __die+0x23/0x70
[  214.633285]  ? page_fault_oops+0x171/0x4e0
[  214.633289]  ? exc_page_fault+0x7f/0x180
[  214.633293]  ? asm_exc_page_fault+0x26/0x30
[  214.633297]  ? dma_resv_add_fence+0x47/0x1f0
[  214.633301]  amdgpu_amdkfd_gpuvm_acquire_process_vm+0x212/0x530 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[  214.633602]  kfd_process_device_init_vm+0xb0/0x320 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[  214.633869]  kfd_ioctl_acquire_vm+0x89/0xc0 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[  214.634133]  kfd_ioctl+0x3cc/0x4e0 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[  214.634396]  ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[  214.634660]  __x64_sys_ioctl+0x97/0xd0
[  214.634663]  do_syscall_64+0x60/0x90
[  214.634668]  ? syscall_exit_to_user_mode+0x2b/0x40
[  214.634670]  ? do_syscall_64+0x6c/0x90
[  214.634672]  ? syscall_exit_to_user_mode+0x2b/0x40
[  214.634674]  ? do_syscall_64+0x6c/0x90
[  214.634676]  ? do_syscall_64+0x6c/0x90
[  214.634678]  ? do_syscall_64+0x6c/0x90
[  214.634680]  ? do_syscall_64+0x6c/0x90
[  214.634682]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  214.634686] RIP: 0033:0x7fa5dc32a3af
[  214.634704] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[  214.634706] RSP: 002b:00007ffc2b343290 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  214.634708] RAX: ffffffffffffffda RBX: 00007ffc2b3433b0 RCX: 00007fa5dc32a3af
[  214.634709] RDX: 00007ffc2b343430 RSI: 0000000040084b15 RDI: 000000000000000d
[  214.634711] RBP: 00007ffc2b343430 R08: 000000000000000b R09: 0000000000000008
[  214.634712] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000040084b15
[  214.634713] R13: 000000000000000d R14: 00007fa5d5c72a40 R15: 00007fa5bc5a5180
[  214.634715]  </TASK>
[  214.634716] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq uinput rfkill xt_tcpudp nft_compat nf_tables nct6775 nct6775_core libcrc32c hwmon_vid intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd vfat fat kvm irqbypass crct10dif_pclmul snd_hda_codec_realtek crc32_pclmul polyval_clmulni snd_hda_codec_generic ledtrig_audio polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 snd_hda_codec_hdmi sha256_ssse3 sha1_ssse3 aesni_intel snd_hda_intel snd_usb_audio snd_intel_dspcfg snd_usbmidi_lib crypto_simd snd_ump snd_intel_sdw_acpi snd_hda_codec cryptd snd_rawmidi snd_seq_device rapl snd_hda_core mc snd_hwdep r8169 snd_pcm realtek wmi_bmof sp5100_tco mdio_devres snd_timer acpi_cpufreq pcspkr libphy snd i2c_piix4 ccp k10temp soundcore gpio_amdpt gpio_generic joydev mousedev mac_hid vmmon(OE) vmw_vmci i2c_dev fuse crypto_user loop dm_mod nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid amdgpu i2c_algo_bit drm_ttm_helper ttm drm_exec drm_suballoc_helper amdxcp nvme drm_buddy gpu_sched
[  214.634764]  nvme_core crc32c_intel drm_display_helper video xhci_pci nvme_common cec xhci_pci_renesas wmi
[  214.634771] CR2: 0000000000000008
[  214.634772] ---[ end trace 0000000000000000 ]---
[  214.634773] RIP: 0010:dma_resv_add_fence+0x47/0x1f0
[  214.634777] Code: 89 54 24 04 48 85 f6 74 21 48 8d 7e 38 b8 01 00 00 00 f0 0f c1 46 38 85 c0 0f 84 59 01 00 00 8d 50 01 09 c2 0f 88 5d 01 00 00 <49> 8b 46 08 48 3d 00 73 5b ac 0f 84 c9 00 00 00 48 3d a0 72 5b ac
[  214.634778] RSP: 0018:ffffc9000c9d7c90 EFLAGS: 00010246
[  214.634780] RAX: ffff8881f7870000 RBX: ffff8881f7870158 RCX: 00000000013b4002
[  214.634781] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881f7870158
[  214.634782] RBP: ffff88815014f000 R08: 0000000000000000 R09: 000000000003a5f0
[  214.634783] R10: ffff888228070f60 R11: 0000000000000000 R12: ffff8881a99d0338
[  214.634784] R13: ffff8881a99d0340 R14: 0000000000000000 R15: ffff8881f7870000
[  214.634785] FS:  00007fa5dc6ef580(0000) GS:ffff8887a0680000(0000) knlGS:0000000000000000
[  214.634787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  214.634788] CR2: 0000000000000008 CR3: 0000000195d74000 CR4: 00000000003506e0
[  214.634789] note: blender[9778] exited with irqs disabled⏎

seth · 2023-12-17 09:47:19

Which bears no resemblance to the OPs errors, but see eg. https://bbs.archlinux.org/viewtopic.php?id=290926 ("solved" by rendering LO in software, probably not applicable to blender…)
amdgpu is in a pretty lousy state lately, there're multiple threads w/ various issues over the past year, so "i have amdgpu crash, must be same" is not a reasonable assumption.

Try the LTS kernel.

Arch Linux

#1 2023-12-06 03:42:14

Display repeatedly crashes; journal shows many amdgpu/atombios errors

#2 2023-12-06 08:55:21

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#3 2023-12-06 16:21:12

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#4 2023-12-06 19:03:34

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#5 2023-12-07 06:39:28

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#6 2023-12-07 15:03:32

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#7 2023-12-12 04:04:08

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#8 2023-12-12 07:39:02

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#9 2023-12-16 19:12:49

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#10 2023-12-16 19:42:00

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#11 2023-12-17 06:02:09

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

#12 2023-12-17 09:47:19

Re: Display repeatedly crashes; journal shows many amdgpu/atombios errors

Board footer