You are not logged in.

#26 2023-03-27 19:43:26

seth
Member
Registered: 2012-09-03
Posts: 49,992

Re: Random GPU crashes amdgpu across 2 different GPUs

So according to #24 it sabilized around the 15th and now it broke on the 26th - any suspiciously relevant updates in your pacman log for those dates?
Did anything else differ during this period (climate, access/usage patterns, etc.)?

Offline

#27 2023-03-28 01:16:22

mearkat7
Member
Registered: 2023-03-07
Posts: 22

Re: Random GPU crashes amdgpu across 2 different GPUs

seth wrote:

So according to #24 it sabilized around the 15th and now it broke on the 26th - any suspiciously relevant updates in your pacman log for those dates?
Did anything else differ during this period (climate, access/usage patterns, etc.)?

I may not have worded it correctly, it was stable from around 21st to 26th so slightly under a week.

When I had a browse it was mainly only firefox that got an update that I'm aware has been part of the issue (even though it has still occured with just chromium too). The rest of my pacman.log was mainly just the installs when switching to wayland.

In terms of climate I did start to notice that with X.org/i3 it seemed to mainly occur first thing in the morning when the pc was first switched on so I was thinking maybe it's a temp thing but then the 2 crashes on wayland were both at around 3pm. It seems less frequent on wayland but the effects of the crash impact wayland more. I'll keep note of the times and if it happens again today around 3pm that would seem to be a pattern emerging.

Last edited by mearkat7 (2023-03-28 01:20:15)

Offline

#28 2023-03-28 08:45:17

UndeadLeech
Member
Registered: 2016-03-07
Posts: 6

Re: Random GPU crashes amdgpu across 2 different GPUs

Tried it again after 6.2.7 and can't reproduce it anymore. So very possible that fixed it.

Offline

#29 2023-04-05 08:51:02

Bodyash
Member
Registered: 2023-04-05
Posts: 9

Re: Random GPU crashes amdgpu across 2 different GPUs

Hello there

Running Arch 6.2.9-arch1-1 kernel
DE - Gnome (on Wayland)
Mesa 23.0.1-2

Constantly crashing GPU when using Vivaldi browser. Chances are huge when I have microsoft Teams url opened, can crash few times during calls, or when chatting with someone.
Need to restart gnome-session if gpu driver crashed.
Doing other tasks in this browser - is okay.
Playing games on this GPU - is also okay, no issues.

Ryzen 6900HX + Radeon RX6600M

$ lspci | grep ' VGA '
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
e8:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] (rev c7)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32774, for process vivaldi-bin pid 46941 thread vivaldi-bi:cs0 pid 46954)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00008001454ff000 from client 0x1b (UTCL2)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32774, for process vivaldi-bin pid 46941 thread vivaldi-bi:cs0 pid 46954)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00008001454ff000 from client 0x1b (UTCL2)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 05 10:06:54 arch vivaldi-stable.desktop[46906]: [46941:46941:0405/100654.155185:ERROR:shared_context_state.cc(860)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
Apr 05 10:06:54 arch vivaldi-stable.desktop[46906]: [46941:46941:0405/100654.155430:ERROR:gpu_service_impl.cc(1011)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
Apr 05 10:06:54 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
Apr 05 10:06:54 arch vivaldi-stable.desktop[46906]: [46901:46901:0405/100654.163985:ERROR:gpu_process_host.cc(954)] GPU process exited unexpectedly: exit_code=8704

Offline

#30 2023-04-07 18:41:27

JPenuchot
Member
Registered: 2020-02-10
Posts: 7

Re: Random GPU crashes amdgpu across 2 different GPUs

Hi there,

I've been having the same issue on Linux 6.2.9, Mesa 23.0.1 with a 5700 XT and an i7 4790K. This was while playing Minecraft, I've had issues while playing Cyberpunk too:

Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x0020113B
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x5
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x1
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:2 pasid:32771, for process java pid 1604 thread java:cs0 pid 1701)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080012b140000 from client 0x1b (UTCL2)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 07 20:24:37 jpenuchot-nzxt kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 07 20:24:43 jpenuchot-nzxt input-leapc[1485]: InputLeap 2.4.0-release: [2023-04-07T20:24:43] INFO: leaving screen
Apr 07 20:24:47 jpenuchot-nzxt kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=611913, emitted seq=611915
Apr 07 20:24:47 jpenuchot-nzxt kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process java pid 1604 thread java:cs0 pid 1701

Last edited by JPenuchot (2023-04-07 18:41:49)

Offline

#31 2023-04-10 20:22:01

Noammac
Member
Registered: 2017-06-26
Posts: 5

Re: Random GPU crashes amdgpu across 2 different GPUs

Can confirm I am also experiencing this crash, and have been experiencing similar troubles with my AMD GPU for some time now.

Kernel version 6.2.10, Mesa 23.0.2, RX 6700XT, Ryzen 5 5600.

Relevant log from today:

Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x000080031d428000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x000080031d430000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x000080031143f000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x000080031d431000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800311437000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x000080031143e000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x000080031d438000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x000080031d439000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32769, for process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800311436000 from client 0x1b (UTCL2)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 10 23:11:24 prospit kernel: amdgpu 0000:0b:00.0: amdgpu:          RW: 0x0
Apr 10 23:11:34 prospit kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=84609, emitted seq=84611
Apr 10 23:11:34 prospit kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 1746 thread kwin_wayla:cs0 pid 1885
Apr 10 23:11:35 prospit kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Offline

#32 2023-04-27 07:49:55

Bodyash
Member
Registered: 2023-04-05
Posts: 9

Re: Random GPU crashes amdgpu across 2 different GPUs

Bodyash wrote:

Hello there

Running Arch 6.2.9-arch1-1 kernel
DE - Gnome (on Wayland)
Mesa 23.0.1-2

Constantly crashing GPU when using Vivaldi browser. Chances are huge when I have microsoft Teams url opened, can crash few times during calls, or when chatting with someone.
Need to restart gnome-session if gpu driver crashed.
Doing other tasks in this browser - is okay.
Playing games on this GPU - is also okay, no issues.

Ryzen 6900HX + Radeon RX6600M

$ lspci | grep ' VGA '
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
e8:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] (rev c7)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32774, for process vivaldi-bin pid 46941 thread vivaldi-bi:cs0 pid 46954)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00008001454ff000 from client 0x1b (UTCL2)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32774, for process vivaldi-bin pid 46941 thread vivaldi-bi:cs0 pid 46954)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00008001454ff000 from client 0x1b (UTCL2)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 05 10:06:44 arch kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 05 10:06:54 arch vivaldi-stable.desktop[46906]: [46941:46941:0405/100654.155185:ERROR:shared_context_state.cc(860)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_INNOCENT_CONTEXT_RESET_KHR
Apr 05 10:06:54 arch vivaldi-stable.desktop[46906]: [46941:46941:0405/100654.155430:ERROR:gpu_service_impl.cc(1011)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
Apr 05 10:06:54 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
Apr 05 10:06:54 arch vivaldi-stable.desktop[46906]: [46901:46901:0405/100654.163985:ERROR:gpu_process_host.cc(954)] GPU process exited unexpectedly: exit_code=8704

Looks like in kernel 6.2.12 problem disappeared for me. Now my system is stable again.

UPD: after writing this comment browser crashed GPU driver again.

Last edited by Bodyash (2023-04-27 07:57:22)

Offline

#33 2023-05-05 18:55:48

tonatiuhmira
Member
Registered: 2020-04-28
Posts: 2

Re: Random GPU crashes amdgpu across 2 different GPUs

Just to add that I'm also having this bug in 6.3.1-zen1-1:

uname -r
6.3.1-zen1-1-zen
lspci | grep ' VGA ' 
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c3)

Crashes while browsing (Firefox, Twitter), happens once a week, more or less, since a couple months (Ie., since I have this computer). There's a sudden black screen followed by a "static" screen with the las displayed frame of my desktop. I was able to go to a VT and reboot the computer without further issue.

may 05 11:55:55 blanquita kernel: [drm] failed to load ucode VCN0_RAM(0x3A) 
may 05 11:55:55 blanquita kernel: [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
may 05 11:55:55 blanquita kernel: [drm] failed to load ucode VCN1_RAM(0x3B) 
may 05 11:55:55 blanquita kernel: [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
may 05 11:56:05 blanquita kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec_0 timeout, signaled seq=6610, emitted seq=6614
may 05 11:56:05 blanquita kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RDD Process pid 2067 thread firefox:cs0 pid 3092
may 05 11:56:05 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
may 05 11:56:05 blanquita kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
may 05 11:56:05 blanquita kernel: [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x000000e0 != 0x00000000
may 05 11:56:05 blanquita kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
may 05 11:56:05 blanquita kernel: [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
may 05 11:56:06 blanquita kernel: [drm] Register(1) [mmUVD_RBC_RB_RPTR] failed to reach value 0x000000a0 != 0x00000000
may 05 11:56:06 blanquita kernel: [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
may 05 11:56:06 blanquita plasmashell[1469]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
may 05 11:56:06 blanquita kernel: ------------[ cut here ]------------
may 05 11:56:06 blanquita kernel: WARNING: CPU: 5 PID: 7700 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:599 amdgpu_irq_put+0xf3/0x120 [amdgpu]
may 05 11:56:06 blanquita kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq ccm cmac algif_hash algif_skcipher af_alg hid_logitech_hidpp bnep btusb btrtl btbcm btintel btmtk bluetooth ecdh_generic snd_usb_audio snd_usbmidi_lib snd_rawmidi xpad hid_logitech_dj snd_>
may 05 11:56:06 blanquita kernel:  gpio_generic acpi_cpufreq mac_hid dm_multipath snd_aloop snd_pcm snd_timer snd soundcore v4l2loopback_dc(OE) videodev mc crypto_user loop fuse dm_mod bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid amdgpu i2c_algo_bit >
may 05 11:56:06 blanquita kernel: CPU: 5 PID: 7700 Comm: kworker/u24:1 Tainted: G           OE      6.3.1-zen1-1-zen #1 156717576b1d5c8078aee7319e8186602a31f594
may 05 11:56:06 blanquita kernel: Hardware name: ASUS System Product Name/ROG STRIX B650E-F GAMING WIFI, BIOS 1412 04/25/2023
may 05 11:56:06 blanquita kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
may 05 11:56:06 blanquita kernel: RIP: 0010:amdgpu_irq_put+0xf3/0x120 [amdgpu]
may 05 11:56:06 blanquita kernel: Code: 89 ff ff d0 0f 1f 00 4c 89 ee 4c 89 f7 89 04 24 e8 a2 a9 2a c7 8b 04 24 48 83 c4 08 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc <0f> 0b b8 ea ff ff ff e9 64 ff ff ff b8 ea ff ff ff e9 5a ff ff ff
may 05 11:56:06 blanquita kernel: RSP: 0018:ffffb8a2539d7c70 EFLAGS: 00010246
may 05 11:56:06 blanquita kernel: RAX: ffff9d7414c61a80 RBX: ffff9d7414820000 RCX: 0000000000000000
may 05 11:56:06 blanquita kernel: RDX: 0000000000000000 RSI: ffff9d7414822510 RDI: ffff9d7414820000
may 05 11:56:06 blanquita kernel: RBP: 0000000000000000 R08: 0000000000040000 R09: 0000000000000000
may 05 11:56:06 blanquita kernel: R10: 0000000000000001 R11: ffff9d74009c26c0 R12: 0000000000000000
may 05 11:56:06 blanquita kernel: R13: ffff9d74148389a0 R14: ffff9d7502ac6c00 R15: 0000000000000000
may 05 11:56:06 blanquita kernel: FS:  0000000000000000(0000) GS:ffff9d7b5dd40000(0000) knlGS:0000000000000000
may 05 11:56:06 blanquita kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
may 05 11:56:06 blanquita kernel: CR2: 00007f46ecbc4000 CR3: 0000000105ba2000 CR4: 0000000000750ee0
may 05 11:56:06 blanquita kernel: PKRU: 55555554
may 05 11:56:06 blanquita kernel: Call Trace:
may 05 11:56:06 blanquita kernel:  <TASK>
may 05 11:56:06 blanquita kernel:  gmc_v10_0_suspend+0x53/0x90 [amdgpu 847874e39535bccfb19fd35aab9642c104bab4b6]
may 05 11:56:06 blanquita kernel:  amdgpu_device_ip_suspend_phase2+0x104/0x1a0 [amdgpu 847874e39535bccfb19fd35aab9642c104bab4b6]
may 05 11:56:06 blanquita kernel:  ? amdgpu_device_ip_suspend_phase1+0x64/0xe0 [amdgpu 847874e39535bccfb19fd35aab9642c104bab4b6]
may 05 11:56:06 blanquita kernel:  amdgpu_device_pre_asic_reset+0xe3/0x2c0 [amdgpu 847874e39535bccfb19fd35aab9642c104bab4b6]
may 05 11:56:06 blanquita kernel:  amdgpu_device_gpu_recover+0x484/0xec0 [amdgpu 847874e39535bccfb19fd35aab9642c104bab4b6]
may 05 11:56:06 blanquita kernel:  amdgpu_job_timedout+0x18d/0x240 [amdgpu 847874e39535bccfb19fd35aab9642c104bab4b6]
may 05 11:56:06 blanquita kernel:  drm_sched_job_timedout+0x77/0x110 [gpu_sched 9b9c9d3603e187ddb26f6cc1cb6c604387b484a9]
may 05 11:56:06 blanquita kernel:  process_one_work+0x24f/0x460
may 05 11:56:06 blanquita kernel:  worker_thread+0x55/0x4f0
may 05 11:56:06 blanquita kernel:  ? __pfx_worker_thread+0x10/0x10
may 05 11:56:06 blanquita kernel:  kthread+0xdb/0x110
may 05 11:56:06 blanquita kernel:  ? __pfx_kthread+0x10/0x10
may 05 11:56:06 blanquita kernel:  ret_from_fork+0x29/0x50
may 05 11:56:06 blanquita kernel:  </TASK>
may 05 11:56:06 blanquita kernel: ---[ end trace 0000000000000000 ]---
may 05 11:56:06 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
may 05 11:56:06 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
may 05 11:56:06 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
may 05 11:56:06 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
may 05 11:56:06 blanquita kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000F00000).
may 05 11:56:06 blanquita kernel: [drm] VRAM is lost due to GPU reset!
may 05 11:56:06 blanquita kernel: [drm] PSP is resuming...
may 05 11:56:06 blanquita plasmashell[1469]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:06 blanquita plasmashell[1469]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:06 blanquita plasmashell[1469]: [GFX1]: Device reset due to WR context
may 05 11:56:06 blanquita plasmashell[1469]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
may 05 11:56:06 blanquita plasmashell[1702]: [GFX1-]: Failed to connect WebRenderBridgeChild. isParent=false
may 05 11:56:07 blanquita kernel: [drm] reserve 0xa00000 from 0x8001000000 for PSP TMR
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x00000040, smu fw if version = 0x00000041, smu fw program = 0, version = 0x003a5600 (58.86.0)
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: use vbios provided pptable
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
may 05 11:56:07 blanquita kernel: [drm] DMUB hardware initialized: version=0x02020017
may 05 11:56:07 blanquita kwin_wayland[798]: kwin_scene_opengl: A graphics reset not attributable to the current GL context occurred.
may 05 11:56:07 blanquita kernel: [drm] kiq ring mec 2 pipe 1 q 0
may 05 11:56:07 blanquita kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
may 05 11:56:07 blanquita kernel: [drm] JPEG decode initialized successfully.
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 1
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 1
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 1
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 1
may 05 11:56:07 blanquita plasmashell[2067]: amdgpu: amdgpu_cs_query_fence_status failed.
may 05 11:56:07 blanquita plasmashell[2067]: amdgpu: amdgpu_cs_query_fence_status failed.
may 05 11:56:07 blanquita plasmashell[2067]: amdgpu: The CS has been rejected (-125), but the context isn't robust.
may 05 11:56:07 blanquita plasmashell[2067]: amdgpu: The process will be terminated.
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done
may 05 11:56:07 blanquita kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(1) succeeded!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm] Skip scheduling IBs!
may 05 11:56:07 blanquita kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
may 05 11:56:07 blanquita plasmashell[1469]: [GFX1-]: VideoBridgeParent receives IPC close with reason=AbnormalShutdown
may 05 11:56:08 blanquita kwin_wayland[798]: kwin_scene_opengl: Waiting for glGetGraphicsResetStatus to return GL_NO_ERROR timed out!
may 05 11:56:08 blanquita kwin_wayland[798]: OpenGL vendor string:                   AMD
may 05 11:56:08 blanquita kwin_wayland[798]: OpenGL renderer string:                 AMD Radeon RX 6800 (navi21, LLVM 15.0.7, DRM 3.52, 6.3.1-zen1-1-zen)
may 05 11:56:08 blanquita kwin_wayland[798]: OpenGL version string:                  4.6 (Core Profile) Mesa 23.0.3
may 05 11:56:08 blanquita kwin_wayland[798]: OpenGL shading language version string: 4.60
may 05 11:56:08 blanquita kwin_wayland[798]: Driver:                                 Unknown
may 05 11:56:08 blanquita kwin_wayland[798]: GPU class:                              Unknown
may 05 11:56:08 blanquita kwin_wayland[798]: OpenGL version:                         4.6
may 05 11:56:08 blanquita kwin_wayland[798]: GLSL version:                           4.60
may 05 11:56:08 blanquita kwin_wayland[798]: Mesa version:                           23.0.3
may 05 11:56:08 blanquita kwin_wayland[798]: X server version:                       1.23.1
may 05 11:56:08 blanquita kwin_wayland[798]: Linux kernel version:                   6.3.1
may 05 11:56:08 blanquita kwin_wayland[798]: Requires strict binding:                no
may 05 11:56:08 blanquita kwin_wayland[798]: GLSL shaders:                           yes
may 05 11:56:08 blanquita kwin_wayland[798]: Texture NPOT support:                   yes
may 05 11:56:08 blanquita kwin_wayland[798]: Virtual Machine:                        no
may 05 11:56:08 blanquita kwin_wayland_wrapper[798]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:08 blanquita kwin_wayland_wrapper[798]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:08 blanquita kwin_wayland[798]: BlurConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland_wrapper[798]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:08 blanquita kwin_wayland_wrapper[798]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:08 blanquita kwin_wayland_wrapper[798]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:08 blanquita kwin_wayland_wrapper[798]: amdgpu: The CS has been rejected (-125). Recreate the context.
may 05 11:56:08 blanquita kwin_wayland[798]: ZoomConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland[798]: WindowViewConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland[798]: SlidingPopupsConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland[798]: SlideConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland[798]: OverviewConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland[798]: KscreenConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland[798]: DesktopGridConfig::instance called after the first use - ignoring
may 05 11:56:08 blanquita kwin_wayland[798]: kwin_scene_opengl: A graphics reset not attributable to the current GL context occurred.
may 05 11:56:09 blanquita kwin_wayland[798]: kwin_scene_opengl: Waiting for glGetGraphicsResetStatus to return GL_NO_ERROR timed out!
may 05 11:56:09 blanquita kwin_wayland[798]: OpenGL vendor string:                   AMD
may 05 11:56:09 blanquita kwin_wayland[798]: OpenGL renderer string:                 AMD Radeon RX 6800 (navi21, LLVM 15.0.7, DRM 3.52, 6.3.1-zen1-1-zen)
may 05 11:56:09 blanquita kwin_wayland[798]: OpenGL version string:                  4.6 (Core Profile) Mesa 23.0.3
may 05 11:56:09 blanquita kwin_wayland[798]: OpenGL shading language version string: 4.60
may 05 11:56:09 blanquita kwin_wayland[798]: Driver:                                 Unknown
may 05 11:56:09 blanquita kwin_wayland[798]: GPU class:                              Unknown
may 05 11:56:09 blanquita kwin_wayland[798]: OpenGL version:                         4.6
may 05 11:56:09 blanquita kwin_wayland[798]: GLSL version:                           4.60
may 05 11:56:09 blanquita kwin_wayland[798]: Mesa version:                           23.0.3
may 05 11:56:09 blanquita kwin_wayland[798]: X server version:                       1.23.1
may 05 11:56:09 blanquita kwin_wayland[798]: Linux kernel version:                   6.3.1
may 05 11:56:09 blanquita kwin_wayland[798]: Requires strict binding:                no
may 05 11:56:09 blanquita kwin_wayland[798]: GLSL shaders:                           yes
may 05 11:56:09 blanquita kwin_wayland[798]: Texture NPOT support:                   yes
may 05 11:56:09 blanquita kwin_wayland[798]: Virtual Machine:                        no
may 05 11:56:09 blanquita kwin_wayland[798]: BlurConfig::instance called after the first use - ignoring
may 05 11:56:09 blanquita kwin_wayland[798]: ZoomConfig::instance called after the first use - ignoring

...

And it keeps repeating the same messages.


ADDENDUM: Happened again just a moment ago. When loading youtube. It seems 6.3 has made it worse.

Last edited by tonatiuhmira (2023-05-05 19:11:06)

Offline

#34 2023-05-06 10:55:00

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,868

Re: Random GPU crashes amdgpu across 2 different GPUs

tonatiuhmira, you may not be facing the same bug.

The other problem logs all have (atleast) 2 things in common :
- a gpu reset happens
-  messages mentioning GCVM_L2_PROTECTION_FAULT_STATUS appear

The snippet you posted is about a gpu reset, but doesn't show the other message.
Please check if GCVM_L2_PROTECTION_FAULT is present in your logs after a crash .

In case that phrase is not present you are facing another bug and should start  a new thread.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#35 2023-05-19 08:19:19

Bodyash
Member
Registered: 2023-04-05
Posts: 9

Re: Random GPU crashes amdgpu across 2 different GPUs

So, it's still crashing sometimes:

May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32771, for process vivaldi-bin pid 254791 thread vivaldi-bi:cs0 pid 254806)
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800100639000 from client 0x1b (UTCL2)
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00241051
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x1
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x1
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32771, for process vivaldi-bin pid 254791 thread vivaldi-bi:cs0 pid 254806)
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800100639000 from client 0x1b (UTCL2)
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
May 19 10:03:28 arch kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
May 19 10:03:38 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
May 19 10:03:49 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=26948552, emitted seq=26948555
May 19 10:03:49 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
May 19 10:03:49 arch kernel: ------------[ cut here ]------------
May 19 10:03:49 arch kernel: WARNING: CPU: 12 PID: 254207 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:599 amdgpu_irq_put+0x46/0x70 [amdgpu]
May 19 10:03:49 arch kernel: Modules linked in: uvcvideo videobuf2_vmalloc uvc videobuf2_memops videobuf2_v4l2 videodev videobuf2_common vhost_net vhost vhost_iotlb tap tun uinput hid_microsoft ff_memless hidp ccm rfcomm snd_seq_dummy sn>
May 19 10:03:49 arch kernel:  sha512_ssse3 snd_hda_core snd_seq_device aesni_intel drm_ttm_helper mc snd_hwdep crypto_simd libarc4 ttm snd_pci_acp5x wmi_bmof cryptd usbhid snd_pcm snd_rn_pci_acp3x cfg80211 drm_display_helper rapl snd_acp>
May 19 10:03:49 arch kernel: CPU: 12 PID: 254207 Comm: kworker/u32:2 Tainted: G        W          6.3.2-arch1-1 #1 44a850778a68c42d012ba8e685997cb0375875a4
May 19 10:03:49 arch kernel: Hardware name: Micro Computer (HK) Tech Limited HX99G/F7BAA, BIOS 0.18 03/02/2023
May 19 10:03:49 arch kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
May 19 10:03:49 arch kernel: RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
May 19 10:03:49 arch kernel: Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
May 19 10:03:49 arch kernel: RSP: 0018:ffff9dac23907ca0 EFLAGS: 00010246
May 19 10:03:49 arch kernel: RAX: ffff8df0c48a0a40 RBX: ffff8df0e1e00000 RCX: 0000000000000000
May 19 10:03:49 arch kernel: RDX: 0000000000000000 RSI: ffff8df0e1e02510 RDI: ffff8df0e1e00000
May 19 10:03:49 arch kernel: RBP: ffff8df0e1e00000 R08: 0000000000000000 R09: 0000000000000000
May 19 10:03:49 arch kernel: R10: 0000000000000001 R11: 0000000000000100 R12: 0000000000001050
May 19 10:03:49 arch kernel: R13: ffff8df0e1e189a0 R14: ffff8dfc1ab70a00 R15: 0000000000000000
May 19 10:03:49 arch kernel: FS:  0000000000000000(0000) GS:ffff8dffbe900000(0000) knlGS:0000000000000000
May 19 10:03:49 arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 10:03:49 arch kernel: CR2: 000025a40d0ffc50 CR3: 00000002cec20000 CR4: 0000000000750ee0
May 19 10:03:49 arch kernel: PKRU: 55555554
May 19 10:03:49 arch kernel: Call Trace:
May 19 10:03:49 arch kernel:  <TASK>
May 19 10:03:49 arch kernel:  gmc_v10_0_hw_fini+0x53/0x90 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  gmc_v10_0_suspend+0xe/0x20 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  amdgpu_device_ip_suspend_phase2+0x107/0x1a0 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  ? amdgpu_device_ip_suspend_phase1+0x71/0xe0 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  amdgpu_device_ip_suspend+0x36/0x70 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  amdgpu_device_pre_asic_reset+0xd3/0x2b0 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  amdgpu_device_gpu_recover+0x4c7/0xd60 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  amdgpu_job_timedout+0x18d/0x240 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:49 arch kernel:  drm_sched_job_timedout+0x7a/0x110 [gpu_sched f3b1fbe337249fc10d476a810e412960d1b556b8]
May 19 10:03:49 arch kernel:  process_one_work+0x1c7/0x3d0
May 19 10:03:49 arch kernel:  worker_thread+0x51/0x390
May 19 10:03:49 arch kernel:  ? __pfx_worker_thread+0x10/0x10
May 19 10:03:49 arch kernel:  kthread+0xde/0x110
May 19 10:03:49 arch kernel:  ? __pfx_kthread+0x10/0x10
May 19 10:03:49 arch kernel:  ret_from_fork+0x2c/0x50
May 19 10:03:49 arch kernel:  </TASK>
May 19 10:03:49 arch kernel: ---[ end trace 0000000000000000 ]---
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
May 19 10:03:49 arch kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
May 19 10:03:49 arch kernel: [drm] VRAM is lost due to GPU reset!
May 19 10:03:49 arch kernel: [drm] PSP is resuming...
May 19 10:03:49 arch kernel: [drm] reserve 0xa00000 from 0x8001000000 for PSP TMR
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b2a00 (59.42.0)
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: use vbios provided pptable
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
May 19 10:03:49 arch kernel: [drm] DMUB hardware initialized: version=0x02020017
May 19 10:03:49 arch kernel: [drm] kiq ring mec 2 pipe 1 q 0
May 19 10:03:49 arch kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
May 19 10:03:49 arch kernel: [drm] JPEG decode initialized successfully.
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 19 10:03:49 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done
May 19 10:03:50 arch kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(4) succeeded!
May 19 10:03:50 arch kernel: [drm] Skip scheduling IBs!
May 19 10:03:50 arch gnome-shell[164568]: amdgpu: The CS has been rejected (-125), but the context isn't robust.
May 19 10:03:50 arch gnome-shell[164568]: amdgpu: The process will be terminated.
May 19 10:03:50 arch WebKitWebProces[255448]: Error reading events from display: Broken pipe
May 19 10:03:50 arch gnome-shell[167627]: (EE) failed to read Wayland events: Broken pipe

This time I had no luck to switch to tty and kill session, image stuck, but I have second error occured:

May 19 10:03:55 arch kernel: ------------[ cut here ]------------
May 19 10:03:55 arch kernel: WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:599 amdgpu_irq_put+0x46/0x70 [amdgpu]
May 19 10:03:55 arch kernel: Modules linked in: uvcvideo videobuf2_vmalloc uvc videobuf2_memops videobuf2_v4l2 videodev videobuf2_common vhost_net vhost vhost_iotlb tap tun uinput hid_microsoft ff_memless hidp ccm rfcomm snd_seq_dummy sn>
May 19 10:03:55 arch kernel:  sha512_ssse3 snd_hda_core snd_seq_device aesni_intel drm_ttm_helper mc snd_hwdep crypto_simd libarc4 ttm snd_pci_acp5x wmi_bmof cryptd usbhid snd_pcm snd_rn_pci_acp3x cfg80211 drm_display_helper rapl snd_acp>
May 19 10:03:55 arch kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.3.2-arch1-1 #1 44a850778a68c42d012ba8e685997cb0375875a4
May 19 10:03:55 arch kernel: Hardware name: Micro Computer (HK) Tech Limited HX99G/F7BAA, BIOS 0.18 03/02/2023
May 19 10:03:55 arch kernel: RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
May 19 10:03:55 arch kernel: Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
May 19 10:03:55 arch kernel: RSP: 0018:ffff9dac00003e20 EFLAGS: 00010046
May 19 10:03:55 arch kernel: RAX: ffff8df0c4ba2d60 RBX: ffff8df0ec749800 RCX: 0000000000000000
May 19 10:03:55 arch kernel: RDX: 0000000000000000 RSI: ffff8df0e1e065c8 RDI: ffff8df0e1e00000
May 19 10:03:55 arch kernel: RBP: 0000000000000000 R08: ffffffffc1a1fb22 R09: 0000000000000000
May 19 10:03:55 arch kernel: R10: ffff9dac00003d10 R11: ffff9dac00003d14 R12: ffff8df0e1e00010
May 19 10:03:55 arch kernel: R13: ffff8df0e1e00000 R14: ffff8dfac3324e00 R15: ffff8dffbe621fc0
May 19 10:03:55 arch kernel: FS:  0000000000000000(0000) GS:ffff8dffbe600000(0000) knlGS:0000000000000000
May 19 10:03:55 arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 10:03:55 arch kernel: CR2: 000014fc00a0c000 CR3: 00000002cec20000 CR4: 0000000000750ef0
May 19 10:03:55 arch kernel: PKRU: 55555554
May 19 10:03:55 arch kernel: Call Trace:
May 19 10:03:55 arch kernel:  <IRQ>
May 19 10:03:55 arch kernel:  dm_set_vblank+0x187/0x1b0 [amdgpu 0323f8491ce43051980d5a5c8bf0138a850d0341]
May 19 10:03:55 arch kernel:  drm_vblank_disable_and_save+0xba/0xf0
May 19 10:03:55 arch kernel:  vblank_disable_fn+0x67/0x80
May 19 10:03:55 arch kernel:  ? __pfx_vblank_disable_fn+0x10/0x10
May 19 10:03:55 arch kernel:  call_timer_fn+0x27/0x130
May 19 10:03:55 arch kernel:  ? __pfx_vblank_disable_fn+0x10/0x10
May 19 10:03:55 arch kernel:  __run_timers+0x222/0x2c0
May 19 10:03:55 arch kernel:  run_timer_softirq+0x1d/0x40
May 19 10:03:55 arch kernel:  __do_softirq+0xd4/0x2c8
May 19 10:03:55 arch kernel:  __irq_exit_rcu+0xbb/0xf0
May 19 10:03:55 arch kernel:  sysvec_apic_timer_interrupt+0x72/0x90
May 19 10:03:55 arch kernel:  </IRQ>
May 19 10:03:55 arch kernel:  <TASK>
May 19 10:03:55 arch kernel:  asm_sysvec_apic_timer_interrupt+0x1a/0x20
May 19 10:03:55 arch kernel: RIP: 0010:cpuidle_enter_state+0xcc/0x440
May 19 10:03:55 arch kernel: Code: aa 6f 3d ff e8 c5 f3 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 a3 71 3c ff 45 84 ff 0f 85 56 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
May 19 10:03:55 arch kernel: RSP: 0018:ffffffffb4803e40 EFLAGS: 00000246
May 19 10:03:55 arch kernel: RAX: ffff8dffbe633e80 RBX: ffff8df0c0bab800 RCX: 0000000000000000
May 19 10:03:55 arch kernel: RDX: 0000000000000000 RSI: fffffffcdf9d35d3 RDI: 0000000000000000
May 19 10:03:55 arch kernel: RBP: 0000000000000003 R08: 0000000000000002 R09: 0000000026dc593f
May 19 10:03:55 arch kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffb4949360
May 19 10:03:55 arch kernel: R13: 00010ac3ebb9db45 R14: 0000000000000003 R15: 0000000000000000
May 19 10:03:55 arch kernel:  cpuidle_enter+0x2d/0x40
May 19 10:03:55 arch kernel:  do_idle+0x1bf/0x220
May 19 10:03:55 arch kernel:  cpu_startup_entry+0x1d/0x20
May 19 10:03:55 arch kernel:  rest_init+0xc8/0xd0
May 19 10:03:55 arch kernel:  arch_call_rest_init+0xe/0x30
May 19 10:03:55 arch kernel:  start_kernel+0x778/0xb80
May 19 10:03:55 arch kernel:  secondary_startup_64_no_verify+0xe5/0xeb
May 19 10:03:55 arch kernel:  </TASK>
May 19 10:03:55 arch kernel: ---[ end trace 0000000000000000 ]---

Kernel 6.3.2-arch1-1

Offline

#36 2023-05-20 22:03:34

topcat01
Member
Registered: 2019-09-17
Posts: 124

Re: Random GPU crashes amdgpu across 2 different GPUs

Have you managed to run memtest86+ for at least a day (24 hrs)?

Offline

#37 2023-05-22 21:29:56

blueford
Member
Registered: 2023-05-22
Posts: 2

Re: Random GPU crashes amdgpu across 2 different GPUs

Having the same issue using a 6700XT, completely random, might take 5 minutes might take 5 hours. Has only happened to me while playing Assassin's Creed Valhalla.

lspci

0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev df)
uname -rms
Linux 6.3.3-arch1-1 x86_64

log output:

May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32775, for process ACValhalla.exe pid 1732 thread vkd3d_queue pid 1824)
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x000080032aab0000 from client 0x1b (UTCL2)
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501031
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x1
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x0
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x0
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32775, for process ACValhalla.exe pid 1732 thread vkd3d_queue pid 1824)
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x000080032aab1000 from client 0x1b (UTCL2)
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x0
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x0
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x0
May 22 17:10:29 bl kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0
May 22 17:10:39 bl kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=141786, emitted seq=141788
May 22 17:10:39 bl kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ACValhalla.exe pid 1732 thread vkd3d_queue pid 1824

Offline

#38 2023-05-24 06:28:24

orlfman
Member
Registered: 2007-11-20
Posts: 138

Re: Random GPU crashes amdgpu across 2 different GPUs

there are a few things that stand out as possible culprits.

#1
you might actually have defective cards. i know its not something people like to hear, but your cards could actually be defective. you can try enabling overclocking to gain access to being able to modifiy voltage and core / memory frequencies on your gpu. you can use something like corectrl to downclock your gpu core by lowering  frequency down by 100-200mhz and see if that helps stabilize the cards.
#2
seems like a lot of you have amd cpu's. amd cpu's are notoriously dependent on system memory speeds due to infinity fabric being tied to system memory. if system memory is remotely unstable, infinity fabric will be unstable and it will manifest in unsuspecting ways. such as gpu's crashing. the gpu 16x pci-e lane is connected to the cpu over infinity fabric. you can try loosing up your timings or resetting you ram to their default jdec standard rather than the xmp profile they are advertised for to see if that helps.
#3
the 5700 xt series is notoriously known for how unstable that series was. the entire rdna1 line up was prone to crashing from day 1. if you're on a 5xxx series card, i would just look at upgrading to something else. amd should have issued a recall for that series and refunded people.
#4
buggy motherboard firmware. motherboards play a big role and can cause really odd problems. i had one motherboard from msi, a b550 tomahawk, that with my creative fx v2 and ae-5 sound card it would cause audio popping / crackling sounds when first starting to play any sound. it would go away once the sound card was playing audio continuously. but from idle to sound really bad crackling and popping. i also would get journactl spammed heavily with "Too many BDL entries." i didn't have this with my gigabyte x470 before it, nor with my current msi board, a z690 edge wifi ddr5 with my intel 13700k. everything might seem fine and stable, but in reality it could be the motherboard. jay2centz on youtube did a video recently with a nvidia 3070 iirc that would crash in one particular motherboard but be 100% stable in a completely different motherboard with the same cpu and memory. but a different gpu in that motherboard that would crash with that 3070 would run fine. completely bizarre.
#5
power supply could be the culprit as well. even power cables.

Last edited by orlfman (2023-05-24 06:30:48)

Offline

#39 2023-05-24 06:49:49

knoellle
Member
Registered: 2023-05-22
Posts: 1

Re: Random GPU crashes amdgpu across 2 different GPUs

Personally I have had these issues since migrating my 6750 XT from a Ryzen 3000 to a Ryzen 7000 system.
At least in my case I can eliminate the power supply as a possible source since I get crashes with the new and the old power supply in the new system, but it ran without issues in the old system.
With the amount of people reporting these issues, I also doubt its all defective cards.
It could be a motherboard or infinity fabric issue. I'm running DDR5 6000, will try slower speeds.

I've had multiple crashes per day for a few weeks now (since I built the system), although its kind of hard to reproduce. Days without crashes were few and far between.
But three days ago I started using this kernel from the AUR: https://aur.archlinux.org/packages/linux-drm-next-git which has yet to crash even once.
Perhaps others could try this as well and report their findings.

Offline

#40 2023-05-24 06:53:51

Bodyash
Member
Registered: 2023-04-05
Posts: 9

Re: Random GPU crashes amdgpu across 2 different GPUs

knoellle wrote:

Personally I have had these issues since migrating my 6750 XT from a Ryzen 3000 to a Ryzen 7000 system.
At least in my case I can eliminate the power supply as a possible source since I get crashes with the new and the old power supply in the new system, but it ran without issues in the old system.
With the amount of people reporting these issues, I also doubt its all defective cards.
It could be a motherboard or infinity fabric issue. I'm running DDR5 6000, will try slower speeds.

I've had multiple crashes per day for a few weeks now (since I built the system), although its kind of hard to reproduce. Days without crashes were few and far between.
But three days ago I started using this kernel from the AUR: https://aur.archlinux.org/packages/linux-drm-next-git which has yet to crash even once.
Perhaps others could try this as well and report their findings.

You can try 5.19-lts kernel and expect ~0 crashes.

Offline

#41 2023-05-24 07:44:27

bernd_b
Member
Registered: 2013-07-30
Posts: 164

Re: Random GPU crashes amdgpu across 2 different GPUs

In my long life, I had similar problems with nvidia too and people guessed that there could be a defective card or memory or ...

Of course, it is possible that the hardware got broken, but in my case a new feature of the nvidia driver not supported by my old card was the culprit. So some updates later the mystery was gone, in between I did days of memory testing and nearly bought a new card which I definitely didn't need. Before the upstream fix, adding a simple configuration file was the fix needed and everything was back at normal.

So my point is:
Don't blame your hardware too soon.

I did some tests here regarding webgl:
They are pretty clearly repeatable and as long someone doesn't succeed having the same hardware, I would not know why to assume a hardware defect.

Last edited by bernd_b (2023-06-07 08:13:35)

Offline

#42 2023-05-24 08:06:44

blueford
Member
Registered: 2023-05-22
Posts: 2

Re: Random GPU crashes amdgpu across 2 different GPUs

orlfman wrote:

there are a few things that stand out as possible culprits.

#1
you might actually have defective cards. i know its not something people like to hear, but your cards could actually be defective. you can try enabling overclocking to gain access to being able to modifiy voltage and core / memory frequencies on your gpu. you can use something like corectrl to downclock your gpu core by lowering  frequency down by 100-200mhz and see if that helps stabilize the cards.

I really hope it's not my card because I just got done RMAing it after getting a faulty one smile

This happens far too infrequently for me to try to diagnose it, I guess I'll have to wait until I can find a way to cause it to happen.

Offline

#43 2023-05-25 01:39:42

Bodyash
Member
Registered: 2023-04-05
Posts: 9

Re: Random GPU crashes amdgpu across 2 different GPUs

blueford wrote:
orlfman wrote:

there are a few things that stand out as possible culprits.

#1
you might actually have defective cards. i know its not something people like to hear, but your cards could actually be defective. you can try enabling overclocking to gain access to being able to modifiy voltage and core / memory frequencies on your gpu. you can use something like corectrl to downclock your gpu core by lowering  frequency down by 100-200mhz and see if that helps stabilize the cards.

I really hope it's not my card because I just got done RMAing it after getting a faulty one smile

This happens far too infrequently for me to try to diagnose it, I guess I'll have to wait until I can find a way to cause it to happen.

You can check your journalctl. Also check this: https://gitlab.freedesktop.org/drm/amd/-/issues
There is so many similar crashes.

Offline

#44 2023-06-07 08:03:51

Bodyash
Member
Registered: 2023-04-05
Posts: 9

Re: Random GPU crashes amdgpu across 2 different GPUs

Feels like 6.3.5 and 6.3.6 kernels is more stable.
Driver stopped crashing for me.

Offline

#45 2023-06-08 09:21:17

dark_desert
Member
Registered: 2014-01-18
Posts: 3

Re: Random GPU crashes amdgpu across 2 different GPUs

Interesting. I'm on 6.3.6 and I get more crashes/freezes than before. I use a laptop with a AMD Ryzen 7 PRO 6850U (T14s Gen 3). Logs:

Jun 08 10:49:03 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* flip_done timed out
Jun 08 10:49:03 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* [CRTC:72:crtc-0] commit wait timed out
Jun 08 10:49:13 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* flip_done timed out
Jun 08 10:49:13 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* [CONNECTOR:83:eDP-1] commit wait timed out
Jun 08 10:49:23 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* flip_done timed out
Jun 08 10:49:23 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* [PLANE:55:plane-3] commit wait timed out
Jun 08 10:49:34 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* flip_done timed out
Jun 08 10:49:34 thinkarch kernel: amdgpu 0000:33:00.0: [drm] *ERROR* [PLANE:70:plane-6] commit wait timed out
Jun 08 10:49:34 thinkarch kernel: ------------[ cut here ]------------
Jun 08 10:49:34 thinkarch kernel: WARNING: CPU: 12 PID: 1247 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8119 amdgpu_dm_atomic_commit_tail+0x3403/0x3620 [amdgpu]
Jun 08 10:49:34 thinkarch kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device ccm michael_mic cmac algif_hash algif_skcipher af_alg qrtr_mhi bnep btusb btrtl btbcm btintel btmtk bluetooth ecdh_generic crc16 qrtr uvcvideo snd_acp6x_pdm_dma snd_soc_dmic snd_soc_acp6x_mach videobuf2_vmalloc uvc snd_sof_amd_rembrandt ath11k_pci videobuf2_memops videobuf2_v4l2 snd_sof_amd_renoir uinput ath11k snd_sof_amd_acp videodev snd_sof_pci snd_sof_xtensa_dsp videobuf2_common qmi_helpers mc snd_sof snd_sof_utils mac80211 snd_ctl_led snd_soc_core snd_hda_codec_realtek joydev libarc4 snd_compress amdgpu ac97_bus mousedev intel_rapl_msr snd_pcm_dmaengine snd_hda_codec_generic intel_rapl_common snd_hda_codec_hdmi snd_pci_ps snd_rpl_pci_acp6x thinkpad_acpi cfg80211 edac_mce_amd snd_acp_pci snd_hda_intel drm_buddy ledtrig_audio gpu_sched snd_intel_dspcfg snd_pci_acp6x platform_profile snd_intel_sdw_acpi i2c_algo_bit snd_pci_acp5x kvm_amd snd_hda_codec snd_rn_pci_acp3x drm_ttm_helper snd_hda_core ttm snd_acp_config snd_hwdep kvm
Jun 08 10:49:34 thinkarch kernel:  ucsi_acpi hid_multitouch drm_display_helper snd_pcm snd_soc_acpi typec_ucsi sp5100_tco irqbypass think_lmi rfkill vfat video snd_timer firmware_attributes_class wmi_bmof fat rapl typec psmouse thunderbolt snd cec snd_pci_acp3x i2c_piix4 pcspkr k10temp roles mhi soundcore wmi i2c_hid_acpi i2c_hid amd_pmc acpi_cpufreq acpi_tad mac_hid ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) crypto_user fuse loop bpf_preload ip_tables x_tables usbhid dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod serio_raw atkbd crct10dif_pclmul libps2 vivaldi_fmap crc32_pclmul polyval_clmulni nvme polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 aesni_intel nvme_core crypto_simd i8042 cryptd xhci_pci ccp nvme_common xhci_pci_renesas serio btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_intel
Jun 08 10:49:34 thinkarch kernel: CPU: 12 PID: 1247 Comm: seatd Tainted: G        W  OE      6.3.6-arch1-1 #1 a07497485287c74e7a472f42ded4b2ddcf7a6fd7
Jun 08 10:49:34 thinkarch kernel: Hardware name: LENOVO 21CQCTO1WW/21CQCTO1WW, BIOS R22ET60W (1.30 ) 02/09/2023
Jun 08 10:49:34 thinkarch kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x3403/0x3620 [amdgpu]
Jun 08 10:49:34 thinkarch kernel: Code: 48 89 f9 49 8b 7d 28 48 39 79 28 41 0f 95 c1 0f 95 85 d0 fc ff ff 45 0f b6 c9 48 85 f6 0f 85 a8 f6 ff ff e9 a7 f6 ff ff 0f 0b <0f> 0b e9 ef f1 ff ff 48 89 85 b8 fc ff ff 8b 85 f0 fc ff ff 31 f6
Jun 08 10:49:34 thinkarch kernel: RSP: 0018:ffffa8cc03337578 EFLAGS: 00010002
Jun 08 10:49:34 thinkarch kernel: RAX: 0000000000000286 RBX: 0000000000000286 RCX: 0000000000000020
Jun 08 10:49:34 thinkarch kernel: RDX: 0000000000000001 RSI: 0000000000000297 RDI: ffff9b1406ac0178
Jun 08 10:49:34 thinkarch kernel: RBP: ffffa8cc033378f8 R08: 0000000000000002 R09: 0000000000000001
Jun 08 10:49:34 thinkarch kernel: R10: 0000000000000000 R11: ffff9b13deddf918 R12: ffff9b13deddf918
Jun 08 10:49:34 thinkarch kernel: R13: 0000000000000000 R14: ffff9b1402e74c00 R15: ffff9b13deddf800
Jun 08 10:49:34 thinkarch kernel: FS:  00007f70c0aa9740(0000) GS:ffff9b1adf100000(0000) knlGS:0000000000000000
Jun 08 10:49:34 thinkarch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 08 10:49:34 thinkarch kernel: CR2: 00007f734a640040 CR3: 00000001479e0000 CR4: 0000000000750ee0
Jun 08 10:49:34 thinkarch kernel: PKRU: 55555554
Jun 08 10:49:34 thinkarch kernel: Call Trace:
Jun 08 10:49:34 thinkarch kernel:  <TASK>
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x3403/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  ? __warn+0x81/0x130
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x3403/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  ? report_bug+0x171/0x1a0
Jun 08 10:49:34 thinkarch kernel:  ? handle_bug+0x3c/0x80
Jun 08 10:49:34 thinkarch kernel:  ? exc_invalid_op+0x17/0x70
Jun 08 10:49:34 thinkarch kernel:  ? asm_exc_invalid_op+0x1a/0x20
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x3403/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x25e6/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  commit_tail+0x94/0x130
Jun 08 10:49:34 thinkarch kernel:  drm_atomic_helper_commit+0x11a/0x140
Jun 08 10:49:34 thinkarch kernel:  drm_atomic_commit+0x9a/0xd0
Jun 08 10:49:34 thinkarch kernel:  ? __pfx___drm_printfn_info+0x10/0x10
Jun 08 10:49:34 thinkarch kernel:  drm_client_modeset_commit_atomic+0x203/0x250
Jun 08 10:49:34 thinkarch kernel:  drm_client_modeset_commit_locked+0x5a/0x160
Jun 08 10:49:34 thinkarch kernel:  ? default_send_IPI_single_phys+0x36/0x50
Jun 08 10:49:34 thinkarch kernel:  drm_fb_helper_set_par+0x7f/0x100
Jun 08 10:49:34 thinkarch kernel:  fb_set_var+0x204/0x420
Jun 08 10:49:34 thinkarch kernel:  ? update_load_avg+0x7e/0x780
Jun 08 10:49:34 thinkarch kernel:  ? __flush_work.isra.0+0x1aa/0x280
Jun 08 10:49:34 thinkarch kernel:  ? update_load_avg+0x7e/0x780
Jun 08 10:49:34 thinkarch kernel:  fbcon_blank+0x213/0x310
Jun 08 10:49:34 thinkarch kernel:  do_unblank_screen+0xac/0x160
Jun 08 10:49:34 thinkarch kernel:  complete_change_console+0x54/0x120
Jun 08 10:49:34 thinkarch kernel:  vt_ioctl+0xd8b/0x13f0
Jun 08 10:49:34 thinkarch kernel:  ? dput+0x3a/0x310
Jun 08 10:49:34 thinkarch kernel:  ? __fsnotify_parent+0x129/0x360
Jun 08 10:49:34 thinkarch kernel:  tty_ioctl+0x4db/0x8a0
Jun 08 10:49:34 thinkarch kernel:  ? kmem_cache_free+0x19/0x360
Jun 08 10:49:34 thinkarch kernel:  ? do_sys_openat2+0x9b/0x170
Jun 08 10:49:34 thinkarch kernel:  __x64_sys_ioctl+0x94/0xd0
Jun 08 10:49:34 thinkarch kernel:  do_syscall_64+0x60/0x90
Jun 08 10:49:34 thinkarch kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 08 10:49:34 thinkarch kernel: RIP: 0033:0x7f70c0bab76f
Jun 08 10:49:34 thinkarch kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Jun 08 10:49:34 thinkarch kernel: RSP: 002b:00007fff1befcc70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 08 10:49:34 thinkarch kernel: RAX: ffffffffffffffda RBX: 0000000000000015 RCX: 00007f70c0bab76f
Jun 08 10:49:34 thinkarch kernel: RDX: 0000000000000001 RSI: 0000000000005605 RDI: 0000000000000015
Jun 08 10:49:34 thinkarch kernel: RBP: 000055bc4eb604bf R08: 000055bc4eb60078 R09: 0000000000000000
Jun 08 10:49:34 thinkarch kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff1befce10
Jun 08 10:49:34 thinkarch kernel: R13: 000055bc4eb6014d R14: 000055bc4ff963b0 R15: 000055bc4ff96120
Jun 08 10:49:34 thinkarch kernel:  </TASK>
Jun 08 10:49:34 thinkarch kernel: ---[ end trace 0000000000000000 ]---
Jun 08 10:49:34 thinkarch kernel: ------------[ cut here ]------------
Jun 08 10:49:34 thinkarch kernel: WARNING: CPU: 12 PID: 1247 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7666 amdgpu_dm_atomic_commit_tail+0x3589/0x3620 [amdgpu]
Jun 08 10:49:34 thinkarch kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device ccm michael_mic cmac algif_hash algif_skcipher af_alg qrtr_mhi bnep btusb btrtl btbcm btintel btmtk bluetooth ecdh_generic crc16 qrtr uvcvideo snd_acp6x_pdm_dma snd_soc_dmic snd_soc_acp6x_mach videobuf2_vmalloc uvc snd_sof_amd_rembrandt ath11k_pci videobuf2_memops videobuf2_v4l2 snd_sof_amd_renoir uinput ath11k snd_sof_amd_acp videodev snd_sof_pci snd_sof_xtensa_dsp videobuf2_common qmi_helpers mc snd_sof snd_sof_utils mac80211 snd_ctl_led snd_soc_core snd_hda_codec_realtek joydev libarc4 snd_compress amdgpu ac97_bus mousedev intel_rapl_msr snd_pcm_dmaengine snd_hda_codec_generic intel_rapl_common snd_hda_codec_hdmi snd_pci_ps snd_rpl_pci_acp6x thinkpad_acpi cfg80211 edac_mce_amd snd_acp_pci snd_hda_intel drm_buddy ledtrig_audio gpu_sched snd_intel_dspcfg snd_pci_acp6x platform_profile snd_intel_sdw_acpi i2c_algo_bit snd_pci_acp5x kvm_amd snd_hda_codec snd_rn_pci_acp3x drm_ttm_helper snd_hda_core ttm snd_acp_config snd_hwdep kvm
Jun 08 10:49:34 thinkarch kernel:  ucsi_acpi hid_multitouch drm_display_helper snd_pcm snd_soc_acpi typec_ucsi sp5100_tco irqbypass think_lmi rfkill vfat video snd_timer firmware_attributes_class wmi_bmof fat rapl typec psmouse thunderbolt snd cec snd_pci_acp3x i2c_piix4 pcspkr k10temp roles mhi soundcore wmi i2c_hid_acpi i2c_hid amd_pmc acpi_cpufreq acpi_tad mac_hid ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) crypto_user fuse loop bpf_preload ip_tables x_tables usbhid dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod serio_raw atkbd crct10dif_pclmul libps2 vivaldi_fmap crc32_pclmul polyval_clmulni nvme polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 aesni_intel nvme_core crypto_simd i8042 cryptd xhci_pci ccp nvme_common xhci_pci_renesas serio btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_intel
Jun 08 10:49:34 thinkarch kernel: CPU: 12 PID: 1247 Comm: seatd Tainted: G        W  OE      6.3.6-arch1-1 #1 a07497485287c74e7a472f42ded4b2ddcf7a6fd7
Jun 08 10:49:34 thinkarch kernel: Hardware name: LENOVO 21CQCTO1WW/21CQCTO1WW, BIOS R22ET60W (1.30 ) 02/09/2023
Jun 08 10:49:34 thinkarch kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x3589/0x3620 [amdgpu]
Jun 08 10:49:34 thinkarch kernel: Code: 4c 8b 95 98 fc ff ff 48 83 c4 18 83 bd f0 fc ff ff 02 4c 8b 9d 90 fc ff ff 77 27 c7 85 f0 fc ff ff 02 00 00 00 e9 6d fd ff ff <0f> 0b e9 88 f0 ff ff 8b 85 20 fd ff ff 89 85 50 fd ff ff e9 5e fb
Jun 08 10:49:34 thinkarch kernel: RSP: 0018:ffffa8cc03337578 EFLAGS: 00010082
Jun 08 10:49:34 thinkarch kernel: RAX: 0000000000000001 RBX: 0000000000000286 RCX: 0000000000000020
Jun 08 10:49:34 thinkarch kernel: RDX: 0000000000000001 RSI: 0000000000000297 RDI: ffff9b1406ac0178
Jun 08 10:49:34 thinkarch kernel: RBP: ffffa8cc033378f8 R08: 0000000000000002 R09: 0000000000000001
Jun 08 10:49:34 thinkarch kernel: R10: 0000000000000000 R11: ffff9b13deddf918 R12: ffff9b13deddf918
Jun 08 10:49:34 thinkarch kernel: R13: 0000000000000000 R14: ffff9b1402e74c00 R15: ffff9b13deddf800
Jun 08 10:49:34 thinkarch kernel: FS:  00007f70c0aa9740(0000) GS:ffff9b1adf100000(0000) knlGS:0000000000000000
Jun 08 10:49:34 thinkarch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 08 10:49:34 thinkarch kernel: CR2: 00007f734a640040 CR3: 00000001479e0000 CR4: 0000000000750ee0
Jun 08 10:49:34 thinkarch kernel: PKRU: 55555554
Jun 08 10:49:34 thinkarch kernel: Call Trace:
Jun 08 10:49:34 thinkarch kernel:  <TASK>
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x3589/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  ? __warn+0x81/0x130
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x3589/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  ? report_bug+0x171/0x1a0
Jun 08 10:49:34 thinkarch kernel:  ? handle_bug+0x3c/0x80
Jun 08 10:49:34 thinkarch kernel:  ? exc_invalid_op+0x17/0x70
Jun 08 10:49:34 thinkarch kernel:  ? asm_exc_invalid_op+0x1a/0x20
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x3589/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  ? amdgpu_dm_atomic_commit_tail+0x25e6/0x3620 [amdgpu 0fa3473e1f35a3200140f96d82060a17aecc2505]
Jun 08 10:49:34 thinkarch kernel:  commit_tail+0x94/0x130
Jun 08 10:49:34 thinkarch kernel:  drm_atomic_helper_commit+0x11a/0x140
Jun 08 10:49:34 thinkarch kernel:  drm_atomic_commit+0x9a/0xd0
Jun 08 10:49:34 thinkarch kernel:  ? __pfx___drm_printfn_info+0x10/0x10
Jun 08 10:49:34 thinkarch kernel:  drm_client_modeset_commit_atomic+0x203/0x250
Jun 08 10:49:34 thinkarch kernel:  drm_client_modeset_commit_locked+0x5a/0x160
Jun 08 10:49:34 thinkarch kernel:  ? default_send_IPI_single_phys+0x36/0x50
Jun 08 10:49:34 thinkarch kernel:  drm_fb_helper_set_par+0x7f/0x100
Jun 08 10:49:34 thinkarch kernel:  fb_set_var+0x204/0x420
Jun 08 10:49:34 thinkarch kernel:  ? update_load_avg+0x7e/0x780
Jun 08 10:49:34 thinkarch kernel:  ? __flush_work.isra.0+0x1aa/0x280
Jun 08 10:49:34 thinkarch kernel:  ? update_load_avg+0x7e/0x780
Jun 08 10:49:34 thinkarch kernel:  fbcon_blank+0x213/0x310
Jun 08 10:49:34 thinkarch kernel:  do_unblank_screen+0xac/0x160
Jun 08 10:49:34 thinkarch kernel:  complete_change_console+0x54/0x120
Jun 08 10:49:34 thinkarch kernel:  vt_ioctl+0xd8b/0x13f0
Jun 08 10:49:34 thinkarch kernel:  ? dput+0x3a/0x310
Jun 08 10:49:34 thinkarch kernel:  ? __fsnotify_parent+0x129/0x360
Jun 08 10:49:34 thinkarch kernel:  tty_ioctl+0x4db/0x8a0
Jun 08 10:49:34 thinkarch kernel:  ? kmem_cache_free+0x19/0x360
Jun 08 10:49:34 thinkarch kernel:  ? do_sys_openat2+0x9b/0x170
Jun 08 10:49:34 thinkarch kernel:  __x64_sys_ioctl+0x94/0xd0
Jun 08 10:49:34 thinkarch kernel:  do_syscall_64+0x60/0x90
Jun 08 10:49:34 thinkarch kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  ? do_syscall_64+0x6c/0x90
Jun 08 10:49:34 thinkarch kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 08 10:49:34 thinkarch kernel: RIP: 0033:0x7f70c0bab76f
Jun 08 10:49:34 thinkarch kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Jun 08 10:49:34 thinkarch kernel: RSP: 002b:00007fff1befcc70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 08 10:49:34 thinkarch kernel: RAX: ffffffffffffffda RBX: 0000000000000015 RCX: 00007f70c0bab76f
Jun 08 10:49:34 thinkarch kernel: RDX: 0000000000000001 RSI: 0000000000005605 RDI: 0000000000000015
Jun 08 10:49:34 thinkarch kernel: RBP: 000055bc4eb604bf R08: 000055bc4eb60078 R09: 0000000000000000
Jun 08 10:49:34 thinkarch kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff1befce10
Jun 08 10:49:34 thinkarch kernel: R13: 000055bc4eb6014d R14: 000055bc4ff963b0 R15: 000055bc4ff96120
Jun 08 10:49:34 thinkarch kernel:  </TASK>
Jun 08 10:49:34 thinkarch kernel: ---[ end trace 0000000000000000 ]---

Offline

#46 2023-06-09 11:27:44

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,868

Re: Random GPU crashes amdgpu across 2 different GPUs

Looks like you're facing  a different bug, dark_desert . Maybe start a new thread ?


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#47 2023-07-17 18:39:08

RBNXI
Member
Registered: 2023-07-17
Posts: 2

Re: Random GPU crashes amdgpu across 2 different GPUs

Can confirm that it's still not solved:

jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c0b000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501031
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x1
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c1d000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c1c000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c1b000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c0a000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c0d000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501031
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x1
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c0b000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501030
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c1a000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c0a000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c0c000 from client 0x1b (UTCL2)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:19 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: gmc_v10_0_process_interrupt: 19 callbacks suppressed
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c8d000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501031
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x1
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c8c000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c98000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c99000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c8c000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c99000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c9a000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510c89000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510caa000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32770, for process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000800510cbc000 from client 0x1b (UTCL2)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0
jul 17 20:30:29 ruben kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3394557, emitted seq=3394559
jul 17 20:30:29 ruben kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_wayland pid 1010 thread kwin_wayla:cs0 pid 1066
jul 17 20:30:29 ruben kernel: amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
jul 17 20:30:39 ruben kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:72:crtc-0] hw_done or flip_done timed out

Offline

#48 2023-07-18 16:50:14

eephyne
Member
Registered: 2023-06-03
Posts: 1

Re: Random GPU crashes amdgpu across 2 different GPUs

I confirm here too.

I'm on manjaro but my issue is the same. (same error).
I tested on kernel 6.4.1 and 6.5 (also on lts 6.1.38).
Tried amdgpu.runpm=0
Also tried settings AMD_DEBUG=dcc as mentionned in the gitlab thread https://gitlab.freedesktop.org/drm/amd/-/issues/2496

juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32779, for process gnome-control-c pid 19674 thread gnome-cont:cs0 pid 19711)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x000001944c3e9000 from client 0x1b (UTCL2)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00700031
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x1
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32779, for process gnome-control-c pid 19674 thread gnome-cont:cs0 pid 19711)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x000000e3895bf000 from client 0x1b (UTCL2)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32779, for process gnome-control-c pid 19674 thread gnome-cont:cs0 pid 19711)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x00002de3895be000 from client 0x1b (UTCL2)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32779, for process gnome-control-c pid 19674 thread gnome-cont:cs0 pid 19711)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x0000e8944c3e8000 from client 0x1b (UTCL2)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
juil. 17 14:39:53 asgard kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0

Offline

#49 2023-07-25 09:39:21

CkNoSFeRaTU
Member
Registered: 2023-07-25
Posts: 1

Re: Random GPU crashes amdgpu across 2 different GPUs

I have Sapphire AMD Radeon RX 6700XT 11306-01-20 Gaming NITRO+ with micron memory (vbios supports both samsung and micron with autodetect) and have the same issue as topic starter.
Basically gpu is working fine overall aside from big delta between edge and hotspot temperatures. But could never recover if gpu reset is issued. And as there are plenty of buggy applications and amd drivers are also a buggy mess it happens quite often and it is aggravating to forcibly reboot everytime that happens. Suspend at the same time seems working fine.

My gpu has been behaving like this from the start and on completely different hardware. It also behave the same under windows (same no signal on monitors, disabled driver after reboot and output on only one monitor, to go back to working condition driver needs to be enabled, disabled again as there is an error and enabled again)

I've created issue https://gitlab.freedesktop.org/drm/amd/-/issues/2709, but there is no response yet if there ever be...
You can check if you are affected by initiating gpu reset at any time:

cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

If you are affected I encourage you to post something to aforementioned issue.

I also have [gfxhub] page fault GCVM_L2_PROTECTION_FAULT_STATUS on recent 6.4.x kernels on some applications like rocm+pytorch+stable diffusion stack and it looks like some regression.

Regarding GTT memory size it defaults to half your ram. You can change it via amdgpu.gttsize with a caveat is that 3GB seems to be a minimum. So something like amdgpu.gttsize=3072 should work. But it won't make any difference in my experience.

Offline

#50 2023-07-27 18:22:42

bitterhalt
Member
Registered: 2022-06-19
Posts: 21

Re: Random GPU crashes amdgpu across 2 different GPUs

I've had tons of these GPU soft resets lately with x5600+RX6600.

kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32770, for process pic>
kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x000080050489f000 from client 0x1b (UTCL2)
kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031
kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x1
kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32770, for process pic>
kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x000080050424e000 from client 0x1b (UTCL2)
kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32770, for process pic>
kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x000080050489e000 from client 0x1b (UTCL2)
kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32770, for process pic>
kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x000080050489a000 from client 0x1b (UTCL2)
kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32770, for process pic>
kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x000080050489b000 from client 0x1b (UTCL2)
kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:1 pasid:32770, for process pic>
kernel: amdgpu 0000:08:00.0: amdgpu:   in page starting at address 0x0000800108e6e000 from client 0x1b (UTCL2)
kernel: amdgpu 0000:08:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
kernel: amdgpu 0000:08:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
kernel: amdgpu 0000:08:00.0: amdgpu:          MORE_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:08:00.0: amdgpu:          RW: 0x0
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered

These are nothing new with this GPU but lately they have been happening more frequently than usual.

In this case failing "process" is Picom but it also happens with Kwin and even without any compositor and then culprit can be almost any active process. This is only a wild guess after last  2-3 Firefox versions I've had lots of more these crashes.

Last edited by bitterhalt (2023-07-27 23:37:15)

Offline

Board footer

Powered by FluxBB