Random GPU crashes amdgpu across 2 different GPUs

ockerman0 · 2023-07-27 20:16:01

I've been having this issue only on Judgement (the RGG game) after around a minute of gameplay with my PowerColor Hellhound 6700xt.
Sometimes when it crashes I can recover by alt-tabbing out after a few seconds and closing the game and also closing and opening KDE, but other times it completely logs me out.
Changing proton versions didn't work at all (denuvo really hates that too) and neither did using the lts kernel.

The output of a "soft" crash:

Jul 27 19:15:42 amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32795, for process timing pid 3317 thread vkd3d_queue pid 3386)
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x00008000e2400000 from client 0x1b (UTCL2)
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201430
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x0
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 27 19:15:42 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0

The output of a "hard" crash:

Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32796, for process timing pid 3185 thread vkd3d_queue pid 3258)
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000800072200000 from client 0x1b (UTCL2)
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201430
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x0
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 27 19:38:29 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0
Jul 27 19:38:39 ciarandpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32796, for process timing pid 3185 thread vkd3d_queue pid 3258)
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000800064600000 from client 0x1b (UTCL2)
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201430
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x0
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 27 19:38:39 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0
Jul 27 19:38:49 ciarandpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32796, for process timing pid 3185 thread vkd3d_queue pid 3258)
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000800062400000 from client 0x1b (UTCL2)
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201430
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x0
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x0
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jul 27 19:38:49 ciarandpc kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0
Jul 27 19:38:59 ciarandpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=161392, emitted seq=161394
Jul 27 19:38:59 ciarandpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process timing pid 3185 thread vkd3d_queue pid 3258

Last edited by ockerman0 (2023-07-27 20:18:39)

LienNoir · 2023-07-28 12:07:05

Hi guys, I was having the same issue for about 3days, my GPU is an RX 6700xt. And the crash are very specific too, same area in game.
I feel like i found a solution to mine on reddit on a Fedora post. It 's definitely a Kernel issue, happened when i updated my system from 6.3 to 6.4. And it kept happening enven when i reverted back to my old Kernel.
The post was about reverting back to 5.15 in fedora corp (it's like PPA in Ubuntu) that's what i did and since then no crash.

LienNoir · 2023-08-01 17:53:08

LienNoir wrote:

Hi guys, I was having the same issue for about 3days, my GPU is an RX 6700xt. And the crash are very specific too, same area in game.
I feel like i found a solution to mine on reddit on a Fedora post. It 's definitely a Kernel issue, happened when i updated my system from 6.3 to 6.4. And it kept happening enven when i reverted back to my old Kernel.
The post was about reverting back to 5.15 in fedora corp (it's like PPA in Ubuntu) that's what i did and since then no crash.

Nevermind i still get the error, but i found out what was it. it's a proton issue about a spesific game (Street Fighter 6)

RBNXI · 2023-08-16 13:45:30

Still getting crashes
ago 15 23:12:58 ruben kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32770, for process kwin_wayland pid 1033 thread kwin_wayla:cs0 pid 1089)

They happen often after updating, maybe it has something to do with caches?

mearkat7 · 2023-08-30 22:16:12

I haven't updated for a while but i'm still getting the crashes too. I've switched GPU again so it's now been 3 different AMD GPUs giving the same issue. They seem to be less frequent but more severe, most of mine these days require a full power cycle to bring the pc back.

Sometimes it seems like it's gone away for a few weeks then the next day will crash again seemingly without reason.

Tempted to do a full wipe and reinstall from scratch but at the moment don't have the time for that.

Last edited by mearkat7 (2023-08-30 22:16:43)

andyturfer · 2023-09-07 00:47:39

Jumping in here with a "me too"...

Laptop (Legion 7 gen 7 AMD, Ryzen 7 6800H, Radeon RX 6700m). KDE Plasma.

Laptop is always in "hybrid" mode.

$ lspci -k | grep -A 3 -E "(VGA|3D)"
pcilib: Error reading /sys/bus/pci/devices/0000:00:08.3/label: Operation not permitted
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev cf)
        DeviceName: Realtek
        Subsystem: Lenovo Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT]
        Kernel driver in use: amdgpu
--
37:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] (rev c8)
        Subsystem: Lenovo Rembrandt [Radeon 680M]
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

Kernel:

Linux 6.4.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 24 Aug 2023 00:38:14 +0000 x86_64 GNU/Linux

From journalctl:

...
Sep 06 15:41:30 l7 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=52235, emitted seq=52237
Sep 06 15:41:30 l7 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:4 pasid:32770, for process Xorg pid 1635 thread Xorg:cs0 pid 1639)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:   in page starting at address 0x0000800102400000 from client 0x1b (UTCL2)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00401031
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MORE_FAULTS: 0x1
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          WALKER_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MAPPING_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          RW: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:4 pasid:32770, for process Xorg pid 1635 thread Xorg:cs0 pid 1639)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:   in page starting at address 0x0000800102400000 from client 0x1b (UTCL2)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MORE_FAULTS: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          WALKER_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MAPPING_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          RW: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:4 pasid:32770, for process Xorg pid 1635 thread Xorg:cs0 pid 1639)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:   in page starting at address 0x0000800102401000 from client 0x1b (UTCL2)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MORE_FAULTS: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          WALKER_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MAPPING_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          RW: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:4 pasid:32770, for process Xorg pid 1635 thread Xorg:cs0 pid 1639)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:   in page starting at address 0x0000800102401000 from client 0x1b (UTCL2)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MORE_FAULTS: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          WALKER_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          MAPPING_ERROR: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:          RW: 0x0
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:4 pasid:32770, for process Xorg pid 1635 thread Xorg:cs0 pid 1639)
Sep 06 15:41:32 l7 kernel: amdgpu 0000:37:00.0: amdgpu:   in page starting at address 0x0000800102402000 from client 0x1b (UTCL2)
...

seth · 2023-09-07 06:05:05

"amdgpu.dpm=0" was reported to no longer fail the boot, try to add that to the https://wiki.archlinux.org/title/Kernel_parameters
(Use the interactive commandline editor of your bootloader, there's currently *one* report where it's benign after a streak of "GPU does't init at all")

andyturfer · 2023-09-09 23:45:50

I added these two entries to my boot params:

amd_iommu=off
amdgpu.ppfeaturemask=0xffffbffd

I've been using my laptop for at least 8 hours per day and haven't experienced the issue (crash) in 3 days now.

Sije008 · 2023-09-10 11:17:50

andyturfer wrote:

I added these two entries to my boot params:
amd_iommu=off
amdgpu.ppfeaturemask=0xffffbffd
I've been using my laptop for at least 8 hours per day and haven't experienced the issue (crash) in 3 days now.

I have no image at all with this parameter "amdgpu.ppfeaturemask=0xffffbffd"

Last edited by Sije008 (2023-09-10 11:27:05)

andyturfer · 2023-09-11 09:27:17

Sije008 wrote:

I have no image at all with this parameter "amdgpu.ppfeaturemask=0xffffbffd"

You will have to adjust the value depending on your CPU/GPU - only enable features your hardware supports. I have a Ryzen 7 6800H and Radeon RX 6700m, and that value works on my laptop. I still haven't experienced the crash since using these settings.

sid0831 · 2023-09-11 22:46:39

Hi,

Greetings from a Arch Linux Forum newbie. I've been using Arch for years but I haven't had much issue or had courage and ability to help others solve problems so far, but here I am finally

Anyway,
I was experiencing similar issue since kernel 6.5.2 update. Both 6.5.2-arch1 and 6.5.2-zen1 crashes GPU, especially when I'm using chrome or chromium. Below is the log in the system journal:

Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:1 pasid:32814, for process chrome pid 4073 thread chrome:cs0 pid 4101)
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000e38dbdd3b000 from IH client 0x1b (UTCL2)
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00100430
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: IA (0x2)
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Sep 11 20:00:46 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0

It made the system virtually unusable. I tried switching between mesa/proprietary drivers, changing the session type chrome uses, or adding kernel parameters suggested here and there such as amdgpu.runpm=0, amdgpu.dpm=0, amdgpu.vm_update_mode=3 and so on, one each and altogether, but didn't have a luck to stabilise the system. It also happened more frequently when external monitors (4K2K@60Hz*1 + FHD@60Hz*1) were connected by DP Alt mode and then converted to HDMI from the docking station.

Fortunately, downgrading the kernel to 6.4.12 (both arch vanilla and zen kernel) seemed to be work, but since for those kernel packages I can't stay on that version forever, I switched to linux-lts kernel and it's fairly stable for now.
I'm happy to try to bisect which commit(s) contributed to this issue, with a kindly guidance from some of you.

Thanks!

seth · 2023-09-11 23:54:14

Fortunately, downgrading the kernel to 6.4.12 (both arch vanilla and zen kernel) seemed to be work

nb. that this particular thread concerns issues since at least 6.2.2 and the pattern is different, https://bbs.archlinux.org/viewtopic.php … 2#p2098202

=> https://bugzilla.kernel.org/show_bug.cgi?id=211157
Try passing "amdgpu.noretry=0"

sid0831 · 2023-09-11 23:58:32

Thanks Seth for the reply.

But unfortunately, I just forgot to mention I tried amdgpu.noretry=0 too; that made system slightly slightly more stable, but the issue persisted. The system was usable for 5~10 more minutes.
Well, as you said, the issue is not new, so maybe more investigation from more people is needed.

Thanks for the suggestion nevertheless, maybe it'd be helpful for some people

andyturfer · 2023-09-12 11:48:04

Laptop: Legion 7 gen 7 AMD (Ryzen 7 6800H, Radeon RX 6700m).

I did a sudo pacman -Syu recently, and there were a bunch of updates (include mesa updates). The kernel also updated to 6.5.2. I've removed all amdgpup kernel params, and I haven't experienced the crash so far (it has been less than 24 hours).

sid0831 · 2023-09-20 05:13:23

sid0831 wrote:

Anyway,
I was experiencing similar issue since kernel 6.5.2 update. Both 6.5.2-arch1 and 6.5.2-zen1 crashes GPU, especially when I'm using chrome or chromium. Below is the log in the system journal:

It seems like there's an open issue on freedesktop.org gitlab (https://gitlab.freedesktop.org/drm/amd/-/issues/2848) exactly like what I'm having, but I'm not sure it needs a separate thread because many people seem to be suspecting the issue is related to the kernel 6.2.2 one, maybe it started to become an issue later in 6.5.y in some hardware combinations. The issue is quite frustrating, tbh.

Lone_Wolf · 2023-09-20 09:28:05

@sid0831
check the comment I just added to that issue, https://gitlab.freedesktop.org/drm/amd/ … te_2092272 .
Do your logs show similar lines with gmc ?

sid0831 · 2023-09-20 10:31:38

Lone_Wolf wrote:

@sid0831
check the comment I just added to that issue, https://gitlab.freedesktop.org/drm/amd/ … te_2092272 .
Do your logs show similar lines with gmc ?

Yes, they indeed do:

Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: gmc_v9_0_process_interrupt: 20 callbacks suppressed
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32814, for process chrome pid 4138 thread chrome:cs0 pid 4162)
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000065188e24f000 from IH client 0x1b (UTCL2)
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00700431
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: IA (0x2)
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x1
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:7 pasid:32814, for process chrome pid 4138 thread chrome:cs0 pid 4162)
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000064d91e960000 from IH client 0x1b (UTCL2)
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          Faulty UTCL2 client ID: CB (0x0)
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          MORE_FAULTS: 0x0
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          WALKER_ERROR: 0x0
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          MAPPING_ERROR: 0x0
Sep 15 17:28:29 yoohyeon.dc.sidlibrary.org kernel: amdgpu 0000:07:00.0: amdgpu:          RW: 0x0

On a side note, I tried compiling kernel 6.6-rc2 and booting into it, and found out the crash didn't happen immediately but the graphics lagged a lot like stuttering. Let me see if there's any commit merged into 6.6-rc2 related to gmc when I got more time.

sid0831 · 2023-09-21 03:05:26

It seems like amdgpu.mcbp=0 fixes the issue for some people.
And I found the default value for amdgpu.mcbp kernel module parameter was changed from "0" to new entry "-1" (auto) in kernel 6.5.
(see: amdgpu module parameters entry of kernel documentation in versions 6.4 and 6.5)

yuvadm · 2023-09-22 09:25:36

amdbpu.mcbp=0 definitely works for me after seeing constant crashes when loading Google Maps in all browsers

UFODivebomb · 2023-09-23 21:43:21

I've been encountering these crashes as well. Only in Baldur's Gate 3. T_T

`amd_iommu=off` reduced the frequency of the crashes but they still occurred.

Trying `amdgpu.mcbp=0`.

duskfox · 2023-10-06 21:11:59

UFODivebomb wrote:

I've been encountering these crashes as well. Only in Baldur's Gate 3. T_T
`amd_iommu=off` reduced the frequency of the crashes but they still occurred.
Trying `amdgpu.mcbp=0`.

Same issue with Baldur's Gate 3, I'm using both amd_iommu=off and amdgpu.mcbp=0 crash are reduced in frequency but they still happen.

peldax · 2023-10-08 09:20:44

Hi,

it is happening on my otherwise rock stable system exclusively while playing Wolfenstein: New Order / Wolfenstein: Old Blood on Steam using Proton. Multiple other games run completely fine.
This issue was probably introduced recently, as I completed the New Order in July without any problems. Last week I tried to grind for some missing achievements and it is currently unplayable as the crashes are far too often.

Crashes seem random, but sometimes it happened in a cut scene on the perfectly same spot multiple times.

Game freezes and I am prompted to login into new session when GPU recovers.

I tried multiple proton versions, kernel versions (linux-lts 6.1.55 and 6.1.56, linux 6.5.5 and 6.5.6), setting amdgpu.mcbp=0, and multiple combinations, but none helps.

System info:
5900X
RX 6800XT

Last edited by peldax (2023-10-08 09:27:08)

seth · 2023-10-08 10:46:04

This issue was probably introduced recently, as I completed the New Order in July without any problems.

This thread started in 7th Mar 2023 02:58 …
Do you get the recurring VM_L2_PROTECTION_FAULT_STATUS pattern in your journal after those crashes?

peldax · 2023-10-08 14:54:01

seth wrote:

This issue was probably introduced recently, as I completed the New Order in July without any problems.
This thread started in 7th Mar 2023 02:58 …
Do you get the recurring VM_L2_PROTECTION_FAULT_STATUS pattern in your journal after those crashes?

I am getting the GCVM_L2_PROTECTION_FAULT_STATUS status.

Rena · 2023-10-21 00:08:46

I've been having this issue for the past couple weeks. Opening Blender has about a 50% chance of causing it. Opening a web page in IceCat, letting the screen go blank from idle, or starting a 3D game also are quite likely to crash. My system is nearly unusable. I've had it crash four times today. Sometimes the Magic Sysrq works, sometimes it will even "recover" enough to show a "the CS was rejected" message in the terminal before freezing completely, and sometimes only holding the power button can recover it.

I've done memory tests. I haven't made any hardware changes. I've tried various kernel parameters. Nothing has helped.

If I'm lucky enough to get Blender or a game open without a crash, I can usually use it for hours without an issue, but opening another window or browser tab is almost certain to crash again.

It might be worth noting that I'm using a laptop with the built-in screen, plus the HDMI output, plus a third screen connected by USB-C.

This seems like the relevant bug, but I can't comment there because (lol) signups are broken and (lol) my login from the old bug tracker doesn't work.

I can reproduce this 100% by opening Blender, adding a cube to the empty scene, and clicking Save. Otherwise it just happens randomly.

I've tried:

Using LTS kernel (the cursor keeps moving a bit when it crashes, but otherwise no change)
Using kernel 6.5.7 (no change)
Using amdgpu.dc=0 (boots to blank screen)
Using amdgpu.gpu_recovery=1 (no change)
Using amdgpu.mcbp=0 (no change)
Using amdgpu.dcdebugmask=0x10 (no change)
Using iommu=pt (I could open the save prompt three whole times before it crashed!)
Unplugging all external monitors (no change)
memtest86+ (passed)
memtest_vulkan (pased)
Physically inspecting the GPU and cleaning off dust (no change, no visible problems)

Last edited by Rena (2023-10-23 14:43:47)

Arch Linux

#51 2023-07-27 20:16:01

Re: Random GPU crashes amdgpu across 2 different GPUs

#52 2023-07-28 12:07:05

Re: Random GPU crashes amdgpu across 2 different GPUs

#53 2023-08-01 17:53:08

Re: Random GPU crashes amdgpu across 2 different GPUs

#54 2023-08-16 13:45:30

Re: Random GPU crashes amdgpu across 2 different GPUs

#55 2023-08-30 22:16:12

Re: Random GPU crashes amdgpu across 2 different GPUs

#56 2023-09-07 00:47:39

Re: Random GPU crashes amdgpu across 2 different GPUs

#57 2023-09-07 06:05:05

Re: Random GPU crashes amdgpu across 2 different GPUs

#58 2023-09-09 23:45:50

Re: Random GPU crashes amdgpu across 2 different GPUs

#59 2023-09-10 11:17:50

Re: Random GPU crashes amdgpu across 2 different GPUs

#60 2023-09-11 09:27:17

Re: Random GPU crashes amdgpu across 2 different GPUs

#61 2023-09-11 22:46:39

Re: Random GPU crashes amdgpu across 2 different GPUs

#62 2023-09-11 23:54:14

Re: Random GPU crashes amdgpu across 2 different GPUs

#63 2023-09-11 23:58:32

Re: Random GPU crashes amdgpu across 2 different GPUs

#64 2023-09-12 11:48:04

Re: Random GPU crashes amdgpu across 2 different GPUs

#65 2023-09-20 05:13:23

Re: Random GPU crashes amdgpu across 2 different GPUs

#66 2023-09-20 09:28:05

Re: Random GPU crashes amdgpu across 2 different GPUs

#67 2023-09-20 10:31:38

Re: Random GPU crashes amdgpu across 2 different GPUs

#68 2023-09-21 03:05:26

Re: Random GPU crashes amdgpu across 2 different GPUs

#69 2023-09-22 09:25:36

Re: Random GPU crashes amdgpu across 2 different GPUs

#70 2023-09-23 21:43:21

Re: Random GPU crashes amdgpu across 2 different GPUs

#71 2023-10-06 21:11:59

Re: Random GPU crashes amdgpu across 2 different GPUs

#72 2023-10-08 09:20:44

Re: Random GPU crashes amdgpu across 2 different GPUs

#73 2023-10-08 10:46:04

Re: Random GPU crashes amdgpu across 2 different GPUs

#74 2023-10-08 14:54:01

Re: Random GPU crashes amdgpu across 2 different GPUs

#75 2023-10-21 00:08:46

Re: Random GPU crashes amdgpu across 2 different GPUs

Board footer