UFODivebomb wrote:I've been encountering these crashes as well. Only in Baldur's Gate 3. T_T
`amd_iommu=off` reduced the frequency of the crashes but they still occurred.
Trying `amdgpu.mcbp=0`.
Same issue with Baldur's Gate 3, I'm using both amd_iommu=off and amdgpu.mcbp=0 crash are reduced in frequency but they still happen.
The issue happened again on kernel 6.6.10-arch1-1, I had removed both boot parameters since because on 6.6 I never had a crash and I thought the issue was resolved upstream.
In the following days I'll retry again with the parameters.
i've been having these issues too.
]]>https://forums.fedoraforum.org/showthre … ra-kernels
Great.But you could simply try to follow the lead and set THP to "never" as a broadsword attempt and see whether that stabilizes the system.
https://wiki.archlinux.org/title/Zswap
https://wiki.archlinux.org/title/Zramcat /sys/module/zswap/parameters/enabled swapon # prints all swapt devices
I'll try this as soon as I find some time to re-install Arch on this system (it's currently running Fedora).
]]>But you could simply try to follow the lead and set THP to "never" as a broadsword attempt and see whether that stabilizes the system.
https://wiki.archlinux.org/title/Zswap
https://wiki.archlinux.org/title/Zram
cat /sys/module/zswap/parameters/enabled
swapon # prints all swapt devices
zram, zswap or THP?
https://bbs.archlinux.org/viewtopic.php … 7#p2110357zgrep TRANSPARENT_HUGEPAGE /proc/config.gz # on F39
There is no /proc/config.gz on my Fedora 39 install.
From the "THP" link you sent:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never
$ cat /proc/sys/vm/compaction_proactiveness
20
On another laptop I have running Arch (with KDE), the first value is different:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
I don't know how to check zram or zswap.
]]>zgrep TRANSPARENT_HUGEPAGE /proc/config.gz # on F39
amdgpu.ppfeaturemask=0xfffd3fff
works in my case for 7950x igpu
]]>If it happens again I'll post the full log, but for now it's "fixed" by downgrading some packages and using LTS kernel:
composable-kernel 5.5.1-2
xf86-video-amdgpu 21.0.0-2
vulkan-radeon 23.1.0-1
amdvlk 2023.Q3.2-1
mesa 23.1.0-1
Probably not all of these needed to be downgraded (or are needed at all) but this is what got things back to being usable for now.
Here's the part of the log from when the first error appears to when I Magic SysRq:
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32773, for pro
cess blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b
(UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: SQC (data) (0xa)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x3
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32773, for pro
cess blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b
(UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00241051
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: TCP (0x8)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x5
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x1
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32773, for pro
cess blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000003f80000000 from client 0x1b
(UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32773, for process blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000003f80000000 from client 0x1b (UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32773, for process blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000003f80000000 from client 0x1b (UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32773, for process blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000003f80000000 from client 0x1b (UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32773, for process blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000003f80000000 from client 0x1b (UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32773, for process blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000003f80000000 from client 0x1b (UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:2 pasid:32773, for process blender pid 3829 thread blender:cs0 pid 3856)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000003f80000000 from client 0x1b (UTCL2)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: CB/DB (0x0)
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
Oct 25 13:22:12 greymon kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32779, for process WolfOldBlood_x6 pid 8887 thread WolfOldBlo:cs0 pid 8938)
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: in page starting at address 0x000003c942200000 from client 0x1b (UTCL2)
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701430
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: MORE_FAULTS: 0x0
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: WALKER_ERROR: 0x0
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: PERMISSION_FAULTS: 0x3
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: MAPPING_ERROR: 0x0
říj 21 12:02:54 kernel: amdgpu 0000:0c:00.0: amdgpu: RW: 0x0
říj 21 12:03:04 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=111964, emitted seq=111966
říj 21 12:03:04 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process WolfOldBlood_x6 pid 8887 thread WolfOldBlo:cs0 pid 8938
říj 21 12:03:04 kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
říj 21 12:03:08 kernel: amdgpu 0000:0c:00.0: amdgpu: failed to suspend display audio
říj 21 12:03:08 kernel: amdgpu 0000:0c:00.0: amdgpu: MODE1 reset
říj 21 12:03:08 kernel: amdgpu 0000:0c:00.0: amdgpu: GPU mode1 reset
říj 21 12:03:08 kernel: amdgpu 0000:0c:00.0: amdgpu: GPU smu mode1 reset
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is resuming...
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: smu driver if version = 0x00000040, smu fw if version = 0x00000041, smu fw program = 0, version = 0x003a5800 (58.88.0)
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: SMU driver if version not matched
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: use vbios provided pptable
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: SMU is resumed successfully!
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 8
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 8
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 8
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 8
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow start
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow done
říj 21 12:03:09 kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) succeeded!
I've tried various kernel parameters. Nothing has helped.
https://bbs.archlinux.org/viewtopic.php?id=57855
Did you try https://bbs.archlinux.org/viewtopic.php … 5#p2124945 ?
]]>I've done memory tests. I haven't made any hardware changes. I've tried various kernel parameters. Nothing has helped.
If I'm lucky enough to get Blender or a game open without a crash, I can usually use it for hours without an issue, but opening another window or browser tab is almost certain to crash again.
It might be worth noting that I'm using a laptop with the built-in screen, plus the HDMI output, plus a third screen connected by USB-C.
This seems like the relevant bug, but I can't comment there because (lol) signups are broken and (lol) my login from the old bug tracker doesn't work.
I can reproduce this 100% by opening Blender, adding a cube to the empty scene, and clicking Save. Otherwise it just happens randomly.
I've tried:
Using LTS kernel (the cursor keeps moving a bit when it crashes, but otherwise no change)
Using kernel 6.5.7 (no change)
Using amdgpu.dc=0 (boots to blank screen)
Using amdgpu.gpu_recovery=1 (no change)
Using amdgpu.mcbp=0 (no change)
Using amdgpu.dcdebugmask=0x10 (no change)
Using iommu=pt (I could open the save prompt three whole times before it crashed!)
Unplugging all external monitors (no change)
memtest86+ (passed)
memtest_vulkan (pased)
Physically inspecting the GPU and cleaning off dust (no change, no visible problems)
This issue was probably introduced recently, as I completed the New Order in July without any problems.
This thread started in 7th Mar 2023 02:58 …
Do you get the recurring VM_L2_PROTECTION_FAULT_STATUS pattern in your journal after those crashes?
I am getting the GCVM_L2_PROTECTION_FAULT_STATUS status.
]]>This issue was probably introduced recently, as I completed the New Order in July without any problems.
This thread started in 7th Mar 2023 02:58 …
Do you get the recurring VM_L2_PROTECTION_FAULT_STATUS pattern in your journal after those crashes?