You are not logged in.
Hi, I also got similar issue but only run the GUI mode of CERN-root, i downgrade to 5.11.arch2-1 but didn't solve the issue.
C/GPU: AMD Ryzen 7 4800U with Radeon Graphics
current kernel: 5.12.5
btw, i also tried "iommu=pt" but it didn't help.
May 21 11:20:08 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences tim>
May 21 11:20:09 systemd-timesyncd[438]: Initial synchronization to time server [2001:1600:3:6::123>
May 21 11:20:13 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences tim>
May 21 11:20:13 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=>
May 21 11:20:13 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xo>
May 21 11:20:13 kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
May 21 11:20:13 kernel: [drm] free PSP TMR buffer
May 21 11:20:13 kernel: amdgpu 0000:03:00.0: amdgpu: MODE2 reset
May 21 11:20:13 kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
May 21 11:20:13 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
May 21 11:20:13 kernel: [drm] PSP is resuming...
May 21 11:20:13 kernel: [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR
May 21 11:20:13 kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
May 21 11:20:13 kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
May 21 11:20:13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
May 21 11:20:13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
May 21 11:20:14 kernel: [drm] kiq ring mec 2 pipe 1 q 0
May 21 11:20:14 kernel: [drm] DMUB hardware initialized: version=0x01020008
May 21 11:20:14 kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
May 21 11:20:14 kernel: [drm] JPEG decode initialized successfully.
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
May 21 11:20:14 /usr/lib/gdm-x-session[945]: (II) event18 - Logitech B330/M330/M3: SYN_DROPPED eve>
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(2) succeeded!
May 21 11:20:14 kernel: [drm] Skip scheduling IBs!
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 systemd[928]: run-user-120.mount: Deactivated successfully.
May 21 11:20:14 systemd[1]: user-runtime-dir@120.service: Deactivated successfully.
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 kernel: audit: type=1131 audit(1621567214.371:90): pid=1 uid=0 auid=4294967295 ses>
May 21 11:20:14 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=user-r>
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 systemd[1]: Stopped User Runtime Directory /run/user/120.
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 /usr/lib/gdm-x-session[945]: amdgpu: The CS has been cancelled because the context>
May 21 11:20:14 systemd[1]: Removed slice User Slice of UID 120.
May 21 11:20:14 systemd[1]: user-120.slice: Consumed 4.783s CPU time.
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
May 21 11:20:14 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Offline
Tried 5.10.15 and it still crashes.
I don't remember crashing way back... Possibly not the kernel?
Offline
mesa did see a few big bumps as well and could be a likely candidate
Offline
I've downgraded to mesa 21.0.3-2.
Checking journalctl, the most recent I've started getting the gfxhub0 page faults are from May 13. I upgraded mesa on the same day (and rebooted). Looks like this is indeed the culprit.
Offline
I've downgraded to mesa 21.0.3-2.
Checking journalctl, the most recent I've started getting the gfxhub0 page faults are from May 13. I upgraded mesa on the same day (and rebooted). Looks like this is indeed the culprit.
does the mesa downgrading work?
Offline
Still observing. If I don't get a crash this session, I'll report back.
Offline
According to folks on other distros this may be firmware issue (possibly intruduced in linux-firmware-20210426 which came out in Arch on 5th May).
https://forums.opensuse.org/showthread. … ost3029836
https://forum.manjaro.org/t/system-freq … /62139/114
Offline
According to folks on other distros this may be firmware issue (possibly intruduced in linux-firmware-20210426 which came out in Arch on 5th May).
https://forums.opensuse.org/showthread. … ost3029836
https://forum.manjaro.org/t/system-freq … /62139/114
i tried to downgrade to linux-firmware-20210426 but issue is still there.
Offline
codicodi wrote:According to folks on other distros this may be firmware issue (possibly intruduced in linux-firmware-20210426 which came out in Arch on 5th May).
https://forums.opensuse.org/showthread. … ost3029836
https://forum.manjaro.org/t/system-freq … /62139/114i tried to downgrade to linux-firmware-20210426 but issue is still there.
Try downgrading to 20210315.3568f96-2. As I said above I think the issues started with 20210426 and weren't fixed yet with 20210511.
I'm running 20210315.3568f96-2 myself for a few hours now and everything appears to be in order, although it's too early to be sure...
Last edited by codicodi (2021-05-21 11:09:35)
Offline
Just popping in to say I have the same issue. I thought it was a hardware issue until I found this thread. I'm also on Sway, I've been having games crash and recently noticed graphical corruption in Alacritty, which caused a freeze that my session was able to recover from automatically. I think you guys are right, saying it has something to do with acceleration. I'm on a 3400g, using integrated graphics.
sudo dmesg
[53507.534738] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[53509.653244] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653257] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x800106000000 from client 27
[53509.653266] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653270] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653273] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653275] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653278] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653280] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653282] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653289] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653295] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x800106003000 from client 27
[53509.653303] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653305] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653308] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653310] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653312] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653314] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653316] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653321] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653326] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x800106004000 from client 27
[53509.653333] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653336] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653338] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653340] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653342] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653344] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653346] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653351] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653356] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x800106001000 from client 27
[53509.653363] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653365] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653368] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653370] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653372] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653374] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653376] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653380] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653386] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x800106002000 from client 27
[53509.653393] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653395] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653397] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653399] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653402] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653404] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653406] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653410] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653415] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x800106009000 from client 27
[53509.653422] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653424] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653427] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653429] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653431] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653433] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653435] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653439] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653444] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x80010600c000 from client 27
[53509.653451] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653453] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653455] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653458] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653460] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653462] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653464] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653468] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653473] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x80010600e000 from client 27
[53509.653479] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653482] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653484] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653486] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653488] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653490] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653492] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653496] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653501] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x800106006000 from client 27
[53509.653508] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653510] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653513] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653515] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653517] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653519] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653521] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53509.653525] amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:2 pasid:32769, for process sway pid 7837 thread sway:cs0 pid 7845)
[53509.653530] amdgpu 0000:0a:00.0: amdgpu: in page starting at address 0x80010600b000 from client 27
[53509.653537] amdgpu 0000:0a:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00201031
[53509.653539] amdgpu 0000:0a:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[53509.653541] amdgpu 0000:0a:00.0: amdgpu: MORE_FAULTS: 0x1
[53509.653543] amdgpu 0000:0a:00.0: amdgpu: WALKER_ERROR: 0x0
[53509.653545] amdgpu 0000:0a:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[53509.653547] amdgpu 0000:0a:00.0: amdgpu: MAPPING_ERROR: 0x0
[53509.653549] amdgpu 0000:0a:00.0: amdgpu: RW: 0x0
[53512.654702] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[53517.774691] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[53519.697569] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[53545.331994] kauditd_printk_skb: 21 callbacks suppressed
Versions of my packages below.
pacman -Qs amd
local/amd-ucode 20210511.7685cf4-1
Microcode update image for AMD CPUs
local/amdvlk 2021.Q2.3-1
AMD's standalone Vulkan driver
local/lib32-amdvlk 2021.Q2.3-1
AMD's standalone Vulkan driver
local/lib32-opencl-mesa 21.1.0-1
OpenCL support for AMD/ATI Radeon mesa drivers (32-bit)
local/opencl-mesa 21.1.0-1
OpenCL support for AMD/ATI Radeon mesa drivers
local/xf86-video-amdgpu 19.1.0-2 (xorg-drivers)
X.org amdgpu video driver
pacman -Qs linux-firmware
local/linux-firmware 20210511.7685cf4-1
Firmware files for Linux
pacman -Qi linux
Name : linux
Version : 5.12.5.arch1-1
...
Offline
Still observing. If I don't get a crash this session, I'll report back.
Downgrading mesa to 21.0.3-2 fixed my issues! No crashes last session.
Offline
*ERROR* Waiting for fences timed out!
first appeared in my journalctl on March 23 (although it might have caused crashes before that), with
mesa 20.3.4-3
linux-lts 5.10.25-1
linux-firmware 20210315.3568f96-1
Still having this problem today with
mesa 21.1.0-1
linux-lts 5.10.38-1
linux-firmware 20210511.7685cf4-1
using a Ryzen 7 3700U with Vega 10.
I also randomly get kernel panics on entering sleep mode, I don't know if that's got something to do with it.
Offline
EDIT: I did not notice the second page of comments in this thread before replaying... so basically all I'm saying here had already been discussed, ups...
---- Original message ---
I'm experiencing the same issue: random GPU resets that start with this message appearing on journalctl:
*ERROR* Waiting for fences timed out!
This error was reported to amdgpu maintainers one year ago. It went almost silent since then but seems to have "resurrected" recently by looking at that thread's history.
In particular, in those comments they point to a manjaro thread where some people seem to have fixed the issue by downgrading linux-firmware.
I just did, and haven't had an issue for several hours (but should wait a few days just in case):
sudo pacman -U /var/cache/pacman/pkg/linux-firmware-20210315.3568f96-2-any.pkg.tar.zst
(...and don't forget to add "linux-firmware" to "IgnorePkg" in "/etc/pacman.conf" to tell pacman not to upgrade it from now on)
UPDATE:
It's been two days now without crashes (which was never the case before downgrading). However I see others on this same thread that say they still have crashes even when using the downgraded "linux-firmware-20210315" package.
As reference (in case it helps) I'm using a vega 8 GPU (Ryzen 3500U) and these versions of kernel and mesa:
- mesa 21.1.0
- linux 5.12.5
Last edited by greenfoo (2021-05-23 19:57:42)
Offline
I am still getting crashes with the downgraded linux-firmware, version 20210315.3568f96-2.
EDIT:
Also still getting crashes with with mesa 21.0.3-2. I'm starting to wonder if this is a BIOS bug. I'm on a gigabyte b450i with bios revision F61a. 3400g. I saw another thread at one point that mentioned XMP on this board causes voltage to drop on the iGPU. Is anyone else running XMP? Disabling XMP still gives me the issue, but it happens significantly less.
EDIT 2:
Slapping this reddit thread down since it's relevant.
Last edited by cyberrumor (2021-05-22 23:33:35)
Offline
I'm starting to wonder if this is a BIOS bug
Have my ryzen 2700U for 2 years already and the issues started only recently, so it is hard to justify it with BIOS issues (aldo it isn't completely impossible)
Offline
It looks like the fix is incoming. I'm not sure if this has been pulled into linux-mainline in the aur yet, but i'm going to build it and give it a shot.
Offline
Just experienced the bug again with downgraded mesa. Hope that fix works!
Offline
Ok, so, I'm back with news!
This is my current setup, based on comments here: linux 5.11.6.arch1-1 AND linux-firmware 20210315.3568f96-2
With these, the crashes have stopped occuring completely, it's now been a few days of constant heavy usage of my computer and not a single crash or any sign of the bug in my journal whatsoever.
CPU: AMD Ryzen 7 PRO 3700U w/ Radeon Vega Mobile Gfx
It is possible linux-firmware is the sole culprit, but I'm staying downgraded on linux as well just to be sure, until I can test the fix mentioned above. I hope this helps someone!
Offline
That's great news! I'm still on the official up-to-date linux-firmware from Arch repos, and my only change was compiling linux-mainline from the aur, and booting from it. I'm using kernel 5.13-rc2-1. I've now gone 5 days without the issue. I'm betting the fix comes from different packages depending on which CPU you have, which would explain the differing experiences we have. So, 3400g = fixed in 5.13. Everything else seems to be fixed via downgrade of linux-firmware.
EDIT:
Just crashed again, but that was the longest period of stability since I started seeing the issue. Am currently building 5.13-rc3, will report back.
EDIT 2:
Built 5.13-rc4 instead, still crashing.
Last edited by cyberrumor (2021-06-01 15:03:01)
Offline
FWIW, this issue isn't fixed for me yet. I've tried using almost every kernel in the LTS series from 5.10.23 to 5.10.41, 5.11.x, and 5.12.x but system freezes still happen. They happen less frequently in 5.12.x but they still happen.
Here are some logs.
Offline
FWIW, this issue isn't fixed for me yet. I've tried using almost every kernel in the LTS series from 5.10.23 to 5.10.41, 5.11.x, and 5.12.x but system freezes still happen. They happen less frequently in 5.12.x but they still happen.
Here are some logs.
Did you also downgrade linux-firmware?
Offline
Did you also downgrade linux-firmware?
To which version should I downgrade? The oldest version I have locally is linux-firmware-20210315.
Also, here's another system freeze log from today. Kernel version 5.12.8-arch1-1
EDIT: Another day, another crash
[86728.466221] myhost kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[86733.586224] myhost kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[86733.586361] myhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[86738.706215] myhost kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[86743.616759] myhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
So much for AMDGPU being better for Linux. My system freezes everyday and I have to do hard reboots.
Last edited by 7nm (2021-06-02 06:12:42)
Offline
I have my linux-firmware at version 20210315.3568f96-2. No crashes since. I've since ignored that package and am updating fine.
Offline
I have my linux-firmware at version 20210315.3568f96-2. No crashes since. I've since ignored that package and am updating fine.
Ok, I'll downgrade to 20210315.3568f96-2 and report back if there are any improvements.
My laptop is basically a big paperweight right now and journalctl is filled with amdgpu errors.
Offline
Just posting as reminder to myself and confirmation, it seems like it's Vega related, R3200G, iGPU used, crashing when using Chromium. Downgraded to "linux-firmware-20210426", let's see how it goes.
Using this PC for ~year now, never had such issue.
Offline