You are not logged in.
Fun fact: You can get the mainline kernel prebuilt from miffe's repo: https://wiki.archlinux.org/title/Unoffi … ries#miffe
They're the person who maintains the linux-mainline aur package
Last edited by jpegxguy (2021-07-17 21:24:48)
hi
Offline
So I upgraded to 5.14rc2 (released yesterday) and experienced the bug again!
Very frustrating experience with amdgpu on Linux for the past few months!
I guess this is a general issue with other distros as well and not only Arch.
Seriously, I got so annoyed that I just pulled the trigger and purchased an Intel-based Laptop today.
Last edited by neemo (2021-07-20 02:01:50)
Offline
I have been experiencing this bug for the past few months, and it seemed to occur more often as time progressed. Since updating the day before yesterday, I haven't experienced a single crash. I am running on kernel 5.11.1, mesa 21.1.5-1, and linux-firmware 20210716.b7c134f-1. I have not tested using linux-mainline, as zfs does not support it yet. Is it magically solved for anyone else here as well?
Offline
I have been using linux-lts 5.10.17 and linux-firmware 20210315.3568f96-3 without an issue for a while now, but it's only been 2 or 3 days, so I need more time to confirm. @H0yVoy do you know how old is 5.11.1? Could it be about as old as 5.10.17?
Offline
I have been using linux-lts 5.10.17 and linux-firmware 20210315.3568f96-3 without an issue for a while now, but it's only been 2 or 3 days, so I need more time to confirm. @H0yVoy do you know how old is 5.11.1? Could it be about as old as 5.10.17?
Same here, let's see what will happen in a week or so.
Offline
Wish I had read your post about 5.14rc2 before I recompiled linux-mainline. I can confirm the regression in 5.14rc2, bug is present in this version but not 5.14rc1.
Offline
@cyberrumor do you (or anybody else) know how to reliably trigger this issue? After reading comments above it can only happen semi-randomly after multiple hours of usage, possibly related to VAAPI, which is a bad way to test this. If we could trigger this issue easily we could do a bisect between 5.14-rc1 and rc2 which shouldn't be many commits.
Offline
@Mthw0 I could trigger it by increasing my gpu usage when my ram is almost full (but swap isn't) e.g. by starting a program like blender or resize gpu-accellerated firefox playing a youtube video when I have many tabs open. Right now that doesn't seem to happen anymore, but my system still crashes (same gpu ring errors) when starting davinci resolve or using the internal gpu for rendering in blender.
Offline
@Mthw0 I haven't noticed a way to reliable reproduce crash 100% of the time. They've happened for all sorts of reasons. During Alacrity terminal sessions with transparency in Sway, browsing reddit via firefox, etc. Most often my crashes occur while running directX games through proton. They seem to be less often while running native OpenGL games, but perhaps this is confirmation bias or illusory correlation.
Offline
I've just got a freeze with upgraded firmware, had no issues with linux-firmware 20210315.3568f96-3, did anyone reported the issue? This is unaceptable, and it's firmware fault.
Best way to reproduce for me, watch videos on youtube etc. in Chromium.
Offline
Yesterday when I upgraded my packages, the problem came back. However, I didn't upgrade mesa, linux-firmware, or the kernel itself. The only graphics-related package that I upgraded were the proprietary amdgpu pro drivers. After rolling back my system and upgrading my packages, excluding these drivers, everything seems to work fine still, which leads me to believe this issue is more complicated than I initially thought. Maybe this information helps you @lpr1?
According to the AMD website, these drivers aren't made for the built-in Ryzen gpus. I just installed them because I kept running into graphics issues a few months ago, and just never bothered to remove them. They seem to at least be partially related to the issue though...
Offline
Yesterday when I upgraded my packages, the problem came back. However, I didn't upgrade mesa, linux-firmware, or the kernel itself. The only graphics-related package that I upgraded were the proprietary amdgpu pro drivers. After rolling back my system and upgrading my packages, excluding these drivers, everything seems to work fine still, which leads me to believe this issue is more complicated than I initially thought. Maybe this information helps you @lpr1?
According to the AMD website, these drivers aren't made for the built-in Ryzen gpus. I just installed them because I kept running into graphics issues a few months ago, and just never bothered to remove them. They seem to at least be partially related to the issue though...
This looks to me like an AMD created issue, I do not use proprietary drivers (AMDGPU-PRO), I use Mesa with 0 issues so far. I'm on the latest kernel in the repo (5.13.5-arch1-1) and have no issue with it, as soon as linux-firmare package is upgraded, issue comes back. It's beyond my knowledge to figure out what is wrong, but it's clear to me who is at fault for this issue (AMD), and from them, that's not acceptable IMO.
I can see there is different info from other users, but in general, it seems that linux-firmware is the culprit of the issue, so I guess bug report have to be directed at AMD.
Offline
So I upgraded to 5.14rc2 (released yesterday) and experienced the bug again!
Very frustrating experience with amdgpu on Linux for the past few months!
I guess this is a general issue with other distros as well and not only Arch.Seriously, I got so annoyed that I just pulled the trigger and purchased an Intel-based Laptop today.
What firmware version are you running? I'm running 5.14rc2 with linux-firmware 20210315 and have had no problems so far.
It's a shame because I want to try out the new firmware to take advantage of HDMI over freesync.
Offline
For me, one of the most annoying things is the fact that I can not reproduce the error easily. It is so random...
Offline
Even though the gpu is not crashing with ring errors anymore, I still see plenty of gpu-related errors in the dmesg output:
amdgpu: failed to write reg 28b4 wait reg 28c6
amdgpu: failed to write reg 1a6f4 wait reg 1a706
it complains about the same registers every time, and googling around seems like this error with these particular register pops up more often, dating as far back as 2019... But no working solutions have been posted as far as I can see.
Offline
I also have the Ryzen 2700U and had freezes and crashes occasionally while using Firefox and watching videos.
I found that, at least in my case, the freezes stop completely when I don't maximize my windows and use them a bit smaller on the screen.
Watching fullscreen videos like on youtube or netflix doesn't appear to be an issue, as long as the browser window is not maximized.
This worked for me since Linux version 5.12 without any kernel parameters (worked for a couple month without issues). I had some parameters like iommu=pt or idle=nomwait for some time which seemed to help, but I never was completely sure if they did anything, since freezes still occurred from time to time.
Offline
I've been through all the 5.14 rc versions (currently on rc5) and on all of them I've seen this issue. 5.14 has fixed the amdgpu issue with systemd-backlight, but not this one. Sometimes it freezes and recovers, other times the screen takes some LSD and displays pixels like random contents of RAM (corruption) other times it freezes to black screen and I have to use SysRq combo to reboot. I've had this issue for a while, probably at least since 5.11
It's probably the #1 most annoying thing for me right now, it hits randomly, even during word editing work.
Last edited by jpegxguy (2021-08-09 21:34:53)
hi
Offline
I own a laptop with Ryzen 3200U and I had no problem with firmware version linux-firmware-20210315.3568f96. All later versions cause a hanging, laptop screen just goes black. In this case I press a power button until shutdown. I use the latest kernel (5.13.9 at the moment), video engine failure happens with any firmware later than 20210315 (though with the latest one it happens less often). I believe I should post a bug report, but I don't know where to. It's obvious (at least in my case) that this problem is directly linked with some changes in firmware code but I don't know who should I tell about it. Can you give me directions?
Offline
Yeah I've been running linux-firmware 20210315.3568f96-3 since yesterday to see if that changes anything, I'll need to have it runnign for a week though to see if it strikes, because it cannot be reproduced reliably
hi
Offline
I believe I should post a bug report, but I don't know where to. It's obvious (at least in my case) that this problem is directly linked with some changes in firmware code but I don't know who should I tell about it. Can you give me directions?
Offline
This happened again. I notice that it likes to happen during window switches, like maximizing a previously minized window.
This was the full experience with the blackscreen and no recovery, needed SysRq combo to reboot
I'm using:
linux-mainline 5.14rc5-1
linux-firmware 20210315.3568f96-3
mesa 21.2.0-1
cinnamon 5.0.5-1
So, using the downgraded linux-firmware doesn't work for me
Last edited by jpegxguy (2021-08-11 01:20:51)
hi
Offline
This happened again. I notice that it likes to happen during window switches, like maximizing a previously minized window.
This was the full experience with the blackscreen and no recovery, needed SysRq combo to rebootI'm using:
linux-mainline 5.14rc5-1
linux-firmware 20210315.3568f96-3
mesa 21.2.0-1
cinnamon 5.0.5-1So, using the downgraded linux-firmware doesn't work for me
Use linux-firmware 20210315.3568f96-2 (or even -1) instead, it did happen to me with -3 as well, testing -2 as of now.
EDIT: linux-firmware 20210315.3568f96-2 have the issue as well, linux-firmware 20210315.3568f96-2-1 is way to go.
Last edited by lpr1 (2021-08-18 12:39:18)
Offline
Having the same problem running a 3700U w/ 5.13.8-1 and plasma wayland:
amdgpu 0000:04:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:3 pasid:32771, for process Xwayland pid 1272 thread Xwayland:cs0 pid 1289)
in page starting at address 0x000080010302a000 from IH client 0x1b (UTCL2)
Faulty UTCL2 client ID: TCP (0x8)
MORE_FAULTS: 0x1
WALKER_ERROR: 0x0
PERMISSION_FAULTS: 0x3
MAPPING_ERROR: 0x0
RW: 0x0
Offline
In my 3750h laptop, I havent seen the error since I downgraded the firmware to 20200916.00a84c5-1, I noticed that ramdomly using different firmwares from this year radeontop showed that the memory clock was at 100% at boot and that supposedly was caused by an specific kernel version and fixed later but testing with different firmwares/kernels even compiled manually still had the memory clock at max from boot randomly and also the op error, so I went with a firmware version mentioned in a comment of the gitlab mesa of a person that have the same behaviour and so far 3 weeks using sway no op error and radeontop shows only 60-70% usage, before it was almost every 1 - 2 hours to get the laptop hanged with the error.
Offline
New linux-firmware may contain a fix for this issue. Looking at the commit log at https://git.kernel.org/pub/scm/linux/ke … e.git/log/.
Offline