You are not logged in.
The thread has become extremely confusing and everything is slowly getting mixed up here...
MR 33248 was merged to trunk in a slightly different form that only affects raven & raven2 chipsets.
mesa 25.0 hasn't been branched off yet, so 25.0 rc candidates and stable will have the change.
mesa-test-git-25.0.0_devel.200908.66775c89fce-1-x86_64.pkg.tar.zst
https://app.box.com/s/8ednrt82hzac90x9ng5x0f5i9fjkxpos
So if the fix is included in that test-git above, then it is also included in the following mesa-test-git. Also, @OJaksch had installed the following mesa-test-git in post {#545} and has not reported a crash to date.
mesa-test-git-25.0.0_rc3.201182.3a8abfa39b7-1-x86_64.pkg.tar.zst
https://app.box.com/s/3npzkdloj8fgx6068i612prg0pd8172t
Offline
EDIT: Removed more confusion...
Is this quote of Lone_Wolf confirming that a slight variation of his patch has been applied upstream to mesa 25.0 rc, including the one he has posted?
Lone_Wolf {#329} wrote:
MR 33248 was merged to trunk in a slightly different form that only affects raven & raven2 chipsets.
mesa 25.0 hasn't been branched off yet, so 25.0 rc candidates and stable will have the change.
Sorry, but not being a dev or an appropriately educated user, I find the following terms very confusing and/or a lack clear understanding of them.
This makes it difficult to follow the situation.
MR 33248 ?
merged ?
trunk ?
25.0 hasn't been branched off ?
"so 25.0 rc candidates and stable will have the change" I 'think' this means mesa 25rc and 25 when it hits the Arch official repos will have the fix.
Would anyone please decipher Lone_Wolf's quote using language that a dummy like me could make sense of?
There are likely silent followers that may find this helpful as well.
Last edited by NuSkool (2025-02-17 19:47:06)
Scripts I use: https://github.com/Cody-Learner
$ glxinfo | grep Device Device: AMD Radeon Vega 11 Graphics (radeonsi, raven, LLVM 19.1.7, DRM 3.60, 6.13.3-arch1-1.1) (0x15dd)
$ sudo dmesg | awk '/drm/ && /gfx/' [ 6.427009] [drm] add ip block number 6 <gfx_v9_0>
Offline
@NuSkool and @all
You don't need to apologise for anything. It's not your fault that the thread here can no longer easily provide the information needed by new users.
You can no longer read everything and you can no longer summarise everything that has been said.
At the beginning in December 2024 and also in January 2025 I still tried to keep the thread together here, but that is no longer possible.
Now Manjaro is also affected and they can hardly get in here. I only got in touch to make sure @fixitharder didn't drown here.
The kernel investigation would have needed its own thread IMHO. But it already seems too late for that.
But nobody could have known that mesa developers and maintainers would leave us out in the rain like this. Without Lone_Wolf we wouldn't have been able to use our system since months.
I've been more than just angry for a long time and would prefer to stay in the background at the moment. In addition, I can't really help either. Just waiting!
Offline
@NuSkool
--- I
a) applies
At the time, the stable mesa was 24.3.4, which I understand means that the fix will be included in a 24.3.5 release, but also in any 25.x release.
b) do not apply
If the branch had already been branched to 25, then the fix would not have been included in a version 24.3.5 and 24.3.6.
The only problem is that mesa 24.3.5 has not yet been released, but was supposed to be released on 05.02.2025.
I now suspect that mesa 25.x could be released before 24.3.5, which doesn't matter at all by now.
At least that's how I understand it.
--- II
Regarding the other question: The original fix was revised twice and has been improved. The last version was then merged in. Lone_Wolf's original fix was IMHO the first version of it.
At least that's how I understand it, but I can't find the source right now. ;-)
Last edited by orbit-oc (2025-02-17 20:20:20)
Offline
@orbit-oc Agree with the confusion in this thread... I've removed my last few additions to it.
I've used Linux for more than 15 years, and have never had anything break a system like this. (ie: non recoverable freeze requiring a hard reset) for such long term.
I'll do my best to not direct any frustration towards anyone or team as a user though. Lessons hard learned over the years... Seems there is a reason, but it may not be motivated by anything we'd be aware of. I just don't understand why this is taking so long after a viable solution has been found and passed testing (admittedly small volume, etc.), considering the percentage of hardware out there it breaks.
And yes, Lone_Wolf has saved the day for those of us who have found his mesa packages.
Early on, I was contemplating on whether Arch would release a patched version of mesa to the repos. After learning that mesa is so closely tied to several other packages though, who knows how our fix may break things for other users. It's way outside my wheelhouse to even guess.
Scripts I use: https://github.com/Cody-Learner
$ glxinfo | grep Device Device: AMD Radeon Vega 11 Graphics (radeonsi, raven, LLVM 19.1.7, DRM 3.60, 6.13.3-arch1-1.1) (0x15dd)
$ sudo dmesg | awk '/drm/ && /gfx/' [ 6.427009] [drm] add ip block number 6 <gfx_v9_0>
Offline
Build: linux-amdgpu-testing-6.13.2.arch1-11 - one crash confirmed.
Included patches:
- Optimize mutex protected blocks in amdgpu_vm_flush
- Add PG workaround for Raven Compute rings
- Refactor GFXOFF handler v3 (lock-free atomic reference counting)
- Use gfx_off_ctrl_immediate for GFX9/10/11/12
- Remove workaround for Raven in amdgpu_dpm
- Remove workaround for TLB seq race
laikm from pacoandres for reporting issues
Last edited by Mechanicus (2025-02-18 19:43:33)
Offline
Perhaps I can attempt to summarize for normal Arch users since I've been following along.
- A mesa regression (likely the main mesa regression causing problems for users here) was bisected about a month ago: https://bbs.archlinux.org/viewtopic.php … 0#p2219060.
- Marek proposed a fix which was then merged (into mesa's test branch) about three weeks ago: https://gitlab.freedesktop.org/mesa/mes … te_2754154. This will be included in mesa 25.0 when released as stable, which could be in as early as a few days though it could be in two weeks too: https://docs.mesa3d.org/release-calendar.html. I am not sure what the status of 24.3.5 is. Maybe someone can request for the patch to be ported to that branch too if it hasn't already.
Arch users can install a mesa test build thanks to Lone_Wolf while we wait: https://bbs.archlinux.org/viewtopic.php … 3#p2226213. I have been using Lone_Wolf's builds with Marek's patches for over a month with no issue.
Looking more broadly, I think there are two main options for distro/package maintainers:
- add the patch manually
- wait for a mesa release with the patch
Maybe someone can ping our mesa package maintainers if they weren't already aware to see how they want to handle this.
Separately, people have been working on identifying and fixing the root cause in the kernel.
@Mechanicus
Please make a new thread (or just use github) for your kernel tests to not pollute this thread, especially since the issue https://gitlab.freedesktop.org/drm/amd/ … te_2780504 has been closed, a fix has been merged into the kernel, and it seems a thread really isn't ideal for your kind of feedback.
@fixitharder (and any other Manjaro users)
If you see freezes on mesa 24.2.7 (and you've rebooted), then it seems very likely to be a different issue than the ones here. Try to bisect if you can, otherwise we likely can't help you much in this thread since you use Manjaro.
Offline
@orbit-oc said: "I've been more than just angry for a long time and would prefer to stay in the background at the moment."
your input has been valuable - please don't get discouraged and thanks for all you've done - i'm in the same boat and just as pissed off at this crap
also thanks to @Lone_Wolf and everyone else who has been of help
one of the problems here that was pointed out is the mix of geeks (those with intimate knowledge of the subject) and nooblets (me, @NuSkool, etc.) - we nooblets are looking for *simple solutions* and those solutions are naturally dynamic as this situation is evolving - be nice if the 1st post was kept up to date (@pacmancrashedagain)
BUT, my primary point in posting is to mention that i'm on mesa 1:24.3.4-1 and llvm-libs 19.1.7-1 and have had *no freezing* for ~16 hrs (https://forum.manjaro.org/t/constant-sy … /174105/36) - journalctl -p 4 -b is still looking mighty ugly however (-p 3 is fine)
i also reset BIOS (MSI) which changed global c-state from disabled to automatic and adjusted some RAM settings
so far, so good - no Firefox wokiness, no unstability - another 24 hrs. and i'll be happy
Offline
Seems to be stable on linux-amdgpu-testing 6.13.2-arch1-11, though there's still an issue with hotplug bringing up the display on first boot if no displays were attached and the iGPU is enabled. Though the iGPU being enabled also seems to help with the S3 Suspend working, so I'm not sure. I was able to bring up the displays by using SSH to start up GDM. I will let you know how this proceeds.
The test results you requested, however, still using Arch's stable Mesa:
- temperature: While the dGPU is mostly idle and using ~10% GPU, the temperature is 51°C, while the iGPU is using 0% GPU and running at 41°C.
- average package power (sudo turbostat -s PkgWatt): dGPU: 25W, iGPU 34W
- WebGL performance Aquarium for different amount of fish: All fish counts up to 15,000 were 60fps on both GPUs. I only tested 20,000 and higher on the dGPU, and it ran down to 40-45fps on up to 30,000 fish, but only using 12% of the GPU.
- vblank_mode=0 glxgears: On the dGPU, almost 21,000 fps at the default window size.
Offline
Kclisp has given a good summary of events in #582 .
A few additional comments
1.
There are currently 3 binaries provided by me
A. mesa 24.2.8
B. mesa development version (includes raven/raven2 fix)
C. mesa 25.0.0 release candidate 3 (includes raven/raven2 fix)
All 3 binaries are build against archlinux repos with the pkgctl tool from archlinux devtools package that is also used by archlinux devs.
A is the final version of mesa 24.2 , B and C both have the fix but C is much closer to what soon will be released as mesa 25.0.0 .
Using those binaries on other distros is similar to asking the Formula 1 champion to participate in Nascar races using their formula 1 racecar and expecting them to win .
In case someone wants to build the binaries themselves, send me a forum message to get the PKGBUILD for that specfic binary.
2.
No one has explicitly asked mesa devs to add the fix to a 24.3.x release. There are no signals that mesa devs plan to do that.
3.
Maybe someone can ping our mesa package maintainers if they weren't already aware to see how they want to handle this.
That should be done through a bug report against the mesa package.
Note that as far as I know no one has tested the fix with 24.3 releases and this reduces the chance of arch devs adding it.
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
No one has explicitly asked mesa devs to add the fix to a 24.3.x release. There are no signals that mesa devs plan to do that.
Debian Testing (‘trixie’) uses mesa 24.3.4. Do you seriously think that Debian will crash its users if the fix is not implemented in mesa 24.3.5 /24.3.6?
How big is the damage yet going to be? Folks: 'go back to Windows'?
Offline
Debian will do the same thing they are doing in cases like these and carry a backport if necessary. Note that none of the mesa devs are here, if you have relevant concerns you need to bring that upstream.
Offline
Keep in mind that while mesa 24.3.0 had several issues a lot of them have been fixed in later releases like 24.3.4 .
The issue the majority of this thread deals with is limited to amd raven & raven2 chipsets.
It's a nasty bug and the code that causes it brought a massive change in radeonsi that had seen very little testing.
Personally I feel it should not have been in mesa 24.3 at all.
However it only affects users with that exact hardware, other linux users may not know this bug exists.
Are debian mesa developers even aware of the bug ?
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
Are debian mesa developers even aware of the bug ?
I have no idea. But sooner or later that will be the case and then the users there or Debian itself will go looking for the cause again, just like here.
With such a large and at mesa known impact, the least that could have been done would have been to have a reasonable information policy, so that at least all distributions know and the same work does not have to be done again and again.
There is no need to submit a request to mesa for this. The effects are known and the people there can certainly assess the effects.
I'll give it a rest now, but I'm still very surprised.
Offline
It's just a suggestion but i also think the whole kernel testing thing might belong in a new thread so that the people that are investigating can keep their information organized without the people that just want to know if Mesa is finally fixed, after Mesa 25.0 is released then this thread is probably going to cease any activity.
I guess none of the Mesa devs had this APU to see how bad the issue was and the fact that the Mesa 24.3 was released near the holiday season probably didn't help, a big thanks again to Lone_Wolf for saving the day, otherwise we would probably be screwed in the long term with this APU and Linux when the bug reached the other stable distros.
Last edited by pacmancrashedagain (2025-02-18 12:18:00)
Offline
I just want to chime in to say that I seem to be affected by this issue (applications occasionally locking up the CPU, freezing the whole system and forcing me to hard reboot), and I have a Ryzen 5 3600XT CPU, with a Radeon RX 5600XT graphics card.
This mesa patch, as I've understood it, will fix the issue for Raven APUs, but mine is a Matisse CPU. So I'm left uncertain if it's going to fix it for me. Can anyone offer some insight on this? Thanks.
I have very recently rolled my system back to the older mesa/vulkan-radeon/llvm-libs versions as suggested here, but it's too early to tell with confidence if that alleviates it for me.
Offline
I just want to chime in to say that I seem to be affected by this issue (applications occasionally locking up the CPU, freezing the whole system and forcing me to hard reboot), and I have a Ryzen 5 3600XT CPU, with a Radeon RX 5600XT graphics card.
This mesa patch, as I've understood it, will fix the issue for Raven APUs, but mine is a Matisse CPU. So I'm left uncertain if it's going to fix it for me. Can anyone offer some insight on this? Thanks.I have very recently rolled my system back to the older mesa/vulkan-radeon/llvm-libs versions as suggested here, but it's too early to tell with confidence if that alleviates it for me.
The fix on this thread is for people with vega(gfx9) integrated graphics, it's not related to the CPU, your graphics card seems to be Navi from what i can find, so it shouldn't fix anything for you.
Offline
Build: linux-amdgpu-testing-6.13.2.arch1-12 - freeze confirmed.
Included patches:
- Optimize mutex protected blocks in amdgpu_vm_flush
- Add PG workaround for Raven Compute rings
- Refactor GFXOFF handler v3.1 (updated lock-free atomic reference counting)
- Use gfx_off_ctrl_immediate for GFX9/10/11/12
- Remove workaround for Raven in amdgpu_dpm
- Remove workaround for TLB seq race
- Add PG workaround for RDNA1-2
Build: linux-amdgpu-testing-6.13.2.arch1-13 - crashed for 1 configuration
Included patches:
- Optimize mutex protected blocks in amdgpu_vm_flush
- Add PG workaround for Raven Compute rings
- Refactor GFXOFF handler v4
- Use gfx_off_ctrl_immediate for GFX9/10/11/12
- Remove workaround for Raven in amdgpu_dpm
- Remove workaround for TLB seq race
- Add PG workaround for RDNA1-2
glibc 2.41+r9+ga900dbaf70f0-1
Build: linux-amdgpu-testing-6.13.3.arch1-1 - freezes
Included patches:
- Optimize mutex protected blocks in amdgpu_vm_flush
- Add PG workaround for Raven Compute rings
- Refactor GFXOFF handler v5
- Use gfx_off_ctrl_immediate for GFX9/10/11/12
- Remove workaround for TLB seq race
- Add PG workaround for RDNA1-2
Build: linux-amdgpu-testing-6.13.3.arch1-2 - crash after 2 hours of video streaming
Included patches:
- Optimize mutex protected blocks in amdgpu_vm_flush
- Add PG workaround for Raven Compute rings
- Refactor GFXOFF handler v6
- Use gfx_off_ctrl_immediate for GFX9/10/11/12
- Remove workaround for TLB seq race
- Add PG workaround for RDNA1-2
What to check:
- temperature
- average package power (sudo turbostat -s PkgWatt)
- WebGL performance Aquarium for different amount of fish
- vblank_mode=0 glxgears
GitHub
laikm from pacoandres for reporting issues
Last edited by Mechanicus (Today 08:47:18)
Offline
I just want to chime in to say that I seem to be affected by this issue (applications occasionally locking up the CPU, freezing the whole system and forcing me to hard reboot), and I have a Ryzen 5 3600XT CPU, with a Radeon RX 5600XT graphics card.
This mesa patch, as I've understood it, will fix the issue for Raven APUs, but mine is a Matisse CPU. So I'm left uncertain if it's going to fix it for me. Can anyone offer some insight on this? Thanks.I have very recently rolled my system back to the older mesa/vulkan-radeon/llvm-libs versions as suggested here, but it's too early to tell with confidence if that alleviates it for me.
As this thread https://gitlab.freedesktop.org/drm/amd/-/issues/3930 it looks that could be a similar problem. But both upstreamed fixes (kernel and mesa) are workarounds for gfx9 and RX 5600XT is not gfx9 (not sure if gfx10 or 11).
Maybe I'm wrong, but that is what I'm seeing in the commits for mesa and amdgpu module.
But I'm sure here are people that can give you a better answer.
Offline
As this thread https://gitlab.freedesktop.org/drm/amd/-/issues/3930 it looks that could be a similar problem. But both upstreamed fixes (kernel and mesa) are workarounds for gfx9 and RX 5600XT is not gfx9 (not sure if gfx10 or 11).
Maybe I'm wrong, but that is what I'm seeing in the commits for mesa and amdgpu module.
But I'm sure here are people that can give you a better answer.
Thanks, but my problem is definitely taking place on the CPU, not on the GPU. It typically starts with some "event" broadly speaking (for example, opening firefox or stopping my display manager) that causes the process to lock up my CPU, making the computer *almost* completely frozen at first (I can still type some things in my terminal emulator for example, it just won't spawn any new processes), kill -9 doesn't work, and after about a minute the entire system finally hangs and I can't do anything anymore.
Messages like this pop up in my journal when that happens:
Feb 12 19:05:22 desky kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Feb 12 19:05:22 desky kernel: rcu: Tasks blocked on level-0 rcu_node (CPUs 0-11): P406896/1:b..l
Feb 12 19:05:22 desky kernel: rcu: (detected by 5, t=60002 jiffies, g=8600213, q=170814 ncpus=12)
Feb 12 19:06:18 desky kernel: watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kcompactd0:101]
My first reflex was to suspect there's some kind of deadlock regression in the kernel, so I switched from linux 6.13.2.arch1-1 to linux-lts 6.12.13-1, but that did not solve the problem. So now I'm considering if it could be related (even tangentially) to this issue with mesa, which sounds suspiciously familiar.
I apologize if I'm adding even more confusion to this quagmire... I'm not having an easy time figuring this one out the cause of this one.
Last edited by ParaSait (2025-02-18 18:16:41)
Offline
It's just a suggestion but i also think the whole kernel testing thing might belong in a new thread so that the people that are investigating can keep their information organized without the people that just want to know if Mesa is finally fixed, after Mesa 25.0 is released then this thread is probably going to cease any activity.
I guess none of the Mesa devs had this APU to see how bad the issue was and the fact that the Mesa 24.3 was released near the holiday season probably didn't help, a big thanks again to Lone_Wolf for saving the day, otherwise we would probably be screwed in the long term with this APU and Linux when the bug reached the other stable distros.
Especially users who need stable system for work etc., who do not have time to test everything, for such users, there's only one workaround that actually makes system rock solid stable without adding any 3rd party packages, and that's kernel parameter from Lone_Wolf or orbit-oc (I forgot who posted it a while back), and that is:
amdgpu.ppfeaturemask=0xfff73fff
It's not all shiny there either, there is a performance degradation when using this, however, it is well worth in this scenario.
Offline
Lone_Wolf wrote:No one has explicitly asked mesa devs to add the fix to a 24.3.x release. There are no signals that mesa devs plan to do that.
Debian Testing (‘trixie’) uses mesa 24.3.4. Do you seriously think that Debian will crash its users if the fix is not implemented in mesa 24.3.5 /24.3.6?
How big is the damage yet going to be? Folks: 'go back to Windows'?
I went back to Windows, I use the computer to work and I can't have random crashes.
Offline
Is there consensus that the following would be a good indicator if the system has 'gfx9'?
$ sudo dmesg | grep '<gfx'
[ 6.366573] [drm] add ip block number 6 <gfx_v9_0>
@Lone_Wolf I think it may be worth adding this to #585 or an alternative command along with an explanation for users to ID their system for coverage.
Then maybe consider adding the entire #585 as a quote to the first (top?) post or your post #3? It may help users looking for solutions.
Another consideration that comes to mind is a wiki page or addition with the relative info?
Perhaps a stickied locked top post in one of the forums?
EDIT1: On the other hand, if a fix in the repo is days away maybe too late. EDIT Add for clarity: "to bother with these proposals."
EDIT2: Looks like Weezer has brought it up in the OpenGL wiki discussions: https://wiki.archlinux.org/title/Talk:OpenGL
Last edited by NuSkool (2025-02-19 01:00:18)
Scripts I use: https://github.com/Cody-Learner
$ glxinfo | grep Device Device: AMD Radeon Vega 11 Graphics (radeonsi, raven, LLVM 19.1.7, DRM 3.60, 6.13.3-arch1-1.1) (0x15dd)
$ sudo dmesg | awk '/drm/ && /gfx/' [ 6.427009] [drm] add ip block number 6 <gfx_v9_0>
Offline
EDIT1: On the other hand, if a fix in the repo is days away maybe too late.
drm-amdgpu-gfx9-manually-control-gfxoff-for-cs-on-rv.patch is queued for 6.13.4. does that fix the issue for you NuSkool?
Offline
NuSkool wrote:EDIT1: On the other hand, if a fix in the repo is days away maybe too late.
drm-amdgpu-gfx9-manually-control-gfxoff-for-cs-on-rv.patch is queued for 6.13.4. does that fix the issue for you NuSkool?
I've taken a break from the kernel tests offered here.
I'm on repo 'linux' + Lone_Wolfs 'mesa-test-git 25.0.0_rc' ATM.
Is there/has there been a pre-built (or an AUR pkg, a PKGBUILD) kernel with those patches?
I'd be glad to test it along with repo mesa.
Last edited by NuSkool (2025-02-19 02:36:53)
Scripts I use: https://github.com/Cody-Learner
$ glxinfo | grep Device Device: AMD Radeon Vega 11 Graphics (radeonsi, raven, LLVM 19.1.7, DRM 3.60, 6.13.3-arch1-1.1) (0x15dd)
$ sudo dmesg | awk '/drm/ && /gfx/' [ 6.427009] [drm] add ip block number 6 <gfx_v9_0>
Offline