Issues with Mesa 24.3.x and amdgpu Vega graphics

pacoandres · 2025-01-26 07:50:59

Mechanicus wrote:

Just for reminder: problems with pageflip timeout started not in December, but about a year ago.
https://gitlab.freedesktop.org/drm/amd/-/issues/2950
https://community.amd.com/t5/pc-drivers … d-p/723574
https://www.reddit.com/r/archlinux/comm … p_timeout/
https://www.reddit.com/r/AMDHelp/commen … imeout_on/
https://bugs.kde.org/show_bug.cgi?id=493277
https://lists.freedesktop.org/archives/ … 46656.html
https://github.com/M-Bab/linux-kernel-a … /issues/48
https://forums.freebsd.org/threads/amdg … set.89229/
The most interesting outcome is the last comment from the last provided link:
For me, a recent DRM update of "latest" package tree, has solved the issue.
What means that the problem is not is mesa. It goes deeper, to DRM and firmware binaries.
Since the "fix" you're trying to apply changes only the state of compute queues, I think we can try to force them to be handled by CPU by adding amdgpu.vm_update_mode=3 kernel parameter (see documentation).
The next thing to try is Enable Micro Engine Scheduler via amdgpu.mes=1
Anyway the fix that disables half of the GPU functionality should be the last solution.

I don't know so much about GPU's functionalities and internal details, so I've a question from your comment:
Do you mean the patch I'm using disables half of the GPU functionality?

Thanks.

Mechanicus · 2025-01-26 07:55:09

lpr1 wrote:

Does amdgpu.mes=1 work with vega?

The drm/amdgpu driver supports all AMD Radeon GPUs based on the Graphics Core Next (GCN) architecture drm/amdgpu AMDgpu driver documentation

Mechanicus · 2025-01-26 08:05:35

pacoandres wrote:

I don't know so much about GPU's functionalities and internal details, so I've a question from your comment:
Do you mean the patch I'm using disables half of the GPU functionality?
Thanks.

Yes. Mesa has finally started using the compute part of the GPU, and I think we should try to force it to work as expected. This is the performance you've already paid for.
There are several parameters we can adjust manually, so we should start from that until the fixed firmware is released. And the first parameter to apply is vm_update_mode:

Override VM update mode. VM updated by using CPU (0 = never, 1 = Graphics only, 2 = Compute only, 3 = Both). The default is -1 (Only in large BAR(LB) systems Compute VM tables will be updated by CPU, otherwise 0, never).

I proposed to set it to 3, but probably 2 will be enough. BTW, is ReBAR and above 4G decoding enabled in your systems?

lpr1 · 2025-01-26 08:56:08

Mechanicus wrote:

lpr1 wrote:
Does amdgpu.mes=1 work with vega?
The drm/amdgpu driver supports all AMD Radeon GPUs based on the Graphics Core Next (GCN) architecture drm/amdgpu AMDgpu driver documentation

Yes, but that specific feature is related to newer hardware gfx10 or later, I will attempt it tho., thanks.

Mechanicus · 2025-01-26 09:18:09

lpr1 wrote:

Mechanicus wrote:
lpr1 wrote:
Does amdgpu.mes=1 work with vega?
The drm/amdgpu driver supports all AMD Radeon GPUs based on the Graphics Core Next (GCN) architecture drm/amdgpu AMDgpu driver documentation
Yes, but that specific feature is related to newer hardware gfx10 or later, I will attempt it tho., thanks.

I applied this option for GFX9. Tested 30 minutes with multiple browser windows with multiple tabs with video (HW acceleration enabled) and WebGL streams. Window switching/resizing etc. So far so good.
Will try to test with amdgpu.vm_update_mode (1,2,3) options afterwards if the freeze happens.

Update:
1) amdgpu.mes=1: no quick system freezes, executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang, like before.
2) amdgpu.vm_update_mode=1 (Graphics only): no glitches when loading webgl Aquarium in Chrome, but some performance drop when switching to 30000 fish. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang, like before.
2) amdgpu.vm_update_mode=2 (Compute only): adds visual glitches when loading webgl Aquarium in Chrome, but some performance drop when switching to 30000 fish. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang, like before.
4) amdgpu.vm_update_mode=3 (Both): no glitches when loading webgl Aquarium in Chrome, but some performance drop when switching to 30000 fish. Playing 4k60p YouTube video with HW acceleration causes system freeze quickly when switching from fullscreen to windowed mode. But, what's interesting is that executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover doesn't cause system freeze anymore.
5) amdgpu.cwsr_enable=0: no performance drop in webgl Aquarium in Chrome, no freezes in 4k60p YouTube. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang.

Summary:
1) Only amdgpu.vm_update_mode=3 forces amdgpu_gpu_recover to work as expected.
2) The bug is related to firmware
3) I propose to test amdgpu.cwsr_enable=0 more, since it reduces context switching inside GPU VM, so this might be the way to overcome buggy firmware for now.

Last edited by Mechanicus (2025-01-26 11:14:22)

davlucas · 2025-01-26 09:20:23

issue mitigated by using fluxbox.
Actually, when using wayland, it crashes frrequently, using a light WM , it does not crash. There are others lightWM that ones could try
Thanks Mechanicus for the proposed mitigation.

Lone_Wolf · 2025-01-26 12:17:06

new binary using mesa commit 61e289d0ca05073d079e3e6004089afd80e0df1b + gfx9 patch uploaded.

Downloadlink

lpr1 · 2025-01-26 13:43:57

Mechanicus wrote:

lpr1 wrote:
Mechanicus wrote:
The drm/amdgpu driver supports all AMD Radeon GPUs based on the Graphics Core Next (GCN) architecture drm/amdgpu AMDgpu driver documentation
Yes, but that specific feature is related to newer hardware gfx10 or later, I will attempt it tho., thanks.
I applied this option for GFX9. Tested 30 minutes with multiple browser windows with multiple tabs with video (HW acceleration enabled) and WebGL streams. Window switching/resizing etc. So far so good.
Will try to test with amdgpu.vm_update_mode (1,2,3) options afterwards if the freeze happens.
Update:
1) amdgpu.mes=1: no quick system freezes, executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang, like before.
2) amdgpu.vm_update_mode=1 (Graphics only): no glitches when loading webgl Aquarium in Chrome, but some performance drop when switching to 30000 fish. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang, like before.
2) amdgpu.vm_update_mode=2 (Compute only): adds visual glitches when loading webgl Aquarium in Chrome, but some performance drop when switching to 30000 fish. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang, like before.
4) amdgpu.vm_update_mode=3 (Both): no glitches when loading webgl Aquarium in Chrome, but some performance drop when switching to 30000 fish. Playing 4k60p YouTube video with HW acceleration causes system freeze quickly when switching from fullscreen to windowed mode. But, what's interesting is that executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover doesn't cause system freeze anymore.
5) amdgpu.cwsr_enable=0: no performance drop in webgl Aquarium in Chrome, no freezes in 4k60p YouTube. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover causes system hang.
Summary:
1) Only amdgpu.vm_update_mode=3 forces amdgpu_gpu_recover to work as expected.
2) The bug is related to firmware
3) I propose to test amdgpu.cwsr_enable=0 more, since it reduces context switching inside GPU VM, so this might be the way to overcome buggy firmware for now.

I've only enabled amdgpu.mes=1, worked in Blender and GIMP that previously did quickly freeze system sometimes (it's kinda random), so far it's good, let's wait some time and see, you might be on to something with this option alone.

bernd_b · 2025-01-26 18:53:10

lpr1 wrote:

I've only enabled amdgpu.mes=1

So the idea is that enabling this kernel-option on boot could replace the patched-mesa-version?

Mechanicus · 2025-01-26 18:58:04

bernd_b wrote:

lpr1 wrote:
I've only enabled amdgpu.mes=1
So the idea is that enabling this kernel-option on boot could replace the patched-mesa-version?

Exactly. The patch disables utilization of the Compute module in mesa completely. This option or another proposed one might help to force the driver work as expected, with Graphics and Compute modules utilization. At least until the firmware is fixed.

Last edited by Mechanicus (2025-01-26 19:01:27)

V1del · 2025-01-26 19:24:34

Begs the question whether the firmware here will ever be fixed. Afaik amd pretty much dropped their support for vega for compute usecases and the firmware is one of the things where "FOSS" and "community" can do squat

Mechanicus · 2025-01-26 19:29:41

V1del wrote:

Begs the question whether the firmware here will ever be fixed. Afaik amd pretty much dropped their support for vega for compute usecases and the firmware is one of the things where "FOSS" and "community" can do squat

I know... A lot of related errors didn't show any progress since 2018. But there are some interesting fixes related to VM and IP management in Linux 6.13: https://github.com/torvalds/linux/commi … amd/amdgpu
Let's see if they help in our case.
Also some firmware binaries are already updated in linux-firmware master: https://gitlab.com/kernel-firmware/linu … type=heads

Last edited by Mechanicus (2025-01-26 19:34:41)

pacmancrashedagain · 2025-01-26 19:35:31

Just to keep track, both @Mechanicus and @lpr1 are using that kernel parameter with Mesa 24.3.4 ?

Mechanicus · 2025-01-26 19:38:56

pacmancrashedagain wrote:

Just to keep track, both @Mechanicus and @lpr1 are using that kernel parameter with Mesa 24.3.4 ?

I'm using up-to-date mesa 24.3.4 and other components from the repo. And yes, after some tests with other params, I kept amdgpu.mes=1 for now.

Last edited by Mechanicus (2025-01-26 19:44:50)

bernd_b · 2025-01-26 22:01:51

Mechanicus wrote:

I'm using up-to-date mesa 24.3.4...

I switched to an unpatched git-version (25.0.0_devel.200681.5cc977bee40.d41d8cd-1) with amdgpu.mes=1 added to my boot options. First hours were successful.

Last edited by bernd_b (2025-01-26 22:02:51)

pacmancrashedagain · 2025-01-26 22:29:25

Thanks guys, if that kernel parameter works i feel like it's an easier workaround.

And from my understanding we wouldn't be losing anything compared to before because the compute units only started being used by Mesa now.

NuSkool · 2025-01-26 23:56:37

This latest info sounds encouraging so I'm in...

Switched to current AUR mesa-git unpatched:

pacman -Ss mesa-git
aur/mesa-git 25.0.0_devel.200736.dea4b46f213.d41d8cd-1 [installed]

Added kernel parameter: amdgpu.mes=1

awk '{$1="";$2="";print}' /proc/cmdline
  rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive amdgpu.mes=1

+ system update and rebooted about 1/2hr ago.

GPU info:

$ lspci -k -d ::03xx ; sudo dmesg | grep 'gfx_'
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev d6)
	DeviceName: Onboard IGD
	Subsystem: Hewlett-Packard Company Device 83e9
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
[    6.263651] [drm] add ip block number 6 <gfx_v9_0>

Last edited by NuSkool (2025-01-27 00:06:24)

NuSkool · 2025-01-27 05:49:04

Well unfortunately, I have to report my system froze after about 5 1/2 hr, as typical of other mesa testing without the patch.

Running unpatched AUR mesa-git:
aur/mesa-git 25.0.0_devel.200736.dea4b46f213.d41d8cd-1 [installed]

With kernel parameters:
rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive amdgpu.mes=1

I'm going to leave the kernel parameter and switch to official repo mesa:
extra/mesa 1:24.3.4-1 [installed]

Last edited by NuSkool (2025-01-27 06:05:02)

lpr1 · 2025-01-27 08:28:28

Using amdgpu.mes=1 alone unfortunately still results in freeze, it took some time (~4-5h working in Blender), but eventually it did freeze.

Now I will attempt amdgpu.vm_update_mode=1 I guess?, I want to do one option at time.

Interesting, most freezes in Blender happen when context switching(?) while exporting model etc., when things are painted/destroyed.

UPDATE: With amdgpu.vm_update_mode=1 very quickly got freeze in file manager actually, maybe 5-10 minutes.

Attempting line mentioned above by NuSkool sysrq_always_enabled=1 amd_pstate=passive amdgpu.mes=1

Last edited by lpr1 (2025-01-27 08:41:33)

Mechanicus · 2025-01-27 12:01:07

lpr1 wrote:

Using amdgpu.mes=1 alone unfortunately still results in freeze, it took some time (~4-5h working in Blender), but eventually it did freeze.
Now I will attempt amdgpu.vm_update_mode=1 I guess?, I want to do one option at time.
Interesting, most freezes in Blender happen when context switching(?) while exporting model etc., when things are painted/destroyed.
UPDATE: With amdgpu.vm_update_mode=1 very quickly got freeze in file manager actually, maybe 5-10 minutes.
Attempting line mentioned above by NuSkool sysrq_always_enabled=1 amd_pstate=passive amdgpu.mes=1

Probably better to try amdgpu.cwsr_enable=0 or amdgpu.enforce_isolation=1.
Some info about CWSR: https://gist.github.com/fxkamd/ecaaab72 … 0edaab9d8d

I'm trying amdgpu.enforce_isolation=1 now (https://docs.kernel.org/gpu/amdgpu/proc … ation.html)

Last edited by Mechanicus (2025-01-27 12:58:49)

lpr1 · 2025-01-27 14:21:49

Mechanicus wrote:

lpr1 wrote:
Using amdgpu.mes=1 alone unfortunately still results in freeze, it took some time (~4-5h working in Blender), but eventually it did freeze.
Now I will attempt amdgpu.vm_update_mode=1 I guess?, I want to do one option at time.
Interesting, most freezes in Blender happen when context switching(?) while exporting model etc., when things are painted/destroyed.
UPDATE: With amdgpu.vm_update_mode=1 very quickly got freeze in file manager actually, maybe 5-10 minutes.
Attempting line mentioned above by NuSkool sysrq_always_enabled=1 amd_pstate=passive amdgpu.mes=1
Probably better to try amdgpu.cwsr_enable=0 or amdgpu.enforce_isolation=1.
Some info about CWSR: https://gist.github.com/fxkamd/ecaaab72 … 0edaab9d8d
I'm trying amdgpu.enforce_isolation=1 now (https://docs.kernel.org/gpu/amdgpu/proc … ation.html)

Will try amdgpu.cwsr_enable=0 alone now, since sysrq_always_enabled=1 amd_pstate=passive amdgpu.mes=1 resulted in freeze as well.

Last edited by lpr1 (2025-01-27 14:24:21)

Mechanicus · 2025-01-27 16:40:22

Setting amd_pstate is useless for Zen and Zen+ generations. Kernel loads amd_pstate_epp automatically for Zen2 and newer. I recommend to add fsck.mode=force to overcome the negative effect of hard reset.

Last edited by Mechanicus (2025-01-27 16:44:54)

bernd_b · 2025-01-27 16:55:32

bernd_b wrote:

I switched to an unpatched git-version (25.0.0_devel.200681.5cc977bee40.d41d8cd-1) with amdgpu.mes=1 added to my boot options. First hours were successful.

And on day three, which was this afternoon, I failed too, PC froze out of the blue. Funny, a short time before my windows running in virtualbox rebooted out of the blue too.

I went back the patched git-version (25.0.0_devel.200681.5cc977bee40.d41d8cd-1) and will check once again if this lasts longer.

NuSkool · 2025-01-27 17:44:30

Sorry if I caused any confusion, I listed all the kernel parameters set on this machine.
Only the last one, amdgpu.mes=1 has anything to do with mesa troubleshooting.
The amd_pstate=passive is set as a workaround to enable an old python based CPU monitor I use to have full functionality.

I switched mesa versions in my last post from the AUR mesa-git to official repo mesa. This made no difference and had a system freeze shortly after.

I'm currently back on Lone_Wolf's latest patched mesa until I do some reading on these parameters.
https://bbs.archlinux.org/viewtopic.php … 4#p2222544
mesa-test-git-25.0.0_devel.200729.61e289d0ca0-1-x86_64.pkg.tar.zst

Are some of the participants here mesa devs? Seems at least a few have a good understanding of the inner workings of this mesa bug.
Are we just throwing stuff (parameters) at the wall to see what sticks at this point?
Has the mesa development or other related packages teams made any progress on this bug other than the test patch Lone_Wolf has been using?

Last edited by NuSkool (2025-01-27 17:57:16)

Mechanicus · 2025-01-27 18:45:37

The bug internals for curious persons :
1) There is a data race in VM access between GFX, Compute and other IP blocks. It isn't new, but before mesa started using Compute unit, there were fewer active participants in the race. It is obvious from the outcome of amdgpu.vm_update_mode=1,2,3 kernel parameter, since when the "tick" for VM access is provided externally (from CPU) the GPU hangs very quickly. I tried to analyze the amdgpu source code, but it is hard for non-AMD developer (especially when you see the commits like this https://github.com/torvalds/linux/commi … 5659dd10a5).
2) From the links to similar problem I found before it seems to be firmware related as well. The worst case is that we've found a hardware bug, but let's hope it isn't.

Finally I can say that there are some driver options that could possible prevent sharing the same memory between GPU units (amdgpu.enforce_isolation=1 for example).
Commits, that might fix the issue (we'll know exactly after switching to linux-6.13):
https://github.com/torvalds/linux/commi … dca23c00f4
https://github.com/torvalds/linux/commi … 08a36e5038

That's all I've found so far. Thanks for your attention.

Last edited by Mechanicus (2025-01-27 18:49:23)

Arch Linux

#276 2025-01-26 07:50:59

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#277 2025-01-26 07:55:09

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#278 2025-01-26 08:05:35

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#279 2025-01-26 08:56:08

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#280 2025-01-26 09:18:09

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#281 2025-01-26 09:20:23

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#282 2025-01-26 12:17:06

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#283 2025-01-26 13:43:57

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#284 2025-01-26 18:53:10

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#285 2025-01-26 18:58:04

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#286 2025-01-26 19:24:34

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#287 2025-01-26 19:29:41

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#288 2025-01-26 19:35:31

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#289 2025-01-26 19:38:56

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#290 2025-01-26 22:01:51

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#291 2025-01-26 22:29:25

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#292 2025-01-26 23:56:37

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#293 2025-01-27 05:49:04

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#294 2025-01-27 08:28:28

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#295 2025-01-27 12:01:07

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#296 2025-01-27 14:21:49

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#297 2025-01-27 16:40:22

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#298 2025-01-27 16:55:32

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#299 2025-01-27 17:44:30

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#300 2025-01-27 18:45:37

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

Board footer