Issues with Mesa 24.3.x and amdgpu Vega graphics

Mechanicus · 2025-02-01 17:28:42

Who has some time please test linux-ring-recovery build. https://bbs.archlinux.org/viewtopic.php … 1#p2223911

grayich · 2025-02-01 17:45:09

Is it normal that after "cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover" the GPU frequency rises to the maximum (1240 MHz) and does not drop?
GPU load is 99%, according to nvtop
Ryzen 3 3200G, Vega8

NuSkool · 2025-02-01 18:20:44

Mechanicus wrote:

Who has some time please test linux-ring-recovery build. https://bbs.archlinux.org/viewtopic.php … 1#p2223911

I'm about 20hr into testing linux-amdgpu 6.13.arch1-2. This setup is working so well including passing my overnight test, I hate to stop the testing early.

Considering these findings, https://bbs.archlinux.org/viewtopic.php … 6#p2223896 would you prefer I move onto testing 'linux-ring-recovery' or would you rather have the results from longer term linux-amdgpu 6.13.arch1-2 testing?

I've also set the contents of /sys/module/drm/parameters/debug to '15'. Would there be any mesa debug logging available (location of file or data) other than the journal, and would you be interested in it at this point. If it's only in the journal, could I search/grep for a term that would filter it out?

Yes. 0xf in hexadecimal is 15 decimal.

EDIT:
I've found I can consistently induce a graphical freeze in XFCE by entering the command 'radeontop' in 'Application Finder' , [Alt]+[F2].
With my current setup, it's recoverable via switching tty and killing 'radeontop, and 'xinit', then rerunning 'startx'.
Since this isn't a complete system freeze with this setup anyway, it may not prove useful in producing a system freeze with different mesa setups.
Thought this was interesting none the less and worth passing on.

Last edited by NuSkool (2025-02-01 18:46:53)

flemingfleming · 2025-02-01 18:37:30

Mechanicus wrote:

AMDGPU
Build: linux-ring-recovery-6.13.arch1-1, linux-ring-recovery-headers-6.13.arch1-1
Included patches:
- Extend amdgpu_ring_soft_recovery function wih PG control
What to check:
- sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
- idle stability
- workflow stability
- glxgears window resizing
- vkcube window resizing
Kernel option to keep during testing period: fsck.mode=force

I'm able to cause the crash with this build: linux-ring-recovery-6.13.arch-1-1.

I did try with drm.debug=0xf but for some reason if I do that I can no longer cause the crash! So sorry about that. If this is the correct documentation I could try setting it to some other value if you want?

(Ryzen 5 2500U (Raven Ridge))

Mechanicus · 2025-02-01 19:43:58

flemingfleming wrote:

I'm able to cause the crash with this build: linux-ring-recovery-6.13.arch-1-1.
I did try with drm.debug=0xf but for some reason if I do that I can no longer cause the crash! So sorry about that. If this is the correct documentation I could try setting it to some other value if you want?
(Ryzen 5 2500U (Raven Ridge))

Documentation link is correct. Well, if enabled logging changes the behavior of the driver - its another funny story.
I'm trying to find better place to use powergating state switch - I want it to be triggered only in case of problem, not on any GPU operation.
Was the crash similar to unpatched kernel? This build was aimed to check if it fixes amdgpu_gpu_recovery mechanism. But in case of crash, seems like the recovery problem is totally on firmware side.

Last edited by Mechanicus (2025-02-01 20:27:42)

Mechanicus · 2025-02-01 19:47:33

NuSkool wrote:

Mechanicus wrote:
Who has some time please test linux-ring-recovery build. https://bbs.archlinux.org/viewtopic.php … 1#p2223911
I'm about 20hr into testing linux-amdgpu 6.13.arch1-2. This setup is working so well including passing my overnight test, I hate to stop the testing early.
Considering these findings, https://bbs.archlinux.org/viewtopic.php … 6#p2223896 would you prefer I move onto testing 'linux-ring-recovery' or would you rather have the results from longer term linux-amdgpu 6.13.arch1-2 testing?
I've also set the contents of /sys/module/drm/parameters/debug to '15'. Would there be any mesa debug logging available (location of file or data) other than the journal, and would you be interested in it at this point. If it's only in the journal, could I search/grep for a term that would filter it out?
Yes. 0xf in hexadecimal is 15 decimal.
EDIT:
I've found I can consistently induce a graphical freeze in XFCE by entering the command 'radeontop' in 'Application Finder' , [Alt]+[F2].
With my current setup, it's recoverable via switching tty and killing 'radeontop, and 'xinit', then rerunning 'startx'.
Since this isn't a complete system freeze with this setup anyway, it may not prove useful in producing a system freeze with different mesa setups.
Thought this was interesting none the less and worth passing on.

Continue your test of linux-amdgpu 6.13.arch1-2 please.
Regarding the freeze with radeontop - that's different story. Since the fixes included in this build and the fix that is going to be merged in mainstream added additional registers access running any application that tries to read GPU registers as well could be now tricky...
update: could you post dmesg when radeontop was running?

Last edited by Mechanicus (2025-02-01 20:46:18)

Mechanicus · 2025-02-01 20:14:07

grayich wrote:

Is it normal that after "cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover" the GPU frequency rises to the maximum (1240 MHz) and does not drop?
GPU load is 99%, according to nvtop
Ryzen 3 3200G, Vega8

On what build?

Mechanicus · 2025-02-01 20:43:52

AMDGPU
Build: linux-ring-recovery-6.13.1.arch1-2 - freezes
Included patches:
- Extend amdgpu_ring_soft_recovery function wih PG control v2 (now try to recover GFX and Compute rings)

What to check:
- stability

Kernel option to keep during testing period: fsck.mode=force

Last edited by Mechanicus (2025-02-01 22:34:09)

grayich · 2025-02-01 21:38:06

Mechanicus wrote:

On what build?

Any

6.12.10-arch1-1
mesa 24.3.4
mesa-test-git 25.0.0_devel.200908.66775c89fce-1

flemingfleming · 2025-02-01 21:50:40

Mechanicus wrote:

Was the crash similar to unpatched kernel? This build was aimed to check if it fixes amdgpu_gpu_recovery mechanism.

Sorry for the confusion. I was testing and reporting the "original" freeze that happens randomly, it makes sense if this build isn't supposed to try and fix that (so far only linux-amdgpu 6.13.arch1-2 seems to fix this).

All of the kernels I have tried so far have crashed on trying to read /sys/kernel/debug/dri/1/amdgpu_gpu_recover, regardless of logging output.

Mechanicus wrote:

Well, if enabled logging changes the behavior of the driver - its another funny story.

It's probably just affecting the way I was triggering it. When I enable all the logging, performance goes down massively (from about 60 to 30 fps when playing videos and similar).

With your latest linux-ring-recovery-6.13.1.arch1-2, I triggered the second type of crash (the recovery crash) by reading /sys/kernel/debug/dri/1/amdgpu_gpu_recover with the kernel parameter drm.debug=0x2 and it got some debug output I've recorded here:

https://gist.github.com/fleming-2/dc963 … tcrash-log

Mechanicus · 2025-02-01 22:03:09

Thank you, @flemingfleming!

kode54 · 2025-02-02 03:45:22

I tested linux-ring-recovery-6.13.arch1-1 with drm.debug=0x2 triggered by writing to the parameter post-boot. I tested amdgpu_gpu_recovery. It managed to bring my displays back up almost immediately, but CRTC flipping hung, and the GPU never recovered. I generated a dmesg log:

https://gist.github.com/kode54/ddc47a5c … dacf482888

And from this excerpt, you can see a message being spammed immediately the moment I turned on drm.debug=0x2:

[   92.165875] amdgpu 0000:03:00.0: [drm:amdgpu_dm_crtc_vblank_control_worker [amdgpu]] dc_allow_idle_optimizations_internal: enabled
[   92.239661] amdgpu 0000:03:00.0: [drm:amdgpu_dm_crtc_vblank_control_worker [amdgpu]] dc_allow_idle_optimizations_internal: disabled

NotAnArchUser · 2025-02-02 04:53:18

Mechanicus wrote:

NotAnArchUser wrote:
Mechanicus, I'm now testing patches proposed by Alex Deucher on GitLab. So I suppose my kernel is now equivalent to your linux-amdgpu 6.13.arch1-2 to some degree?
Correct. The differences are that my changes affect all chips and puts GPU in power efficient mode faster.

~~Unfortunately I've got another freeze using mpv last night~~ (freeze may be unrelated to bug)

Last edited by NotAnArchUser (2025-02-02 11:17:17)

kode54 · 2025-02-02 07:35:42

6.13.arch1-2 holding steady for me. This includes gaming and running mpv HEVC decoding using gpu-next output.

lpr1 · 2025-02-02 08:26:54

Mechanicus wrote:

grayich wrote:
Is it normal that after "cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover" the GPU frequency rises to the maximum (1240 MHz) and does not drop?
GPU load is 99%, according to nvtop
Ryzen 3 3200G, Vega8
On what build?

Same thing happens to me as well (also 6.12.10-arch1-1 and mesa from the repo), interesting is that according to the radeontop, only graphics pipe is 100% utilized, all the rest are at 0-variable%. 3400G.

Graphics pipe 100,00%
Event Engine 0,00%
Vertex Grouper + Tesselator 0,00%
Texture Addresser 0,00%
Shader Export 0,00%
Sequencer Instruction Cache 0,00%
Shader Interpolator 0,00%
Scan Converter 0,00%
Primitive Assembly 0,00%
Depth Block 0,00%
Color Block 0,00%
52M / 40M VRAM 129,43%
516M / 15950M GTT 3,24%
1,33G / 1,33G Memory Clock 100,00%
1,40G / 1,40G Shader Clock 100,00%

Confirmed with:

watch -n1 sudo 'cat /sys/kernel/debug/dri/1/amdgpu_pm_info | grep "MHz"'

Last edited by lpr1 (2025-02-02 08:30:48)

kode54 · 2025-02-02 08:35:02

As an aside, also noting once again that amdgpu_top exists.

NuSkool · 2025-02-02 08:47:25

Report: @ ~35 hours on the following setup:

pacman -Q linux-amdgpu : linux-amdgpu 6.13.arch1-2
uname -rs : Linux 6.13.0-arch1-2-amdgpu
pacman -Q mesa : mesa 1:24.3.4-1

Kernel parameters. (nothing added to normal config)
cat /proc/cmdline : .............. rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive fsck.mode=force

This test kernel seems pretty robust. Have not been able to induce system freeze.

Regarding the freeze mentioned previously https://bbs.archlinux.org/viewtopic.php … 8#p2223938
Seems it's only freezing the XFCE session, caused from trying to start the terminal application 'radentop' in the XFCE "Application Finder" which is intended for apps listed in the menu. Entering other terminal apps result in nothing, as if it ignores the commands altogether. During a freeze, I can switch tty, log in as a different user and start a new xfce session via startx. It's only the 'radeontop' application that causes this freeze, and there's nothing logged to journal/dmesg. I also believe it's unrelated to mesa or the kernel.

Let me know if I can provide any additional info and/or switch testing to something different.

Last edited by NuSkool (2025-02-02 08:57:10)

Mechanicus · 2025-02-02 09:19:16

NotAnArchUser wrote:

Mechanicus wrote:
NotAnArchUser wrote:
Mechanicus, I'm now testing patches proposed by Alex Deucher on GitLab. So I suppose my kernel is now equivalent to your linux-amdgpu 6.13.arch1-2 to some degree?
Correct. The differences are that my changes affect all chips and puts GPU in power efficient mode faster.
Unfortunately I've got another freeze using mpv last night

That's why I didn't support the proposed "fix". And that's why we are here and doing experiments.

Mechanicus · 2025-02-02 09:22:07

NuSkool wrote:

Report: @ ~35 hours on the following setup:
pacman -Q linux-amdgpu : linux-amdgpu 6.13.arch1-2
uname -rs : Linux 6.13.0-arch1-2-amdgpu
pacman -Q mesa : mesa 1:24.3.4-1
Kernel parameters. (nothing added to normal config)
cat /proc/cmdline : .............. rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive fsck.mode=force

This test kernel seems pretty robust. Have not been able to induce system freeze.
Regarding the freeze mentioned previously https://bbs.archlinux.org/viewtopic.php … 8#p2223938
Seems it's only freezing the XFCE session, caused from trying to start the terminal application 'radentop' in the XFCE "Application Finder" which is intended for apps listed in the menu. Entering other terminal apps result in nothing, as if it ignores the commands altogether. During a freeze, I can switch tty, log in as a different user and start a new xfce session via startx. It's only the 'radeontop' application that causes this freeze, and there's nothing logged to journal/dmesg. I also believe it's unrelated to mesa or the kernel.
Let me know if I can provide any additional info and/or switch testing to something different.

"there's nothing logged to journal/dmesg" - that's weird. I was expected to see warnings about register read timeouts.

pacoandres · 2025-02-02 09:32:45

Sorry, but I'm just going to getting lost in this thread.
By now I'm not sure if I'm testing the same linux-amdgpu 6.13.arch1-2 than @NuSkool or which version is testing other people. I've also lost which Mesa versions are being testing now.

If you think it could help, I've done a github project for using only the issues part so we can use one issue per test and have all the results gathered together. I think this could made developers and testers job easier and anyone who wants to join the solution search can have a place to start without reading this ever-growing thread.

This is the link to the project: https://github.com/pacoandres/laikm.
Anyone who wants permissions on it, just ask me. Any suggestions or criticisms are welcome.

By now I think I'm testing the same configuration than @NuSkool with no issues.

NotAnArchUser · 2025-02-02 10:21:18

Mechanicus wrote:

That's why I didn't support the proposed "fix". And that's why we are here and doing experiments.

Do you think I should try your kernel? If so can you share it like patches please?

Besides I've realized I could cause a sort of confusion here because I still experience freezes even with mesa 24.2.8

Last edited by NotAnArchUser (2025-02-02 10:26:17)

kode54 · 2025-02-02 10:38:10

Mechanicus wrote:

linux-amdgpu 6.13.arch1-2
What to check:
- idle stability
- workflow stability
- glxgears window resizing
- vkcube window resizing
What to report:
- stability
- performance
- temperatures
- power consumption (if available)
- dmesg (if possible)
Kernel option to keep during testing period: fsck.mode=force

- idle stability: OK
- workflow stability: OK
- glxgears window resizing: OK
- vkcube window resizing: OK both X11 and -wayland variants

- Also played a 4k HEVC movie for about 20 minutes to test decoder stability

- stability: Stable for all intents and purposes
- performance: Seems to be fine
- temperatures: chart here
- power consumption: not available to Beszel for some reason
- dmesg: gist

Kernel commandline: (/proc/cmdline)

splash quiet root=UUID=23028a63-b140-422f-a6d6-9fc61a90db87 rw rootfstype=btrfs rootflags=rw,relatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=256,subvol=/@ lsm=landlock,lockdown,yama,integrity,apparmor,bpf usbcore.autosuspend=-1 fsck.mode=force

Mechanicus · 2025-02-02 10:44:56

NotAnArchUser wrote:

Mechanicus wrote:
That's why I didn't support the proposed "fix". And that's why we are here and doing experiments.
Do you think I should try your kernel? If so can you share it like patches please?
Besides I've realized I could cause a sort of confusion here because I still experience freezes even with mesa 24.2.8

I update patches on my Linux branch. Also you can download them from here: https://drive.google.com/drive/folders/ … drive_link
These patches are included in linux-amdgpu 6.13.arch1-2, the most stable one.

Last edited by Mechanicus (2025-02-02 10:55:41)

kode54 · 2025-02-02 11:06:54

I'm building a new kernel from linux-zen from Arch repos, with the only change being to apply your latest three patches. Let me know if there's anything different I should test.

Mechanicus · 2025-02-02 12:29:46

AMDGPU testing
Build: linux-amdgpu-6.13.1.arch1-5 - freezes.
Included patches:
- Disallow GFXOFF during amdgpu_ring_commit in amdgpu_ib_schedule

Build: linux-amdgpu-testing-6.13.1.arch1-1 - freezes.
Included patches:
- Deny GFXOFF for amdgpu_job_run

Last edited by Mechanicus (2025-02-02 15:33:58)

Arch Linux

#401 2025-02-01 17:28:42

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#402 2025-02-01 17:45:09

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#403 2025-02-01 18:20:44

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#404 2025-02-01 18:37:30

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#405 2025-02-01 19:43:58

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#406 2025-02-01 19:47:33

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#407 2025-02-01 20:14:07

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#408 2025-02-01 20:43:52

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#409 2025-02-01 21:38:06

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#410 2025-02-01 21:50:40

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#411 2025-02-01 22:03:09

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#412 2025-02-02 03:45:22

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#413 2025-02-02 04:53:18

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#414 2025-02-02 07:35:42

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#415 2025-02-02 08:26:54

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#416 2025-02-02 08:35:02

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#417 2025-02-02 08:47:25

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#418 2025-02-02 09:19:16

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#419 2025-02-02 09:22:07

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#420 2025-02-02 09:32:45

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#421 2025-02-02 10:21:18

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#422 2025-02-02 10:38:10

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#423 2025-02-02 10:44:56

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#424 2025-02-02 11:06:54

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#425 2025-02-02 12:29:46

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

Board footer