Issues with Mesa 24.3.x and amdgpu Vega graphics

bernd_b · 2025-01-13 14:53:14

I compiled both packages on a 4-core CPU this afternoon. Yes, it takes a good amount of minutes, but not hours.

mesa-minimal-git from AUR should be be a time-saver as well.

Last edited by bernd_b (2025-01-13 14:56:12)

orbit-oc · 2025-01-13 20:24:06

Progress
The mesa ticket from Dacia Mountable (13.01.2025) seems to me to be the best description of our problem here so far, which also shows a reaction on the same day.
The only thing missing is the highlighting of the affected hardware: integrated Vesa graphics (gfx9), but the connection to the ticket from timber22 and kclisp has been established.

Now linked tickets in chronological order:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12310
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12433 - closed
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12457

@bernd_b

bernd_b wrote:

I also managed to compile mesa-git from the arch-repo which I use at the moment.

To emphasize: You're still using mesa-git 25.0.0 and it hasn't crashed yet? If so, that's important information (with a request for confirmation).
Where did you get the PKGBUILD for it (link)? I can only find the AUR.
@Lone_Wolf wants to create a new binary package in the coming days.

I am now expecting mesa 24.3.4 first...

Mechanicus · 2025-01-13 22:43:42

Hi guys! Have you tried to disable GPU recovery with the current mesa, like described in AMD hang debugging ?
I applied

amdgpu.gpu_recovery=0

to kernel parameters and there is no more freezes when running Google Chrome.
I'm using current mesa 1:24.3.3-2, llvm-libs 19.1.6-3, linux 6.12.9.arch1-1. The kernel parameter also fixed the freeze on previous mesa with llvm-libs 18.
System is based on Athlon 200GE with Vega 3.

Last edited by Mechanicus (2025-01-13 22:53:33)

nek0panchi · 2025-01-13 23:24:43

Mechanicus wrote:

Hi guys! Have you tried to disable GPU recovery with the current mesa, like described in AMD hang debugging ?
I applied
amdgpu.gpu_recovery=0

I just tried it and it still froze using Gimp, nothing in the logs except

kernel: amdgpu 0000:05:00.0: amdgpu: Dumping IP State

I'm on a AMD Ryzen 5 3400G with Radeon Vega Graphics, with mesa 24.3.3-2, Gnome (wayland).
I will wait for 24.3.4

lpr1 · 2025-01-14 04:16:01

What is for sure that error is unrelated to X11/wayland, it happens at random, from gimp, blender, firefox etc.

If it's of any help, it's probably related to the issue that randomly happens in Firefox on youtube webpage, when whole webpage UI becomes unresponsive, it doesn't always crash the system, it goes away after Firefox is restarted (sometimes even just by replacing tabs it appears it is gone). 3200G, 3400G with different boards, memory, X or wayland.

Last edited by lpr1 (2025-01-14 04:26:55)

Lone_Wolf · 2025-01-14 10:25:23

I've build 2 new binaries , one for latest trunk with the patch applied and another with the 24.2.8 version .
Both are built against repo llvm 19.1.6 .

mesa trunk commit 94da1edbe49

mesa 24.2.8

@kclisp : I made a mistake in applying the upstream patch for the previous trunk binary, please re-test with the new binary .

To those trying mesa trunk builds themselves :

mesa-git with repo llvm (default choice) should build in minutes. mesa-minimal-git itself is a bit faster but requires llvm-minimal-git .
llvm-minimal-git takes ~ 50 minutes on my 12C/24T desktop, it could take several hours on lower specced systems.

orbit-oc · 2025-01-14 11:07:08

@nek0panchi
Is Gimp perhaps useful to reproduce the crash? Does the system freeze immediately or does it take an uncertain amount of time? Is it possible to speed up the crash if an image loaded to Gimp is large enough?

Sjoerd fc wrote:

My system has an AMD Ryzen 5 3400G with Radeon Vega Graphics, and I'm using openSuse Tumbleweed. With some applications my system instantly freezes, such as Gimp or KiCAD.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12457

Maybe someone will have a look to the test of lixo 6504 from the same source (mpv - 2 videos).

---

Pierre-Eric Pelloux-Prayer (mesa Developer) wrote:

FWIW I'm testing the reproduction steps from @kclisp:
kclisp wrote:
I use i3; I start two terminals in one workspace, and hold Super + Shift + Space on one of them, which rapidly alternates it between a floating window and a tiling window, producing a lot of graphical window resizing. This can produce a freeze on broken builds in a few seconds.
And I can't get a freeze on a 2700u, using a Mesa build at the problematic commit. I'll try updating my kernel in case the failure requires at 6.12+.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12310

compare here:
https://bbs.archlinux.org/viewtopic.php … 2#p2219232
I'm not saying that the error doesn't exist...

---

Many thanks to @Lone_Wolf for building 2 new binaries...

Mechanicus · 2025-01-14 12:46:04

nek0panchi wrote:

I just tried it and it still froze using Gimp, nothing in the logs except
kernel: amdgpu 0000:05:00.0: amdgpu: Dumping IP State
I'm on a AMD Ryzen 5 3400G with Radeon Vega Graphics, with mesa 24.3.3-2, Gnome (wayland).
I will wait for 24.3.4

Have you cleaned all mesa cache before test? That's important aspect.
I recommend you to clean your system properly and test again. You can use my program for this: Advanced linux system cleaning

Mechanicus · 2025-01-14 17:58:01

Maybe this information is helpful:
On Raven Ridge system (Athlon 200GE) the system freeze can be initiated by reading this value:

/sys/kernel/debug/dri/N/amdgpu_gpu_recover

Note: check your available N with ls first.
In my case the returned value was -110 and reboot command didn't work (within 2 minutes at least).

Summary:
1) default kernel parameters. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover - system freezes almost completely (sound, IO, video not responding, ssh session responds slowly). journalctl output (before the hard reset):

янв 14 20:14:10 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
янв 14 20:14:10 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
янв 14 20:14:10 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: PSP is resuming...
янв 14 20:14:10 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: reserve 0x400000 from 0xf47fc00000 for PSP TMR
янв 14 20:14:10 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
янв 14 20:14:10 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
янв 14 20:14:10 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
янв 14 20:14:11 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
янв 14 20:14:11 amd-a320mhdvr40 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
янв 14 20:14:11 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(1) failed
янв 14 20:14:11 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
янв 14 20:14:21 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: Dumping IP State
янв 14 20:14:21 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: Dumping IP State Completed
янв 14 20:14:21 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: ring sdma0 timeout, signaled seq=4512, emitted seq=4514
янв 14 20:14:21 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!

2) amdgpu.gpu_recovery=0 applied to kernel parameter. Executing sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover - system freezes partially (sound continues playing, IO and video not working, ssh session responds normal). journalctl output (before the hard reset):

янв 14 19:41:38 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
янв 14 19:41:38 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
янв 14 19:41:38 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: PSP is resuming...
янв 14 19:41:38 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: reserve 0x400000 from 0xf47fc00000 for PSP TMR
янв 14 19:41:38 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
янв 14 19:41:38 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
янв 14 19:41:38 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
янв 14 19:41:39 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
янв 14 19:41:39 amd-a320mhdvr40 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
янв 14 19:41:39 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(1) failed
янв 14 19:41:39 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
янв 14 19:41:49 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: Dumping IP State
янв 14 19:41:49 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: Dumping IP State Completed
янв 14 19:41:49 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: ring sdma0 timeout, signaled seq=48781, emitted seq=48783
янв 14 19:41:49 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU recovery disabled.
янв 14 19:41:59 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: Dumping IP State
янв 14 19:41:59 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: Dumping IP State Completed
янв 14 19:41:59 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: ring sdma0 timeout, signaled seq=48781, emitted seq=48783
янв 14 19:41:59 amd-a320mhdvr40 kernel: amdgpu 0000:07:00.0: amdgpu: GPU recovery disabled.

dmesg output for amdgpu for this machine:

[   10.291188] [drm] amdgpu kernel modesetting enabled.
[   10.291364] amdgpu: Virtual CRAT table created for CPU
[   10.291384] amdgpu: Topology: Add CPU node
[   10.291580] amdgpu 0000:07:00.0: enabling device (0006 -> 0007)
[   10.294220] amdgpu 0000:07:00.0: amdgpu: Fetched VBIOS from VFCT
[   10.294226] amdgpu: ATOM BIOS: 113-RAVEN-117
[   10.374857] amdgpu 0000:07:00.0: vgaarb: deactivate vga console
[   10.374871] amdgpu 0000:07:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
[   10.374958] amdgpu 0000:07:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[   10.374965] amdgpu 0000:07:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
[   10.375280] [drm] amdgpu: 2048M of VRAM memory ready
[   10.375285] [drm] amdgpu: 6948M of GTT memory ready.
[   10.376644] amdgpu: hwmgr_sw_init smu backed is smu10_smu
[   10.398973] amdgpu 0000:07:00.0: amdgpu: reserve 0x400000 from 0xf47fc00000 for PSP TMR
[   10.465267] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[   10.472009] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   10.472020] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[   10.480717] snd_hda_intel 0000:07:00.1: bound 0000:07:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[   10.590314] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   10.590339] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[   10.590506] amdgpu: Virtual CRAT table created for GPU
[   10.590585] amdgpu: Topology: Add dGPU node [0x15dd:0x1002]
[   10.590590] kfd kfd: amdgpu: added device 1002:15dd
[   10.590615] amdgpu 0000:07:00.0: amdgpu: SE 1, SH per SE 1, CU per SH 11, active_cu_number 3
[   10.590622] amdgpu 0000:07:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[   10.590627] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[   10.590630] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[   10.590634] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[   10.590638] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[   10.590642] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[   10.590645] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[   10.590649] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[   10.590652] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[   10.590656] amdgpu 0000:07:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[   10.590659] amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
[   10.590662] amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 8
[   10.590666] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 8
[   10.590669] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 8
[   10.590672] amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 8
[   10.594230] amdgpu: pp_dpm_get_sclk_od was not implemented.
[   10.594236] amdgpu: pp_dpm_get_mclk_od was not implemented.
[   10.594433] amdgpu 0000:07:00.0: amdgpu: Runtime PM not available
[   10.595044] [drm] Initialized amdgpu 3.59.0 for 0000:07:00.0 on minor 1
[   10.601569] fbcon: amdgpudrmfb (fb0) is primary device
[   10.601582] amdgpu 0000:07:00.0: [drm] fb0: amdgpudrmfb frame buffer device

Last edited by Mechanicus (2025-01-14 20:08:45)

orbit-oc · 2025-01-14 19:39:26

Mechanicus wrote:

On Raven Ridge system (Athlon 200GE) the system freeze can be initiated by reading this value
/sys/kernel/debug/dri/N/amdgpu_gpu_recover
In my case the returned value was -110 and reboot command didn't work (within 2 minutes at least).

Not /N/ but /1/. This could perhaps be an approach. Apparently, calling the file causes a gpu reset.

$ sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
--> Result: 0

$ journalctl -b | grep amdgpu
Jan 14 19:34:50 ... kernel: amdgpu: ATOM BIOS: 113-PICASSO-118
--> first call
Jan 14 19:46:01 ... kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset begin!
Jan 14 19:46:02 ... kernel: amdgpu 0000:26:00.0: amdgpu: MODE2 reset
Jan 14 19:46:02 ... kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset succeeded, trying to resume
...
Jan 14 19:46:02 ... kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset(1) succeeded!
--> second call
Jan 14 19:47:32 ... kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset begin!
Jan 14 19:47:32 ... kernel: amdgpu 0000:26:00.0: amdgpu: MODE2 reset
Jan 14 19:47:32 ... kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset succeeded, trying to resume
...
Jan 14 19:47:32 ... kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset(2) succeeded!

With mesa 1:24.2.7-1 the returned value is 0. The reset of the gpu can be seen visually. According to your test, it seems that the system freeze can actually be initiated in this way.
As soon as mesa 1:24.3.4-1 is installed here, I will run this test again to confirm your result.

Thanks for the contribution @Mechanicus. Maybe this will lead to something...

Mechanicus · 2025-01-14 20:06:30

orbit-oc wrote:

Not /N/ but /1/. This could perhaps be an approach. Apparently, calling the file causes a gpu reset.

I'm glad to help.
I think N can depend on multi-gpu configuration. So just wrote it as is. In my case N=1 because the integrated graphics is only used.

Last edited by Mechanicus (2025-01-14 20:15:56)

Mechanicus · 2025-01-14 20:14:41

Please perform the mesa cache cleaning before any test or version change. This is important, because the shader cache is also a source of problems.
You can do it manually, or using this tool Advanced Linux System Cleaning

Last edited by Mechanicus (2025-01-15 10:18:38)

NuSkool · 2025-01-14 20:19:35

I've held off on updating system for the updated llvm-libs and reverted to the following mesa for this output.

$ pacman -Q mesa llvm-libs
mesa 1:24.2.7-1
llvm-libs 18.1.8-5

Ryzen 5 PRO 2400GE with Vega 11 Graphics:

I can induce a system freeze from a fresh reboot in the console with:

$ sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover

Beginning of journal:

Jan 14 11:15:23 Arch2024p8 systemd[1]: getty@tty1.service: Scheduled restart job, restart counter is at 1.
Jan 14 11:15:23 Arch2024p8 systemd[1]: Started Getty on tty1.
Jan 14 11:15:23 Arch2024p8 systemd-logind[671]: Removed session 1.
Jan 14 11:15:29 Arch2024p8 login[2086]: pam_unix(login:session): session opened for user jeff(uid=1000) by jeff(uid=0)
Jan 14 11:15:29 Arch2024p8 systemd-logind[671]: New session 3 of user jeff.
Jan 14 11:15:29 Arch2024p8 systemd[1]: Started Session 3 of User jeff.
Jan 14 11:15:29 Arch2024p8 login[2086]: LOGIN ON tty1 BY jeff
Jan 14 11:15:40 Arch2024p8 sudo[2118]: jeff : TTY=tty1 ; PWD=/home/jeff ; USER=root ; COMMAND=/usr/bin/cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
Jan 14 11:15:40 Arch2024p8 sudo[2118]: pam_unix(sudo:session): session opened for user root(uid=0) by jeff(uid=1000)
Jan 14 11:15:40 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
Jan 14 11:15:40 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 14 11:15:40 Arch2024p8 kernel: [drm] PCIE GART of 1024M enabled.
Jan 14 11:15:40 Arch2024p8 kernel: [drm] PTB located at 0x000000F400A00000
Jan 14 11:15:40 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: PSP is resuming...
Jan 14 11:15:41 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: reserve 0x400000 from 0xf43fc00000 for PSP TMR
Jan 14 11:15:41 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jan 14 11:15:41 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jan 14 11:15:41 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Jan 14 11:15:41 Arch2024p8 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Jan 14 11:15:41 Arch2024p8 kernel: amdgpu 0000:0a:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32770)
....

And the complete relevant journal contents: https://0x0.st/8oTC.txt

This seems abnormal behavior to me considering the version of mesa...

EDIT:

Seems we cross posted....

You can do it manually

Would this involve removing the contents of ~/.cache/mesa* directories or is there more to it?
Nothing personal but I don't want to run a script with dozens of 'rm' commands before I thoroughly go through it.
We may have very different ideas on what cleaning is...

Note the already empty directory upon checking for contents.

$ ls -la ~/.cache/mesa_shader_cache/
total 8.0K
drwx------  2 jeff jeff 4.0K Sep  5 10:40 .
drwxr-xr-x 29 jeff jeff 4.0K Dec 22 18:04 ..

However, ' ~/.cache/mesa_shader_cache_db/' has plenty in it.

Last edited by NuSkool (2025-01-14 20:40:38)

Mechanicus · 2025-01-14 21:02:12

NuSkool wrote:

Would this involve removing the contents of ~/.cache/mesa* directories or is there more to it?
Nothing personal but I don't want to run a script with dozens of 'rm' commands before I thoroughly go through it.
We may have very different ideas on what cleaning is...
Note the already empty directory upon checking for contents.
$ ls -la ~/.cache/mesa_shader_cache/
total 8.0K
drwx------  2 jeff jeff 4.0K Sep  5 10:40 .
drwxr-xr-x 29 jeff jeff 4.0K Dec 22 18:04 ..
However, ' ~/.cache/mesa_shader_cache_db/' has plenty in it.

Yes, the script cleans up ~/.cache/, /tmp/ and removes shader caches from supported web browsers and Electron-based applications. The latter like to store it inside user configuration directories.

Last edited by Mechanicus (2025-01-14 21:09:08)

lpr1 · 2025-01-15 00:43:15

orbit-oc wrote:

With mesa 1:24.2.7-1 the returned value is 0. The reset of the gpu can be seen visually. According to your test, it seems that the system freeze can actually be initiated in this way.
As soon as mesa 1:24.3.4-1 is installed here, I will run this test again to confirm your result.
Thanks for the contribution @Mechanicus. Maybe this will lead to something...

With mesa 1:24.3.3-2, I can't reproduce freeze this way with 3400G, however, indeed gpu reset can be seen visually, and return value is 0. At least we got something, thanks to Mechanicus.
Freezes are present in 1:24.3.3-2.

Last edited by lpr1 (2025-01-15 00:44:09)

orbit-oc · 2025-01-15 01:20:19

@lpr1 @Mechanicus

Update to mesa 1:24.3.3-2, llvm-libs 19.1.6-3, linux-firmware 20250109.7673dffd-1, amd-ucode 20250109.7673dffd-1

confirmation to @lpr1:
$ sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
--> returned value = 0
--> no crash/freeze

lpr1 · 2025-01-15 05:58:09

orbit-oc wrote:

@lpr1 @Mechanicus
Update to mesa 1:24.3.3-2, llvm-libs 19.1.6-3, linux-firmware 20250109.7673dffd-1, amd-ucode 20250109.7673dffd-1
confirmation to @lpr1:
$ sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
--> returned value = 0
--> no crash/freeze

All packages are updated, now we wait for the freeze I guess.

EDIT: Still freezing, Picasso.

Last edited by lpr1 (2025-01-15 10:56:07)

Mechanicus · 2025-01-15 10:02:15

orbit-oc wrote:

@lpr1 @Mechanicus
Update to mesa 1:24.3.3-2, llvm-libs 19.1.6-3, linux-firmware 20250109.7673dffd-1, amd-ucode 20250109.7673dffd-1
confirmation to @lpr1:
$ sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
--> returned value = 0
--> no crash/freeze

For Raven Ridge no difference with the new firmware --> freeze still exists.

Last edited by Mechanicus (2025-01-15 10:26:10)

Mechanicus · 2025-01-15 19:07:26

lpr1 wrote:

orbit-oc wrote:
@lpr1 @Mechanicus
Update to mesa 1:24.3.3-2, llvm-libs 19.1.6-3, linux-firmware 20250109.7673dffd-1, amd-ucode 20250109.7673dffd-1
confirmation to @lpr1:
$ sudo cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
--> returned value = 0
--> no crash/freeze
All packages are updated, now we wait for the freeze I guess.
EDIT: Still freezing, Picasso.

Can you post dmesg output for successful amdgpu_gpu_recover? I want to check if it contains the line similar to this:

[  127.100875] amdgpu 0000:07:00.0: [drm] ERROR [CRTC:67:crtc-0] flip_done timed out

Last edited by Mechanicus (2025-01-15 19:08:01)

orbit-oc · 2025-01-15 20:17:00

slowly slowly @Mechanicus.
I'm glad that Pierre-Eric Pelloux-Prayer seems to have correctly identified the problem.
We need to work it out properly here before the mesa ticker *gets stuffed*.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/12457

I just can't reproduce the problem and the developer seems to have identified *this* problem correctly as well.

Mesa 24.3.3-2 continues to crash here unreproducibly and I'm now waiting for 24.3.4. There can always be cross-connections, 25.0.0 is also running (see @bernd_b).
After 24.3.4 I will probably go to 24.2.8 from @Lone_Wolf.
If 25.0.0 is or will be still broken after its release in February, I might temporarily switch to a Debian-based distribution.

The bug here will cause an avalanche if it can't be found soon.

Mechanicus · 2025-01-15 21:46:08

@orbit-oc could you set the environment variable MESA_SHADER_CACHE_DISABLE=true in your system, clean ~/.cache directory after reboot and do what you usually do?
I just checked all commits from mesa 24.2.7 to 24.3.3, I found several interesting related to caching. If there is no crash when shader cache is disabled, then one of those commits introduced double free issue.

Last edited by Mechanicus (2025-01-15 21:49:58)

lpr1 · 2025-01-16 01:52:50

Mechanicus wrote:

Can you post dmesg output for successful amdgpu_gpu_recover? I want to check if it contains the line similar to this:
[  127.100875] amdgpu 0000:07:00.0: [drm] ERROR [CRTC:67:crtc-0] flip_done timed out

Sure, all looks normal

[  833.138985] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
[  833.284451] amdgpu 0000:07:00.0: amdgpu: psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
[  833.320089] amdgpu 0000:07:00.0: amdgpu: MODE2 reset
[  833.320571] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
[  833.320992] amdgpu 0000:07:00.0: amdgpu: PSP is resuming...
[  833.341030] amdgpu 0000:07:00.0: amdgpu: reserve 0x400000 from 0xf403c00000 for PSP TMR
[  833.408258] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  833.415546] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[  833.415549] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  833.661767] amdgpu 0000:07:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[  833.661775] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  833.661778] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  833.661781] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  833.661784] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  833.661787] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  833.661789] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  833.661792] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  833.661795] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  833.661798] amdgpu 0000:07:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[  833.661801] amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
[  833.661803] amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 8
[  833.661806] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 8
[  833.661809] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 8
[  833.661811] amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 8
[  833.669505] amdgpu 0000:07:00.0: amdgpu: GPU reset(1) succeeded!

SmillerMP · 2025-01-16 03:10:32

Hi I'm new to the forum, I've also been experiencing freezing issues with mesa 24.3.x, I have a ryzen 5 3400g,

I recently saw that a recent update of some package on my system breaks the login and the login is recovered after updating mesa and libmesa
so waiting on mesa 24.2.7, it may start to cause a problem.

I don't have such deep experience with arch, but I'm here to help in any way I can.

Mechanicus · 2025-01-16 07:06:53

lpr1 wrote:

Sure, all looks normal

Also it executes pretty fast. Good results.

Mechanicus · 2025-01-16 12:12:54

SmillerMP wrote:

Hi I'm new to the forum, I've also been experiencing freezing issues with mesa 24.3.x, I have a ryzen 5 3400g,
I recently saw that a recent update of some package on my system breaks the login and the login is recovered after updating mesa and libmesa
so waiting on mesa 24.2.7, it may start to cause a problem.
I don't have such deep experience with arch, but I'm here to help in any way I can.

Could you please also add export MESA_SHADER_CACHE_DISABLE=true to your ~/.bashrc, clean ~/.cache directory after reboot and try to do what you usually do when the system freezes?
Use the latest mesa for sure.

Last edited by Mechanicus (2025-01-16 12:13:13)

Arch Linux

#126 2025-01-13 14:53:14

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#127 2025-01-13 20:24:06

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#128 2025-01-13 22:43:42

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#129 2025-01-13 23:24:43

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#130 2025-01-14 04:16:01

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#131 2025-01-14 10:25:23

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#132 2025-01-14 11:07:08

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#133 2025-01-14 12:46:04

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#134 2025-01-14 17:58:01

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#135 2025-01-14 19:39:26

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#136 2025-01-14 20:06:30

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#137 2025-01-14 20:14:41

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#138 2025-01-14 20:19:35

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#139 2025-01-14 21:02:12

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#140 2025-01-15 00:43:15

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#141 2025-01-15 01:20:19

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#142 2025-01-15 05:58:09

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#143 2025-01-15 10:02:15

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#144 2025-01-15 19:07:26

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#145 2025-01-15 20:17:00

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#146 2025-01-15 21:46:08

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#147 2025-01-16 01:52:50

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#148 2025-01-16 03:10:32

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#149 2025-01-16 07:06:53

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

#150 2025-01-16 12:12:54

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

Board footer