You are not logged in.

#476 2025-02-04 11:58:46

lpr1
Member
Registered: 2017-10-08
Posts: 109

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

Mechanicus wrote:
lpr1 wrote:

Alright, but they/you are moving too fast, 2-3 days is required to assure there's no issue or if there it is one, so I keep testing 1-2 for now, at least 2 more days.

Well... probably. This is my typical speed of problem resolution, sorry smile

Always go quality over quantity to ensure the issue is solved, so to test your 4 kernels (stable, 1-3, 1-7, 1-8 and 1-9), I will need about 12 days, but as an result, there will be little doubt about accuracy of results wink

Last edited by lpr1 (2025-02-04 12:00:34)

Offline

#477 2025-02-04 15:00:28

Mechanicus
Member
Registered: 2025-01-13
Posts: 104

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

AMDGPU testing

glibc-2.41

Build: linux-amdgpu-testing-6.13.1.arch1-10 - freezes.
Included patches:
- Reduce GFXOFF enable delay from 100ms to 10ms
- Execute amdgpu_vm_flush with one mutex lock

Build: linux-amdgpu-testing-6.13.1.arch1-11 - freezes. The problem found.
Included patches:
- Reduce GFXOFF enable delay from 100ms to 10ms
- Fix data races in amdgpu_vm_flush v2.0

Build: linux-amdgpu-testing-6.13.1.arch1-12 - freeze confimed.
Included patches:
- Fix data races in amdgpu_vm_flush v3.0

Build: linux-amdgpu-testing-6.13.1.arch1-13 - freeze confirmed.
Included patches:
- Reduce GFXOFF enable delay from 100ms to 10ms
- Fix data races in amdgpu_vm_flush v3.0
- Protect all fence operations with mutex in amdgpu_vm_flush()

Build: linux-amdgpu-testing-6.13.1.arch1-14 - freeze confirmed.
Included patches:
- Optimize mutex protected blocks in amdgpu_vm_flush
- Protect pm_runtime_get_noresume from being called when GFXOFF active

Build: linux-amdgpu-testing-6.13.1.arch1-15 - freeze confirmed. No Dumping IP state message.
Included patches:
- Reduce GFXOFF enable delay from 100ms to 10ms
- Optimize mutex protected blocks in amdgpu_vm_flush
- Add missing mutex protection for amdgpu_vmid

GitHub

What to check:
- stability
If stable:
- temperature
- average package power (sudo turbostat -s PkgWatt)
- performance - for example WebGL Aquarium fps for different amount of fish

Kernel option to keep during testing period: fsck.mode=force

laikm from pacoandres for reporting issues

Last edited by Mechanicus (2025-02-07 09:18:16)

Offline

#478 2025-02-04 15:21:10

pacoandres
Member
Registered: 2020-03-05
Posts: 43

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

I've been testing  linux-amdgpu-testing-6.13.1.arch1-7 for 4 hours and no freeze.
All tests have passed and the consumption on idle is sightly lower than repo kernel with patched mesa.

The only thing is that after recovery the GPU load and consumption raises (consumption grows from 4W before to 12W after) and keeps until reboot.
More detailed analysis with logs an captures here: https://github.com/pacoandres/laikm/issues/9

I'll keep testing this release at least until tomorrow.

Offline

#479 2025-02-04 15:26:56

Mechanicus
Member
Registered: 2025-01-13
Posts: 104

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

@pacoandres good results. linux-amdgpu-testing-6.13.1.arch1-7 is an adaptation of stable workaround which uses powergating state change for compute queues rings. The main goal of this build to move workaround closer to the actual problematic place in the code.

Offline

#480 2025-02-04 19:43:03

NuSkool
Member
Registered: 2015-03-23
Posts: 214

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

lpr1 wrote:
Mechanicus wrote:
lpr1 wrote:

Alright, but they/you are moving too fast, 2-3 days is required to assure there's no issue or if there it is one, so I keep testing 1-2 for now, at least 2 more days.

Well... probably. This is my typical speed of problem resolution, sorry smile

Always go quality over quantity to ensure the issue is solved, so to test your 4 kernels (stable, 1-3, 1-7, 1-8 and 1-9), I will need about 12 days, but as an result, there will be little doubt about accuracy of results wink

Alright lets see if I'm getting my thick head around this correctly...

There will be/are testing kernels that will go through testing/development. After it passes that phase, it becomes stable for more thorough testing on various hardware?

Then for only for the stable kernels, few days min testing time to assure quality?

If this is correct, would you want more than one person(different hardware, skills) to test both testing and stable versions?

My thought process:  If there are a variety of skill level level testers, perhaps have the right people doing the testing vs stable to expedite this process as a whole?
It seems there are more than one software devs here doing testing. On the other hand I'm just a user doing what I can to help out.

EDIT:
I've started testing linux-amdgpu-stable 6.13.1.arch1-3 on an up to date system( glibc 2.41).

Last edited by NuSkool (2025-02-04 20:15:18)

Offline

#481 2025-02-04 19:49:10

flemingfleming
Member
Registered: 2024-12-27
Posts: 14

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

NuSkool wrote:

There will be/are testing kernels that will go through testing/development. After it passes that phase, it becomes stable for more thorough testing on various hardware?.

Then for only for the stable kernels, few days min testing time to assure quality?

As far as I understand it, the stable kernels contain only patches which are known to work, and don't have the freeze. The testing kernels contain additional patches on top of this to try and fix more stuff/clean up the code.

Speaking of which 6.13.1arch1-10 is working well right now.

Offline

#482 2025-02-04 21:09:51

Mechanicus
Member
Registered: 2025-01-13
Posts: 104

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

flemingfleming wrote:
NuSkool wrote:

There will be/are testing kernels that will go through testing/development. After it passes that phase, it becomes stable for more thorough testing on various hardware?.

Then for only for the stable kernels, few days min testing time to assure quality?

As far as I understand it, the stable kernels contain only patches which are known to work, and don't have the freeze. The testing kernels contain additional patches on top of this to try and fix more stuff/clean up the code.

Speaking of which 6.13.1arch1-10 is working well right now.

Sometimes "on top", sometimes "instead" of patches that called stable. I provide the list of patches applied on top of default linux kernel for every release. 6.13.1arch1-10, for example, contains my own fix of video memory access error, and doesn't contain any workarounds.

Last edited by Mechanicus (2025-02-04 21:10:30)

Offline

#483 2025-02-04 21:13:59

Mechanicus
Member
Registered: 2025-01-13
Posts: 104

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

@NuSkool let me explain what is called stable. Stable is ex-testing kernel that passed all stability tests. linux-amdgpu-stable 6.13.1.arch1-3 - is the same as linux-amdgpu-stable 6.13.1.arch1-2 and linux-amdgpu-6.13.arch1-2. It is just a rebuild for compatibility reason with new glibc. The main purpose of rebuilds is ability to switch to this stable kernel when the things go wrong with others.
My goal is to solve the root cause of the problem, not to create one more workaround.  And I hope 6.13.1.arch1-10 is a step in the right direction.

Last edited by Mechanicus (2025-02-04 21:47:06)

Offline

#484 2025-02-04 21:22:05

SnowF
Member
Registered: 2025-01-17
Posts: 12

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

  • Kernel: 6.13.1-zen1-1-zen

  • Mesa: 24.3.4

Kernel boot option: amdgpu.ppfeaturemask=0xf7fff

+5 days without crashes

Offline

#485 2025-02-04 23:09:20

lpr1
Member
Registered: 2017-10-08
Posts: 109

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

NuSkool wrote:
lpr1 wrote:
Mechanicus wrote:

Well... probably. This is my typical speed of problem resolution, sorry smile

Always go quality over quantity to ensure the issue is solved, so to test your 4 kernels (stable, 1-3, 1-7, 1-8 and 1-9), I will need about 12 days, but as an result, there will be little doubt about accuracy of results wink

Alright lets see if I'm getting my thick head around this correctly...

There will be/are testing kernels that will go through testing/development. After it passes that phase, it becomes stable for more thorough testing on various hardware?

Then for only for the stable kernels, few days min testing time to assure quality?

If this is correct, would you want more than one person(different hardware, skills) to test both testing and stable versions?

My thought process:  If there are a variety of skill level level testers, perhaps have the right people doing the testing vs stable to expedite this process as a whole?
It seems there are more than one software devs here doing testing. On the other hand I'm just a user doing what I can to help out.

EDIT:
I've started testing linux-amdgpu-stable 6.13.1.arch1-3 on an up to date system( glibc 2.41).

Nothing special about my testing method, I just give it time to freeze via "normal use".

Offline

#486 2025-02-05 13:20:26

lpr1
Member
Registered: 2017-10-08
Posts: 109

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

6.13.1-arch1-3-amdgpu-stable (glib 2.41) it seems that I can't reproduce freezing with this kernel. However, system feels less responsive for some reason (example, opening texture in GIMP, project in Blender, or even files in file manager).

It's pointless to share temperature and usage, since I have no clue what it is with repo kernel, I disable turbo on all of machines, so CPU/GPU works at lower voltage (~1.15v for 3400G, ~1.20-1.25v for 3200G = all auto, voltage/frequency set by motherboard UEFI) and frequency, therefore, less heat output and consumption. But if it means anything under these conditions, I can share it.

What's the next kernel you would like me to test @Mechanicus ?

Offline

#487 2025-02-05 13:39:57

Mechanicus
Member
Registered: 2025-01-13
Posts: 104

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

lpr1 wrote:

6.13.1-arch1-3-amdgpu-stable (glib 2.41) it seems that I can't reproduce freezing with this kernel. However, system feels less responsive for some reason (example, opening texture in GIMP, project in Blender, or even files in file manager).

It's pointless to share temperature and usage, since I have no clue what it is with repo kernel, I disable turbo on all of machines, so CPU/GPU works at lower voltage (~1.15v for 3400G, ~1.20-1.25v for 3200G = all auto, voltage/frequency set by motherboard UEFI) and frequency, therefore, less heat output and consumption. But if it means anything under these conditions, I can share it.

What's the next kernel you would like me to test @Mechanicus ?

Please try linux-amdgpu-testing-6.13.1.arch1-11. It contains updated fix for data race in VM flush function. Regarding responsiveness with 6.13.1-arch1-3-amdgpu-stable - you're right, the workaround of switching on/off power saving for every frame is not free. That's why I don't want to see such "fix" in the mainline kernel.

Last edited by Mechanicus (2025-02-05 13:42:26)

Offline

#488 2025-02-05 17:40:46

lpr1
Member
Registered: 2017-10-08
Posts: 109

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

Mechanicus wrote:
lpr1 wrote:

6.13.1-arch1-3-amdgpu-stable (glib 2.41) it seems that I can't reproduce freezing with this kernel. However, system feels less responsive for some reason (example, opening texture in GIMP, project in Blender, or even files in file manager).

It's pointless to share temperature and usage, since I have no clue what it is with repo kernel, I disable turbo on all of machines, so CPU/GPU works at lower voltage (~1.15v for 3400G, ~1.20-1.25v for 3200G = all auto, voltage/frequency set by motherboard UEFI) and frequency, therefore, less heat output and consumption. But if it means anything under these conditions, I can share it.

What's the next kernel you would like me to test @Mechanicus ?

Please try linux-amdgpu-testing-6.13.1.arch1-11. It contains updated fix for data race in VM flush function. Regarding responsiveness with 6.13.1-arch1-3-amdgpu-stable - you're right, the workaround of switching on/off power saving for every frame is not free. That's why I don't want to see such "fix" in the mainline kernel.

Got it, testing now 6.13.1-arch1-11-amdgpu-testing, and indeed it is much more responsive, I would say to the level of stock kernel. Now we will see how it goes.

Offline

#489 2025-02-05 17:43:30

Mechanicus
Member
Registered: 2025-01-13
Posts: 104

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

lpr1 wrote:

Got it, testing now 6.13.1-arch1-11-amdgpu-testing, and indeed it is much more responsive, I would say to the level of stock kernel. Now we will see how it goes.

Seems like not as good yet.  @flemingfleming got a freeze. But I've found where it came from, so the new kernel is building now.
linux-amdgpu-testing-6.13.1.arch1-13 uploaded. I'm will test it tomorrow too.

Last edited by Mechanicus (2025-02-05 23:17:37)

Offline

#490 2025-02-06 07:59:18

CarlenWhite
Member
Registered: 2025-02-02
Posts: 3

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

NixOS unstable-small's zen updated to 6.13 and I decided to try the testing patches since they can be easily applied now, but encountered a hang. The quickest way on my system to cause a crash is using GIMP, especially trying to export a PNG from a project, and it did on 6.13.1-zen during export while on vanilla 6.13.1 it got as far as trying opening a project before it hanged. If there's anything else I can supply let me know.

Note: I am building amdgpu instead repackaging and installing the offered ones here.

Set of commits used, aka linux-amdgpu-testing-6.13.1.arch1-13
AMD Ryzen 7 2700U
Crashed on: 6.13.1-zen1, 6.13.1

Last edited by CarlenWhite (2025-02-06 08:28:17)

Offline

#491 2025-02-06 10:57:19

lpr1
Member
Registered: 2017-10-08
Posts: 109

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

Mechanicus wrote:
lpr1 wrote:

Got it, testing now 6.13.1-arch1-11-amdgpu-testing, and indeed it is much more responsive, I would say to the level of stock kernel. Now we will see how it goes.

Seems like not as good yet.  @flemingfleming got a freeze. But I've found where it came from, so the new kernel is building now.
linux-amdgpu-testing-6.13.1.arch1-13 uploaded. I'm will test it tomorrow too.

So we should move testing 1-13?

Offline

#492 2025-02-06 11:13:09

pacoandres
Member
Registered: 2020-03-05
Posts: 43

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

lpr1 wrote:
Mechanicus wrote:
lpr1 wrote:

Got it, testing now 6.13.1-arch1-11-amdgpu-testing, and indeed it is much more responsive, I would say to the level of stock kernel. Now we will see how it goes.

Seems like not as good yet.  @flemingfleming got a freeze. But I've found where it came from, so the new kernel is building now.
linux-amdgpu-testing-6.13.1.arch1-13 uploaded. I'm will test it tomorrow too.

So we should move testing 1-13?

Seems it's still broken: https://github.com/pacoandres/laikm/issues/14

Offline

#493 2025-02-06 12:11:04

lpr1
Member
Registered: 2017-10-08
Posts: 109

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

pacoandres wrote:
lpr1 wrote:
Mechanicus wrote:

Seems like not as good yet.  @flemingfleming got a freeze. But I've found where it came from, so the new kernel is building now.
linux-amdgpu-testing-6.13.1.arch1-13 uploaded. I'm will test it tomorrow too.

So we should move testing 1-13?

Seems it's still broken: https://github.com/pacoandres/laikm/issues/14

Oh, I guess I should continue testing 1-11 then, because so far I didn't experience issues. Unfortunately I don't have github account anymore to participate there, as soon as MS decided to lock it and ask for authentication by asking for my private data, I decided against it, so I have account but no access to it.

Offline

#494 2025-02-06 12:28:51

pacoandres
Member
Registered: 2020-03-05
Posts: 43

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

lpr1 wrote:
pacoandres wrote:
lpr1 wrote:

So we should move testing 1-13?

Seems it's still broken: https://github.com/pacoandres/laikm/issues/14

Oh, I guess I should continue testing 1-11 then, because so far I didn't experience issues. Unfortunately I don't have github account anymore to participate there, as soon as MS decided to lock it and ask for authentication by asking for my private data, I decided against it, so I have account but no access to it.

1-11 is also marked as broken https://github.com/pacoandres/laikm/issues/12

By now the only one testing release that keeps stable is 1-7: https://github.com/pacoandres/laikm/issues/9. This is the one I'm using while not testing a new release.

Offline

#495 2025-02-06 12:48:26

tenebraus
Member
Registered: 2021-01-01
Posts: 4

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

No crashes so far, using mesa 24.3.4.1, 17h with gaming + videos. Also got some uptime before light gone out in the neighborhood, but no crashes so far.

Kernel 6.12.10 (didn't update yet on purpose), using kernel parameters:

GRUB_CMDLINE_LINUX_DEFAULT="loglevel=3 amdgpu.ppfeaturemask=0xfff73fff quiet"

Let me know how I can help with testing!

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 24 [Radeon RX 6400/6500 XT/6500M] (rev c1)
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series] (rev c9)

Offline

#496 2025-02-06 13:08:36

pacoandres
Member
Registered: 2020-03-05
Posts: 43

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

tenebraus wrote:

No crashes so far, using mesa 24.3.4.1, 17h with gaming + videos. Also got some uptime before light gone out in the neighborhood, but no crashes so far.

Kernel 6.12.10 (didn't update yet on purpose), using kernel parameters:

GRUB_CMDLINE_LINUX_DEFAULT="loglevel=3 amdgpu.ppfeaturemask=0xfff73fff quiet"

Let me know how I can help with testing!

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 24 [Radeon RX 6400/6500 XT/6500M] (rev c1)
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series] (rev c9)

With this workaround the system is much more stable, but sometimes freezes too.
And the power consumption is also higher with this parameter, in my system it's about 37% higher when idle compared to a kernel with no parameters.

At this time @Mechanicus is building testing kernels, if you can help the best you can do is test them.
You can take a look at the status here: https://github.com/pacoandres/laikm/issues
Open issues are for current testing kernels, closed issues are for crashed ones.

Offline

#497 2025-02-06 14:52:56

lpr1
Member
Registered: 2017-10-08
Posts: 109

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

pacoandres wrote:
lpr1 wrote:
pacoandres wrote:

Oh, I guess I should continue testing 1-11 then, because so far I didn't experience issues. Unfortunately I don't have github account anymore to participate there, as soon as MS decided to lock it and ask for authentication by asking for my private data, I decided against it, so I have account but no access to it.

1-11 is also marked as broken https://github.com/pacoandres/laikm/issues/12

By now the only one testing release that keeps stable is 1-7: https://github.com/pacoandres/laikm/issues/9. This is the one I'm using while not testing a new release.

Yup, just got a freeze with 1-11 as well while working in Blender, so ye.

Offline

#498 2025-02-06 17:57:28

Mechanicus
Member
Registered: 2025-01-13
Posts: 104

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

AMDGPU testing

Build: linux-amdgpu-testing-6.13.1.arch1-16 - freezes
Included patches:
- Reduce GFXOFF enable delay from 100ms to 10ms
- Optimize mutex protected blocks in amdgpu_vm_flush
- Add missing mutex protection for amdgpu_vmid
- Execute all amdgpu_vmid changes under locked mutex
- Remove workaround for TLB seq race

Build: linux-amdgpu-testing-6.13.1.arch1-17 - freeze confirmed.
Included patches:
- Add verbose logging in amdgpu_vmid_grab()
- Perform amdgpu_vmid access under locked mutex in amdgpu_ib_schedule
- Add missing mutex protection for amdgpu_vmid
- Do not use invalidation semaphore for flush_gpu_tlb
- Remove GFXOFF workaround for doorbell for sdma_v5_2

Build: linux-amdgpu-testing-6.13.1.arch1-18 - stable.
Included patches:
- Reduce GFXOFF enable delay from 100ms to 10ms
- Refactor GFXOFF handling in amdgpu_dpm_force_performance_level
- Remove GFXOFF workaround for doorbell for sdma_v5_2
- Deny GFXOFF in amdgpu_ring_commit for GFX IP (with PG change)
- Optimize mutex protected blocks in amdgpu_vm_flush

GitHub

What to check:
- stability
If stable:
- temperature
- average package power (sudo turbostat -s PkgWatt)
- performance - for example WebGL Aquarium fps for different amount of fish

Kernel option to keep during testing period: fsck.mode=force

laikm from pacoandres for reporting issues

Last edited by Mechanicus (Yesterday 14:44:47)

Offline

#499 2025-02-07 11:31:43

pacoandres
Member
Registered: 2020-03-05
Posts: 43

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

linux-amdgpu-testing-6.13.1.arch1-16 also freezes.

Offline

#500 2025-02-07 20:30:14

Aivar
Member
Registered: 2025-02-07
Posts: 1

Re: Issues with Mesa 24.3.x and amdgpu Vega graphics

6.13.1-arch1-14-amdgpu-testing - passed all my tests except cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover
6.13.1-arch1-18-amdgpu-testing - passed all my tests except, but cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover freezes the system and turns off the video output on the third try.
In both cases i didn't notice any difference in temperature and aquarium compared to linux-lts 6.12.12-1 (with kernel parameter amdgpu.ppfeaturemask=0xf7fff).
---
Raven Ridge <gfx_v9_0>


Blame the google-translator for all language errors.

Offline

Board footer

Powered by FluxBB