You are not logged in.
I'm building a new kernel from linux-zen from Arch repos, with the only change being to apply your latest three patches. Let me know if there's anything different I should test.
I had to patch the patch since it seems like zen does their own changes to amdgpu. Turns out NixOS is on zen 6.12, not zen 6.13, follow-up here.
--- a-0002-Extend-amdgpu_gfx_-_ring_begin_use-end_use-functions.patch 1980-01-01 00:00:00.000000000 -0500
+++ b-0002-Extend-amdgpu_gfx_-_ring_begin_use-end_use-functions.patch 2025-02-02 07:19:27.851287493 -0500
@@ -21,9 +21,9 @@
index 785c5ca0b72a..a5be647848ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
-@@ -1991,6 +1991,16 @@ void amdgpu_gfx_enforce_isolation_ring_begin_use(struct amdgpu_ring *ring)
+@@ -1838,6 +1838,17 @@ void amdgpu_gfx_enforce_isolation_ring_begin_use(struct amdgpu_ring *ring)
+ struct amdgpu_device *adev = ring->adev;
u32 idx;
- bool sched_work = false;
+ /* Disable powergating before ringing doorbell for compute */
+ const struct amdgpu_ring_funcs *funcs = ring->funcs;
@@ -35,13 +35,14 @@
+ }
+ }
+
++
if (!adev->gfx.enable_cleaner_shader)
return;
-@@ -2042,6 +2052,15 @@ void amdgpu_gfx_enforce_isolation_ring_end_use(struct amdgpu_ring *ring)
-
- if (sched_work)
- amdgpu_gfx_kfd_sch_ctrl(adev, idx, true);
+@@ -1879,4 +1890,13 @@ void amdgpu_gfx_enforce_isolation_ring_end_use(struct amdgpu_ring *ring)
+ amdgpu_gfx_kfd_sch_ctrl(adev, idx, true);
+ }
+ mutex_unlock(&adev->enforce_isolation_mutex);
+
+ /* Restore powergating after ringing doorbell for compute */
+ const struct amdgpu_ring_funcs *funcs = ring->funcs;
@@ -52,8 +53,6 @@
+ }
+ }
}
-
- /*
--
2.48.1
Though disclaimer I'm on NixOS, but this issue has been doing my head in probably as much as it is for everyone here. So I'm glad that there's a patch I can apply on my system.
I'll try to report in if this is stable or not.
Edit:
Seems like it's stable so far. The thing I could think of that would semi-reliably get a hang was GIMP and reenacting the crop and export on a specific project, and I did that several times before opening a much larger project to then export.
Last edited by CarlenWhite (2025-02-03 11:59:57)
Offline
Build: linux-amdgpu-testing-6.13.1.arch1-2 - freezes.
Included patches:
- Remove GFXOFF/PG workaround from amdgpu_amdkfd
- Remove GFXOFF workarounds from gfx part
- Remove GFXOFF workaroud from amdgpu_dpm
- Remove GFXOFF workarounds from sdma
- Deny GFXOFF for amdgpu_job operations
What to check:
- stability
Kernel option to keep during testing period: fsck.mode=force
Last edited by Mechanicus (2025-02-03 11:14:25)
Offline
Report: @ ~45 hours on the following setup:
pacman -Q linux-amdgpu : linux-amdgpu 6.13.arch1-2
uname -rs : Linux 6.13.0-arch1-2-amdgpu
pacman -Q mesa : mesa 1:24.3.4-1
sudo cat /sys/module/drm/parameters/debug: 15
Kernel parameters: (nothing added to normal config)
cat /proc/cmdline : ........... rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive fsck.mode=force
dmesg: http://0x0.st/8KPv.txt
I performed the following:
1) Write dmesg to file.
2) Inducded a desktop session freeze with radeontop in 'Application Finder', write dmesg to file.
3) Run radeontop normally in a terminal, write dmesg to file.
The three dmesg files are identical.
Ran a 4K fullscreen youtube video for ~8hr overnight.
Ran a 4K fullscreen youtube video in background while watching ~1.5hr movie in MPV.
Ran a 4K fullscreen youtube video in background while simultaneously running both glxgears and vkcube.
Manually move glxgears and vkcube windows around while the above.
Rapidly switch between windows.
OK - idle stability
OK - workflow stability
OK - glxgears window resizing
OK - vkcube window resizing
This test kernel is robust for this system. Have not been able to induce system freeze.
I keep an eye on my CPU loads/temp realtime via DE apps with nothing abnormal to report.
Not sure how to record and log APU load and temp over time.
Open for suggestions on how to get this set up.
Let me know what else I can do to help with testing.
EDIT:
FYI: Just figured out that'/sys/module/drm/parameters/debug' gets reset to zero upon reboot....
On a fresh boot, I logged in as root (on console, no DM), and ran:
echo 0xf > /sys/module/drm/parameters/debug
logged out and into my user account...
Now this (debug logging?) is working as expected, logging flooding lines proceeded with 'amdgpu' to the journal.
I also noticed my system suddenly got laggy with debug logging actually working.
I had previously tried to implement this https://bbs.archlinux.org/viewtopic.php … 9#p2223819 and https://bbs.archlinux.org/viewtopic.php … 1#p2223821 but it seemed to not be working as expected. Not sure, but possibly implementing the debug logging command before starting a GUI session for each re/start is the key for it to work?
dmesg with debug logging enabled: http://0x0.st/8KNr.txt
Last edited by NuSkool (2025-02-02 20:10:10)
Scripts I use: https://github.com/Cody-Learner
Offline
Sorry, but I'm just going to getting lost in this thread. …
If you think it could help, I've done a github project for using only the issues part so we can use one issue per test and have all the results gathered together. I think this could made developers and testers job easier and anyone who wants to join the solution search can have a place to start without reading this ever-growing thread.
I wholeheartedly support this. There are too many versions of Mesa and the Linux kernel. One version per GitHub thread would be nice.
we are not condemned to write ugly code
Offline
glibc-2.40
Build: linux-amdgpu-stable-6.13.1.arch1-2 - stable.
Included patches:
- Remove 100ms delay from amdgpu_gfx_off_ctrl
- Add PG to amdgpu_gfx_enforce_isolation_ring_begin_use/end_use
This is a rebuild of stable patches on top of 6.13.1 kernel for consistency.
Last edited by Mechanicus (2025-02-04 17:13:02)
Offline
Sorry, but I'm just going to getting lost in this thread.
By now I'm not sure if I'm testing the same linux-amdgpu 6.13.arch1-2 than @NuSkool or which version is testing other people. I've also lost which Mesa versions are being testing now.If you think it could help, I've done a github project for using only the issues part so we can use one issue per test and have all the results gathered together. I think this could made developers and testers job easier and anyone who wants to join the solution search can have a place to start without reading this ever-growing thread.
This is the link to the project: https://github.com/pacoandres/laikm.
Anyone who wants permissions on it, just ask me. Any suggestions or criticisms are welcome.By now I think I'm testing the same configuration than @NuSkool with no issues.
This is very useful, thanks!
Offline
Big thanks, @NuSkool! I think you can switch to 6.13.1 rebuild - to make sure the stability remain the same.
Offline
My experiment with linux-zen was sort of a failure, as using the stable branch patches. Fall Guys would regularly dip into sub-20fps and the Mangohud would report "0fps" and "INF ms" frame times.
Offline
My experiment with linux-zen was sort of a failure, as using the stable branch patches. Fall Guys would regularly dip into sub-20fps and the Mangohud would report "0fps" and "INF ms" frame times.
If linux-zen modifies GPU rings frequency the patches might hurt performance very much. They are stable, but not ideal yet. I'm trying to get rid of all sh* in the driver and make it stable without workarounds.
Offline
I know this is a Vega thread, but I wanted to confirm that Mechanicus' build (6.13.1-arch1-2-amdgpu-testing) with mesa 1:24.3.4-1 made my system more stable, so thank you for that.
My system became unusable with latest kernel and mesa. Even opening a heavier web page like LinkedIn on Brave was enough to either reboot or freeze my computer.
Now it seems to play media just fine now *if* I don't abuse it - if I'm playing something using mpv or jellyfin and fast-forward/jump ahead too much, the system crashes and reboots.
My system:
Ryzen 5500
RX 6600
Offline
kode54 wrote:My experiment with linux-zen was sort of a failure, as using the stable branch patches. Fall Guys would regularly dip into sub-20fps and the Mangohud would report "0fps" and "INF ms" frame times.
If linux-zen modifies GPU rings frequency the patches might hurt performance very much. They are stable, but not ideal yet. I'm trying to get rid of all sh* in the driver and make it stable without workarounds.
Yeah, thanks for that, and I didn't realize when I started that Zen already applies AMDGPU patches.
PS. You didn't link to -headers package for the -testing kernel.
Offline
I know this is a Vega thread, but I wanted to confirm that Mechanicus' build (6.13.1-arch1-2-amdgpu-testing) with mesa 1:24.3.4-1 made my system more stable, so thank you for that.
My system became unusable with latest kernel and mesa. Even opening a heavier web page like LinkedIn on Brave was enough to either reboot or freeze my computer.
Now it seems to play media just fine now *if* I don't abuse it - if I'm playing something using mpv or jellyfin and fast-forward/jump ahead too much, the system crashes and reboots.My system:
Ryzen 5500
RX 6600
Have you tried https://bbs.archlinux.org/viewtopic.php … 0#p2223890 ? It's a workaround while the underlying issue is being investigated.
Last edited by topcat01 (2025-02-02 21:54:17)
Offline
Report: @ ~30 min on the following setup:
pacman -Q linux-amdgpu :linux-amdgpu-stable 6.13.1.arch1-1
uname -rs :Linux 6.13.1-arch1-1-amdgpu-stable
pacman -Q mesa : mesa 1:24.3.4-1
Kernel parameters: (nothing added to normal config)
cat /proc/cmdline : ........... rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive fsck.mode=force
System froze before any testing.
Scripts I use: https://github.com/Cody-Learner
Offline
@NuSkool sorry, looks like I mixed the wrong change. Rebuilding now.
Offline
I can corroborate, linux-amdgpu-stable 6.13.1.arch1-1 also freezes for me.
Offline
PS. You didn't link to -headers package for the -testing kernel.
Yes, because I think it is not necessary for testing. dkms will not use custom headers without modification, for example. But if necessary, I'll continue upload kernel + headers, like before.
Offline
@flemingfleming, @NuSkool Updated build uploaded: https://bbs.archlinux.org/viewtopic.php … 4#p2224124
Any feedback on linux-amdgpu-testing-6.13.1.arch1-2?
Last edited by Mechanicus (2025-02-02 23:22:41)
Offline
I got a freeze with 6.13.1-arch1-2-amdgpu-testing.
drm.debug=0x2 doesn't give any additional logging than the "dumping IP state" message that you sometimes get anyway with the freeze. I did get some errors from thorium-browser (chromium fork) this time though (which I'm currently using to try and cause the freeze more often):
...
Feb 03 00:48:52 v330-14arr thorium-browser[2842]: [2884:2962:0203/004852.923925:ERROR:vulkan_swap_chain.cc(431)] vkAcquireNextImageKHR() hangs.
Feb 03 00:48:52 v330-14arr thorium-browser[2842]: [2884:2884:0203/004852.928516:ERROR:gpu_service_impl.cc(1164)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
Feb 03 00:49:01 v330-14arr kernel: amdgpu 0000:03:00.0: amdgpu: Dumping IP State
Feb 03 00:49:02 v330-14arr kernel: sysrq: Keyboard mode set to system default
Feb 03 00:49:02 v330-14arr kernel: sysrq: Terminate All Tasks
...
Offline
kode54 wrote:PS. You didn't link to -headers package for the -testing kernel.
Yes, because I think it is not necessary for testing. dkms will not use custom headers without modification, for example. But if necessary, I'll continue upload kernel + headers, like before.
"dkms will not use custom headers without modification" ??? It just includes from /usr/src/<kernel name> which is a symlink?
Fine, I'll uninstall my DKMS packages while testing.
Listed here for posterity:
nct6687d-dkms-git r109.2f1a27a-1
xpadneo-dkms-git 0.9.r214.g8d20a23-1
zenstats-dkms 0.1.0-1
Edit: Okay, 6.13.1-arch1-2-amdgpu-testing seems to be holding steady on 7700 XT / GFX11 / RDNA 3 / DCN 3.2. Idle is stable, workload is stable, glxgears and vkcube can be freely resized without stutters. HEVC 4k video puts the GPU at about 65°C, while Fall Guys max settings 1080p165 utilizes about 50% of the GPU and also puts the temperatures around 63.8. Still has the REG_WAIT timeout message on DPMS state, but wakes the monitors just fine.
Edit 2: Now testing the amdgpu-testing patch series from commit hash d7e545ec689161fa26924ca0675c751f7bf56e13, as applied to linux-zen 6.13.1-1. Seems to be just as stable, under all the usual tests. GPU junction hits about 54°C and really low utilization when decoding the 4k HEVC video, and 62.4°C when playing Fall Guys, which uses about 46% while loading levels and 58% while in-game.
Last edited by kode54 (2025-02-03 05:00:56)
Offline
After more than 12 hours running the amdgpu-6.13.arch1-2 kernel seems stable.
glxgears resize: passed
vkcube resize: passed (alone and concurrent with glxgears)
cat to recovery: not tested
Watch HD videos (2 hours): passed
Work with Eclipse and Android Studio: passed
The only thing I've notice is that on idle the consumption is 1W greater than using repo kernel with patched Mesa.
The problem is that I don't know which build of this kernel I'm using and what patches has. If it does help the sha1sum of the installed package is:
e047a9f86c6d667cfdebc896f8704d37c1d8d005
I've added this report to github, and if any one wants to add theirs for keeping them tidy you are welcome.
Last edited by pacoandres (2025-02-03 05:51:43)
Offline
Holy crap, the linux-zen package I compiled survived GPU recovery. (Based on the linux-amdgpu-testing patch series.)
Edit: Tried my luck again with another GPU reset. LabWC didn't like the reset and hit flip_done timeouts. Continued resetting the GPU from an SSH login. Got the idea to loginctl terminate-user my old session and reset the GPU again. GDM came up, and I was able to log back into a sesson. The fun continues.
Last edited by kode54 (2025-02-03 08:59:46)
Offline
I've just tested the recovery in the amdgpu-6.13.arch1-2 kernel I'm using with no problem.
I takes 3 seconds to recover.
Here are the dmesg log messages
EDIT: I've done more experiments.
Usually I'm using plasma on Wayland, so I've logout and login whit X11, no problems. I've also logout and login with Kodi with no problems.
Where testing the recovery on X11 the desktop gets completely freeze, but not the system. So I was able to get into a console, then forced plasma logout and everything works again. No need for hard reset or REISUB.
When reading dmesg checking if the logs where different from Wayland to X11 (but only differ in addresses) I've seen these three lines two times in three hours of system running:
amdgpu 0000:09:00.0: amdgpu: ring comp_1.3.1 timeout, signaled seq=2, emitted seq=14
amdgpu 0000:09:00.0: amdgpu: Process information: process ksmserver pid 11879 thread ksmserver:cs0 pid 11915
amdgpu 0000:09:00.0: amdgpu: Starting comp_1.3.1 ring reset
But I haven't noticed any glitch or any abnormal behavior in the system. This is the first time I've seen these logs, may be I didn't notice them before.
Last edited by pacoandres (2025-02-03 10:38:18)
Offline
Turns out I made a mistake in my last post. So NixOS's pkgs.linuxPackages_zen is on 6.12 and not 6.13 like I thought (and zen's current main is 6.12 which Arch's Repo follows so I'd make the same mistake regardless) which explains why I needed to patch the patch to begin with. Basically I wasn't paying attention to version numbers.
Anyhow, because I'm updating NixOS's Wiki over AMD GPUs and I didn't want to send someone assembling their own patch alone so I ended up making a fork that cherry-picks patches from Mechanicus put onto the zen kernel with a backport for 6.12. 6.12/amdgpu-stable and 6.13/amdgpu-stable (though not strictly necessary; could patch.diff from Mechanicus if you prefer.) Mechanicus' 6.13 merges fine onto zen 6.13, but 6.12 came with conflicts that needed to be resolved. I'm not really interested in the testing branch since I value a stable system, but anyone can do a cherry-pick themselves onto their own zen fork.
Last edited by CarlenWhite (2025-02-03 12:51:28)
Offline
grapplarr wrote:I know this is a Vega thread, but I wanted to confirm that Mechanicus' build (6.13.1-arch1-2-amdgpu-testing) with mesa 1:24.3.4-1 made my system more stable, so thank you for that.
My system became unusable with latest kernel and mesa. Even opening a heavier web page like LinkedIn on Brave was enough to either reboot or freeze my computer.
Now it seems to play media just fine now *if* I don't abuse it - if I'm playing something using mpv or jellyfin and fast-forward/jump ahead too much, the system crashes and reboots.My system:
Ryzen 5500
RX 6600Have you tried https://bbs.archlinux.org/viewtopic.php … 0#p2223890 ? It's a workaround while the underlying issue is being investigated.
Yes, still had the aforementioned issues with both versions.
Offline
glibc-2.40
Build: linux-amdgpu-testing-6.13.1.arch1-4 - freezes for one user.
Included patches:
- Remove GFXOFF/PG workaround from amdgpu_amdkfd
- Remove GFXOFF workarounds from gfx part
- Remove GFXOFF workaroud from amdgpu_dpm
- Remove GFXOFF workarounds from sdma
- Deny GFXOFF for amdgpu_job operations
- Deny GFXOFF for amdgpu_ring operations
- Switch PG to UNGATE before ringing doorbell for AMDGPU_RING_TYPE_COMPUTE (rework of the stable patch)
Build: linux-amdgpu-testing-6.13.1.arch1-5 - freezes.
Included patches:
- Remove GFXOFF/PG workaround from amdgpu_amdkfd
- Remove GFXOFF workarounds from gfx part
- Remove GFXOFF workaroud from amdgpu_dpm
- Remove GFXOFF workarounds from sdma
- Deny GFXOFF for amdgpu_job operations
- Deny GFXOFF for amdgpu_ring operations
- Switch PG to UNGATE before ringing doorbell for AMDGPU_RING_TYPE_COMPUTE (rework of the stable patch)
- Remove 100ms delay from amdgpu_gfx_off_ctrl (from stable patches)
Build: linux-amdgpu-testing-6.13.1.arch1-6 - stable for several configurations
Included patches:
- Remove GFXOFF/PG workaround from amdgpu_amdkfd
- Remove GFXOFF workarounds from gfx part
- Remove GFXOFF workaroud from amdgpu_dpm
- Remove GFXOFF workarounds from sdma
- Deny GFXOFF in amdgpu_ring_commit
What to check:
- stability
If stable:
- temperature
- average package power (sudo turbostat -s PkgWatt)
- performance - for example WebGL Aquarium fps for different amount of fish
Kernel option to keep during testing period: fsck.mode=force
NOTE: kernels compiled with glibc 2.40. Please don't update your system to glibc 2.41 which was moved to Arch stable today.
laikm from pacoandres for reporting issues
Last edited by Mechanicus (2025-02-04 09:03:58)
Offline