You are not logged in.
AMDGPU testing
Build: linux-amdgpu-testing-6.13.1.arch1-4, linux-amdgpu-testing-headers-6.13.1.arch1-4
Included patches:
- Remove GFXOFF/PG workaround from amdgpu_amdkfd
- Remove GFXOFF workarounds from gfx part
- Remove GFXOFF workaroud from amdgpu_dpm
- Remove GFXOFF workarounds from sdma
- Deny GFXOFF for amdgpu_job operations
- Deny GFXOFF for amdgpu_ring operations
- Switch PG to UNGATE before ringing doorbell for AMDGPU_RING_TYPE_COMPUTE (rework of the stable patch)Build: linux-amdgpu-testing-6.13.1.arch1-5, linux-amdgpu-testing-headers-6.13.1.arch1-5
Included patches:
- Remove GFXOFF/PG workaround from amdgpu_amdkfd
- Remove GFXOFF workarounds from gfx part
- Remove GFXOFF workaroud from amdgpu_dpm
- Remove GFXOFF workarounds from sdma
- Deny GFXOFF for amdgpu_job operations
- Deny GFXOFF for amdgpu_ring operations
- Switch PG to UNGATE before ringing doorbell for AMDGPU_RING_TYPE_COMPUTE (rework of the stable patch)
- Remove 100ms delay from amdgpu_gfx_off_ctrl (from stable patches)Build: linux-amdgpu-testing-6.13.1.arch1-6, linux-amdgpu-testing-headers-6.13.1.arch1-6
Included patches:
- Remove GFXOFF/PG workaround from amdgpu_amdkfd
- Remove GFXOFF workarounds from gfx part
- Remove GFXOFF workaroud from amdgpu_dpm
- Remove GFXOFF workarounds from sdma
- Deny GFXOFF for amdgpu_ring operations onlyWhat to check:
- stability
If stable:
- temperature
- average package power (sudo turbostat -s PkgWatt)
- performance - for example WebGL Aquarium fps for different amount of fishKernel option to keep during testing period: fsck.mode=force
Both kernels are freezing my system instantly as soon as GDM is loaded? I must be doing something wrong, also why is pahole a dependency for headers?
I'll try to add new entry for testing kernel and see how it goes, but amdgpu.ppfeaturemask=0xfff73fff definitively solves the issue.
Offline
All 3 test kernels from the last post freeze the system instantly.
title AMDGPU TESTING (linux)
linux /vmlinuz-linux-amdgpu-testing
initrd /initramfs-linux-amdgpu-testing.img
options root=PARTUUID=x-x-x-x rw rootfstype=ext4
Nothing unusual, this error is always present for some reason:
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
EDIT: Oh I see edit about glibc update.
EDIT1: It freezes with glibc 2.40 at GDM load.
Last edited by lpr1 (2025-02-03 16:05:41)
Offline
glibc-2.41
Build: linux-amdgpu-stable-6.13.1.arch1-3, linux-amdgpu-stable-headers-6.13.1.arch1-3
Included patches:
- Remove 100ms delay from amdgpu_gfx_off_ctrl
- Add PG to amdgpu_gfx_enforce_isolation_ring_begin_use/end_use
Offline
AMDGPU stable
glibc-2.41
Build: linux-amdgpu-stable-6.13.1.arch1-3, linux-amdgpu-stable-headers-6.13.1.arch1-3
Included patches:
- Remove 100ms delay from amdgpu_gfx_off_ctrl
- Add PG to amdgpu_gfx_enforce_isolation_ring_begin_use/end_use
Ok, this one loads fine, I guess I rolled back to the wrong 2.40 version (since there are multiple). Should I test this one?
Offline
Mechanicus wrote:AMDGPU stable
glibc-2.41
Build: linux-amdgpu-stable-6.13.1.arch1-3, linux-amdgpu-stable-headers-6.13.1.arch1-3
Included patches:
- Remove 100ms delay from amdgpu_gfx_off_ctrl
- Add PG to amdgpu_gfx_enforce_isolation_ring_begin_use/end_useOk, this one loads fine, I guess I rolled back to the wrong 2.40 version (since there are multiple). Should I test this one?
Yes. This is just a rebuild of stable patches on top of Linux 6.13.1 and with new glibc 2.41+r2+g0a7c7a3e283a-1.
Last edited by Mechanicus (2025-02-03 17:02:48)
Offline
All 3 test kernels from the last post freeze the system instantly.
title AMDGPU TESTING (linux)
linux /vmlinuz-linux-amdgpu-testing
initrd /initramfs-linux-amdgpu-testing.img
options root=PARTUUID=x-x-x-x rw rootfstype=ext4Nothing unusual, this error is always present for some reason:
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0EDIT: Oh I see edit about glibc update.
EDIT1: It freezes with glibc 2.40 at GDM load.
Just to clarify: you installed those kernels after today's system update (that comes with new glibc and gcc), or before that?
Offline
lpr1 wrote:All 3 test kernels from the last post freeze the system instantly.
title AMDGPU TESTING (linux)
linux /vmlinuz-linux-amdgpu-testing
initrd /initramfs-linux-amdgpu-testing.img
options root=PARTUUID=x-x-x-x rw rootfstype=ext4Nothing unusual, this error is always present for some reason:
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: Secure display: Generic Failure.
16:39:10 archlinux kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0EDIT: Oh I see edit about glibc update.
EDIT1: It freezes with glibc 2.40 at GDM load.Just to clarify: you installed those kernels after today's system update (that comes with new glibc and gcc), or before that?
Probably, since I had to rollback to 2.40, I don't know when they got updated since I didn't pay attention, but I rolled back to glibc-2.40-1-x86_64.pkg.tar.zst 22-Jul-2024 16:52 10M from archive, so probably the wrong version, since there are newer (glibc-2.40+r16+gaa533d58ff-1-x86_64.pkg.tar.zst 03-Aug-2024 16:49 10M). The linux-amdgpu-stable-6.13.1.arch1-3-x86_64 loads fine with glibc 2.41 and I'm testing it now, so far it's good, but will see over time.
Offline
Mechanicus wrote:Just to clarify: you installed those kernels after today's system update (that comes with new glibc and gcc), or before that?
Probably, since I had to rollback to 2.40, I don't know when they got updated since I didn't pay attention, but I rolled back to glibc-2.40-1-x86_64.pkg.tar.zst 22-Jul-2024 16:52 10M from archive, so probably the wrong version, since there are newer (glibc-2.40+r16+gaa533d58ff-1-x86_64.pkg.tar.zst 03-Aug-2024 16:49 10M). The linux-amdgpu-stable-6.13.1.arch1-3-x86_64 loads fine with glibc 2.41 and I'm testing it now, so far it's good, but will see over time.
Well... I asked because I tested a bit all three builds on my Raven machine. And all of them seemed stable.
Last edited by Mechanicus (2025-02-03 19:55:08)
Offline
lpr1 wrote:Mechanicus wrote:Just to clarify: you installed those kernels after today's system update (that comes with new glibc and gcc), or before that?
Probably, since I had to rollback to 2.40, I don't know when they got updated since I didn't pay attention, but I rolled back to glibc-2.40-1-x86_64.pkg.tar.zst 22-Jul-2024 16:52 10M from archive, so probably the wrong version, since there are newer (glibc-2.40+r16+gaa533d58ff-1-x86_64.pkg.tar.zst 03-Aug-2024 16:49 10M). The linux-amdgpu-stable-6.13.1.arch1-3-x86_64 loads fine with glibc 2.41 and I'm testing it now, so far it's good, but will see over time.
Well... I asked because I tested a bit all three build on my Raven machine. And all of them seemed stable.
I've tested testing 1-3 and 1-4 before and after the glibc update (I've noticed there was an update after the first test) and the system got freeze in less than one minute with the two releases and glib versions 2.40 and 2.41.
Now I'm testing 6.13.1-arch1-3-amdgpu-stable and I think it's good.
Temp and power are similar to repo kernel with patched mesa.
All tests have been passed, except the recovery. I'll try this one when stop working.
EDIT: I've forgotten to post the aquarium results (I'm using a 75Hz monitor):
5,000 fishes: 75 fps, 38W, 55ºC
10,000 fishes: 65fps, 46W, 66ºC
15,000 fishes: 45fps, 46W, 68ºC
20,000 fishes: 35fps, 46W, 68ºC
25,000 fishes: 30fps, 46W, 68ºC
30,000 fishes: 25 fps, 46W, 68ºC
Last edited by pacoandres (2025-02-03 18:12:34)
Offline
Sorry for the rough post.. but. I've been following this because of my own AMD cards and iGPU crashing.. here is my own experience.
I have come to the conclusion its the AMD dreaded "reset bug". GFXOFF patch wont work, because you're re-enabling it, which is still able to trigger the "power off/low power" portion of the code resulting in the bug.
I turned off GFXOFF in my bios (ASRock B650E Taichi [non-lite]) and have had 100% stability since (Uptime is a little over 9 hours which is 8 hours above what I could get, with any attempts previously) with all of my crashing attempts (that were reproducible failing to trigger it) with the current released arch kernel. I've tried to crash my 7900XTX and my Raphael iGPU (7900X3D) they have had no instability since turning this feature off.
The reason I've come to the conclusion its the "reset bug" is that it was added, but turned off in ~2018. We had no issues.
It was recently turned on in 6.8 ~8 months ago. I had 4 months roughly of stability prior to the crashing being introduced due to a lack of updated bios, I updated my bios and began crashing. I bought a new PSU and even went as far as changing power outlets thinking the power was just to dirty/unstable. It got better but was inconsistent. After a kernel update I kept getting random power off (resets. not full power off) or black screens it was actually the GPU blacking out.. which is exactly the same symptoms I had trying to use a Windows 10 VM (black screen / host crashing with AMD gpu's going back through POST) I bought a WX3100 work card from ebay and had same results. Constant instability (even with iGPU disabled)
After disabling GFXOFF in the bios, all of the instability and my own repeatable reproducible crashes have stopped functioning. I've played games, youtube, watched like 5 videos via MPV at the same time (while skimming 1+ hour long videos), CTRL + PAGEDOWN to tab swap rapidly in firefox and haven't frozen, or crashed/blackscreen.
Linux aki 6.12.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 18 Jan 2025 02:26:57 +0000 x86_64 GNU/Linux
13:00:23 up 9:03, 1 user, load average: 5.31, 5.12, 5.46
Source for my own thoughts:
https://www.phoronix.com/search/GFXOFF
2018 - Default Off, Tested and caused issues
2024-01-19 - Enabled, crashing/instability.
*Editing up time, because I read wrong number
Last edited by Sora (2025-02-03 18:22:04)
Offline
Well... I asked because I tested a bit all three build on my Raven machine. And all of them seemed stable.
Do you have a Vega machine for testing?
I'm testing on a refurb 'HP EliteDesk 705 G4', which only costs a little over $100.00 atm.
They're readily available on Amazon for example: https://www.amazon.com/HP-EliteDesk-Min … NB7RD?th=1
Would having a Vega based system streamline testing at this point, would you like one for testing, and be worth it at this stage?
Perhaps people could donate to get you one for free?
I'd be willing to throw in ~$25 to get you setup if 4-5 more people would step in.
Anyone else?
Scripts I use: https://github.com/Cody-Learner
Offline
Mechanicus wrote:Well... I asked because I tested a bit all three build on my Raven machine. And all of them seemed stable.
Do you have a Vega machine for testing?
Anyone else?
I have Athlon 200GE with Vega 3. I perform some tests on it while it is not in use. Also I think it is better to have wider range of AMD chips and not focus on Vega only.
Why? - Because the fix proposed to kernel is GFX9-only. But the instability issues are related not to Vega only family.
My goal is to eliminate the root cause, not the consequences.
Last edited by Mechanicus (2025-02-03 20:17:34)
Offline
Now I'm testing 6.13.1-arch1-3-amdgpu-stable and I think it's good.
Temp and power are similar to repo kernel with patched mesa.All tests have been passed, except the recovery. I'll try this one when stop working.
EDIT: I've forgotten to post the aquarium results (I'm using a 75Hz monitor):
5,000 fishes: 75 fps, 38W, 55ºC
10,000 fishes: 65fps, 46W, 66ºC
15,000 fishes: 45fps, 46W, 68ºC
20,000 fishes: 35fps, 46W, 68ºC
25,000 fishes: 30fps, 46W, 68ºC
30,000 fishes: 25 fps, 46W, 68ºC
The recovery test is passed, it takes less than a second to recover and everything works. After this test I've repeat some more and everything works without freezes.
A more detailed description with logs here.
Last edited by pacoandres (2025-02-03 19:40:03)
Offline
glibc-2.41
Build: linux-amdgpu-testing-6.13.1.arch1-7, linux-amdgpu-testing-headers-6.13.1.arch1-7
Included patches:
- Remove 100ms delay from amdgpu_gfx_off_ctrl
- Deny GFXOFF in amdgpu_ring_commit for GFX IP (with PG workaround)
What to check:
- stability
If stable:
- temperature
- average package power (sudo turbostat -s PkgWatt)
- performance - for example WebGL Aquarium fps for different amount of fish
Kernel option to keep during testing period: fsck.mode=force
laikm from pacoandres for reporting issues
Last edited by Mechanicus (2025-02-04 13:53:40)
Offline
Mechanicus wrote:Well... I asked because I tested a bit all three build on my Raven machine. And all of them seemed stable.
Do you have a Vega machine for testing?
I'm testing on a refurb 'HP EliteDesk 705 G4', which only costs a little over $100.00 atm.
They're readily available on Amazon for example: https://www.amazon.com/HP-EliteDesk-Min … NB7RD?th=1Would having a Vega based system streamline testing at this point, would you like one for testing, and be worth it at this stage?
Perhaps people could donate to get you one for free?
I'd be willing to throw in ~$25 to get you setup if 4-5 more people would step in.Anyone else?
This proposal is very kind of You. But Amazon doesn't ship anything to Ukraine directly. Through this "black hole" only our US managers can laundry billions.
Offline
lpr1 wrote:Mechanicus wrote:Just to clarify: you installed those kernels after today's system update (that comes with new glibc and gcc), or before that?
Probably, since I had to rollback to 2.40, I don't know when they got updated since I didn't pay attention, but I rolled back to glibc-2.40-1-x86_64.pkg.tar.zst 22-Jul-2024 16:52 10M from archive, so probably the wrong version, since there are newer (glibc-2.40+r16+gaa533d58ff-1-x86_64.pkg.tar.zst 03-Aug-2024 16:49 10M). The linux-amdgpu-stable-6.13.1.arch1-3-x86_64 loads fine with glibc 2.41 and I'm testing it now, so far it's good, but will see over time.
Well... I asked because I tested a bit all three builds on my Raven machine. And all of them seemed stable.
Got it, ye, I got instant freeze on all 3 kernels and it seems it's not related to the glibc version as @pacoandres experienced similar thing. I also do not think this is related to workarounds anyway, unless those patches are relatively recent ones, but I don't know. I see new builds, so what are we supposed to test now?
Offline
Mechanicus wrote:Well... I asked because I tested a bit all three build on my Raven machine. And all of them seemed stable.
Do you have a Vega machine for testing?
I'm testing on a refurb 'HP EliteDesk 705 G4', which only costs a little over $100.00 atm.
They're readily available on Amazon for example: https://www.amazon.com/HP-EliteDesk-Min … NB7RD?th=1Would having a Vega based system streamline testing at this point, would you like one for testing, and be worth it at this stage?
Perhaps people could donate to get you one for free?
I'd be willing to throw in ~$25 to get you setup if 4-5 more people would step in.Anyone else?
I've exactly this same machine with a Ryzen 2400G (Vega 11). Maybe I can help in some way?
Problem is that I do not have lot of time being out of home almost all day...
Offline
Mechanicus wrote:lpr1 wrote:Probably, since I had to rollback to 2.40, I don't know when they got updated since I didn't pay attention, but I rolled back to glibc-2.40-1-x86_64.pkg.tar.zst 22-Jul-2024 16:52 10M from archive, so probably the wrong version, since there are newer (glibc-2.40+r16+gaa533d58ff-1-x86_64.pkg.tar.zst 03-Aug-2024 16:49 10M). The linux-amdgpu-stable-6.13.1.arch1-3-x86_64 loads fine with glibc 2.41 and I'm testing it now, so far it's good, but will see over time.
Well... I asked because I tested a bit all three builds on my Raven machine. And all of them seemed stable.
Got it, ye, I got instant freeze on all 3 kernels and it seems it's not related to the glibc version as @pacoandres experienced similar thing. I also do not think this is related to workarounds anyway, unless those patches are relatively recent ones, but I don't know. I see new builds, so what are we supposed to test now?
linux-amdgpu-testing-6.13.1.arch1-7 and 8 aimed to find the actual place where the problem starts.
Last edited by Mechanicus (2025-02-03 23:59:09)
Offline
I was all ready to write up a 28hr report on linux-amdgpu-stable 6.13.1.arch1-2
Unfortunately, I've been wasting all this time retesting linux-amdgpu 6.13.arch1-2!
A 'uname -rs' saved me from giving a false report........
I'll move onto testing linux-amdgpu-stable 6.13.1.arch1-2 unless I hear otherwise.
EDIT:
Preliminary report on:
pacman -Q linux-amdgpu-stable : linux-amdgpu-stable 6.13.1.arch1-2
uname -rs : Linux 6.13.1-arch1-2-amdgpu-stable
pacman -Q glibc : glibc 2.40+r66+g7d4b6bcae91f-1
cat /proc/cmdline : .... rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive fsck.mode=force
Power usage at idle desktop w/ chromium 129F 2.62W
WebGL Aquarium Chromium : 5000 fish 60fps 154F 13.12W
10000 fish 60fps 159F 17.39W
20000 fish 40fps 162F 17.19W
30000 fish 31fps 169F 17.95W
Note: Firefox: ~ 8-10 fps less
Only about an hour into this test, seems stable enough to do some testing, but I'll add more results in about 24hr.
Last edited by NuSkool (2025-02-04 05:05:15)
Scripts I use: https://github.com/Cody-Learner
Offline
Sorry for the rough post.. but. I've been following this because of my own AMD cards and iGPU crashing.. here is my own experience.
I have come to the conclusion its the AMD dreaded "reset bug". GFXOFF patch wont work, because you're re-enabling it, which is still able to trigger the "power off/low power" portion of the code resulting in the bug.I turned off GFXOFF in my bios (ASRock B650E Taichi [non-lite]) and have had 100% stability since (Uptime is a little over 9 hours which is 8 hours above what I could get, with any attempts previously) with all of my crashing attempts (that were reproducible failing to trigger it) with the current released arch kernel. I've tried to crash my 7900XTX and my Raphael iGPU (7900X3D) they have had no instability since turning this feature off.
The reason I've come to the conclusion its the "reset bug" is that it was added, but turned off in ~2018. We had no issues.
It was recently turned on in 6.8 ~8 months ago. I had 4 months roughly of stability prior to the crashing being introduced due to a lack of updated bios, I updated my bios and began crashing. I bought a new PSU and even went as far as changing power outlets thinking the power was just to dirty/unstable. It got better but was inconsistent. After a kernel update I kept getting random power off (resets. not full power off) or black screens it was actually the GPU blacking out.. which is exactly the same symptoms I had trying to use a Windows 10 VM (black screen / host crashing with AMD gpu's going back through POST) I bought a WX3100 work card from ebay and had same results. Constant instability (even with iGPU disabled)After disabling GFXOFF in the bios, all of the instability and my own repeatable reproducible crashes have stopped functioning. I've played games, youtube, watched like 5 videos via MPV at the same time (while skimming 1+ hour long videos), CTRL + PAGEDOWN to tab swap rapidly in firefox and haven't frozen, or crashed/blackscreen.
Linux aki 6.12.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 18 Jan 2025 02:26:57 +0000 x86_64 GNU/Linux 13:00:23 up 9:03, 1 user, load average: 5.31, 5.12, 5.46
Source for my own thoughts:
https://www.phoronix.com/search/GFXOFF
2018 - Default Off, Tested and caused issues
2024-01-19 - Enabled, crashing/instability.*Editing up time, because I read wrong number
Have you tried to test any of the kernel or mesa patches the are suggested in this thread?
The point is that not every one can disable GFXOFF in the bios as they don't have that choice (mine doesn't. I'm using an Asus B450M-A). Some of the releases posted here bring stability to some of us.
Offline
Don't know if someone's testing linux-amdgpu-testing-6.13.1.arch1-7. I'm going to do it now.
Offline
I was all ready to write up a 28hr report on linux-amdgpu-stable 6.13.1.arch1-2
Unfortunately, I've been wasting all this time retesting linux-amdgpu 6.13.arch1-2!
A 'uname -rs' saved me from giving a false report........I'll move onto testing linux-amdgpu-stable 6.13.1.arch1-2 unless I hear otherwise.
EDIT:
Preliminary report on:pacman -Q linux-amdgpu-stable : linux-amdgpu-stable 6.13.1.arch1-2
uname -rs : Linux 6.13.1-arch1-2-amdgpu-stable
pacman -Q glibc : glibc 2.40+r66+g7d4b6bcae91f-1
cat /proc/cmdline : .... rw loglevel=3 sysrq_always_enabled=1 amd_pstate=passive fsck.mode=forcePower usage at idle desktop w/ chromium 129F 2.62W
WebGL Aquarium Chromium : 5000 fish 60fps 154F 13.12W
10000 fish 60fps 159F 17.39W
20000 fish 40fps 162F 17.19W
30000 fish 31fps 169F 17.95W
Note: Firefox: ~ 8-10 fps lessOnly about an hour into this test, seems stable enough to do some testing, but I'll add more results in about 24hr.
I've copied this results into https://github.com/pacoandres/laikm/iss … 2633179337 for having all together.
Offline
Thank you, @NuSkool! I just want to clarify that stable builds are for normal use in case if testing kernel crashes. linux-amdgpu-stable-6.13.1.arch1-2 and : linux-amdgpu-stable-6.13.1.arch1-3 are same, the only difference is glibc version they are compiled with. So you can skip -2, update your system to the latest and use -3.
Offline
lpr1 wrote:Mechanicus wrote:Well... I asked because I tested a bit all three builds on my Raven machine. And all of them seemed stable.
Got it, ye, I got instant freeze on all 3 kernels and it seems it's not related to the glibc version as @pacoandres experienced similar thing. I also do not think this is related to workarounds anyway, unless those patches are relatively recent ones, but I don't know. I see new builds, so what are we supposed to test now?
linux-amdgpu-testing-6.13.1.arch1-7 and 8 aimed to find the actual place where the problem starts.
Alright, but they/you are moving too fast, 2-3 days is required to assure there's no issue or if there it is one, so I keep testing 1-2 for now, at least 2 more days.
Offline
Alright, but they/you are moving too fast, 2-3 days is required to assure there's no issue or if there it is one, so I keep testing 1-2 for now, at least 2 more days.
Well... probably. This is my typical speed of problem resolution, sorry
Offline