You are not logged in.
Ah , sorry Dobie, 6.10.2 is still working flawlessly here but I guess I will notice if it hits. in the 6.10.4 it happened within an hour or two, whereas it took 2 days before first occurrence on 6.10.3 and ye, the one I built myself here , so far, so good. I will update of course if it turns out it happens on this one as well.
Last edited by pliny (2024-08-13 19:48:36)
Offline
$ git bisect bad Bisecting: 0 revisions left to test after this (roughly 1 step) [60887a89986f0b762e0916acd55068d525ff63fb] drm/udl: Remove DRM_CONNECTOR_POLL_HPD $ git describe v6.10.2-697-g60887a89986f
linux-stable-6.10.2.r697.g60887a89986f-1-x86_64.pkg.tar.zst/linux-stable-headers-6.10.2.r697.g60887a89986f-1-x86_64.pkg.tar.zst.
crash
Offline
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[09a67694edd1f787c206cd2fd066b0ca37debe9f] drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell
$ git describe
v6.10.2-696-g09a67694edd1
linux-stable-6.10.2.r696.g09a67694edd1-1-x86_64.pkg.tar.zst/linux-stable-headers-6.10.2.r696.g09a67694edd1-1-x86_64.pkg.tar.zst
Offline
$ git bisect bad Bisecting: 0 revisions left to test after this (roughly 0 steps) [09a67694edd1f787c206cd2fd066b0ca37debe9f] drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell $ git describe v6.10.2-696-g09a67694edd1
linux-stable-6.10.2.r696.g09a67694edd1-1-x86_64.pkg.tar.zst/linux-stable-headers-6.10.2.r696.g09a67694edd1-1-x86_64.pkg.tar.zst
crash
Offline
$ git bisect bad
09a67694edd1f787c206cd2fd066b0ca37debe9f is the first bad commit
commit 09a67694edd1f787c206cd2fd066b0ca37debe9f (HEAD)
Author: Alex Deucher <alexander.deucher@amd.com>
Date: Tue Jul 9 17:54:11 2024 -0400
drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell
commit a03ebf116303e5d13ba9a2b65726b106cb1e96f6 upstream.
We seem to have a case where SDMA will sometimes miss a doorbell
if GFX is entering the powergating state when the doorbell comes in.
To workaround this, we can update the wptr via MMIO, however,
this is only safe because we disallow gfxoff in begin_ring() for
SDMA 5.2 and then allow it again in end_ring().
Enable this workaround while we are root causing the issue with
the HW team.
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/3440
Tested-by: Friedrich Vock <friedrich.vock@gmx.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
(cherry picked from commit f2ac52634963fc38e4935e11077b6f7854e5d700)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [d29de02effd4e8816333582ed8230d41e14a73dc] Linux 6.10.3
git bisect bad d29de02effd4e8816333582ed8230d41e14a73dc
# status: waiting for good commit(s), bad commit known
# good: [2d002356c3bb628937e0fb5d72a91dc493a984fe] Linux 6.10.2
git bisect good 2d002356c3bb628937e0fb5d72a91dc493a984fe
# good: [e158d5ecd0c0676d13524275f8b7ff06f4468163] usb: typec-mux: nb7vpq904m: unregister typec switch on probe error and remove
git bisect good e158d5ecd0c0676d13524275f8b7ff06f4468163
# good: [7b17fbfe4fd6b8da85b345d819cdce6ed4ee1d24] media: i2c: alvium: Move V4L2_CID_GAIN to V4L2_CID_ANALOG_GAIN
git bisect good 7b17fbfe4fd6b8da85b345d819cdce6ed4ee1d24
# bad: [60609323f13288b4d39d96080c3c58d06a43ccef] ASoC: codecs: wcd939x: Fix typec mux and switch leak during device removal
git bisect bad 60609323f13288b4d39d96080c3c58d06a43ccef
# good: [154d33dc8de81a9f61fa8f4d2004f0a2f2d829fe] ubi: eba: properly rollback inside self_check_eba
git bisect good 154d33dc8de81a9f61fa8f4d2004f0a2f2d829fe
# good: [8192c533e89d9fb69b2490398939236b78cda79b] scsi: qla2xxx: Fix for possible memory corruption
git bisect good 8192c533e89d9fb69b2490398939236b78cda79b
# good: [1a802eaa152b8380070a65e1e11710d880a99291] drm/i915/gt: Do not consider preemption during execlists_dequeue for gen8
git bisect good 1a802eaa152b8380070a65e1e11710d880a99291
# bad: [9f33d44ab5ef592f2b4175d7fc87dc44291e2832] drm/amd/amdgpu: Fix uninitialized variable warnings
git bisect bad 9f33d44ab5ef592f2b4175d7fc87dc44291e2832
# bad: [972dd51f1857b1da0aeedcfbe4a32e9aea743542] drm/dp_mst: Fix all mstb marked as not probed after suspend/resume
git bisect bad 972dd51f1857b1da0aeedcfbe4a32e9aea743542
# bad: [60887a89986f0b762e0916acd55068d525ff63fb] drm/udl: Remove DRM_CONNECTOR_POLL_HPD
git bisect bad 60887a89986f0b762e0916acd55068d525ff63fb
# bad: [09a67694edd1f787c206cd2fd066b0ca37debe9f] drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell
git bisect bad 09a67694edd1f787c206cd2fd066b0ca37debe9f
# first bad commit: [09a67694edd1f787c206cd2fd066b0ca37debe9f] drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell
6.10.3 with 09a67694edd1f787c206cd2fd066b0ca37debe9f reverted:
linux-stable-6.10.3-1.1-x86_64.pkg.tar.zst/linux-stable-headers-6.10.3-1.1-x86_64.pkg.tar.zst
Offline
Ah , sorry Dobie, 6.10.2 is still working flawlessly here but I guess I will notice if it hits. in the 6.10.4 it happened within an hour or two, whereas it took 2 days before first occurrence on 6.10.3 and ye, the one I built myself here , so far, so good. I will update of course if it turns out it happens on this one as well.
At this point I'm going to say that 6.10.2 is as bad as .3 & .4. I've had 2 random reboots this morning in less than an hour's usage. I've concluded that it also seems to be related to the graphics. Certainly unlike the behavior that I saw on this system on kernels before 6.10.x.
Offline
Seems like seth's suspicion has been confirmed.
So whats the process? Should we mention it in the issue that this patch should have fixed (https://gitlab.freedesktop.org/drm/amd/-/issues/3440) or do we create a new one?
Offline
pliny wrote:Ah , sorry Dobie, 6.10.2 is still working flawlessly here but I guess I will notice if it hits. in the 6.10.4 it happened within an hour or two, whereas it took 2 days before first occurrence on 6.10.3 and ye, the one I built myself here , so far, so good. I will update of course if it turns out it happens on this one as well.
At this point I'm going to say that 6.10.2 is as bad as .3 & .4. I've had 2 random reboots this morning in less than an hour's usage. I've concluded that it also seems to be related to the graphics. Certainly unlike the behavior that I saw on this system on kernels before 6.10.x.
It is possible that you have a different (maybe additional) issue, as this commit that we determined here should not be in 6.10.2
Last edited by Gnat (2024-08-14 11:54:31)
Offline
Seems like seth's suspicion has been confirmed.
Thanks seth.
So whats the process? Should we mention it in the issue that this patch should have fixed (https://gitlab.freedesktop.org/drm/amd/-/issues/3440) or do we create a new one?
I would start a new one and link back to the original issue.
Offline
@loqs could you compile a kernel with patch?
https://gitlab.freedesktop.org/drm/amd/-/issues/3556
Offline
It is possible that you have a different (maybe additional) issue, as this commit that we determined here should not be in 6.10.2
Thanks for the response. I went ahead and did a full upgrade that included "6.10.4-arch2-1" and an update to Mesa & Firefox. Maybe what you guys are doing will correct my system.
Offline
@loqs could you compile a kernel with patch?
https://gitlab.freedesktop.org/drm/amd/-/issues/3556
6.10.3 with https://gitlab.freedesktop.org/-/projec … -5.2.patch applied:
linux-stable-6.10.3-1.2-x86_64.pkg.tar.zst/linux-stable-headers-6.10.3-1.2-x86_64.pkg.tar.zst
Offline
Also seeing random reboots on my machine after updating to linux-6.10.4. Reverted back to 6.10.2 and it has been stable so far. Got no undervolting applied. Reboots were happening at random times doing random things, like opening a browser tab. Screen just goes black and logs are cut off at this point without any specific messages left for analysis.
Dual booting with Windows 11 through complete shutdown as recommended, and Windows didn't exhibit any random reboots which led me to blame one of the updates on Arch side.
Specs:
ROG Zephyrus Duo 16 (2023) GX650 GX650PY-NM044X
AMD Ryzen 9 7945HX with Radeon Graphics (AMD GPU driving the display)
NVIDIA Corporation GN21-X11 [GeForce RTX 4090 Laptop GPU] (rev a1)
2x 16GB DDR5-4800 SO-DIMM
Offline
pliny wrote:Ah , sorry Dobie, 6.10.2 is still working flawlessly here but I guess I will notice if it hits. in the 6.10.4 it happened within an hour or two, whereas it took 2 days before first occurrence on 6.10.3 and ye, the one I built myself here , so far, so good. I will update of course if it turns out it happens on this one as well.
At this point I'm going to say that 6.10.2 is as bad as .3 & .4. I've had 2 random reboots this morning in less than an hour's usage. I've concluded that it also seems to be related to the graphics. Certainly unlike the behavior that I saw on this system on kernels before 6.10.x.
I'm having a similar issue and the update to 6.10.4 may have trigger something as a side effect. In my case I ended up with a corrupted bootloader and I have not being able to boot arch since the update.
I'm using AMD Ryzen 5 7600X (has integrated graphics too) just in case it's worth to know all the affected hardware.
Offline
I have noticed that my system has rebooted twice, it happened again tonight just as the Linux firmware package was being updated. Which caused the boot partition file system to be corrupted. I had to chroot in using the installation USB I always keep to hand, do a filesystem check on said partition and reinstall the kernel as a safeguard and now I am back in. I have noted that both times I had a reboot, it happened while I was using Telegram and while I can't be 100% sure, it seemed to be when it was playing video.
Offline
Anyone tried the kernel in #87 and got a reboot?
@blinkingbit, there's probably a different issue w/ your setup, the system doesn't boot at all.
Offline
Anyone tried the kernel in #87 and got a reboot?
@blinkingbit, there's probably a different issue w/ your setup, the system doesn't boot at all.
It is running very stable for me. Would be good if some of the others that had problems in this thread can also chime in.
Offline
Anyone tried the kernel in #87 and got a reboot?
@blinkingbit, there's probably a different issue w/ your setup, the system doesn't boot at all.
My system has returned to stability I'm pretty certain because of the update to mesa as I reverted back to 6.10.4 and I'm no longer seeing the issue. So as you mentioned, I most likely was experiencing a different problem than the one being addressed here. Thank you for the assistance.
Offline
I compiled my kernel with the patch on 6.10.5 and it resolved the issue. (did not use #87 because I use a custom tkg kernel). I think this fix will be added in 6.11. I had one random kernel panic since, but I think thats unrelated.
Last edited by rabaimorp (2024-08-15 16:22:31)
Offline
I have the same issue with the LTS kernel (linux-lts)! and on the mainline kernels (linux + linux-zen) 6.10.4
My system is:
AMD Ryzen 9 7950X3D
ASUS ROG x670E-E
Nvidia RTX4090 (through PRIME, main display is amdgpu)
I guess downgrading the kernel to 6.9 won't help me, and compiling a kernel with the patch is the only fix for now, but how does it explain the sudden reboots with the LTS kernel?
Offline
The offending patch was backported into LTS, https://cdn.kernel.org/pub/linux/kernel … Log-6.6.44
Offline
I have the same issue with the LTS kernel (linux-lts)! and on the mainline kernels (linux + linux-zen) 6.10.4
My system is:
AMD Ryzen 9 7950X3D
ASUS ROG x670E-E
Nvidia RTX4090 (through PRIME, main display is amdgpu)I guess downgrading the kernel to 6.9 won't help me, and compiling a kernel with the patch is the only fix for now, but how does it explain the sudden reboots with the LTS kernel?
Maybe that's a cstate problem?
Offline
Have had the same issue on random reboots recently. Applying the sdma5.2 patch from post #87 and compiling with linux 6.10.5 arch kernel has kept the system from rebooting for about a day now.
CPU: Amd 7950x3d
RAM: 32x2GB DDR 6000 CL32
GPU: iGPU + 4090 (vfio)
Mobo: Asus ProArt x670e
Offline
I haven't seen 0001-drm-amdgpu-sdma5.2-limit-wptr-workaround-to-sdma-5.2.patch in the stable queue for 6.10 yet, does anyone have any idea how long it will take? I assume we won't have to wait until 6.11, right?
Offline
That's pretty much Alex D's call - you have to ask on the upstream bug, but the generic answer is "when it's done"
Given that the offending patch was added w/ a point release and jammed into the LTS kernel, I don't see why one would have to wait for a version bump with this.
Offline