You are not logged in.

#1 2024-10-22 16:49:34

skjaldmaer
Member
Registered: 2024-10-22
Posts: 3

amdgpu: black screen, audio looping, system unresponsive

System spec:

  • Linux kernel 6.10.10 (downgraded from 6.11.4 where the problem occurred too)

  • AMD Radeon RX 6650 XT

Symptoms:

After a random amount of time, the screen would go dark and display "no signal".
If there was audio playing, it started looping.
System was totally unresponsive, and forcing a shutdown through the power button was the only option.
Caps lock light was not flickering.

This happens randomly, the same day I had used the system for about 2 hours without problems, and on the next boot it crashed.

What I tried:

  • For context, I have a dual-boot setup with Windows (fast startup and hybernation disabled, on a different physical disk) which is relegated to gaming purposes.
    Windows started crashing and freezing badly while gaming, and I identified a couple of updates that were causing memory problems to other users, and random restarts without any shutdown sequence or warning, and removed those updates. This fixed the random restarts.
    I were able to run a couple of games, but one of the heavier ones still caused a complete freeze.

  • After this, crashes started on Linux too, so I suspected something was done to the RAM, as those Windows updates were known to possibly cause problems, but a Memtest found no problems.
    However, the kernel and graphic drivers were updated the day before, so Windows problems may have been a coincidence as well.

  • Originally, this happened with kernel 6.11.4 while watching videos, so after searching on the forum and coming across this, which seemed like my problem.

  • Hence I thought about a kernel regression, and downgraded the kernel to the latest 6.10.
    This seemed to resolve the issue, as I was able to use the system for a couple of hours, programming and listening to music.
    However, on the same day in the afternoon, in the same situation, I had the crash again, from which I recovered the log shown below through journalctl (only the error part, I can submit the full log if needed). Unfortunately, logs from previous crashes did not show anything.

  • I also found this, which shows the same symptoms that I have with the same errors.
    The card is not new however, and has been able to function flawlessly since I bought it around 2 years ago. Power supply is connected and fans do start when
    under heavy load (e.g., when gaming on Windows). At least from the outside, it also seems properly seated.
    I haven't tried reinstalling it, as now I am trying to exclude software-related issues but will try if I don't find anything.

As the drivers updated recently too, as well as the WM/DM, I downgraded to a point that I recalled worked more than fine, so as to at least identify if the problem is hardware related.

The log:

Oct 22 15:48:45 darkness kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:40 param:0x00000000 message:AllowGfxOff?
Oct 22 15:48:45 darkness kernel: amdgpu 0000:03:00.0: amdgpu: Failed to enable gfxoff!
Oct 22 15:48:47 darkness kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:40 param:0x00000000 message:AllowGfxOff?
Oct 22 15:48:47 darkness kernel: amdgpu 0000:03:00.0: amdgpu: Failed to enable gfxoff!
Oct 22 15:48:55 darkness kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:85:crtc-0] flip_done timed out
Oct 22 15:48:57 darkness kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=15064, emitted seq=15067
Oct 22 15:48:57 darkness kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Oct 22 15:48:57 darkness kernel: amdgpu 0000:03:00.0: amdgpu: device lost from bus!
Oct 22 15:48:57 darkness kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset end with ret = -19
Oct 22 15:48:57 darkness kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -19
Oct 22 15:48:57 darkness kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=5275, emitted seq=5276
Oct 22 15:48:57 darkness kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Oct 22 15:48:57 darkness kernel: amdgpu 0000:03:00.0: amdgpu: device lost from bus!
Oct 22 15:48:57 darkness kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset end with ret = -19
Oct 22 15:48:57 darkness kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -19
Oct 22 15:49:07 darkness kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=15064, emitted seq=15067
Oct 22 15:49:07 darkness kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Oct 22 15:49:07 darkness kernel: amdgpu 0000:03:00.0: amdgpu: device lost from bus!
Oct 22 15:49:07 darkness kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset end with ret = -19
Oct 22 15:49:07 darkness kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -19
Oct 22 15:49:13 darkness kernel: amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
Oct 22 15:49:13 darkness kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:85:crtc-0] commit wait timed out
Oct 22 15:49:17 darkness kernel: amdgpu 0000:03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2
Oct 22 15:49:23 darkness kernel: amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
Oct 22 15:49:23 darkness kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CONNECTOR:121:HDMI-A-1] commit wait timed out

More stuff that I checked but unfortunately did not help:

Last edited by skjaldmaer (2024-10-22 17:03:27)

Offline

#2 2024-10-22 17:35:51

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 23,451

Re: amdgpu: black screen, audio looping, system unresponsive

If Windows is crashing as well I'm strongly suspecting HW troubles. Anecdotal anecdote, a friend of mine was complaining about game crashes (on Windows) also under heavier load. We reprod the same on a linux live system. And then actually checking inside the power cable was fried, a borderline miracle it only died on "heavier" load. So don't just check from the outside, open it up and check how the situation looks like.

Offline

#3 2024-10-22 18:54:55

skjaldmaer
Member
Registered: 2024-10-22
Posts: 3

Re: amdgpu: black screen, audio looping, system unresponsive

I just opened it up and took a look around.
All cables appear in fine condition, no socket seemed strange or fried.
GPU was seated well, no lose connections. In general, nothing seemed out of the ordinary.

As a side note, maybe I should have clarified it better in the original post, but as of now, I've been able to crash only a single game on Windows, and unfortunately not in a reproducible manner.
That game also updated with a patch before problems started arising, so it may be that too. I also checked the integrity of its files, but it was all in order.
I also tested another game, run at 1.2x rendering on 2560x1440 with no upscaling and the highest graphic settings I could choose (no ray tracing) for 1 to 2 hours, and it did not seem to show problems.
I don't know if that can be qualified as a "heavy" load though, as the game itself is not as recent as the other one.

Offline

#4 2024-10-23 08:13:01

coxe87b
Member
From: Canberra
Registered: 2019-12-08
Posts: 86

Re: amdgpu: black screen, audio looping, system unresponsive

After the recent crashing issues I have experienced, I have resolved for now by using the linux-lts kernel.
It may be worth trying that for a while to identify if it is indeed kernel related or not.


Desktop: Arch Linux  |  i3-gaps WM  |  AMD Ryzen 5700X  |  32GB RAM  |  AMD Radeon RX 6700XT  |  Dual monitors @ 2560x1440
Laptop: Debian Linux  |  i3WM  |  Dell Latitude E7270  |  Intel Core i5-6300U  |  16GB RAM
~ Do or do not, there is no try ~

Offline

#5 2024-10-23 13:59:49

skjaldmaer
Member
Registered: 2024-10-22
Posts: 3

Re: amdgpu: black screen, audio looping, system unresponsive

As of today, I have run a stress test through glmark2 on a system with kernel downgraded to 6.10.10-arch1-1 and Mesa downgraded to 24.2.3, rendering on a 2560x1440 surface. Didn't have crashes or hangs. GPU fans spun up just fine.
Also inspected through Windows if it found any HW problem, but everything was reported as good. I don't know how much to trust it though.

I'll continue to daily-drive this configuration and see if it holds. But I think I'll also try and use the LTS kernel in the future with the most recent drivers as well.

For the sake of documentation, I'll leave this issue here, as it shows an error that appears in my logs too.
I may try to set amdgpu.runpm=0 and see if that works with the latest kernel too.

Offline

Board footer

Powered by FluxBB