You are not logged in.
Starting a few days ago, randomly my GPU will seemingly crash, after a few minutes or hours of uptime.
The computer works fine, except there is no video at all, and the monitors lose signal.
I have a Ryzen 5 5600X, a RX 6700-XT, and 32 GiB of RAM.
Here is some useful logs from journalctl:
Nov 12 18:31:28 Belphegor kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:80:crtc-1] flip_done timed out
Nov 12 18:31:30 Belphegor kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=703488, emitted seq=703490
Nov 12 18:31:30 Belphegor kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Discord pid 86701 thread Discord:cs0 pid 86709
Nov 12 18:31:30 Belphegor kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Nov 12 18:31:34 Belphegor kernel: amdgpu 0000:08:00.0: amdgpu: failed to suspend display audio
Nov 12 18:31:34 Belphegor kernel: amdgpu 0000:08:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
Nov 12 18:31:34 Belphegor kernel: amdgpu 0000:08:00.0: amdgpu: Failed to disable gfxoff!
Nov 12 18:31:34 Belphegor kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:434
Nov 12 18:31:34 Belphegor kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:508
Nov 12 18:31:34 Belphegor kernel: [drm:dcn20_wait_for_blank_complete [amdgpu]] *ERROR* DC: failed to blank crtc!
Nov 12 18:31:34 Belphegor kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:442
Nov 12 18:31:34 Belphegor kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:516
Nov 12 18:31:34 Belphegor kernel: [drm:dcn20_wait_for_blank_complete [amdgpu]] *ERROR* DC: failed to blank crtc!
Nov 12 18:31:34 Belphegor kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 000000002e2d4f10; ring_buffer_end = 0000000075bea464; write_frame = 00000000457a5668
Nov 12 18:31:34 Belphegor kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Nov 12 18:31:35 Belphegor kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Nov 12 18:31:35 Belphegor kernel: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_disable_crtc line:544
Nov 12 18:31:35 Belphegor kernel: BUG: unable to handle page fault for address: ffffb7aba05206f8
Nov 12 18:31:35 Belphegor kernel: #PF: supervisor read access in kernel mode
Nov 12 18:31:35 Belphegor kernel: #PF: error_code(0x0000) - not-present page
Nov 12 18:31:35 Belphegor kernel: PGD 100000067 P4D 100000067 PUD 0Offline
I've had this kind of problem for a long time with 2 different PCs and AMD GPUs (the older one was Zen1 based with a Vega 64 and the new one is Zen3 based with a RX 5700 XT). I've never been able to completely solve the issue, so it still occurs, but it occurs very rarely now so that I can tolerate it.
I've searched quite a bit for solutions to this and it's really mostly voodoo stuff. But I'll tell you the things I've tried personally, some of which might help according to other posts on the web, but they also might not help in your specific case. You probably have to try several things until the problem either goes away or almost goes away for your specific setup.
So in no particular order, these are the things I've done which mitigated the problem for me at least (it's likely that some of those things are completely irrelevant, but after doing all that the situation improved massively for me, so I'll just keep it that way):
Ensure you have the latest UEFI firmware and software updates in general
Disable any overclockings if you have any
Use the kernel parameters:
amdgpu.noretry=0 amdgpu.lockup_timeout=1000 amdgpu.gpu_recovery=1 amdgpu.audio=0(audio=0 will disable the HDMI audio feature). You can also try
iommu=ptor
iommu=soft(IOMMU must be on or auto in UEFI). Another thing to try:
pcie_aspm=off(that will disable a PCIe power management feature). I also use
processor.max_cstate=5because my Ryzens seem to generally have issues with the C6 power saving state. But that has probably nothing to do with the GPU issue.
In UEFI, set PCIe slot generation from "Auto" to "Gen4" (which is probably what you have) and set power supply current idle control to "Typical". You can also try to disable even more power saving features in UEFI, but for me that didn't really help.
Try
echo "high" > /sys/class/drm/card0/device/power_dpm_force_performance_level'(default is "auto")
Most of the tips I've found have to do with various power saving stuff as you can see. With the above tips I managed to have this issue appear only very rarely (like once every 1-3 months) instead of every 1-2 days, which is of course a massive improvement and makes it very usable. I still have found no explanation why this occurs to begin with. Faulty hardware could also be a thing. There are even more voodoo tips out there. Good luck. 
Last edited by m6x (2021-11-13 07:50:27)
int pi = 3;
Offline

have you tried your system in windows? if you have the same problems in windows its probably a bad video card.
Offline
This started happening to me with my 6800x-xt with 5.15.x using the zen kernel. Downgrading to the zen version of 5.14 resolves the issue for me, and I'm currently testing the stock kernel to see if it happens there too.
EDIT: I should add that while it seems like the computer continues to be functional, what actually happens for me is the display turns off and the computer reboots, but the display doesn't turn back on until I power off and start it back up.
EDIT 2: Looks like stock 5.15.x has the same issue 
EDIT3 : On the chance it's not the GPU I should also add that I have a Ryzen 5900x
Last edited by prurigro (2021-11-22 19:56:31)
Offline
I have the same issue only when i either lock the screen or blank it, but i could ssh into but couldn't shut it down
There's my specs
OS: Arch Linux x86_64 
Kernel: 5.14.16-zen1-1-zen 
CPU: Intel Xeon E5-1650 v2 (12) @ 4.000GHz 
GPU: AMD ATI Radeon RX 6600/6600 XT/6600M 
Memory: 765MiB / 32037MiB
Offline

Yep, 6600XT reporting in as broken.
When it tires to turn on the screens from standby (suspend or just idle, doesn't matter) they just flicker and go back off immediately. Turning one on and off might yield a picture, but it's frozen.
Journal gets spammed with this:
[drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
After a bit timeout call traces into amdgpu and gpu_shed appear in the journal. Don't see a way of recovery from there except forcing the machine off.
Offline
This is sounding like a number of very similar but different issues- @a1ex, are you only seeing this on 5.15? (I see @Yukiseekyo is hitting their's on 5.14)
Offline
Only on 5.15, 5.14 works fine for me and 5.16rc works as well
Offline
That's good to hear that 5.16 isn't triggering it for you! Hopefully that means they solved whatever the issue was.
Offline

Gives me hope, best I could find is this bisection effort which apparently still failed in the end.
https://lore.kernel.org/regressions/8e4 … uin.co.uk/
I haven't yet found a pattern with this issue, it tends to work for a while after reboot. Then some display wakeup suddenly kills the gpu driver.
From the description and the versions this is very likely a different issue to the thread starter though. (first noticed it on some 5.15 version)
Offline
@a1ex: Yeah, it seems like that lines up more closely with your issue. It doesn't seem out of the question that more than one issue would have appeared in 5.15 considering all the AMD changes, but hopefully they're connected enough to be fixed together. Have you tested the 5.16 release candidate yet?
Offline
Yep, 6600XT reporting in as broken.
When it tires to turn on the screens from standby (suspend or just idle, doesn't matter) they just flicker and go back off immediately. Turning one on and off might yield a picture, but it's frozen.
Journal gets spammed with this:[drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
After a bit timeout call traces into amdgpu and gpu_shed appear in the journal. Don't see a way of recovery from there except forcing the machine off.
I am seeing this same behavior with an XFX Swift 6600XT and kernel 5.15.7 ... didn't try to rollback the kernel nor any of the other power saving "optimizations" suggested in previous posts
What seems to have improved the situation is switching the "power profile" to "compute" in /sys/class/drm/card0/device/pp_power_profile_mode which appears to be blocking both gpu and memory to go below a certain frequency ... maybe that has some indirect consequences on power saving features
Offline
I'm not sure if my issue is connected to various ones described in this thread, but it was so frustrating for me that I'll post my observations anyway – maybe they'll help someone.
I have a 5700 XT that I've been using for 1.5 years without many issues. However recently (about a month ago, similarly to OP) I started seeing much more frequent crashes/hangs. In some cases the screen turned black, but the PC could still be pinged and Magic SysRq keys worked. But most of the time the PC completely froze – not even SysRq could reboot it, I had to physically hit the reset switch. And worst of all, in such cases there were not even any logs in the journal.
Around that time Arch updated the main kernel from 5.14.16 to 5.15.2. So, I tried with a newer kernel (linux-mainline 5.16rc3), and with an older one (linux-lts 5.10.83), but in both cases the hangs still occurred.
I also tried linking this to Firefox, as I saw page crashes (mostly videos/streams) happen more often than usual, and most of the time when the PC hanged I was actively using Firefox. But I managed to catch a hang with Firefox closed, so that was not it.
Defeated, I started thinking about swapping my PC components one by one to see if it's related to any specific piece of hardware. But then I got one soft crash, and in the journal there were some logs about X crashing. This gave me an idea; I took another look at pacman.log and found this:
[2021-11-10T09:37:35+0100] [ALPM] upgraded xorg-server-common (1.20.13-3 -> 21.1.1-2)
[2021-11-10T09:37:35+0100] [ALPM] upgraded xorg-server (1.20.13-3 -> 21.1.1-2)It's a major version update (which introduced some regressions), and this date is awfully close to the moment when I started seeing the hangs.
Since I wanted to try out Wayland for some time, I decided to just do it and see if it has any impact on the issue. I've been running it for 4 days now, and the hard hangs are gone. I had two soft lockups during this time, but they might just be different issues.
My suggestion is that if you are experiencing more frequent hangs since around a month ago, there is a possibility that it was introduced by update of xorg-server to 21.1.1.
To confirm/deny this, you would need to build the older version of xorg-server (1.20.13) and see if the problem persists there. (Personally I'm not interested in doing this, as I'm happily on Wayland now – and I'm just hoping I didn't speak too soon about there being no hangs here.)
Or maybe I'm completely wrong and I just got lucky. In any case, I guess it's worth trying if you're out of other options.
Edit:
It turned out my issue with system completely freezing was caused by a CPU undervolt which I did to limit fan noise (I was using stock cooler on Ryzen 3700X, and fan curves were impossible to adjust to a comfortable level). According to information I found on the web, a -100 mV or -200 mV undervolt should work without any issues, but in my case even -100 mV led to an unstable system.
I have since then upgraded to a better cooler, but forgot about the undervolt. And sure enough, after removing it, I did not experience any more hard freezes.
I still get occasional soft hangs, mainly when using Firefox. It got better with kernel 5.16, but not completely fixed. However it's rare enough that it's not bothering me too much.
Last edited by stanczew (2022-02-13 12:02:25)
Online
I've had two hard lock ups on a 3800X / 5700XT system today. Never had issues with either, but since 5.15.7 it looks like something is bork.
Running Plasma & Chrome both times it happened. Output from kernel rung buffer/dmesg is here: https://pastebin.com/ULeT68h2
Snippet here:
[ 4948.937249] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec timeout, signaled seq=419914, emitted seq=419916
[ 4948.937594] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chrome pid 4894 thread chrome:cs0 pid 4914
[ 4948.937916] amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
[ 4952.937938] amdgpu 0000:2f:00.0: amdgpu: failed to suspend display audio
[ 4953.384351] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
[ 4953.636890] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x00000290 != 0x00000230
[ 4953.888882] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
[ 4953.915067] [drm] free PSP TMR buffer
[ 4953.948647] amdgpu 0000:2f:00.0: amdgpu: BACO reset
[ 4957.098186] amdgpu 0000:2f:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 4957.098372] [drm] PCIE GART of 512M enabled (table at 0x0000008001FA4000).
[ 4957.098397] [drm] VRAM is lost due to GPU reset!
[ 4957.098691] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 4957.098974] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 4957.099142] [drm] PSP is resuming...
[ 4957.138749] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 4957.267584] [drm] reserve 0x900000 from 0x81f1800000 for PSP TMR
[ 4957.306834] amdgpu 0000:2f:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 4957.310998] amdgpu 0000:2f:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 4957.310999] amdgpu 0000:2f:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 4957.311001] amdgpu 0000:2f:00.0: amdgpu: SMU is resuming...
[ 4957.313401] amdgpu 0000:2f:00.0: amdgpu: SMU is resumed successfully!
[ 4957.474956] [drm] kiq ring mec 2 pipe 1 q 0
[ 4957.475502] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 4957.475974] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 4957.476958] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 4957.477295] [drm] JPEG decode initialized successfully.
[ 4957.477358] amdgpu 0000:2f:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 4957.477360] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 4957.477361] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 4957.477362] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 4957.477363] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 4957.477364] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 4957.477365] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 4957.477366] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 4957.477367] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 4957.477368] amdgpu 0000:2f:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 4957.477369] amdgpu 0000:2f:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 4957.477370] amdgpu 0000:2f:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 4957.477371] amdgpu 0000:2f:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[ 4957.477372] amdgpu 0000:2f:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[ 4957.477372] amdgpu 0000:2f:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[ 4957.477373] amdgpu 0000:2f:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[ 4957.480317] amdgpu 0000:2f:00.0: amdgpu: recover vram bo from shadow start
[ 4957.480386] amdgpu 0000:2f:00.0: amdgpu: recover vram bo from shadow done
[ 4957.480393] [drm] Skip scheduling IBs!
[ 4957.480394] [drm] Skip scheduling IBs!
[ 4957.480406] amdgpu 0000:2f:00.0: amdgpu: GPU reset(1) succeeded!Have downgraded to using linux-lts for now
Offline
I have an 6600xt with i5-8600 CPU,using kde+sddm.
After logging in from sddm,everytime I tried to shutdown my computer in kde,the dp out loses signal,but the computer's power LED did not go out.
Even though I downgraded to  linux-lts,which is 5.10.x kernel,sddm won't appear.
If I use linux-hardened,which is 5.14.x kernel,it will enter kde and shutdown normally.
this is the log produced using latest kernel.(5.15.7 at the time of writing)
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=437, emitted seq=438
Dec 14 19:11:18 Falcon-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Dec 14 19:11:18 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Dec 14 19:11:18 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:18 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:19 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:19 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:19 Falcon-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=357, emitted seq=358
Dec 14 19:11:19 Falcon-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Dec 14 19:11:19 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Dec 14 19:11:19 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: Bailing on TDR for s_job:165, as another already in progress
Dec 14 19:11:19 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:19 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:19 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:19 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:20 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:20 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:20 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:20 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:20 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:20 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:21 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:21 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:21 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:21 Falcon-PC kernel: [drm] perform_link_training_with_retries: Link training attempt 1 of 4 failed
Dec 14 19:11:30 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:30 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:30 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:30 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:30 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:31 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:31 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:31 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:31 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:31 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:32 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:32 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:32 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:32 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:32 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:32 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:33 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:33 Falcon-PC kernel: [drm] perform_link_training_with_retries: Link training attempt 2 of 4 failed
Dec 14 19:11:41 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:42 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:42 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:42 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:42 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:42 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:42 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:43 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:43 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:43 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:43 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:43 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:44 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:44 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:44 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:44 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:44 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:44 Falcon-PC kernel: [drm] perform_link_training_with_retries: Link training attempt 3 of 4 failed
Dec 14 19:11:53 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:53 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:53 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:54 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:54 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:54 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:54 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:54 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:55 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:55 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:55 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:55 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:55 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:55 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:56 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:56 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:56 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:11:56 Falcon-PC kernel: [drm] enabling link 1 failed: 15
Dec 14 19:12:00 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:12:05 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:12:07 Falcon-PC dbus-daemon[364]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.3' (uid=0 pid=365 comm="/usr/bin/NetworkManager --no-daemon ")
Dec 14 19:12:07 Falcon-PC dbus-daemon[364]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.nm-dispatcher.service': Refusing activation, D-Bus is shutting down.
Dec 14 19:12:09 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:12:15 Falcon-PC kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Dec 14 19:12:17 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
Dec 14 19:12:29 Falcon-PC kernel: [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:72:crtc-0] flip_done timed out
Dec 14 19:12:32 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!
Dec 14 19:12:33 Falcon-PC systemd[1]: sddm.service: State 'stop-sigterm' timed out. Killing.
Dec 14 19:12:33 Falcon-PC systemd[1]: sddm.service: Killing process 389 (sddm) with signal SIGKILL.
Dec 14 19:12:33 Falcon-PC systemd[1]: sddm.service: Killing process 985 (Xorg) with signal SIGKILL.
Dec 14 19:12:33 Falcon-PC systemd[1]: sddm.service: Killing process 402 (QDBusConnection) with signal SIGKILL.
Dec 14 19:12:33 Falcon-PC systemd[1]: sddm.service: Main process exited, code=killed, status=9/KILL
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ An ExecStart= process belonging to unit sddm.service has exited.
░░ 
░░ The process' exit code is 'killed' and its exit status is 9.
Dec 14 19:12:35 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!
Dec 14 19:12:36 Falcon-PC kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Dec 14 19:12:36 Falcon-PC kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Dec 14 19:12:37 Falcon-PC kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Dec 14 19:12:37 Falcon-PC kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Dec 14 19:12:37 Falcon-PC kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Dec 14 19:12:40 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!
Dec 14 19:12:40 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable smu features.
Dec 14 19:12:40 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: Fail to disable dpm features!
Dec 14 19:12:40 Falcon-PC kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
Dec 14 19:12:40 Falcon-PC kernel: [drm] free PSP TMR buffer
Dec 14 19:12:41 Falcon-PC kernel: [drm] psp gfx command DESTROY_TMR(0x7) failed and response status is (0x80000306)
Dec 14 19:12:41 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Dec 14 19:12:41 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Dec 14 19:12:41 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
Dec 14 19:12:44 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!
Dec 14 19:12:44 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset failed
Dec 14 19:12:44 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:03:00.0
Dec 14 19:12:46 Falcon-PC kernel: [drm:drm_crtc_commit_wait] *ERROR* flip_done timed out
Dec 14 19:12:46 Falcon-PC kernel: [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:72:crtc-0] commit wait timed out
Dec 14 19:12:55 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
Dec 14 19:12:55 Falcon-PC kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
Dec 14 19:12:55 Falcon-PC kernel: [drm] VRAM is lost due to GPU reset!
Dec 14 19:12:55 Falcon-PC kernel: [drm] PSP is resuming...
Dec 14 19:12:56 Falcon-PC kernel: [drm] failed to load ucode SMC(0x18) 
Dec 14 19:12:56 Falcon-PC kernel: [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x80000306)
Dec 14 19:12:56 Falcon-PC kernel: [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR
Dec 14 19:12:56 Falcon-PC kernel: [drm:drm_crtc_commit_wait] *ERROR* flip_done timed out
Dec 14 19:12:56 Falcon-PC kernel: [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CONNECTOR:93:DP-2] commit wait timed out
Dec 14 19:12:57 Falcon-PC dbus-daemon[364]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.3' (uid=0 pid=365 comm="/usr/bin/NetworkManager --no-daemon ")
Dec 14 19:12:57 Falcon-PC dbus-daemon[364]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.nm-dispatcher.service': Refusing activation, D-Bus is shutting down.
Dec 14 19:12:58 Falcon-PC kernel: [drm] psp gfx command AUTOLOAD_RLC(0x21) failed and response status is (0x0)
Dec 14 19:12:58 Falcon-PC kernel: [drm:psp_load_non_psp_fw [amdgpu]] *ERROR* Failed to start rlc autoload
Dec 14 19:12:58 Falcon-PC kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Dec 14 19:12:58 Falcon-PC kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
Dec 14 19:12:58 Falcon-PC kernel: [drm] Skip scheduling IBs!
Dec 14 19:12:58 Falcon-PC kernel: [drm] Skip scheduling IBs!
Dec 14 19:12:58 Falcon-PC kernel: [drm] Skip scheduling IBs!
Dec 14 19:12:58 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(1) failed
Dec 14 19:12:58 Falcon-PC kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset end with ret = -22
Dec 14 19:12:58 Falcon-PC kernel: snd_hda_intel 0000:03:00.1: refused to change power state from D0 to D3hot
Dec 14 19:13:06 Falcon-PC kernel: [drm:drm_crtc_commit_wait] *ERROR* flip_done timed out
Dec 14 19:13:06 Falcon-PC kernel: [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:60:plane-4] commit wait timed outAMD Ryzen R5 5600, RX-6600XT, KDE User
Offline
I am now testing same kernel but also applying https://patchwork.freedesktop.org/patch/466321/ , as suggested in https://gitlab.freedesktop.org/drm/amd/-/issues/1824 and I am not seeing the issue from a couple of days
Offline
So I've been testing 5.16.0 today and so far I haven't had any issues for almost 5 hours now, which is longer than my system ever lasted on 5.15. I also noticed that 5.16.0 includes the patch mentioned by @guilivo (https://patchwork.freedesktop.org/patch/466321/), which is promising. I'll report back if I do run into any lockups, or (hopefully) after a couple of days if everything continues working smoothly.
Offline
Well, I came back at the end of the evening and I could ssh into my computer but the display was dead- required a full reboot to fix. I guess 5.16.0 is no bueno for me 
Offline
I'm seeing this error log as well with my RX 570. I'm on 5.16.2, the log is a bit different but the screen is blank all the same.
Offline

Lots of fixes in that direction with 5.16.3 try with that.
Offline
On 5700XT / 5.15.1.x with ~ 7m DP->miniDP-Cable @ 1440p with SDDM/Plasma, everything's working, with screen blanking for a few seconds once a week or so.
5.16, including 5.16.3 only gives blank screen with that cable.
With an other 7m cable (DP-DP), It sometimes works, with regular screen blanking several times a minute at least, but it shows something.
Funny thing is, with 5.15.x, this DP-DP cable is most of the time blank, and only works sometimes.
With both cables, 5.16.X shows lots of
perform_link_training_with_retries: Link training attemptAlthough with the DP-DP cable, most of the time I just see attempt 1 of 4 failing.
I'm using a custom modeline with reduced blanking for 1440p in /etc/X11/xorg.conf.d if that matters, but deleting it doesn't help.
Offline

Something improved, I haven't had the hard locks for a couple of kernel versions. sddm still dies on me sometimes after resume, but that's recoverable by killing the login session from another tty now.
Unfortuately I don't have a way to test anything apart from waiting what happens for another week 
Offline
Hello gentlemen , I had the same behaviour with my RX6700. I tried a lot of hack on kernel, but the issue was the same,  even on other OS : 
Windows 10 : blink , with short white snow noise screen for 2 or 3 second and back to desktop
Windows 11 : blink and black background in window
Arch linux : Crash X server and freeze desktop. 
Strangely, during 3D gaming in fullscreen no problem ...
The solution that worked for me was to ... Flash the video card :-/
First step is to identify the correct card ID with https://www.techpowerup.com/download/techpowerup-gpu-z/
Second step get the last firmware for GPU and host card on https://www.techpowerup.com/vgabios/ (for me ASUS RX 6700 XT 12 GB Dual)
Last step : Flash the GPu with https://www.techpowerup.com/download/ati-atiflash/
I'm sorry , all tools are on windows, not sure they works correctly on wine.. I don't try to deal with devil 
Caution , flashing video card is not an usual operation.... Be sure to get the good firmware before proceed.
For me now, no more blinking on desktop since 2 weeks (I consider the issue leave)
That's mean the problem was gpu firmware bug not a kernel issue
I hope this radical hint will be useful for other RX6700 owners. But I regret the lack of communication from ATI/AMD.
Last edited by Artemia76 (2022-03-31 21:28:36)
Offline
Hey everyone, I realize there are a few issues getting tracked in this thread at this point, but I believe the one I was having has been solved! https://gitlab.freedesktop.org/drm/amd/-/issues/1871
Offline
Joining in on this, I just got a 6900xt in the mail and to my dismay my system is locking up after a few minutes of games and occasionally when idling as well. The screen goes blank and the last second or so of audio repeats. This seems to be an issue some other people are experiencing so I'm not sure if it's a bad card or not. Haven't been able to try in windows yet.
I am on kernel version 5.17.2
Here are relevant sections of journalctl -b -1
First, it starts spitting out a series of messages like this:
Apr 13 23:33:54 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Apr 13 23:33:54 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Failed to export SMU metrics table!It repeats those lines for a while before different messages start:
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: [drm] *ERROR* [CRTC:80:crtc-1] flip_done timed out
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma3 timeout, signaled seq=1438, emitted seq=1440
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma2 timeout, signaled seq=718, emitted seq=720
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2520, emitted seq=2522
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Bailing on TDR for s_job:2ce, as another already in progress
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Bailing on TDR for s_job:937, as another already in progress
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Failed to disable gfxoff!
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=178191, emitted seq=178191
Apr 13 23:34:04 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process steamwebhelper pid 3345 thread steamwebhe:cs0 pid 3347
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Apr 13 23:34:04 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Bailing on TDR for s_job:2b3f0, as another already in progress
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 10us * 10 tries - mpc3_power_on_ogam_lut line:157
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 10us * 10 tries - mpc3_power_on_ogam_lut line:157
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 10us * 10 tries - mpc3_power_on_ogam_lut line:157
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:437
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:511
Apr 13 23:34:04 ryuko kernel: [drm:dcn20_wait_for_blank_complete [amdgpu]] *ERROR* DC: failed to blank crtc!
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 10us * 10 tries - mpc3_power_on_ogam_lut line:157
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 10us * 10 tries - mpc3_power_on_ogam_lut line:157
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 10us * 10 tries - mpc3_power_on_ogam_lut line:157
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:445
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:519
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:469
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:543
Apr 13 23:34:04 ryuko kernel: [drm:dcn20_wait_for_blank_complete [amdgpu]] *ERROR* DC: failed to blank crtc!
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:453
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:527
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_dpp_pg_control line:461
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 1us * 1000 tries - dcn20_hubp_pg_control line:535
Apr 13 23:34:04 ryuko kernel: [drm:dcn20_wait_for_blank_complete [amdgpu]] *ERROR* DC: failed to blank crtc!
Apr 13 23:34:04 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d832546c; ring_buffer_end = 00000000f3a103db; write_frame = 00000000490837e7
Apr 13 23:34:04 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Apr 13 23:34:04 ryuko kernel: [drm] REG_WAIT timeout 10us * 10020 tries - enc1_stream_encoder_dp_blank line:944
Apr 13 23:34:04 ryuko kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=5
Apr 13 23:34:05 ryuko kernel: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_disable_crtc line:528
Apr 13 23:34:05 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d832546c; ring_buffer_end = 00000000f3a103db; write_frame = 00000000490837e7
Apr 13 23:34:05 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Apr 13 23:34:05 ryuko kernel: [drm] REG_WAIT timeout 10us * 10020 tries - enc1_stream_encoder_dp_blank line:944
Apr 13 23:34:05 ryuko kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=5
Apr 13 23:34:05 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d832546c; ring_buffer_end = 00000000f3a103db; write_frame = 00000000490837e7
Apr 13 23:34:05 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Apr 13 23:34:05 ryuko kernel: [drm] REG_WAIT timeout 10us * 10020 tries - enc1_stream_encoder_dp_blank line:944
Apr 13 23:34:05 ryuko kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=5
Apr 13 23:34:05 ryuko kernel: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_disable_crtc line:528
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Apr 13 23:34:06 ryuko kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Apr 13 23:34:06 ryuko kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Apr 13 23:34:06 ryuko kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Apr 13 23:34:06 ryuko kernel: __smu_cmn_reg_print_error: 71 callbacks suppressed
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Failed to disable smu features.
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Fail to disable dpm features!
Apr 13 23:34:06 ryuko kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -121
Apr 13 23:34:06 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 00000000d832546c; ring_buffer_end = 00000000f3a103db; write_frame = 00000000490837e7
Apr 13 23:34:06 ryuko kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Apr 13 23:34:06 ryuko kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate ras ta
Apr 13 23:34:06 ryuko kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: MODE1 reset
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU mode1 reset
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU smu mode1 reset
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:48 param:0x00000000 message:Mode1Reset?
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU mode1 reset failed
Apr 13 23:34:06 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: ASIC reset failed with error, -121 for drm dev, 0000:08:00.0
Apr 13 23:34:08 ryuko kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:77:crtc-0] hw_done or flip_done timed out
Apr 13 23:34:14 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=2522, emitted seq=2522
Apr 13 23:34:14 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Apr 13 23:34:14 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Apr 13 23:34:14 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Bailing on TDR for s_job:938, as another already in progress
Apr 13 23:34:14 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=178191, emitted seq=178191
Apr 13 23:34:14 ryuko kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process steamwebhelper pid 3345 thread steamwebhe:cs0 pid 3347
Apr 13 23:34:14 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset begin!
Apr 13 23:34:14 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: Bailing on TDR for s_job:2b3f1, as another already in progress
Apr 13 23:34:17 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset succeeded, trying to resume
Apr 13 23:34:17 ryuko kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
Apr 13 23:34:17 ryuko kernel: [drm] VRAM is lost due to GPU reset!
Apr 13 23:34:17 ryuko kernel: [drm] PSP is resuming...
Apr 13 23:34:17 ryuko kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
Apr 13 23:34:17 ryuko kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Apr 13 23:34:17 ryuko kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset(3) failed
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!
Apr 13 23:34:17 ryuko kernel: [drm] Skip scheduling IBs!It then repeats that Skip scheduling IBs line several hundred times before a new set of messages:
Apr 13 23:34:17 ryuko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 13 23:34:17 ryuko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 13 23:34:17 ryuko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 13 23:34:17 ryuko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 13 23:34:17 ryuko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 13 23:34:17 ryuko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 13 23:34:17 ryuko kernel: snd_hda_intel 0000:08:00.1: CORB reset timeout#2, CORBRP = 65535
Apr 13 23:34:17 ryuko kernel: amdgpu 0000:08:00.0: amdgpu: GPU reset end with ret = -62
Apr 13 23:34:19 ryuko kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:92:crtc-5] hw_done or flip_done timed out
Apr 13 23:34:25 ryuko audit[3345]: ANOM_ABEND auid=1001 uid=1001 gid=1001 ses=2 pid=3345 comm="GpuWatchdog" exe="/home/user/.local/share/Steam/ubuntu12_64/steamwebhelper" sig=11 res=1
Apr 13 23:34:25 ryuko kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 13 23:34:25 ryuko kernel: GpuWatchdog[3385]: segfault at 0 ip 00007fa16e5234af sp 00007fa1652514a0 error 6 in libcef.so[7fa16a290000+6fc6000]
Apr 13 23:34:25 ryuko kernel: Code: 89 de e8 a4 0a 8e fe 80 7d cf 00 79 09 48 8b 7d b8 e8 d5 cc d2 02 41 8b 84 24 e0 00 00 00 89 45 b8 48 8d 7d b8 e8 11 d0 d6 fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e
Apr 13 23:34:25 ryuko kernel: audit: type=1701 audit(1649910865.470:92): auid=1001 uid=1001 gid=1001 ses=2 pid=3345 comm="GpuWatchdog" exe="/home/user/.local/share/Steam/ubuntu12_64/steamwebhelper" sig=11 res=1
Apr 13 23:34:25 ryuko audit: BPF prog-id=27 op=LOAD
Apr 13 23:34:25 ryuko kernel: audit: type=1334 audit(1649910865.475:93): prog-id=27 op=LOAD
Apr 13 23:34:25 ryuko audit: BPF prog-id=28 op=LOAD
Apr 13 23:34:25 ryuko audit: BPF prog-id=29 op=LOAD
Apr 13 23:34:25 ryuko systemd[1]: Started Process Core Dump (PID 5507/UID 0).
Apr 13 23:34:25 ryuko kernel: audit: type=1334 audit(1649910865.476:94): prog-id=28 op=LOAD
Apr 13 23:34:25 ryuko kernel: audit: type=1334 audit(1649910865.476:95): prog-id=29 op=LOAD
Apr 13 23:34:25 ryuko kernel: audit: type=1130 audit(1649910865.476:96): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-5507-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Apr 13 23:34:25 ryuko audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-5507-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Apr 13 23:34:26 ryuko systemd-coredump[5511]: [?] Process 3345 (steamwebhelper) of user 1001 dumped core.
                                              
                                              Module /home/user/.local/share/Steam/ubuntu12_64/steamwebhelper with build-id 2df4ef6babfc44a6391362551a5df8955db480fa
                                              Module /home/user/.local/share/Steam/ubuntu12_64/libSDL2-2.0.so.0 with build-id d19038a3aea8a3772c51cb208a50f0e288dc5942
                                              Module /home/user/.local/share/Steam/ubuntu12_64/libminigbm.so with build-id 6ac45b920b9a761a66d2c5a295ec46946620d605
                                              Module /home/user/.local/share/Steam/ubuntu12_64/libcef.so with build-id 928a4a5a528723b2eaadc15f9072ae5888dd3d05
                                              Module linux-vdso.so.1 with build-id 48d1f2a0ba6b9bf33e371e464222a5d7f3ba1da9
                                              Module libva-drm.so.2 with build-id 7a02a127b38763e2ec9a2503f0ae4c3d57c20327
                                              Module libva-x11.so.2 with build-id 61e734cd9bd8265d76a73698e94b891539801ec1
                                              Module libva.so.2 with build-id 7636f0d16fc8cb315f9274e487bd226d1236891e
                                              Module libicudata.so.70 with build-id e1dcc2a88cfaafed882d09c90c668af0eed4efed
                                              Module libicuuc.so.70 with build-id 2e245c2bf12f95fd8ab79b3a4be99524677cbd70
                                              Module libxml2.so.2 with build-id 34aa03d6fadb52a051964f0e50a977efaea9482eIt then continues core dumping a bunch of other stuff before I eventually power cycled.
EDIT: Booted into my windows partition and could not recreate the issue so it seems it's probably not a bad card. Disabling dpm is not a workable solution here or there wouldn't be much point in spending $1000 on a GPU. Reflashed the BIOS per one of the previous suggestions, but it's getting late so I'll have to test more tomorrow.
Last edited by mikesbytes (2022-04-14 07:18:51)
Offline