You are not logged in.

#1 2022-10-16 05:53:55

eggrole
Member
Registered: 2021-01-30
Posts: 13

Radeon RX 5500 blank screen when gaming (sometimes when not gaming)

sudo lspci -v
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (rev c5) (prog-if 00 [VGA controller])
        Subsystem: Hightech Information System Ltd. Device 2401
        Flags: bus master, fast devsel, latency 0, IRQ 79, IOMMU group 18
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at e0000000 (64-bit, prefetchable) [size=2M]
        I/O ports at e000 [size=256]
        Memory at fcb00000 (32-bit, non-prefetchable) [size=512K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] Physical Resizable BAR
        Capabilities: [240] Power Budgeting <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [320] Latency Tolerance Reporting
        Capabilities: [400] Data Link Feature <?>
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [440] Lane Margining at the Receiver <?>
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

CPU: AMD Ryzen 3 3100 (8) @ 3.6GHz
GPU: AMD ATI Radeon RX 5500/5500M / Pro 5500M
MOBO:  Gigabyte A520 AORUS ELITE
16GB RAM @ 2666
550W Corsair PSU

Machine is about 21 months old and was working fine up until about 2 months
ago.

I am randomly getting blank screens. It can happen anytime, but most common
when I am playing a game. Sometimes it happens multiple times in a day and
sometimes it is a few days between. When I am not playing a game it can
sometimes go for 2+ weeks without happening. When I am playing a game it will
usually happen within 15 minutes, but sometimes it can go an hour or more.

Initially I thought it might be the video card overheating, but I've been
watching it closely for the past few weeks and the card never seems to get over
75C. Once or twice I saw spike to 85C, but very rare and well within spec. I am
using amdgpu-fan to control the speeds.

When the screen blanks, I no longer have control with keyboard mouse either. I
can't switch to another tty or "blindly" type in reboot etc.


Things I have tried:

started on kernal 5.18.xx. Tried lts and regular kernals all the way up to
current (6.0.1).

Flash motherboard BIOS to latesst

I tried each of these individually and all together.

# /etc/modprobe.d/amdgpu.conf
options amdgpu
runpm=0
ppfeaturemask=0xffffffff
dpm=0 this seems to freeze it at tty login so I removed it
aspm=0
bapm=0

echo performance > /sys/class/drm/card0/device/power_dpm_state
echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level

echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo "5" > /sys/class/drm/card0/device/pp_power_profile_mode

Disabled HD Audio in BIOS (read something about the
'snd_hda_intel 0000:07:00.1: Unable to change power state from D3cold to D0, device inaccessible'
might be related but still blanked screened.

I did not try the kernal param: pcie_aspm=off since this seems to be disabled
in my BIOS.

Booting to a the latest live Ubuntu and surfing the net for a night produced
the same blank screen after several hours. This makes me worried it might be
bad hardware.

Searching for the below errors comes with a lot of results but no real
solutions. At this point I am simply throwing everything I come across at it
and I am hoping someone can point me in the right direction.


The relevant journal:

Not sure if this is relevant. Happens on boot.
Oct 15 23:31:59 r3 kernel: kvm: support for 'kvm_amd' disabled by bios

Then it starts with this error:
Oct 16 01:18:26 r3 kernel: snd_hda_intel 0000:07:00.1: Unable to change power state from D3hot to D0, device inaccessible
Oct 16 01:18:26 r3 kernel: snd_hda_intel 0000:07:00.1: CORB reset timeout#2, CORBRP = 65535
...
Oct 16 01:18:36 r3 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=874967, emitted seq=874969
Oct 16 01:18:36 r3 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process MonsterTrain.ex pid 211482 thread dxvk-submit pid 211522
Oct 16 01:18:36 r3 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
...
Oct 16 01:18:37 r3 kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 0000000054ad1599; ring_buffer_end = 000000007d66eeae; write_frame = 000000007060fe4e
Oct 16 01:18:37 r3 kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
...
Oct 16 01:18:37 r3 kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Oct 16 01:18:38 r3 kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
...
Oct 16 01:18:38 r3 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -121
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:52 param:0x00000000 message:PrepareMp1ForShutdown?
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: [PrepareMp1] Failed!
Oct 16 01:18:38 r3 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* SMC failed to set mp1 state 1, -121
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: GPU pre asic reset failed with err, -121 for drm dev, 0000:07:00.0
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: MODE1 reset
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: GPU mode1 reset
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: GPU psp mode1 reset
Oct 16 01:18:38 r3 kernel: [drm] psp is not working correctly before mode1 reset!
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: GPU mode1 reset failed
Oct 16 01:18:38 r3 kernel: amdgpu 0000:07:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:07:00.0
Oct 16 01:18:58 r3 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Oct 16 01:18:58 r3 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 936A (len 110, WS 12, PS 8) @ 0x93CF
Oct 16 01:18:58 r3 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 92F8 (len 37, WS 0, PS 8) @ 0x9311
Oct 16 01:18:58 r3 kernel: amdgpu 0000:07:00.0: amdgpu: asic atom init failed!
Oct 16 01:18:58 r3 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(2) failed
Oct 16 01:18:58 r3 kernel: snd_hda_intel 0000:07:00.1: Unable to change power state from D3cold to D0, device inaccessible
Oct 16 01:18:58 r3 kernel: snd_hda_intel 0000:07:00.1: CORB reset timeout#2, CORBRP = 65535
Oct 16 01:18:58 r3 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -22
Oct 16 01:18:58 r3 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -22

The journal from the crash forwad is at https://pastebin.com/1HtrBZNp

Offline

Board footer

Powered by FluxBB