[amdgpu] no display, lots of errors in dmesg

Moviuro · 2023-09-28 20:02:47

Hi all,

I just came back from 2 weeks away, pacman -Syu, reboot and suddenly my R9 Fury that had been working is suddenly not working anymore.

Recently, I swapped the system nvme to a larger one, but the upgrade was successful and the machine was usable before I went away.

Kernel log: journalctl -b -k, in particular:

Sep 28 21:44:05 houndsditch kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
Sep 28 21:44:05 houndsditch kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -110
Sep 28 21:44:05 houndsditch kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
Sep 28 21:44:05 houndsditch kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
Sep 28 21:44:05 houndsditch kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

I have tried:

Rolling back kernel changes to some versions that were still in the pacman cache
using the LTS, zen, and linux kernel
unplugging the PSU, waiting a few minutes and restarting the machine
reseating the GPU after pulling it away from the machine
checking for a new BIOS (there wasn't)
pressing the "BIOS switch" button on the card
unplugging the power and SATA cables to the hard drives inside the machine (that were used for the nvme swap)

# lspci -kk
[snip]
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev cb)
        Subsystem: PC Partner Limited / Sapphire Technology Fiji [Radeon R9 FURY / NANO Series]
        Kernel modules: amdgpu
[snip]

# tree /sys/class/drm                                                                                                           
/sys/class/drm
└── version

I will also be asking on the amdgpu bug tracker https://gitlab.freedesktop.org/drm/amd/-/issues/2882

Last edited by Moviuro (2023-10-01 19:53:17)

Lone_Wolf · 2023-09-29 09:20:35

The log indicates the initialisation of your card starts very late in the boot process, f.e. the cards hdmi audio output initialisation is started before that of the videocard.

please post your /etc/mkinitcpio.conf

seth · 2023-09-29 14:47:46

https://x0.at/0dBu.txt is obviously 404, from the upstream bug

I also tried launching a Debian live system: live didn't display anything past GRUB, but the installer (very low res) was showing stuff on screen. Using the installer, dmesg|grep amdgpu returned nothing though.

=> can you still boot "nomodeset systemd.unit=multi-user.target"?
DId you maybe forget to attach some 6/8-pin dedicated power supply?

Moviuro · 2023-10-01 19:48:26

My GPU is indeed used for audio over HDMI (through the TV display); mkinitcpio.conf(5): https://x0.at/pMXP.conf -- nothing weird, it's worked since July.

I have validated that fans are spinning on the GPU (ouch, fingers), and reversed the power connectors (2x 8pin, which were correctly plugged in AFAICT, and that cable has been in use since 2017 when I put that machine together). I don't have easy access to the PSU side of power cables though, so that test will wait until I can plug the card in another machine (later this week).

=> can you still boot "nomodeset systemd.unit=multi-user.target"?

Did that on the Mageia live CD, the machine works, display too (startx, run firefox...) (very low resolution though I suppose this is expected).

Last edited by Moviuro (2023-10-01 19:48:50)

seth · 2023-10-01 20:13:00

X11 will run on the vesa or fbdev driver this way - low performance is expectable.
Playing to Lone_Wolfs idea, add "amdgpu" to the MODULES array and regenerate the initramfs.

I don't think it's gonna be relevant given the errors, but since the topic was brought up you can try

amdgpu.audio=0

What might be more interesting next to the early KMS could be

amdgpu.runpm=0 amdgpu.bapm=0 amdgpu.aspm=0 pcie_aspm=off

Finally, as things start to fall apart w/

Sep 28 21:31:47 houndsditch kernel: [drm] UVD ENC is disabled
Sep 28 21:31:47 houndsditch kernel: [drm] Found VCE firmware Version: 55.2 Binary ID: 3
Sep 28 21:31:48 houndsditch kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
Sep 28 21:31:48 houndsditch kernel: [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
Sep 28 21:31:48 houndsditch kernel: [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
Sep 28 21:31:48 houndsditch kernel: [drm] Display Core v3.2.241 initialized on DCE 10.0
Sep 28 21:31:48 houndsditch kernel: snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Sep 28 21:31:48 houndsditch kernel: [drm] UVD initialized successfully.
Sep 28 21:31:48 houndsditch kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vce2 test failed (-110)
Sep 28 21:31:48 houndsditch kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <vce_v3_0> failed -110
Sep 28 21:31:48 houndsditch kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
Sep 28 21:31:48 houndsditch kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
Sep 28 21:31:48 houndsditch kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

you could try what happens if you move /usr/lib/firmware/amdgpu/fiji_{uvd,vce}.bin.zst out of the way …

Moviuro · 2023-10-07 10:17:30

My other PC is also going bonkers with that card. So I suspect the card itself is dead ☠️ https://x0.at/eu2C.txt

Arch Linux

#1 2023-09-28 20:02:47

[amdgpu] no display, lots of errors in dmesg

#2 2023-09-29 09:20:35

Re: [amdgpu] no display, lots of errors in dmesg

#3 2023-09-29 14:47:46

Re: [amdgpu] no display, lots of errors in dmesg

#4 2023-10-01 19:48:26

Re: [amdgpu] no display, lots of errors in dmesg

#5 2023-10-01 20:13:00

Re: [amdgpu] no display, lots of errors in dmesg

#6 2023-10-07 10:17:30

Re: [amdgpu] no display, lots of errors in dmesg

Board footer