You are not logged in.

#1 2020-12-21 21:27:33

knacky
Member
Registered: 2020-12-21
Posts: 3

Boot hangs on fb0: switching to amdgpudrmfb from EFI VGA

Hello Arch friends,

Being forced to write up this post has led to significantly more digging/unit testing, so thanks for scaring me with help vampirism! Would have just reported the Subject and not figured out the set of kernel params that lets me ssh and pull logs.

I cannot seem to boot an updated arch system (pacman -Syu) nor a USB ISO without nomodeset. Both are via UEFI/GPT.

Early KMS via MODULES in /etc/mkinitcpio.conf does not change freeze point (at "EFI VGA" switching).

Adding ALL of amdgpu.dc=0 amdgpu.dpm=0 amdgpu.gpu_recovery=1 seems to let booting continue (still screen frozen). I can SSH, and look at logs (without this the freeze does not generate a journal entry to look at later via disabling KMS).

Relevant dmesg:

[    5.286673] [drm] amdgpu kernel modesetting enabled.
[    5.289173] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.289282] CRAT table not found
[    5.291820] Virtual CRAT table created for CPU
[    5.292517] amdgpu: Topology: Add CPU node
[    5.293308] checking generic (c0000000 300000) vs hw (c0000000 10000000)
[    5.295896] fb0: switching to amdgpudrmfb from EFI VGA
[    5.298653] Console: switching to colour dummy device 80x25
[    5.298682] amdgpu 0000:01:00.0: vgaarb: deactivate vga console
[    5.298886] [drm] initializing kernel modesetting (TAHITI 0x1002:0x6798 0x1043:0x3006 0x00).
[    5.298898] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    5.298920] [drm] register mmio base: 0xFEA00000
[    5.298925] [drm] register mmio size: 262144
[    5.298931] [drm] PCIE atomic ops is not supported
[    5.298941] [drm] add ip block number 0 <si_common>
[    5.298946] [drm] add ip block number 1 <gmc_v6_0>
[    5.298951] [drm] add ip block number 2 <si_ih>
[    5.298956] [drm] add ip block number 3 <gfx_v6_0>
[    5.298961] [drm] add ip block number 4 <si_dma>
[    5.298966] [drm] add ip block number 5 <si_dpm>
[    5.298971] [drm] add ip block number 6 <dce_v6_0>
[    5.298976] [drm] add ip block number 7 <uvd_v3_1>
[    5.298983] kfd kfd: TAHITI  not supported in kfd
[    5.317829] [drm] BIOS signature incorrect 0 0
[    5.317841] amdgpu 0000:01:00.0: No more image in the PCI ROM
[    5.317903] amdgpu: ATOM BIOS: 113-AD47800-100
[    5.318270] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    5.319147] amdgpu 0000:01:00.0: amdgpu: VRAM: 3072M 0x000000F400000000 - 0x000000F4BFFFFFFF (3072M used)
[    5.319155] amdgpu 0000:01:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[    5.319166] [drm] Detected VRAM RAM=3072M, BAR=256M
[    5.319170] [drm] RAM width 384bits GDDR5
[    5.319338] [TTM] Zone  kernel: Available graphics memory: 4050482 KiB
[    5.319342] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[    5.319347] [TTM] Initializing pool allocator
[    5.319355] [TTM] Initializing DMA pool allocator
[    5.319398] [drm] amdgpu: 3072M of VRAM memory ready
[    5.319406] [drm] amdgpu: 3072M of GTT memory ready.
[    5.319414] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    5.319866] amdgpu 0000:01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400300000).
[    5.322642] [drm] AMDGPU Display Connectors
[    5.322645] [drm] Connector 0:
[    5.322648] [drm]   DP-1
[    5.322650] [drm]   HPD6
[    5.322653] [drm]   DDC: 0x1958 0x1958 0x1959 0x1959 0x195a 0x195a 0x195b 0x195b
[    5.322657] [drm]   Encoders:
[    5.322660] [drm]     DFP1: INTERNAL_UNIPHY1
[    5.322663] [drm] Connector 1:
[    5.322665] [drm]   HDMI-A-1
[    5.322667] [drm]   HPD1
[    5.322670] [drm]   DDC: 0x1954 0x1954 0x1955 0x1955 0x1956 0x1956 0x1957 0x1957
[    5.322674] [drm]   Encoders:
[    5.322676] [drm]     DFP2: INTERNAL_UNIPHY1
[    5.322679] [drm] Connector 2:
[    5.322681] [drm]   DVI-I-1
[    5.322683] [drm]   HPD4
[    5.322686] [drm]   DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953
[    5.322690] [drm]   Encoders:
[    5.322692] [drm]     DFP3: INTERNAL_UNIPHY2
[    5.322695] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
[    5.322697] [drm] Connector 3:
[    5.322699] [drm]   DVI-D-1
[    5.322702] [drm]   HPD3
[    5.322704] [drm]   DDC: 0x1960 0x1960 0x1961 0x1961 0x1962 0x1962 0x1963 0x1963
[    5.322708] [drm]   Encoders:
[    5.322710] [drm]     DFP4: INTERNAL_UNIPHY
[    5.323704] [drm] Found UVD firmware Version: 64.0 Family ID: 13
[    5.325017] [drm] PCIE gen 2 link speeds already enabled
[    5.348886] EDAC amd64: F15h detected (node 0).
[    5.348994] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.418775] EDAC amd64: F15h detected (node 0).
[    5.418867] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.496035] EDAC amd64: F15h detected (node 0).
[    5.496116] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.504480] random: crng init done
[    5.504485] random: 7 urandom warning(s) missed due to ratelimiting
[    5.505191] [drm:uvd_v3_1_hw_init [amdgpu]] *ERROR* amdgpu: UVD Firmware validate fail (-22).
[    5.505308] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <uvd_v3_1> failed -22
[    5.505313] amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[    5.505316] amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[    5.505319] [drm] amdgpu: finishing device.

Is this a driver/firmware issue that can be solved? Is my graphics card dead? Is this a red herring for something else completely?

Thanks much in advance!



Potentially related:
iommu=soft has been set as per Networking issues with this specific motherboard, but inclusion/exclusion again has no effect.
I have enabled/disabled IOMMU, EHCI, XHCI options in BIOS as per other forum posts regarding similar issues; this does not seem to affect the outcome.

Hardware:
Motherboard: Gigabyte 990FXA-UD5 R.3, updated BIOS
CPU: AMD FX-8350
RAM: G.SKILL Ripjaw 1x8GB 1600MHz
GPU: ASUS R9 280X DIRECTCU II TOP 3GB

uname -r 
5.9.14-arch1-1
lspci -k | grep VGA -A2
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Tahiti XTL [Radeon R9 280X DirectCU II TOP]
	Kernel modules: radeon, amdgpu

Offline

#2 2020-12-21 21:39:38

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,079

Re: Boot hangs on fb0: switching to amdgpudrmfb from EFI VGA

Do you have a time frame when this started? Try downgrading the linux-firmware package.

Online

#3 2020-12-21 22:52:26

knacky
Member
Registered: 2020-12-21
Posts: 3

Re: Boot hangs on fb0: switching to amdgpudrmfb from EFI VGA

V1del wrote:

Do you have a time frame when this started? Try downgrading the linux-firmware package.

Unfortunately my previous arch install was on an MBR-partitioned drive, and I was a BAD_ARCH_USER and hadn't updated in...a while. A previous GPU crashed which led to a reinstall, and these issues.

I'll try downgrading the firmware and get back with an update!

Offline

#4 2020-12-27 06:16:44

knacky
Member
Registered: 2020-12-21
Posts: 3

Re: Boot hangs on fb0: switching to amdgpudrmfb from EFI VGA

No dice with downgrading linux and linux-firmware to around March nor to around July (looking at commits there were changes to enable UVD for Tahiti around this time).

Additionally, seems like my ability to SSH when the boot hangs is highly variable, I was not able to correlate it with neither software nor kernel boot parameters.

With linux-lts and linux-lts-firmware, I was able to boot! However as soon as I tried to start an X program (KDE Plasma), tearing, flickering, artifacts, crashes. Was able to check dmesg during one of them:

[  104.317858] amdgpu 0000:01:00.0: GPU fault detected: 146 0x00031014
[  104.317865] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101200
[  104.317869] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03010014
[  104.317875] amdgpu 0000:01:00.0: VM fault (0x14, vmid 1) at page 1053184, write from '' (0x00000000) (16)
[  104.317883] amdgpu 0000:01:00.0: GPU fault detected: 146 0x00031014
[  104.317886] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101222
[  104.317890] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03050014
[  104.317895] amdgpu 0000:01:00.0: VM fault (0x14, vmid 1) at page 1053218, write from '' (0x00000000) (80)
[  104.317901] amdgpu 0000:01:00.0: GPU fault detected: 146 0x00431014
[  104.317905] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101250
[  104.317908] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03060014
[  104.317913] amdgpu 0000:01:00.0: VM fault (0x14, vmid 1) at page 1053264, write from '' (0x00000000) (96)
[  104.319723] amdgpu 0000:01:00.0: GPU fault detected: 147 0x054a7002
[  104.319724] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000022A
[  104.319726] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A070002
[  104.319727] amdgpu 0000:01:00.0: VM fault (0x02, vmid 5) at page 554, read from '' (0x00000000) (112)
[  114.737900] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1260, emitted seq=1262
[  114.738105] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ksmserver-logou pid 817 thread ksmserver-:cs0 pid 820

This, along with the inconsistent boot errors, makes me think this is a hardware issue. The card is old, and I could believe its time in life is coming to an end. Is it worth trying with Ubuntu or Windows? I imagine that some things may work and other may not, and debugging will be much worse without Arch...Any other tool I could try to test the card with?

I'll try a couple more things before calling it hardware and marking as solved, different PCIe slot, I don't know. If nothing else, I may try baking it. I wonder if these fancy new air fryers have a "Video Card" setting...

Offline

Board footer

Powered by FluxBB