You are not logged in.

#1 2019-09-14 17:52:51

rbn14
Member
Registered: 2009-03-22
Posts: 23

Identical GPU PCI passthrough

I have (2) Vega64 gpus. After following the wiki it looks like the scripts are preventing amdgpu from getting loaded for the second card but vfio-pci is not getting loaded in its place.

30:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
30:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
33:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
	Kernel modules: amdgpu
33:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Kernel modules: snd_hda_intel

Thanks for any help.

Offline

#2 2019-09-14 18:03:40

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,916

Re: Identical GPU PCI passthrough

following the wiki

Which page and section are you referring to ?
If it's the page / section I expect, you should have created 3 files and edited one .
Post the content of those 4 files please.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#3 2019-09-14 18:52:17

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Lone_Wolf wrote:

following the wiki

Which page and section are you referring to ?
If it's the page / section I expect, you should have created 3 files and edited one .
Post the content of those 4 files please.

This section

/usr/bin/vfio-pci-override.sh

#!/bin/sh                                                                                                                                                          

for i in /sys/bus/pci/devices/*/boot_vga; do
    if [ $(cat "$i") -eq 0 ]; then
        GPU="${i%/boot_vga}"
        AUDIO="$(echo "$GPU" | sed -e "s/0$/1/")"
        echo "vfio-pci" > "$GPU/driver_override"
        if [ -d "$AUDIO" ]; then
            echo "vfio-pci" > "$AUDIO/driver_override"
        fi
    fi
done

modprobe -i vfio-pci

/etc/initcpio/install/vfio

#!/bin/bash
build() {
    add_file /usr/bin/vfio-pci-override.sh
    add_runscript
}

/etc/initcpio/hooks/vfio

#!/usr/bin/ash

run_hook() {
    msg ":: Triggering vfio-pci override"
    /bin/sh /usr/bin/vfio-pci-override.sh
}

/etc/mkinitcpio.conf

MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd)

BINARIES=("/usr/bin/btrfs")

FILES=(/crypto_keyfile.bin /usr/bin/vfio-pci-override.sh)

HOOKS=(base udev autodetect block encrypt bcache filesystems keyboard fsck modconf vfio)

Thanks!

Offline

#4 2019-09-14 19:00:00

loqs
Member
Registered: 2014-03-06
Posts: 17,362

Re: Identical GPU PCI passthrough

You are using the script from the section Passthrough all GPUs but the boot GPU.  Which GPU is the boot GPU?

Offline

#5 2019-09-14 21:58:39

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

loqs wrote:

You are using the script from the section Passthrough all GPUs but the boot GPU.  Which GPU is the boot GPU?

This one

30:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
30:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

Offline

#6 2019-09-14 22:06:35

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Something I just realized is that after is start the virtual machine I get

33:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
	Kernel driver in use: vfio-pci
	Kernel modules: amdgpu
33:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

I don't get any complaints from virt-manager but I just get a black screen and the monitor says no source.

Updated:
Looks like Disabling ROM bar gets me a little farther, I get to the boot screen but then the vm freezes and one cpu gets pegged to 100.

Last edited by rbn14 (2019-09-14 22:59:00)

Offline

#7 2019-09-15 11:01:21

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,916

Re: Identical GPU PCI passthrough

So passing through seems to work, maybe both videocards are in the same iommu group.

Post the output of the script at https://wiki.archlinux.org/index.php/PC … _are_valid .
Also the full output of

$ lspci -tv

Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#8 2019-09-15 21:30:19

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

iommu groups

IOMMU Group 0:
	00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 1:
	00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 10:
	00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 11:
	00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
IOMMU Group 12:
	00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
	00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 13:
	00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
	00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
	00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
	00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
	00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
	00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
	00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
	00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
IOMMU Group 14:
	01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
IOMMU Group 15:
	03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01)
	03:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
	03:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01)
	1d:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
	1d:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
	1d:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
	1d:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
	1d:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
	1e:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM2142 USB 3.1 Host Controller [1b21:2142]
	20:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02)
	21:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
	26:01.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
	26:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
	26:05.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
	26:07.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
	27:00.0 Network controller [0280]: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] [8086:24fb] (rev 10)
	2a:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
	2c:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
	2d:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:07b1] (rev 02)
IOMMU Group 16:
	2e:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1470] (rev c1)
IOMMU Group 17:
	2f:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1471]
IOMMU Group 18:
	30:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
IOMMU Group 19:
	30:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
IOMMU Group 2:
	00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 20:
	31:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1470] (rev c1)
IOMMU Group 21:
	32:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1471]
IOMMU Group 22:
	33:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
IOMMU Group 23:
	33:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
IOMMU Group 24:
	34:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
IOMMU Group 25:
	34:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
IOMMU Group 26:
	34:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f]
IOMMU Group 27:
	35:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
IOMMU Group 28:
	35:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 29:
	35:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
IOMMU Group 3:
	00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 4:
	00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 5:
	00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 6:
	00:03.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 7:
	00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 8:
	00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 9:
	00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]

lspci -tv

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
           +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
           +-01.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-01.1-[01]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
           +-01.3-[03-2d]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43d0
           |               +-00.1  Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller
           |               \-00.2-[1d-2d]--+-00.0-[1e]----00.0  ASMedia Technology Inc. ASM2142 USB 3.1 Host Controller
           |                               +-02.0-[20]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           |                               +-03.0-[21-2b]----00.0-[26-2b]--+-01.0-[27]----00.0  Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak]
           |                               |                               +-03.0-[28]--
           |                               |                               +-05.0-[2a]----00.0  Intel Corporation I211 Gigabit Network Connection
           |                               |                               \-07.0-[2b]--
           |                               +-04.0-[2c]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
           |                               \-09.0-[2d]----00.0  Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion]
           +-02.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-03.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-03.1-[2e-30]----00.0-[2f-30]----00.0-[30]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64]
           |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
           +-03.2-[31-33]----00.0-[32-33]----00.0-[33]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64]
           |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
           +-04.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-07.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-07.1-[34]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
           +-08.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-08.1-[35]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
           +-18.1  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
           +-18.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
           +-18.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
           +-18.4  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
           +-18.5  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
           +-18.6  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
           \-18.7  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7

Offline

#9 2019-09-16 11:17:13

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,916

Re: Identical GPU PCI passthrough

Iommu groups are separated further then i expected, no need for special trickery like ACS.

You do have a separate monitor attached to the 2nd videocard, right ?

Try starting a VM with plain qemu , see https://wiki.archlinux.org/index.php/PC … ut_libvirt

Last edited by Lone_Wolf (2019-09-16 11:17:37)


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#10 2019-09-16 14:14:06

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

The card is dead at this point, because you cannot reset the device after the driver is bound. I am almost certain that trying to reboot the VM after would result in PCIe header errors and can only be fixed by a full power cycle.

Anyway, the issue remains the same. The second GPU must be assigned to vfio before amdgpu.
In the past I had one case where the issue was fixed by adding amdgpu and the sound drivers to the initram. Try this first please.
What you could try is bind both cards to the stub driver and do the script approach in reverse.

If neither of them work check if the GPU really is not used during boot (efifb) and continue from here.

Offline

#11 2019-09-18 02:17:31

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:

The card is dead at this point, because you cannot reset the device after the driver is bound. I am almost certain that trying to reboot the VM after would result in PCIe header errors and can only be fixed by a full power cycle.

Anyway, the issue remains the same. The second GPU must be assigned to vfio before amdgpu.
In the past I had one case where the issue was fixed by adding amdgpu and the sound drivers to the initram. Try this first please.
What you could try is bind both cards to the stub driver and do the script approach in reverse.

If neither of them work check if the GPU really is not used during boot (efifb) and continue from here.

The symptoms do seem a bit akin to what I've read about the "amd reset bug". I will experiment a bit more and report back here. Thanks.

Offline

#12 2019-09-21 21:19:59

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:

The card is dead at this point, because you cannot reset the device after the driver is bound. I am almost certain that trying to reboot the VM after would result in PCIe header errors and can only be fixed by a full power cycle.

Anyway, the issue remains the same. The second GPU must be assigned to vfio before amdgpu.
In the past I had one case where the issue was fixed by adding amdgpu and the sound drivers to the initram. Try this first please.
What you could try is bind both cards to the stub driver and do the script approach in reverse.

If neither of them work check if the GPU really is not used during boot (efifb) and continue from here.

Well I patched the kernel to fix the reset bug but no change.
I'm not totally sure what you mean by adding the amdgpu and sound drivers to the initfam. Place them in

MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd)

?

Thanks

Offline

#13 2019-09-21 21:29:06

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

Yes, at the end.

I don't know if there is a fix for the reset bug. I recently read that gnif is working on one (after he is done with Navi), but I haven't checked back. At least you can easily fix VM reboot by adding bind and unbind tasks in Windows.
For now the goal is to not even run into the problem, therefore the vfio driver must be used at boot time.

Offline

#14 2019-09-22 23:03:01

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:

Yes, at the end.

I don't know if there is a fix for the reset bug. I recently read that gnif is working on one (after he is done with Navi), but I haven't checked back. At least you can easily fix VM reboot by adding bind and unbind tasks in Windows.
For now the goal is to not even run into the problem, therefore the vfio driver must be used at boot time.

Added the other drivers

MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd amdgpu snd_hda_intel)

but it didn't result in any changes.

As for the reset bug I used this patch based on this thread. I have a Asrock Taichi Ultimate x470

I might try a fresh Windows VM as well and see if it is behaving the same.

Offline

#15 2019-09-22 23:28:49

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

I didn't really expect it to fix the issue, but something to try as said before. Please try the other options as well.

rbn14 wrote:

As for the reset bug I used this patch based on this thread. I have a Asrock Taichi Ultimate x470

This is unrelated and a different issue. Looks like some bug on the PCI bridge and you did not mention you were affected by this exact header error?

rbn14 wrote:

I might try a fresh Windows VM as well and see if it is behaving the same.

It won't change.

The issue is the same since the first post: The card must not be initialized by any driver. It's the same for any recent RX or related GPUs, well known and the only reason why the vfio bind is required. No way around it until (if ever) a proper fix is found.

Offline

#16 2019-09-28 20:56:18

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:

I didn't really expect it to fix the issue, but something to try as said before. Please try the other options as well.

rbn14 wrote:

As for the reset bug I used this patch based on this thread. I have a Asrock Taichi Ultimate x470

This is unrelated and a different issue. Looks like some bug on the PCI bridge and you did not mention you were affected by this exact header error?

I am not affected by it, this just read like it was a fix to the reset bud as well. Anyway, I tried to bind all cards to vfio-pci and then overide in reverse as you suggested.

/etc/modprobe.d/vfio.conf

options vfio-pci ids=1002:687f,1002:aaf8
/etc/mkinitcpio.conf

MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd)
BINARIES=("/usr/bin/btrfs")
FILES=(/crypto_keyfile.bin /usr/bin/vfio-pci-override.sh)
HOOKS=(base udev autodetect block encrypt bcache filesystems keyboard fsck modconf amd_gpu)
/usr/bin/vfio-pci-override.sh

#!/bin/sh

DEVS="0000:30:00.0 0000:30:00.1"

if [ ! -z "$(ls -A /sys/class/iommu)" ]; then
    for DEV in $DEVS; do
        echo "amdgpu" > /sys/bus/pci/devices/$DEV/driver_override
    done
fi
/etc/initcpio/install/amd_gpu

#!/bin/bash
build() {
    add_file /usr/bin/vfio-pci-override.sh
    add_runscript
}
/etc/initcpio/hooks/amd_gpu

#!/usr/bin/bash

run_hook() {
    msg ":: Triggering vfio-pci override"
    /bin/sh /usr/bin/vfio-pci-override.sh
}

Then regenerated initramfs and rebooted gets me a black screen after the grub menu.

Last edited by rbn14 (2019-09-28 21:19:43)

Offline

#17 2019-09-28 21:07:51

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

rbn14 wrote:
/etc/modprobe.d/vfio.conf

options vfio-pci ids=10de:13c2,10de:0fbb

That's not your GPU, but some other card.

Offline

#18 2019-09-28 21:19:14

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:
rbn14 wrote:
/etc/modprobe.d/vfio.conf

options vfio-pci ids=10de:13c2,10de:0fbb

That's not your GPU, but some other card.

Sorry, that was a mistake changed it to

options vfio-pci ids=1002:687f,1002:aaf8

and I get a black screen.

My guess is that in both cases, my scripts are not working correctly. Unfortunately I don't know enough about bash scripts or the override mechanism to trouble shoot it.

Last edited by rbn14 (2019-09-28 21:22:09)

Offline

#19 2019-09-28 21:38:21

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

Also hooks are supposed to be busybox ash compatible and therefore should have the shebang "#!/usr/bin/ash". Tbh I do not know if it makes a difference, but it is what the mkinitcpio doc says.

What happens if you ssh into the machine and run the script after boot? Make sure to not boot into the graphical target. If it does not work can you manually unbind the driver, load the correct ones and move into the graphical target?

Offline

#20 2019-09-29 02:09:52

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:

Also hooks are supposed to be busybox ash compatible and therefore should have the shebang "#!/usr/bin/ash". Tbh I do not know if it makes a difference, but it is what the mkinitcpio doc says.

What happens if you ssh into the machine and run the script after boot? Make sure to not boot into the graphical target. If it does not work can you manually unbind the driver, load the correct ones and move into the graphical target?

I switch to "#!/usr/bin/ash" but didnt change anything.

I ssh'd into the computer and was able to unbind and bind drivers alike so:

echo 0000:30:00.0 > /sys/bus/pci/devices/0000\:30\:00.0/driver/unbind
echo 0000:30:00.0 > /sys/bus/pci/drivers/amdgpu/bind

It worked and my screen came back up and I could issue "startx" and go into my graphical environment as usual.

Running the overide script manual did nothing.

Offline

#21 2019-09-29 11:53:59

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

Ok, at least some good news.
Here is a rough plan to automate it:

/etc/systemd/system/bind-gpu.service

[Unit]
Description=Bind GPU to display driver
Before=basic.target
After=local-fs.target sysinit.target
DefaultDependencies=no

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/bind-gpu

[Install]
WantedBy=basic.target

/usr/local/sbin/bind-gpu

#!/bin/bash

gpu='0000:30:00.0'
gpu_audio='0000:30:00.1'

function rebind {
    local device_path="/sys/bus/pci/devices/${1}"
    local driver_path="/sys/bus/pci/drivers/${2}"
    if [[ -d "$device_path" ]]; then
        if [[ -d "$driver_path" ]]; then
            echo $1 > "${device_path}/driver/unbind"
            echo $1 > "${driver_path}/bind"
        else
            echo Driver: \"${driver_path}\" does not exist >&2
        fi
    else
        echo Device: \"${device_path}\" does not exist >&2
    fi  
}

rebind "$gpu" amdgpu
rebind "$gpu_audio" snd_hda_intel
chmod 744 /usr/local/sbin/bind-gpu
chown root:root /usr/local/sbin/bind-gpu
systemctl enable bind-gpu.service
reboot

Note: This is untested and might require some tweaking. Moving the devices to a config file might be a good idea after it is confirmed working.

Last edited by Swiggles (2019-09-29 12:00:10)

Offline

#22 2019-09-29 19:36:11

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:

Ok, at least some good news.
Here is a rough plan to automate it:

/etc/systemd/system/bind-gpu.service

[Unit]
Description=Bind GPU to display driver
Before=basic.target
After=local-fs.target sysinit.target
DefaultDependencies=no

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/bind-gpu

[Install]
WantedBy=basic.target

/usr/local/sbin/bind-gpu

#!/bin/bash

gpu='0000:30:00.0'
gpu_audio='0000:30:00.1'

function rebind {
    local device_path="/sys/bus/pci/devices/${1}"
    local driver_path="/sys/bus/pci/drivers/${2}"
    if [[ -d "$device_path" ]]; then
        if [[ -d "$driver_path" ]]; then
            echo $1 > "${device_path}/driver/unbind"
            echo $1 > "${driver_path}/bind"
        else
            echo Driver: \"${driver_path}\" does not exist >&2
        fi
    else
        echo Device: \"${device_path}\" does not exist >&2
    fi  
}

rebind "$gpu" amdgpu
rebind "$gpu_audio" snd_hda_intel
chmod 744 /usr/local/sbin/bind-gpu
chown root:root /usr/local/sbin/bind-gpu
systemctl enable bind-gpu.service
reboot

Note: This is untested and might require some tweaking. Moving the devices to a config file might be a good idea after it is confirmed working.

Looks like are drivers are being correctly assigned now. It is a bit slow to bring up the screen at boot but that doesn't bother. I really appreciate all your help with this!

Offline

#23 2019-09-29 21:21:49

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

I don't know if there is any point you can run this earlier. Ofc you could try to run this script as an initcpio hook, but I guess we kinda tried it before. Maybe someone else has a better idea to implement this?

Does this mean the VM passthrough is also working for you?

Offline

#24 2019-09-30 19:31:39

rbn14
Member
Registered: 2009-03-22
Posts: 23

Re: Identical GPU PCI passthrough

Swiggles wrote:

I don't know if there is any point you can run this earlier. Ofc you could try to run this script as an initcpio hook, but I guess we kinda tried it before. Maybe someone else has a better idea to implement this?

Does this mean the VM passthrough is also working for you?

It is still not working. With rom bar checked I just get a black screen in virt-manager and "no signal" on monitor. With rom bar un-checked I get the tiano core boot screen and the windows "wheel" then a freeze in virt-manager and "no signal on the monitor. When I remove the pci devices from virt-manager, it boots up fine and displays in virt-manager.

Offline

#25 2019-09-30 20:12:55

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: Identical GPU PCI passthrough

What if you power off your system, start it again (note: not reboot!) and directly start the VM with rom bar unchecked? Edit: Also monitor dmesg for possible errors.
If that's working follow this on the Windows side: https://forum.level1techs.com/t/linux-h … fix/121097

It takes care of the reinitialization problem with some AMD cards (as mentioned in #13).

On the other hand if it is not working try to supply a rom file and reenable rom bar. I have seen some success with it in some cases. Follow this until you modify you libvirt file: https://wiki.archlinux.org/index.php/PC … y_in_VBIOS
Do not continue further and try to flash your card!
If the dump is not working for you (which may happen for some cards and setups) the alternative is to download it from a third party source: techpowerup Make sure to pick the right vendor and just select the newest.

Last edited by Swiggles (2019-09-30 20:14:10)

Offline

Board footer

Powered by FluxBB