You are not logged in.
I have (2) Vega64 gpus. After following the wiki it looks like the scripts are preventing amdgpu from getting loaded for the second card but vfio-pci is not getting loaded in its place.
30:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
30:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
33:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
Kernel modules: amdgpu
33:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Kernel modules: snd_hda_intel
Thanks for any help.
Offline
following the wiki
Which page and section are you referring to ?
If it's the page / section I expect, you should have created 3 files and edited one .
Post the content of those 4 files please.
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
(A works at time B) && (time C > time B ) ≠ (A works at time C)
Offline
following the wiki
Which page and section are you referring to ?
If it's the page / section I expect, you should have created 3 files and edited one .
Post the content of those 4 files please.
/usr/bin/vfio-pci-override.sh
#!/bin/sh
for i in /sys/bus/pci/devices/*/boot_vga; do
if [ $(cat "$i") -eq 0 ]; then
GPU="${i%/boot_vga}"
AUDIO="$(echo "$GPU" | sed -e "s/0$/1/")"
echo "vfio-pci" > "$GPU/driver_override"
if [ -d "$AUDIO" ]; then
echo "vfio-pci" > "$AUDIO/driver_override"
fi
fi
done
modprobe -i vfio-pci
/etc/initcpio/install/vfio
#!/bin/bash
build() {
add_file /usr/bin/vfio-pci-override.sh
add_runscript
}
/etc/initcpio/hooks/vfio
#!/usr/bin/ash
run_hook() {
msg ":: Triggering vfio-pci override"
/bin/sh /usr/bin/vfio-pci-override.sh
}
/etc/mkinitcpio.conf
MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd)
BINARIES=("/usr/bin/btrfs")
FILES=(/crypto_keyfile.bin /usr/bin/vfio-pci-override.sh)
HOOKS=(base udev autodetect block encrypt bcache filesystems keyboard fsck modconf vfio)
Thanks!
Offline
You are using the script from the section Passthrough all GPUs but the boot GPU. Which GPU is the boot GPU?
Offline
You are using the script from the section Passthrough all GPUs but the boot GPU. Which GPU is the boot GPU?
This one
30:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
Kernel driver in use: amdgpu
Kernel modules: amdgpu
30:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
Offline
Something I just realized is that after is start the virtual machine I get
33:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] RX Vega64 [1002:0b36]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
33:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
I don't get any complaints from virt-manager but I just get a black screen and the monitor says no source.
Updated:
Looks like Disabling ROM bar gets me a little farther, I get to the boot screen but then the vm freezes and one cpu gets pegged to 100.
Last edited by rbn14 (2019-09-14 22:59:00)
Offline
So passing through seems to work, maybe both videocards are in the same iommu group.
Post the output of the script at https://wiki.archlinux.org/index.php/PC … _are_valid .
Also the full output of
$ lspci -tv
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
(A works at time B) && (time C > time B ) ≠ (A works at time C)
Offline
iommu groups
IOMMU Group 0:
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 1:
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 10:
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 11:
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
IOMMU Group 12:
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 13:
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
IOMMU Group 14:
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
IOMMU Group 15:
03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:43d0] (rev 01)
03:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller [1022:43c8] (rev 01)
03:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge [1022:43c6] (rev 01)
1d:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
1d:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
1d:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
1d:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
1d:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port [1022:43c7] (rev 01)
1e:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM2142 USB 3.1 Host Controller [1b21:2142]
20:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02)
21:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
26:01.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
26:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
26:05.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
26:07.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1184e PCIe Switch Port [1b21:1184]
27:00.0 Network controller [0280]: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] [8086:24fb] (rev 10)
2a:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
2c:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
2d:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:07b1] (rev 02)
IOMMU Group 16:
2e:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1470] (rev c1)
IOMMU Group 17:
2f:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1471]
IOMMU Group 18:
30:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
IOMMU Group 19:
30:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
IOMMU Group 2:
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 20:
31:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1470] (rev c1)
IOMMU Group 21:
32:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1471]
IOMMU Group 22:
33:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1)
IOMMU Group 23:
33:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
IOMMU Group 24:
34:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
IOMMU Group 25:
34:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
IOMMU Group 26:
34:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f]
IOMMU Group 27:
35:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
IOMMU Group 28:
35:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 29:
35:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
IOMMU Group 3:
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 4:
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 5:
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 6:
00:03.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 7:
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 8:
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 9:
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
lspci -tv
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
+-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
+-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-01.1-[01]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
+-01.3-[03-2d]--+-00.0 Advanced Micro Devices, Inc. [AMD] Device 43d0
| +-00.1 Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller
| \-00.2-[1d-2d]--+-00.0-[1e]----00.0 ASMedia Technology Inc. ASM2142 USB 3.1 Host Controller
| +-02.0-[20]----00.0 ASMedia Technology Inc. ASM1062 Serial ATA Controller
| +-03.0-[21-2b]----00.0-[26-2b]--+-01.0-[27]----00.0 Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak]
| | +-03.0-[28]--
| | +-05.0-[2a]----00.0 Intel Corporation I211 Gigabit Network Connection
| | \-07.0-[2b]--
| +-04.0-[2c]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
| \-09.0-[2d]----00.0 Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion]
+-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-03.1-[2e-30]----00.0-[2f-30]----00.0-[30]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
+-03.2-[31-33]----00.0-[32-33]----00.0-[33]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
+-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-07.1-[34]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
| +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
| \-00.3 Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
+-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-08.1-[35]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
| +-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
| \-00.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
+-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
+-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
+-18.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
+-18.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
+-18.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
+-18.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
+-18.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
+-18.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
+-18.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
\-18.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
Offline
Iommu groups are separated further then i expected, no need for special trickery like ACS.
You do have a separate monitor attached to the 2nd videocard, right ?
Try starting a VM with plain qemu , see https://wiki.archlinux.org/index.php/PC … ut_libvirt
Last edited by Lone_Wolf (2019-09-16 11:17:37)
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
(A works at time B) && (time C > time B ) ≠ (A works at time C)
Offline
The card is dead at this point, because you cannot reset the device after the driver is bound. I am almost certain that trying to reboot the VM after would result in PCIe header errors and can only be fixed by a full power cycle.
Anyway, the issue remains the same. The second GPU must be assigned to vfio before amdgpu.
In the past I had one case where the issue was fixed by adding amdgpu and the sound drivers to the initram. Try this first please.
What you could try is bind both cards to the stub driver and do the script approach in reverse.
If neither of them work check if the GPU really is not used during boot (efifb) and continue from here.
Offline
The card is dead at this point, because you cannot reset the device after the driver is bound. I am almost certain that trying to reboot the VM after would result in PCIe header errors and can only be fixed by a full power cycle.
Anyway, the issue remains the same. The second GPU must be assigned to vfio before amdgpu.
In the past I had one case where the issue was fixed by adding amdgpu and the sound drivers to the initram. Try this first please.
What you could try is bind both cards to the stub driver and do the script approach in reverse.If neither of them work check if the GPU really is not used during boot (efifb) and continue from here.
The symptoms do seem a bit akin to what I've read about the "amd reset bug". I will experiment a bit more and report back here. Thanks.
Offline
The card is dead at this point, because you cannot reset the device after the driver is bound. I am almost certain that trying to reboot the VM after would result in PCIe header errors and can only be fixed by a full power cycle.
Anyway, the issue remains the same. The second GPU must be assigned to vfio before amdgpu.
In the past I had one case where the issue was fixed by adding amdgpu and the sound drivers to the initram. Try this first please.
What you could try is bind both cards to the stub driver and do the script approach in reverse.If neither of them work check if the GPU really is not used during boot (efifb) and continue from here.
Well I patched the kernel to fix the reset bug but no change.
I'm not totally sure what you mean by adding the amdgpu and sound drivers to the initfam. Place them in
MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd)
?
Thanks
Offline
Yes, at the end.
I don't know if there is a fix for the reset bug. I recently read that gnif is working on one (after he is done with Navi), but I haven't checked back. At least you can easily fix VM reboot by adding bind and unbind tasks in Windows.
For now the goal is to not even run into the problem, therefore the vfio driver must be used at boot time.
Offline
Yes, at the end.
I don't know if there is a fix for the reset bug. I recently read that gnif is working on one (after he is done with Navi), but I haven't checked back. At least you can easily fix VM reboot by adding bind and unbind tasks in Windows.
For now the goal is to not even run into the problem, therefore the vfio driver must be used at boot time.
Added the other drivers
MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd amdgpu snd_hda_intel)
but it didn't result in any changes.
As for the reset bug I used this patch based on this thread. I have a Asrock Taichi Ultimate x470
I might try a fresh Windows VM as well and see if it is behaving the same.
Offline
I didn't really expect it to fix the issue, but something to try as said before. Please try the other options as well.
As for the reset bug I used this patch based on this thread. I have a Asrock Taichi Ultimate x470
This is unrelated and a different issue. Looks like some bug on the PCI bridge and you did not mention you were affected by this exact header error?
I might try a fresh Windows VM as well and see if it is behaving the same.
It won't change.
The issue is the same since the first post: The card must not be initialized by any driver. It's the same for any recent RX or related GPUs, well known and the only reason why the vfio bind is required. No way around it until (if ever) a proper fix is found.
Offline
I didn't really expect it to fix the issue, but something to try as said before. Please try the other options as well.
rbn14 wrote:As for the reset bug I used this patch based on this thread. I have a Asrock Taichi Ultimate x470
This is unrelated and a different issue. Looks like some bug on the PCI bridge and you did not mention you were affected by this exact header error?
I am not affected by it, this just read like it was a fix to the reset bud as well. Anyway, I tried to bind all cards to vfio-pci and then overide in reverse as you suggested.
/etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:687f,1002:aaf8
/etc/mkinitcpio.conf
MODULES=(bcache vfio_pci vfio vfio_iommu_type1 vfio_virqfd)
BINARIES=("/usr/bin/btrfs")
FILES=(/crypto_keyfile.bin /usr/bin/vfio-pci-override.sh)
HOOKS=(base udev autodetect block encrypt bcache filesystems keyboard fsck modconf amd_gpu)
/usr/bin/vfio-pci-override.sh
#!/bin/sh
DEVS="0000:30:00.0 0000:30:00.1"
if [ ! -z "$(ls -A /sys/class/iommu)" ]; then
for DEV in $DEVS; do
echo "amdgpu" > /sys/bus/pci/devices/$DEV/driver_override
done
fi
/etc/initcpio/install/amd_gpu
#!/bin/bash
build() {
add_file /usr/bin/vfio-pci-override.sh
add_runscript
}
/etc/initcpio/hooks/amd_gpu
#!/usr/bin/bash
run_hook() {
msg ":: Triggering vfio-pci override"
/bin/sh /usr/bin/vfio-pci-override.sh
}
Then regenerated initramfs and rebooted gets me a black screen after the grub menu.
Last edited by rbn14 (2019-09-28 21:19:43)
Offline
/etc/modprobe.d/vfio.conf options vfio-pci ids=10de:13c2,10de:0fbb
That's not your GPU, but some other card.
Offline
rbn14 wrote:/etc/modprobe.d/vfio.conf options vfio-pci ids=10de:13c2,10de:0fbb
That's not your GPU, but some other card.
Sorry, that was a mistake changed it to
options vfio-pci ids=1002:687f,1002:aaf8
and I get a black screen.
My guess is that in both cases, my scripts are not working correctly. Unfortunately I don't know enough about bash scripts or the override mechanism to trouble shoot it.
Last edited by rbn14 (2019-09-28 21:22:09)
Offline
Also hooks are supposed to be busybox ash compatible and therefore should have the shebang "#!/usr/bin/ash". Tbh I do not know if it makes a difference, but it is what the mkinitcpio doc says.
What happens if you ssh into the machine and run the script after boot? Make sure to not boot into the graphical target. If it does not work can you manually unbind the driver, load the correct ones and move into the graphical target?
Offline
Also hooks are supposed to be busybox ash compatible and therefore should have the shebang "#!/usr/bin/ash". Tbh I do not know if it makes a difference, but it is what the mkinitcpio doc says.
What happens if you ssh into the machine and run the script after boot? Make sure to not boot into the graphical target. If it does not work can you manually unbind the driver, load the correct ones and move into the graphical target?
I switch to "#!/usr/bin/ash" but didnt change anything.
I ssh'd into the computer and was able to unbind and bind drivers alike so:
echo 0000:30:00.0 > /sys/bus/pci/devices/0000\:30\:00.0/driver/unbind
echo 0000:30:00.0 > /sys/bus/pci/drivers/amdgpu/bind
It worked and my screen came back up and I could issue "startx" and go into my graphical environment as usual.
Running the overide script manual did nothing.
Offline
Ok, at least some good news.
Here is a rough plan to automate it:
/etc/systemd/system/bind-gpu.service
[Unit]
Description=Bind GPU to display driver
Before=basic.target
After=local-fs.target sysinit.target
DefaultDependencies=no
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/bind-gpu
[Install]
WantedBy=basic.target
/usr/local/sbin/bind-gpu
#!/bin/bash
gpu='0000:30:00.0'
gpu_audio='0000:30:00.1'
function rebind {
local device_path="/sys/bus/pci/devices/${1}"
local driver_path="/sys/bus/pci/drivers/${2}"
if [[ -d "$device_path" ]]; then
if [[ -d "$driver_path" ]]; then
echo $1 > "${device_path}/driver/unbind"
echo $1 > "${driver_path}/bind"
else
echo Driver: \"${driver_path}\" does not exist >&2
fi
else
echo Device: \"${device_path}\" does not exist >&2
fi
}
rebind "$gpu" amdgpu
rebind "$gpu_audio" snd_hda_intel
chmod 744 /usr/local/sbin/bind-gpu
chown root:root /usr/local/sbin/bind-gpu
systemctl enable bind-gpu.service
reboot
Note: This is untested and might require some tweaking. Moving the devices to a config file might be a good idea after it is confirmed working.
Last edited by Swiggles (2019-09-29 12:00:10)
Offline
Ok, at least some good news.
Here is a rough plan to automate it:/etc/systemd/system/bind-gpu.service
[Unit] Description=Bind GPU to display driver Before=basic.target After=local-fs.target sysinit.target DefaultDependencies=no [Service] Type=oneshot ExecStart=/usr/local/sbin/bind-gpu [Install] WantedBy=basic.target
/usr/local/sbin/bind-gpu
#!/bin/bash gpu='0000:30:00.0' gpu_audio='0000:30:00.1' function rebind { local device_path="/sys/bus/pci/devices/${1}" local driver_path="/sys/bus/pci/drivers/${2}" if [[ -d "$device_path" ]]; then if [[ -d "$driver_path" ]]; then echo $1 > "${device_path}/driver/unbind" echo $1 > "${driver_path}/bind" else echo Driver: \"${driver_path}\" does not exist >&2 fi else echo Device: \"${device_path}\" does not exist >&2 fi } rebind "$gpu" amdgpu rebind "$gpu_audio" snd_hda_intel
chmod 744 /usr/local/sbin/bind-gpu chown root:root /usr/local/sbin/bind-gpu systemctl enable bind-gpu.service reboot
Note: This is untested and might require some tweaking. Moving the devices to a config file might be a good idea after it is confirmed working.
Looks like are drivers are being correctly assigned now. It is a bit slow to bring up the screen at boot but that doesn't bother. I really appreciate all your help with this!
Offline
I don't know if there is any point you can run this earlier. Ofc you could try to run this script as an initcpio hook, but I guess we kinda tried it before. Maybe someone else has a better idea to implement this?
Does this mean the VM passthrough is also working for you?
Offline
I don't know if there is any point you can run this earlier. Ofc you could try to run this script as an initcpio hook, but I guess we kinda tried it before. Maybe someone else has a better idea to implement this?
Does this mean the VM passthrough is also working for you?
It is still not working. With rom bar checked I just get a black screen in virt-manager and "no signal" on monitor. With rom bar un-checked I get the tiano core boot screen and the windows "wheel" then a freeze in virt-manager and "no signal on the monitor. When I remove the pci devices from virt-manager, it boots up fine and displays in virt-manager.
Offline
What if you power off your system, start it again (note: not reboot!) and directly start the VM with rom bar unchecked? Edit: Also monitor dmesg for possible errors.
If that's working follow this on the Windows side: https://forum.level1techs.com/t/linux-h … fix/121097
It takes care of the reinitialization problem with some AMD cards (as mentioned in #13).
On the other hand if it is not working try to supply a rom file and reenable rom bar. I have seen some success with it in some cases. Follow this until you modify you libvirt file: https://wiki.archlinux.org/index.php/PC … y_in_VBIOS
Do not continue further and try to flash your card!
If the dump is not working for you (which may happen for some cards and setups) the alternative is to download it from a third party source: techpowerup Make sure to pick the right vendor and just select the newest.
Last edited by Swiggles (2019-09-30 20:14:10)
Offline