You are not logged in.

#1 2023-11-27 05:38:17

ARCHLOVER123
Member
Registered: 2023-11-27
Posts: 3

Loading VFIO drivers causing freezing on host

I am attempting to load VFIO drivers to pass through to a virtual machine on my laptop. I have two graphics cards: Intel integrated graphics, and an nVidia graphics card. I am going to passthrough the nvidia graphics card to the virtual machine and load VFIO drivers on. For some reason, when my computer boots using these drivers, it runs normally for a while, then it freezes and I have to hard reboot. As well, when I run lspci during this time, the command just freezes the terminal in which it runs (when I press Ctrl + C, it does nothing). After I hard reboot, the and I type in the password to decrypt the drive, the computer either just stays at the decryption screen or I can only access the TTY consoles (it's random). When the computer is frozen, I cannot SSH into it. I have successfully loaded VFIO and passed it through to a virtual machine in the past, so my laptop should be able to do this.

I loaded VFIO drivers putting this in mkinitcpio.conf

MODULES=(vfio_pci vfio vfio_iommu_type1 i915 nvidia nvidia_modeset nvidia_uvm nvidia_drm)

This is in my grub config

GRUB_CMDLINE_LINUX_DEFAULT="loglevel=3 quiet rd.luks.name=93a85318-61ed-44fc-b763-a558c576b440=system rd.luks.name=45979d22-2832-4965-a516-6b410838bc79=system2 root=/dev/mapper/system rd.luks.options=timeout=60s rd.luks.options=password-echo=no intel_iommu=on iommu=pt nvida_drm.modeset=1 vfio-pci.ids=10de:1f15,10de:10f9,10de:1ada,10de:1adb"

I also have tried using modprobe for specifying the ids, putting this in /etc/modprobe.d/vfio.conf:

options vfio-pci ids=10de:1f15,10de:10f9,10de:1ada,10de:1adb

This is the IOMMU group I'm passing through:

IOMMU Group 2:
        00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 02)
        01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU106M [GeForce RTX 2060 Mobile] [10de:1f15] (rev a1)
        01:00.1 Audio device [0403]: NVIDIA Corporation TU106 High Definition Audio Controller [10de:10f9] (rev a1)
        01:00.2 USB controller [0c03]: NVIDIA Corporation TU106 USB 3.1 Host Controller [10de:1ada] (rev a1)
        01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU106 USB Type-C UCSI Controller [10de:1adb] (rev a1)

This is my /etc/X11/xorg.conf file

Section "Device"
    Identifier "IntelGraphics"
    Driver "intel"
    BusID "PCI:0:2:0"
EndSection

Section "Screen"
    Identifier "Screen0"
    Device "IntelGraphics"
    Monitor "Monitor0"
EndSection

Section "Monitor"
    Identifier "Monitor0"
EndSection

I'm using linux-6.6.2.arch1-1 as my kernel, which is the latest kernel as of writing this. I am using nvidia-open-dkms as my nvidia drivers, and I have also tried uninstalling the nvidia drivers before loading VFIO drivers, but this same issue still persists.

Edit: journalctl log on boot

Last edited by ARCHLOVER123 (2023-11-27 09:30:44)

Offline

#2 2023-11-27 08:40:23

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 76,056

Re: Loading VFIO drivers causing freezing on host

Why are you adding the nvidia modules to the initramfs?
The point of vfio is to get to the chip before the driver can claim it.

intel_iommu=on iommu=pt

Why? Does the IOMMU not work unless you explicitly enabled it?

Please post your complete system journal for the boot:

sudo journalctl -b | curl -F 'file=@-' 0x0.st

Offline

#3 2023-11-27 09:33:29

ARCHLOVER123
Member
Registered: 2023-11-27
Posts: 3

Re: Loading VFIO drivers causing freezing on host

Why are you adding the nvidia modules to the initramfs?
The point of vfio is to get to the chip before the driver can claim it.

The nvidia modules can be there if the vfio modules precede it.

intel_iommu=on iommu=pt

Why? Does the IOMMU not work unless you explicitly enabled it?

The wiki says to enable it this way.

Please post your complete system journal for the boot:

sudo journalctl -b | curl -F 'file=@-' 0x0.st

I edited my post to contain a link to the journalctl

Offline

#4 2023-11-27 10:09:02

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 76,056

Re: Loading VFIO drivers causing freezing on host

They /can/ be there, but
* they do't have to
* completely bloat the initramfs
* with modules you're not going to use
* and you still have to make sure vfio loads first

Given the sheer amount of attempts to load the nvidia module, resulting in

Nov 27 13:11:00 archlinux kernel: NVRM: This can occur when a driver such as: 
                                  NVRM: nouveau, rivafb, nvidiafb or rivatv 
                                  NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Nov 27 13:11:00 archlinux kernel: NVRM: Try unloading the conflicting kernel module (and/or
                                  NVRM: reconfigure your kernel without the conflicting
                                  NVRM: driver(s)), then try loading the NVIDIA kernel module
                                  NVRM: again.

why do you have the nvidia modules installed at all when you're going to pass the device through anyway?

The accuracy of that statement is also disputed, https://wiki.archlinux.org/title/Talk:P … me_systems
I just wanted to make sure you're not forcing it because it otherwise doesn't work - the parameter won't harm in and by itself.

Nov 27 13:11:10 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:10 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:10 cs kernel: pcieport 0000:00:1d.6: AER: Multiple Corrected error received: 0000:05:00.0
Nov 27 13:11:10 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:10 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:10 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Nov 27 13:11:10 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:10 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:11 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:11 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:12 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:12 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Nov 27 13:11:12 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:12 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Nov 27 13:11:12 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:12 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:22 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:22 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:22 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:22 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:22 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:22 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0
Nov 27 13:11:22 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:22 cs kernel: pcieport 0000:00:1d.6: AER: Multiple Corrected error received: 0000:05:00.0
Nov 27 13:11:22 cs kernel: alx 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Nov 27 13:11:22 cs kernel: pcieport 0000:00:1d.6: AER: Corrected error received: 0000:05:00.0

looks concerning (there's MUCH more of that in the journal)

Nov 27 13:11:08 cs kernel: alx 0000:05:00.0 eth0: Qualcomm Atheros AR816x/AR817x Ethernet [00:d8:61:e8:27:97]

Is that on-board or can you remove the card?

Offline

#5 2023-11-28 05:32:19

ARCHLOVER123
Member
Registered: 2023-11-27
Posts: 3

Re: Loading VFIO drivers causing freezing on host

Nov 27 13:11:08 cs kernel: alx 0000:05:00.0 eth0: Qualcomm Atheros AR816x/AR817x Ethernet [00:d8:61:e8:27:97]

Is that on-board or can you remove the card?

I'm not sure, but it is probably on board because this is a laptop. There is also an ethernet port in my laptop. Why is that important?

why do you have the nvidia modules installed at all when you're going to pass the device through anyway?

The accuracy of that statement is also disputed, https://wiki.archlinux.org/title/Talk:P … me_systems
I just wanted to make sure you're not forcing it because it otherwise doesn't work - the parameter won't harm in and by itself.

I modified config files after reading this. I removed nvidia modules from /etc/mkinitcpio.conf and uninstalled anything nvidia related. I also blacklisted anything nvidia related in /etc/modprobe.d/blacklist.conf :

blacklist nouveau
blacklist nvidia_uvm
blacklist nvdia_drm 
blacklist nvidia_modeset 
blacklist nvidia

I specified vfio-pci ids in /etc/modprobe.d/vfio.conf

options vfio-pci ids=10de:1f15,10de:10f9,10de:1ada,10de:1adb

, and this is in my current /etc/default/grub file:

GRUB_CMDLINE_LINUX_DEFAULT="loglevel=3 quiet rd.luks.name=93a85318-61ed-44fc-b763-a558c576b440=system rd.luks.name=45979d22-2832-4965-a516-6b410838bc79=system2 root=/dev/mapper/system rd.luks.options=timeout=60s rd.luks.options=password-echo=no intel_iommu=on iommu=pt"

I booted and the VFIO drivers worked. I got the VM setup and the GPU passthrough worked. However, one issues still remains. When I shut down my VM, I cannot turn on the same virtual machine again. libvirt and virt-manager would then not respond and crash. I would then reboot my laptop, and the issue where lspci freezes and the computer freezes happens again. I rebooted my laptop yet again, and the same issue persisted.

After this, I just turned off my laptop and went to sleep. I woke up and turned on the laptop, and for some unknown reason, the VFIO drivers loaded and worked again. I turned on my VM and it successfully booted with the GPU passthrough, however, when I turned off the VM, libvirt and virt-manager would become unresponsive.I rebooted my laptop yet again, and the same issue persisted.

I'm not sure if it is a coincidence or not, but if I keep my laptop off for a long period of time, then boot it up, the VFIO drivers seem to load and my computer does not randomly freeze, and I can use my VM. I think this is related, SMART is reporting issues with one of my NVME drives, and it seems that the temperature is going too high.

Offline

#6 2023-11-28 08:47:00

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 76,056

Re: Loading VFIO drivers causing freezing on host

I'm not sure, but it is probably on board because this is a laptop. There is also an ethernet port in my laptop. Why is that important?

Because it's clearly acting up.
In case it tunrs out relevant, can you disable the ethernet chip in the BIOS/UEFI settings?

It might however be a symptom of

but if I keep my laptop off for a long period of time, then boot it up, the VFIO drivers seem to load and my computer does not randomly freeze, and I can use my VM. I think this is related, SMART is reporting issues with one of my NVME drives, and it seems that the temperature is going too high.

What issue w/ your nvme (there're no errors in the journal) and where does the temperature issue come from?
(If the sensors can't really tell, maybe powertop can - the heat ultimately comes from your battery and what uses most of that will produce the most heat)
Can you rule out a D.U.S.T. issue?

Offline

Board footer

Powered by FluxBB