vfio-pci not ready after resume, ACPI PCI error messages

chairless · 2025-03-07 19:38:17

Hey Arch forum,

I have an HP Omen laptop with dGPU RTX 2060 and i7 10th gen processor. I have vfio-pci loaded and properly asserting control of the IOMMU group associated with the dGPU. However, with all of my configuration, i2c_nvidia_gpu is always loaded on boot, and I still get occasional NVRM messages in the journal. I believe that this issue is linked to the fact that lspci hangs indefinitely after specific applications/processes fail to execute.

I am running Gnome on Wayland through my iGPU, and I have a few environment variables set to make most applications work:

EGL_PLATFORM=wayland
__EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json
GSK_RENDERER=ngl
MESA_VK_DEVICE_SELECT=00:02.0

The last two are probably redundant (as I only recently figured out that you can have NVIDIA and Intel Vulkan drivers simultaneously), but this set of variables makes the system mostly stable on integrated graphics.

On a related note, my /etc/modprobe.d/blacklist.conf is:

blacklist nvidia nvidia_drm nouveau i2c_nvidia_gpu nvidia_uvm

After boot, lsmod | grep -i nv returns:

i2c_nvidia_gpu         12288  0
nvme                   65536  7
nvme_core             262144  8 nvme
nvme_auth              24576  1 nvme_core

Notably, modprobe -r i2c_nvidia_gpu successfully unloads the module, and the system continues to function. This makes sense, as nothing is actually using the module according to lsmod, and the USB C controller on the GPU (which is normally bound to i2c_nvidia_gpu) remains bound to vfio-pci.

Also, NVRM messages periodically appear in the journal. Output of journalctl -b is here. These messages do not appear to have any effect on the system, I find it odd that they appear at all when all NVIDIA modules are unloaded.

Firstly, how can I prevent i2c_nvidia_gpu and/or NVRM from loading on boot?

There is also a second issue which I believe to be related to the first. I usually cannot launch Obsidian in my configuration. Unloading i2c_nvidia_gpu allows me to launch it once, but it resumes following its usual behavior after closing and reopening.

Launching Obsidian causes the following things to happen:
1. No window is opened, and no errors appear on the command line.
2. Stopping with ^C, a window briefly opens, displays nothing, and closes.
3. The program segfaults and stops.
4. Running lspci now hangs forever.
5. dmesg records three to four messages of the format:

 vfio-pci 0000:01:00.0: not ready 1023ms after resume; waiting

6. Full system hang shortly thereafter.

Below is the relevant section of the journal when Obsidian fails to launch:

 Mar 07 11:54:28 chairlessarchbtw kgx[1990]: ../gtk/gdk/wayland/gdkcursor-wayland.c:210 cursor image size (32) not an integer multiple 
of theme size (24)
Mar 07 11:54:28 chairlessarchbtw kgx[1990]: ../gtk/gdk/wayland/gdkcursor-wayland.c:210 cursor image size (32) not an integer multiple 
of theme size (24)
Mar 07 11:54:29 chairlessarchbtw systemd[1210]: Started VTE child process 4545 launched by kgx process 1990.
Mar 07 11:54:34 chairlessarchbtw kgx[1990]: ../gtk/gdk/wayland/gdkcursor-wayland.c:210 cursor image size (32) not an integer multiple 
of theme size (24)
Mar 07 11:54:36 chairlessarchbtw kernel: electron[4390]: segfault at 10d0 ip 00005b4c4e53e0a7 sp 00007ffc099b12e0 error 4 in electron[
2d610a7,5b4c4cc0c000+92b7000] likely on CPU 3 (core 3, socket 0)
Mar 07 11:54:36 chairlessarchbtw kernel: Code: 24 49 c7 04 24 00 00 00 00 49 89 06 41 0f 10 07 41 0f 11 46 08 49 89 5e 18 5b 41 5c 41 
5e 41 5f 5d c3 cc 55 48 89 e5 48 89 f8 <48> 8b 0e 48 89 0f 48 c7 06 00 00 00 00 5d c3 cc cc cc cc cc cc cc
Mar 07 11:54:36 chairlessarchbtw systemd-coredump[4560]: Process 4390 (electron) of user 1000 terminated abnormally with signal 11/SEG
V, processing...
Mar 07 11:54:36 chairlessarchbtw systemd[1]: Created slice Slice /system/systemd-coredump.
Mar 07 11:54:36 chairlessarchbtw systemd[1]: Started Process Core Dump (PID 4560/UID 0).
Mar 07 11:54:37 chairlessarchbtw (sd-parse-elf)[4563]: Could not parse core file, dwfl_core_file_report() failed: (null)
Mar 07 11:54:37 chairlessarchbtw (sd-parse-elf)[4563]: Failed to inspect core file: Invalid argument
Mar 07 11:54:37 chairlessarchbtw systemd-coredump[4561]: Process 4390 (electron) of user 1000 dumped core.
Mar 07 11:54:37 chairlessarchbtw systemd[1]: systemd-coredump@0-4560-0.service: Deactivated successfully.
Mar 07 11:54:37 chairlessarchbtw systemd[1]: systemd-coredump@0-4560-0.service: Consumed 1.168s CPU time, 874.5M memory peak.
Mar 07 11:54:37 chairlessarchbtw kgx[1990]: ../gtk/gdk/wayland/gdkcursor-wayland.c:210 cursor image size (32) not an integer multiple 
of theme size (24)

My best guess is that the NVIDIA drivers are attempting to take control of the GPU again when I launch Obsidian, so vfio-pci freaks out and the entire system crashes.
I believe that this Obsidian issue is a symptom of the first issue noted in this post, and the solution to one is the solution to the other.

I'm somewhat new to much of this, and any help is greatly appreciated!

Last edited by chairless (2025-04-02 15:24:25)

xerxes_ · 2025-03-08 21:12:13

Read or view all, especially point 6:
https://wiki.archlinux.org/title/Blacklist
Also may be useful:
https://wiki.archlinux.org/title/Kernel_parameters

chairless · 2025-03-24 15:05:11

xerxes_,

Sorry I haven't responded in a while. I got bogged down by school and spring break stuff. Anyways, thanks for the ideas. I tried

modprobe.blacklist=i2c_nvidia_gpu

to prevent implicit loading, but that doesn't stop i2c_nvidia_gpu from being loaded. Since I might want to load the module later, I don't want to use module_blacklist.

In the last few days, I've noticed a few more things. Firstly, launching Code OSS sometimes causes the same kind of crash as Obsidian. Notably, these apps both run on Electron 34, so it seems likely that there is some environment variable I need to set for Electron to work properly.

Still, I would like to know if there is another way to prevent the loading of i2c_nvidia_gpu on startup.
Secondly, I'm noticing that xhci_hcd also seems grabby with the GPU. Namely, the following message also appears in the journal when I launch Code OSS (and a few times in the journal I posted).

Mar 24 09:29:27 chairlessarchbtw kernel: xhci_hcd 0000:01:00.2: xHC error in resume, USBSTS 0x401, Reinit
Mar 24 09:29:27 chairlessarchbtw kernel: usb usb3: root hub lost power or was reset
Mar 24 09:29:27 chairlessarchbtw kernel: usb usb4: root hub lost power or was reset

When working properly (i.e. lspci does not hang forever), the first line of that message appears a few times at boot, and nothing happens. However, on boots where lspci hangs indefinitely, the "root hub lost power or was reset" message appears as above. It seems to me that on some boots, xhci_hcd grabs the USB controller on the graphics card before vfio-pci, then i2c_nvidia_gpu tries to take control itself, causing a chain of events that eventually leads to the vfio-pci "not ready" messages. However, this doesn't totally align with what my system ought to be doing, as I have the following line in a conf file in modprobe.d:

softdep xhci_hcd pre: vfio-pci

Note that 0000:01:00.2 corresponds to the USB-C controller on the graphics card.

Curious to hear anyone's thoughts on further troubleshooting/solutions!

Last edited by chairless (2025-03-24 15:07:11)

chairless · 2025-04-02 14:59:26

Okay, I spent a while trying to wrap my head around this problem. Firstly, xhci_hcd being in charge of the USB host seems to be normal and expected behavior. My friend, who has a stable VFIO setup on Gentoo, says that xhci_hcd releases the device to a VM on startup. I launched my Windows VM, and I can confirm that to be the case.

Secondly, I resolved most of the crashes caused by applications by installing vulkan-intel and vulkan-mesa-layers, then configuring the environment variables as in this post: https://bbs.archlinux.org/viewtopic.php … 3#p2185993. This action, along with blacklisting nvidia_uvm, seemed to remove the NVRM errors I mentioned in my first post.

Thirdly, the most consistent error when the system fails is:

Apr 01 22:45:50 chairlessarchbtw kernel: ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20240827/psparse-529)
Apr 01 22:45:50 chairlessarchbtw kernel: ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20240827/psparse-529)
Apr 01 22:45:53 chairlessarchbtw kernel: vfio-pci 0000:01:00.0: not ready 1023ms after resume; waiting
Apr 01 22:45:54 chairlessarchbtw kernel: vfio-pci 0000:01:00.0: not ready 2047ms after resume; waiting
Apr 01 22:45:56 chairlessarchbtw kernel: vfio-pci 0000:01:00.0: not ready 4095ms after resume; waiting
Apr 01 22:46:00 chairlessarchbtw kernel: vfio-pci 0000:01:00.0: not ready 8191ms after resume; waiting
Apr 01 22:46:09 chairlessarchbtw kernel: vfio-pci 0000:01:00.0: not ready 16383ms after resume; waiting
Apr 01 22:46:25 chairlessarchbtw kernel: vfio-pci 0000:01:00.0: not ready 32767ms after resume; waiting

I know that vfio-pci is working mostly as intended, as I was able to run my Windows VM for an hour and close it properly. The journal I post below is from that boot, and the computer froze shortly after closing the VM.
I've attached a full journal for if anyone is able to peruse it and figure out what else might be going wrong.

chairless · 2025-04-02 15:24:53

Updating the title of this thread to reflect progress.

xerxes_ · 2025-04-08 16:36:54

Maybe you have to try some ACPI boot kernel switches, like: 'acpi=noirq' or 'acpi=off' or 'acpi=rsdt' or 'acpi=copy_dsdt' ... ?

Arch Linux

#1 2025-03-07 19:38:17

vfio-pci not ready after resume, ACPI PCI error messages

#2 2025-03-08 21:12:13

Re: vfio-pci not ready after resume, ACPI PCI error messages

#3 2025-03-24 15:05:11

Re: vfio-pci not ready after resume, ACPI PCI error messages

#4 2025-04-02 14:59:26

Re: vfio-pci not ready after resume, ACPI PCI error messages

#5 2025-04-02 15:24:53

Re: vfio-pci not ready after resume, ACPI PCI error messages

#6 2025-04-08 16:36:54

Re: vfio-pci not ready after resume, ACPI PCI error messages

Board footer