You are not logged in.

#1 2023-02-12 20:29:24

andrej.podzimek
Member
From: Zürich, Switzerland
Registered: 2005-04-10
Posts: 115

NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

This post from 2+ years ago describes the hardware setup I’m using. (I’ve since upgraded the ASRock x570 Creator’s UEFI firmware to 3.50, otherwise everything has stayed the same.) The OS is a stock, frequently updated ArchLinux with kernel 6.1.11. The TL;DR is that I used to make my eGPU work using the following kernel command line magic:

pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=128M,hpmmioprefsize=16G

With this↑ everything was perfect; I could use my “internal” AMD Radeon Pro W5700 for desktop work and the NVidia Quadro P5000 in a Razer Core X Chroma eGPU for VMs.

Up until kernel 5.19.x, that is. The NVida in the eGPU doesn’t work any more with 6.0.x kernels.

There is an error message that all modules attempting to control the NVidia card have in common, i.e. they all fail to use the card “the same way”:

Feb 12 19:27:03 kernel: nvidia 0000:3d:00.0: can't enable device: BAR 5 [io  0x0000-0x007f] not claimed
Feb 12 20:25:06 kernel: vfio-pci 0000:3d:00.0: can't enable device: BAR 5 [io  0x0000-0x007f] not claimed
Feb 12 20:25:36 kernel: nouveau 0000:3d:00.0: can't enable device: BAR 5 [io  0x0000-0x007f] not claimed

What I tried, to no avail:

  1. install nvidia / enable nouveau / blacklist and/or uninstall both — (A particular host driver for the NVidia is not needed anyhow, vfio-pci that lets VMs use it is all that matters.)

  2. ibt=off on the kernel command line, “just in case”, based on these stories.

  3. mem_encrypt=off (It has always worked fine with AMD SME on, but … again, just in case.)

  4. Removing the PCIe magic from the kernel command line. (This renders ~all but trivial Thunderbolt devices inoperable, including the eGPU; not a way to go.)

  5. Tricks loosely based on qemu’s error messages, but qemu didn’t realize that the problem had started before VFIO came into play and that any driver would have had the same problem when trying to control the NVidia:

    echo -n vfio-pci > /sys/bus/pci/devices/0000\:3d\:00.0/driver_override

The Razer Core X Chroma has a built-in USB hub and a (USB) network card. These devices work perfectly fine. It’s just the NVidia that cannot be used.

Everything looks normal w.r.t. Thunderbolt (where the Chroma has always appeared as two devices):

 ● Razer Core X Chroma
   ├─ type:          peripheral
   ├─ name:          Core X Chroma
   ├─ vendor:        Razer
   ├─ uuid:          00653854-e510-2701-ffff-ffffffffffff
   ├─ generation:    Thunderbolt 3
   ├─ status:        authorized
   │  ├─ domain:     ce010000-0060-6c0e-03b7-b91c46b12223
   │  ├─ rx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  ├─ tx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  └─ authflags:  secure
   ├─ authorized:    Sun 12 Feb 2023 07:40:05 PM UTC
   ├─ connected:     Sun 12 Feb 2023 07:39:27 PM UTC
   └─ stored:        Mon 30 Nov 2020 02:18:57 PM UTC
      ├─ policy:     auto
      └─ key:        yes

 ● Razer Core X Chroma #2
   ├─ type:          peripheral
   ├─ name:          Core X Chroma
   ├─ vendor:        Razer
   ├─ uuid:          00306925-e510-2701-ffff-ffffffffffff
   ├─ generation:    Thunderbolt 3
   ├─ status:        authorized
   │  ├─ domain:     ce010000-0060-6c0e-03b7-b91c46b12223
   │  ├─ rx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  ├─ tx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   │  └─ authflags:  secure
   ├─ authorized:    Sun 12 Feb 2023 07:40:17 PM UTC
   ├─ connected:     Sun 12 Feb 2023 07:39:27 PM UTC
   └─ stored:        Mon 30 Nov 2020 02:19:08 PM UTC
      ├─ policy:     auto
      └─ key:        yes

lspci -tv shows the NVidia and its adjacent audio device, exactly as it always did:

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
           +-00.2  Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
           +-01.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-01.2-[01-7a]----00.0-[02-7a]--+-01.0-[03]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
           |                               +-02.0-[04-6d]----00.0-[05-6d]--+-00.0-[06]----00.0  Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018]
           |                               |                               +-01.0-[07-39]----00.0-[08-39]--+-01.0-[09]----00.0  Adaptec Smart Storage PQI SAS
           |                               |                               |                               \-04.0-[0a-39]--
           |                               |                               +-02.0-[3a]----00.0  Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018]
           |                               |                               \-04.0-[3b-6d]----00.0-[3c-6d]--+-01.0-[3d]--+-00.0  NVIDIA Corporation GP104GL [Quadro P5000]
           |                               |                                                               |            \-00.1  NVIDIA Corporation GP104 High Definition Audio Controller
           |                               |                                                               \-04.0-[3e-6d]----00.0-[3f-42]--+-00.0-[40]----00.0  ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
           |                               |                                                                                               +-01.0-[41]----00.0  ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
           |                               |                                                                                               \-02.0-[42]----00.0  ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
           |                               +-03.0-[6e-76]----00.0-[6f-76]--+-01.0-[70]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           |                               |                               +-02.0-[71]----00.0  Intel Corporation I211 Gigabit Network Connection
           |                               |                               +-03.0-[72]----00.0  Intel Corporation Wi-Fi 6 AX200
           |                               |                               +-04.0-[73]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           |                               |                               +-05.0-[74]--
           |                               |                               +-06.0-[75]--
           |                               |                               \-07.0-[76]--
           |                               +-04.0-[77]----00.0  Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion]
           |                               +-08.0-[78]--+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
           |                               |            +-00.1  Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
           |                               |            \-00.3  Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
           |                               +-09.0-[79]----00.0  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           |                               \-0a.0-[7a]----00.0  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           +-01.3-[7b]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
           +-02.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.1-[7c-7e]----00.0-[7d-7e]----00.0-[7e]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon Pro W5700]
           |                                            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           |                                            +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] Device 7316
           |                                            \-00.3  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 USB
           +-04.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-05.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-07.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-07.1-[7f]----00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
           +-08.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-08.1-[80]--+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
           |            +-00.1  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
           |            +-00.3  Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
           |            \-00.4  Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0
           +-18.1  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1
           +-18.2  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2
           +-18.3  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3
           +-18.4  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4
           +-18.5  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5
           +-18.6  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6
           \-18.7  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7

But the “can't enable device: BAR 5 [io  0x0000-0x007f] not claimed” problem ruins it all. Here’s some dmesg context for the nvidia case; the other drivers (nouveau and vfio-pci) produce something similarly pessimistic:

Feb 12 19:23:53 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 509
Feb 12 19:23:53 kernel: 
Feb 12 19:23:53 kernel: nvidia 0000:3d:00.0: can't enable device: BAR 5 [io  0x0000-0x007f] not claimed
Feb 12 19:23:53 kernel: NVRM: pci_enable_device failed, aborting
Feb 12 19:23:53 kernel: nvidia: probe of 0000:3d:00.0 failed with error -1
Feb 12 19:23:53 kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Feb 12 19:23:53 kernel: NVRM: None of the NVIDIA devices were initialized.
Feb 12 19:23:53 kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 509

What can be wrong? Was there a major PCIe change between kernel 5.19.x and 6.0.x? I currently run 6.1.11 and the problem is still there.

The machine is my home server and NAS and I just don’t want to get into a bisecting spree on it… sad Any ideas or guesses?

Offline

#2 2023-02-13 09:16:13

seth
Member
Registered: 2012-09-03
Posts: 58,484

Re: NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

Smells like https://bbs.archlinux.org/viewtopic.php?id=281363

Edit, fyi: ibt=off related would leave an error about missing ENDBR in the journal.

Last edited by seth (2023-02-13 09:18:52)

Offline

#3 2023-02-13 19:01:46

andrej.podzimek
Member
From: Zürich, Switzerland
Registered: 2005-04-10
Posts: 115

Re: NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

It might be related, but I don’t think it’s the same problem. In my case the problem stays exactly the same when both nvidia + nouveau are uninstalled + blocked, so there’s neither a problem when compiling nvidia nor a conflict among the multiple possible drivers; vfio-pci is free to do whatever it pleases with the GPU. But it can’t, due to the strange “BAR 5” problem (which ibt=off doesn’t solve in this case).

Something must have changed in the way the PCIe virtual address spaces are handled.

Offline

#4 2023-02-13 23:52:40

andrej.podzimek
Member
From: Zürich, Switzerland
Registered: 2005-04-10
Posts: 115

Re: NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

Further trial-and-error:

  1. Added big_root_window to the pci= list. Well, this has no effect whatsoever and it doesn’t even cause a kernel taint (which should happen, as documented). So the Ryzen 3950X probably doesn’t have this capability / functionality.

  2. Lots of fiddling with bind / unbind virtual files. The NVidia GPU (0000:3d:00.0) can be manually bound to vfio-pci, no problem. But it doesn’t help. Its adjacent sound device (0000:3d:00.1) can be manually unbound from snd_hda_intel, but cannot be (re)bound to vfio-pci. So I removed that sound device from the VM’s configuration.

  3. The error from qemu says that the IOMMU group 52 “is not viable”. Guessing from the error message, I also tried to unbind a Thunderbolt device one level up the PCIe hierarchy (0000:3c:01.0), the only other device in IOMMU group 52, from pcieport and bind it to vfio-pci. The former succeeded, the latter did not. This is the error message suggesting that the whole IOMMU group must be re-bound:

    2023-02-13T23:40:17.205830Z qemu-system-x86_64: -device {"driver":"vfio-pci","host":"0000:3d:00.0","id":"hostdev0","bus":"pci.11","addr":"0x0"}: vfio 0000:3d:00.0: group 52 is not viable
    Please ensure all devices within the iommu_group are bound to their vfio bus driver.

As for the “please ensure”: No, qemu, sorry, but that has always been your job. Here’s the dmesg during a VM start attempt:

Feb 14 00:43:17 kernel: vfio-pci 0000:3d:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Feb 14 00:43:17 kernel: virbr0: port 2(vnet1) entered blocking state
Feb 14 00:43:17 kernel: virbr0: port 2(vnet1) entered disabled state
Feb 14 00:43:17 kernel: device vnet1 entered promiscuous mode
Feb 14 00:43:17 kernel: virbr0: port 2(vnet1) entered blocking state
Feb 14 00:43:17 kernel: virbr0: port 2(vnet1) entered listening state
Feb 14 00:43:17 kernel: virbr0: port 2(vnet1) entered disabled state
Feb 14 00:43:17 kernel: device vnet1 left promiscuous mode
Feb 14 00:43:17 kernel: virbr0: port 2(vnet1) entered disabled state
Feb 14 00:43:18 kernel: vfio-pci 0000:3d:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none

(The intertwined audit messages don’t seem to report any failures, so I left them out for brevity.) Now, the owns=none thing looks suspicious. Could it be the case that libvirtd itself and/or the libvirt-qemu user lack some ownerships or capabilities? Sadly I no longer have logs from successful runs (5.19.x, pre-November) for comparison.

Last edited by andrej.podzimek (2023-02-13 23:53:52)

Offline

#5 2023-03-24 01:23:54

Lacam
Member
Registered: 2023-03-24
Posts: 3

Re: NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

Hey, I am having a similar problem here with my eGPU setup on Razer Core X Chroma.
I want to ask if you remember what your specific version of 5.19.x kernel was,
since I am getting the same "can't enable device: BAR 5 [io  0x0000-0x007f] not claimed".
Or have you solved your problem on 6.1.11?

I am currently running Ubuntu 22.04 LTS, with 5.19.0-35-generic kernel.
eGPU is setup by Razer Core X Chroma, with GTX 1650 Super.

I also had to change the kernel command line a bit to get the eGPU recognized by nvidia-smi. I used:

pci=assign-busses,hpbussize=0x40,realloc,hpmmiosize=128M,hpmmioprefsize=4G intel_iommu=on

With the above, I can actually use the eGPU and run benchmark tests on it.
I can also bind the eGPU with the VFIO-PCI driver, but when I try to pass it through to the VM with QEMU I get the above BAR 5 error.

Thanks!

Last edited by Lacam (2023-03-24 04:14:34)

Offline

#6 2023-03-24 07:43:31

andrej.podzimek
Member
From: Zürich, Switzerland
Registered: 2005-04-10
Posts: 115

Re: NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

Lacam wrote:

I want to ask if you remember what your specific version of 5.19.x kernel was,

5.19.4, Arch’s default configuration.

Lacam wrote:

Or have you solved your problem on 6.1.11?

I wish I had that problem solving capabiility, but there are only so many LoC one can understand well enough in professional + free time contexts combined. sad

This is still broken (since November at least) and it is a huge pain and it renders expensive hardware inoperable.

Lacam wrote:

I am currently running Ubuntu 22.04 LTS, with 5.19.0-35-generic kernel.

Well, if Ubuntu has that problem on 5.19.x and Arch doesn’t, then some .config diffing between Arch vs Ubuntu and Arch 5.19 vs. Arch 6.x, followed by diff diffing lol, may yield a clue.

Lacam wrote:

I also had to change the kernel command line a bit to get the eGPU recognized by nvidia-smi. I used:

pci=assign-busses,hpbussize=0x40,realloc,hpmmiosize=128M,hpmmioprefsize=4G intel_iommu=on

Tried that↑, but there’s almost no change. The address of the GPU changed from 0000:3d:00.0 to 0000:4a:00.0, but other than that everything remained the same. The failure is still the same, nvidia-smi doesn’t work and none of the modules that could control the NVidia will do so (due to the “bar 5” failure).

Lacam wrote:

With the above, I can actually use the eGPU and run benchmark tests on it.

Magic. That change changes nothing for me, unfortunately. (Obviously I’m not using the intel_iommu bit on an AMD machine, but tried the 0x40 thing.)

Lacam wrote:

I can also bind the eGPU with the VFIO-PCI driver, but when I try to pass it through to the VM with QEMU I get the above BAR 5 error.

I can neither unbind it (because there is nowhere to unbind it from; no driver loads successfully) nor bind it to vfio_pci (because of the “bar 5” error).

The only new thing I tried since last time is the open-source NVidia driver. No change there; still “bar 5” as usual.

Offline

#7 2023-03-24 07:54:34

Lacam
Member
Registered: 2023-03-24
Posts: 3

Re: NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

Thanks for the detailed reply!

I'll try upgrading my kernel to 5.19.4, and see if there are any changes.

I'll be sure to update you on this if I succeed.

Offline

#8 2023-03-28 10:23:37

Lacam
Member
Registered: 2023-03-24
Posts: 3

Re: NVidia Quadro P5000 in a Core X Chroma eGPU broken since Linux 6.0.x

FYI, I tried diff-ing the following

  • Archlinux 5.19.4 -> Ubuntu 5.19.0-35-generic source codes

  • Archlinux 5.19.4 -> Archlinux 6.1.11 source codes

Interestingly both diff-s shared a patch; this patch was applied on Ubuntu, Arch 6.1.11, but not Arch 5.19.4.
setup-res.c is the code where our BAR error pops up, so I thought this would be relevant and tried switching my kernel to Ubuntu-hwe-5.19-5.19.0-18.18_22.04.3, hoping this will work. This was the most recent Ubuntu kernel without the above patch. But I am still getting the same BAR error sad
I also tried switching my eGPU from 1650 Super to 3060, but there weren't any changes.

I also tried diff-ing the kernel build config between the Arch 5.19.4 and Ubuntu-hwe-5.19-5.19.0-18.18_22.04.3, but there wasn't anything interesting,
at least anything related to PCI or VFIO.

Do you remember if there were any other modifications you made when your eGPU passthrough to QEMU worked on Arch 5.19.4?

My next approach is to get down to the BAR error, understanding what it really means.
My final resort will be to setup Archlinux 5.19.4 for testing (hope it works before that), so any thoughts you could give me will greatly help.

Offline

Board footer

Powered by FluxBB