You are not logged in.

#1 2024-08-16 06:21:41

RobertSchmobert
Member
Registered: 2024-08-16
Posts: 5

SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

Hello,

I have a strange issue after upgrading nvidia drivers to version 555. The performance dropped to about one frame every few seconds across the board, beginning at the login screen.

I have two RTX 3060's I use for running LLMs. I'm on gnome+wayland.

I tried the solution proposed here:
https://forums.developer.nvidia.com/t/m … 555/293606

I also tried running Xorg but I have a different issue with it and it crashes on boot.

What worked was pulling one GPU out. Doesn't matter which, same result.

My logs:
http://0x0.st/X4WX.txt

Where would I even begin to attempt to solve this?

Thanks in advance.

Last edited by RobertSchmobert (2024-08-27 07:55:07)

Offline

#2 2024-08-16 08:42:27

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,175

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: (EE) No devices detected.
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: (EE)
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: Fatal server error:
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: (EE) no screens found(EE)
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: (EE)
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: Please consult the The X.Org Foundation support
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]:          at http://wiki.x.org
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]:  for help.
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: (EE) Please also check the log file at "/home/robert/.local/share/xorg/Xorg.0.log" for additional information.
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: (EE)
aug 16 08:50:26 robert-PC /usr/lib/gdm-x-session[3331]: (EE) Server terminated with error (1). Closing log file.

Remove /etc/X11/xorg.conf and

aug 16 08:44:28 robert-PC kernel: simple-framebuffer simple-framebuffer.0: [drm] fb0: simpledrmdrmfb frame buffer device
aug 16 08:44:31 robert-PC org.gnome.Shell.desktop[871]: MESA-LOADER: failed to open simpledrm: /usr/lib/dri/simpledrm_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri, suffix _dri)

Enable https://wiki.archlinux.org/title/NVIDIA … de_setting - use the "nvidia_drm.modeset=1" kernel parameter (modprobe.conf won't do!) to get rid of the simpledrm device to make sure you're not running on that.

1fps sounds like some prime sync issue, ie. you're running on GPU1 but the (only) output is attached to GPU2 which is also supported by

What worked was pulling one GPU out. Doesn't matter which, same result.

so you'd have to control the target GPU, for GDM
=> https://bbs.archlinux.org/viewtopic.php … 3#p2174563

Online

#3 2024-08-16 10:51:07

RobertSchmobert
Member
Registered: 2024-08-16
Posts: 5

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

Thank you for the reply!

Removing xorg.conf fixed Xorg, so even though the login screen is laggy, after booting into Xorg everything runs as it should.

seth wrote:

Enable https://wiki.archlinux.org/title/NVIDIA … de_setting - use the "nvidia_drm.modeset=1" kernel parameter (modprobe.conf won't do!) to get rid of the simpledrm device to make sure you're not running on that.

I have added nvidia_drm.modeset=1 to the options row in arch.conf, and I have nvidia, nvidia_drm, nvidia_uvm, nvidia_modeset in the modules section in mkinitcpio.conf, and removed kms from the hooks section. I hope that's what you meant.

cat /sys/module/nvidia_drm/parameters/modeset

returns Y

seth wrote:

so you'd have to control the target GPU, for GDM

This is what fixed gdm and wayland for me. Placing the following rule into /etc/udev/rules.d:

ENV{DEVNAME}=="/dev/dri/card0", TAG+="mutter-device-preferred-primary"

And then plugging the monitor into that GPU. The problem is that that's my secondary GPU. The one in the PCIe x16 slot is card1. Telling gdm to use card1 and plugging my monitor into that is not working, it's still laggy. Is there any way to make it work on card1?

Thanks again.

Offline

#4 2024-08-16 12:44:38

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,175

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

cat /sys/module/nvidia_drm/parameters/modeset

The meaningful check is

sudo journalctl -b | grep simple

you don't want the simpledrm device to show up.

Why do you want to use the GPU that's not in the PEG?
Do you actually want to pass through the other one to a VM?
Cause  I'd start with that as it will greately reduce the complexity of the situation.
Also check eg. nvidia-smi to figure what GDM actually ends up running on (since the 1fps and behavior on removing either card still sounds like some reverse prime/sync issue)

Online

#5 2024-08-16 13:05:00

RobertSchmobert
Member
Registered: 2024-08-16
Posts: 5

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

simpledrm doesn't show up at all, but that's a good check to know.

seth wrote:

Why do you want to use the GPU that's not in the PEG?

I'm sorry but I have no idea what PEG means in this context. I have a ryzen 5 5600x with no integrated GPU, only 2 x RTX 3060, one in the PCIe x 16 slot and one in the x8 I think. I like to game occasionally, so I would prefer if the default card would be the one that's in the x16 slot.

seth wrote:

Do you actually want to pass through the other one to a VM?

Nope, they're both for inferencing on arch, and the occasional gaming, also on arch.

seth wrote:

Also check eg. nvidia-smi to figure what GDM actually ends up running on (since the 1fps and behavior on removing either card still sounds like some reverse prime/sync issue)

So despite setting:

ENV{DEVNAME}=="/dev/dri/card1", TAG+="mutter-device-preferred-primary"

as well as:

MESA_VK_DEVICE_SELECT=10de:2487!

which corresponds to card1, nvidia-smi returns:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:07:00.0  On |                  N/A |
| 78%   41C    P8             22W /  140W |    2417MiB /  12288MiB |     37%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1354      G   /usr/bin/gnome-shell                          121MiB |
|    0   N/A  N/A      2306      G   ...local/share/Steam/ubuntu12_32/steam          2MiB |
|    0   N/A  N/A      2307      G   /usr/bin/Xwayland                              32MiB |
|    0   N/A  N/A      2326    C+G   /usr/lib/mutter-x11-frames                      9MiB |
|    0   N/A  N/A      2757      G   ./steamwebhelper                               34MiB |
|    0   N/A  N/A      3300    C+G   ...ommon\ELDEN RING\Game\eldenring.exe       2120MiB |
|    0   N/A  N/A      7160      G   /usr/bin/kgx                                   20MiB |
+-----------------------------------------------------------------------------------------+

So I'm pretty sure everything ends up using card0 anyway. The display at this point is plugged into card1. I think I'm completely lost.

Offline

#6 2024-08-16 13:40:04

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,175

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

PEG is a(n often dedicated) x16 slot that typically can then also draw more power.

Please post your complete system journal for the boot:

sudo journalctl -b | curl -F 'file=@-' 0x0.st

The GPU selection is fairly easy to control w/ X11, for wayland you're completely depending on mutters behavior

And then plugging the monitor into that GPU. The problem is that that's my secondary GPU. The one in the PCIe x16 slot is card1. Telling gdm to use card1 and plugging my monitor into that is not working

read more like you're actually intending to run gnome on the GPU in the x8 slot, but however: 07:00 is the x8 slot? (The "P8" there is a power state, it doesn't say anything about the bus)
If you're not intending to forward one of the GPUs, you're probably going to dedicate if cuda? Ie. you've no interest in using it for graphics AT ALL?

MESA_VK_DEVICE_SELECT=10de:2487!

Even is nvidia would care about the mesa setting, if the GPUs are equal, they've the same PCI id.

Online

#7 2024-08-17 10:24:26

RobertSchmobert
Member
Registered: 2024-08-16
Posts: 5

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

I'm sorry if I caused any confusion.

I don't have an integrated graphics card, only the 2 x RTX 3060 12GB.

I want to use the GPU that's in the x16 slot for graphics like gnome, and gaming. Also inferencing.
I only want to use the x8 GPU for inferencing when I need more than 12GB of VRAM. Inferencing already works perfectly. Other than that I don't want this card to be used at all.

My x16 GPU is recognized as card1, or GPU1.
My x8 GPU is recognized as card0, or GPU0.

This is the log of my PC booting into Xorg. The login screen had 1fps behavior. After booting into Xorg, everything works fine.
http://0x0.st/X466.txt

This is the output of nvidia-smi on Xorg

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:04:00.0 Off |                  N/A |
|  0%   37C    P8              9W /  140W |       9MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        Off |   00000000:08:00.0  On |                  N/A |
|  0%   44C    P8             16W /  140W |     324MiB /  12288MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1342      G   /usr/lib/Xorg                                   4MiB |
|    1   N/A  N/A      1342      G   /usr/lib/Xorg                                 101MiB |
|    1   N/A  N/A      1410      G   /usr/bin/gnome-shell                          142MiB |
|    1   N/A  N/A      2085      G   /usr/bin/kgx                                   10MiB |
|    1   N/A  N/A      2409      G   ...ures=SpareRendererForSitePerProcess         57MiB |
+-----------------------------------------------------------------------------------------+

This is the log of my PC booting into Wayland. Wayland still has the 1fps behavior.
http://0x0.st/X46g.txt

This is the output of nvidia-smi on Wayland:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:04:00.0 Off |                  N/A |
|  0%   42C    P8             10W /  140W |       5MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        Off |   00000000:08:00.0  On |                  N/A |
|  0%   41C    P8             14W /  140W |     275MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1408      G   /usr/bin/gnome-shell                            2MiB |
|    1   N/A  N/A      1408      G   /usr/bin/gnome-shell                          247MiB |
|    1   N/A  N/A      2484      G   /usr/bin/kgx                                   20MiB |
+-----------------------------------------------------------------------------------------+

gdm is set to use card1 in /etc/udev/rules.d/, and the monitor is plugged into card1

Offline

#8 2024-08-17 16:21:41

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,175

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

I don't have an integrated graphics card, only the 2 x RTX 3060 12GB.

I got that, you're (now) either way running on GPU1 on 08:00 - the issue is going to be the gnome wayland compositor (assuming you're using GDM, that's the case there as well - https://wiki.archlinux.org/title/GDM#Use_Xorg_backend )

Unfortunaltely I've no idea how or whether it's even possible to tell the gnome/mutter wayland compositor to only use a specific PCI device (since driver and even model are equivalent)

My best idea would be to unbind the second GPU, start gnome and then re-bind it, sth. like /usr/bin/aigpu

#!/bin/bash

gpu="0000:04:00.0"
aud="0000:04:00.1"

case "$1" in
"unbind")
  echo "$gpu" > "/sys/bus/pci/devices/$gpu/driver/unbind"
  echo "$aud" > "/sys/bus/pci/devices/$aud/driver/unbind"
;;
 
"bind")
  echo 1 > "/sys/bus/pci/devices/$gpu/remove"
  echo 1 > "/sys/bus/pci/devices/$aud/remove"
  echo 1 > "/sys/bus/pci/rescan"
;;
*)
  echo "first learn, how to type"
;;
esac

Online

#9 2024-08-27 07:53:50

RobertSchmobert
Member
Registered: 2024-08-16
Posts: 5

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

The issue resolved itself after updating nvidia drivers to version 560.

Offline

#10 2024-08-27 11:56:58

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,175

Re: SOLVED-Extreme lag with multi-GPU setup after upgrading nvidia drivers

Online

Board footer

Powered by FluxBB