[solved] Cuda on a computer with two generation of NVIDIA

dmidge · 2018-12-07 18:19:31

Hi,

I have a quite old computer (around 2008), that has an old version of Nvidia GPU. I just added on this computer, a fairly new Nvidia graphic card (GTX 1060), because I want to do some cuda on it.

As far as I understand, based on what I read, I should install the nvidia drivers and replace nouveau to the proprietary Nvidia driver to be able to use cuda. Nouveau is working fine on my old graphic card, so I would rather stick to it if I can. However, I wouldn't mind replacing it by the proprietary drivers if it works.
So, I tried to install the proprietary drivers. But if I install the nvidia driver, it seems that nouveau is removed from both graphic card (I did the mistake just before, and chose to reinstall from scratch.) My display was not working anymore... Even when I tried to "unblaklist" nouveau.
But the drivers seems to be on different kernel version, so I guess it is not compatible. Thus, if I install the old version, my old graphic card connected to the display would work, but not the new one for cuda. And vice versa. But I want to have a functional display and cuda.

So now, my question is: how could I use cuda on my new GPU, but keeping the display functional on my old graphic card?

Thanks for your help!
Cheers,

Last edited by dmidge (2019-01-06 21:56:26)

Lone_Wolf · 2018-12-08 12:49:04

Looking at dependencies for cuda it may be possible.

Try installing nvidia-utils but NOT nvidia .
In order to keep nouveau working you'll probably have to manually remove /usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.conf .

Edit: also install opencl-nvidia .

Last edited by Lone_Wolf (2018-12-08 12:50:51)

dmidge · 2018-12-08 18:44:53

Hi Lone_Wolf,

Thanks for your help!
Unfortunately, I ended up with a black screen at reboot time.
For both GPU, lspci says that the module in use is nouveau. (I just renamed /usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.conf to /usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.confDel, in case I need to restore it. I don't think it would change anything.)

With dmesg, I get audit information. Nouveau loads and a bit later, it seems that I have a coredump invocated. I guess that it is of sddm (the starter of KDE/plasma desktop), because on some messages, that names pops up.
That is consistent to what I see in systemctl where this status is failed, with an "6/ABRT" (I guess abort) status is shown.

With journalctl -u sddm, I get a bit more info. The first error is "Failed to read display number from pipe", which comes just after the command "Running: /usr/bin/X -nolisten tcp auth [...]" ([...] because I shortened the line)

I renamed the /etc/X11/xorg.conf to /etc/X11/xorg.confDel. I managed to get back the display back. I will now test if can compile some cuda.

Last edited by dmidge (2018-12-08 18:45:10)

dmidge · 2018-12-09 00:33:56

Sadly, cuda is still not working and I don't know how to fix it. even nvidia-smi don't know how to access the card...

dmidge · 2018-12-09 01:07:49

Actually, at that stage, I am not even sure that I can make cuda work event though I would mind scrapping my old GPU...

dmidge · 2018-12-09 03:12:30

Okay. Installing the linux419-nvidia and blacklisting nouveau works to have cuda. But now, my first card doesn't work and I am only in CLI.
How to make the nouveau work again with that?

Ropid · 2018-12-09 04:17:13

You normally can't use both. The two kernel modules clash with each other so nouveau needs to be blacklisted to be able to use the nvidia module.

Maybe there's a way to have a certain kernel module only get used for one PCIe device but not another one?

dmidge · 2018-12-09 20:44:04

That would be what I look like. Is there a way? Maybe through the proper Xorg.conf? Or some special modprobe.conf file?
Looking online, it may be more a job for udev. But I dunno.

Last edited by dmidge (2018-12-09 20:46:08)

Lone_Wolf · 2018-12-09 21:25:35

This needs to be solved way before Xorg.conf is applied, custom udev rules might be "exactly what the doctor ordered" .
Let's hope some people that know how to write those respond.

dmidge · 2018-12-25 19:43:30

@Lone_Wolf: I think you are right, but do you have any idea of how to find someone that I could poke to have some help with it?
(Merry Christmas to whom it applies. )

Last edited by dmidge (2018-12-25 19:43:56)

mich41 · 2018-12-26 19:17:50

Of course you can use the 1060 for display AND cuda and throw out the old GPU.
One of the legacy versions may support both GPUs. See https://www.nvidia.com/object/unix.html and install with pacman if you find a suitable version.

Finally, I think it may be possible to run nouveau and nvidia side by side.
Make sure the old GPU is the one picked up by BIOS/UEFI. Typically a matter of rearranging them in PCIe slots.
Restore nouveau because that's the graphics stack you will use. If installation of nvidia removed some conflicting packages like mesa, reinstall them (check /var/log/pacman.log for that). If it installed module blacklists, remove them and blacklist nvidia instead (run "pacman -Ql nvidia" to find out what files it installed).
Once you have nouveau graphics running, reserve the 1060 like for GPU passthrough so that nouveau doesn't touch it during boot.
Manually load nvidia kernel module and see what happens. It won't touch the old GPU used by nouveau. It may bitch about nouveau being loaded. If you are lucky, it may still bind to the 1060 though and work with it.

Depending on how exactly the nvidia package is found to interfere with nouveau, some fiddling may be necessary to prevent problems at future updates.

dmidge · 2018-12-26 21:08:00

@mich41:
Well, I can't remove the old GPU. It is soldered on the motherboard (as lot of laptops). And sadly, there are no versions that combined both. Actually, it seems it is also because old GPU drivers from Nvidia don't support the "new" linux kernels.

Sadly, it is very tough to make nouveau work back when the nvidia proprietary drivers are loaded. But it finally worked. And then, I tried to load the nvidia module by hand. It just won't because of the conflict with nouveau (and nouveau can use the new card). Thus I would need to find a way to unload the module on this specific card, but keeping it running on the other one, and then try to load the nvidia driver. Which is exactly the problem that we try to solve with udev.

progandy · 2018-12-26 21:26:14

You could try to pass your cuda gpu to a virtual machine and only install the nvidia drivers there.

mich41 · 2018-12-26 23:35:45

dmidge wrote:

Sadly, it is very tough to make nouveau work back when the nvidia proprietary drivers are loaded. But it finally worked.

Great, getting nouveau to work without uninstalling nvidia is half the job done.

dmidge wrote:

And then, I tried to load the nvidia module by hand. It just won't because of the conflict with nouveau (and nouveau can use the new card). Thus I would need to find a way to unload the module on this specific card, but keeping it running on the other one, and then try to load the nvidia driver. Which is exactly the problem that we try to solve with udev.

I don't think it can be achieved with a udev rule. Problem is, once any of these kernel modules is loaded, it will bind to both devices, regardless of which device triggered the loading of this module. Later the other module will find both devices busy and fail to work. I don't think there is any way to stop that, except by binding the dummy pci-stub driver to the 1060 beforehand, like people do to reserve GPUs for VM passthrough. (Well, technically, you could edit nouveau code and recompile, but come on).

If you are lucky, it may be enough unbind nouveau from the 1060 after it has loaded and then load nvidia, but this approach may also fail for reasons including nouveau being buggy and crashing or nvidia failing to properly initialize a device previously initialized by nouveau.
To try it anyway, make sure that no X server is running on the 1060 and do

# lspci | grep VGA
01:05.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RS780L [Radeon 3000]         <- note the numbers on the left
# echo 0000:01:05.0 > /sys/module/nouveau/drivers/pci\:nouveau/unbind                                  <- yes, you need to add 0000:

This is obviously an integrated Radeon GPU, you are going to use numbers from your 1060.
If this succeeds and the machine doesn't crash, reload nvidia

rmmod nvidia
modprobe nvidia

Post any errors printed by modprobe or dmesg after nvidia loading, if the machine survives.

dmidge · 2018-12-30 21:52:37

Hi,

Thanks for the precision mich41.

mich41 wrote:

To try it anyway, make sure that no X server is running on the 1060 and do
# lspci | grep VGA
01:05.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RS780L [Radeon 3000]         <- note the numbers on the left
# echo 0000:01:05.0 > /sys/module/nouveau/drivers/pci\:nouveau/unbind                                  <- yes, you need to add 0000:
This is obviously an integrated Radeon GPU, you are going to use numbers from your 1060.
If this succeeds and the machine doesn't crash, reload nvidia
rmmod nvidia
modprobe nvidia
Post any errors printed by modprobe or dmesg after nvidia loading, if the machine survives.

So, as asked, I tried what you said. I loaded both nvidia and nouveau. Both are loaded at the same time as shown:

# lsmod
Module                  Size  Used by
nouveau              2187264  1
mxm_wmi                16384  1 nouveau
wmi                    28672  2 mxm_wmi,nouveau
i2c_algo_bit           16384  1 nouveau
ttm                   126976  1 nouveau
nvidia_drm             53248  0
nvidia_modeset       1040384  1 nvidia_drm
nvidia              17313792  1 nvidia_modeset
[...]

This way, I have access to the nouveau folder:

# ls /sys/module/nouveau/drivers/pci\:nouveau
0000:02:00.0  bind  module  new_id  remove_id  uevent  unbind

The lspci shows:

# lspci | grep VGA
02:00.0 VGA compatible controller: NVIDIA Corporation GT218M [GeForce 315M] (rev a2)
03:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)

and then, I get this weird error:

# echo 0000:03:00.0 > /sys/module/nouveau/drivers/pci\:nouveau/unbind    
-bash: echo: write error: No such device

I just don't get it...
(Btw, sorry, it is a 1050, not a 1060. I was hesitating between both during my spending adventure...)

Lone_Wolf · 2018-12-31 10:41:48

# ls /sys/module/nouveau/drivers/pci\:nouveau
0000:02:00.0  bind  module  new_id  remove_id  uevent  unbind

That suggest only the 315M is bound to the nouveau module .

check the output of lspci -k | grep VGA

mich41 · 2018-12-31 17:37:52

Lone_Wolf wrote:

That suggest only the 315M is bound to the nouveau module .

That's how it looks.

I was probably wrong saying that the first driver to load will grab both GPUs. Actually, since the old GPU is no longer supported by NVIDIA, it seems entirely reasonable that if the proprietary driver loads first, it will completely ignore that old GPU and then nouveau will be able pick it up later.
So nvidia comes first and binds to the 1050Ti, then nouveau takes the internal GPU. The display is driven by nouveau and, guess what, I think there is a chance that cuda may work on the 1050Ti.

I suggest the following checks

dmesg |grep -i nvidia    it may say something about what GPUs it found and enabled
ls /sys/module/nvidia/drivers/pci\:nvidia    I think this or similar directory should exists and contain 0000:03:00.0
ls /sys/module/nouveau/drivers/pci\:nouveau   this directory still shouldn't contain 0000:03:00.0
nvidia-smi -L       from nvidia-utils package, lists all GPUs currently accessible through the proprietary driver
try to run any random cuda demo

dmidge · 2019-01-02 00:17:06

Good news, it works!
The X11 graphic display is however very slow. Like if it was rendered on CPU instead of GPU though... (By that, I even mean that mouving the mouse is not fluid, even with the computer doing nothing.)
How to check that the old GPU is effectively used for X rendering? (The KDE desktop is launched through startx.)

dmidge · 2019-01-02 03:18:54

Actually, on nvidia-smi, I have this line:

0       911      G   /usr/bin/plasmashell                          45MiB

That means that X11 is using the wrong GPU. How to change that?

V1del · 2019-01-02 08:36:35

Set up a xorg config that explicitly sets up the device X should render to, something like

Section "Device"
       Identifier "nouveau"
      #alternatively if installed, instead of the modesetting
      #Driver "nouveau"
       Driver "modesetting" 
       BusID "PCI:0:2:0"
EndSection

dmidge · 2019-01-06 21:54:44

@V1del:
I was afraid of seeing an answer that would be generating a Xorg.conf. I most of the time have a bad experience with them, for some reason. But it went quite smooth this time!
I regenerated the Xorg file with:

# Xorg :0 -configure

Then, I deleted everything related to the new NVidia GPU. It gave the following configuration:

Section "ServerLayout"
        Identifier     "X.org Configured"
        Screen      0  "Screen0" 0 0
        Screen      1  "Screen1" RightOf "Screen0"
        InputDevice    "Mouse0" "CorePointer"
        InputDevice    "Keyboard0" "CoreKeyboard"
EndSection

Section "Files"
        ModulePath   "/usr/lib/xorg/modules"
        FontPath     "/usr/share/fonts/misc"
        FontPath     "/usr/share/fonts/TTF"
        FontPath     "/usr/share/fonts/OTF"
        FontPath     "/usr/share/fonts/Type1"
        FontPath     "/usr/share/fonts/100dpi"
        FontPath     "/usr/share/fonts/75dpi"
EndSection

Section "Module"
        Load  "glx"
EndSection

Section "InputDevice"
        Identifier  "Keyboard0"
        Driver      "kbd"
EndSection

Section "InputDevice"
        Identifier  "Mouse0"
        Driver      "mouse"
        Option      "Protocol" "auto"
        Option      "Device" "/dev/input/mice"
        Option      "ZAxisMapping" "4 5 6 7"
EndSection

Section "Monitor"
        Identifier   "Monitor0"
        VendorName   "Monitor Vendor"
        ModelName    "Monitor Model"
EndSection

Section "Monitor"
        Identifier   "Monitor1"
        VendorName   "Monitor Vendor"
        ModelName    "Monitor Model"
EndSection

Section "Device"
        ### Available Driver options are:-
        ### Values: <i>: integer, <f>: float, <bool>: "True"/"False",
        ### <string>: "String", <freq>: "<f> Hz/kHz/MHz",
        ### <percent>: "<f>%"
        ### [arg]: arg optional
        #Option     "SWcursor"                  # [<bool>]
        #Option     "HWcursor"                  # [<bool>]
        #Option     "NoAccel"                   # [<bool>]
        #Option     "ShadowFB"                  # [<bool>]
        #Option     "VideoKey"                  # <i>
        #Option     "WrappedFB"                 # [<bool>]
        #Option     "GLXVBlank"                 # [<bool>]
        #Option     "ZaphodHeads"               # <str>
        #Option     "PageFlip"                  # [<bool>]
        #Option     "SwapLimit"                 # <i>
        #Option     "AsyncUTSDFS"               # [<bool>]
        #Option     "AccelMethod"               # <str>
        #Option     "DRI"                       # <i>
        Identifier  "Card0"
        Driver      "nouveau"
        BusID       "PCI:2:0:0"
EndSection

Section "Screen"
        Identifier "Screen0"
        Device     "Card0"
        Monitor    "Monitor0"
        SubSection "Display"
                Viewport   0 0
                Depth     1
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     4
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     8
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     15
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     16
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     24
        EndSubSection
EndSection

Section "Screen"
        Identifier "Screen1"
        Device     "Card1"
        Monitor    "Monitor1"
        SubSection "Display"
                Viewport   0 0
                Depth     1
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     4
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     8
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     15
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     16
        EndSubSection
        SubSection "Display"
                Viewport   0 0
                Depth     24
        EndSubSection
EndSection

And that works! Thanks you all guys!

dmidge · 2019-01-06 22:00:34

Summary of the steps to solve the problem (for the next ones to come):
- install nouveau. Make a working configuration for the X display.
- install the nvidia driver for the new card, which is not compatible to the old one.
- unload (blacklist) all nouveau modules in a tty console. The X display should not be loaded nor the nouveau module at startup.
- load the nvidia module.
- Then load nouveau module (after nvidia module. The order is important.)
- Then edit X configurations file to use only the old GPU for the display.
- Then startx!

mich41 · 2019-01-07 11:12:17

That's cumbersome. I would definitely give this a try, it should load the proprietary module before nouveau is loaded automatically.

V1del · 2019-01-09 11:38:02

FWIW regarding the xorg config, you really don't need all that cruft, my example as written should suffice and contain everything you need to make xorg do the correct thing.

Arch Linux

#1 2018-12-07 18:19:31

[solved] Cuda on a computer with two generation of NVIDIA

#2 2018-12-08 12:49:04

Re: [solved] Cuda on a computer with two generation of NVIDIA

#3 2018-12-08 18:44:53

Re: [solved] Cuda on a computer with two generation of NVIDIA

#4 2018-12-09 00:33:56

Re: [solved] Cuda on a computer with two generation of NVIDIA

#5 2018-12-09 01:07:49

Re: [solved] Cuda on a computer with two generation of NVIDIA

#6 2018-12-09 03:12:30

Re: [solved] Cuda on a computer with two generation of NVIDIA

#7 2018-12-09 04:17:13

Re: [solved] Cuda on a computer with two generation of NVIDIA

#8 2018-12-09 20:44:04

Re: [solved] Cuda on a computer with two generation of NVIDIA

#9 2018-12-09 21:25:35

Re: [solved] Cuda on a computer with two generation of NVIDIA

#10 2018-12-25 19:43:30

Re: [solved] Cuda on a computer with two generation of NVIDIA

#11 2018-12-26 19:17:50

Re: [solved] Cuda on a computer with two generation of NVIDIA

#12 2018-12-26 21:08:00

Re: [solved] Cuda on a computer with two generation of NVIDIA

#13 2018-12-26 21:26:14

Re: [solved] Cuda on a computer with two generation of NVIDIA

#14 2018-12-26 23:35:45

Re: [solved] Cuda on a computer with two generation of NVIDIA

#15 2018-12-30 21:52:37

Re: [solved] Cuda on a computer with two generation of NVIDIA

#16 2018-12-31 10:41:48

Re: [solved] Cuda on a computer with two generation of NVIDIA

#17 2018-12-31 17:37:52

Re: [solved] Cuda on a computer with two generation of NVIDIA

#18 2019-01-02 00:17:06

Re: [solved] Cuda on a computer with two generation of NVIDIA

#19 2019-01-02 03:18:54

Re: [solved] Cuda on a computer with two generation of NVIDIA

#20 2019-01-02 08:36:35

Re: [solved] Cuda on a computer with two generation of NVIDIA

#21 2019-01-06 21:54:44

Re: [solved] Cuda on a computer with two generation of NVIDIA

#22 2019-01-06 22:00:34

Re: [solved] Cuda on a computer with two generation of NVIDIA

#23 2019-01-07 11:12:17

Re: [solved] Cuda on a computer with two generation of NVIDIA

#24 2019-01-09 11:38:02

Re: [solved] Cuda on a computer with two generation of NVIDIA

Board footer