You are not logged in.

#1 2021-09-22 16:38:53

FuzzySPb
Member
Registered: 2013-01-21
Posts: 46

[SOLVED] nvidia 470.74-1 broke CUDA

Hi,
I'm not sure about correct way to report this problem. It doesn't look like a bug and I'm not sure some package is out-dated. So I decided to create topic there and probably get a good advice.

I upgraded nvidia drivers from version 470.63.01-12-x86_64 to 470.74-1-x86_64 on 20/09/2021 (packages nvidia, nvidia-settings, nvidia-utils, opencl-nvidia). There were no any cuda upgrade after it.
And I got following messages from tensorflow after this upgrade:

E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 470.63.1 does not match DSO version 470.74.0 -- cannot find working devices in this configuration

Now I downgraded nvidia back to 470.63.01-12 and it works fine. It looks like cuda package (or tensorflow-opt-cuda, python-tensorflow-opt-cuda) should be upgraded also. But currently I have cuda 11.4.2-1 and don't see newer version on nvidia site. So it looks like an incompatibility between package versions after the upgrade.

Could someone guide me how to report this problem in a right way?

Last edited by FuzzySPb (2021-09-23 07:18:42)

Offline

#2 2021-09-22 16:45:24

Scimmia
Fellow
Registered: 2012-09-01
Posts: 11,463

Re: [SOLVED] nvidia 470.74-1 broke CUDA

Sounds like you didn't reboot to load the new module.

Online

#3 2021-09-22 17:08:37

FuzzySPb
Member
Registered: 2013-01-21
Posts: 46

Re: [SOLVED] nvidia 470.74-1 broke CUDA

Scimmia wrote:

Sounds like you didn't reboot to load the new module.

No, I rebooted system several times.

Offline

#4 2021-09-22 20:24:48

seth
Member
Registered: 2012-09-03
Posts: 49,980

Re: [SOLVED] nvidia 470.74-1 broke CUDA

pacman -Qs '(nvidia|linux)'
uname -a

Online

#5 2021-09-23 05:52:42

FuzzySPb
Member
Registered: 2013-01-21
Posts: 46

Re: [SOLVED] nvidia 470.74-1 broke CUDA

Hi seth, thanks for the guidance.
I executed this commands twice this way:
1) I executed it on working system before the upgrade
then I performed the upgrade and rebooted the system
2) I executed it again after aforementioned errors from tensorflow.

Full output is at the end of this message but I think diff of two cases is better:

[root@solid flex]# diff ok.txt nok.txt 
66c66
< local/nvidia 470.63.01-12
---
> local/nvidia 470.74-1
70c70
< local/nvidia-settings 470.63.01-1
---
> local/nvidia-settings 470.74-1
72c72
< local/nvidia-utils 470.63.01-1
---
> local/nvidia-utils 470.74-1
74c74
< local/opencl-nvidia 470.63.01-1
---
> local/opencl-nvidia 470.74-1
[root@solid flex]# 

Here is a full log:

######  GOOD CASE BEFORE THE UPGRADE  ########

[root@solid ~]# uname -a
Linux solid 5.14.6-arch1-1 #1 SMP PREEMPT Sat, 18 Sep 2021 16:19:35 +0000 x86_64 GNU/Linux
[root@solid ~]# pacman -Qs '(nvidia|linux)'
local/alsa-lib 1.2.5.1-3
    An alternative implementation of Linux sound support
local/alsa-utils 1.2.5.1-1
    Advanced Linux Sound Architecture - Utilities
local/android-udev 20210501-1
    Udev rules to connect Android devices to your linux box
local/archlinux-keyring 20210902-1
    Arch Linux PGP keyring
local/archlinuxcn-keyring 20210401-1
    Arch Linux CN PGP keyring
local/avahi 0.8+22+gfd482a7-1
    Service Discovery for Linux using mDNS/DNS-SD -- compatible with Bonjour
local/base 2-2
    Minimal package set to define a basic Arch Linux installation
local/cuda 11.4.2-1
    NVIDIA's GPU programming toolkit
local/cudnn 8.2.4.15-1
    NVIDIA CUDA Deep Neural Network library
local/efibootmgr 17-2
    Linux user-space application to modify the EFI Boot Manager
local/egl-wayland 1.1.7-1
    EGLStream-based Wayland external platform
local/filesystem 2021.05.31-1
    Base Arch Linux files
local/iptables 1:1.8.7-1
    Linux kernel packet control tool (using legacy interface)
local/keyutils 1.6.3-1
    Linux Key Management Utilities
local/kmod 29-1
    Linux kernel module management tools and library
local/libaio 0.3.112-2
    The Linux-native asynchronous I/O facility (aio) library
local/libcap-ng 0.8.2-3
    A library for Linux that makes using posix capabilities easy
local/libiec61883 1.2.0-6
    A higher level API for streaming DV, MPEG-2 and audio over Linux IEEE 1394
local/libimobiledevice 1.3.0-3
    Library that talks the protocols to support iPhone and iPod Touch devices on Linux
local/libraw1394 2.1.2-3
    Provides an API to the Linux IEEE1394 (FireWire) driver
local/libva 2.12.0-1
    Video Acceleration (VA) API for Linux
local/libvdpau 1.4-1
    Nvidia VDPAU library
local/libxnvctrl 470.74-1
    NVIDIA NV-CONTROL X extension
local/libxshmfence 1.3-2
    a library that exposes a event API on top of Linux futexes
local/linssid 3.6-9
    Graphical wireless scanner for Linux
local/linux 5.14.6.arch1-1
    The Linux kernel and modules
local/linux-api-headers 5.12.3-1
    Kernel headers sanitized for use in userspace
local/linux-firmware 20210818.c46b8c3-1
    Firmware files for Linux
local/mdadm 4.1-2
    A tool for managing/monitoring Linux md device arrays, also known as Software RAID
local/nccl 2.11.4-1
    Library for NVIDIA multi-GPU and multi-node collective communication primitives
local/ndctl 71.1-1
    Utility library for managing the libnvdimm (non-volatile memory device) sub-system in the Linux kernel
local/nvidia 470.63.01-12
    NVIDIA drivers for linux
local/nvidia-prime 1.0-4
    NVIDIA Prime Render Offload configuration and utilities
local/nvidia-settings 470.63.01-1
    Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 470.63.01-1
    NVIDIA drivers utilities
local/opencl-nvidia 470.63.01-1
    OpenCL implemention for NVIDIA
local/pacman-mirrorlist 20210822-1
    Arch Linux mirror list for use by pacman
local/python-distro 1.6.0-1
    Linux OS platform information API
local/python-pycuda 2021.1-4
    Python wrapper for Nvidia CUDA
local/solaar 1.0.6-2
    Linux device manager for a wide range of Logitech devices
local/util-linux 2.37.2-1
    Miscellaneous system utilities for Linux
local/util-linux-libs 2.37.2-1
    util-linux runtime libraries
local/v4l-utils 1.20.0-1
    Userspace tools and conversion library for Video 4 Linux
[root@solid ~]# 



######  BAD CASE AFTER THE UPGRADE  ########

[root@solid ~]# uname -a
Linux solid 5.14.6-arch1-1 #1 SMP PREEMPT Sat, 18 Sep 2021 16:19:35 +0000 x86_64 GNU/Linux
[root@solid ~]# pacman -Qs '(nvidia|linux)'
local/alsa-lib 1.2.5.1-3
    An alternative implementation of Linux sound support
local/alsa-utils 1.2.5.1-1
    Advanced Linux Sound Architecture - Utilities
local/android-udev 20210501-1
    Udev rules to connect Android devices to your linux box
local/archlinux-keyring 20210902-1
    Arch Linux PGP keyring
local/archlinuxcn-keyring 20210401-1
    Arch Linux CN PGP keyring
local/avahi 0.8+22+gfd482a7-1
    Service Discovery for Linux using mDNS/DNS-SD -- compatible with Bonjour
local/base 2-2
    Minimal package set to define a basic Arch Linux installation
local/cuda 11.4.2-1
    NVIDIA's GPU programming toolkit
local/cudnn 8.2.4.15-1
    NVIDIA CUDA Deep Neural Network library
local/efibootmgr 17-2
    Linux user-space application to modify the EFI Boot Manager
local/egl-wayland 1.1.7-1
    EGLStream-based Wayland external platform
local/filesystem 2021.05.31-1
    Base Arch Linux files
local/iptables 1:1.8.7-1
    Linux kernel packet control tool (using legacy interface)
local/keyutils 1.6.3-1
    Linux Key Management Utilities
local/kmod 29-1
    Linux kernel module management tools and library
local/libaio 0.3.112-2
    The Linux-native asynchronous I/O facility (aio) library
local/libcap-ng 0.8.2-3
    A library for Linux that makes using posix capabilities easy
local/libiec61883 1.2.0-6
    A higher level API for streaming DV, MPEG-2 and audio over Linux IEEE 1394
local/libimobiledevice 1.3.0-3
    Library that talks the protocols to support iPhone and iPod Touch devices on Linux
local/libraw1394 2.1.2-3
    Provides an API to the Linux IEEE1394 (FireWire) driver
local/libva 2.12.0-1
    Video Acceleration (VA) API for Linux
local/libvdpau 1.4-1
    Nvidia VDPAU library
local/libxnvctrl 470.74-1
    NVIDIA NV-CONTROL X extension
local/libxshmfence 1.3-2
    a library that exposes a event API on top of Linux futexes
local/linssid 3.6-9
    Graphical wireless scanner for Linux
local/linux 5.14.6.arch1-1
    The Linux kernel and modules
local/linux-api-headers 5.12.3-1
    Kernel headers sanitized for use in userspace
local/linux-firmware 20210818.c46b8c3-1
    Firmware files for Linux
local/mdadm 4.1-2
    A tool for managing/monitoring Linux md device arrays, also known as Software RAID
local/nccl 2.11.4-1
    Library for NVIDIA multi-GPU and multi-node collective communication primitives
local/ndctl 71.1-1
    Utility library for managing the libnvdimm (non-volatile memory device) sub-system in the Linux kernel
local/nvidia 470.74-1
    NVIDIA drivers for linux
local/nvidia-prime 1.0-4
    NVIDIA Prime Render Offload configuration and utilities
local/nvidia-settings 470.74-1
    Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 470.74-1
    NVIDIA drivers utilities
local/opencl-nvidia 470.74-1
    OpenCL implemention for NVIDIA
local/pacman-mirrorlist 20210822-1
    Arch Linux mirror list for use by pacman
local/python-distro 1.6.0-1
    Linux OS platform information API
local/python-pycuda 2021.1-4
    Python wrapper for Nvidia CUDA
local/solaar 1.0.6-2
    Linux device manager for a wide range of Logitech devices
local/util-linux 2.37.2-1
    Miscellaneous system utilities for Linux
local/util-linux-libs 2.37.2-1
    util-linux runtime libraries
local/v4l-utils 1.20.0-1
    Userspace tools and conversion library for Video 4 Linux
[root@solid ~]# 

Last edited by FuzzySPb (2021-09-23 05:53:03)

Offline

#6 2021-09-23 06:51:17

seth
Member
Registered: 2012-09-03
Posts: 49,980

Re: [SOLVED] nvidia 470.74-1 broke CUDA

Do you have the nvidia modules in the initramfs?
Did you update the initramfs?
https://wiki.archlinux.org/title/NVIDIA#Pacman_hook

Online

#7 2021-09-23 07:18:26

FuzzySPb
Member
Registered: 2013-01-21
Posts: 46

Re: [SOLVED] nvidia 470.74-1 broke CUDA

seth wrote:

Do you have the nvidia modules in the initramfs?
Did you update the initramfs?
https://wiki.archlinux.org/title/NVIDIA#Pacman_hook

Hi, seth, you are right, this was the reason. I haven't had this hook before. Probably I was lucky and got nvidia driver updates together with kernel updates that triggerred initramfs update for nvidia.
I created this hook and now everything works with new nvidia 470.74-1

But why it isn't possible to install this hook with nvidia package itself? It is too complex due to different possible dependencies? or it won't be triggered after first installation?
As alternative - maybe nvidia package should show a warning if hook is missing?

Last edited by FuzzySPb (2021-09-23 07:20:44)

Offline

#8 2021-09-23 07:30:01

seth
Member
Registered: 2012-09-03
Posts: 49,980

Re: [SOLVED] nvidia 470.74-1 broke CUDA

Early KMS is opt-in (though because of recent developments kinda mandatory…) and adding the hook would make it the (possibly undesired for optimus systems) default.

Online

Board footer

Powered by FluxBB