You are not logged in.

#1 2021-06-04 17:31:47

Californian
Member
Registered: 2014-03-05
Posts: 14

[SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

I'm following the (very simple) instructions here (https://wiki.archlinux.org/title/Docker … VIDIA_GPUs) for the recommended route (nvidia-container-toolkit) and getting

Failed to initialize NVML: Unknown Error

when I try to run any container with "--gpus". In particular, I tried the command here (https://docs.nvidia.com/datacenter/clou … tml#docker):

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

(both as unprivileged user and root, same result).

Also, I (attemped) to add the kernel param "systemd.unified_cgroup_hierarchy=false", but I don't know whether that succeeded, and

sudo sysctl -a | grep unified_cgroup_hierarchy

returns nothing so it may not have worked -- I added it to the entries in

/boot/loader/entries/arch.conf

.

The output of "nvidia-smi" on the host is

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+

My docker version: 20.10.6
My kernel version: 5.12.7
My GPU: NVIDIA® GeForce RTX™ 2060 6GB GDDR6 with Max-Q

Last edited by Californian (2021-06-05 00:33:57)

Offline

#2 2021-06-04 23:38:56

szalinski
Member
Registered: 2021-06-04
Posts: 1

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

I've bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.

-----
Method 1, recommended

1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :

cat /proc/cmdline

It's of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime ( https://wiki.archlinux.org/title/Kernel … ng_cmdline)

2) nvidia-container configuration
In the file

/etc/nvidia-container-runtime/config.toml

set the parameter

no-cgroups = false

After that restart docker and run test container:

sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

---------
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)

no-cgroups = true

Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts: https://github.com/NVIDIA/nvidia-docker … -851039827
For debugging purposes, just run:

sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi

Good luck

Last edited by szalinski (2021-06-04 23:41:06)

Offline

#3 2021-06-05 00:32:31

Californian
Member
Registered: 2014-03-05
Posts: 14

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

Amazing, thank you! The /etc/nvidia-container-runtime/config.toml parameter (no-cgroups = false) was what I was missing, I'll add that to the wiki.

Offline

#4 2021-06-06 10:24:00

lahwaacz
Wiki Admin
From: Czech Republic
Registered: 2012-05-29
Posts: 698

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

FWIW, no-cgroups=true is forced by default in the AUR package: https://aur.archlinux.org/cgit/aur.git/ … 08ba39c5df

Offline

Board footer

Powered by FluxBB