[SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

Californian · 2021-06-04 17:31:47

I'm following the (very simple) instructions here (https://wiki.archlinux.org/title/Docker … VIDIA_GPUs) for the recommended route (nvidia-container-toolkit) and getting

Failed to initialize NVML: Unknown Error

when I try to run any container with "--gpus". In particular, I tried the command here (https://docs.nvidia.com/datacenter/clou … tml#docker):

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

(both as unprivileged user and root, same result).

Also, I (attemped) to add the kernel param "systemd.unified_cgroup_hierarchy=false", but I don't know whether that succeeded, and

sudo sysctl -a | grep unified_cgroup_hierarchy

returns nothing so it may not have worked -- I added it to the entries in

/boot/loader/entries/arch.conf

.

The output of "nvidia-smi" on the host is

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+

My docker version: 20.10.6
My kernel version: 5.12.7
My GPU: NVIDIA® GeForce RTX™ 2060 6GB GDDR6 with Max-Q

Last edited by Californian (2021-06-05 00:33:57)

szalinski · 2021-06-04 23:38:56

I've bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.

-----
Method 1, recommended

1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :

cat /proc/cmdline

It's of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime ( https://wiki.archlinux.org/title/Kernel … ng_cmdline)

2) nvidia-container configuration
In the file

/etc/nvidia-container-runtime/config.toml

set the parameter

no-cgroups = false

After that restart docker and run test container:

sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

---------
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)

no-cgroups = true

Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts: https://github.com/NVIDIA/nvidia-docker … -851039827
For debugging purposes, just run:

sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi

Good luck

Last edited by szalinski (2021-06-04 23:41:06)

Californian · 2021-06-05 00:32:31

Amazing, thank you! The /etc/nvidia-container-runtime/config.toml parameter (no-cgroups = false) was what I was missing, I'll add that to the wiki.

lahwaacz · 2021-06-06 10:24:00

FWIW, no-cgroups=true is forced by default in the AUR package: https://aur.archlinux.org/cgit/aur.git/ … 08ba39c5df

Wmog · 2021-10-15 12:57:59

I have exactly the same problem, and although I thoroughly followed szalinski's instructions, I have still the problem.

My /cat/cmdline contains:

BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=9af686ac-51d7-4011-9bfd-74556a615a87 rw systemd.unified_cgroup_hierarchy=false loglevel=3 quiet

and my /etc/nvidia-container-runtime/config.toml contains:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

When I run docker run --gpus all nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi, I have the following error:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled

Here are my GPU and system's characteristics:

* nvidia-smi's output: NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
* GPU: GeFORCE RTX 2080 Ti
* Docker version: Docker version 20.10.9, build c2ea9bc90b

Thank you in avance for your help.

EDIT: I eventually solved my problem, there was nothing to do with the solutions proposed above, but I had to run a privileged container with `docker run`'s option `--privileged` to have access to the GPU:

docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi

now everything works perfectly.

Last edited by Wmog (2021-10-15 13:32:41)

howard-o-neil · 2021-11-13 16:39:53

Im just add some info for SZALINSKI answer. For you guys to understand more about this problem since i have worked on building tensorflow container for a few days

-----------------------------------------------------------------------

My docker version is 20.10.8, from this point and i only have additional

nvidia-container-toolkit

installed by

yay -S nvidia-container-toolkit

So i use method2 from THIS POST, which is bypass cgroups option.
When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container

So when bypass this option, you gotta allocate resource by your own. Here's an example
A single docker run

docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run

docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

-----------------------------------------------------------------------

You can also use

nvidia-container-runtime

installed by

yay -S nvidia-container-runtime

which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json

{
    "data-root": "/virtual/data/docker", // just my personal
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

sudo systemctl restart docker

There's no need to edit docker unit. As you run you container with additional runtime option, it'll search for runtime inside the daemon.json

For a single docker run, with or without privileged mode, just replace

--gpus all

with

--runtime nvidia

For a compose file

Without privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    entrypoint: ["nvidia-smi"]

With privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    privileged: true
    entrypoint: ["nvidia-smi"]

Last edited by howard-o-neil (2021-11-13 16:49:08)

Sukabulie · 2021-12-20 02:46:25

Thank you very much, the anwser of howard-o-neil works for me.

Arch Linux

#1 2021-06-04 17:31:47

[SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

#2 2021-06-04 23:38:56

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

#3 2021-06-05 00:32:31

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

#4 2021-06-06 10:24:00

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

#5 2021-10-15 12:57:59

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

#6 2021-11-13 16:39:53

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

#7 2021-12-20 02:46:25

Re: [SOLVED] Docker with GPU: "Failed to initialize NVML: Unknown Error"

Board footer