You are not logged in.
I'm following the (very simple) instructions here (https://wiki.archlinux.org/title/Docker … VIDIA_GPUs) for the recommended route (nvidia-container-toolkit) and getting
Failed to initialize NVML: Unknown Errorwhen I try to run any container with "--gpus". In particular, I tried the command here (https://docs.nvidia.com/datacenter/clou … tml#docker):
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi(both as unprivileged user and root, same result).
Also, I (attemped) to add the kernel param "systemd.unified_cgroup_hierarchy=false", but I don't know whether that succeeded, and
sudo sysctl -a | grep unified_cgroup_hierarchyreturns nothing so it may not have worked -- I added it to the entries in
/boot/loader/entries/arch.conf.
The output of "nvidia-smi" on the host is
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+My docker version: 20.10.6
My kernel version: 5.12.7
My GPU: NVIDIA® GeForce RTX™ 2060 6GB GDDR6 with Max-Q
Last edited by Californian (2021-06-05 00:33:57)
Offline
I've bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.
-----
Method 1, recommended 
1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :
cat /proc/cmdlineIt's of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime ( https://wiki.archlinux.org/title/Kernel … ng_cmdline)
2) nvidia-container configuration
In the file 
/etc/nvidia-container-runtime/config.tomlset the parameter
no-cgroups = falseAfter that restart docker and run test container:
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi---------
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above) 
no-cgroups = trueThen you must manually pass all gpu devices to the container. Check this answer for the list of required mounts: https://github.com/NVIDIA/nvidia-docker … -851039827
For debugging purposes, just run:
sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smiGood luck
Last edited by szalinski (2021-06-04 23:41:06)
Offline
Amazing, thank you! The /etc/nvidia-container-runtime/config.toml parameter (no-cgroups = false) was what I was missing, I'll add that to the wiki.
Offline
FWIW, no-cgroups=true is forced by default in the AUR package: https://aur.archlinux.org/cgit/aur.git/ … 08ba39c5df
Offline
I have exactly the same problem, and although I thoroughly followed szalinski's instructions, I have still the problem.
My /cat/cmdline contains:
BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=9af686ac-51d7-4011-9bfd-74556a615a87 rw systemd.unified_cgroup_hierarchy=false loglevel=3 quietand my /etc/nvidia-container-runtime/config.toml contains:
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"When I run docker run --gpus all nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi, I have the following error:
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceledHere are my GPU and system's characteristics:
* nvidia-smi's output: NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4
* GPU: GeFORCE RTX 2080 Ti
* Docker version: Docker version 20.10.9, build c2ea9bc90b
Thank you in avance for your help.
EDIT: I eventually solved my problem, there was nothing to do with the solutions proposed above, but I had to run a privileged container with `docker run`'s option `--privileged` to have access to the GPU:
docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-sminow everything works perfectly.
Last edited by Wmog (2021-10-15 13:32:41)
Offline
Im just add some info for SZALINSKI answer. For you guys to understand more about this problem since i have worked on building tensorflow container for a few days
-----------------------------------------------------------------------
My docker version is 20.10.8, from this point and i only have additional
nvidia-container-toolkitinstalled by
yay -S nvidia-container-toolkitSo i use method2 from THIS POST, which is bypass cgroups option.
When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container
So when bypass this option, you gotta allocate resource by your own. Here's an example
A single docker run
docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smiA compose file
version: "3.9"
services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run
docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smiA compose file
version: "3.9"
services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]-----------------------------------------------------------------------
You can also use
nvidia-container-runtimeinstalled by
yay -S nvidia-container-runtime which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json
{
    "data-root": "/virtual/data/docker", // just my personal
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}sudo systemctl restart dockerThere's no need to edit docker unit. As you run you container with additional runtime option, it'll search for runtime inside the daemon.json
For a single docker run, with or without privileged mode, just replace
--gpus allwith
--runtime nvidiaFor a compose file
Without privilege mode
version: "3.9"
services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    entrypoint: ["nvidia-smi"]With privilege mode
version: "3.9"
services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    privileged: true
    entrypoint: ["nvidia-smi"]Last edited by howard-o-neil (2021-11-13 16:49:08)
Offline
Thank you very much, the anwser of howard-o-neil works for me.
Offline