You are not logged in.
I'm following the (very simple) instructions here (https://wiki.archlinux.org/title/Docker … VIDIA_GPUs) for the recommended route (nvidia-container-toolkit) and getting
Failed to initialize NVML: Unknown Error
when I try to run any container with "--gpus". In particular, I tried the command here (https://docs.nvidia.com/datacenter/clou … tml#docker):
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
(both as unprivileged user and root, same result).
Also, I (attemped) to add the kernel param "systemd.unified_cgroup_hierarchy=false", but I don't know whether that succeeded, and
sudo sysctl -a | grep unified_cgroup_hierarchy
returns nothing so it may not have worked -- I added it to the entries in
/boot/loader/entries/arch.conf
.
The output of "nvidia-smi" on the host is
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31 Driver Version: 465.31 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
My docker version: 20.10.6
My kernel version: 5.12.7
My GPU: NVIDIA® GeForce RTX™ 2060 6GB GDDR6 with Max-Q
Last edited by Californian (2021-06-05 00:33:57)
Offline
I've bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.
-----
Method 1, recommended
1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :
cat /proc/cmdline
It's of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime ( https://wiki.archlinux.org/title/Kernel … ng_cmdline)
2) nvidia-container configuration
In the file
/etc/nvidia-container-runtime/config.toml
set the parameter
no-cgroups = false
After that restart docker and run test container:
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
---------
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)
no-cgroups = true
Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts: https://github.com/NVIDIA/nvidia-docker … -851039827
For debugging purposes, just run:
sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
Good luck
Last edited by szalinski (2021-06-04 23:41:06)
Offline
Amazing, thank you! The /etc/nvidia-container-runtime/config.toml parameter (no-cgroups = false) was what I was missing, I'll add that to the wiki.
Offline
FWIW, no-cgroups=true is forced by default in the AUR package: https://aur.archlinux.org/cgit/aur.git/ … 08ba39c5df
Offline
I have exactly the same problem, and although I thoroughly followed szalinski's instructions, I have still the problem.
My /cat/cmdline contains:
BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=9af686ac-51d7-4011-9bfd-74556a615a87 rw systemd.unified_cgroup_hierarchy=false loglevel=3 quiet
and my /etc/nvidia-container-runtime/config.toml contains:
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
When I run docker run --gpus all nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi, I have the following error:
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled
Here are my GPU and system's characteristics:
* nvidia-smi's output: NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
* GPU: GeFORCE RTX 2080 Ti
* Docker version: Docker version 20.10.9, build c2ea9bc90b
Thank you in avance for your help.
EDIT: I eventually solved my problem, there was nothing to do with the solutions proposed above, but I had to run a privileged container with `docker run`'s option `--privileged` to have access to the GPU:
docker run --gpus all --privileged nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi
now everything works perfectly.
Last edited by Wmog (2021-10-15 13:32:41)
Offline
Im just add some info for SZALINSKI answer. For you guys to understand more about this problem since i have worked on building tensorflow container for a few days
-----------------------------------------------------------------------
My docker version is 20.10.8, from this point and i only have additional
nvidia-container-toolkit
installed by
yay -S nvidia-container-toolkit
So i use method2 from THIS POST, which is bypass cgroups option.
When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container
So when bypass this option, you gotta allocate resource by your own. Here's an example
A single docker run
docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run
docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
privileged: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
-----------------------------------------------------------------------
You can also use
nvidia-container-runtime
installed by
yay -S nvidia-container-runtime
which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json
{
"data-root": "/virtual/data/docker", // just my personal
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl restart docker
There's no need to edit docker unit. As you run you container with additional runtime option, it'll search for runtime inside the daemon.json
For a single docker run, with or without privileged mode, just replace
--gpus all
with
--runtime nvidia
For a compose file
Without privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
entrypoint: ["nvidia-smi"]
With privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
privileged: true
entrypoint: ["nvidia-smi"]
Last edited by howard-o-neil (2021-11-13 16:49:08)
Offline
Thank you very much, the anwser of howard-o-neil works for me.
Offline