NVIDIA XDI 13 with newest Linux LTS kernel

merasil · 2025-07-25 11:37:19

Hello everyone,

I’m running Arch Linux on a host with an NVIDIA RTX3070 and the proprietary drivers. Until yesterday, everything was rock‑solid—both on the host and inside a Docker container based on tensorflow/tensorflow:2.19.0-gpu. However, after pulling a routine pacman -Syu that upgraded:

linux-lts from 6.12.38-1 → 6.12.40-1

nvidia-lts from 575.64.03-3 → 575.64.05-2

nvidia-utils from 575.64.03-1 → 575.64.05-1

my GPU‑accelerated Python scripts inside Docker immediately began failing with an unrecoverable error. The container logs show:

… device_event_mgr.cc:226] Unexpected Event status: 1  
… CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure  
… cuDNN launch failure : input shape ([1,3,3,192])  
… FusedBatchNormV3 … cuDNN_STATUS_EXECUTION_FAILED  
Killing Process…

Simultaneously, journalctl -k on the host reports:

NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: MISSING_INLINE_DATA  
… general protection fault ip:… in libc.so.6

What I’ve tried so far:
- Rollback tests: everything works... haven't tried NVIDIA and Kernel independently
- Debugging TensorFlow: Memory‑growth, smaller batch sizes, alternate TF models—no difference.
- Container toolkit: Reinstalled nvidia-container-toolkit, restarted Docker, confirmed nvidia-smi works inside the container.
- Driver/Kerrnel flags: Tried pcie_aspm=off, blacklisting nova_core—no effect.
- Hardware checks: GPU temps remain low, no errors under native CUDA samples.

I think that something in NVIDIA 575.64.05 and/or linux-lts 6.12.40 is corrupting the GPU command stream (XID 13 “MISSING_INLINE_DATA”) and causing cuDNN kernels to fail launch. Since the exact same container works fine under the previous driver + kernel, I suspect a regression either in the driver’s buffer management or in an LTS‑kernel DRM/DMA change.

- Has anyone else seen XID 13 “MISSING_INLINE_DATA” in Arch after updating to Nvidia 575.64.05 or linux-lts 6.12.40?
- Are there known fixes or updated packages pending for this regression?
- Any pointers on Arch-specific workarounds (e.g. alternative PKGBUILD flags, AUR workaround builds, specific module blacklists)?

I’m happy to provide any additional logs or run further tests. Thanks in advance for any insights!

Last edited by merasil (2025-07-25 11:53:09)

seth · 2025-07-25 14:37:35

NVRM is from the nvidia kernel module, 6.12.39-1 has 575.64.03-5 and 575.64.05-1 what should allow you to check whether it's the driver update.

merasil · 2025-07-27 09:44:08

so it seems the culprit is the kernel... 575.64.05-1 runs perfectly fine with 6.12.39-1. If i use the newest non-lts kernel i get the same problem. But does 6.12.39 to .40 change such a huge thing that my gpu starts to work faulty??

seth · 2025-07-27 13:14:36

https://gitlab.archlinux.org/archlinux/ … a9e94d2c1b
Not a build option/config change.
And it's double indirect (docker + OOT nvidia module) => you'll probably have to bisect the offending commit.
https://wiki.gentoo.org/wiki/Kernel_git-bisect
https://git.kernel.org/pub/scm/linux/ke … 12.39&dt=2

Resp, what specifically is

… general protection fault ip:… in libc.so.6

(ie. all logs and errors around the failure)

Arch Linux

#1 2025-07-25 11:37:19

NVIDIA XDI 13 with newest Linux LTS kernel

#2 2025-07-25 14:37:35

Re: NVIDIA XDI 13 with newest Linux LTS kernel

#3 2025-07-27 09:44:08

Re: NVIDIA XDI 13 with newest Linux LTS kernel

#4 2025-07-27 13:14:36

Re: NVIDIA XDI 13 with newest Linux LTS kernel

Board footer