[Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

alllexx88 · 2024-10-08 07:25:04

Hi,
I have an issue with a Razer Core X Chroma eGPU with Nvidia RTX 4070 Ti. nvidia-drm fails to configure a drm device for it with errors like

Oct 08 09:08:46 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 4096

As a result there's no output to the external monitor connected to the eGPU.

Here's a full boot log: http://0x0.st/XEIR.txt

There's also org_kde_powerde crash in the logs, which doesn't happen if I boot without the eGPU connected:

Oct 08 09:09:08 AlexArch systemd-coredump[1615]: Process 1484 (org_kde_powerde) of user 1000 dumped core.
                                                 
                                                 Stack trace of thread 1484:
                                                 #0  0x0000703f230a53f4 n/a (libc.so.6 + 0x963f4)
                                                 #1  0x0000703f2304c120 raise (libc.so.6 + 0x3d120)
                                                 #2  0x0000703f249b72a1 _ZN6KCrash19defaultCrashHandlerEi (libKF6Crash.so.6 + 0x62a1)
                                                 #3  0x0000703f2304c1d0 n/a (libc.so.6 + 0x3d1d0)
                                                 #4  0x0000703f2317e39d n/a (libc.so.6 + 0x16f39d)
                                                 #5  0x0000703f22e707e3 n/a (libddcutil.so.5 + 0x6c7e3)
                                                 #6  0x0000703f22e709b4 n/a (libddcutil.so.5 + 0x6c9b4)
                                                 #7  0x0000703f22e30703 n/a (libddcutil.so.5 + 0x2c703)
                                                 #8  0x0000703f22e324f1 n/a (libddcutil.so.5 + 0x2e4f1)
                                                 #9  0x0000703f22e70458 n/a (libddcutil.so.5 + 0x6c458)
                                                 #10 0x0000703f22e32db4 n/a (libddcutil.so.5 + 0x2edb4)
                                                 #11 0x0000703f22e0d9e1 n/a (libddcutil.so.5 + 0x99e1)
                                                 #12 0x0000703f22e5b13d n/a (libddcutil.so.5 + 0x5713d)
                                                 #13 0x0000703f2497a455 n/a (libpowerdevilcore.so.2 + 0x53455)
                                                 #14 0x0000703f249832d3 n/a (libpowerdevilcore.so.2 + 0x5c2d3)
                                                 #15 0x0000703f2497e484 _ZN26ScreenBrightnessController14detectDisplaysEv (libpowerdevilcore.so.2 + 0x57484)
                                                 #16 0x00005d9c637228bf n/a (org_kde_powerdevil + 0x78bf)
                                                 #17 0x0000703f23034e08 n/a (libc.so.6 + 0x25e08)
                                                 #18 0x0000703f23034ecc __libc_start_main (libc.so.6 + 0x25ecc)
                                                 #19 0x00005d9c63722995 n/a (org_kde_powerdevil + 0x7995)
                                                 
                                                 Stack trace of thread 1506:
                                                 #0  0x0000703f2309fa19 n/a (libc.so.6 + 0x90a19)
                                                 #1  0x0000703f230a2479 pthread_cond_wait (libc.so.6 + 0x93479)
                                                 #2  0x0000703f238ca030 _ZN14QWaitCondition4waitEP6QMutex14QDeadlineTimer (libQt6Core.so.6 + 0x2ca030)
                                                 #3  0x0000703f2104847f n/a (libQt6WaylandClient.so.6 + 0x6047f)
                                                 #4  0x0000703f238c53f7 n/a (libQt6Core.so.6 + 0x2c53f7)
                                                 #5  0x0000703f230a339d n/a (libc.so.6 + 0x9439d)
                                                 #6  0x0000703f2312849c n/a (libc.so.6 + 0x11949c)
                                                 
                                                 Stack trace of thread 1536:
                                                 #0  0x0000703f231261fd syscall (libc.so.6 + 0x1171fd)
                                                 #1  0x0000703f21d3ef20 g_cond_wait (libglib-2.0.so.0 + 0x8ef20)
                                                 #2  0x0000703f21cd598c n/a (libglib-2.0.so.0 + 0x2598c)
                                                 #3  0x0000703f21d45137 n/a (libglib-2.0.so.0 + 0x95137)
                                                 #4  0x0000703f21d41026 n/a (libglib-2.0.so.0 + 0x91026)
                                                 #5  0x0000703f230a339d n/a (libc.so.6 + 0x9439d)
                                                 #6  0x0000703f2312849c n/a (libc.so.6 + 0x11949c)
                                                 
                                                 Stack trace of thread 1538:
                                                 #0  0x0000703f2311abb0 ppoll (libc.so.6 + 0x10bbb0)
                                                 #1  0x0000703f21d70227 n/a (libglib-2.0.so.0 + 0xc0227)
                                                 #2  0x0000703f21d0e287 g_main_loop_run (libglib-2.0.so.0 + 0x5e287)
                                                 #3  0x0000703f1bf1cb44 n/a (libgio-2.0.so.0 + 0x113b44)
                                                 #4  0x0000703f21d41026 n/a (libglib-2.0.so.0 + 0x91026)
                                                 #5  0x0000703f230a339d n/a (libc.so.6 + 0x9439d)
                                                 #6  0x0000703f2312849c n/a (libc.so.6 + 0x11949c)
                                                 
                                                 Stack trace of thread 1537:
                                                 #0  0x0000703f2311abb0 ppoll (libc.so.6 + 0x10bbb0)
                                                 #1  0x0000703f21d70227 n/a (libglib-2.0.so.0 + 0xc0227)
                                                 #2  0x0000703f21d0ca55 g_main_context_iteration (libglib-2.0.so.0 + 0x5ca55)
                                                 #3  0x0000703f21d0cab2 n/a (libglib-2.0.so.0 + 0x5cab2)
                                                 #4  0x0000703f21d41026 n/a (libglib-2.0.so.0 + 0x91026)
                                                 #5  0x0000703f230a339d n/a (libc.so.6 + 0x9439d)
                                                 #6  0x0000703f2312849c n/a (libc.so.6 + 0x11949c)
                                                 
                                                 Stack trace of thread 1507:
                                                 #0  0x0000703f2311a63d __poll (libc.so.6 + 0x10b63d)
                                                 #1  0x0000703f210484e7 n/a (libQt6WaylandClient.so.6 + 0x604e7)
                                                 #2  0x0000703f238c53f7 n/a (libQt6Core.so.6 + 0x2c53f7)
                                                 #3  0x0000703f230a339d n/a (libc.so.6 + 0x9439d)
                                                 #4  0x0000703f2312849c n/a (libc.so.6 + 0x11949c)
                                                 
                                                 Stack trace of thread 1505:
                                                 #0  0x0000703f2311abb0 ppoll (libc.so.6 + 0x10bbb0)
                                                 #1  0x0000703f21d70227 n/a (libglib-2.0.so.0 + 0xc0227)
                                                 #2  0x0000703f21d0ca55 g_main_context_iteration (libglib-2.0.so.0 + 0x5ca55)
                                                 #3  0x0000703f239a985d _ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE (libQt6Core.so.6 + 0x3a985d)
                                                 #4  0x0000703f23750106 _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt6Core.so.6 + 0x150106)
                                                 #5  0x0000703f238432b0 _ZN7QThread4execEv (libQt6Core.so.6 + 0x2432b0)
                                                 #6  0x0000703f24611e4e n/a (libQt6DBus.so.6 + 0x2de4e)
                                                 #7  0x0000703f238c53f7 n/a (libQt6Core.so.6 + 0x2c53f7)
                                                 #8  0x0000703f230a339d n/a (libc.so.6 + 0x9439d)
                                                 #9  0x0000703f2312849c n/a (libc.so.6 + 0x11949c)
                                                 ELF object binary architecture: AMD x86-64

Thanks

Last edited by alllexx88 (2024-10-18 08:41:16)

seth · 2024-10-08 13:10:19

Oct 08 09:08:44 AlexArch kernel: Linux version 6.11.2-zen1-1-zen (linux-zen@archlinux) (gcc (GCC) 14.2.1 20240910, GNU ld (GNU Binutils) 2.43.0) #1 ZEN SMP PREEMPT_DYNAMIC Fri, 04 Oct 2024 21:51:07 +0000
…
Oct 08 09:08:44 AlexArch kernel: nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
Oct 08 09:08:44 AlexArch kernel: nvidia 0000:04:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
…
Oct 08 09:08:45 AlexArch kernel: NVRM: GPU at PCI:0000:04:00: GPU-a5b5851e-dfe0-833c-36f5-e63990058e1e
Oct 08 09:08:45 AlexArch kernel: NVRM: Xid (PCI:0000:04:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 08 09:08:45 AlexArch kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
Oct 08 09:08:45 AlexArch kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                 NVRM: nvidia-bug-report.sh as root to collect this data before
                                 NVRM: the NVIDIA kernel module is unloaded.
…
Oct 08 09:08:46 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 4096
Oct 08 09:08:46 AlexArch kernel: nvidia-modeset: ERROR: GPU:1: The requested configuration of display devices (HKC 27E1QA (DP-4)) is not supported on this GPU.
…
Oct 08 09:08:47 AlexArch kernel: pci 0000:04:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x4 link at 0000:00:07.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
…
Oct 08 09:08:47 AlexArch kernel: nvidia 0000:04:00.0: enabling device (0000 -> 0003)
Oct 08 09:08:47 AlexArch kernel: nvidia 0000:04:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
…
Oct 08 09:09:01 AlexArch kernel: nvidia-modeset: ERROR: GPU:1: The requested configuration of display devices (HKC 27E1QA (DP-4)) is not supported on this GPU.

Try to add "rcutree.gp_init_delay=1" to the https://wiki.archlinux.org/title/Kernel_parameters
The lowere half is the result of a bus rescan?
Is your nvidia-module juggling oneshot service still active?

alllexx88 · 2024-10-08 13:36:39

seth wrote:

Try to add "rcutree.gp_init_delay=1" to the https://wiki.archlinux.org/title/Kernel_parameters

Can't seem to boot now, hitting

Oct 08 15:24:13 AlexArch kernel: BUG: unable to handle page fault for address: ffffb0da478d404c
Oct 08 15:24:13 AlexArch kernel: #PF: supervisor read access in kernel mode
Oct 08 15:24:13 AlexArch kernel: #PF: error_code(0x0000) - not-present page

http://0x0.st/XE03.txt

I've been having the exact same error before, with egpu plugged in on boot, but could circumvent it by booting windows and powering the egpu off/on again. Now it's resilient, the error is there. I'm able to boot only by switching off the egpu.

seth wrote:

The lowere half is the result of a bus rescan?

No, at least I wasn't doing it manually.

seth wrote:

Is your nvidia-module juggling oneshot service still active?

It's disabled now, it's probably a half-measure workaround that cripples internal gpu drm for wayland session.

Thanks

UPD: so I reset BIOS and was able to boot with the new kernel parameter. The issues look similar: http://0x0.st/XEGo.txt

Last edited by alllexx88 (2024-10-08 14:16:41)

seth · 2024-10-08 14:22:27

Oct 08 15:24:12 AlexArch kernel: NVRM: GPU at PCI:0000:04:00: GPU-a5b5851e-dfe0-833c-36f5-e63990058e1e
Oct 08 15:24:12 AlexArch kernel: NVRM: Xid (PCI:0000:04:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 08 15:24:12 AlexArch kernel: xhci_hcd 0000:09:00.0: remove, state 1
Oct 08 15:24:12 AlexArch kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.

Still happens

Oct 08 15:24:13 AlexArch kernel: pcieport 0000:03:04.0: Runtime PM usage count underflow!
Oct 08 15:24:13 AlexArch kernel: BUG: unable to handle page fault for address: ffffb0da478d404c

03:04.0 is a usb hub where your mouse is attached - there's generally a lot of fuzz on the USB stack, revolving around a webcam, a microphone and a soundblaster x4

I'd start by stripping the device of everything that's not strictly necessary, disable the webcam (microphone?) in the BIOS etc et pp and see how the system respond to that.

On the upside, "nv_drm_dumb_create" is gone - or the system just didn't make it there

For clarification: this condition isn't a 6.11 regression, is it?

alllexx88 · 2024-10-08 14:31:11

Thank you
The soundblaster x4 is connected directly to the laptop, but there're also a usb mic and a webcam connected to the TB3 dock from the egpu enclosure. I'll remove it all and try another boot.

I've been having more issues actually before the 6.11 upgrade, not being able to reboot/power off properly. I don't think 6.11 introduced any regressions for me.

UPD: I disconnected the usb mic and webcam and the soundblaster. First didn't boot: http://0x0.st/XEGc.txt then after switching the egpu off/on boot with similar issues: http://0x0.st/XEGT.txt
I'll go and try to disable built-in webcam and mic and maybe wifi (I'm on ethernet atm) and bluetooth and see how it goes

UPD2: I've run few tries. Looks like with most of internal devices (webcam, sound, wifi and bluetooth) switched off, and the soundblaster disconnected, I don't run into freezes on boot, an example of such a boot: http://0x0.st/XEG9.txt
With just the soundblaster connected, but internals off, it boots like 50% of the time, but with different issues. Freezes: http://0x0.st/XEGf.txt and http://0x0.st/XEGO.txt. When it boots fine: http://0x0.st/XEG4.txt. Oh, now that I think of it I'll try to connect the soundblaster to a usb c port, maybe it's on a different bus.

UPD3: no, the same deal via usb c. I feel like giving up on using the egpu with Arch for video output and as a dock. I've been doing this on my old laptop, also only with X11 with the oneshot service, but there it was fine since the internal screen was just FHD and bigger (17"). I've ordered a usb c dock for usb devices and connected the external monitor directly to a laptop's HDMI port. Since the monitor has 3 inputs, I'll just switch egpu on only when I need CUDA on Arch (not on boot) or when I game on windows. Not the most elegant setup, but at least it should work

Last edited by alllexx88 (2024-10-08 15:24:37)

seth · 2024-10-09 15:16:51

Even the boots that supposedly worked ( http://0x0.st/XEG9.txt & http://0x0.st/XEG4.txt ) have eg

Oct 08 16:41:38 AlexArch kernel: NVRM: GPU at PCI:0000:04:00: GPU-a5b5851e-dfe0-833c-36f5-e63990058e1e
Oct 08 16:41:38 AlexArch kernel: NVRM: Xid (PCI:0000:04:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 08 16:41:38 AlexArch kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
…
Oct 08 16:41:39 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 4096

and a bunch of xhci errors.

Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:09:00.0: Host halt failed, -19
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:09:00.0: Host not accessible, reset failed.
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:08:00.0: Host halt failed, -19
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:08:00.0: Host not accessible, reset failed.
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:07:00.0: Host halt failed, -19
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:07:00.0: Host not accessible, reset failed.
…
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:07:00.0: Host halt failed, -19
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:07:00.0: Host not accessible, reset failed.
Oct 08 16:34:36 AlexArch kernel: xhci_hcd 0000:07:00.0: USB bus 5 deregistered
Oct 08 16:34:36 AlexArch kernel: pcieport 0000:03:04.0: Runtime PM usage count underflow!

09 has

Oct 08 16:34:35 AlexArch kernel: input: Razer Razer Core X Chroma as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:04.0/0000:05:00.0/0000:06:02.0/0000:09:00.0/usb9/9-2/9-2:1.0/0003:1532:0F1A.0007/input/input13
Oct 08 16:34:35 AlexArch kernel: input: Razer Razer Core X Chroma Keyboard as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:04.0/0000:05:00.0/0000:06:02.0/0000:09:00.0/usb9/9-2/9-2:1.1/0003:1532:0F1A.0008/input/input14
Oct 08 16:34:35 AlexArch kernel: input: Razer Razer Core X Chroma as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:04.0/0000:05:00.0/0000:06:02.0/0000:09:00.0/usb9/9-2/9-2:1.1/0003:1532:0F1A.0008/input/input15
Oct 08 16:34:35 AlexArch kernel: input: Razer Razer Core X Chroma as /devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:04.0/0000:05:00.0/0000:06:02.0/0000:09:00.0/usb9/9-2/9-2:1.2/0003:1532:0F1A.0009/input/input16

But the main problem remains

Oct 08 16:34:36 AlexArch kernel: snd_hda_intel 0000:04:00.1: Unable to change power state from D3cold to D0, device inaccessible
Oct 08 16:34:36 AlexArch kernel: snd_hda_intel 0000:04:00.1: Disabling MSI
Oct 08 16:34:36 AlexArch kernel: snd_hda_intel 0000:04:00.1: Handle vga_switcheroo audio client
Oct 08 16:34:36 AlexArch kernel: NVRM: GPU at PCI:0000:04:00: GPU-a5b5851e-dfe0-833c-36f5-e63990058e1e
Oct 08 16:34:36 AlexArch kernel: NVRM: Xid (PCI:0000:04:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 08 16:34:36 AlexArch kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
…
Oct 08 16:34:37 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 4096

Though both also have the razer device attached.

Can you keep the GPU by blacklisting snd_hda_intel to avoid access to its audio output?

Either way, those seem red-herrings in the light of the "working" boots.
Can you reliably boot the multi-user.target (2nd link below)?

alllexx88 · 2024-10-10 08:36:04

Thanks,

A boot with just snd_hda_intel blacklisted: http://0x0.st/X6HH.txt (hang early on boot, shut down with SysRq reisuo)
Also boot into multi-user.target: http://0x0.st/X6H8.txt (a "successful" boot)
Another boot into multi-user.target that hang: http://0x0.st/X6HK.txt

"GPU has fallen off the bus" is still there, even on the "successful" boot:

Okt 10 10:24:40 AlexArch kernel: pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
Okt 10 10:24:40 AlexArch kernel: pcieport 0000:00:07.0: pciehp: Slot(14): Card not present
Okt 10 10:24:40 AlexArch kernel: xhci_hcd 0000:09:00.0: remove, state 1
Okt 10 10:24:40 AlexArch kernel: usb usb10: USB disconnect, device number 1
Okt 10 10:24:40 AlexArch kernel: usb 10-1: USB disconnect, device number 2
Okt 10 10:24:40 AlexArch kernel: NVRM: GPU at PCI:0000:04:00: GPU-a5b5851e-dfe0-833c-36f5-e63990058e1e
Okt 10 10:24:40 AlexArch kernel: NVRM: Xid (PCI:0000:04:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Okt 10 10:24:40 AlexArch kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
Okt 10 10:24:40 AlexArch kernel: [drm] [nvidia-drm] [GPU ID 0x00000400] Framebuffer memory not appropriate for scanout
Okt 10 10:24:40 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 14745600
Okt 10 10:24:40 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 14745600
Okt 10 10:24:40 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 14745600

seth · 2024-10-10 12:37:21

Also boot into multi-user.target: http://0x0.st/X6H8.txt (a "successful" boot)

Okt 10 10:24:40 AlexArch systemd[1]: Queued start job for default target Graphical Interface.

How exactly did you configure the default target? It's certainly not in the kernel commandline and more curiously, none of the journals has either the multi-user or graphical.target being reached.

I think the nvidia thing is a red herring, there's something wrong with the boot process at large.
The "succesful" boot takes 20s before you login, but only 3s until

Okt 10 10:24:43 AlexArch systemd[1]: Started Getty on tty1.
Okt 10 10:24:43 AlexArch systemd[1]: Reached target Login Prompts.

which is followed by a second pci probe, a hell lot of network setup actions
You've NM and networkd enabled, disable the latter categorically and for a test, also all vmware and container and docker stuff, the nats server everything that's not strictly necessary (incl. boltd?) to boot the system and see what happens.

Okt 10 10:25:53 AlexArch kernel: sysrq: Keyboard mode set to system default

find /etc/systemd -type l -exec test -f {} \; -print | awk -F'/' '{ printf ("%-40s | %s\n", $(NF-0), $(NF-1)) }' | sort -f

alllexx88 · 2024-10-10 13:45:34

seth wrote:

How exactly did you configure the default target? It's certainly not in the kernel commandline and more curiously, none of the journals has either the multi-user or graphical.target being reached.

seth wrote:

How exactly did you configure the default target?

I just disabled sddm service, which also brings me too a tty, like the multi-user.target. I now changed the default to multi-user.target, and couldn't boot for a couple of times: http://0x0.st/X68t.txt. And this is what a boot to the multi-user.target without egpu connected tooks like: http://0x0.st/X68v.txt

With systemd-networkd, vmware, nats, docker services disabled it still hangs on boot: http://0x0.st/X686.txt. For comparison a boot without egpu attached: http://0x0.st/X68l.txt

$ find /etc/systemd -type l -exec test -f {} \; -print | awk -F'/' '{ printf ("%-42s | %s\n", $(NF-0), $(NF-1)) }' | sort -f
acpid.service                              | multi-user.target.wants
apparmor.service                           | multi-user.target.wants
AX88179.service                            | graphical.target.wants
bluetooth.service                          | bluetooth.target.wants
creative.service                           | graphical.target.wants
creative-shutdown.service                  | halt.target.wants
creative-shutdown.service                  | reboot.target.wants
creative-shutdown.service                  | shutdown.target.wants
ctrl-alt-del.target                        | system
cups.path                                  | multi-user.target.wants
cups.service                               | multi-user.target.wants
cups.service                               | printer.target.wants
cups.socket                                | sockets.target.wants
dbus-org.bluez.service                     | system
dbus-org.freedesktop.home1.service         | system
dbus-org.freedesktop.nm-dispatcher.service | system
dbus-org.freedesktop.resolve1.service      | system
dbus-org.freedesktop.timesync1.service     | system
default.target                             | system
display-manager.service                    | system
getty@tty1.service                         | getty.target.wants
i8kmon.service                             | sys-subsystem-hwmon-devices-dell_smm.device.wants
machines.target                            | multi-user.target.wants
NetworkManager.service                     | multi-user.target.wants
NetworkManager-wait-online.service         | network-online.target.wants
nfs-client.target                          | multi-user.target.wants
nfs-client.target                          | remote-fs.target.wants
nvidia-hibernate.service                   | systemd-hibernate.service.wants
nvidia-resume.service                      | systemd-hibernate.service.wants
nvidia-resume.service                      | systemd-suspend.service.wants
nvidia-suspend.service                     | systemd-suspend.service.wants
p11-kit-server.socket                      | sockets.target.wants
pipewire-pulse.socket                      | sockets.target.wants
pipewire-session-manager.service           | user
pipewire.socket                            | sockets.target.wants
remote-cryptsetup.target                   | multi-user.target.wants
remote-fs.target                           | multi-user.target.wants
systemd-boot-update.service                | sysinit.target.wants
systemd-confext.service                    | sysinit.target.wants
systemd-homed-activate.service             | systemd-homed.service.wants
systemd-homed.service                      | multi-user.target.wants
systemd-journald-audit.socket              | sockets.target.wants
systemd-journald-audit.socket              | systemd-journald.service.wants
systemd-mountfsd.socket                    | sockets.target.wants
systemd-nsresourced.socket                 | sockets.target.wants
systemd-pstore.service                     | sysinit.target.wants
systemd-resolved.service                   | sysinit.target.wants
systemd-sysext.service                     | sysinit.target.wants
systemd-timesyncd.service                  | sysinit.target.wants
systemd-userdbd.socket                     | sockets.target.wants
wireplumber.service                        | pipewire.service.wants
xdg-user-dirs-update.service               | default.target.wants

Also, not sure how to disable boltd:

$ sudo systemctl disable bolt
The unit files have no installation config (WantedBy=, RequiredBy=, UpheldBy=,
Also=, or Alias= settings in the [Install] section, and DefaultInstance= for
template units). This means they are not meant to be enabled or disabled using systemctl.
 
Possible reasons for having these kinds of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/, .requires/, or .upholds/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.

Wouldn't disabling it make the egpu inaccessible?

UPD: disabled all the services I remember I added/enabled manually:

$ find /etc/systemd -type l -exec test -f {} \; -print | awk -F'/' '{ printf ("%-42s | %s\n", $(NF-0), $(NF-1)) }' | sort -f
acpid.service                              | multi-user.target.wants
apparmor.service                           | multi-user.target.wants
bluetooth.service                          | bluetooth.target.wants
ctrl-alt-del.target                        | system
cups.path                                  | multi-user.target.wants
cups.service                               | multi-user.target.wants
cups.service                               | printer.target.wants
cups.socket                                | sockets.target.wants
dbus-org.bluez.service                     | system
dbus-org.freedesktop.home1.service         | system
dbus-org.freedesktop.nm-dispatcher.service | system
dbus-org.freedesktop.resolve1.service      | system
dbus-org.freedesktop.timesync1.service     | system
default.target                             | system
display-manager.service                    | system
getty@tty1.service                         | getty.target.wants
machines.target                            | multi-user.target.wants
NetworkManager.service                     | multi-user.target.wants
NetworkManager-wait-online.service         | network-online.target.wants
nfs-client.target                          | multi-user.target.wants
nfs-client.target                          | remote-fs.target.wants
p11-kit-server.socket                      | sockets.target.wants
pipewire-pulse.socket                      | sockets.target.wants
pipewire-session-manager.service           | user
pipewire.socket                            | sockets.target.wants
remote-cryptsetup.target                   | multi-user.target.wants
remote-fs.target                           | multi-user.target.wants
systemd-boot-update.service                | sysinit.target.wants
systemd-confext.service                    | sysinit.target.wants
systemd-homed-activate.service             | systemd-homed.service.wants
systemd-homed.service                      | multi-user.target.wants
systemd-journald-audit.socket              | sockets.target.wants
systemd-journald-audit.socket              | systemd-journald.service.wants
systemd-mountfsd.socket                    | sockets.target.wants
systemd-nsresourced.socket                 | sockets.target.wants
systemd-pstore.service                     | sysinit.target.wants
systemd-resolved.service                   | sysinit.target.wants
systemd-sysext.service                     | sysinit.target.wants
systemd-timesyncd.service                  | sysinit.target.wants
systemd-userdbd.socket                     | sockets.target.wants
wireplumber.service                        | pipewire.service.wants
xdg-user-dirs-update.service               | default.target.wants

It still doesn't boot with the egpu connected: http://0x0.st/X6Ki.txt. Without it it's fine: http://0x0.st/X6Ko.txt

Last edited by alllexx88 (2024-10-10 13:58:19)

seth · 2024-10-10 14:29:30

You'd mask the service or remove the bolt package. Lack of the device manager might render the egpu useless, but not inaccessible.

Okt 10 15:16:52 AlexArch kernel: nvidia-modeset: ERROR: GPU:1: Error while waiting for GPU progress: 0x0000c77d:0 2:0:4048:4040
…
Okt 10 15:16:53 AlexArch kernel: pcieport 0000:03:04.0: Runtime PM usage count underflow!
Okt 10 15:16:53 AlexArch (udev-worker)[580]: controlC2: /usr/lib/udev/rules.d/78-sound-card.rules:5 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:04.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/usb7/7-2/7-2:1.0/sound/card2/controlC2/../uevent}="change", ignoring: No such file or directory
Okt 10 15:16:53 AlexArch (udev-worker)[601]: controlC1: Process '/usr/bin/alsactl restore /dev/snd/controlC1' failed with exit code 2.
Okt 10 15:16:53 AlexArch (udev-worker)[580]: controlC2: Process '/usr/bin/alsactl restore /dev/snd/controlC2' failed with exit code 2.
Okt 10 15:16:53 AlexArch (udev-worker)[574]: controlC1: /usr/lib/udev/rules.d/78-sound-card.rules:5 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:04.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/usb5/5-2/5-2:1.2/sound/card1/controlC1/../uevent}="change", ignoring: No such file or directory
Okt 10 15:16:53 AlexArch systemd[1]: Dispatch Password Requests to Console Directory Watch was skipped because of an unmet condition check (ConditionPathExists=!/run/plymouth/pid).
Okt 10 15:16:53 AlexArch systemd[1]: Rebuild Dynamic Linker Cache was skipped because no trigger condition checks were met.
Okt 10 15:16:53 AlexArch (udev-worker)[574]: controlC1: Process '/usr/bin/alsactl restore /dev/snd/controlC1' failed with exit code 2.
…
Okt 10 15:16:53 AlexArch kernel: BUG: unable to handle page fault for address: ffff9924c9680004
Okt 10 15:16:53 AlexArch kernel: #PF: supervisor read access in kernel mode
Okt 10 15:16:53 AlexArch kernel: #PF: error_code(0x0000) - not-present page
Okt 10 15:16:53 AlexArch kernel: PGD 100000067 P4D 100000067 PUD 100255067 PMD 0 
Okt 10 15:16:53 AlexArch kernel: Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
Okt 10 15:16:53 AlexArch kernel: CPU: 2 UID: 0 PID: 609 Comm: plymouthd Tainted: P           OE      6.11.2-zen1-1-zen #1 d30fef8841e6fc6d439846897474034e8403afe5
Okt 10 15:16:53 AlexArch kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Okt 10 15:16:53 AlexArch kernel: Hardware name: Alienware Alienware m16 R2/0NNNJG, BIOS 1.6.0 07/09/2024
Okt 10 15:16:53 AlexArch kernel: RIP: 0010:_nv002568kms+0x133/0x230 [nvidia_modeset]
Okt 10 15:16:53 AlexArch kernel: Code: 85 ed 74 3d 4c 39 e8 72 3e 4c 29 e8 48 39 d0 77 36 e8 b1 91 fc ff 83 bb 8c 00 00 00 01 0f 87 24 ff ff ff 48 8b 83 90 00 00 00 <44> 8b 78 04 e9 56 ff ff ff 0f 1f 40 00 ba 40 4b 4c 00 4d 85 ed 75
Okt 10 15:16:53 AlexArch kernel: RSP: 0018:ffff9924c26e7368 EFLAGS: 00010246
Okt 10 15:16:53 AlexArch kernel: RAX: ffff9924c9680000 RBX: ffff8bbf1d522e08 RCX: 0000000000000002
Okt 10 15:16:53 AlexArch kernel: RDX: ffff8bd61f937370 RSI: 0000000000000002 RDI: ffff8bd61f936840
Okt 10 15:16:53 AlexArch kernel: RBP: ffff9924c26e73a8 R08: 0000000000000001 R09: 0000000399076187
Okt 10 15:16:53 AlexArch kernel: R10: 00000000000001c3 R11: 00000000000001d8 R12: 0000000000000fc8
Okt 10 15:16:53 AlexArch kernel: R13: 0000000000e1e8c7 R14: ffff9924c01a1008 R15: 0000000000000fd0
Okt 10 15:16:53 AlexArch kernel: FS:  00007371be762480(0000) GS:ffff8bd61f900000(0000) knlGS:0000000000000000
Okt 10 15:16:53 AlexArch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Okt 10 15:16:53 AlexArch kernel: CR2: ffff9924c9680004 CR3: 000000011da72001 CR4: 0000000000f70ef0
Okt 10 15:16:53 AlexArch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Okt 10 15:16:53 AlexArch kernel: DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
Okt 10 15:16:53 AlexArch kernel: PKRU: 55555554
Okt 10 15:16:53 AlexArch kernel: Call Trace:
Okt 10 15:16:53 AlexArch kernel:  <TASK>
Okt 10 15:16:53 AlexArch kernel:  ? __die_body.cold+0x8/0x12
Okt 10 15:16:53 AlexArch kernel:  ? page_fault_oops+0x15a/0x2d0
Okt 10 15:16:53 AlexArch kernel:  ? exc_page_fault+0x18a/0x190
Okt 10 15:16:53 AlexArch kernel:  ? asm_exc_page_fault+0x26/0x30
Okt 10 15:16:53 AlexArch kernel:  ? _nv002568kms+0x133/0x230 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  ? _nv002568kms+0x11f/0x230 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  _nv000521kms+0x177/0x2d0 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  _nv000205kms+0x5d/0x530 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  _nv002891kms+0xdc/0x1a0 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  _nv002909kms+0x443/0x48f0 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  ? alloc_pages_bulk_noprof+0x535/0x8f0
Okt 10 15:16:53 AlexArch kernel:  ? vmap_small_pages_range_noflush+0x417/0x980
Okt 10 15:16:53 AlexArch kernel:  ? __vmalloc_node_range_noprof+0x76e/0xb90
Okt 10 15:16:53 AlexArch kernel:  ? nvkms_alloc+0x90/0xa0 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  ? _nv000352kms+0xf0/0xf0 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  nvKmsIoctl+0xf7/0x270 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  nvkms_ioctl_from_kapi_try_pmlock+0x64/0xb0 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  _nv000022kms+0x350/0xa90 [nvidia_modeset fe970f9cc8df885a5bdde1e7374f235aa6e5331d]
Okt 10 15:16:53 AlexArch kernel:  nv_drm_atomic_apply_modeset_config.isra.0+0x789/0x800 [nvidia_drm cd15be0c22a3d70339c4397d6071cc78f4ca3ba9]
Okt 10 15:16:53 AlexArch kernel:  drm_atomic_check_only+0x5c6/0xa20
Okt 10 15:16:53 AlexArch kernel:  drm_atomic_commit+0x69/0xe0
Okt 10 15:16:53 AlexArch kernel:  ? __pfx___drm_printfn_info+0x10/0x10
Okt 10 15:16:53 AlexArch kernel:  drm_client_modeset_commit_atomic.constprop.0+0x1c6/0x200
Okt 10 15:16:53 AlexArch kernel:  drm_client_modeset_commit_locked+0x55/0x190
Okt 10 15:16:53 AlexArch kernel:  drm_client_modeset_commit+0x25/0x40
Okt 10 15:16:53 AlexArch kernel:  drm_fb_helper_lastclose+0x49/0x80
Okt 10 15:16:53 AlexArch kernel:  drm_fbdev_ttm_client_restore+0x11/0x20 [drm_ttm_helper 43c13e97dc804a440e1d7fe5b1a42a38926fdf96]
Okt 10 15:16:53 AlexArch kernel:  drm_client_dev_restore+0x6e/0x100
Okt 10 15:16:53 AlexArch kernel:  drm_release+0x170/0x1b0
Okt 10 15:16:53 AlexArch kernel:  __fput+0xed/0x2d0
Okt 10 15:16:53 AlexArch kernel:  __x64_sys_close+0x87/0x100
Okt 10 15:16:53 AlexArch kernel:  do_syscall_64+0x82/0x190
Okt 10 15:16:53 AlexArch kernel:  ? __mod_memcg_lruvec_state+0xa0/0x150
Okt 10 15:16:53 AlexArch kernel:  ? __lruvec_stat_mod_folio+0x83/0xd0
Okt 10 15:16:53 AlexArch kernel:  ? folio_add_file_rmap_ptes+0x3b/0xb0
Okt 10 15:16:53 AlexArch kernel:  ? next_uptodate_folio+0x88/0x290
Okt 10 15:16:53 AlexArch kernel:  ? filemap_map_pages+0x4fa/0x5c0
Okt 10 15:16:53 AlexArch kernel:  ? __count_memcg_events+0x57/0xf0
Okt 10 15:16:53 AlexArch kernel:  ? handle_mm_fault+0x58a/0x1500
Okt 10 15:16:53 AlexArch kernel:  ? do_user_addr_fault+0x5d5/0x860
Okt 10 15:16:53 AlexArch kernel:  ? exc_page_fault+0x81/0x190
Okt 10 15:16:53 AlexArch kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Okt 10 15:16:53 AlexArch kernel: RIP: 0033:0x7371bea37cac
Okt 10 15:16:53 AlexArch kernel: Code: 0f 05 48 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d fc e8 20 96 f8 ff 8b 7d fc 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2c 89 d7 89 45 fc e8 82 96 f8 ff 8b 45 fc c9
Okt 10 15:16:53 AlexArch kernel: RSP: 002b:00007ffd29c71610 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
Okt 10 15:16:53 AlexArch kernel: RAX: ffffffffffffffda RBX: 00005a4005de0400 RCX: 00007371bea37cac
Okt 10 15:16:53 AlexArch kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000c
Okt 10 15:16:53 AlexArch kernel: RBP: 00007ffd29c71620 R08: 00000005a4005df4 R09: 0000000000000007
Okt 10 15:16:53 AlexArch kernel: R10: 00005a4005df40b0 R11: 0000000000000293 R12: 00007371be762408
Okt 10 15:16:53 AlexArch kernel: R13: 0000000000000016 R14: 00007371beb49c90 R15: 0000000000000016
Okt 10 15:16:53 AlexArch kernel:  </TASK>

This looks new (in this case, it's actually fairly common), have you tried to use https://aur.archlinux.org/packages/nvidia-535xx-dkms ?

alllexx88 · 2024-10-10 15:57:28

I masked bolt service, and with nvidia-dkms it boots with egpu connected, but yes, it's unusable.

With nvidia-535xx-dkms, I still get "Failed to allocate NvKmsKapiMemory" and "BUG: unable to handle page fault for address": http://0x0.st/X6PN.txt

seth · 2024-10-10 18:01:59

The log runs 560.35.03, but the page fault has shifted into snd_hda_core?
[Edit: the reason is likely a copy of the module in the initramfs?]

But if you can reliably boot w/o boltd we at least know where it's falling apart, the open question is whether the 535xx driver fares better.

Last edited by seth (2024-10-10 18:02:56)

alllexx88 · 2024-10-11 10:11:14

I don't know about snd_hda_core. Looks like I tried blacklisting snd_hda_intel before, it didn't (seem to?) change anything, so I brought it back so that sound works when I'm without the usb soundcard.

Looks like I can reliably boot w/o boltd with nvidia-dkms, but not with nvidia-535xx-dkms: http://0x0.st/X6TU.txt
Also, even without egpu connected, and with nvidia-dkms, sometimes the boot takes quite a while, like: http://0x0.st/X6TG.txt. Don't think it should be that way.

Thanks

alllexx88 · 2024-10-16 11:28:13

Hi again

Maybe due to nvidia-dkms upgrade, or my laptop BIOS upgrade, or me moving all USB devices to a Type C hub, or all of these together, but I can now consistently boot with the eGPU connected. It's now down to the "Failed to allocate NvKmsKapiMemory for dumb object of size 4096" error preventing setting up the drm device: http://0x0.st/XIbW.txt

Thanks

UPD: Just tried the 535 version of nvidia drivers: http://0x0.st/XIbk.txt. It didn't even boot with the egpu connected, also failing on the dgpu. It boots fine without the egpu connected though. But the current nvidia-dkms (560.35.03-16) seems to fair better, at least it boots.

UPD2: Sorry, just realized I had bolt service masked. Unmasked it. Now booting takes much longer, but it still seems to boot: http://0x0.st/XIcs.txt

UPD3: Ah damn, after a couple of tries I stumbled on a boot freeze again: http://0x0.st/XIci.txt. Looks like nothing has changed

Last edited by alllexx88 (2024-10-16 12:54:02)

seth · 2024-10-16 13:07:39

Oct 16 13:22:38 AlexArch kernel: NVRM: GPU at PCI:0000:04:00: GPU-a5b5851e-dfe0-833c-36f5-e63990058e1e
Oct 16 13:22:38 AlexArch kernel: NVRM: Xid (PCI:0000:04:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Oct 16 13:22:38 AlexArch kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
Oct 16 13:22:38 AlexArch kernel: BUG: unable to handle page fault for address: ffffae994784805c
…
Oct 16 13:22:38 AlexArch kernel:  ? __die_body.cold+0x8/0x12
Oct 16 13:22:38 AlexArch kernel:  ? page_fault_oops+0x15a/0x2d0
Oct 16 13:22:38 AlexArch kernel:  ? exc_page_fault+0x18a/0x190
Oct 16 13:22:38 AlexArch kernel:  ? asm_exc_page_fault+0x26/0x30
Oct 16 13:22:38 AlexArch kernel:  ? snd_hdac_bus_stop_cmd_io+0x69/0xf0 [snd_hda_core 1400000003000000474e55006a4163f8b3984578]
…
Oct 16 13:22:38 AlexArch kernel: pci_bus 0000:04: busn_res: [bus 04] is released
Oct 16 13:22:38 AlexArch kernel: pci_bus 0000:07: busn_res: [bus 07] is released
Oct 16 13:22:38 AlexArch kernel: pci_bus 0000:08: busn_res: [bus 08] is released
Oct 16 13:22:38 AlexArch kernel: pci_bus 0000:09: busn_res: [bus 09] is released
Oct 16 13:22:38 AlexArch kernel: pci_bus 0000:06: busn_res: [bus 06-09] is released
Oct 16 13:22:38 AlexArch kernel: pci_bus 0000:05: busn_res: [bus 05-2b] is released
Oct 16 13:22:38 AlexArch kernel: pci_bus 0000:03: busn_res: [bus 03-2b] is released
Oct 16 13:22:38 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 4096
…
Oct 16 13:22:43 AlexArch sddm-helper-start-wayland[977]: Starting Wayland process "kwin_wayland --drm --no-lockscreen --no-global-shortcuts --locale1" "sddm"
Oct 16 13:22:43 AlexArch sddm-helper-start-wayland[977]: started succesfully "kwin_wayland --drm --no-lockscreen --no-global-shortcuts --locale1"
Oct 16 13:22:58 AlexArch sddm-helper-start-wayland[977]: "kwin_wayland_drm: atomic commit failed: Invalid argument\n"
Oct 16 13:22:59 AlexArch kwin_wayland[1167]: No backend specified, automatically choosing drm
Oct 16 13:22:59 AlexArch kwin_wayland[1167]: kwin_wayland_drm: failed to open drm device at "/dev/dri/card1"

The main issue is stil lthat the GPU is "falling off the bus", does it (still) show up in lspci?

535xx fails on

Oct 16 14:12:09 AlexArch kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to apply atomic modeset.  Error code: -22
Oct 16 14:12:09 AlexArch kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to apply atomic modeset.  Error code: -22
Oct 16 14:12:09 AlexArch kernel: [drm:nv_drm_dumb_create [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000400] Failed to allocate NvKmsKapiMemory for dumb object of size 4096
Oct 16 14:12:14 AlexArch kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to apply atomic modeset.  Error code: -22

alllexx88 · 2024-10-16 14:37:29

seth wrote:

The main issue is stil lthat the GPU is "falling off the bus", does it (still) show up in lspci?

Yes, it shows up on lspci when it boots:

$ lspci | grep NVIDIA
0000:01:00.0 VGA compatible controller: NVIDIA Corporation AD106M [GeForce RTX 4070 Max-Q / Mobile] (rev a1)
0000:01:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
0000:04:00.0 VGA compatible controller: NVIDIA Corporation AD104 [GeForce RTX 4070 Ti] (rev a1)
0000:04:00.1 Audio device: NVIDIA Corporation AD104 High Definition Audio Controller (rev a1)

As well as seen on nvidia-smi:

$ nvidia-smi
Wed Oct 16 16:36:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   52C    P5             10W /   80W |     977MiB /   8188MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4070 Ti     Off |   00000000:04:00.0 Off |                  N/A |
|  0%   27C    P8              2W /  285W |       4MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
...

Thanks

seth · 2024-10-16 14:49:03

So luckily your internal GPU shows up w/ a different ID 10de:2860 than your external one 10de:2782
What if you add "pci-stub.ids=10de:2782,10de:22bc" (the second one is the audio) to the kernel parameters to claim the eGPU and then later on unbind it from pci-stub and rescan the bus, https://wiki.archlinux.org/title/PCI_pa … _device_ID (see the script at the bottom of the paragraph, except we're not using vfio but pci-stub)

gpu="0000:04:00.0"
aud="0000:04:00.1"
echo "$gpu_vd" > "/sys/bus/pci/drivers/pci-stub/remove_id"
echo "$aud_vd" > "/sys/bus/pci/drivers/pci-stub/remove_id"
echo 1 > "/sys/bus/pci/devices/$gpu/remove"
echo 1 > "/sys/bus/pci/devices/$aud/remove"
echo 1 > "/sys/bus/pci/rescan"

alllexx88 · 2024-10-17 09:43:29

I added pci-stub.ids to the kernel parameters, and also added this initcpio hook:

#!/usr/bin/ash

run_hook() {
    gpu="0000:04:00.0"
    aud="0000:04:00.1"
    gpu_id="10de 2782"
#   gpu_id="$(cat /sys/bus/pci/devices/$gpu/vendor) $(cat /sys/bus/pci/devices/$gpu/device)"
#   aud_id="$(cat /sys/bus/pci/devices/$aud/vendor) $(cat /sys/bus/pci/devices/$aud/device)"
    aud_id="10de 22bc"
    echo "$gpu_id" > "/sys/bus/pci/drivers/pci-stub/remove_id"
    echo "$aud_id" > "/sys/bus/pci/drivers/pci-stub/remove_id"
    echo 1 > "/sys/bus/pci/devices/$gpu/remove"
    echo 1 > "/sys/bus/pci/devices/$aud/remove"
    echo 1 > "/sys/bus/pci/rescan"
}

# vim: set ft=sh ts=4 sw=4 et:

It was booting fine for a few boots in a row, but without trying to create an nvidia-drm device for the external GPU: http://0x0.st/XIQw.txt
Then, however, it got stuck on boot again: http://0x0.st/XIQ3.txt

Oct 17 11:10:40 AlexArch kernel: BUG: unable to handle page fault for address: ffffb6c88179404c
Oct 17 11:10:40 AlexArch kernel: #PF: supervisor read access in kernel mode
Oct 17 11:10:40 AlexArch kernel: #PF: error_code(0x0000) - not-present page

(I also added "8086:7d55" to pci-stub ids: it's the Intel Arc graphics, which causes some problems by being visible, even though nvidia advanced optimus is disable via BIOS. I used an initcpio hook before to remove the pci device, but it now comes back due to pci bus rescan, so pci-stub is a better solution).

Thanks

seth · 2024-10-17 12:15:17

Nonono.
The plan is to keep the device at the pci-stub and activate it lay-tor.
In doubt *much* later (like maybe "seconds"

Initially, just always boot w/ the device on the pci-stub and then, after your session started (or after you login and before starting a GUI session if that's relevant) manually reclaim the device.
If that works reliably, you only have to figure how long you actually have to wait (and can, w/ the delay getting annoying)

alllexx88 · 2024-10-17 16:18:32

Ah, I see
I did just that: switched to multi-user.target and ran the reclaim manually before sddm: http://0x0.st/XI2i.txt
This made the eGPU usable as a CUDA device, but didn't attempt to create a drm device for video output.
The same thing with a systemd service that runs before display-manager.service:

[Unit]
Description=Reclaim egpu pci
Before=display-manager.service
After=bolt.service

[Service]
Type=oneshot
ExecStart=bash -c 'gpu="0000:04:00.0"; aud="0000:04:00.1"; gpu_id="10de 2782"; aud_id="10de 22bc"; echo "$gpu_id" > "/sys/bus/pci/drivers/pci-stub/remove_id"; echo "$aud_id" > "/sys/bus/pci/drivers/pci-stub/remove_id"; echo 1 > "/sys/bus/pci/devices/$gpu/remove"; echo 1 > "/sys/bus/pci/devices/$aud/remove"; echo 1 > "/sys/bus/pci/rescan"'

[Install]
WantedBy=graphical.target

http://0x0.st/XI2o.txt
The result's the same, as with doing it manually: eGPU works, but without a respective drm device created.

Also, I rebooted several times, and so far it boots and powers off/reboots fine.

This solves the problem for me currently, as I use a USB C hub for HDMI output to the external monitor, while being able to use the eGPU for ML stuff. But still a pity I can't get it fully functional, eliminating the need to eat up a portion of the hub's bandwidth for the HDMI output (though on the other hand it frees some eGPU VRAM for CUDA )

Thank you

Last edited by alllexx88 (2024-10-17 16:19:07)

seth · 2024-10-17 18:46:18

\o/
Nothing's falling off the bus anymore

What happens if you "modprobe.blacklist=nvidia_drm" and only load that later (preferably before starting the GUI)?

alllexx88 · 2024-10-17 20:56:02

Wow, drm device gets created this way, awesome! I added "sleep 3" to ExecStart of the pci_reclaim.service, and created nvidia_drm_load.service:

[Unit]
Description=nvidia_drm modprobe
Before=display-manager.service
After=pci_reclaim.service

[Service]
Type=oneshot
ExecStart=modprobe nvidia_drm

[Install]
WantedBy=graphical.target

And I get the drm device:

# find /sys/devices -name drm
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/drm
/sys/devices/pci0000:00/0000:00:07.0/0000:02:00.0/0000:03:01.0/0000:04:00.0/drm

For some reason though, sddm on wayland has the external screen, connected to egpu, black (but with cursor), while when I log into a plasma wayland session, it's the opposite: the built-in screen is black with cursor, while the external one (from egpu) is fine: http://0x0.st/XILg.txt

If I start X11 plasma session, both screens work fine.

Thanks again

seth · 2024-10-18 05:47:44

The main problem is likely a race condition, the nvidia drivers try to handle the eGPU before it has properly initialized.
As for wayland, I'm not sure how good it is at handling two random GPUs, but the relevant information would come from kscreendoctor and "qdbus org.kde.KWin /KWin supportInformation" and (for conttrast) your xorg log.

Please always remember to mark resolved threads by editing your initial posts subject - so others will know that there's no task left, but maybe a solution to find.
Thanks.

alllexx88 · 2024-10-18 08:38:54

Thank you very much for your help seth!

I'm marking this topic as "solved", the wayland issue seems to be a separate problem indeed.

Arch Linux

#1 2024-10-08 07:25:04

[Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#2 2024-10-08 13:10:19

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#3 2024-10-08 13:36:39

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#4 2024-10-08 14:22:27

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#5 2024-10-08 14:31:11

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#6 2024-10-09 15:16:51

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#7 2024-10-10 08:36:04

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#8 2024-10-10 12:37:21

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#9 2024-10-10 13:45:34

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#10 2024-10-10 14:29:30

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#11 2024-10-10 15:57:28

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#12 2024-10-10 18:01:59

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#13 2024-10-11 10:11:14

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#14 2024-10-16 11:28:13

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#15 2024-10-16 13:07:39

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#16 2024-10-16 14:37:29

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#17 2024-10-16 14:49:03

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#18 2024-10-17 09:43:29

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#19 2024-10-17 12:15:17

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#20 2024-10-17 16:18:32

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#21 2024-10-17 18:46:18

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#22 2024-10-17 20:56:02

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#23 2024-10-18 05:47:44

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

#24 2024-10-18 08:38:54

Re: [Solved] Nvidia eGPU: [nvidia-drm] Failed to allocate NvKmsKapiMemory

Board footer