You are not logged in.

#1 2023-01-22 11:57:41

Zer0C00l
Member
Registered: 2023-01-22
Posts: 3

[solved] nvidia - GPU has fallen off the bus when returning from sleep

When waking the system (a desktop) from suspend, my displays don't come back. I can see the backlight in them turning on, but there is no output. I can toggle caps lock on and off for a while, meaning the kernel hasn't panicked quite yet. After a minute or two, however, caps lock stops responding.

Looking at the logs I can see the system went into suspend, but there is an error from the nvidia drivers about the GPU falling off the bus.

I'm using the nvidia-open drivers on an RTX 3090, and I have `nvidia-suspend` and `nvidia-resume` enabled (the latter being recommended when you run GDM). Using wayland (sway); haven't tried with xorg.

I've tried disabling `nvidia-suspend` and `nvidia-resume`, but then the system won't even go into sleep. Disabling just `nvidia-resume` also doesn't make a difference. I've also tried enabling and disabling `nvidia-persistence.service`, but I saw no changes.

Now, the interesting thing here is on Fedora suspend and resume worked flawlessly 100% of the time (when running Sway, but *not* when running Gnome). So this is not (at least exclusively) a hardware issue. I must assume there is something Fedora does by default that I haven't done on my Arch install, but I can't for the life of me figure out what it is.

The logs:

Jan 21 08:05:54 czernobog systemd-logind[964]: The system will suspend now!
Jan 21 08:05:54 czernobog NetworkManager[969]: <info>  [1674299154.2657] manager: sleep: sleep requested (sleeping: no  enabled: yes)
Jan 21 08:05:54 czernobog NetworkManager[969]: <info>  [1674299154.2657] device (enp5s0): state change: unavailable -> unmanaged (reason 's>
Jan 21 08:05:54 czernobog NetworkManager[969]: <info>  [1674299154.2746] device (wlan0): state change: disconnected -> unmanaged (reason 's>
Jan 21 08:05:54 czernobog NetworkManager[969]: <info>  [1674299154.2988] device (wlan0): set-hw-addr: reset MAC address to 84:5C:F3:FD:D4:8>
Jan 21 08:05:54 czernobog NetworkManager[969]: <info>  [1674299154.3009] device (p2p-dev-wlan0): state change: disconnected -> unmanaged (r>
Jan 21 08:05:54 czernobog NetworkManager[969]: <info>  [1674299154.3010] manager: NetworkManager state is now ASLEEP
Jan 21 08:05:54 czernobog systemd[1]: Reached target Sleep.
Jan 21 08:05:54 czernobog wpa_supplicant[1304]: p2p-dev-wlan0: CTRL-EVENT-DSCP-POLICY clear_all
Jan 21 08:05:54 czernobog wpa_supplicant[1304]: p2p-dev-wlan0: CTRL-EVENT-DSCP-POLICY clear_all
Jan 21 08:05:54 czernobog wpa_supplicant[1304]: nl80211: deinit ifname=p2p-dev-wlan0 disabled_11b_rates=0
Jan 21 08:05:54 czernobog systemd[1]: Starting NVIDIA system suspend actions...
Jan 21 08:05:54 czernobog suspend[12372]: nvidia-suspend.service
Jan 21 08:05:54 czernobog logger[12372]: <13>Jan 21 08:05:54 suspend: nvidia-suspend.service
Jan 21 08:05:54 czernobog wpa_supplicant[1304]: wlan0: CTRL-EVENT-DSCP-POLICY clear_all
Jan 21 08:05:54 czernobog wpa_supplicant[1304]: wlan0: CTRL-EVENT-DSCP-POLICY clear_all
Jan 21 08:05:54 czernobog wpa_supplicant[1304]: nl80211: deinit ifname=wlan0 disabled_11b_rates=0
Jan 21 08:05:54 czernobog kernel: snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 80
Jan 21 08:05:54 czernobog kernel: snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 7
Jan 21 08:05:54 czernobog systemd[1]: nvidia-suspend.service: Deactivated successfully.
Jan 21 08:05:54 czernobog systemd[1]: Finished NVIDIA system suspend actions.
Jan 21 08:05:54 czernobog audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=nvidia-suspend comm="systemd" exe="/>
Jan 21 08:05:54 czernobog audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=nvidia-suspend comm="systemd" exe="/u>
Jan 21 08:05:54 czernobog kernel: audit: type=1130 audit(1674299154.669:185): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=nvidia-s>
Jan 21 08:05:54 czernobog kernel: audit: type=1131 audit(1674299154.669:186): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=nvidia-s>
Jan 21 08:05:54 czernobog systemd[1]: Starting System Suspend...
Jan 21 08:05:54 czernobog systemd-sleep[12387]: Entering sleep state 'suspend'...
Jan 21 08:05:54 czernobog kernel: PM: suspend entry (deep)
Jan 21 08:05:54 czernobog kernel: Filesystems sync: 0.004 seconds
Jan 21 08:06:32 czernobog kernel: Freezing user space processes ... (elapsed 0.001 seconds) done.
Jan 21 08:06:32 czernobog kernel: OOM killer disabled.
Jan 21 08:06:32 czernobog kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Jan 21 08:06:32 czernobog kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Jan 21 08:06:32 czernobog kernel: ACPI: EC: interrupt blocked
Jan 21 08:06:32 czernobog kernel: NVRM: GPU at PCI:0000:0b:00: GPU-86065511-1dea-f4f6-9c60-39185c120a5f
Jan 21 08:06:32 czernobog kernel: NVRM: Xid (PCI:0000:0b:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jan 21 08:06:32 czernobog kernel: NVRM: GPU 0000:0b:00.0: GPU has fallen off the bus.
Jan 21 08:06:32 czernobog kernel: NVRM prbEncStartAlloc: Can't allocate memory for protocol buffers.
Jan 21 08:06:32 czernobog kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                  NVRM: nvidia-bug-report.sh as root to collect this data before
                                  NVRM: the NVIDIA kernel module is unloaded.
Jan 21 08:06:32 czernobog kernel: ACPI: PM: Preparing to enter system sleep state S3
Jan 21 08:06:32 czernobog kernel: ACPI: EC: event blocked
Jan 21 08:06:32 czernobog kernel: ACPI: EC: EC stopped
Jan 21 08:06:32 czernobog kernel: ACPI: PM: Saving platform NVS memory

[SNIP]

Jan 21 08:06:32 czernobog kernel: Restarting tasks ... done.
Jan 21 08:06:32 czernobog kernel: random: crng reseeded on system resumption
Jan 21 08:06:32 czernobog kernel: PM: suspend exit
Jan 21 08:06:32 czernobog systemd-sleep[12387]: System returned from sleep state.
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Device revision is 0
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Secure boot is enabled
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: OTP lock is enabled
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: API lock is enabled
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Debug lock is disabled
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Minimum firmware build 1 week 10 2014
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Bootloader timestamp 2019.40 buildtype 1 build 38
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Found device firmware: intel/ibt-0041-0041.sfi
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Boot Address: 0x100800
Jan 21 08:06:32 czernobog kernel: Bluetooth: hci0: Firmware Version: 191-39.22
Jan 21 08:06:32 czernobog systemd[1]: Starting Load/Save RF Kill Switch Status...
Jan 21 08:06:32 czernobog systemd[1]: Stopped target Bluetooth Support.
Jan 21 08:06:32 czernobog systemd[1]: Reached target Bluetooth Support.
Jan 21 08:06:32 czernobog systemd[1826]: Reached target Bluetooth.

[SNIP]

Jan 21 08:07:08 czernobog systemd[1]: Started Process Core Dump (PID 12683/UID 0).
Jan 21 08:07:08 czernobog audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-12683-0 comm="systemd" exe="/>
Jan 21 08:07:08 czernobog kernel: audit: type=1130 audit(1674299228.816:195): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-1>
Jan 21 08:07:08 czernobog systemd-coredump[12687]: [?] Process 2079 (wireplumber) of user 1000 dumped core.

                                                   Stack trace of thread 2128:
                                                   #0  0x00007efe93d957c8 _ZN9libcamera20DeviceEnumeratorUdev13addV4L2DeviceEm (libcamera.so.0.0.3 + 0x>
                                                   #1  0x00007efe93d95d08 _ZN9libcamera20DeviceEnumeratorUdev13addUdevDeviceEP11udev_device (libcamera.>
                                                   #2  0x00007efe93d96b51 _ZN9libcamera20DeviceEnumeratorUdev10udevNotifyEv (libcamera.so.0.0.3 + 0x118>
                                                   #3  0x00007efe93c6f8ef _ZN9libcamera19EventDispatcherPoll16processNotifiersERKSt6vectorI6pollfdSaIS2>
                                                   #4  0x00007efe93c6fedc _ZN9libcamera19EventDispatcherPoll13processEventsEv (libcamera-base.so.0.0.3 >
                                                   #5  0x00007efe93c6db1f _ZN9libcamera6Thread4execEv (libcamera-base.so.0.0.3 + 0x15b1f)
                                                   #6  0x00007efe93d07504 _ZN9libcamera13CameraManager7Private3runEv (libcamera.so.0.0.3 + 0x89504)
                                                   #7  0x00007efe93ad7283 execute_native_thread_routine (libstdc++.so.6 + 0xd7283)
                                                   #8  0x00007efea2d768fd n/a (libc.so.6 + 0x868fd)
                                                   #9  0x00007efea2df7f34 __clone (libc.so.6 + 0x107f34)

[SNIP]

Jan 21 08:07:08 czernobog systemd[1826]: wireplumber.service: Main process exited, code=dumped, status=11/SEGV
Jan 21 08:07:08 czernobog systemd[1826]: wireplumber.service: Failed with result 'core-dump'.
Jan 21 08:07:08 czernobog systemd[1]: systemd-coredump@1-12683-0.service: Deactivated successfully.
Jan 21 08:07:08 czernobog audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-12683-0 comm="systemd" exe="/u>
Jan 21 08:07:08 czernobog kernel: audit: type=1131 audit(1674299228.959:196): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-1>
Jan 21 08:07:09 czernobog audit: BPF prog-id=0 op=UNLOAD
Jan 21 08:07:09 czernobog audit: BPF prog-id=0 op=UNLOAD
Jan 21 08:07:09 czernobog audit: BPF prog-id=0 op=UNLOAD
Jan 21 08:07:09 czernobog kernel: audit: type=1334 audit(1674299229.053:197): prog-id=0 op=UNLOAD
Jan 21 08:07:09 czernobog kernel: audit: type=1334 audit(1674299229.053:198): prog-id=0 op=UNLOAD
Jan 21 08:07:09 czernobog kernel: audit: type=1334 audit(1674299229.053:199): prog-id=0 op=UNLOAD
Jan 21 08:07:09 czernobog systemd[1826]: wireplumber.service: Scheduled restart job, restart counter is at 1.
Jan 21 08:07:09 czernobog systemd[1826]: Stopped Multimedia Service Session Manager.
Jan 21 08:07:09 czernobog systemd[1826]: Started Multimedia Service Session Manager.
Jan 21 08:07:09 czernobog rtkit-daemon[1402]: Successfully made thread 12707 of process 12707 owned by '1000' high priority at nice level -11.
Jan 21 08:07:09 czernobog rtkit-daemon[1402]: Supervising 8 threads of 6 processes of 1 users.
Jan 21 08:07:09 czernobog rtkit-daemon[1402]: Successfully made thread 12709 of process 12707 owned by '1000' RT at priority 20.
Jan 21 08:07:09 czernobog rtkit-daemon[1402]: Supervising 9 threads of 6 processes of 1 users.
Jan 21 08:07:09 czernobog wireplumber[12707]: Failed to set scheduler settings: Operation not permitted
Jan 21 08:07:09 czernobog wireplumber[12707]: [0:25:52.918652289] [12707]  INFO Camera camera_manager.cpp:299 libcamera v0.0.3
Jan 21 08:07:09 czernobog wireplumber[12707]: [0:25:52.919740871] [12714] ERROR MediaDevice media_device.cpp:483 /dev/media0[]: Failed to open media de>
Jan 21 08:07:09 czernobog wireplumber[12707]: [0:25:52.919755097] [12714]  INFO DeviceEnumerator device_enumerator.cpp:218 Unable to populate media dev>
Jan 21 08:07:09 czernobog wireplumber[12707]: [0:25:52.919760597] [12714]  WARN DeviceEnumerator device_enumerator_udev.cpp:173 Failed to add device fo>
Jan 21 08:07:16 czernobog kernel: NVRM: Xid (PCI:0000:0b:00): 119, pid=367, name=nv_queue, Timeout waiting for RPC from GSP! Expected function 78 (DUMP>
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119: RPC history (CPU -> GSP):
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         entry        func                                data
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:          0         78 DUMP_PROTOBUF_COMPONENT        0x00000000 0x00000000
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         -1         76 GSP_RM_CONTROL                0x20801117 0x00000001
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         -2         10 FREE                          0x00010011 0x00000000
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         -3         10 FREE                          0x00010016 0x00000000
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         -4         76 GSP_RM_CONTROL                0x50700117 0x00000008
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         -5         10 FREE                          0x0001006f 0x00000000
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         -6         76 GSP_RM_CONTROL                0x50700117 0x00000008
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119:         -7         10 FREE                          0x00010070 0x00000000
Jan 21 08:07:16 czernobog kernel: NVRM _kgspLogXid119: Dumping stack:
Jan 21 08:07:16 czernobog kernel: CPU: 17 PID: 367 Comm: nv_queue Tainted: G           OE      6.1.7-arch1-1 #1 a2d6f1dcaa775aaae1f25aaf758ae968e3493665
Jan 21 08:07:16 czernobog kernel: Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII HERO, BIOS 3801 07/30/2021

[SNIP]

Jan 21 08:07:16 czernobog kernel: NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 78!
Jan 21 08:07:16 czernobog kernel: NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:205
Jan 21 08:07:16 czernobog kernel: NVRM nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from nvdEngineDumpCall>
Jan 21 08:07:29 czernobog wireplumber[12707]: 1 pending linkable(s) not activated in 20sec. This should never happen.

Edit: Solved by installing the closed drivers instead of `nvidia-open`.

Last edited by Zer0C00l (2023-01-22 19:35:33)

Offline

#2 2023-01-22 12:15:29

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,427

Re: [solved] nvidia - GPU has fallen off the bus when returning from sleep

Are you using nvidia-open on fedora as well? They are generally considered experimental for consumer GPUs and you probably want to check whether you can reproduce on the prop driver.

For the suspend and resume services, did you configure the relevant module settings so that it uses a persistent proper backing storage that has enough free space to house your VRAM? See https://wiki.archlinux.org/title/NVIDIA … er_suspend it will default to /tmp which will in turn default to half of your RAM on your RAM which might be problematic.

Offline

#3 2023-01-22 12:28:29

Zer0C00l
Member
Registered: 2023-01-22
Posts: 3

Re: [solved] nvidia - GPU has fallen off the bus when returning from sleep

Are you using nvidia-open on fedora as well?

Yep.

did you configure the relevant module settings so that it uses a persistent proper backing storage that has enough free space to house your VRAM?

Yes. I have the following on `/etc/modprobe.d/nvidia.conf`:

options nvidia NVreg_PreserveVideoMemoryAllocations=1 NVreg_TemporaryFilePath=/var/tmp NVreg_UsePageAttributeTable=1

`/var` is part of the `/` btrfs subvolume, as created by archinstall with the default options:

➜ df -h
Filesystem      Size  Used Avail Use% Mounted on
dev              63G     0   63G   0% /dev
run              63G  1.9M   63G   1% /run
/dev/nvme0n1p2  1.9T  649G  1.2T  35% /
tmpfs            63G  992K   63G   1% /dev/shm
/dev/nvme0n1p2  1.9T  649G  1.2T  35% /.snapshots
/dev/nvme0n1p2  1.9T  649G  1.2T  35% /home
/dev/nvme0n1p2  1.9T  649G  1.2T  35% /var/cache/pacman/pkg
/dev/nvme0n1p2  1.9T  649G  1.2T  35% /var/log
tmpfs            63G   44K   63G   1% /tmp
/dev/nvme0n1p1  510M  151M  360M  30% /boot
tmpfs            13G  148K   13G   1% /run/user/1000

Offline

#4 2023-01-22 15:10:00

seth
Member
Registered: 2012-09-03
Posts: 49,995

Re: [solved] nvidia - GPU has fallen off the bus when returning from sleep

Fedora suspend and resume worked flawlessly 100% of the time (when running Sway, but *not* when running Gnome)

Does Fedora/gnome produce the same error you're seeing here?
Fedora/sway on GBM or EGLstreams?
Which kernel does said Fedora run?

cat /proc/cmdline # whether you're setting nvidia-drm.modeset=1

Does the same or a similar problem exist w/ the "regular" nvidia kernel module?

Possibly aspm? Try

pcie_aspm=off

Online

#5 2023-01-22 19:27:16

Zer0C00l
Member
Registered: 2023-01-22
Posts: 3

Re: [solved] nvidia - GPU has fallen off the bus when returning from sleep

seth wrote:

Does Fedora/gnome produce the same error you're seeing here?

Nope.

seth wrote:

Fedora/sway on GBM or EGLstreams?

GBM, no extra patches.

seth wrote:

Which kernel does said Fedora run?

I no longer have access to that Fedora installation, but it was a few patches behind the current kernel on Arch. This has been happening on Arch since I replaced Fedora, and then it was running the same kernel version.

seth wrote:
cat /proc/cmdline # whether you're setting nvidia-drm.modeset=1
initrd=\amd-ucode.img initrd=\initramfs-linux.img root=PARTUUID=ea80b925-a84e-4b4e-8ec2-c1a7c2b4d942 zswap.enabled=0 rootflags=subvol=@ rw intel_pstate=no_hwp rootfstype=btrfs nvidia_drm.modeset=1
seth wrote:

Does the same or a similar problem exist w/ the "regular" nvidia kernel module?

Will try next and report back in a bit.

Edit: tried and what do you know, suspend actually works!

seth wrote:

Possibly aspm? Try

pcie_aspm=off

Tried that, and no dice. Same behavior.

Last edited by Zer0C00l (2023-01-22 19:34:30)

Offline

Board footer

Powered by FluxBB