Suspend to RAM leads to Kernel errors from NVIDIA

javex · 2024-08-12 12:14:11

I've dug through the Wiki articles, NVIDIA docs and various forums and posts and I can find some related issues, but I'm not sure if I'm experiencing the same problem: Whenever I use Sleep (Suspend-to-RAM), there's a bunch of kernel logs that seem to be backtraces from "nvidia-sleep.sh". There's not always the same amount of them and it's not clear to me what's causing them and if I'm doing something wrong. Also I wasn't sure if this belongs here or in Multimedia so please move if necessary.

The errors I'm getting look like this:

[  553.155550] WARNING: CPU: 7 PID: 17684 at include/linux/rwsem.h:80 follow_pte+0x1de/0x200
[  553.155553] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cmac nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core dimlib cifs_md4 dns_resolver netfs amd_atl intel_rapl_msr snd_hda_codec_realtek intel_rapl_common snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_scodec_component snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec vfat eeepc_wmi nvidia_drm(POE) fat kvm_amd snd_hda_core asus_wmi nvidia_uvm(POE) nvidia_modeset(POE) platform_profile snd_hwdep i8042 kvm snd_pcm r8169 sparse_keymap serio snd_timer realtek rfkill sp5100_tco mdio_devres gpio_amdpt snd joydev mousedev rapl pcspkr acpi_cpufreq wmi_bmof k10temp video libphy soundcore i2c_piix4 gpio_generic mac_hid nvidia(POE) i2c_dev crypto_user loop nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee hid_generic usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3
[  553.155621]  sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd nvme cryptd ccp nvme_core xhci_pci nvme_auth xhci_pci_renesas wmi
[  553.155631] CPU: 7 PID: 17684 Comm: nvidia-sleep.sh Tainted: P        W  OE      6.10.4-arch2-1 #1 517ed45cc9c4492ee5d5bfc2d2fe6ef1f2e7a8eb
[  553.155633] Hardware name: System manufacturer System Product Name/TUF B450M-PLUS GAMING, BIOS 2006 11/13/2019
[  553.155635] RIP: 0010:follow_pte+0x1de/0x200
[  553.155637] Code: 3c b0 00 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 20 e8 b2 f5 ff ff 48 8b 35 6b e3 5c 01 48 81 e6 00 00 00 c0 eb 8d <0f> 0b 48 3b 1f 0f 83 50 fe ff ff bd ea ff ff ff eb b6 49 8b 3c 24
[  553.155639] RSP: 0018:ffff998bcdc6f7b0 EFLAGS: 00010246
[  553.155641] RAX: 0000000000000000 RBX: 0000713f10baf000 RCX: ffff998bcdc6f7f0
[  553.155643] RDX: ffff998bcdc6f7e8 RSI: 0000713f10baf000 RDI: ffff89f804625730
[  553.155644] RBP: ffff998bcdc6f830 R08: ffff998bcdc6f988 R09: 0000000000000000
[  553.155646] R10: ffff89facefb7af0 R11: 0000000000000000 R12: ffff998bcdc6f7f0
[  553.155647] R13: ffff998bcdc6f7e8 R14: ffff89f7c3387380 R15: 0000000000000000
[  553.155649] FS:  00007f44df062b80(0000) GS:ffff89facef80000(0000) knlGS:0000000000000000
[  553.155651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  553.155652] CR2: 00005fe852f752b8 CR3: 000000013ab4e000 CR4: 0000000000350ef0
[  553.155654] Call Trace:
[  553.155655]  <TASK>
[  553.155656]  ? follow_pte+0x1de/0x200
[  553.155659]  ? __warn.cold+0x8e/0xe8
[  553.155661]  ? follow_pte+0x1de/0x200
[  553.155664]  ? report_bug+0xff/0x140
[  553.155667]  ? handle_bug+0x3c/0x80
[  553.155669]  ? exc_invalid_op+0x17/0x70
[  553.155672]  ? asm_exc_invalid_op+0x1a/0x20
[  553.155677]  ? follow_pte+0x1de/0x200
[  553.155681]  follow_phys+0x49/0x110
[  553.155685]  untrack_pfn+0x55/0x120
[  553.155688]  unmap_single_vma+0xa6/0xe0
[  553.155692]  zap_page_range_single+0x122/0x1d0
[  553.155699]  unmap_mapping_range+0x116/0x140
[  553.155704]  nv_revoke_gpu_mappings_locked+0x47/0x70 [nvidia c7f6e139f220ed9e0f8d83b04273a1b52e844dc8]
[  553.155918]  nv_set_system_power_state+0x1cd/0x470 [nvidia c7f6e139f220ed9e0f8d83b04273a1b52e844dc8]
[  553.156137]  nv_procfs_write_suspend+0xef/0x170 [nvidia c7f6e139f220ed9e0f8d83b04273a1b52e844dc8]
[  553.156352]  proc_reg_write+0x5d/0xa0
[  553.156354]  ? srso_return_thunk+0x5/0x5f
[  553.156356]  vfs_write+0xf8/0x460
[  553.156358]  ? mntput_no_expire+0x4a/0x260
[  553.156362]  ? srso_return_thunk+0x5/0x5f
[  553.156365]  ksys_write+0x6d/0xf0
[  553.156368]  do_syscall_64+0x82/0x190
[  553.156371]  ? srso_return_thunk+0x5/0x5f
[  553.156373]  ? get_page_from_freelist+0x17a0/0x1a30
[  553.156377]  ? srso_return_thunk+0x5/0x5f
[  553.156379]  ? __do_sys_newfstat+0x68/0x70
[  553.156384]  ? srso_return_thunk+0x5/0x5f
[  553.156385]  ? page_counter_uncharge+0x33/0x80
[  553.156389]  ? srso_return_thunk+0x5/0x5f
[  553.156390]  ? drain_stock+0x68/0xa0
[  553.156393]  ? srso_return_thunk+0x5/0x5f
[  553.156395]  ? __refill_stock+0x81/0x90
[  553.156398]  ? srso_return_thunk+0x5/0x5f
[  553.156400]  ? refill_stock+0x1a/0x30
[  553.156402]  ? srso_return_thunk+0x5/0x5f
[  553.156404]  ? srso_return_thunk+0x5/0x5f
[  553.156406]  ? __mem_cgroup_threshold+0x15/0x150
[  553.156408]  ? srso_return_thunk+0x5/0x5f
[  553.156410]  ? memcg_check_events+0x71/0x1c0
[  553.156414]  ? srso_return_thunk+0x5/0x5f
[  553.156416]  ? __mod_memcg_lruvec_state+0xa6/0x150
[  553.156418]  ? srso_return_thunk+0x5/0x5f
[  553.156420]  ? srso_return_thunk+0x5/0x5f
[  553.156422]  ? set_ptes.isra.0+0x28/0x90
[  553.156425]  ? srso_return_thunk+0x5/0x5f
[  553.156427]  ? do_anonymous_page+0xfa/0x820
[  553.156429]  ? __pte_offset_map+0x1b/0x180
[  553.156433]  ? srso_return_thunk+0x5/0x5f
[  553.156434]  ? __handle_mm_fault+0xbe0/0x1050
[  553.156440]  ? srso_return_thunk+0x5/0x5f
[  553.156442]  ? __count_memcg_events+0x58/0xf0
[  553.156445]  ? srso_return_thunk+0x5/0x5f
[  553.156447]  ? count_memcg_events.constprop.0+0x1a/0x30
[  553.156449]  ? srso_return_thunk+0x5/0x5f
[  553.156451]  ? handle_mm_fault+0x1f0/0x300
[  553.156454]  ? srso_return_thunk+0x5/0x5f
[  553.156456]  ? do_user_addr_fault+0x36c/0x620
[  553.156460]  ? srso_return_thunk+0x5/0x5f
[  553.156462]  ? srso_return_thunk+0x5/0x5f
[  553.156464]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  553.156467] RIP: 0033:0x7f44df1df7a4
[  553.156470] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 28 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[  553.156472] RSP: 002b:00007fff75dbb538 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  553.156475] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f44df1df7a4
[  553.156476] RDX: 0000000000000008 RSI: 00005fe852f74eb0 RDI: 0000000000000001
[  553.156478] RBP: 00007fff75dbb560 R08: 0000000000000410 R09: 0000000000000001
[  553.156479] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000008
[  553.156481] R13: 00005fe852f74eb0 R14: 00007f44df2bb5c0 R15: 00007f44df2b8ea0
[  553.156486]  </TASK>
[  553.156487] ---[ end trace 0000000000000000 ]---

The amount of times this happens varies, this last time it actually started missing messages: "systemd-journald[561]: Missed 8 kernel messages"

The reason I'm looking into this is because my system sometimes doesn't return from suspend properly, with just a black screen. When I check "journalctl -b-1" these traces are usually among the last things logged. Because I can't reproduce the failure to return from suspend reliably, I don't 100% know that these issues are related, but the amount of errors here doesn't seem right to me and they do happen when I trigger a suspend so I'm trying to resolve them. I'm out of search engine results so hoping for some help on how to debug this. It's been happening on multiple driver & kernel versions, but it's happening on the latest, too, so updating has not fixed it.

Some package versions:

% pacman -Q linux nvidia nvidia-utils systemd
linux 6.10.4.arch2-1
nvidia 555.58.02-15
nvidia-utils 555.58.02-1
systemd 256.4-1

My "mkinitcpio.conf" (comments removed)

MODULES=()

BINARIES=()

FILES=()

HOOKS=(base systemd keyboard autodetect modconf block sd-encrypt lvm2 filesystems fsck)

My cmdline:

initrd=\amd-ucode.img initrd=\initramfs-linux.img rd.luks.name=dc74f6fb-000c-48a7-ac16-1d9a308691c1=cryptlvm rd.luks.options=discard root=/dev/mapper/system-root random.trust_cpu=on resume=/dev/mapper/system-swap nvidia_drm.fbdev=1 delayacct quiet rw

Note the "nvidia_drm.fbdev=1" is here because I was not able to start a GUI without it, possibly related to https://gitlab.archlinux.org/archlinux/ … /issues/53, but not sure. The parameter fixed that issue though.

And modprobe:

% cat /etc/modprobe.d/*
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var/tmp
options nvidia_drm modeset=1

Note t hat I've played around with the top two options for the "nvidia" module: I've removed them and changed the top one to 0 (and stopped/disabled the nvidia-{suspend,hibernate,resume}.service units). I couldn't find a combination that caused the issue to stop, except when "NVreg_PreserveVideoMemoryAllocations=0" was set, in which case this issue appeared instead (which recommends what I already had).

I feel like I've exhausted all my options, because all the workarounds I've found suggest going back to the "old way" with "NVreg_PreserveVideoMemoryAllocations=0" but that doesn't work either. Here's a collection of posts I've found that all didn't help:

Any hints on what to do next?

Last edited by javex (2024-08-12 13:18:38)

seth · 2024-08-12 15:25:54

https://bbs.archlinux.org/viewtopic.php?id=293400&p=4

nvidia-535xx-dkms from the AUR, nvidia-open (though STR has been a constant issue w/ that in the past unrelated to the current situation) or the "experimental" 550.107.02 driver linked in that thread.

fish4terrisa-MSDSM · 2024-08-12 15:34:27

The dmesg output is similiar with the nv_queue crashes I experienced with nvidia-open-beta+linux-mainline on a laptop with rtx 4060.
Changing to linux-lts fixed this problem.
Maybe you can make a try of it?

birdie-github · 2024-08-30 10:04:31

Has already been reported here: https://bbs.archlinux.org/viewtopic.php?id=297997

Arch Linux

#1 2024-08-12 12:14:11

Suspend to RAM leads to Kernel errors from NVIDIA

#2 2024-08-12 15:25:54

Re: Suspend to RAM leads to Kernel errors from NVIDIA

#3 2024-08-12 15:34:27

Re: Suspend to RAM leads to Kernel errors from NVIDIA

#4 2024-08-30 10:04:31

Re: Suspend to RAM leads to Kernel errors from NVIDIA

Board footer