You are not logged in.
I've dug through the Wiki articles, NVIDIA docs and various forums and posts and I can find some related issues, but I'm not sure if I'm experiencing the same problem: Whenever I use Sleep (Suspend-to-RAM), there's a bunch of kernel logs that seem to be backtraces from "nvidia-sleep.sh". There's not always the same amount of them and it's not clear to me what's causing them and if I'm doing something wrong. Also I wasn't sure if this belongs here or in Multimedia so please move if necessary.
The errors I'm getting look like this:
[ 553.155550] WARNING: CPU: 7 PID: 17684 at include/linux/rwsem.h:80 follow_pte+0x1de/0x200
[ 553.155553] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device cmac nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core dimlib cifs_md4 dns_resolver netfs amd_atl intel_rapl_msr snd_hda_codec_realtek intel_rapl_common snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_scodec_component snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec vfat eeepc_wmi nvidia_drm(POE) fat kvm_amd snd_hda_core asus_wmi nvidia_uvm(POE) nvidia_modeset(POE) platform_profile snd_hwdep i8042 kvm snd_pcm r8169 sparse_keymap serio snd_timer realtek rfkill sp5100_tco mdio_devres gpio_amdpt snd joydev mousedev rapl pcspkr acpi_cpufreq wmi_bmof k10temp video libphy soundcore i2c_piix4 gpio_generic mac_hid nvidia(POE) i2c_dev crypto_user loop nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee hid_generic usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3
[ 553.155621] sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd nvme cryptd ccp nvme_core xhci_pci nvme_auth xhci_pci_renesas wmi
[ 553.155631] CPU: 7 PID: 17684 Comm: nvidia-sleep.sh Tainted: P W OE 6.10.4-arch2-1 #1 517ed45cc9c4492ee5d5bfc2d2fe6ef1f2e7a8eb
[ 553.155633] Hardware name: System manufacturer System Product Name/TUF B450M-PLUS GAMING, BIOS 2006 11/13/2019
[ 553.155635] RIP: 0010:follow_pte+0x1de/0x200
[ 553.155637] Code: 3c b0 00 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 20 e8 b2 f5 ff ff 48 8b 35 6b e3 5c 01 48 81 e6 00 00 00 c0 eb 8d <0f> 0b 48 3b 1f 0f 83 50 fe ff ff bd ea ff ff ff eb b6 49 8b 3c 24
[ 553.155639] RSP: 0018:ffff998bcdc6f7b0 EFLAGS: 00010246
[ 553.155641] RAX: 0000000000000000 RBX: 0000713f10baf000 RCX: ffff998bcdc6f7f0
[ 553.155643] RDX: ffff998bcdc6f7e8 RSI: 0000713f10baf000 RDI: ffff89f804625730
[ 553.155644] RBP: ffff998bcdc6f830 R08: ffff998bcdc6f988 R09: 0000000000000000
[ 553.155646] R10: ffff89facefb7af0 R11: 0000000000000000 R12: ffff998bcdc6f7f0
[ 553.155647] R13: ffff998bcdc6f7e8 R14: ffff89f7c3387380 R15: 0000000000000000
[ 553.155649] FS: 00007f44df062b80(0000) GS:ffff89facef80000(0000) knlGS:0000000000000000
[ 553.155651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 553.155652] CR2: 00005fe852f752b8 CR3: 000000013ab4e000 CR4: 0000000000350ef0
[ 553.155654] Call Trace:
[ 553.155655] <TASK>
[ 553.155656] ? follow_pte+0x1de/0x200
[ 553.155659] ? __warn.cold+0x8e/0xe8
[ 553.155661] ? follow_pte+0x1de/0x200
[ 553.155664] ? report_bug+0xff/0x140
[ 553.155667] ? handle_bug+0x3c/0x80
[ 553.155669] ? exc_invalid_op+0x17/0x70
[ 553.155672] ? asm_exc_invalid_op+0x1a/0x20
[ 553.155677] ? follow_pte+0x1de/0x200
[ 553.155681] follow_phys+0x49/0x110
[ 553.155685] untrack_pfn+0x55/0x120
[ 553.155688] unmap_single_vma+0xa6/0xe0
[ 553.155692] zap_page_range_single+0x122/0x1d0
[ 553.155699] unmap_mapping_range+0x116/0x140
[ 553.155704] nv_revoke_gpu_mappings_locked+0x47/0x70 [nvidia c7f6e139f220ed9e0f8d83b04273a1b52e844dc8]
[ 553.155918] nv_set_system_power_state+0x1cd/0x470 [nvidia c7f6e139f220ed9e0f8d83b04273a1b52e844dc8]
[ 553.156137] nv_procfs_write_suspend+0xef/0x170 [nvidia c7f6e139f220ed9e0f8d83b04273a1b52e844dc8]
[ 553.156352] proc_reg_write+0x5d/0xa0
[ 553.156354] ? srso_return_thunk+0x5/0x5f
[ 553.156356] vfs_write+0xf8/0x460
[ 553.156358] ? mntput_no_expire+0x4a/0x260
[ 553.156362] ? srso_return_thunk+0x5/0x5f
[ 553.156365] ksys_write+0x6d/0xf0
[ 553.156368] do_syscall_64+0x82/0x190
[ 553.156371] ? srso_return_thunk+0x5/0x5f
[ 553.156373] ? get_page_from_freelist+0x17a0/0x1a30
[ 553.156377] ? srso_return_thunk+0x5/0x5f
[ 553.156379] ? __do_sys_newfstat+0x68/0x70
[ 553.156384] ? srso_return_thunk+0x5/0x5f
[ 553.156385] ? page_counter_uncharge+0x33/0x80
[ 553.156389] ? srso_return_thunk+0x5/0x5f
[ 553.156390] ? drain_stock+0x68/0xa0
[ 553.156393] ? srso_return_thunk+0x5/0x5f
[ 553.156395] ? __refill_stock+0x81/0x90
[ 553.156398] ? srso_return_thunk+0x5/0x5f
[ 553.156400] ? refill_stock+0x1a/0x30
[ 553.156402] ? srso_return_thunk+0x5/0x5f
[ 553.156404] ? srso_return_thunk+0x5/0x5f
[ 553.156406] ? __mem_cgroup_threshold+0x15/0x150
[ 553.156408] ? srso_return_thunk+0x5/0x5f
[ 553.156410] ? memcg_check_events+0x71/0x1c0
[ 553.156414] ? srso_return_thunk+0x5/0x5f
[ 553.156416] ? __mod_memcg_lruvec_state+0xa6/0x150
[ 553.156418] ? srso_return_thunk+0x5/0x5f
[ 553.156420] ? srso_return_thunk+0x5/0x5f
[ 553.156422] ? set_ptes.isra.0+0x28/0x90
[ 553.156425] ? srso_return_thunk+0x5/0x5f
[ 553.156427] ? do_anonymous_page+0xfa/0x820
[ 553.156429] ? __pte_offset_map+0x1b/0x180
[ 553.156433] ? srso_return_thunk+0x5/0x5f
[ 553.156434] ? __handle_mm_fault+0xbe0/0x1050
[ 553.156440] ? srso_return_thunk+0x5/0x5f
[ 553.156442] ? __count_memcg_events+0x58/0xf0
[ 553.156445] ? srso_return_thunk+0x5/0x5f
[ 553.156447] ? count_memcg_events.constprop.0+0x1a/0x30
[ 553.156449] ? srso_return_thunk+0x5/0x5f
[ 553.156451] ? handle_mm_fault+0x1f0/0x300
[ 553.156454] ? srso_return_thunk+0x5/0x5f
[ 553.156456] ? do_user_addr_fault+0x36c/0x620
[ 553.156460] ? srso_return_thunk+0x5/0x5f
[ 553.156462] ? srso_return_thunk+0x5/0x5f
[ 553.156464] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 553.156467] RIP: 0033:0x7f44df1df7a4
[ 553.156470] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 28 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[ 553.156472] RSP: 002b:00007fff75dbb538 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 553.156475] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f44df1df7a4
[ 553.156476] RDX: 0000000000000008 RSI: 00005fe852f74eb0 RDI: 0000000000000001
[ 553.156478] RBP: 00007fff75dbb560 R08: 0000000000000410 R09: 0000000000000001
[ 553.156479] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000008
[ 553.156481] R13: 00005fe852f74eb0 R14: 00007f44df2bb5c0 R15: 00007f44df2b8ea0
[ 553.156486] </TASK>
[ 553.156487] ---[ end trace 0000000000000000 ]---The amount of times this happens varies, this last time it actually started missing messages: "systemd-journald[561]: Missed 8 kernel messages"
The reason I'm looking into this is because my system sometimes doesn't return from suspend properly, with just a black screen. When I check "journalctl -b-1" these traces are usually among the last things logged. Because I can't reproduce the failure to return from suspend reliably, I don't 100% know that these issues are related, but the amount of errors here doesn't seem right to me and they do happen when I trigger a suspend so I'm trying to resolve them. I'm out of search engine results so hoping for some help on how to debug this. It's been happening on multiple driver & kernel versions, but it's happening on the latest, too, so updating has not fixed it.
Some package versions:
% pacman -Q linux nvidia nvidia-utils systemd
linux 6.10.4.arch2-1
nvidia 555.58.02-15
nvidia-utils 555.58.02-1
systemd 256.4-1My "mkinitcpio.conf" (comments removed)
MODULES=()
BINARIES=()
FILES=()
HOOKS=(base systemd keyboard autodetect modconf block sd-encrypt lvm2 filesystems fsck)My cmdline:
initrd=\amd-ucode.img initrd=\initramfs-linux.img rd.luks.name=dc74f6fb-000c-48a7-ac16-1d9a308691c1=cryptlvm rd.luks.options=discard root=/dev/mapper/system-root random.trust_cpu=on resume=/dev/mapper/system-swap nvidia_drm.fbdev=1 delayacct quiet rwNote the "nvidia_drm.fbdev=1" is here because I was not able to start a GUI without it, possibly related to https://gitlab.archlinux.org/archlinux/ … /issues/53, but not sure. The parameter fixed that issue though.
And modprobe:
% cat /etc/modprobe.d/*
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var/tmp
options nvidia_drm modeset=1Note t hat I've played around with the top two options for the "nvidia" module: I've removed them and changed the top one to 0 (and stopped/disabled the nvidia-{suspend,hibernate,resume}.service units). I couldn't find a combination that caused the issue to stop, except when "NVreg_PreserveVideoMemoryAllocations=0" was set, in which case this issue appeared instead (which recommends what I already had).
I feel like I've exhausted all my options, because all the workarounds I've found suggest going back to the "old way" with "NVreg_PreserveVideoMemoryAllocations=0" but that doesn't work either. Here's a collection of posts I've found that all didn't help:
Any hints on what to do next?
Last edited by javex (2024-08-12 13:18:38)
Offline
https://bbs.archlinux.org/viewtopic.php?id=293400&p=4
nvidia-535xx-dkms from the AUR, nvidia-open (though STR has been a constant issue w/ that in the past unrelated to the current situation) or the "experimental" 550.107.02 driver linked in that thread.
Offline
The dmesg output is similiar with the nv_queue crashes I experienced with nvidia-open-beta+linux-mainline on a laptop with rtx 4060.
Changing to linux-lts fixed this problem.
Maybe you can make a try of it?
Offline
Has already been reported here: https://bbs.archlinux.org/viewtopic.php?id=297997
Offline