You are not logged in.

#1 2024-09-19 06:11:28

Bidski
Member
Registered: 2019-07-31
Posts: 44

Issues with nvidia GPU

I have recently started noticing that my laptop doesn't shutdown properly and I am seeing this message displaying every couple of seconds before I finally hit the power button

nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state

Here is a dump of "journal -k -b-1" that I took of the logs from yesterday https://0x0.st/X34S.txt

I haven't yet noticed any other issues, but I did see these messages in my kernel logs this morning

Sep 19 08:54:47 kernel: NVRM: failed to allocate vmap() page descriptor table!
Sep 19 08:54:47 kernel: NVRM: osMapSystemMemory: failed to create system memory kernel mapping!
Sep 19 08:54:47 kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from memdescMap(*ppMemdescRadix3, 0, allocSize, NV_TRUE, NV_PROTECT_WRITEABLE, &pVaKernel, &pPrivKernel) @ kernel_gsp.c:4188
Sep 19 08:54:47 kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspCreateRadix3(pGpu, pKernelGsp, &pKernelGsp->pSRRadix3Descriptor, NULL, NULL, gspfwSRMeta.sizeOfSuspendResumeData) @ kernel_gsp_tu102.c:1052
Sep 19 08:54:47 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspSavePowerMgmtState_HAL(pGpu, pKernelGsp) @ gpu_suspend.c:114
Sep 19 08:54:47 kernel: nvidia 0000:01:00.0: can't suspend (nv_pmops_runtime_suspend [nvidia] returned -5)
Sep 19 09:04:53 kernel: NVRM: Error in service of callback

I am currently using 560.35.03 of the nvidia Open Kernel Module, but I also saw the same results using the nvidia-dkms package.

I have a NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)

Kernel command line is

kernel: Command line: initrd=\intel-ucode.img initrd=\initramfs-linux.img root="UUID=efa38477-7366-4343-8c9c-c83d0f1aedd2" rw

Has anyone seen this sort of behaviour before? Or knows how I can set about debugging this?

Offline

#2 2024-09-19 07:35:51

seth
Member
Registered: 2012-09-03
Posts: 58,071

Re: Issues with nvidia GPU

Here is a dump of "journal -k -b-1"

That's just a random selection or messages.

Sep 18 17:05:08 kernel: pcieport 0000:03:00.0: not ready 1023ms after resume; giving up
Sep 18 17:05:08 kernel: pcieport 0000:00:07.0: pciehp: Slot(0): Card not present
Sep 18 17:05:08 kernel: pcieport 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 18 17:05:08 kernel: pcieport 0000:04:04.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 18 17:05:08 kernel: pcieport 0000:04:03.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 18 17:05:08 kernel: pcieport 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 18 17:05:08 kernel: pcieport 0000:04:01.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 18 17:05:08 kernel: pcieport 0000:04:02.0: Unable to change power state from D3cold to D0, device inaccessible
Sep 18 17:05:08 kernel: pcieport 0000:04:03.0: Runtime PM usage count underflow!
Sep 18 17:05:08 kernel: pcieport 0000:04:02.0: Runtime PM usage count underflow!
Sep 18 17:05:08 kernel: pcieport 0000:04:01.0: Runtime PM usage count underflow!

and multiple devices seem to act up? (Though the context might matter)

Please post your complete system journal for the boot:

sudo journalctl -b -1 | curl -F 'file=@-' 0x0.st

But in doubt you're facing https://bbs.archlinux.org/viewtopic.php … 7#p2181317
You can just add "nvidia.NVreg_EnableGpuFirmware=0" to the https://wiki.archlinux.org/title/Kernel_parameters to test this.

Online

#3 2024-09-29 23:07:43

Bidski
Member
Registered: 2019-07-31
Posts: 44

Re: Issues with nvidia GPU

seth wrote:

But in doubt you're facing https://bbs.archlinux.org/viewtopic.php … 7#p2181317
You can just add "nvidia.NVreg_EnableGpuFirmware=0" to the https://wiki.archlinux.org/title/Kernel_parameters to test this.

Thought this had fixed it as the issue seemed to have gone away for a bit, but it returned the other day. Here are the logs from that boot https://0x0.st/Xgjf.txt

Offline

#4 2024-09-30 08:19:00

seth
Member
Registered: 2012-09-03
Posts: 58,071

Re: Issues with nvidia GPU

a) the kernel parameter isn't in that journal
b) do you (still) get the same issue w/ the binary driver intead of nvidia-open

Sep 27 08:33:29 4TELLT129 kernel: ------------[ cut here ]------------
Sep 27 08:33:29 4TELLT129 kernel: WARNING: CPU: 13 PID: 1167 at include/linux/rwsem.h:80 follow_pte+0x1de/0x200
Sep 27 08:33:29 4TELLT129 kernel: Modules linked in: rfcomm xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc nf_conntrack_netlink xfrm_user xfrm_algo ip6table_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype evdi(OE) nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core dimlib cifs_md4 dns_resolver netfs cmac algif_hash algif_skcipher af_alg ip6table_filter ip6_tables iptable_filter overlay bnep tun nvidia_drm(OE) nvidia_modeset(OE) vfat fat snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_soc_acpi_intel_match soundwire_generic_allocation snd_soc_acpi soundwire_bus snd_soc_avs snd_soc_hda_codec snd_hda_ext_core snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine r8153_ecm cdc_ether usbnet joydev snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common intel_uncore_frequency
Sep 27 08:33:29 4TELLT129 kernel:  intel_uncore_frequency_common intel_tcc_cooling iwlmvm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_scodec_component crct10dif_pclmul mac80211 crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 libarc4 snd_hda_intel sha256_ssse3 ptp snd_intel_dspcfg sha1_ssse3 snd_intel_sdw_acpi pps_core uvcvideo aesni_intel snd_hda_codec btusb videobuf2_vmalloc uvc crypto_simd btrtl snd_hda_core videobuf2_memops cryptd videobuf2_v4l2 btintel snd_hwdep iwlwifi iTCO_wdt videodev btbcm hid_multitouch mei_pxp snd_pcm r8169 intel_pmc_bxt mei_hdcp ee1004 vboxnetflt(OE) btmtk iTCO_vendor_support rapl vboxnetadp(OE) intel_cstate cfg80211 bluetooth r8152 ucsi_acpi videobuf2_common realtek snd_timer vboxdrv(OE) i2c_i801 spi_nor mdio_devres mei_me mii intel_lpss_pci snd typec_ucsi i2c_smbus intel_lpss mousedev mc intel_uncore psmouse pcspkr typec libphy mtd i2c_mux mei i2c_hid_acpi thunderbolt idma64 soundcore rfkill
Sep 27 08:33:29 4TELLT129 kernel:  intel_pmc_core roles i2c_hid nvidia_uvm(OE) intel_vsec intel_hid pmt_telemetry sparse_keymap pmt_class pinctrl_tigerlake acpi_pad mac_hid nvidia(OE) sg crypto_user dm_mod loop nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec hid_generic usbhid i915 serio_raw i2c_algo_bit drm_buddy sdhci_pci atkbd nvme ttm cqhci libps2 intel_gtt sdhci vivaldi_fmap nvme_core mxm_wmi spi_intel_pci drm_display_helper xhci_pci mmc_core crc32c_intel spi_intel xhci_pci_renesas nvme_auth cec i8042 video serio wmi
Sep 27 08:33:29 4TELLT129 kernel: CPU: 13 PID: 1167 Comm: nv_queue Tainted: G           OE      6.10.10-arch1-1 #1 e28ee6293423e91d57555c4cc06eb839714254b7
Sep 27 08:33:29 4TELLT129 kernel: Hardware name: Metabox Alpha-S NP50HP/NP5x_NP6x_NP7xHP, BIOS 1.07.04TMB1 07/19/2021
Sep 27 08:33:29 4TELLT129 kernel: RIP: 0010:follow_pte+0x1de/0x200
Sep 27 08:33:29 4TELLT129 kernel: Code: cc cc cc 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 20 e8 b2 f5 ff ff 48 8b 35 6b dd 5c 01 48 81 e6 00 00 00 c0 eb 8d <0f> 0b 48 3b 1f 0f 83 50 fe ff ff bd ea ff ff ff eb b6 49 8b 3c 24
Sep 27 08:33:29 4TELLT129 kernel: RSP: 0018:ffffa5af438f3b48 EFLAGS: 00010246
Sep 27 08:33:29 4TELLT129 kernel: RAX: 0000000000000000 RBX: 00007759cd0ab000 RCX: ffffa5af438f3b88
Sep 27 08:33:29 4TELLT129 kernel: RDX: ffffa5af438f3b80 RSI: 00007759cd0ab000 RDI: ffff9a10c2115e30
Sep 27 08:33:29 4TELLT129 kernel: RBP: ffffa5af438f3bc8 R08: ffffa5af438f3d20 R09: 0000000000000000
Sep 27 08:33:29 4TELLT129 kernel: R10: 0000000000000001 R11: 0000000000000003 R12: ffffa5af438f3b88
Sep 27 08:33:29 4TELLT129 kernel: R13: ffffa5af438f3b80 R14: ffff9a108d5ddd80 R15: 0000000000000000
Sep 27 08:33:29 4TELLT129 kernel: FS:  0000000000000000(0000) GS:ffff9a13fb480000(0000) knlGS:0000000000000000
Sep 27 08:33:29 4TELLT129 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 27 08:33:29 4TELLT129 kernel: CR2: 0000241822073180 CR3: 0000000128ba0006 CR4: 0000000000f70ef0
Sep 27 08:33:29 4TELLT129 kernel: PKRU: 55555554
Sep 27 08:33:29 4TELLT129 kernel: Call Trace:
Sep 27 08:33:29 4TELLT129 kernel:  <TASK>
Sep 27 08:33:29 4TELLT129 kernel:  ? follow_pte+0x1de/0x200
Sep 27 08:33:29 4TELLT129 kernel:  ? __warn.cold+0x8e/0xe8
Sep 27 08:33:29 4TELLT129 kernel:  ? follow_pte+0x1de/0x200
Sep 27 08:33:29 4TELLT129 kernel:  ? report_bug+0xff/0x140
Sep 27 08:33:29 4TELLT129 kernel:  ? handle_bug+0x3c/0x80
Sep 27 08:33:29 4TELLT129 kernel:  ? exc_invalid_op+0x17/0x70
Sep 27 08:33:29 4TELLT129 kernel:  ? asm_exc_invalid_op+0x1a/0x20
Sep 27 08:33:29 4TELLT129 kernel:  ? follow_pte+0x1de/0x200
Sep 27 08:33:29 4TELLT129 kernel:  follow_phys+0x49/0x110
Sep 27 08:33:29 4TELLT129 kernel:  untrack_pfn+0x55/0x120
Sep 27 08:33:29 4TELLT129 kernel:  unmap_single_vma+0xa6/0xe0
Sep 27 08:33:29 4TELLT129 kernel:  zap_page_range_single+0x122/0x1d0
Sep 27 08:33:29 4TELLT129 kernel:  unmap_mapping_range+0x116/0x140
Sep 27 08:33:29 4TELLT129 kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia 60797da24455cf8cecb9ac63336161f36efd1db6]
Sep 27 08:33:29 4TELLT129 kernel:  nv_revoke_gpu_mappings+0x67/0xb0 [nvidia 60797da24455cf8cecb9ac63336161f36efd1db6]
Sep 27 08:33:29 4TELLT129 kernel:  RmHandleIdleSustained+0x3b/0x140 [nvidia 60797da24455cf8cecb9ac63336161f36efd1db6]
Sep 27 08:33:29 4TELLT129 kernel:  ? gpumgrGetGpu+0x69/0xa0 [nvidia 60797da24455cf8cecb9ac63336161f36efd1db6]
Sep 27 08:33:29 4TELLT129 kernel:  rm_execute_work_item+0xda/0x150 [nvidia 60797da24455cf8cecb9ac63336161f36efd1db6]
Sep 27 08:33:29 4TELLT129 kernel:  _main_loop+0x95/0x150 [nvidia 60797da24455cf8cecb9ac63336161f36efd1db6]
Sep 27 08:33:29 4TELLT129 kernel:  kthread+0xcf/0x100
Sep 27 08:33:29 4TELLT129 kernel:  ? __pfx_kthread+0x10/0x10
Sep 27 08:33:29 4TELLT129 kernel:  ret_from_fork+0x31/0x50
Sep 27 08:33:29 4TELLT129 kernel:  ? __pfx_kthread+0x10/0x10
Sep 27 08:33:29 4TELLT129 kernel:  ret_from_fork_asm+0x1a/0x30
Sep 27 08:33:29 4TELLT129 kernel:  </TASK>
Sep 27 08:33:29 4TELLT129 kernel: ---[ end trace 0000000000000000 ]---

or when disabling zswap?

Looks like https://bbs.archlinux.org/viewtopic.php?id=290126 which is about resuming from sleep but your GPU is a render device and will likely enter rtd3, you could try to disable that at the cost of increased battery drain: https://wiki.archlinux.org/title/PRIME# … Management

Online

#5 2024-09-30 22:55:18

Bidski
Member
Registered: 2019-07-31
Posts: 44

Re: Issues with nvidia GPU

seth wrote:

a) the kernel parameter isn't in that journal

I added it to /etc/modprobe.d/nvidia.conf

options nvidia "NVreg_EnableGpuFirmware=0"

I have rerun mkinitramfs -P since doing that

seth wrote:

b) do you (still) get the same issue w/ the binary driver intead of nvidia-open

I haven't tried yet. I will try disabling zswap with the nvidia-open driver first, then try with binary driver

seth wrote:

or when disabling zswap?

Didn't know this was a thing, will give it a try.

seth wrote:

Looks like https://bbs.archlinux.org/viewtopic.php?id=290126 which is about resuming from sleep but your GPU is a render device and will likely enter rtd3, you could try to disable that at the cost of increased battery drain: https://wiki.archlinux.org/title/PRIME# … Management

I don't allow my laptop to sleep or hibernate, those targets are masked on my system and the nvidia power management units are disabled. Is this still likely to be an option?

$ systemctl list-unit-files | grep nvidia
nvidia-hibernate.service                                                      disabled        disabled
nvidia-persistenced.service                                                   disabled        disabled
nvidia-powerd.service                                                         disabled        disabled
nvidia-resume.service                                                         disabled        disabled
nvidia-suspend.service                                                        disabled        disabled

I am also using a thunderbolt dock to drive external monitors (the monitors are connected via thunderbolt -> DisplayPort adapters), could this be related?

Offline

#6 2024-10-01 13:17:47

seth
Member
Registered: 2012-09-03
Posts: 58,071

Re: Issues with nvidia GPU

There seem no outputs attached to the nvidia GPU.
Another thing would be to enable https://wiki.archlinux.org/title/NVIDIA … de_setting - use the "nvidia_drm.modeset=1" kernel parameter (modprobe.conf won't do!) to get rid of the simpledrm device and also expose the nvidia attached outputs (iff any) to the drm subsystem.

Online

#7 2024-10-01 22:22:41

Bidski
Member
Registered: 2019-07-31
Posts: 44

Re: Issues with nvidia GPU

seth wrote:

There seem no outputs attached to the nvidia GPU.

This would be correct. The thunderbolt port is, I believe, separate to the nvidia GPU. So I suppose it could be going into a low power mode because it isn't being used?

$ lspci
~~snip~~
00:07.0 PCI bridge: Intel Corporation Tiger Lake-H Thunderbolt 4 PCI Express Root Port #0 (rev 05)[/b]
~~snip~~
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)
~~snip~~
seth wrote:

use the "nvidia_drm.modeset=1" kernel parameter

I will give this a go too.

Offline

#8 2024-10-02 07:35:51

seth
Member
Registered: 2012-09-03
Posts: 58,071

Re: Issues with nvidia GPU

So I suppose it could be going into a low power mode because it isn't being used?

Yup (I kinda skipped your last line - I've a reflex to click on logs wink

I don't expect kms to help you out here, though. It's just generally a good idea.
You might alternatively want to try https://aur.archlinux.org/packages/nvidia-535xx-dkms (nvidida had severe issues w/ the 55yxx drivers; they seem to have been fixed w/ the 560xx ones, but maybe the problem was only shifted somewhere else)
Likewise you can expect issues w/ 6.11 (though a patch for that is pending) and generally may want to test the behavior of the LTS kernel here.

Online

Board footer

Powered by FluxBB