You are not logged in.

#26 2025-07-03 22:18:54

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

Looks completely normal, nothing unusual.

The questions are
1. what processes are (still) using the GPU
2. how much VRAM is at use.
I trust that it will always render a nice table…

even though I downgraded _all_ NVIDIA packages at once as intended

You'd have to downgrade the kernel accordingly use use nvidia-dkms, https://wiki.archlinux.org/title/Dynami … le_Support

My gut is telling me this is a linux-firmware issue now.

The recent suspend failures are because you're using more VRAM that have space available at the storage location.

df -h

Offline

#27 2025-07-03 22:37:56

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

The questions are
1. what processes are (still) using the GPU
2. how much VRAM is at use.

Just Steam and Hyprland. That's it. The memory usage before suspension is 49mb. After suspending and failing is still 49mb, and if I close steam it goes down to 6mb, but will still fail unless I fully exit my WM. The storage location is root, and I already know that it has *plenty* of space for 4 gigs of total VRAM (8 gigs left on root partition), and my swap being 16 gigs. Its definitely not because of no VRAM or storage, that'd not make much sense when I actively close out any game before suspending, and often close steam as well, and Hyprland uses very very little VRAM (atm it is using only 1mb). Everything else also remained the same before and after suspension. Performance level at P8 (lowest), power at 2W.

seth wrote:

You'd have to downgrade the kernel accordingly

Ah that'd explain it. Oh well.

Last edited by Sync Shard (2025-07-03 22:39:21)


"What is the cost of lies?"

Offline

#28 2025-07-04 06:59:44

seth
Member
Registered: 2012-09-03
Posts: 66,248

Offline

#29 2025-07-04 09:28:21

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

This definitely seems related to the issue. I had tested with a game and only went into the main menu where it uses only 579MB of VRAM. I can quit out, then suspend, but it fails, even if I never went to the VRAM limit of my gpu.

However I looked at that forum, and saw there was a patch in the next version, so I checked the AL NVIDIA packages and it has incremented versions, so I went ahead and installed the update.

I can suspend now, and probably hibernate now completely fine. That solves the major issue now. However, there is still the issue that my GPU only appears on the bus 8 seconds or so into boot which was my original issue.

Secondary issue fixed by just updating.. just.. the boot issue I feel like is going to be a pain to find the source of. I could just wait for the next updates, as the boot issues only make me have to restart nvidia-persistenced whenever I boot and then it's fine. The original issue in of itself hasn't changed in my journal logs. It's still the exact same errors:

NVIDIA module initializes -> "nvidia-nvlink: Nvlink Core is being initialized, major device number 235" -> No NVIDIA GPU found -> Failed to insert module 'nvidia_drm': No such device -> later on in boot: nvidia-nvlink: Nvlink Core is being initialized, major device number 234 -> nvidia 0000:01:00.0: enabling device (0000 -> 0003) -> nvidia finally loads

Last edited by Sync Shard (2025-07-04 09:28:37)


"What is the cost of lies?"

Offline

#30 2025-07-04 18:47:36

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

I can suspend now, and probably hibernate now completely fine. That solves the major issue now.

\o/

there is still the issue that my GPU only appears on the bus 8 seconds or so into boot which was my original issue

That's somehow gonna hinge on the firmware - the thing very much behaves like a https://wiki.archlinux.org/title/Extern … VIDIA_eGPU - but it's built into the system and not somehow a dock, is it?

Offline

#31 2025-07-05 19:14:34

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

but it's built into the system and not somehow a dock, is it?

The GPU is fully built into the system, and I never did have this issue until the 575 drivers. 570 did not have this issue whatsoever. So I suppose this isnt really a package issue, just a NVIDIA issue. Just seems like a semi-harmless regression in NVIDIA GPU detection or initialization.

Hopefully the next updates fix whatever issue this really is, with booting, As it stands, it doesn't actually break anything, doesn't prevent me from booting, just scary and I have to restart nvidia-persistenced.

It's still an _issue_, but it's probably better left to the NVIDIA guys, and I would assume if anyone else was having this problem, they'd already be on the NVIDIA forums reporting about it.

Shall I mark this as solved as the most serious issue has been fixed by simply waiting on an update?

Last edited by Sync Shard (2025-07-05 19:17:52)


"What is the cost of lies?"

Offline

#32 2025-07-05 20:23:34

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

The GPU showing up late is /not/ a driver issue. The driver will not automatically load and no version of it can do anything before the GPU actually appears on the bus.
No driver version/update will *ever* fix  that.

Did you recently-ish update your firmware/bios?
Do you still have journals from before the GPU showed up late ("sudo journalctl -b -δ", increasing the δ will get you older journals)

Offline

#33 2025-07-05 20:46:38

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

Did you recently-ish update your firmware/bios?

Not at all. Ive not done a single bios/UEFI/firmware update since I got my PC back in 2022/2023 or so. Never touched BIOS/firmware stuff.

seth wrote:

Do you still have journals from before the GPU showed up late ("sudo journalctl -b -δ", increasing the δ will get you older journals)

I tried to look back a few dozen reboots but it had already rolled over from then. I had previously set pretty strict size limitations. However I remembered I have timeshift backups and do not exclude system journals, so I pulled a journal from that, however the journals are a month's worth, every day, so I had to limit it to the first 5000 lines, here you go, just last month:
https://0x0.st/8GoM.txt

I have a feeling it is all to do with:

May 26 13:07:11 archlinux kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
May 26 13:07:11 archlinux kernel: 
May 26 13:07:11 archlinux kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May 26 13:07:11 archlinux kernel: nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none

Those lines in the journal appear much EARLIER in that one, and the nvidia module loads fine _first try_. Now with 575.64, it appears much LATER. the moment it does that "enabling device", the nvidia gpu can now be detected and it loads. This just _smells_ regression.

Last edited by Sync Shard (2025-07-05 20:47:38)


"What is the cost of lies?"

Offline

#34 2025-07-05 21:38:19

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

The journal you posted is from June 11th and the relevant part is

Jun 11 15:54:55 archlinux kernel: Linux version 6.14.7-arch2-1 (linux@archlinux) (gcc (GCC) 15.1.1 20250425, GNU ld (GNU Binutils) 2.44.0) #1 SMP PREEMPT_DYNAMIC Thu, 22 May 2025 05:37:49 +0000
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: [10de:1f9d] type 00 class 0x030000 PCIe Legacy Endpoint
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.1: [10de:10fa] type 00 class 0x040300 PCIe Endpoint
Jun 11 15:54:55 archlinux kernel: microcode: Current revision: 0x0860010d[

The GPU is on the bus *immediately*
BIOS indeed completely unchanged:

Jun 11 15:54:55 archlinux kernel: DMI: ASUSTeK COMPUTER INC. ASUS TUF Gaming A17 FA706IHRB_FA706IHRB/FA706IHRB, BIOS FA706IHRB.307 12/28/2022
Jun 30 17:40:05 archlinux kernel: DMI: ASUSTeK COMPUTER INC. ASUS TUF Gaming A17 FA706IHRB_FA706IHRB/FA706IHRB, BIOS FA706IHRB.307 12/28/2022

Here's the crucial context:

Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: [10de:1f9d] type 00 class 0x030000 PCIe Legacy Endpoint
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: BAR 0 [mem 0xfb000000-0xfbffffff]
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: BAR 1 [mem 0xb0000000-0xbfffffff 64bit pref]
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: BAR 3 [mem 0xc0000000-0xc1ffffff 64bit pref]
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: BAR 5 [io  0xf000-0xf07f]
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: ROM [mem 0xfc000000-0xfc07ffff pref]
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.1: [10de:10fa] type 00 class 0x040300 PCIe Endpoint
Jun 11 15:54:55 archlinux kernel: pci 0000:01:00.1: BAR 0 [mem 0xfc080000-0xfc083fff]
Jun 11 15:54:55 archlinux kernel: pci 0000:00:01.1: PCI bridge to [bus 01]
Jun 11 15:54:55 archlinux kernel: pci 0000:02:00.0: [10ec:8168] type 00 class 0x020000 PCIe Endpoint
Jun 11 15:54:55 archlinux kernel: pci 0000:02:00.0: BAR 0 [io  0xe000-0xe0ff]
Jun 11 15:54:55 archlinux kernel: pci 0000:02:00.0: BAR 2 [mem 0xfc704000-0xfc704fff 64bit]
Jun 11 15:54:55 archlinux kernel: pci 0000:02:00.0: BAR 4 [mem 0xfc700000-0xfc703fff 64bit]
Jun 11 15:54:55 archlinux kernel: pci 0000:02:00.0: supports D1 D2

turned into

Jun 30 17:40:05 archlinux kernel: pci 0000:00:18.4: [1022:144c] type 00 class 0x060000 conventional PCI endpoint
Jun 30 17:40:05 archlinux kernel: pci 0000:00:18.5: [1022:144d] type 00 class 0x060000 conventional PCI endpoint
Jun 30 17:40:05 archlinux kernel: pci 0000:00:18.6: [1022:144e] type 00 class 0x060000 conventional PCI endpoint
Jun 30 17:40:05 archlinux kernel: pci 0000:00:18.7: [1022:144f] type 00 class 0x060000 conventional PCI endpoint
Jun 30 17:40:05 archlinux kernel: pci 0000:00:01.1: PCI bridge to [bus 01]
Jun 30 17:40:05 archlinux kernel: pci 0000:02:00.0: [10ec:8168] type 00 class 0x020000 PCIe Endpoint
Jun 30 17:40:05 archlinux kernel: pci 0000:02:00.0: BAR 0 [io  0xe000-0xe0ff]
Jun 30 17:40:05 archlinux kernel: pci 0000:02:00.0: BAR 2 [mem 0xfc704000-0xfc704fff 64bit]
Jun 30 17:40:05 archlinux kernel: pci 0000:02:00.0: BAR 4 [mem 0xfc700000-0xfc703fff 64bit]
Jun 30 17:40:05 archlinux kernel: pci 0000:02:00.0: supports D1 D2

and then later

Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: [10de:1f9d] type 00 class 0x030000 PCIe Legacy Endpoint
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: BAR 0 [mem 0x00000000-0x00ffffff]
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: BAR 1 [mem 0x00000000-0x0fffffff 64bit pref]
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: BAR 3 [mem 0x00000000-0x01ffffff 64bit pref]
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: BAR 5 [io  0x0000-0x007f]
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: ROM [mem 0x00000000-0x0007ffff pref]
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: Max Payload Size set to 256 (was 128, max 256)
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: Adding to iommu group 13
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: vgaarb: bridge control possible
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.1: [10de:10fa] type 00 class 0x040300 PCIe Endpoint
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.1: BAR 0 [mem 0x00000000-0x00003fff]
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
Jun 30 17:40:13 ASUS-A17-Arch kernel: pci 0000:01:00.1: Adding to iommu group 13

1. did we already try the LTS kernel?
2. can you roll back to 6.14.7?
3. Try to add "pci=realloc" to the https://wiki.archlinux.org/title/Kernel_parameters (macbooks have that but the GPU doesn't show up at all…)

Offline

#35 2025-07-05 21:58:41

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

1. did we already try the LTS kernel?

Yep. Same there. My LTS also has the proprietary nvidia module, not the open.

seth wrote:

2. can you roll back to 6.14.7?

I don't remember what NVIDIA module I had on there, I'd have to roll back all of the NVIDIA stuff as well, and my other kernel. DKMS never played nice with me before and was the cause of OOM killer crashes in the middle of a kernel module build.

seth wrote:

3. Try to add "pci=realloc"

It boots up fine, but still has the same error. The log however does look a little different during early boot (I think?) so here is the log:
http://0x0.st/8GHy.txt
It still has the same issue of NVIDIA enabling the GPU far too late. I can almost guarantee that the NVIDIA gpu *does* appear on the bus, it's just not enabled until the driver enables it (perhaps a timeout in the code finally enabling it?).


"What is the cost of lies?"

Offline

#36 2025-07-05 22:09:42

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

The nvidia driver does not and cannot enable the device, get that out of your head.

Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.0: [1022:1632] type 00 class 0x060000 conventional PCI endpoint
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1: [1022:1633] type 01 class 0x060400 PCIe Root Port
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1: PCI bridge to [bus 01]
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1:   bridge window [io  0xf000-0xffff]
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1:   bridge window [mem 0xfb000000-0xfc0fffff]
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1:   bridge window [mem 0xb0000000-0xc1ffffff 64bit pref]
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1: retraining failed
Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1: PME# supported from D0 D3hot D3cold
lspci -tvnn

Have you tried the behavior w/ any live distro?
grml.org will do - nvidia does NOT matter, we're only looking for when the device actually shows up on the bus, because

I can almost guarantee that the NVIDIA gpu *does* appear on the bus

I can absolutely guarantee that your logs distinctly disagree here.

Edit: also, what happens if you add "rootdelay=3" to the kernel parameters?
This will stall the boot by 3s but maybe the nvidia GPU is then ready by the time the bus is initially probed…

Last edited by seth (2025-07-05 22:28:09)

Offline

#37 2025-07-05 23:08:39

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

The nvidia driver does not and cannot enable the device, get that out of your head.

Ah if it isnt responsible then I am completely out of ideas, as I have never updated my bios/firmware. Not even for a security database update. Nada. I never touch anything in the BIOS either except for a boot menu.

seth wrote:

lspci -tvnn

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex [1022:1630]
           +-00.2  Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU [1022:1631]
           +-01.0  Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
           +-01.1-[01]--+-00.0  NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] [10de:1f9d]
           |            \-00.1  NVIDIA Corporation Device [10de:10fa]
           +-02.0  Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
           +-02.1-[02]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
           +-02.2-[03]----00.0  MEDIATEK Corp. MT7921 802.11ax PCI Express Wireless Network Adapter [14c3:7961]
           +-02.4-[04]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller 980 (DRAM-less) [144d:a809]
           +-08.0  Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge [1022:1632]
           +-08.1-[05]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Renoir [Radeon Vega Series / Radeon Vega Mobile Series] [1002:1636]
           |            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller [1002:1637]
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]
           |            +-00.3  Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
           |            +-00.4  Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1 [1022:1639]
           |            +-00.5  Advanced Micro Devices, Inc. [AMD] Audio Coprocessor [1022:15e2]
           |            \-00.6  Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3]
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b]
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e]
           +-18.0  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 0 [1022:1448]
           +-18.1  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 1 [1022:1449]
           +-18.2  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 2 [1022:144a]
           +-18.3  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 3 [1022:144b]
           +-18.4  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 4 [1022:144c]
           +-18.5  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 5 [1022:144d]
           +-18.6  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 6 [1022:144e]
           \-18.7  Advanced Micro Devices, Inc. [AMD] Renoir Device 24: Function 7 [1022:144f]
seth wrote:

Have you tried the behavior w/ any live distro?

I have my Emergency™ mint live distro, so here is the log of that:
https://0x0.st/8GXB.txt
I looked at the PCI logs myself and they're nearly identical compared to my current boot on Arch. However it was able to detect that there was a GTX 1650 in the live environment as GPU 0, even though it probably doesn't matter.

seth wrote:

I can absolutely guarantee that your logs distinctly disagree here.

If this was a BIOS/UEFI issue then, I have no clue why it wasn't appearing before, only now. If this was a kernel problem then almost everyone would be hearing of the issue, so it isn't that either. This makes my head hurt. sad


"What is the cost of lies?"

Offline

#38 2025-07-06 06:23:07

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

The nvidia GP doesn't show up in the mint log *at all* (just grep it for "10de") and it is

           +-01.1-[01]--+-00.0  NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] [10de:1f9d]

this device

Jul 05 17:44:21 archlinux kernel: pci 0000:00:01.1: retraining failed

Did you attempt a slight rootwait?
My current theory is that it's margnially late, still gets missed (by an inch, missed by a mile) until the persistenced starts, possibly triggers a pci rescan and that's when it finally gets probed.
In that case you might get away with waiting "a bit"™ early on to not have to wait 8s (or whenever the bus gets rescanned)

I have no clue why it wasn't appearing before

Did you change any UEFI settings? Add or remove devices to the system? Does it make a difference whether you're booting on battery or AC?

Offline

#39 2025-07-06 11:03:20

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

Did you attempt a slight rootwait?

Does nothing. Tried both rootwait and rootdelay for what it's worth.

seth wrote:

Did you change any UEFI settings?

Again I never touch the BIOS/UEFI like ever. Only for the boot menu every now and then. However I did change one setting _to see if it fixes this issue_, and that is fast boot. I turned it off, no change, nothing different. Went ahead and turned it back on, no change again. I would assume something like this is due to fast boot but, apparently not.

seth wrote:

Add or remove devices to the system?

Only my USB drive and 2tb external HDD. That's all. Unplugging them all does nothing either.

seth wrote:

Does it make a difference whether you're booting on battery or AC?

Nope. I always boot on AC anyway, I went and tried battery boot, no difference.


"What is the cost of lies?"

Offline

#40 2025-07-06 12:02:43

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

fast boot. I turned it off, no change, nothing different. Went ahead and turned it back on, no change again

That actually sucks as it suggests that we cannot impact this w/ any delays sad

What happens if you disable the nvidia-psersistenced.service?
Does the device then never show up? And then when you start the service, does it magically show up?
Assuming the persistenced triggers something special™ (acpi call, pci rescan, whatever) that can get the device to show up on the bus, it might be possible to run it early on in the initramfs to kick the GPU out of sleep…
https://man.archlinux.org/man/mkinitcpi … TIME_HOOKS

Offline

#41 2025-07-06 15:12:31

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

What happens if you disable the nvidia-psersistenced.service?
Does the device then never show up? And then when you start the service, does it magically show up?
Assuming the persistenced triggers something special™ (acpi call, pci rescan, whatever) that can get the device to show up on the bus, it might be possible to run it early on in the initramfs to kick the GPU out of sleep…

OK now we're getting somewhere. If I disable persistenced, the GPU (or maybe pci rescan) happens *SO LATE* that the boot finishes up, I can go into TTY3 and login, then do a nvidia-smi and have it fail twice before the nvidia driver finally loads, and I feel like it only loaded due to trying nvidia-smi. I swapped my DM to SDDM and it still works fine even with NVIDIA not loading when it loads up, so I know it works on the igpu. Now we know nvidia-persistenced is waking up the gpu, and its a gpu wakeup/pci rescan issue.

Devices in /sys/bus have a rescan file, perhaps this could be used in a systemd service unit before sysinit to possibly rescan the bus as soon as possible? Instead of using a userspace daemon in initramfs, as I don't even know if that would work or not.

Forgot to provide the journal:
http://0x0.st/8GTJ.txt

Last edited by Sync Shard (2025-07-06 15:21:22)


"What is the cost of lies?"

Offline

#42 2025-07-06 15:27:09

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

Instead of nvidia-smi, do you get

echo 1 | sudo tee /sys/bus/pci/rescan

to wake the device?
We unfortunately do currently not know what actually kicks it into existence, but if nvidia-smi achieves the same you could execute that in the initramfs

Instead of using a userspace daemon in initramfs, as I don't even know if that would work or not.

It so far has also initially failed after the root switch and would likely do the same from the initramfs, just earlier.

If you don't have any outputs attached to the nvidia GPU (?) you could also just start the GUI session and see whether "prime-run superturboturkeypuncher³" works on the first or second attempt (cause after all, that'd be the goal, right?)

Offline

#43 2025-07-06 15:37:43

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

Instead of nvidia-smi, do you get

echo 1 | sudo tee /sys/bus/pci/rescan

to wake the device?

Yes thats exactly it. Went ahead and rebooted, checked with fastfetch instead of nvidia-smi, my GPU was not showing up whatsoever, which was a bit scary, but then I went ahead and did a pci rescan and checked again, and it showed up instantly, and nvidia errors stopped popping up. The friggin' gpu is just too sleepy. That's the problem. Figured it out. Now for solutions...

Im going to test a systemd service unit set before sysinit.target to execute /bin/sh -c 'echo 1 > /sys/bus/pci/rescan', will report back in about 5-10 minutes.


"What is the cost of lies?"

Offline

#44 2025-07-06 16:00:29

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

I've tried it, and I think it does work for making it load before its fully booted, however despite it being as early as I can make it (before sysinit) its still too late. And oddly this time the GPU woke up and the NVIDIA module loaded before the unit file. Odd, but oh well.


"What is the cost of lies?"

Offline

#45 2025-07-06 21:49:39

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

Stupid question: you're not relying on the nvidia chip as VGA output - you just want to offload rendering onto it?
What if you initially just allow it to fail (keep the persistenced disabled so it won't trigger it) start your GUI session w/o it and then rescan the bus to get access to it?

Also

however despite it being as early as I can make it (before sysinit) its still too late

what does that mean? Too late for what? When does the GPU show up?
Do ypu have a journal of that attempt?

Offline

#46 2025-07-07 10:47:49

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

Stupid question: you're not relying on the nvidia chip as VGA output - you just want to offload rendering onto it?

I honestly _want_ to use it to render everything, but I dont this issue is really letting me do that, and the boot renderer is the iGPU no matter what. I might just have to live with it honestly.

seth wrote:

what does that mean? Too late for what? When does the GPU show up?

Not in initramfs, but before nvidia-persistenced starts, so it at least lets that start on its own completely fine and I don't have to restart it.

seth wrote:

Do ypu have a journal of that attempt?

Here you go http://0x0.st/8G2U.txt The log:

Jul 07 06:19:59 ASUS-A17-Arch systemd[1]: Finished Rescan PCI Bus for NVIDIA GPU detection.

is the unit I made to rescan. It does make nvidia-persistenced work fine on startup though. But again, it's not in initramfs. I wouldn't know how to do that, as the unit uses /bin/sh and echo which I don't think are available in initramfs. (are they?), and if yes, /sys availability that early on then could be another issue. NVIDIA tries to load the modules very early in there.

seth wrote:

What if you initially just allow it to fail (keep the persistenced disabled so it won't trigger it) start your GUI session w/o it and then rescan the bus to get access to it?

So. I've only tested KDE so far but it shouldn't matter. But Kaccess does some weird acpi/pci rescan and wakes the GPU, but it is too late to put any apps onto the GPU before they're autostarted, so I have to restart some apps to put them on the GPU to render instead of my iGPU. Here's a log of that as well just in case: https://0x0.st/8G2l.txt

So far this is just a case of the GPU not being waked up in initramfs. It was being waked up before, and was being waked up when I first installed Arch with a very bare bones level of programs. The question becomes _why_ now?

However, it still just works by itself, it is waken up by even KDE. But not... the initramfs? The thing that has the modules for it right there? Its weird. hmm

But for now, I could just remove NVIDIA from my initramfs, and keep my custom unit enabled to rescan the pci bus, and reenable nvidia-persistenced, and I don't think I'll have boot errors, and it'd work fine, just as a late loading GPU. Bandaid, but I can probably live with it.


"What is the cost of lies?"

Offline

#47 2025-07-07 12:16:51

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

I honestly _want_ to use it to render everything

Does the UEFI allow you to deactivate the IGP/APU?

However, it still just works by itself, it is waken up by even KDE. But not... the initramfs?

Might be too early and the GPU not ready, however:

Jul 07 06:19:57 archlinux kernel: nvidia: loading out-of-tree module taints kernel.
Jul 07 06:19:57 archlinux kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Jul 07 06:19:57 archlinux kernel: NVRM: No NVIDIA GPU found.
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 235
Jul 07 06:19:57 archlinux systemd-modules-load[181]: Failed to insert module 'nvidia': No such device
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Jul 07 06:19:57 archlinux kernel: NVRM: No NVIDIA GPU found.
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 235
Jul 07 06:19:57 archlinux systemd-modules-load[181]: Failed to insert module 'nvidia_modeset': No such device
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Jul 07 06:19:57 archlinux kernel: NVRM: No NVIDIA GPU found.
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 235
Jul 07 06:19:57 archlinux systemd-modules-load[181]: Failed to insert module 'nvidia_uvm': No such device
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Jul 07 06:19:57 archlinux kernel: NVRM: No NVIDIA GPU found.
Jul 07 06:19:57 archlinux kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 235
Jul 07 06:19:58 archlinux systemd-modules-load[181]: Failed to insert module 'nvidia_drm': No such device
…
Jul 07 06:19:58 ASUS-A17-Arch systemd[1]: Starting Rescan PCI Bus for NVIDIA GPU detection...
…
Jul 07 06:19:58 ASUS-A17-Arch systemd[1]: Finished Remount Root and Kernel File Systems.
…
Jul 07 06:19:59 ASUS-A17-Arch kernel: pci 0000:01:00.1: [10de:10fa] type 00 class 0x040300 PCIe Endpoint
…
Jul 07 06:19:59 ASUS-A17-Arch systemd[1]: Finished Rescan PCI Bus for NVIDIA GPU detection.

remove. the. nvidia. modules. from. the. initramfs.

But for now, I could just remove NVIDIA from my initramfs

Yes. It's stalling the boot and the service (or run_hook) will not start before it has failed.
There's no point in adding the nvidia modules to the initramfs if the device is gonna be late no matter what - it's better to focus on getting it to show up as early and allow the nvidia modules to pick it up after the root switch.

The rescan there actually
a) happens late (so late that it crosses the root switch)
b) kicks up the nvidia GPU
and you technically don't need some systemd service to rescan the bus, https://man.archlinux.org/man/mkinitcpi … TIME_HOOKS

Offline

#48 2025-07-07 13:59:04

Sync Shard
Member
Registered: 2024-08-17
Posts: 26

Re: [SOLVED] New issues with NVIDIA 575.64

seth wrote:

Does the UEFI allow you to deactivate the IGP/APU?

Nope. No setting for basically anything of that except for fastboot which I have already tried.

seth wrote:

Might be too early and the GPU not ready, however: remove. the. nvidia. modules. from. the. initramfs.

Sorry, I was still experimenting with ways to make the GPU wake in initramfs without having to do *scary mkinitcpio hooks* /j

seth wrote:

Yes. It's stalling the boot and the service (or run_hook) will not start before it has failed.
There's no point in adding the nvidia modules to the initramfs if the device is gonna be late no matter what - it's better to focus on getting it to show up as early and allow the nvidia modules to pick it up after the root switch.

That's a good point, I've fully removed them them from both my kernel's initramfs's. It's so strange cause it _used to_ not do this, but oh well. It does at least boot up faster.

seth wrote:

The rescan there actually
a) happens late (so late that it crosses the root switch)
b) kicks up the nvidia GPU
and you technically don't need some systemd service to rescan the bus

Now fully trying to use the sysd service I made and enabling nvidia-persistenced again without NVIDIA in my initramfs, there are zero errors, everything loads perfectly, just... not in initrd as I would hope. But I can live with this, no nvidia GPU not found warnings. It's a bandaid, but it works. I could most likely get it all to work in the initramfs with those hooks as you mentioned, but I will be busy today, so it'll be another time (probably tomorrow or later tonight.)

Shall I mark this as solved?


"What is the cost of lies?"

Offline

#49 2025-07-07 15:46:51

seth
Member
Registered: 2012-09-03
Posts: 66,248

Re: [SOLVED] New issues with NVIDIA 575.64

Shall I mark this as solved?

Once you're satisfied w/ the situation and no longer seek help, yes.
(And you can un-solve and bump it w/ new information any time anyways)

Offline

Board footer

Powered by FluxBB