You are not logged in.
Note that either TLP or NVIDIA itself is entirely capable of changing the power/control value between "auto" and "on" as necessary, but the GPU will refuse to go into D3 on "auto".
Sanity check:
1. Is this w/ the nvidazi-open driver?
Same w/ the blob?
2. this is w/o any prime or nvidia-smi invocation (latter might include some fancy desktop widget polling it) because those will prevent entering D3
Only checkcat /sys/class/drm/card*/device/power_state
Nvidia-open-dkms 550. And it's with prime, but I've always had prime and always idled at 7W. I might be mis-speaking in saying that the card could previously enter D3, but all I know is that it consumed no power most of the time before a month ago, and prime-run worked fine. D3hot? D2?
EDIT: I do not think I'm mis-speaking, as checking when booting the laptop on battery (as opposed to the reproduction process for the failure, which is booting on AC and unplugging afterwards):
[asuka@Marojejy ~]$ cat /sys/class/drm/card*/device/power_state
D3cold
D0
Indeed, as I type this, card0 (which is Nvidia, vendor 0x10de) is in D3cold, and my laptop is using less than 10 watts.
The wiki does list alternative ways to get Optimus to work (nvidia-xrun looks promising); all I really care is being able to switch without reboot, and not having any issues with battery life/shutdown failing. This looks almost identical: https://forums.developer.nvidia.com/t/n … 022/250023 Though the early date means probably not something from a recent kernel.
Last edited by mesaprotector (2024-05-26 17:50:16)
Offline
I'm not sure whether nvidia-open supports RTD3 - it certainly didn't a couple of months ago.
Offline
I thought of that - possibly dumb question, but are the 535 and 525 AUR drivers also based on the open source flavor?
In addition, see my edit. There's definitely some functionality for running in D3cold. If I call nvidia-smi, the device powers up to D0, then drops back to D3cold within seconds, as long as I booted on battery instead of AC. I changed the subject title of this thread to reflect the actual cause of the problem.
EDIT: Checking Nvidia's documentation, it says that RT D3 works on nvidia-open for Ampere+, and my GPU is Ampere. Though perhaps it's still buggy.
Last edited by mesaprotector (2024-05-26 20:11:22)
Offline
are the 535 and 525 AUR drivers also based on the open source flavor
No. But it doesn't have the 550xx-related memory corruption bug either (if that's your concern) and at least the 535xx aur package should build against 6.9
I'd certainly try whether that has impact on the RTD3 situation, nvidia-open is still deemed alpha-quality.
Offline
Thank you for the suggestions. I already mentioned that I have the same issue on 535, but I double-checked. Sure enough, runtime D3 does not work on 535. In fact, it's not even the fine-grained functionality that isn't working - the issues I've mentioned happen even when I never load Xorg at all and just stay in the TTY. There is nothing running on the Nvidia GPU, but it still won't turn off.
-- If I boot with the laptop plugged in, /sys/bus/pci/devices/0000:01:00.0/power/control is permanently set to "on" (regardless of udev rules like those mentioned on the wiki).
-- If I boot with it unplugged, the GPU properly goes into D3cold; if I wake it up it will shortly return to D3cold. It continues to work if I plug in the charger, going to D3hot when idle for 15 seconds or so. But if I unplug it again it switches to D0 and stays there forever, and hangs on suspend or shutdown. (power/control is set to "auto" during all this).
-- If I boot with only "coarse-grained" power management ("NVReg_DynamicPowerManagement=0x01"), then the GPU never changes out of D0 even before plugging/unplugging.
Some other things I've tried, with no effect unless mentioned: kernel parameters noapic, acpi=force, acpi=noirq (this one causes Nvidia to not even work); downgrading the kernel to 6.8.8; disabling the nvidia-powerd service; blacklisting nvidia_uvm.
---
Somehow, TLP was really, really unhappy with all this. I told it to ignore nvidia in it's config file, but it still froze in the background; then I disabled it and it inexplicably woke up anyway and froze. Uninstalling it didn't stop the freezes as shutdown being blocked would just be attributed to the shutdown task itself.
I haven't updated my laptop's BIOS, so that doesn't seem like it was the breaking change. Will try older kernels (I have to grab them from the backups off my hard drive) but I'm done for the day.
Last edited by mesaprotector (2024-05-30 20:42:18)
Offline
I apologize for double-posting in my own thread. I try not to do so unless I have significant new information.
Well, only disabling Active State Power Management entirely (kernel parameter pcie_aspm=off) lets my Nvidia GPU reach D3 after unplugging as it should, though it's D3hot rather than D3cold. Of course disabling ASPM causes my battery life to be horrendous, so I was looking into per-device disabling of ASPM and the kernel's own documentation says to echo to files under /sys/bus/pci/devices/*/link. I tried this and it causes an immediate kernel bug and shutdown. I managed to get a video that captures the kernel bug (it flashes on screen for less than ¼ second before shutting down) but it doesn't help much - a NULL pointer dereference with a trace.
I will edit this post if anything new comes up. I do have a full backup from May 1st, though so much has changed I don't know exactly how to use it to debug. I might try the Nvidia Linux forums as this issue also happens on the current driver.
Last edited by mesaprotector (2024-05-30 20:43:24)
Offline
For clarification: you've completely removed TLP atm?
Otherwise https://linrunner.de/tlp/settings/runtimepm.html# - check the config,
RUNTIME_PM_DRIVER_BLACKLIST was renamed to RUNTIME_PM_DRIVER_DENYLIST and apparently nvidia removed from the defaults, but RUNTIME_PM_DRIVER_BLACKLIST is still parsed so you might have a stale entry there…
Do you have a picture of the crash? Bonuspoints if it has the RIP code in it.
Did you alter the values of *all* *link/*files ?
One more thing: the recent nvidia chips come w/ an usb hub that might get in the way and is a known caveat, https://bbs.archlinux.org/viewtopic.php … 4#p2159544
Offline
The behavior I described in #30 is what happens with TLP completely gone. But things are almost the same with and without it (not to mention that I only installed TLP for the first time weeks after this issue presented itself), so I've been doing most of my testing with it on.
I forget if I mentioned, but if I disable nvidia kernel modesetting, the Runtime D3 issue continues, but it no longer blocks shutdown.
Do you have a picture of the crash? Bonuspoints if it has the RIP code in it.
Did you alter the values of *all* *link/*files ?
I do have a picture: https://i.imgur.com/2q6VGOi.jpeg . You can see from it what I attempted. The exact same thing happens if I echo to any of the other link files for the nvidia device.
One more thing: the recent nvidia chips come w/ an usb hub that might get in the way and is a known caveat, https://bbs.archlinux.org/viewtopic.php … 4#p2159544
I did see this in Nvidia's documentation. The USB seems unlikely to be relevant but the audio function that's also mentioned could be. I don't currently see it, but during my testing I've occasionally seen an "0000:01:00.1" in /sys/bus/pci/devices. Also I inadvertently installed pipewire at almost exactly the time issues started happening, though I didn't remove pulseaudio until a bit later.
Finally, I found a workaround, which is absolutely nonsensical, but if I ONLY unplug my laptop when suspended (that is, systemctl suspend, wait a few seconds, unplug, wake up), Runtime D3 continues to work perfectly. I don't have to do this for plugging in. Note that I use suspend-to-RAM (S3).
Offline
per-device disabling of ASPM and the kernel's own documentation says to echo to files under /sys/bus/pci/devices/*/link. I tried this and it causes an immediate kernel bug and shutdown
page fault in aspm code/clkpm_store
if I ONLY unplug my laptop when suspended (that is, systemctl suspend, wait a few seconds, unplug, wake up), Runtime D3 continues to work perfectly
Ie. when hiding the event from kernel and userspace. Did you check the behavior at the rescue.target?
Offline
Ie. when hiding the event from kernel and userspace. Did you check the behavior at the rescue.target?
In rescue mode, if kernel modesetting is enabled then the behavior is the same. But something different happens if kernel modesetting is disabled. First of all, the power options under '/proc/driver/nvidia/gpus/0000:01:00.0/power' will all be listed as '?'. On unplug. the GPU completely disappears from the bus, and nvidia-smi fails with "Failed to initialize NVRM: Unknown Error". After this, plugging it back in will make it reappear but with the previous behavior of sticking in D0. Also, the folder /sys/class/drm/card1 will stay empty, and nvidia-smi will take an extra couple of seconds to run.
Am I correct in guessing that what's been going on is that my PC is trying to unmount the GPU every time it recognizes an unplug event? And if it fails to do so (as with kernel modesetting, Xorg running, etc.), it hangs up the GPU and causes all the other problems?
Last edited by mesaprotector (2024-06-03 21:48:20)
Offline
if kernel modesetting is disabled … the folder /sys/class/drm/card1 will stay empty
is normal but
On unplug. the GPU completely disappears from the bus, and nvidia-smi fails with "Failed to initialize NVRM: Unknown Error".
is probably not - from the rescue.target I don't see how some userspace daemon could do this and either the firmware cuts power from the GPU or the unspported battery cannot generate enough voltage to keep it up.
Are there any dGPU settings in the UEFI?
You could also try to test this from some live distro to rule out a local config issue (but you'll need a distro that ships the nvidia driver on the live system, eg. manjaro, perhaps ubuntu)
Offline
I feel a little silly at taking a month to figure out what was causing this issue. I'm still surprised nobody else has reported it (I have done a LOT of searching), as my laptop is fairly common and dual-booting can't be that rare.
The cause (hopefully, at the very least the issue is currently fixed) was a setting in Windows. Since I bought this laptop, the battery life in Windows had been terrible, similar to my recent issues in Linux, with the same cause - the Nvidia dGPU was not powering off. Since I almost never use Windows on battery, I didn't bother dealing with it until about a month ago when I decided to try the "Hybrid-Auto" graphics mode setting in Lenovo Legion Toolkit (and Vantage). The description of this says it turns off the dGPU on battery, while using the default Advanced Optimus on AC. Indeed it does that, and the mechanism it uses to do so is not confined to Windows - it kicks the GPU off the bus.
I don't remember exactly when I changed that setting, but it must have been very near May 8th, the date I switched to the 535 Nvidia driver and the 6.8.9 mainline kernel, so I assumed my issues were the result of that instead. Well, I switched it back. Unfortunately it seems like there's no way to get good idle battery life in both Windows and Linux at the same time, and I'm much more often in Linux, so it's an easy choice. It would be nice if Lenovo Legion Toolkit (or its Linux equivalent, LenovoLegionLinux), would allow for setting graphics mode on boot and setting it back on shutdown, instead of persistently changing it. They have no trouble doing so in userspace. Maybe I can still find a way to set that up, but the main issue is solved.
As for why the card blocks shutdown when kernel modesetting is on, I don't know, but I assume it's a firmware issue I can't do much about.
Offline
Windows can crash your system even when it's not running
graphics mode setting in Lenovo Legion Toolkit
sounds like it's probably actually controlling the firmware, otherwise see the 3rd link below and double- and triple re-check that it's disabled (windows frequently enables it with updates)
Offline
Oh, I know it's not fast start. I've been double-checking every other day since I started this thread, because I'd never live it down if that were the cause.
Still, I made a bug report to LenovoLegionLinux two-ish weeks ago about it causing problems in Windows, so I was already aware the firmware settings could be shared across OSs. The way Lenovo decided to do it, plenty of functions like fan control and power modes are hidden from ordinary monitoring software, and require some level of messing with firmware to work, which is unfriendly to dual-booting setups.
This is only gonna get worse on ARM, isn't it? Well, anyway. I hope this thread helps somebody. Tl;dr: don't use Hybrid-Auto (or Hybrid-iGPU) mode on Lenovo laptops, and if you do, switch it on for just the duration you're in Windows.
Offline