Nvidia Drivers unstable/unusable

Pena · 2024-10-06 20:55:22

2-3 Weeks ago i started to notice that my system is not behaving well to nvidia-dkms latest changes, i began noticing a lot of black screens, freezes along with messages on kernel from nvidia firmware.

I've tried to variate a lot so one message from the journal wont magically debug it but basically: if it doesnt crash xorg it will be so unstable that it will inevitably throw the entire system down with a random freeze.
Right now the only way i can make the system usable is to intentionally install nvidia package which for some reason makes optimus-manager unable to find the nvidia stuff which causes a fallback to the integrated graphics because for some reason(i havent properly checked it), i have a feeling that the integrated part can't launch via optimus(i know this doesn't make a lot of sense but i really just want to fix nvidia for now)

so let's start with some logs:
this one was made with what i believe with nvidia-dkms
http://0x0.st/XEVx.txt
and heres the full text
http://0x0.st/XEVg.txt
and this with nvidia-beta-dkms
http://0x0.st/XEVl.txt
full journal:
http://0x0.st/XEVU.txt

I've disabled some firmware flag as a kernel parameter hoping to fix aforementioned and i dont think it did nothing

Last edited by Pena (2024-12-15 21:30:19)

Pena · 2024-10-06 20:58:11

Ah i forgot about my system:
The graphics card is NVIDIA GeForce GTX 1050 Mobile (Pascal)
The cpu Intel i5-7300HQ (4) @ 3.500GHz
Im running 6.6.52-1-lts and i've also tried the latest kernel too
Im running xorg + optimus-manager
and although i used sddm with the whole plasma 6 i've installed lightdm

V1del · 2024-10-08 09:35:21

fbdev got enabled by default which potentially doesn't play ball with certain use cases. Try 'nvidia_drm.fbdev=0' instead of the gpu firmware parameter (which is inert, that one only has relevance on Turing+ GPUs

Pena · 2024-10-08 18:50:10

V1del wrote:

fbdev got enabled by default which potentially doesn't play ball with certain use cases. Try 'nvidia_drm.fbdev=0' instead of the gpu firmware parameter (which is inert, that one only has relevance on Turing+ GPUs

thank you, i think it solved it give me 1 hour or so to push it to see if it actually fixed it or its still unstable

regardle... well it crashed twice/three times(i guess one didnt count i didnt even boot it fully), but it seems a lot more stable than before,
heres the new logs with (journalctl -b -3 -e -g nvidia)

http://0x0.st/XEkl.txt

Last edited by Pena (2024-10-08 18:55:02)

Pena · 2024-10-12 08:32:48

hmm upon watching the logs closely it appears the kernel params havent changed even though i changed ./etc/default/grub and updated grub

Last edited by Pena (2024-10-12 08:34:17)

Pena · 2024-12-09 22:16:40

I hate "necrobumping"(i guess its still not solved) but after all this time i decided to once again retry and... well it apparently only works for the non lts version

I also installed plymouth marking it as solved even though for me the story doesnt end here

Last edited by Pena (2024-12-09 22:17:31)

Pena · 2024-12-15 21:31:14

Seth recommended to restart this conversation instead of creating a new one so heres the post from the other thread:

So apparently the issue wasn't fully solved it just takes a bit longer or requires me to unplug my second monitor(im on a laptop) to check it i followed seth's advice so i set fbdev=0 and i saw other posts of him towards other people to set nvidia_drm.fbdev=0 nvidia_drm.modeset=1, i also tested modeset =0 but no major results,im pretty shure its related to nvidia_drm as when i plug in a new monitor or go into a tty(control alt F?) it tipically causes xorg to crash or enter a weird frozen mess(which did not happen before as there was little latency)

The issue has been going on for some months, but only now i decided to attempt to fix the issue instead of just using intel integrated graphics, these are my kernel parameters BOOT_IMAGE=/vmlinuz-linux root=UUID=ae37db1b-0112-43d0-ac95-ae2df5da46ad rw loglevel=3 quiet nvidia_drm.fbdev=0 nvidia_drm.modeset=1 nouveau.modeset=0. I think i dont have nouveau but im not too familiar on how graphics drivers nomenclatures, etc... work so i just placed it there wishing for it to work.

seth wrote:

I guess this is a continuation of https://bbs.archlinux.org/viewtopic.php?id=300013 then?
you should rather pick up that thread to maintain context and report this one for the bin.
Also please post a complete system journal, but http://0x0.st/XEkl.txt looks a lot like it's driven by optimus-manager. Start by getting rid of that.
sudo journalctl -b | curl -F 'file=@-' 0x0.st

It is being driven by optimus-manager the issue is, by uninstalling it I will also have to uninstall graphic drivers from intel, making it kinda risky and requiring a few more free hours

Mostly as the issue presents itself almost like a Heisenbug , whenever I try to fix it or observe the output it makes sure to shut off the graphics interface and freeze the desktop forcing me to reboot into intel discrete and take the data, which then if I want to retry something I have to redo everything again, if I don't have optimus-manager it will presumably take more time for me to provide results

Last edited by Pena (2024-12-15 21:44:07)

seth · 2024-12-15 21:50:47

seth also wrote:

Also please post a complete system journal, but http://0x0.st/XEkl.txt looks a lot like it's driven by optimus-manager. Start by getting rid of that.
sudo journalctl -b | curl -F 'file=@-' 0x0.st

Pena · 2024-12-16 22:11:14

Here is a launch with nvidia until the screen randomly blanked out:
http://0x0.st/XFkJ.txt
And here is the current one im using(hybrid/intel)
http://0x0.st/XFky.txt

seth · 2024-12-17 09:14:51

set 26 13:08:23 micron python3[2181]: [1511] INFO: Writing state {'type': 'pending_post_xorg_start', 'switch_id': '20240926T130821', 'requested_mode': 'nvidia'}
set 26 13:08:25 micron lightdm[2865]: [560] INFO: Running /etc/optimus-manager/xsetup-nvidia.sh
set 26 13:08:51 micron systemd-logind[2072]: New session 2 of user micron.
set 26 13:10:03 micron kernel: NVRM: GPU at PCI:0000:01:00: GPU-8ec817dd-35d1-e7c3-5ba1-c017c9200b60
set 26 13:10:03 micron kernel: NVRM: Xid (PCI:0000:01:00): 62, pid='<unknown>', name=<unknown>, 000190ed 0001a9f3 0001a993 0001a978 0001a9ef 0001a9ec 00000011 00000000
set 26 13:10:21 micron kernel: NVRM: Xid (PCI:0000:01:00): 41, pid='<unknown>', name=<unknown>, CCMDs 0000000a 0000c1b5
set 26 13:10:21 micron kernel: NVRM: Xid (PCI:0000:01:00): 41, pid='<unknown>', name=<unknown>, CCMDs 00000000 0000c1b5
set 26 13:10:22 micron kernel: NVRM: Xid (PCI:0000:01:00): 32, pid='<unknown>', name=<unknown>, Channel ID 00000008 intr 00040000
set 26 13:10:32 micron systemd-logind[2072]: Removed session 2.
set 26 13:10:32 micron lightdm[5622]: [10] INFO: # Xorg pre-start hook
set 26 13:10:32 micron lightdm[5622]: [10] INFO: Previous state was: {'type': 'done', 'switch_id': '20240926T130821', 'current_mode': 'nvidia'}
set 26 13:10:32 micron lightdm[5622]: [10] INFO: Requested mode is: nvidia
set 26 13:10:44 micron systemd[1]: lightdm.service: Main process exited, code=exited, status=1/FAILURE
set 26 13:10:32 micron lightdm[5622]: [27] INFO: Writing to /etc/X11/xorg.conf.d/10-optimus-manager.conf

1. you've systemd-networkd an networkmanager enabled, probably disable networkd (assumign you want to use NM)
2. You're running a 6.10 kernel and nvidia 560.35.03 ?
3. Add "nvidia_drm.modeset=1 nvidia_drm.fbdev=0" to the kernel parameters (in part to get rid of the simpledrm device)
4. "looks a lot like it's driven by optimus-manager. Start by getting rid of that" - notably

And here is the current one im using(hybrid/intel)

Is the hybrid behavior stable and you can run "prime-run superturboturkeypuncher³" w/o problems?

XID 62 is "Internal micro-controller halt (newer drivers)", 32 is "Invalid or corrupted push buffer stream" and 41 is unused…

set 26 13:08:21 micron python3[2188]: modinfo: ERROR: Module bbswitch not found.
set 26 13:08:21 micron python3[2190]: modinfo: ERROR: Module acpi_call not found.
set 26 13:08:19 micron kernel: acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration

so OM doesn't get to power down the GPU, aspm isn't available and the GP107M pascal chip isn't eligible for RTD3

If it's not optimus-manager, is it a thermal issue? Power? Does this only happen on battery or AC?

Pena · 2024-12-18 19:47:35

Sorry for becoming missing for a few days, i have powerline eth extenders and only today did i have the will to fix them(they usually either require me plugging and unplugging random extenders or worse i have to re-pair them by running up and down the stairs)

about the post
1. good point, i still have the old config when i did the transition from qt to a custom openbox env so everything from web is probably misconfigured but im too lazy and too scared of not having a network on the pc
2. ? uname: says Linux micron 6.12.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 09 Dec 2024 14:31:57 +0000 x86_64 GNU/Linux,
im using nvidia-dkms as at this point, as it never failed me... until recently heres everything nvidia:

local/cuda 12.6.3-1
local/cudnn 9.5.1.17-1
local/egl-gbm 1.1.2-1
local/egl-wayland 4:1.1.17-1
local/egl-x11 1.0.0-1
local/ffnvcodec-headers 12.2.72.0-1
local/libvdpau 1.5-3
local/libxnvctrl 565.57.01-1
local/nccl 2.23.4-1
local/nvidia-cg-toolkit 3.1-8
local/nvidia-dkms 565.77-2
local/nvidia-utils 565.77-2
local/nvtop 3.1.0-1
local/opencl-nvidia 565.77-2
local/optimus-manager-git 1:r771.6b3d3e5.python3.12-1
local/vulkan-nouveau 1:24.3.1-3
local/xf86-video-nouveau 1.0.18-1 (xorg-drivers)

I might have some wayland stuff installed, but im running X11
3... i already did, cat /proc/cmdline returns: BOOT_IMAGE=/vmlinuz-linux root=UUID=ae37db1b-0112-43d0-ac95-ae2df5da46ad rw loglevel=3 quiet nvidia_drm.fbdev=0 nvidia_drm.modeset=1 nouveau.modeset=0
I injected the command via grub however im unshure if they actually were parsed or just ignored as random garbage
4 no i dont even have prime-run as far as im aware, i just run hybrid so nvidia isnt turned on and i can actually debug my device. I can systemctl disable it though to check what will happen but i will probably wait for this weekend as im likely to have to do some tweaks before i can deactivate optimus as im on a laptop and have the intel drivers, and always used optimus(because when i tried to install arch 4-5 years ago, the drivers conflicted and i was unable to do anything for a week until i discovered tty exists)
5 thats a good call, months ago my lenovo brick started to have bad contacts and since i havent repaired it yet(i wanna modify it with a plugable connector on both ends) i've been using an off brand charger i bought from amazon(which as far as im concerned its aliexpress quality). Im always connected via wire as my pc has about an hour or two of charge, especially if i turn on the graphics card. I've already retified that it isnt a thermal issue as i've done maintenance twice in a short span, ie removing dust, putting in new paste. If there is an issue its most likely a power issue

seth · 2024-12-18 21:34:46

however im unshure if they actually were parsed

They show up in /proc/cmdline - it just doesn't match the log you posted (of the problematic boot, not the newer intel/hybrid one)

no i dont even have prime-run as far as im aware

Your journal is aware of optimus manager

dez 16 21:33:33 micron python3[2115]: [27] INFO: # Daemon pre-start hook
dez 16 21:33:33 micron python3[2115]: [27] INFO: Removing /etc/X11/xorg.conf.d/10-optimus-manager.conf (if present)
dez 16 21:33:33 micron python3[2115]: [28] INFO: Copying /etc/optimus-manager/optimus-manager.conf to /var/lib/optimus-manager/tmp/config_copy.conf

And so is your package list

local/optimus-manager-git 1:r771.6b3d3e5.python3.12-1

Please completely get rid of that for regular prime use.

Pena · 2025-01-05 14:51:06

ok i've removed optimus-manager-git and im now seemingly stuck with intel drivers?

http://0x0.st/8ijh.txt

seth · 2025-01-05 14:58:14

No?

jan 05 14:39:57 micron kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.77  Wed Nov 27 23:33:08 UTC 2024
jan 05 14:39:57 micron kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
jan 05 14:39:57 micron systemd-modules-load[280]: Inserted module 'nvidia_uvm'
jan 05 14:39:57 micron kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.77  Wed Nov 27 22:53:48 UTC 2024
jan 05 14:39:57 micron kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
jan 05 14:39:57 micron kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
jan 05 14:39:58 micron kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:01:00.0 on minor 1

But there may be residual xorg config files from OM, please post your Xorg log, https://wiki.archlinux.org/title/Xorg#General

Pena · 2025-01-05 18:27:01

seth wrote:

No?

jan 05 14:39:57 micron kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.77  Wed Nov 27 23:33:08 UTC 2024
jan 05 14:39:57 micron kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
jan 05 14:39:57 micron systemd-modules-load[280]: Inserted module 'nvidia_uvm'
jan 05 14:39:57 micron kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.77  Wed Nov 27 22:53:48 UTC 2024
jan 05 14:39:57 micron kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
jan 05 14:39:57 micron kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
jan 05 14:39:58 micron kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:01:00.0 on minor 1

But there may be residual xorg config files from OM, please post your Xorg log, https://wiki.archlinux.org/title/Xorg#General

I still have nvidia drivers installed, i never uninstalled them, i removed optimus, the reason why i said im stuck with intel is because, when nvidia is driving the monitors etc, it forces a different nomenclature to devices such as monitors(HDMI1-1 instead of HDMI1 and EDP1-1 instead of EDP1) and also picom stops working when its intel driving although from memory i cant recall exactly if it was intentional configs by me or if it was just a case of me not being bothered by it. I think this should address the doubt/confusion?

I will post xorg logs as it might help diagnose the problem regardless

Pena · 2025-01-05 18:29:57

the instructions given by the forums output a short transcript and im unshure if its of any relevance, here it is:

[     6.135] (==) Log file: "/var/log/Xorg.0.log", Time: Sun Jan  5 15:52:39 2025
[     6.142] (II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration

Here's the entire file:
http://0x0.st/8iLs.txt

Lone_Wolf · 2025-01-05 19:44:17

[     6.164] (II) LoadModule: "intel"
[     6.164] (II) Loading /usr/lib/xorg/modules/drivers/intel_drv.so
[     6.165] (II) Module intel: vendor="X.Org Foundation"
[     6.165] 	compiled for 1.21.1.11, module version = 2.99.917
[     6.165] 	Module class: X.Org Video Driver
[     6.165] 	ABI class: X.Org Video Driver, version 25.2

pena wrote:

The cpu Intel i5-7300HQ (4) @ 3.500GHz

Your cpu was released in Q1 2017 and a few years to young to need xf86-video-intel .
Please remove xf86-video intel .

V1del · 2025-01-05 19:49:18

Also make sure you install mesa and not mesa-amber, that the intel driver finds i965 already reads weird

Pena · 2025-01-05 19:53:22

V1del wrote:

Also make sure you install mesa and not mesa-amber, that the intel driver finds i965 already reads weird

 
paru -Qss mesa                                                                       19:51:43
local/glu 9.0.3-2
    Mesa OpenGL utility library
local/lib32-glu 9.0.3-2
    Mesa OpenGL utility library (32 bits)
local/lib32-mesa 1:24.3.2-1
    Open-source OpenGL drivers - 32-bit
local/mesa 1:24.3.2-1
    Open-source OpenGL drivers
local/mesa-demos 9.0.0-5
    Mesa demos
local/mesa-utils 9.0.0-5
    Essential Mesa utilities

mesa-amber is not installed, as far as i am aware(unless its not being managed by pacman?)

Pena · 2025-01-05 19:54:30

Lone_Wolf wrote:

[     6.164] (II) LoadModule: "intel"
[     6.164] (II) Loading /usr/lib/xorg/modules/drivers/intel_drv.so
[     6.165] (II) Module intel: vendor="X.Org Foundation"
[     6.165] 	compiled for 1.21.1.11, module version = 2.99.917
[     6.165] 	Module class: X.Org Video Driver
[     6.165] 	ABI class: X.Org Video Driver, version 25.2
pena wrote:
The cpu Intel i5-7300HQ (4) @ 3.500GHz
Your cpu was released in Q1 2017 and a few years to young to need xf86-video-intel .
Please remove xf86-video intel .

I removed it, and now the following are present:

paru -Qss xf86-video                                                                 19:53:58

local/xf86-video-dummy 0.4.1-2 (xorg-drivers)
    X.org dummy video driver
local/xf86-video-nouveau 1.0.18-1 (xorg-drivers)
    Open Source 3D acceleration driver for nVidia cards
local/xf86-video-vesa 2.6.0-2 (xorg-drivers xorg)
    X.org vesa video driver
local/xf86-video-vmware 13.4.0-3 (xorg-drivers)
    X.org vmware video driver

Last edited by Pena (2025-01-05 19:55:03)

Scimmia · 2025-01-05 19:57:16

Why do you have any of those installed?

Pena · 2025-01-05 20:07:04

Scimmia wrote:

Why do you have any of those installed?

some are leftovers from projects/work relating to software development specially virtualization, etc... Others are leftovers from when i first installed arch and then when it finally began working i never questioned or cleaned stuff, theres a lot of stuff that i did wrong and im yet to discover it today.

In a detailed way,
xf86-video-dummy: i always presumed this is needed to be able to launch x11 in a void and it never gave me issues(such as when your doing docker on qt applications )
xf86-video-nouveau: This is a remnant of when i first installed arch, i installed both this and the intel one, as i saw it on a tutorial on YouTube i think
xf86-video-vmware: remnant of when I used vmware for a semester a long time ago no longer needed
xf86-video-vesa: unshure

Pena · 2025-01-05 20:07:51

Pena wrote:

Scimmia wrote:
Why do you have any of those installed?
some are leftovers from projects/work relating to software development specially virtualization, etc... Others are leftovers from when i first installed arch and then when it finally began working i never questioned or cleaned stuff, theres a lot of stuff that i did wrong and im yet to discover it today.

In a detailed way:
xf86-video-dummy: i always presumed this is needed to be able to launch x11 in a void and it never gave me issues(such as when your doing docker on qt applications )
xf86-video-nouveau: This is a remnant of when i first installed arch, i installed both this and the intel one, as i saw it on a tutorial on YouTube i think
xf86-video-vmware: remnant of when I used vmware for a semester a long time ago no longer needed
xf86-video-vesa: unshure

EDIT: apparently virtualbox-guest-utils forces xf86-video-vmware

Last edited by Pena (2025-01-06 19:26:11)

Scimmia · 2025-01-05 20:09:24

Wait, this is all in a virtualbox VM?

Pena · 2025-01-05 20:12:17

Scimmia wrote:

Wait, this is all in a virtualbox VM?

nope, i just never figured out virtualbox-guest-utils is needed on the host too(because strangely, something only started working inside the vm after i installed the package). Yes i've used both virtualbox and vmware, and im aware that my system is filled to the brim with bloat and errors only now im not as dependant on the computer and have been making an effort to clean it a bit

Arch Linux

#1 2024-10-06 20:55:22

Nvidia Drivers unstable/unusable

#2 2024-10-06 20:58:11

Re: Nvidia Drivers unstable/unusable

#3 2024-10-08 09:35:21

Re: Nvidia Drivers unstable/unusable

#4 2024-10-08 18:50:10

Re: Nvidia Drivers unstable/unusable

#5 2024-10-12 08:32:48

Re: Nvidia Drivers unstable/unusable

#6 2024-12-09 22:16:40

Re: Nvidia Drivers unstable/unusable

#7 2024-12-15 21:31:14

Re: Nvidia Drivers unstable/unusable

#8 2024-12-15 21:50:47

Re: Nvidia Drivers unstable/unusable

#9 2024-12-16 22:11:14

Re: Nvidia Drivers unstable/unusable

#10 2024-12-17 09:14:51

Re: Nvidia Drivers unstable/unusable

#11 2024-12-18 19:47:35

Re: Nvidia Drivers unstable/unusable

#12 2024-12-18 21:34:46

Re: Nvidia Drivers unstable/unusable

#13 2025-01-05 14:51:06

Re: Nvidia Drivers unstable/unusable

#14 2025-01-05 14:58:14

Re: Nvidia Drivers unstable/unusable

#15 2025-01-05 18:27:01

Re: Nvidia Drivers unstable/unusable

#16 2025-01-05 18:29:57

Re: Nvidia Drivers unstable/unusable

#17 2025-01-05 19:44:17

Re: Nvidia Drivers unstable/unusable

#18 2025-01-05 19:49:18

Re: Nvidia Drivers unstable/unusable

#19 2025-01-05 19:53:22

Re: Nvidia Drivers unstable/unusable

#20 2025-01-05 19:54:30

Re: Nvidia Drivers unstable/unusable

#21 2025-01-05 19:57:16

Re: Nvidia Drivers unstable/unusable

#22 2025-01-05 20:07:04

Re: Nvidia Drivers unstable/unusable

#23 2025-01-05 20:07:51

Re: Nvidia Drivers unstable/unusable

#24 2025-01-05 20:09:24

Re: Nvidia Drivers unstable/unusable

#25 2025-01-05 20:12:17

Re: Nvidia Drivers unstable/unusable

Board footer