You are not logged in.
Hello
I have a complete display freeze problem after upgrading my system from AM4 platform to AM5 and doing a complete system update. I'm not entirely sure if it's a hardware issue or software issue, because:
- My system was out of date before upgrade about 1-2 months
- Just before replacing the parts I did a full system update, restarted the PC, made sure everything works and proceeded with replacements.
So I don't know the exact kernel and NVIDIA driver versions before upgrade. Kernel was probably 5.18-6.0.
DE: KDE Plasma
Specs Before upgrade:
- PSU: Corsair HX1000
- Motherboard: MSI MPG X570 GAMING PLUS
- RAM: Corsair Vengance 2x16GB CL16 @3000Mhz
- CPU: AMD Ryzen 7 3800X
- Cooler: Noctua NH-D15
- GPU: Gigabyte RTX 2070 Super Windforce
- Display 1: HP 24ea 1080p (with sound), HDMI
- Display 2: HP 24fw 1080p (w/o sound), DP
Upgraded components:
- Motherboard: ASRock X670E PG Lightning
- RAM: GSkill Ripjaws S5 CL32 @6000Mhz
- CPU: AMD Ryzen 9 7950X
Problem description
Sometimes, randomly, my display on both my monitors freezes. I can't do anything, other than moving my mouse. TTYs don't work, but I can still log in remotely with SSH and do some stuff there, but any NVIDIA related command (like nvidia-smi) hangs indefinitely. It can happen when I watch a youtube video, when I play a game, when I do some development with AVD for Android. From what I've noticed it does not happen during GPU demanding tasks, but rather with some light ones. If there is no freeze after 1-2min of playing a game, then I can continue to play it for many hours with no problems. If I'm running a virtual machine or watching a video, it tends to happen more randomly. Sometimes I don't do anything specific, like when I'm coming back to the PC that was idle - freeze can happen randomly then. Sleep/hibernate are disabled, so it shouldn't be an issue.
There is also another symptom - sometimes some other error happens, which decreases my GPU performance to about 20% (subjectively, not measured). If that happens, I can still use my PC without any problems, just the performance will be worse. I can see that it happened through nvidia-smi.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 48C P8 27W / 215W | 688MiB / 8192MiB | 14% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+In case of this performance problem, 0% and 27W here would be printed as ERR!. Kernel logs don't seem to show anything interesting about this error, only that there was a failure quering fan speed and power usage. If it's necessary, I can post full dmesg output when this happens again, I don't have those logs easily available right now.
I'm currently running:
- Kernel 6.2.0-1-mainline (linux-mainline)
- NVIDIA drivers: 525.89.02 (nvidia-dkms)
What I tried so far:
- Replacing thermal pads and thermal paste on my GPU. It greatly improved the temperatures, but did not solve the problem at all.
- Updating Motherboard's BIOS from 1.07 to 1.18 AS04.
- Disabling iGPU from bios and removing every package I could find related to amdgpu+vulkan.
- Trying out kernels 6.1 (linux) and 6.2 (linux-mainline) with NVIDIA drivers 515 (nvidia-dkms from package archive) and 525 (nvidia-dkms).
None of this helped.
I am somewhat confident that this is not a GPU hardware/thermal problem, since I run dual boot with Windows 11. No such problems on Windows, everything works perfectly fine. Not even a single freeze.
Anyway, here are some logs from last crash, collected with nvidia-bug-report.sh. It happened after I went idle after playing God of War 3 through rpcs3 for about two hours. When I came back to my PC, the freeze happened.
Kernel Command line (I use sddm-plymouth)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-mainline root=UUID=a362c5da-c0f4-4317-9224-556dbf661437 rw apparmor=1 lsm=lockdown,yama,apparmor amd_iommu=on iommu=pt quiet loglevel=3 rd.systemd.show_status=auto rd.udev.log_priority=3 vga=current splash nvidia-drm.modeset=1Nvidia related logs from journalctl
journalctl -b -0:
lut 24 17:33:49 pc kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-linux-mainline root=UUID=a362c5da-c0f4-4317-9224-556dbf661437 rw apparmor=1 lsm=lockdown,yama,apparmor amd_iommu=on iommu=pt quiet loglevel=3 rd.systemd.show_status=auto rd.udev.log_priority=3 vga=current splash nvidia-drm.modeset=1
lut 24 17:33:49 pc kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-linux-mainline root=UUID=a362c5da-c0f4-4317-9224-556dbf661437 rw apparmor=1 lsm=lockdown,yama,apparmor amd_iommu=on iommu=pt quiet loglevel=3 rd.systemd.show_status=auto rd.udev.log_priority=3 vga=current splash nvidia-drm.modeset=1
lut 24 17:33:49 pc kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 238
lut 24 17:33:49 pc kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.89.02 Wed Feb 1 23:23:25 UTC 2023
lut 24 17:33:49 pc kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.89.02 Wed Feb 1 23:09:40 UTC 2023
lut 24 17:33:49 pc kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
lut 24 17:33:49 pc kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
lut 24 17:33:49 pc kernel: nvidia-uvm: Loaded the UVM driver, major device number 236.
lut 24 17:33:49 pc (udev-worker)[619]: nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
lut 24 17:33:49 pc (udev-worker)[619]: nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \ -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) ${i}; done'' failed with exit code 1.
lut 24 17:33:49 pc kernel: nvidia-gpu 0000:01:00.3: enabling device (0000 -> 0002)
lut 24 17:33:50 pc kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
lut 24 21:39:58 pc ksysguard[21371]: QProcess: Destroyed while process ("/usr/bin/nvidia-smi") is still running.
lut 24 21:40:28 pc ksysguard[21371]: QProcess: Destroyed while process ("/home/artur/.local/share/ksysguard/nvidia-gpu-sensor.pl") is still running.
lut 25 00:39:20 pc kernel: [drm:drm_new_set_master] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
lut 25 00:42:59 pc kernel: [drm:drm_new_set_master] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
lut 25 04:15:45 pc kernel: NVRM: GPU at PCI:0000:01:00: GPU-8c280fdc-bac3-c482-5c52-17bcd21089fc
lut 25 04:15:45 pc kernel: NVRM: Xid (PCI:0000:01:00): 8, pid=3512, name=ferdium, Channel 000000a8
lut 25 04:16:17 pc dbus-daemon[1718]: [system] Activating via systemd: service name='org.freedesktop.home1' unit='dbus-org.freedesktop.home1.service' requested by ':1.187' (uid=0 pid=170982 comm="sudo nvidia-bug-report.sh")
lut 25 04:16:22 pc sudo[170982]: artur : TTY=pts/2 ; PWD=/home/artur ; USER=root ; COMMAND=/usr/bin/nvidia-bug-report.shWhat's interesting here is
Failed to grab modeset ownershipIt does not happen after the boot, or when the freeze occurs. It's somewhere in between those two events.
The actual error that causes the freeze is
NVRM: Xid (PCI:0000:01:00): 8, pid=3512, name=ferdium, Channel 000000a8According to NVIDIA, it can be either:
- Driver Error
- User App Error
- Bus Error
- Thermal Issue
I would probably exclude Thermal Issue here, since after replacing thermal pads and thermal paste, nothing got improved. Also the card works properly on Windows, so I think it could be one of the other three.
If I wait about 2 minutes, then I get a log similar to the one here. I did not wait for this one for the current freeze, but if it's necessary, I can collect one when the problem occurs again.
My full bug report is available here. There is more information about my system, various versions of everything, what happened there, I just picked some parts which I thought were relevant and pasted them here.
My question is - what could be causing this, or at least, what else could I do to pinpoint the cause. Also - I tried to go back to kernel 5.19, but I did not manage to find a working combination of kernel and nvidia-drivers. If you could point me to a working combination, I would greatly appreciate that.
Last edited by bathory0xff (2023-02-25 05:06:34)
Offline
I managed to roll back kernel to 5.19.13-arch1-1 and NVIDIA drivers to 515.76, but the freeze happened anyway, while using Android VM (AVD). Those versions were released in October last year and I'm pretty positive that I already had those or newer installed. This probably excludes the possibility of regressions in newer kernel and driver versions. So it could be either a hardware issue with one of the replaced parts or some bug with support for one of them. I'm unsure if it's a problem with the GPU itself, since it worked properly on previous motherboard up the the day of the upgrade. The problems all started after installing the new CPU.
I supose, since I have these new parts for almost a month and had no issues whatsoever on Windows, that this could be a kernel problem.
Last edited by bathory0xff (2023-02-26 00:56:50)
Offline
I am also experiencing random screen freezes with my 9700x and 3070ti. I have tried a variety of combinations of kernels and drivers, just as you have, but nothing has worked. Have you found a solution?
Offline
Nope.
I eventually bought 4080. It works fine.
But using my old parts, my girlfriend built herself a PC and it runs this 2070 Super on my old MSI motherboard with no problems whatsoever. So I would guess that in my case, it is either a compatibility issue or some hardware problem that manifests only in some specific configurations. So I have no solution, unfortunately. I don't know whether you checked the logs and got the same errors like me (Xid 8). Maybe it's something different for you, so I would suggest to start there I guess.
Best of luck, hope you will find the cause.
Last edited by bathory0xff (2023-05-27 22:53:34)
Offline