You are not logged in.
While I was going about my day, my GPU fan suddenly ramped up for no reason and I realized I could no longer access my NVIDIA GPU anymore.
Then I checked `journalctl` and found this:
NVRM: GPU at PCI:0000:01:00: GPU-7ab9172a-dcda-5b17-aad4-74934860ecdd
NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
A restart fixed it but it happened again at some point, And I'm now concerned if it's gonna happen again anytime soon. What should I do to prevent this crash?
useful information about my hardware:
CPU: Intel i7-8700K (12) @ 4.700GHz
GPU 0: NVIDIA GeForce RTX 3060 Lite Hash Rate
GPU 1: AMD ATI Radeon RX 5600 OEM/5600 XT / 5700/5700 XT (drives my main display)
GPU 2: Intel CoffeeLake-S GT2 [UHD Graphics 630]
Kernel: 6.5.9-zen2-1-zen
NVIDIA Driver ver: 545.29.02
I'm using Plasma 6 dev on Wayland if that makes any difference.
Last edited by TpapLinuxer (2023-11-19 18:46:41)
Offline
https://docs.nvidia.com/deploy/xid-erro … #topic_5_9
- HW Error
- Driver Error
- System Memory Corruption
- Bus Error
- Thermal Issue
Try non-zen kernel and 535 driver, disable ASPM, monitor the GPU and system temperature.
Offline
I did some extensive research and found out that it was a thermal issue as the GPU got upto around 100 degrees at full load. I repasted my GPU and now it's maintaining a temperature of 80 - 85 degrees.
The problem is now solved and hasn't occurred in about 10 days which is a good sign.
Offline
\o/
Please always remember to mark resolved threads by editing your initial posts subject - so others will know that there's no task left, but maybe a solution to find.
Thanks.
Offline