You are not logged in.

#1 2023-02-01 22:24:46

manchmalscott
Member
Registered: 2023-02-01
Posts: 1

NVidia NVRM Xid error suddenly brought down whole system

Hey y'all,

Running the 5.15.88-2-lts kernel, nvidia-lts 525.78.01-4, x11, kde plasma 5.26.5, on a gigabyte B660M DS3H DDR4 with a i5-12600k and a RTX 2060 super.

I was watching a youtube video when suddenly everything froze, my monitors restarted, and then I was on a text-mode TTY with my boot messages, and the occasional

`watchdog: BUG: soft lockup - CPU#0 stuck for XYs! [GlobalQueue]

about every 24 seconds. I tried logging into another TTY and restarting but I got an error `Call to Reboot failed: Connection timed out`, so I just had to hard restart.

After I logged back in, I pulled the dmesg logs with journalctl, and it looks like the problems started with the following:

Feb 01 14:25:31 archbox kernel: NVRM: GPU at PCI:0000:01:00: GPU-dfc57b8b-bd59-1dd4-0187-3fc3ac02b900
Feb 01 14:25:31 archbox kernel: NVRM: Xid (PCI:0000:01:00): 38, pid=1659, name=Xorg, 0008 0000c597 00000000 00000000
Feb 01 14:25:36 archbox kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=2366, name=kwin_x11, Ch 00000038, errorString CTX SWITCH TIMEOUT, Info 0x1f4016
Feb 01 14:25:42 archbox kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=1135, name=systemd-udevd, Ch 00000001, errorString CTX SWITCH TIMEOUT, Info 0x14016
Feb 01 14:25:47 archbox kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=7771, name=Renderer, Ch 00000090, errorString CTX SWITCH TIMEOUT, Info 0x164016

From NVidia's website, it seems like Xid 38 is a driver firmware error, and Xid 109....isn't a real error number? The table stops at 92.

The full dmesg from that point (Had to trim down, my system uptime was a few weeks and that was 99% down the log) is hosted here: https://termbin.com/pcc63

Is this a bug with the nvidia driver? kwin? Is this a hardware issue? Misconfiguration?

Any help would be appreciated.

Last edited by manchmalscott (2023-02-01 22:25:07)

Offline

#2 2023-02-08 21:59:44

kellerman
Member
From: Latvia
Registered: 2011-07-20
Posts: 104

Re: NVidia NVRM Xid error suddenly brought down whole system

I experience errors like this:

[ 1720.336957] NVRM: GPU at PCI:0000:01:00: GPU-50ea39f8-76d4-57dd-9d58-004667e5725b
[ 1720.336960] NVRM: Xid (PCI:0000:01:00): 109, pid=2075, name=Stray-Win64-Shi, Ch 0000004c, errorString CTX SWITCH TIMEOUT, Info 0x2ec04c

[ 1720.336998] NVRM krcCheckBusError_KERNEL: PCI-E corelogic status has pending errors (CL_PCIE_DEV_CTRL_STATUS = 00010000):
[ 1720.336999] NVRM krcCheckBusError_KERNEL:      _CORR_ERROR_DETECTED
[ 5571.654365] NVRM dmaAllocMapping_GM107: can't alloc VA space for mapping.
[ 5571.654367] NVRM kbusMapFbAperture_GM107: Failed: [GPU0] Could not map pAperOffset: 0x0
[ 5571.655736] NVRM dmaAllocMapping_GM107: can't alloc VA space for mapping.
[ 5571.655738] NVRM kbusMapFbAperture_GM107: Failed: [GPU0] Could not map pAperOffset: 0x0
[ 5571.657040] NVRM dmaAllocMapping_GM107: can't alloc VA space for mapping.
[ 5571.657042] NVRM kbusMapFbAperture_GM107: Failed: [GPU0] Could not map pAperOffset: 0x0
[ 5571.658344] NVRM dmaAllocMapping_GM107: can't alloc VA space for mapping.
[ 5571.658345] NVRM kbusMapFbAperture_GM107: Failed: [GPU0] Could not map pAperOffset: 0x0
[ 5571.659713] NVRM dmaAllocMapping_GM107: can't alloc VA space for mapping.
[ 5571.659715] NVRM kbusMapFbAperture_GM107: Failed: [GPU0] Could not map pAperOffset: 0x0
[ 5571.661093] NVRM dmaAllocMapping_GM107: can't alloc VA space for mapping.
[ 5571.661095] NVRM kbusMapFbAperture_GM107: Failed: [GPU0] Could not map pAperOffset: 0x0
[ 5916.042276] NVRM: Xid (PCI:0000:01:00): 109, pid=5588, name=Stray-Win64-Shi, Ch 0000004c, errorString CTX SWITCH TIMEOUT, Info 0x10c04c

[ 5916.042320] NVRM krcCheckBusError_KERNEL: PCI-E corelogic status has pending errors (CL_PCIE_DEV_CTRL_STATUS = 00010000):
[ 5916.042321] NVRM krcCheckBusError_KERNEL:      _CORR_ERROR_DETECTED

Happens occasionally when playing GPU intensive games for me. Seems like Nvidia driver problem. This happens both by using nvidia and nvidia-open driver variants. Tried removing overclock on the GPU, but that didn't help.
Mostly the GPU driver recovers (resets) and I can continue gaming. But there were also times when whole system froze.

Last edited by kellerman (2023-02-08 22:01:15)

Offline

#3 2023-02-10 11:36:30

kellerman
Member
From: Latvia
Registered: 2011-07-20
Posts: 104

Re: NVidia NVRM Xid error suddenly brought down whole system

Linked with this issue on Github:
https://github.com/NVIDIA/open-gpu-kern … issues/447
This is a global nvidia driver issue.
https://forums.developer.nvidia.com/t/m … ors/235459

Offline

#4 2023-02-27 20:07:26

kellerman
Member
From: Latvia
Registered: 2011-07-20
Posts: 104

Re: NVidia NVRM Xid error suddenly brought down whole system

I have tried all sorts of things starting with modifying kernel parameters ending with trying bunch of different kernels and Nvidia drivers. I didn't go older than 515.xx tho. The vkd3d games still crashed on older driver versions, but with slightly different Xid errors, not CTX SWITCH TIMEOUT's. Latest drivers gave mainly only the CTX SWITCH TIMEOUT's.
It's a shame that Nvidia cant fix their drivers for so long. I would exchange my GTX1660 for some AMD card without thinking if I had a chance.
We just have to wait for them to fix this. Games crash sometimes after 5minutes, sometimes after a hour or so.

Offline

Board footer

Powered by FluxBB