You are not logged in.
I'm running KDE plasma on X11 an a desktop with proprietary NVIDIA drivers.
For work I frequently have to use MS Teams in the browser.
Aside from the occasional screen tearing on X I never had issues before.
Since 5/9, so about two weeks ago, most often when using Teams (but later I also discovered it happens when playing video in the browser) the computer will randomly freeze, kwin (or kwin_wayland) crashes, and occasionally all power to peripherals is lost (mouse, keyboard, webcam, Mic). I also found my internet connection dropped multiple times every few minutes.
As the system was quite stable before, I believe these issues may be related somehow as they appeared all at once.
If I follow journalctl while I watch video in the browser or use Teams with video:
I get a lot of kernel warnings of the form:
igb 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0015 address=0xc870fa2187c00244 flags=0x0030]
interleaved with
amd_iommu_report_page_fault: 68 callbacks suppressed
I did not have these messages before 5/9, although my journal only goes back to 1/9.
These messages don't seem to affect the user experience, but often they precede a crash and reset of the ethernet interface:
igb 0000:04:00.0: Detected Tx Unit Hang
Tx Queue <1>
TDH <89>
TDT <93>
next_to_use <93>
next_to_clean <8b>
buffer_info[next_to_clean]
time_stamp <100517329>
next_to_watch <000000003a775847>
jiffies <1005175c0>
desc.status <1200200>
igb 0000:04:00.0 enp4s0: Reset adapter
The unit hang may be repeated multiple times over multiple seconds.
This causes the teams call to drop for a few seconds. I have the feeling it happening more and more frequently, but this may be because I'm now actively watching it.
I get quite a lot of NVRM: Xid errors like this:
NVRM: Xid (PCI:0000:08:00): 32, pid=69982, name=plasmashell, Channel ID 00000087 intr0 00040000
NVRM: Xid (PCI:0000:08:00): 31, pid=91022, name=Renderer, Ch 000000a0, intr 00000000. MMU Fault: ENGINE GRAPHICS HUBCLIENT_FE faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
I've also seen Xid errors 11, 13, and 69, but 32 is the most common. I've also seen 56.
Most of these errors are then followed by errors from plasmashell
plasmashell[6229]: [GFX1]: Device reset due to WR context
plasmashell[6229]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
plasmashell[6229]: [GFX1-]: Failed to make render context current during destroying.
Xid 32 also occurs when I'm not watching video and seems to occur totally randomly. It does not always impact UX but sometimes causes a noticeable delay/hang in the screen update.
Again I don't see these errors before 5/9.
Plasmashell crashes and restarts with a long systemd-coredump
USB peripherals lose power and don't come back until I manually unplug and plug them back in:
kernel: xhci_hcd 0000:0a:00.3: xHCI host not responding to stop endpoint command
kernel: xhci_hcd 0000:0a:00.3: xHCI host controller not responding, assume dead
kernel: xhci_hcd 0000:0a:00.3: HC died; cleaning up
kernel: usb 5-1: USB disconnect, device number 2
kernel: usb 5-4: Not enough bandwidth for altsetting 0
kernel: usb 5-3: USB disconnect, device number 3
kernel: usb 5-4: USB disconnect, device number 4
kernel: usb 3-3: new low-speed USB device number 5 using xhci_hcd
kernel: usb 3-3: New USB device found, idVendor=046d, idProduct=c328, bcdDevice=86.00
kernel: usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
kernel: usb 3-3: Product: USB Keyboard
kernel: usb 3-3: Manufacturer: Logitech
kernel: input: Logitech USB Keyboard as /devices/pci0000:00/0000:00:01.2/0000:02:00.0/0000:03:08.0/0000:05:00.3/usb3/3-3/3-3:1.0/0003:046D:C328.0007/input/input21
kernel: hid-generic 0003:046D:C328.0007: input,hidraw2: USB HID v1.10 Keyboard [Logitech USB Keyboard] on usb-0000:05:00.3-3/input0
kernel: input: Logitech USB Keyboard Consumer Control as /devices/pci0000:00/0000:00:01.2/0000:02:00.0/0000:03:08.0/0000:05:00.3/usb3/3-3/3-3:1.1/0003:046D:C328.0008/input/input22
mtp-probe[77438]: checking bus 3, device 5: "/sys/devices/pci0000:00/0000:00:01.2/0000:02:00.0/0000:03:08.0/0000:05:00.3/usb3/3-3"
kernel: input: Logitech USB Keyboard System Control as /devices/pci0000:00/0000:00:01.2/0000:02:00.0/0000:03:08.0/0000:05:00.3/usb3/3-3/3-3:1.1/0003:046D:C328.0008/input/input23
kernel: hid-generic 0003:046D:C328.0008: input,hiddev97,hidraw4: USB HID v1.10 Device [Logitech USB Keyboard] on usb-0000:05:00.3-3/input1
mtp-probe[77438]: bus: 3, device: 5 was not an MTP device
systemd-logind[1532]: Watching system buttons on /dev/input/event3 (Logitech USB Keyboard)
systemd-logind[1532]: Watching system buttons on /dev/input/event5 (Logitech USB Keyboard System Control)
systemd-logind[1532]: Watching system buttons on /dev/input/event4 (Logitech USB Keyboard Consumer Control)
mtp-probe[77463]: checking bus 3, device 5: "/sys/devices/pci0000:00/0000:00:01.2/0000:02:00.0/0000:03:08.0/0000:05:00.3/usb3/3-3"
mtp-probe[77463]: bus: 3, device: 5 was not an MTP device
Things I've tried from googling around that don't seem to do anything:
- updating the BIOS, is now the latest available version (F38f)
- running under plasma under wayland
- using the LTS kernel
- disabling hardware encoding in firefox
- using a different browser (chromium)
- disabling virtualization in the bios
- doing a deep clean of the hardware (dust). Also GPU temperatures never exceeded ~50 degrees C.
- configuring the ethernet interface with ethtool, e.g. disabling sgm rx abd tx
- memtest86+ to check that RAM is OK. No errors it seems.
Things I've tried that have helped but not solved the issue(s):
- when on X11, I've disabled kwin, which has reduced the number of full system freezes, however most of the issues listed above persist.
- Added a file /etc/sysctl.d/usb_power.conf with usbcore.autosuspend=-1. Since then I haven't had a full crash of the peripherals but since it was a rare event I can't be sure it's solved.
Potentially relevant hardware info:
Motherboard: Gigabyte X570 Aorus Elite
CPU: AMD Ryzen 9 3950X
GPU: NVIDIA GeForce RTX 3080 (Gigabyte)
Ethernet controller: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
Displays: 2 displays with 2560x1440 resolution and 60 hz connected directly to the GPU hdmi outputs
Software/driver info:
kernel: 6.5.3-arch1-1 (regular) and linux-lts 6.1.54-1
plasma: 5.27.8
nvidia: 535.104.05-6 and nvidia-lts 1:535.104.05-11
These are the currently installed versions.
Since 5/9 I've tried to update the system a couple times but I'm not entirely sure whether Linux, plasma or nvidia drivers were affected by those updates.
In any case it did not resolve any issues.
The issue seems to have come out of nowhere, not immediately after an update, which made me think it could be a broken hardware issue.
However, I've tested the system briefly using a ubuntu 20.04 live install and don't seem to have any of the same issues, though I might have to drive it for a bit longer to gather more data.
Ubuntu uses the nouveau drivers, which made me lean more towards a driver error/misconfiguration.
Unfortunately I need to use the proprietary drivers for CUDA.
Not sure what else I can still try to figure out what is going on or to fix the issues?
Offline
Kwin crashing also seems to be related to Xid errors:
Sep 05 08:41:26 colormonster kwin_x11[4191]: kwin_core: Failed to focus 0x3c00010 (error 8)
Sep 05 08:41:27 colormonster kwin_x11[4191]: kwin_core: XCB error: 3 (BadWindow), sequence: 9121, resource id: 4194329, major code: 129 (SHAPE), minor code: 6 (Input)
Sep 05 08:41:27 colormonster kwin_x11[4191]: kwin_core: XCB error: 152 (BadDamage), sequence: 9127, resource id: 16777279, major code: 143 (DAMAGE), minor code: 3 (Subtract)
Sep 05 09:01:00 colormonster kernel: NVRM: Xid (PCI:0000:08:00): 69, pid=4191, name=kwin_x11, Class Error: ChId 0038, Class 0000c797, Offset 00001b0c, Data 2001056d, ErrorCode 0000000c
Sep 05 09:01:00 colormonster kwin_x11[4191]: kwin_scene_opengl: A graphics reset attributable to the current GL context occurred.
Sep 05 09:09:18 colormonster kwin_x11[4191]: kwin_scene_opengl: A graphics reset attributable to the current GL context occurred.
Sep 05 09:11:03 colormonster kwin_x11[4191]: 22 -- exe=/usr/bin/kwin_x11
Sep 05 09:11:03 colormonster kwin_x11[4191]: 17 -- appname=kwin_x11
Sep 05 09:11:03 colormonster kwin_x11[4191]: 17 -- programname=KWin
Sep 05 09:11:03 colormonster kwin_x11[4191]: KCrash: Application Name = kwin_x11 path = /usr/bin pid = 4191
Sep 05 09:11:03 colormonster kwin_x11[4191]: KCrash: Arguments: /usr/bin/kwin_x11 --replace
Sep 05 09:11:03 colormonster systemd-coredump[15325]: [?] Process 4191 (kwin_x11) of user 1000 dumped core.
Before 5/9 there are some kwin_core XCB errors as well but never errors related to opengl.
Offline
https://docs.nvidia.com/deploy/xid-errors/index.html
31 is "GPU memory page fault" and driver or client error, 69 is "Graphics Engine class error" and hardware or driver error, 32 is "Invalid or corrupted push buffer stream" and driver or external (RAM, temperature etc) error
=> most likely a driver bug
Since 5/9, so about two weeks ago
Check your pacman log, 535.104.05 entered the repos August, 23rd
You could try to get the 535.98 versions of nvidia-dkms and nvidia-utils from the ALA and see whether that stabilizes the situation.
https://wiki.archlinux.org/title/ALA
Offline
Thanks for the great suggestion of looking into the pacman logs, it hasn't solved everything but I can start to trace back the multitude of issues!
Seems like I updated the system with a pretty big update on 1/09 and didn't make heavy use of the computer until 5/09, which is probably why everything started crashing and it seemed like all issues were related.
As you suggested I downgraded the driver (moving to nvidia-dkms) + downgraded nvidia-utils to the version before the update of 1/09, and it feels slightly more stable with video but unfortunately it doesn't solve the NVRM XID errors or the igb errors. Also kwin is slightly more stable but still crashes occasionally, especially on screenshare, and seems to cause full system freezes. The following corresponds to a full freeze of the graphical environment:
Sep 21 21:00:15 colormonster kwin_x11[4206]: Freeze in OpenGL initialization detected
Sep 21 21:00:15 colormonster kwin_x11[4206]: Application::crashHandler() called with signal 6; recent crashes: 1
Sep 21 21:00:16 colormonster kwin_x11[4206]: 22 -- exe=/usr/bin/kwin_x11
Sep 21 21:00:16 colormonster kwin_x11[4206]: 13 -- platform=xcb
Sep 21 21:00:16 colormonster kwin_x11[4206]: 11 -- display=:0
Sep 21 21:00:16 colormonster kwin_x11[4206]: 17 -- appname=kwin_x11
Sep 21 21:00:16 colormonster kwin_x11[4206]: 17 -- apppath=/usr/bin
Sep 21 21:00:16 colormonster kwin_x11[4206]: 9 -- signal=6
Sep 21 21:00:16 colormonster kwin_x11[4206]: 9 -- pid=4206
Sep 21 21:00:16 colormonster kwin_x11[4206]: 18 -- appversion=5.27.8
Sep 21 21:00:16 colormonster kwin_x11[4206]: 17 -- programname=KWin
Sep 21 21:00:16 colormonster kwin_x11[4206]: 31 -- bugaddress=submit@bugs.kde.org
Sep 21 21:00:16 colormonster kwin_x11[4206]: KCrash: crashing... crashRecursionCounter = 2
Sep 21 21:00:16 colormonster kwin_x11[4206]: KCrash: Application Name = kwin_x11 path = /usr/bin pid = 4206
Sep 21 21:00:16 colormonster kwin_x11[4206]: KCrash: Arguments: /usr/bin/kwin_x11 --replace
Sep 21 21:00:18 colormonster kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1116
Sep 21 21:00:20 colormonster kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1116
Sep 21 21:00:20 colormonster systemd[4112]: plasma-kwin_x11.service: Main process exited, code=killed, status=14/ALRM
Sep 21 21:00:20 colormonster systemd[4112]: plasma-kwin_x11.service: Failed with result 'signal'.
Sep 21 21:00:20 colormonster systemd[4112]: plasma-kwin_x11.service: Consumed 19.426s CPU time.
Sep 21 21:00:20 colormonster systemd[4112]: plasma-kwin_x11.service: Scheduled restart job, restart counter is at 1.
Sep 21 21:00:20 colormonster systemd[4112]: Starting KDE Window Manager...
Sep 21 21:00:31 colormonster CROND[11150]: (root) CMDOUT ( 0.00% complete (06:41:51 remaining) 0.57% complete (00:03:03 remaining) 4.49% complete (00:00:43 remaining) 4.49% complete (00:01:05 remaining) 6.40% complete (00:00:59 remaining) 7.20% complete (00:01:05 remaining)>
Sep 21 21:00:59 colormonster CROND[11150]: (root) CMDOUT (06 remaining) 83.08% complete (00:00:05 remaining) 86.88% complete (00:00:04 remaining) 93.74% complete (00:00:02 remaining) 99.52% complete (00:00:00 remaining)104.29% complete (00:00:00 remaining)110.79% complete (00:00:>
Sep 21 21:01:01 colormonster CROND[11424]: (root) CMD (run-parts /etc/cron.hourly)
Sep 21 21:01:01 colormonster CROND[11423]: (root) CMDEND (run-parts /etc/cron.hourly)
Sep 21 21:01:26 colormonster CROND[11150]: (root) CMDOUT (plete (00:00:00 remaining)180.74% complete (00:00:00 remaining)190.83% complete (00:00:00 remaining)192.81% complete (00:00:00 remaining)193.33% complete (00:00:00 remaining)193.91% complete (00:00:00 remaining)196.41% com>
Sep 21 21:01:30 colormonster kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1116
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (g)283.87% complete (00:00:00 remaining)283.87% complete (00:00:00 remaining)283.87% complete (00:00:00 remaining)283.87% complete (00:00:00 remaining)284.85% complete (00:00:00 remaining)285.82% complete (00:00:00 remainin>
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (RSYNC Snapshot saved successfully (104s))
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Tagged snapshot '2023-09-21_21-00-02': hourly)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Daily snapshots are enabled)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Last daily snapshot is 0 hours old)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Weekly snapshots are enabled)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Last weekly snapshot is 5 days old)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (------------------------------------------------------------------------------)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Maximum backups exceeded for backup level 'hourly')
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Snapshot '2023-09-20_17-00-01' un-tagged 'hourly')
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Removing snapshots (un-tagged):)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (------------------------------------------------------------------------------)
Sep 21 21:01:47 colormonster CROND[11150]: (root) CMDOUT (Removing '2023-09-20_17-00-01'...)
Sep 21 21:01:50 colormonster systemd[4112]: plasma-kwin_x11.service: start operation timed out. Terminating.
Sep 21 21:01:50 colormonster systemd[4112]: plasma-kwin_x11.service: Failed with result 'timeout'.
Sep 21 21:01:50 colormonster systemd[4112]: Failed to start KDE Window Manager.
Sep 21 21:01:50 colormonster systemd[4112]: plasma-kwin_x11.service: Scheduled restart job, restart counter is at 2.
Sep 21 21:01:50 colormonster systemd[4112]: Starting KDE Window Manager...
Sep 21 21:01:53 colormonster kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:445
Sep 21 21:01:57 colormonster kernel: NVRM: Xid (PCI:0000:08:00): 56, pid='<unknown>', name=<unknown>, CMDre 00000007 00000240 0001009a 00000007 00000000
Sep 21 21:01:57 colormonster kernel: NVRM: Xid (PCI:0000:08:00): 56, pid='<unknown>', name=<unknown>, CMDre 00000007 00000240 0001009a 00000007 00000000
Sep 21 21:01:57 colormonster kernel: NVRM: Xid (PCI:0000:08:00): 56, pid='<unknown>', name=<unknown>, CMDre 00000007 00000444 00010084 00000007 00000000
So I conclude graphical issues are related to KDE updates from 5.108.0 -> 5.109.0.
The resetting of the ethernet interface is probably not related to nvidia, but some issue with linux-firmware or the kernel itself?
Still it's weird that one update causes so many issues in so many places, I haven't had that before and am running the system now for about 3 years.
Maybe the best way to ensure there is no broken hardware is reverting to the last update before 1/09. Will try that and report back.
Still I can't keep the system frozen on august 2023 forever. Is there a recommended approach for when and how to update the system after a reversal?
Offline
Ok I'm now 95% sure something in the hardware is broken.
What I've tried:
First I rolled back only the packages to before 5/9. Still all the same errors.
Second, I used timeshift to roll back the entire machine to middle of august. Still the same errors. My timeshift doesn't affect /home so I thought maybe it could still be related to some config. Tried a fresh user but still the same problems.
Thanks to the timeshift roll back I could also see the journal earlier in the past. It gives me more confidence that the NVRM and igb errors were never seen in the past and are related to the instability. They just suddenly appeared in the journal on the evening of 4/9 (though it didn't manifest as crashes until the next day).
Unless there is some permanent state that got botched on a chip in the GPU or somewhere else on the mobo besides the BIOS, I can only conclude something in the hardware is broken.
The only question remains what.
As I've tested the memory, which seems ok, that leaves me something on the mobo, the GPU, the CPU or the power supply. Fans and temperatures seem fine.
Is there a way to figure out which component may be broken, except for replacing each one with a spare (which I don't have)?
Offline
Does your graphic card have heavy cooling system? Maybe it hangs without support and is getting breakage or some damage in that position?
Offline
XID 56 is hardware or driver - ideally test a different SW stack (live distro)
Is there a parallel windows installation?
Please post your complete system journal for the boot:
sudo journalctl -b | curl -F 'file=@-' 0x0.st
Offline
Can you please try to boot with kernel parameter
amd-iommu=off
and check again if you have all these errors in the log?
Last edited by snazza (2023-09-22 16:44:43)
Offline
Thanks all for your input and patience. Some more updates.
I don't think the GPU is busted (luckily). I got my hands on another cheaper nvidia card (GTX 1650) and swapped it out. Card was completely new so I doubt I have two busted cards. All tests revealed the same Xid errors. I tried to "challenge" the system by playing youtube videos, simultaneously doing a Teams meeting with camera and screen share, and by turning kwin on. Often see Xid 32 and 69. When kwin was on it crashed a couple times and I got Xid 56 a few times.
Just to be sure, I will take the suggestion of xerxes_ to add additional mechanical support to the RTX 3080 GPU as it is indeed quite heavy. But I don't think it is the cause of these issues.
As for the kernel parameter suggestion from snazza, this indeed seems to have fixed the igb page faults for now. Thanks! I haven't seen any of the network crashes for a while. What I still don't understand is why I didn't have these errors before when I didn't have this configured, and why when I rolled the system back to a point when everything worked I still got the igb errors. This only makes sense if something on the hardware changed or became broken.
So this leaves the NVRM XiD errors and resulting graphics glitches + plasmashell/kwin crashes.
For seth:
Some journal excerpts from different boots:
A boot from 1/9 (before the update in the evening of 1/9) when everything worked fine: http://0x0.st/HV8K.txt
A boot from 2/9 (after the update of 1/9) when everything worked fine: http://0x0.st/HV8a.txt
A boot from 3/9 when everything worked fine: http://0x0.st/HV8m.txt
A boot from 4/9 when everything worked fine except in the evening (at 19:40 the first NVRM Xid 32 error but I didn't notice): http://0x0.st/HV8B.txt
A boot from 5/9 when everything graphical was broken (but no igb errors): http://0x0.st/HVXd.txt
A boot from 6/9 when there were graphical issues + igb errors started showing: http://0x0.st/HVPW.txt
A boot from 23/9 with the cheap card + the amg iommu kernel parameter off (no more igb errors): http://0x0.st/HVix.txt
Also potentially interesting:
A boot from 24/9 but with the system rolled back to the state of 14:00 on 1/9 (before the update) with timeshift: http://0x0.st/HV8Z.txt. Here we still see the NRVM errors. Also using the cheap new GPU. You will see some differences here due to the updated BIOS. I updated the BIOS on 19/09, before that I never touched it.
There is no parallel Windows install, Arch is the only OS on the system, but I have two kernels both of which show broken behavior.
As mentioned before I tested a live Ubuntu which gave me the same igb erros but no NVRM errors because this one had nouveau drivers. None of the live USBs I have use the nvidia drivers. Perhaps I can create a minimal bootable version of this system on a USB but this will take me a bit of time.
Offline
"boot from 24/9 but with the system rolled back to the state of 14:00 on 1/" has only "Invalid or corrupted push buffer stream" in ksplashhtm and chromium, which could simply be client errors.
"A boot from 23/9 with the cheap card + the amg iommu kernel parameter off" looks much worse
I'd try the current system but with either
1. the LTS kernel
2. nvidia-dkms and nvidia-utils 535.98 from the ALA
Offline
Just as a small intermediate update, my problems are still not fixed (Xid errors remain using teams/screensharing and/or youtube) even with older drivers or older kernel but just sharing some additional experiments and findings.
I moved the GPU to the other PCI slot on the mobo, but that didn't do anything.
I've disabled almost every feature in the BIOS (SMT mode, IOMMU) and messed with CPU and PCI power settings.
I've toggled various combinations of nvidia kernel parameters, e.g. NVreg_RegisterForACPIEvents, NVreg_EnableMSI,NVreg_EnablePCIeGen3, NVreg_PreserveVideoMemoryAllocations, and NVreg_UsePageAttributeTable.
It seems others have had seemingly random issues with nvidia cards on this mobo see this thread: https://forums.developer.nvidia.com/t/r … k-up/79731, though they report Xid 61 which I have not seen. Still as I have very similar hardware and the issues appear out of the blue, so I tried some of the suggestions of setting GPU frequencies. No dice.
Still want to try out a live install with nvidia drivers but haven't gotten around to that.
Also might try different RAM sticks just in case there is an issue there.
If I can't find a solution I will probably try to find answers on the NVIDIA forums as it's unlikely this is a problem specific to Arch.
Offline
I'm getting the feeling that I see a lot fewer errors when I turn off my swap partition. Swap partition is a logical volume inside a LUKS partition. I don't know why or what that means, but it is noticeable.
Offline
swap inside LVM has previously shown up as inhibting performance and (iirc) raising CPU load, the encryption part isn't relevant.
Do you get better behavior when using a swapfile?
Offline