You are not logged in.

#1 2026-05-10 14:51:24

foraphe
Member
Registered: 2025-06-23
Posts: 7

Infrequent kernel soft lockups under light loads that don't leave logs

Update: Also tried fabric clock (FCLK) at 1800MHz and memory clock at 5600MT/s, and still had freezes.

I've had a issue since last month similar to https://bbs.archlinux.org/viewtopic.php?id=311723 but since that thread has been inactive for two months, I felt it might be better starting a new one. The freeze only started occuring after both hardware and software changes, so I haven't 100% ruled out the chances of a hardware failure (and thus my case is unrelated to the other thread), but given the behavior I don't think it's likely.

I'm running a system with an ASUS TUF Gaming B650M-PLUS WiFi motherboard with a Ryzen 9 7950X CPU. Since early April when I installed a PCIe packet switch, updated the UEFI firmware to 3842 (AGESA 1.3.0.0a), and after the update made minor changes to DRAM timings (they still pass stress tests), I've had random freezes where the kernel seems to soft or hard lockup (SysRq didn't work, NICs went down, switching TTYs don't work, and I can only force shutdown using the power button), and after rebooting there's no logs on disk related to the freezes.
I've tried disabling PCIe ASPM, setting the kernel parameter processor.max_cstate=1 and usbcore.autosuspend=-1 (as was discussed in the other thread), upgrading and downgrading UEFI firmware, upgrading and downgrading the kernel, tuning voltage and memory timings, raising minimum CPU frequency using cpupower, increasing Load-Line Calibration levels, and disabling boosting. These mitigations didn't help fix the freeze, they either changed the frequency of the freeze or did nothing (I haven't had enough freezes to tell if freezing in 5 or 15 minutes using the test explained below is just random or shows some sort of progress).

One specific scenario I discovered that increase the frequency of the freeze from around once every few days to around once every 10-15 minutes is to have the rhythm game osu! running its tournament client (multiple instances are started in this mode), capture all these game windows using OBS, and do some other light load work or just let the system stay in that state (this creates a somewhat large number of tasks that don't consume much resources). Before using usbcore.autosuspend=-1 as per the fix in the other thread, usually there's nothing in the logs before soft lockups happen, after setting the kernel command line I saw only once that just before a freeze happened, two different GPUs (one AMD, one Intel) behind a PCIe packet switch timed out at around the same time. The usb_poll test program (mentioned in the other thread) seems to never freeze  my system with the command line. Other common actions before a freeze happens include starting VSCode and potentially other Electron applications. The freeze only happened under transients and spiky workloads, not when I play heavier games, run stress tests or in general have the system running under heavier load. I tried different stress tests (y-cruncher, stress-ng, mprime, running ffmpeg encode, LLM finetuning on the GPU, hashcat, OCCT CPU stress test both with constant and variable load, etc) and they all seem to run well without errors or freezes.

I tried raising voltage (reducing negative core voltage offsets, and increasing SoC voltage), which fixed the issue for me during April but since yesterday it's happening again, and this time further increasing voltage or loosening timings didn't help. I tried the latest UEFI firmware for my motherboard (3854, AGESA 1.3.0.0b) and also downgrading to 3602 (AGESA 1.2.7.0, released in 2025/11), both didn't fix the freeze. Downgrading the kernel to 6.19.12 also didn't help.

This had led me to suspect the I/O die or the fabric itself was unstable with certain powersaving features, not just a USB controller or a PCIe link. This isn't backed by any conclusive proof since I can only see something from netconsole around 10% of the time (I now insert the netconsole module and log to my other system at most times), and when logs do come through usually there's nothing before the kernel detected soft lockups on many different CPU cores. I also tried using pstore/memoops but it seems the kernel didn't panic before it completely hung and pstore wasn't preserving logs either (similar to the other thread). But since there are multiple cases where ASUS motherboards and Zen 4/5 Ryzen 7/9 CPUs had similar issues in the other thread, I don't think it's necessarily a hardware failure.

One specific case that might provide some insights into what's happening is that for one freeze my mouse stopped registering inputs first, and during the next second before my display and the whole system froze, an application did start, so in between mouse inputs gone and system lockup there is a brief period when userspace was still running.

At this point I'm out of ideas on where to explore next, and I'll be grateful if someone could point me in some directions on how to diagnose the issue.

lspci (this system had a PEX88080 PCIe packet switch hosting GPUs and another PLX8747 switch, which then hosts NVMe SSDs, so there are a lot of PCIe bridges, and while some I/O memory space failed to assign, every device has been working properly):

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: ryzen_smu
	Kernel modules: ryzen_smu
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU
	Subsystem: ASUSTeK Computer Inc. Device 8877
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: pcieport
	Kernel modules: shpchp
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: pcieport
	Kernel modules: shpchp
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: pcieport
	Kernel modules: shpchp
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: pcieport
	Kernel modules: shpchp
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A]
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: pcieport
	Kernel modules: shpchp
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A]
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: pcieport
	Kernel modules: shpchp
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 71)
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: piix4_smbus
	Kernel modules: i2c_piix4, sp5100_tco
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
	Subsystem: ASUSTeK Computer Inc. Device 8877
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3
	Kernel modules: k10temp
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7
01:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
02:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
02:04.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
02:08.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
02:0c.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
02:1c.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
03:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
04:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
04:08.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
05:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev 24)
	Subsystem: ASUSTeK Computer Inc. Device 1478
	Kernel driver in use: pcieport
	Kernel modules: shpchp
06:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch (rev 24)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 48 [Radeon RX 9070/9070 XT/9070 GRE] (rev c0)
	Subsystem: ASUSTeK Computer Inc. Device 061a
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
07:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 48 HDMI/DP Audio Controller
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 48 HDMI/DP Audio Controller
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
09:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
0a:10.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
0a:18.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
0b:00.0 PCI bridge: Intel Corporation Device 4fa1 (rev 01)
	Kernel driver in use: pcieport
	Kernel modules: shpchp
0c:01.0 PCI bridge: Intel Corporation Device 4fa4
	Subsystem: Intel Corporation Device 4fa4
	Kernel driver in use: pcieport
	Kernel modules: shpchp
0c:04.0 PCI bridge: Intel Corporation Device 4fa4
	Subsystem: Intel Corporation Device 0000
	Kernel driver in use: pcieport
	Kernel modules: shpchp
0d:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05)
	Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1814
	Kernel driver in use: i915
	Kernel modules: i915, xe
0e:00.0 Audio device: Intel Corporation DG2 Audio Controller
	Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1283
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
0f:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
	Subsystem: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
10:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
	Subsystem: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
10:09.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
	Subsystem: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
10:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
	Subsystem: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
10:11.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
	Subsystem: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
11:00.0 Non-Volatile memory controller: Yangtze Memory Technologies Co.,Ltd PC411 M.2 2280 NVMe SSD (DRAM-less) (rev 01)
	Subsystem: Yangtze Memory Technologies Co.,Ltd PC411 M.2 2280 NVMe SSD (DRAM-less)
	Kernel driver in use: nvme
	Kernel modules: nvme
13:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
	Subsystem: Samsung Electronics Co Ltd SSD 970 EVO/PRO
	Kernel driver in use: nvme
	Kernel modules: nvme
14:00.0 Non-Volatile memory controller: Intel Corporation NVMe Optane Memory Series
	Subsystem: Intel Corporation Optane Memory M10 16GB
	Kernel driver in use: nvme
	Kernel modules: nvme
15:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
16:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
16:08.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
19:00.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
1a:14.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
1a:15.0 PCI bridge: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI PEX88064 64 lane/port PCIe Gen 4.0 Switch
	Kernel driver in use: pcieport
	Kernel modules: shpchp
1d:00.0 Mass storage controller: Broadcom / LSI PEX880xx PCIe Gen 4 Switch (rev b0)
	Subsystem: Broadcom / LSI Device 00b2
1e:00.0 Non-Volatile memory controller: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPlus7100 (rev 01)
	Subsystem: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPlus7100
	Kernel driver in use: nvme
	Kernel modules: nvme
1f:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
20:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
20:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
20:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
20:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
20:0b.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
20:0c.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
20:0d.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
	Subsystem: ASMedia Technology Inc. Device 3328
	Kernel driver in use: pcieport
	Kernel modules: shpchp
21:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
	Subsystem: Mellanox Technologies Device 0051
	Kernel driver in use: mlx4_core
	Kernel modules: mlx4_core
22:00.0 PCI bridge: Texas Instruments XIO2001 PCI Express-to-PCI Bridge
	Kernel modules: shpchp
23:00.0 Multimedia audio controller: ESI Audiotechnik GmbH MAYA44 family PCI Audio Controller (rev 03)
	Subsystem: Device 0e51:0003
	Kernel driver in use: snd_ice1724
25:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
	DeviceName: Realtek RTL8125BG LAN
	Subsystem: ASUSTeK Computer Inc. Device 87d7
	Kernel driver in use: r8169
	Kernel modules: r8169
26:00.0 Network controller: MEDIATEK Corp. MT7921 802.11ax PCIe Wireless Network Adapter [Filogic 330]
	Subsystem: AzureWave Device 4680
	Kernel driver in use: mt7921e
	Kernel modules: mt7921e
27:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller (rev 01)
	Subsystem: ASMedia Technology Inc. Device 1142
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci
28:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller (rev 01)
	Subsystem: ASMedia Technology Inc. Device 1062
	Kernel driver in use: ahci
	Kernel modules: ahci
29:00.0 Non-Volatile memory controller: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPlus7100 (rev 01)
	Subsystem: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPlus7100
	Kernel driver in use: nvme
	Kernel modules: nvme
2a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raphael (rev c1)
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
2a:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Radeon High Definition Audio Controller
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
2a:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: ccp
	Kernel modules: ccp
2a:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci
2a:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci
2a:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Ryzen HD Audio Controller
	DeviceName: Realtek ALC897 Audio
	Subsystem: ASUSTeK Computer Inc. Device 8841
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
2b:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI
	Subsystem: ASUSTeK Computer Inc. Device 8877
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci

Current kernel version: 7.0.5-arch1-1

Last edited by foraphe (2026-05-10 16:10:46)

Offline

#2 Yesterday 17:48:17

jl2
Member
From: 47° 18' N 8° 34' E
Registered: 2022-06-01
Posts: 1,321

Re: Infrequent kernel soft lockups under light loads that don't leave logs

This reminds me of an issue I had, does it improve if you completely disable XMP?


Why I run Arch? To "BTW I run Arch" the guy one grade younger.
And to let my siblings and cousins laugh at Arsch Linux...

Upload longer text output like this

Offline

#3 Yesterday 19:47:44

jonno2002
Member
Registered: 2016-11-21
Posts: 854

Re: Infrequent kernel soft lockups under light loads that don't leave logs

wow thats hard to read so sorry if i missed vital info, process of elimination is always the answer, remove everything but the basics (board/cpu/1 stick ram) (apparently your cpu has graphics built in so take the gpu out of the equation too) and set uefi to defaults with xmp off as suggested and see what happens, if you still get a crash then try different ram stick/cpu/ssd till you find the offending device

all that messing around with voltages and other crap is pointless, devices should run at defaults with no issues, if they dont theyre faulty

Offline

#4 Yesterday 20:12:37

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,310

Re: Infrequent kernel soft lockups under light loads that don't leave logs

Does the script in https://bbs.archlinux.org/viewtopic.php … 2#p2297852 trigger the problem (not immediately but after a short while)?

To be clear, you did test "processor.max_cstate=1 iommu=soft rcu_nocbs=0-15 pcie_aspm=off" ? Together?
You are getting *soft* lockups, you can reboot the system (in doubt using https://wiki.archlinux.org/title/Keyboa … el_(SysRq) + REISUB) and or otherwise get a journal covering those incidents?

Online

#5 Today 02:46:48

foraphe
Member
Registered: 2025-06-23
Posts: 7

Re: Infrequent kernel soft lockups under light loads that don't leave logs

jl2 wrote:

This reminds me of an issue I had, does it improve if you completely disable XMP?

I tried running at reduced frequencies (DDR5-5600 with FCLK reduced to 1800MHz), which doesn't seem to help, but didn't try JEDEC (DDR5-4800). I'll try JEDEC specs.

jonno2002 wrote:

remove everything but the basics (board/cpu/1 stick ram) (apparently your cpu has graphics built in so take the gpu out of the equation too) and set uefi to defaults with xmp off as suggested and see what happens, if you still get a crash then try different ram stick/cpu/ssd till you find the offending device

seth wrote:

Does the script in https://bbs.archlinux.org/viewtopic.php … 2#p2297852 trigger the problem (not immediately but after a short while)?

I needed a stable system to work with so I have moved my peripherals to a different system. I'll try with only the motherboard+CPU, a single RAM stick and maybe a fresh install on a USB drive and report back once I had some free time tomorrow.

seth wrote:

To be clear, you did test "processor.max_cstate=1 iommu=soft rcu_nocbs=0-15 pcie_aspm=off" ? Together?

I forgot to do iommu=soft, but all others were set together. I recall the related kernel command line was "pcie_aspm=off rcu_nocbs=0-31 processor.max_cstate=1 idle=nomwait pci=nomsi usbcore.autosuspend=-1 iommu=pt amd_iommu=on amd_pstate=passive".

seth wrote:

You are getting *soft* lockups, you can reboot the system (in doubt using https://wiki.archlinux.org/title/Keyboa … el_(SysRq) + REISUB) and or otherwise get a journal covering those incidents?

SysRq also didn't seem to work when I do enable it. The only logs I got (the netconsole logs 10% of the time) were reporting soft lockups, but it seems to escalate until NICs and other things also go down. In some occurances pinging the computer worked for around a minute more after the freeze until it doesn't.
The keyboard/mouse also didn't work as soon as a freeze started (caps lock won't toggle, and the mouse stopped moving first in that one case when userspace was still alive).

Netconsole logs usually look very similar to this one when they do come through:

[    8.231405] snd_hda_intel 0000:0e:00.0: Unknown capability 0 
[    9.126739] amdgpu: Overdrive is enabled, please disable it before reporting any bugs unrelated to overdrive.
[  100.529643] watchdog: BUG: soft lockup - CPU#29 stuck for 26s! [kworker/u130:6:918]
[  100.529645] CPU#29 Utilization every 4000ms during lockup:
[  100.529646]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529648]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529649]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529650]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529651]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528398] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [ThreadPoolForeg:4407]
[  100.528400] CPU#10 Utilization every 4000ms during lockup:
[  100.528401]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528402]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.528403]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528404]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529856] watchdog: BUG: soft lockup - CPU#15 stuck for 26s! [chrome:4241]
[  100.529858] CPU#15 Utilization every 4000ms during lockup:
[  100.529859]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529860]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528405]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529861]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529861]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529862]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.528681] watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [chrome:4716]
[  100.528683] CPU#27 Utilization every 4000ms during lockup:
[  100.528684]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529995] watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [kworker/u130:0:212]
[  100.529996] CPU#13 Utilization every 4000ms during lockup:
[  100.529997]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529998]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.528686]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529999]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.530000]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528686]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.528687]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.530001]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.528688]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528895] watchdog: BUG: soft lockup - CPU#16 stuck for 26s! [osu!.exe:7848]
[  100.528898] CPU#16 Utilization every 4000ms during lockup:
[  100.530194] watchdog: BUG: soft lockup - CPU#12 stuck for 26s! [osu!.exe:8616]
[  100.530196] CPU#12 Utilization every 4000ms during lockup:
[  100.530197]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.530198]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.530199]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.530200]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.530200]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528899]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528900]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.528901]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.528901]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.528902]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529067] watchdog: BUG: soft lockup - CPU#17 stuck for 26s! [tosu.exe:4819]
[  100.529069] CPU#17 Utilization every 4000ms during lockup:
[  100.529070]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529071]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529072]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529073]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529074]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529255] watchdog: BUG: soft lockup - CPU#25 stuck for 26s! [ServiceWorker t:4437]
[  100.529258] CPU#25 Utilization every 4000ms during lockup:
[  100.529259]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529260]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529261]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529262]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529263]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529434] watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [kworker/u130:5:917]
[  100.529435] CPU#11 Utilization every 4000ms during lockup:
[  100.529436]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529438]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529439]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  100.529439]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  100.529440]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528389] watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [worker:8408]
[  104.528391] CPU#8 Utilization every 4000ms during lockup:
[  104.528392]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528393]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528394]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528395]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528396]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528570] watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [worker:8476]
[  104.528573] CPU#14 Utilization every 4000ms during lockup:
[  104.528573]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528575]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528576]  #3: 100% system,          1% softirq,     1% hardirq,     0% idle
[  104.528577]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528577]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528759] watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [ThreadPoolForeg:3966]
[  104.528761] CPU#28 Utilization every 4000ms during lockup:
[  104.528761]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528762]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528763]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528764]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528764]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528879] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [electron:2999]
[  104.528880] CPU#31 Utilization every 4000ms during lockup:
[  104.528880]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528881]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528882]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  104.528883]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  104.528883]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  108.528543] watchdog: BUG: soft lockup - CPU#26 stuck for 23s! [chrome:4247]
[  108.528545] CPU#26 Utilization every 4000ms during lockup:
[  108.528546]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  108.528547]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  108.528548]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  108.528549]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  108.528550]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528394] watchdog: BUG: soft lockup - CPU#10 stuck for 48s! [ThreadPoolForeg:4407]
[  128.528395] CPU#10 Utilization every 4000ms during lockup:
[  128.528396]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528397]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528398]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528399]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528400]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528606] watchdog: BUG: soft lockup - CPU#27 stuck for 52s! [chrome:4716]
[  128.528608] CPU#27 Utilization every 4000ms during lockup:
[  128.528609]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528610]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528611]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528612]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528612]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528799] watchdog: BUG: soft lockup - CPU#17 stuck for 52s! [tosu.exe:4819]
[  128.528801] CPU#17 Utilization every 4000ms during lockup:
[  128.528802]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528803]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528804]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528805]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528806]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528971] watchdog: BUG: soft lockup - CPU#16 stuck for 52s! [osu!.exe:7848]
[  128.528972] CPU#16 Utilization every 4000ms during lockup:
[  128.528973]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528973]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528974]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.528975]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.528976]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529080] watchdog: BUG: soft lockup - CPU#25 stuck for 52s! [ServiceWorker t:4437]
[  128.529082] CPU#25 Utilization every 4000ms during lockup:
[  128.529083]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529084]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529085]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529086]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529087]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529237] watchdog: BUG: soft lockup - CPU#11 stuck for 48s! [kworker/u130:5:917]
[  128.529238] CPU#11 Utilization every 4000ms during lockup:
[  128.529238]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529239]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529240]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529241]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529242]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529439] watchdog: BUG: soft lockup - CPU#13 stuck for 48s! [kworker/u130:0:212]
[  128.529440] CPU#13 Utilization every 4000ms during lockup:
[  128.529441]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529442]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529443]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529444]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529445]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529640] watchdog: BUG: soft lockup - CPU#12 stuck for 52s! [osu!.exe:8616]
[  128.529641] CPU#12 Utilization every 4000ms during lockup:
[  128.529642]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529643]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529643]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529644]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529645]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529810] watchdog: BUG: soft lockup - CPU#15 stuck for 52s! [chrome:4241]
[  128.529811] CPU#15 Utilization every 4000ms during lockup:
[  128.529812]  #1: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529813]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529813]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529814]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529815]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529931] watchdog: BUG: soft lockup - CPU#29 stuck for 52s! [kworker/u130:6:918]
[  128.529932] CPU#29 Utilization every 4000ms during lockup:
[  128.529933]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529934]  #2: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529935]  #3: 100% system,          0% softirq,     1% hardirq,     0% idle
[  128.529936]  #4: 100% system,          0% softirq,     0% hardirq,     0% idle
[  128.529937]  #5: 100% system,          0% softirq,     1% hardirq,     0% idle
[  132.528386] watchdog: BUG: soft lockup - CPU#8 stuck for 49s! [worker:8408]
[  132.528387] CPU#8 Utilization every 4000ms during lockup:
[  132.528388]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  132.528389]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  132.528390]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  132.528391]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  132.528392]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  132.528552] watchdog: BUG: soft lockup - CPU#14 stuck for 49s! [worker:8476]
[  132.528554] CPU#14 Utilization every 4000ms during lockup:
[  132.528555]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  132.528556]  #2: 100% system,          0% softirq,     1% hardirq,     0% idle
[  132.528557]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  132.528558]  #4: 100% system,          0% softirq,     1% hardirq,     0% idle
[  132.528559]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle

Last edited by foraphe (Today 03:06:14)

Offline

#6 Today 09:03:58

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,310

Re: Infrequent kernel soft lockups under light loads that don't leave logs

Obviously

[    9.126739] amdgpu: Overdrive is enabled, please disable it before reporting any bugs unrelated to overdrive.

I don't think it's the ryzen bug but "iommu=soft" might actually do something and because of "somewhat large number of tasks" also add "maxcpus=15" (no HT and skip one core altogether)
Also please post your complete system journal:

sudo journalctl -b | curl -s -H "Accept: application/json, */*" --upload-file - 'https://paste.c-net.org/'

and the output of "lsmod"
Edit: just for a general oversight and rounding up the usual suspects.

Last edited by seth (Today 09:04:38)

Online

#7 Today 10:58:32

foraphe
Member
Registered: 2025-06-23
Posts: 7

Re: Infrequent kernel soft lockups under light loads that don't leave logs

My hypothesis might have been wrong since the beginning, and this might be a signal integrity issue that resulted in platform hangs instead.

Before I had the time to test the now "old" system I already encountered problems on the new one, and it might help figure out what happened. With a different CPU/motherboard combo with otherwise the same setup (now without any of the kernel options trying to mitigate the freeze), I had AER corrected error log spam on my SSDs and had two AER fatal errors on the main GPU in a day, and the system continued running somewhat normally this time except for the GPU being gone.
The AER spam is fixed by re-routing the SlimSAS cables coming from the PCIe switch properly, and I haven't had fatal errors since doing this. still encountered one.

The old Asus motherboard masked AER errors. I tried digging into it in the past with setpci but didn't succesfully unmask them.
Maybe the GPU is actually falling off the bus due to bad SI, and maybe the firmware on Asus motherboards didn't handle this type of events very cleanly (a device not waking up from deeper power-saving levels in time and a device not responding due to losing a PCIe link "might" be similar enough to trigger the same firmware quirks, i.e. a freeze).

I'll still test the old system on iGPU and with iommu=soft once I had some time (if this hypothesis is correct it should fix itself).

seth wrote:

Also please post your complete system journal

Journal of a boot that ended up with a freeze (the kernel options are slightly diffrent when I tried different combinations): https://paste.c-net.org/ProofingWhatley. Regarding the lsmod part I also have to wait until I had more time to set up the system on a test bench.

Last edited by foraphe (Today 11:18:22)

Offline

#8 Today 13:07:53

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,310

Re: Infrequent kernel soft lockups under light loads that don't leave logs

Why is there "amd_iommu=on iommu=pt" ?
"pci=hpiosize=0,hpmmiosize=0,hpmemsize=0,hpmmioprefsize=0" is a mitigational effort?

Did you notice the insane amount of errors wrt

…
May 12 13:11:20 D795-P obs[5372]: error: [pipewire-audio] Error id:35 seq:80 res:-2 :no global 498
May 12 13:11:20 D795-P obs[5372]: error: [pipewire-audio] Error id:35 seq:80 res:-2 :no global 498
May 12 13:11:20 D795-P pipewire[2259]: mod.protocol-native: connection 0x564d507f5110: can't DUP fd:971 Too many open files
May 12 13:11:20 D795-P pipewire[2259]: mod.protocol-native: connection 0x564d507f5110: can't DUP fd:976 Too many open files
…
…
May 12 13:11:24 D795-P xdg-desktop-portal-kde[3164]: QSGRhiLayer: Unsupported size requested: [5398990, 124]. Maximum texture size: 65536
May 12 13:11:24 D795-P xdg-desktop-portal-kde[3164]: QSGRhiLayer: Unsupported size requested: [5398990, 124]. Maximum texture size: 65536
…

Your immediate concern should be pipewire, see eg. https://bbs.archlinux.org/viewtopic.php?id=311683 but afaict you're not using raop - anything else special about the PW config?
But might also just be OBS

May 12 13:11:11 D795-P obs[5372]: info: [pipewire-audio] 0x557be3e63210 Got format: rate 48000 - channels 2 - format 8
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu7-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu6-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu5-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu4-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu3-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu2-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu1-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu0-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu7-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu6-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu5-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu4-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu3-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu2-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu1-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu0-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:     - source: 'lazer' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:     - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu7-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu6-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu5-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu4-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu3-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu2-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu1-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu0-pw' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:     - source: 'lazer' (pipewire-screen-capture-source)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'osu!' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'chrome' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'ts3' (pipewire_audio_application_capture)
May 12 13:11:11 D795-P obs[5372]: info:         - source: 'mic' (pipewire_audio_input_capture)

[Edit: if this is a legit requirement of the setup, see https://discussion.fedoraproject.org/t/ … /85544/11]

As for the other error, something wants to allocate a 5.398.990 pixel wide texture via xdg-desktop-portal, could be OBS or Trojan-Qt5-Linux.AppImage

Overall I'd ignore the HW setup - the system is clearly just running out of resources (resp. into the quota wrt pipewire)

Last edited by seth (Today 13:34:31)

Online

#9 Today 15:07:25

foraphe
Member
Registered: 2025-06-23
Posts: 7

Re: Infrequent kernel soft lockups under light loads that don't leave logs

seth wrote:

Why is there "amd_iommu=on iommu=pt" ?

The IOMMU options were there for as long as I have remembered (maybe it came from years ago when I tried PCIe passthrough via OVMF) and I didn't think of removing those until I read your previous posts.

seth wrote:

"pci=hpiosize=0,hpmmiosize=0,hpmemsize=0,hpmmioprefsize=0" is a mitigational effort?

These PCI options are there to prevent pciehp from reserving resources for unused switch ports and this helps to fit the bridge windows without BARs failing to assign. This is a mitigation attempt but seems to have no effect on the result.

seth wrote:

Your immediate concern should be pipewire, see eg. https://bbs.archlinux.org/viewtopic.php?id=311683 but afaict you're not using raop - anything else special about the PW config?

This should just be OBS trying to open at least 8 different portal requests at the same time (each with 10+ windows to preview). After setting up capture for every window the errors would stop appearing. The previews were all blank (so failures likely happened there instead of in OBS itself).
I don't think I'm using raop (a search in ~/.config/pipewire and /etc/pipewire showed no string "raop", and I have no memory of using it before), the config should remain mostly original except for available sample rates and quant values.

I think based on the behavior on two different platforms I still have good reasons to suspect it's HW, manifesting as freezes on one platform and fatal AER errors on the other.

Offline

Board footer

Powered by FluxBB