You are not logged in.
I have been experiencing full unrecoverable crashes on my PC for a few years now, even REISUB doesn't recover from them. Since I have now reinstalled the system at least twice I think it may be hardware related.
There was a time period during which the system didn't crash, and the hardware changes since then, from oldest to newest, are a new AIO cooler for the CPU, switched out the memory because the previous memory was giving errors that I though could be causing these crashes and an extra HDD.
I am not sure if the crashes started after the change of cooler or before the change.
Journal log of latest crash: https://0x0.st/XtBe.txt
The installs do have a few things in common:
Encrypted partitions with luks
UKI boot
Wayland DE or WM
(May add more later if I remember them)
Fastfetch:
OS: Arch Linux x86_64
Kernel: Linux 6.10.6-arch1-1
Uptime: 3 mins
Packages: 1208 (pacman), 6 (flatpak)
Shell: bash 5.2.32
Display (L24i-30): 1920x1080 @ 60 Hz (as 2560x1440) in 24″ [External]
Display (L24q-35): 2560x1440 @ 60 Hz in 24″ [External]
WM: Hyprland (Wayland)
Theme: Breeze-Dark [GTK2/3]
Icons: breeze-dark [GTK2/3/4]
Font: Noto Sans (10pt) [GTK2/3/4]
Cursor: breeze (24px)
Terminal: alacritty 0.13.2
Terminal Font: alacritty (11pt)
CPU: AMD Ryzen 5 7600X (12) @ 5.45 GHz
GPU: AMD Radeon RX 6750 XT [Discrete]
Memory: 2.37 GiB / 62.54 GiB (4%)
Motherboard: ASUS TUF Gaming B650 Plus Wifi
Swap: 0 B / 64.00 GiB (0%)
Disk (/): 101.35 GiB / 273.44 GiB (37%) - ext4
Disk (/home): 1.28 TiB / 1.52 TiB (84%) - ext4
Local IP (eno1): 192.168.100.4/24
Locale: en_US.UTF-8lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14d8
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Device 14d9
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14db
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14db
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14db
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14dd
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14dd
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 71)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c0)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c0)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
04:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN570 NVMe SSD 2TB (rev 01)
05:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port (rev 01)
06:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0b.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0c.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0d.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
6b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
6c:00.0 Network controller: MEDIATEK Corp. MT7921 802.11ax PCI Express Wireless Network Adapter
6d:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller (rev 01)
6e:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller (rev 01)
6f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix PCIe Dummy Function (rev c7)
6f:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP
6f:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b6
6f:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b7
6f:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
70:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b8Any help with this persistent issue would be greatly appreciated,
If I manage to capture more logs on other crashes I will add them to the issue.
If any additional info may be helpful please do ask.
(My timezone is UTC+2)
Last edited by theRedCyclops (2024-08-27 20:00:37)
Offline
Have you also switched out the CPU?
https://wiki.archlinux.org/title/Ryzen#Troubleshooting
Try to limit the c-states.
Online
I have not switched out the CPU, although I have switched out the cooler, so it may have moved??, but the CPU hasn't even been taken out of the socket as far as I remember.
Offline
Well, did you see the wiki?
There's a good chance it's the CPU - try to "processor.max_cstate=1" aggressively prevent the CPU from downstepping and see whether that stabilizes the system. Then whether you only need to prevent C6
Online
I will try that
Offline
It usually takes a while and then randomly crashes, If it doesn't crash in the following week I will call this a success, thanks for the help.
Offline
It crashed, will retry with the more agresive cstate limit (I was using 5), I will also reseat the CPU and see if that does anything. The logs that were captured do seem to indicate an issue with the CPU http://0x0.st/Xt5q.txt
Offline
Also try the rcu_nocbs and yes, there is a CPU soft lock. Right on the money.
Online
What is rcu_nocbs? and anything else I can do to fix the soft locks or have I lost the silicon lottery?
Offline
Ok, figured out the rcu_nocbs. I'm assuming there will be performance penalties due to these settings? Thanks for the help again
(Restarting the count, lets see if this time it doesn't crash for at least a week)
Last edited by theRedCyclops (2024-08-28 23:04:46)
Offline
rcu_nocbs[=cpu-list]
[KNL] The optional argument is a cpu list,
as described above.
In kernels built with CONFIG_RCU_NOCB_CPU=y,
enable the no-callback CPU mode, which prevents
such CPUs' callbacks from being invoked in
softirq context. Invocation of such CPUs' RCU
callbacks will instead be offloaded to "rcuox/N"
kthreads created for that purpose, where "x" is
"p" for RCU-preempt, "s" for RCU-sched, and "g"
for the kthreads that mediate grace periods; and
"N" is the CPU number. This reduces OS jitter on
the offloaded CPUs, which can be useful for HPC
and real-time workloads. It can also improve
energy efficiency for asymmetric multiprocessors.
If a cpulist is passed as an argument, the specified
list of CPUs is set to no-callback mode from boot.
Otherwise, if the '=' sign and the cpulist
arguments are omitted, no CPU will be set to
no-callback mode from boot but the mode may be
toggled at runtime via cpusets.
Note that this argument takes precedence over
the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.https://en.wikipedia.org/wiki/Read-copy-update
https://www.kernel.org/doc/Documentatio … tisRCU.txt
https://docs.redhat.com/en/documentatio … -operation
To be clear, you're supposed to pass sth. like "rcu_nocbs=0-15", not the nproc invoking bash function to the kernel, that's just how you get to the correct numbers for your system.
Online
got it, thanks again, lets hope it works this time
Offline
It just crashed again, was using it remotely and I have lost all contact, the router shows it as disconnected, I have also experienced issues with the shutdown not finishing, it reaches shutdown.target but the computer doesn't shutdown
Offline
I don't have physical access right now, I'm guessing there isn't a way to recover this?
Offline
No, certainly not if you can't interact with the system at all ![]()
You might howver have lost remote access for entirely unrelated reasons (or have you seen any errors before the host went down?)
Online
Sadly no logs for now, I have full access to the network through a VPN and ssh to other machines, and I have verified it shows up as offline in the router, same thing as the last time it crashed remotely.
Last edited by theRedCyclops (2024-09-06 21:15:43)
Offline
have you tried to replace the power supply?
it's just a wild guess but not so long ago a friend of mine had issues booting - as he had one spare we just gave it a try and it worked
Offline
I don't have any spare, if you can suggest a way to test it and verify if it's the culprit, but right now I am not sure if it's that or the CPU, the motherboard or anything else.
Offline
But the system boots, then at some point crashes, previously w/ a softlock.
That's probably more in the vicinity of the general Ryzen issues than an underdimensioned PSU and afaiu the CPU is also the only remaining constant?
Online
@seth
it's not about an underdimensioned psu but rather a faulty one - although I have to agree with you the system booting up and then hangs/crashes point's away from the psu - as a psu fail would likely result in either shutoff or restart
Offline
I finally got physical access to the machine, I have a trace on the tty, how do I upload it? (figured it out)
Full logs from the last boot:
https://0x0.st/XwhU.txt
Last edited by theRedCyclops (2024-09-08 18:14:25)
Offline
I also got the PSU specs: Corsair RM1000x, 80 Plus Gold
Offline
But the system boots, then at some point crashes, previously w/ a softlock.
That's probably more in the vicinity of the general Ryzen issues than an underdimensioned PSU and afaiu the CPU is also the only remaining constant?
The motherboard and the GPU have also stayed the same
Offline
Crashes in runc, containerd-shim, nginx and sshd-sssion - all in kmem_cache_alloc_noprof - but the log starts late, into a series of
sep 06 16:39:40 kheprix kernel: Oops: general protection fault, probably for non-canonical address 0x746de29a9334037: 0000 [#57] PREEMPT SMP NOPTI (they're all for the same address)
Is this all you have or could you save more journal leading up to this?
The kernel doesn't seem to actually panic (so the system was likely still responsive?)
switched out the memory because the previous memory was giving errors that I though could be causing these crashes
Have you tried to downclock the RAM/use the most conservative timings possible?
It looks like RAM, but since you switched it the culprit could be the board…
Online
I have the memory on auto, if I try to use any of the EXPO profiles it ends up either failing to boot or giving memory errors and that's on the new one, the old memory not even on auto manages to not give memory errors.
If it is the motherboard is there any way to verify? because I would rather not have to buy a new component that ends up being completely redundant, I don't precisely have a huge budget.
I'm guessing that the easiest way to find out what's actually failing would be taking the pc to a computer shop and having them test the components?
Last edited by theRedCyclops (2024-09-08 20:11:52)
Offline