You are not logged in.

#1 2024-08-26 22:22:33

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Random system crashes that seemingly follow even across reinstalls

I have been experiencing full unrecoverable crashes on my PC for a few years now, even REISUB doesn't recover from them. Since I have now reinstalled the system at least twice I think it may be hardware related.
There was a time period during which the system didn't crash, and the hardware changes since then, from oldest to newest, are a new AIO cooler for the CPU, switched out the memory because the previous memory was giving errors that I though could be causing these crashes and an extra HDD.
I am not sure if the crashes started after the change of cooler or before the change.
Journal log of latest crash: https://0x0.st/XtBe.txt
The installs do have a few things in common:
    Encrypted partitions with luks
    UKI boot
    Wayland DE or WM
    (May add more later if I remember them)
Fastfetch:

OS: Arch Linux x86_64
Kernel: Linux 6.10.6-arch1-1
Uptime: 3 mins
Packages: 1208 (pacman), 6 (flatpak)
Shell: bash 5.2.32
Display (L24i-30): 1920x1080 @ 60 Hz (as 2560x1440) in 24″ [External]
Display (L24q-35): 2560x1440 @ 60 Hz in 24″ [External]
WM: Hyprland (Wayland)
Theme: Breeze-Dark [GTK2/3]
Icons: breeze-dark [GTK2/3/4]
Font: Noto Sans (10pt) [GTK2/3/4]
Cursor: breeze (24px)
Terminal: alacritty 0.13.2
Terminal Font: alacritty (11pt)
CPU: AMD Ryzen 5 7600X (12) @ 5.45 GHz
GPU: AMD Radeon RX 6750 XT [Discrete]
Memory: 2.37 GiB / 62.54 GiB (4%)
Motherboard: ASUS TUF Gaming B650 Plus Wifi
Swap: 0 B / 64.00 GiB (0%)
Disk (/): 101.35 GiB / 273.44 GiB (37%) - ext4
Disk (/home): 1.28 TiB / 1.52 TiB (84%) - ext4
Local IP (eno1): 192.168.100.4/24
Locale: en_US.UTF-8

lspci:

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14d8
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Device 14d9
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14db
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14db
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14db
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14da
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14dd
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14dd
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 71)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14e7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c0)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c0)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
04:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN570 NVMe SSD 2TB (rev 01)
05:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port (rev 01)
06:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0b.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0c.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
06:0d.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port (rev 01)
6b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
6c:00.0 Network controller: MEDIATEK Corp. MT7921 802.11ax PCI Express Wireless Network Adapter
6d:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller (rev 01)
6e:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller (rev 01)
6f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix PCIe Dummy Function (rev c7)
6f:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP
6f:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b6
6f:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b7
6f:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
70:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b8

Any help with this persistent issue would be greatly appreciated,
If I manage to capture more logs on other crashes I will add them to the issue.
If any additional info may be helpful please do ask.
(My timezone is UTC+2)

Last edited by theRedCyclops (2024-08-27 20:00:37)

Offline

#2 2024-08-27 13:23:21

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,176

Re: Random system crashes that seemingly follow even across reinstalls

Have you also switched out the CPU?
https://wiki.archlinux.org/title/Ryzen#Troubleshooting
Try to limit the c-states.

Online

#3 2024-08-27 19:55:33

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

I have not switched out the CPU, although I have switched out the cooler, so it may have moved??, but the CPU hasn't even been taken out of the socket as far as I remember.

Offline

#4 2024-08-27 19:59:37

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,176

Re: Random system crashes that seemingly follow even across reinstalls

Well, did you see the wiki?
There's a good chance it's the CPU - try to "processor.max_cstate=1" aggressively prevent the CPU from downstepping and see whether that stabilizes the system. Then whether you only need to prevent C6

Online

#5 2024-08-27 20:01:07

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

I will try that

Offline

#6 2024-08-27 20:04:12

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

It usually takes a while and then randomly crashes, If it doesn't crash in the following week I will call this a success, thanks for the help.

Offline

#7 2024-08-28 21:57:44

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

It crashed, will retry with the more agresive cstate limit (I was using 5), I will also reseat the CPU and see if that does anything. The logs that were captured do seem to indicate an issue with the CPU http://0x0.st/Xt5q.txt

Offline

#8 2024-08-28 22:08:04

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,176

Re: Random system crashes that seemingly follow even across reinstalls

Also try the rcu_nocbs and yes, there is a CPU soft lock. Right on the money.

Online

#9 2024-08-28 22:58:36

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

What is rcu_nocbs? and anything else I can do to fix the soft locks or have I lost the silicon lottery?

Offline

#10 2024-08-28 23:01:30

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

Ok, figured out the rcu_nocbs. I'm assuming there will be performance penalties due to these settings? Thanks for the help again
(Restarting the count, lets see if this time it doesn't crash for at least a week)

Last edited by theRedCyclops (2024-08-28 23:04:46)

Offline

#11 2024-08-29 05:56:45

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,176

Re: Random system crashes that seemingly follow even across reinstalls

        rcu_nocbs[=cpu-list]
                        [KNL] The optional argument is a cpu list,
                        as described above.

                        In kernels built with CONFIG_RCU_NOCB_CPU=y,
                        enable the no-callback CPU mode, which prevents
                        such CPUs' callbacks from being invoked in
                        softirq context.  Invocation of such CPUs' RCU
                        callbacks will instead be offloaded to "rcuox/N"
                        kthreads created for that purpose, where "x" is
                        "p" for RCU-preempt, "s" for RCU-sched, and "g"
                        for the kthreads that mediate grace periods; and
                        "N" is the CPU number. This reduces OS jitter on
                        the offloaded CPUs, which can be useful for HPC
                        and real-time workloads.  It can also improve
                        energy efficiency for asymmetric multiprocessors.

                        If a cpulist is passed as an argument, the specified
                        list of CPUs is set to no-callback mode from boot.

                        Otherwise, if the '=' sign and the cpulist
                        arguments are omitted, no CPU will be set to
                        no-callback mode from boot but the mode may be
                        toggled at runtime via cpusets.

                        Note that this argument takes precedence over
                        the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.

https://en.wikipedia.org/wiki/Read-copy-update
https://www.kernel.org/doc/Documentatio … tisRCU.txt
https://docs.redhat.com/en/documentatio … -operation

To be clear, you're supposed to pass sth. like "rcu_nocbs=0-15", not the nproc invoking bash function to the kernel, that's just how you get to the correct numbers for your system.

Online

#12 2024-08-29 10:08:33

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

got it, thanks again, lets hope it works this time

Offline

#13 2024-09-06 15:08:42

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

It just crashed again, was using it remotely and I have lost all contact, the router shows it as disconnected, I have also experienced issues with the shutdown not finishing, it reaches shutdown.target but the computer doesn't shutdown

Offline

#14 2024-09-06 15:17:15

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

I don't have physical access right now, I'm guessing  there isn't a way to recover this?

Offline

#15 2024-09-06 19:19:06

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,176

Re: Random system crashes that seemingly follow even across reinstalls

No, certainly not if you can't interact with the system at all sad

You might howver have lost remote access for entirely unrelated reasons (or have you seen any errors before the host went down?)

Online

#16 2024-09-06 21:12:28

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

Sadly no logs for now, I have full access to the network through a VPN and ssh to other machines, and I have verified it shows up as offline in the router, same thing as the last time it crashed remotely.

Last edited by theRedCyclops (2024-09-06 21:15:43)

Offline

#17 2024-09-07 06:50:04

cryptearth
Member
Registered: 2024-02-03
Posts: 2,104

Re: Random system crashes that seemingly follow even across reinstalls

have you tried to replace the power supply?
it's just a wild guess but not so long ago a friend of mine had issues booting - as he had one spare we just gave it a try and it worked

Offline

#18 2024-09-07 08:14:05

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

I don't have any spare, if you can suggest a way to test it and verify if it's the culprit, but right now I am not sure if it's that or the CPU, the motherboard or anything else.

Offline

#19 2024-09-07 12:12:04

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,176

Re: Random system crashes that seemingly follow even across reinstalls

But the system boots, then at some point crashes, previously w/ a softlock.
That's probably more in the vicinity of the general Ryzen issues than an underdimensioned PSU and afaiu the CPU is also the only remaining constant?

Online

#20 2024-09-07 14:03:48

cryptearth
Member
Registered: 2024-02-03
Posts: 2,104

Re: Random system crashes that seemingly follow even across reinstalls

@seth
it's not about an underdimensioned psu but rather a faulty one - although I have to agree with you the system booting up and then hangs/crashes point's away from the psu - as a psu fail would likely result in either shutoff or restart

Offline

#21 2024-09-08 17:53:56

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

I finally got physical access to the machine, I have a trace on the tty, how do I upload it? (figured it out)
Full logs from the last boot:
https://0x0.st/XwhU.txt

Last edited by theRedCyclops (2024-09-08 18:14:25)

Offline

#22 2024-09-08 18:01:35

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

I also got the PSU specs: Corsair RM1000x, 80 Plus Gold

Offline

#23 2024-09-08 18:19:12

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

seth wrote:

But the system boots, then at some point crashes, previously w/ a softlock.
That's probably more in the vicinity of the general Ryzen issues than an underdimensioned PSU and afaiu the CPU is also the only remaining constant?

The motherboard and the GPU have also stayed the same

Offline

#24 2024-09-08 19:28:13

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,176

Re: Random system crashes that seemingly follow even across reinstalls

Crashes in runc, containerd-shim, nginx and sshd-sssion - all in kmem_cache_alloc_noprof - but the log starts late, into a series of

sep 06 16:39:40 kheprix kernel: Oops: general protection fault, probably for non-canonical address 0x746de29a9334037: 0000 [#57] PREEMPT SMP NOPTI

(they're all for the same address)
Is this all you have or could you save more journal leading up to this?
The kernel doesn't seem to actually panic (so the system was likely still responsive?)

switched out the memory because the previous memory was giving errors that I though could be causing these crashes

Have you tried to downclock the RAM/use the most conservative timings possible?
It looks like RAM, but since you switched it the culprit could be the board…

Online

#25 2024-09-08 19:55:51

theRedCyclops
Member
Registered: 2022-06-17
Posts: 69

Re: Random system crashes that seemingly follow even across reinstalls

I have the memory on auto, if I try to use any of the EXPO profiles it ends up either failing to boot or giving memory errors and that's on the new one, the old memory not even on auto manages to not give memory errors.
If it is the motherboard is there any way to verify? because I would rather not have to buy a new component that ends up being completely redundant, I don't precisely have a huge budget.
I'm guessing that the easiest way to find out what's actually failing would be taking the pc to a computer shop and having them test the components?

Last edited by theRedCyclops (2024-09-08 20:11:52)

Offline

Board footer

Powered by FluxBB