You are not logged in.
I can only keep replacing software/hardware component until i find the culprit.
Be careful about buying new hardware, I almost did that after being convinced that my CPU was on the fritz. As I was experiencing seemingly "random" crashes.
In the end it was actually crashes from several things: C-States (disabled in UEFI), VLC hardware acceleration setting bug (changed settings), kernel bug (was eventually fixed by kernel update).
After figuring out what it was I went from crashing quite often to not at all and haven't had an issue for quite some time.
I'm sorry I can't offer any more in regard to suggestions, but remember that it's easy for hardware to be blamed, and replacing it might fix the issue, for example if there's a software bug with your CPU and you replace the CPU with one that isn't affected by the bug then you will be forgiven for thinking it was a physical hardware fault. It's also why it's an easy explanation to give you when you ask for help.
Ryzen 7 9850X3D | AMD 7800XT | KDE Plasma
Offline
I wish i knew how to coredump kernel and just analyze its broken state, but thats beyond my skill level at this moment.
I can only keep replacing software/hardware component until i find the culprit.
i haven't tried disabling C-states, but i'll try that after trying different nvidia drivers
Also, b/c of the OOM
https://wiki.archlinux.org/title/MongoD … ages_(THP)
You can also just add "transparent_hugepage=never" to the kernel parameters ![]()
Offline
May 19 08:17:19 night-pc kernel: list_del corruption. prev->next should be fffff0ddcf10cd08, but was fffff0ddc20cbd08. (prev=fffff0ddc20cbd48)
Apr 15 19:35:55 night-pc kernel: list_del corruption. prev->next should be ffffeaa2c9430c08, but was 0000000000000000. (prev=ffffeaa2c438fd88)
Did you update your cpu firmware/microcode ? These values don't look random, so I would not suspect RAM stability as a likely option.
I would advise against creating a coredump or trying to debug this issues yourself, as raceconditions like this in the spinlock/RCU code are not for the beginner. And I doubt it would even have the data you need because the CPU that caused the race condition might already have made a context switch.
You can try disablng the cores of your second CCX and see if that helps.
If you really want to run some tests you can compile and run the following:
https://www.kernel.org/doc/html/latest/ … rture.html
https://www.kernel.org/doc/html/latest/RCU/torture.html
Last edited by AaAaAAaaAAaARCH (2024-05-22 09:50:22)
Offline
Thx for all the suggestions guys,
I currently completely uninstalled anything that was made by nvidia and even added an AMD gpu to my pc,
I didn't crash yet, its been few days, but need more time to make sure the system is stable.
If i do crash, i will try disabling c-states and setting up kdump and other stuff next.
And yes, i updated firmware/microcode, latest UEFI and latest arch packages.
My latest theory is that some kernel module corrupts the kernel and leaks memory, which would explain the very weird OOM kills sometimes.
But like i mentioned, i didnt crash yet after switching to AMD, so we will see how this goes.
Last edited by hey_adora (2024-05-26 11:23:49)
Offline
My latest theory is that some kernel module corrupts the kernel and leaks memory, which would explain the very weird OOM kills sometimes.
https://bbs.archlinux.org/viewtopic.php … 5#p2172675
Fwwi, the 550xx nvidia drivers are on the record for corrupting the kernel memory - try to reproduce this w/ nvidia-open or https://aur.archlinux.org/packages/nvidia-535xx-dkms
Offline
My latest theory is that some kernel module corrupts the kernel and leaks memory, which would explain the very weird OOM kills sometimes.
https://bbs.archlinux.org/viewtopic.php … 5#p2172675
seth wrote:Fwwi, the 550xx nvidia drivers are on the record for corrupting the kernel memory - try to reproduce this w/ nvidia-open or https://aur.archlinux.org/packages/nvidia-535xx-dkms
here we go again
May 27 15:44:00 night-pc logd: Skipping 256 entries from slow reader, pid 24731, from LogBuffer::kickMe()
May 27 15:44:07 night-pc kernel: ------------[ cut here ]------------
May 27 15:44:07 night-pc kernel: list_del corruption. prev->next should be ffffc025c5f20c08, but was ffffc025c4a425c8. (prev=ffffc025cdf20c48)
May 27 15:44:07 night-pc kernel: WARNING: CPU: 16 PID: 2429 at lib/list_debug.c:62 __list_del_entry_valid_or_report+0xad/0xcd
May 27 15:44:07 night-pc kernel: Modules linked in: xt_nat rndis_host cdc_ether usbnet mii sch_ingress tcp_diag af_key vsock_loopback vmw_vsock_virtio_transport_common nfnetlink_log vmw_vsock_vmci_transport vsock xfrm_interface xfrm6_tunnel vmw_vmci tunnel4 cfg80211 udp_diag tunnel6 inet_diag veth nft_masq nft_chain_nat nf_tables snd_seq_dummy snd_hrtimer snd_seq xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bridge stp llc amdgpu amdxcp drm_buddy overlay binder_linux(OE) mousedev qrtr rfkill amd_atl intel_rapl_msr snd_hda_codec_realtek intel_rapl_common nouveau uvcvideo vfat snd_hda_codec_generic videobuf2_vmalloc mxm_wmi snd_hda_scodec_component snd_hda_codec_hdmi uvc fat drm_gpuvm videobuf2_memops snd_hda_intel snd_usb_audio videobuf2_v4l2 radeon snd_intel_dspcfg drm_exec ledtrig_netdev
May 27 15:44:07 night-pc systemd-journald[1499]: Missed 1925 kernel messages(20mb)
full log https://raw.githubusercontent.com/hey-a … main/1.txt
newest report:
it wasnt the nvidia drivers
it wasnt the kernel
it wasnt the UEFI
it wasnt zram/zswap
it probably isn't ram (8h memtest)
but maybe it could be this....
[ 35.558350] razermouse: loading out-of-tree module taints kernel.
[ 35.558365] razermouse: module verification failed: signature and/or required key missing - tainting kernelgonna try getting rid of this now
Last edited by hey_adora (2024-05-28 08:09:49)
Offline