You are not logged in.

#1 2022-09-13 19:09:33

defborn
Member
Registered: 2022-02-08
Posts: 6

[SOLVED] Unreliable (re)boots

I have tried following the forum for almost a year, learning a lot and hoping I would eventually fix this problem myself, but I failed. I has become utterly frustrating. So my system is - once it is up - really stable. The only thing that does result in a complete system crash (as in: instant shutdown, black screen, computer shuts down) is doing something graphics related like editing a large file in Gimp or doing video editing - which I have to do rarely. But other than that it is fine. Then whenever I update the system I live in fear for not being able to reboot. I have 4 boot options, which are two kernels (linux and linux lts), each option with a separate 'initramfs fallback' option. The system boots and just cuts out randomly, sometimes. Sometimes it reaches the login, sometimes it doesn't. In the beginning I had some luck with booting into the lts kernel - which made me believe the problem was in the linux kernel. But after some time that also did not matter. It's just so so unreliable. If I have to reboot I only do it in the evening so I have time to get it back up before work the next day.

I already did a memtest, twice, and nothing is wrong there. I do load the nvidia driver with early module loading, and since the crashes in Gimp or video, I kind of still think it might have something to do with it. But then again, I can play Team Fortress 2 for hours straight without any hickup...

My system:

OS: Arch Linux 
Kernel: x86_64 Linux 5.15.67-1-lts
Uptime: 9m
Packages: 1091
Shell: zsh 5.9
Resolution: 3840x2160
WM: dwm
GTK Theme: Arc-Lighter [GTK2/3]
Icon Theme: Adwaita
Font: Nimbus Sans 11
Disk: 690G / 3.6T (20%)
CPU: AMD Ryzen 9 5950X 16-Core @ 32x 3.4GHz
GPU: NVIDIA GeForce GTX 1650
RAM: 2865MiB / 32023MiB

lsmod info: http://0x0.st/oOAc.txt

My last boot log (journal): http://0x0.st/oOT5.txt
I can post the dmesg but is that not just from the current boot? Which went fine.

Yesterday I tried something else, I have 4000mhz RAM which I have to OC in order to use, otherwise it runs at 2666mhz. I disabled the OC profile, rebooted twice, and got twice perfectly fine boots. So I figured that was the problem. I rebooted once more just to be sure, and the boot failed sad

Feel free to ask for any insight /info, I will do my best to provide it. Maybe it is a hardware issue, I am willing to replace whatever is needed, but not unless it is neccesary.  I also own a Thinkpad Carbon X1 which runs Arch and never have issues there, so that might point towards hardware? Because the cut outs sometimes also happen when booting from the live image. Those are rare, but still, it happens...
Thank you 1000 times, in advance. I just hope to have confidence in rebooting the machine.

Last edited by defborn (2022-09-19 11:23:22)

Offline

#2 2022-09-13 21:44:07

xerxes_
Member
Registered: 2018-04-29
Posts: 665

Re: [SOLVED] Unreliable (re)boots

There were many threads related to instability with AMD Ryzen systems on that forum. What solutions did you try so far?
Check if you have installed and working amd-ucode package: 'pacman -Ss ucode', 'ls /boot/*ucode*' or ls /efi/*ucode*, 'sudo journalctl -b -g microcode'.
If I were you, I would start with disabling power saving CSTATES in BIOS and see if that help.
Maybe check if there is newer BIOS version for your motherboard, but before you update, search if there are any problems with that new version.

Offline

#3 2022-09-14 12:21:40

Wild Penguin
Member
Registered: 2015-03-19
Posts: 319

Re: [SOLVED] Unreliable (re)boots

My bet is faulty hardware (well, I wouldn't bet my house but perhaps my car ... but I don't own a car). More specifically, I'll bet it's the PSU.

Note that generally: no single test or application really tests the system thoroughly - so you can not deem a system stable in case a test is negative. In other words: only positive tests are indicative of hardware failure, but negative test results have little value.

In your case, given there is no specific way reliably to re-create this - and indeed the fact it's hard resets - then these facts strongly point towards faulty hardware.

PSU would be the first thing to swap out, since it is (perhaps) the easiest to swap, and generally not too expensive. Also, there's high likelihood (IMO!) it is actually the PSU. Borrow one or buy a spare (it never hurts to have one), if you can not borrow one.

If the issues persist after changing the PSU, then I'd first take away the GPU (if you have an IGP) and test if the system is stable without the GPU. If it is, it does not mean the GPU is faulty - it might just mean your MB/CPU/RAM/PSU can not work reliably with this GPU (for example: faulty PCIe lanes...). Try the GPU on another computer.

After this, swap the RAM, CPU, M/B one by one to a known good one and try to isolate the bad part (you can also try the parts being tested on a known good computer). But you will need lots of alternative parts, it can be a hassle and expensive (unless you have spares / can borrow some) - but there is really no way around it.

Cheers, and hope you get it sorted out!

p.s. I've had identical stability problems before, on two different M/Bs which I was able to work around on the first M/B by reducing the maximum simultaneous boost clocks on all four cores (slight overclock, actually, the Z97 MB wanted to do) on an i4790k (Z97 chipset). But I got similar problems on a B550 motherboard with the same PSU. After swapping the PSU (to an inferior, but known good PSU) - all stability problems were gone! The same Z97/i4790k is also working with no stability issues to this day, with full boost simultaneously on all four cores, on another relatives computer - but different PSU.

p.s.s. my go-to test for stability is to test the CPU and GPU simultaneously with synthetic tests. I've yet ever to come across a system which was unstable in real-world workloads but stable with synthetic simultaneous CPU+GPU stress tests. But just CPU(or RAM) or GPU tests alone might (and have!) still been stable. Perhaps throw in a simultaneous RAM+GPU test for a good measure...

p.s.s.s. This (GPU+CPU i.e. maximum power draw) is also kind of mandatory for thermal testing IMHO - just because in a game session temperatures are good, does not mean they will be good at every gaming session (I see commonly this kind of misdeduction - and some games can, while quite low workload most of the time, suddenly get surprisingly near the max power draw limit(s) at rare, but demanding situations!). I'm not saying that your stability problems are caused by bad thermals (IMHO a correctly working systems should be stable in any case, but just throttle down) - but this is something which makes sense to do if you want the system to stay healthy and know how quiet it can be, safely...

Last edited by Wild Penguin (2022-09-14 12:30:37)

Offline

#4 2022-09-15 06:44:55

defborn
Member
Registered: 2022-02-08
Posts: 6

Re: [SOLVED] Unreliable (re)boots

@Wil Penguin and @xerxes_ thank you for taking the time to reply. I was in luck - in a way, since I have a second system which is a little bit newer than my main sysytem, also running Arch, with a Ryzen 7 2700x whereas my main system has a beefier Ryzen 9 5950X. I went through the list you proposed since I had all spare parts in the other system.

- swapped PSU -> still cut outs when booting, randomly (first try worked so I got my hopes up, second boot failed again ^^
- swapped GPU -> cut outs
- tried with an updated BIOS and setting the AMD curve + 4 positive points (as suggested in an article on the wiki) -> cut outs

Then after the second swap I noticed that once there once a cut out, the red CPU led om the MB was on. I never noticed that before (will have been there every time I guess), how could I have missed it! That got me thinking. What if the CPU was the problem. Could that really be the case? I thought that a CPU would either work or not, not just 'a bit' and 'sometimes'. I swapped both CPUs and since then I have rebooted successfully on my main machine for around 15 times. I even edited large RAW photos in Gimp which always led to a crash, nothing... So for that I am very happy!

So to end, I fully swapped both systems, just keeping the mobo in the case and swapping out SSD, GPU, CPU (I left PSU and fans). Rebooting multiple times on my main machine with no issue.

Strangely, the other machine has been booting flawless as well, multiple times hmm go figure. So I have both systems working. I will have to give it some time to really judge it though...

To recap:
1/ I am very happy, now working on what was my lighter machine, which now just has the new Ryzen 9 CPU (as if I did an upgrade), no problems so far;
2/ have the former 'beefier' machine also runninng, with the Ryzen 7, also without any issue - though this is my software build machine, I do not actually work on it continuously.

The only thing I can think of there must be some kind of bug/conflict on the MoBo. The system that had the problems initially has an 'B450 GAMING PRO CARBON MAX WIFI', while now I am using a MSI B450 TOMAHAWK MAX II (MS-7C02). I have read about specific CPU/MoBo bugs before.

Should I already mark this post als solved, or give it a week or so?

Offline

#5 2022-09-19 11:01:18

Wild Penguin
Member
Registered: 2015-03-19
Posts: 319

Re: [SOLVED] Unreliable (re)boots

I think you can mark this as solved, since indeed it is now established it is a H/W problem (and not something with Arch Linux, and as such a bit OT for this forum).

A few remarks on the HW: I've never heard about bugs in M/Bs which would prevent from using a supported CPU reliably, which were not promptly fixed by BIOS updates after release. One exception could be (old, near-EOL) MBs which had support added later by BIOS updates for recent CPUs, and only in the high-end of these CPUs (i.e. CPU support extended as an afterthought). But B450 is not that old compared to the 5950X.

So either your MB is faulty - or, perhaps there was dirt / dust / something in the CPU socket. You could still try the CPU on your main system, just to rule out it was not a seating/dirt problem. Make sure to clean the CPU well beforehand. Also, check for bent pins in the socket. (note: some CPUs might not use all pins, some might be more demanding on the power delivery components on the MB... so if one CPU works, that does not mean the MB is not faulty).

If it still resets, then the MB is faulty (just your unit - I'm really skeptical it was a design fault; perhaps a BIOS bug but those should be fixed by now). If it still has a warranty, I'd consider doing an RMA the motherboard.

Offline

#6 2022-09-19 11:22:35

defborn
Member
Registered: 2022-02-08
Posts: 6

Re: [SOLVED] Unreliable (re)boots

@WildPenguin well, I have swapped CPUs over on the two systems, both work. I also am rather convinced it is a HW bug, but considering the following it will be hard to address that with the vendor:

- system A with CPU A: bugged (original setup);
- system B with CPU B: works;
- system B with CPU A: works;
- system A with CPU B: works.

So it is very odd that everything works as long I don't use the Ryzen 9 on the Tomahawk Pro Carbon... And testing it even without stress test is easy, with the faulty system I can never get 2 or 3 successful reboots in a row, just won't happen. And with all other system setups reboots work every time. I also had been thinking about dust, too much/not enough cooling past, dirt under the CPU but I cleaned and retried numerous times, accross the two systems. But as you said, I also could not find anything related to CPU/MB bugs of this kind that were not addressed soon after the MB and/or CPU came out.

I think getting the RMA will be rather difficult, since I'm almost certain that when the vendor will test it, it'll just work tongue To be 100% sure I would have to get my hands on a second Ryzen 9 and try with that, but I don't have that opportunity nor budget to buy an extra one.

Thank you for your input! I will try and work on the 'second' machine (with the slower CPU but with the possible faulty MB) this week to see if it triggers anything. But thinking along with your line of thoughts and circling back to my original input, how would you explain the faulty (alleged) MB works fine with the Ryzen 7 for days on end (day 5 now, with reboots due to OS updates) and the minute I put in the Ryzen 9 it'll cut out on boot,  straight away (tried it yesterday, on Sunday). OTOH the Ryzen 9 has been my main driver along with the other MB for an equal amount of days, and it runs fine too. So confusing...

Marked as solved BTW. I am so so happy and relieved, my system feels mega reliable right now.

Last edited by defborn (2022-09-19 11:23:11)

Offline

Board footer

Powered by FluxBB