You are not logged in.

#51 2025-07-24 19:02:26

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,587

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

stock setting …  hits TJmax in certain loads … despite dual-tower cooler and proper thermal paste application

Sure, the CPU will run as fast as the TDP allows - that's a combination of critical temperature and fan power. Nothing to phone home about.

That aside, PBO is NOT a thermal issue!
The problem is that the CPU runs at low voltages and boosts the frequency so much, that it can no longer reliably distinguish 1 and 0
A positive offset will actually increase the voltage and inherently run the CPU at slightly higher temperatures (and slower/less boosted) but hopefully more reliable because the 1s and 0s a bigger differences and don't flip so fast.

Edit: it is also an aggravating problem and will typically only manifest over time as the CPU slightly corrodes and is it extremely common, well documented and the most likely cause.
You can of course exhaust every other option first and only look at the most obvious contender last …

Last edited by seth (2025-07-24 19:04:25)

Online

#52 2025-07-24 21:11:01

LuxFerre
Member
Registered: 2010-03-01
Posts: 117

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

LunarLambda wrote:

Well, I thought I was in the clear but after around 4 days of uninterrupted uptime I did manage to get another freeze on the new CPU, all stock settings. (Which is longer than anything the old CPU managed, in fairness!)

I'm gonna try the only other source of errors in the system, one of the drives in my ZFS pools frequently drops offline and has to be reset by the kernel:

[321271.809465] ata9.00: exception Emask 0x0 SAct 0x808000 SErr 0xed0001 action 0x6 frozen
[321271.809732] ata9: SError: { RecovData PHYRdyChg CommWake 10B8B BadCRC Handshk LinkSeq }
[321271.809940] ata9.00: failed command: WRITE FPDMA QUEUED
[321271.810136] ata9.00: cmd 61/08:78:10:9d:3e/00:00:00:00:00/40 tag 15 ncq dma 4096 out
[321271.810136]          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[321271.810584] ata9.00: status: { DRDY }
[321271.810775] ata9.00: failed command: WRITE FPDMA QUEUED
[321271.810955] ata9.00: cmd 61/40:b8:48:bb:20/00:00:00:00:00/40 tag 23 ncq dma 32768 out
[321271.810955]          res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[321271.811320] ata9.00: status: { DRDY }
[321271.811502] ata9: hard resetting link
[321272.274459] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[321272.275949] ata9.00: NCQ Non-Data Log not supported
[321272.291337] ata9.00: NCQ Non-Data Log not supported
[321272.291522] ata9.00: configured for UDMA/133
[321272.301910] ata9: EH complete

I'm not sure how a dying drive would cause full-system crashes (without anything else reporting errors, ZFS being clean, and it not even being the root drive). But eh.

I'll also try rcu_nocbs=0-15 again, I suspect that might be the most likely thing to fix it.

I had a somewhat similar horror story with a MSI x870 tomahawk motherboard. Random crashes, even installed windows to see and it crashed too, one nvme drive would give errors and crash sometimes as well. Sometimes boot would get stuck counting the memory. It would only be stable with RAM at jedec (slow) speeds.
MSI forum was full of similar reports, one topic had dozens of pages...

Try running your ram at jedec/slow/non-expo and see if it's better.
I think your issue is the MB, not cpu. Mine was solved with a X870 pro rs from asrock, cheaper and never gave issues...
Also, the last ryzen generations are very finicky with RAM. See if yours is in the motherboard QVL list of compatibility.

Finally, as Seth said, high temperatures are not a problem per se, in this CPU it's by design to hit the thermal limit at full load.

Last edited by LuxFerre (2025-07-24 21:14:35)

Offline

#53 2025-07-25 07:41:12

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Both the old 2x16 GB DDR5-6400 CL32 and the current 2x32 GB DDR5-6000 CL30 kit are in the motherboard QVL list for Ryzen 9000 CPUs.
Admittedly, while AMD's QVL website lists my kit, it only lists the slower CL36 model.

I already ran a memtest for 18+ hours though without issues. I could run a longer one but absent any of the usual "RAM problem" symptoms it's low on my list of suspects.'
Either the motherboard or CPU had issues running 4x16 GB, but I went to 2x32 GB instead before I could determine if it was CPU or BIOS issue (see earlier in the thread).

I believe I had deadlocks before the RAM change too. But they were also exacerbated possibly by running into OOM.

Bisecting the issue also is taking up a substantial amount of my time because each deadlock now takes somewhere of 1-4 days of uptime to manifest.
I removed the faulty hard drive under suspicion it might be causing some sort of deadlock bug in ZFS, which could explain why most of the deadlocks occured while playing MC, the server sits on that pool.

The only other thing that changed since May when the deadlocks started happening was adding the RX 6650 XT GPU. While also possible that it is the root cause, again, I've not had any of the typical "Faulty GPU" problems either.

Replacing the motherboard is challenging because:

1. It's specced fairly specifically for the types and amount of hardware in the system
2. I don't have the money to buy new hardware without sending in the current board for replacement, which
3. means I would be without a PC for anywhere from 1-4 weeks, and I need the PC for work, among other things.

So while I've heard everyone saying it's likely a motherboard issue loud and clear, I do feel the need to rule out *everything* else first.

Offline

#54 2025-07-25 12:59:53

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,587

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

when the deadlocks started happening was adding the RX 6650 XT GPU

which will draw a lot of power from the board and might hamper its ability to provide stable voltage to the CPU, exposing too narrow margins.
I'd not call 18h definitive for 64GB, but it's at least no bad sign for the RAM - have you tried to adjust the PBO curve optimizer yet?

Online

#55 2025-07-25 13:05:36

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I plan to try a positive offset after this current test (ruling out a ZFS deadlock)

Offline

#56 2025-07-30 08:46:34

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Well, after removing the faulty drive, I am now sitting at an unprecedented 5 days 16 hours of uptime. I'll probably go until 7 days, but at risk of jinxing it:

I think it might've been the drive. Either it timing out somehow caused a deadlock inside of ZFS, or some combination of ZFS and/or the drive timing out was crashing the chipset on my motherboard and taking the system down with it. While I've never encountered anything like this, it does make some sense given the symptoms (slow grind to a complete halt, sounds playing/I/O being done seems to make it happen faster, no log messages...)

If this is the end of this entire mess, I'm happy, but also wow, what a nightmare.

Offline

#57 2025-08-15 06:20:09

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Well, zero crashes or anything since removing that drive. Closing this thread finally.

Offline

#58 2025-08-17 22:17:24

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I'm glad you were able to found the solution!

I also learned a lot in this thread, it was an insane troubleshooting to be honest.
Your faulty drive is HDD right ? If I'm not wrong the pcb controller in the HDD is a whole entire world. If i'm not wrong you can not replace it. I believe back in the 90s you can do that but now days you can't for some security reasons I believe. Whatever it was, it does create a big desync I guess.  I mean your system in idle was getting freeze right ? If that's the case is hard to see for me  if it was a memory paging issue. But surely it was an I/O issue.

Also the MSI x870 tomahawk  mentioned here opened to me a new frontier of troubleshooting possibilities. Do you all think that the ram could be the main cause of a mobo just not starting ? I mean just press the start button and nothing happens. At least for me that doesn't make sense, because the mobo should check if the ram was ok, and if it wasn't then starts beeping.

Nevertheless congratulations in your endeavor. You issue makes me thing that I must check some hardware stuff more deeply.


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#59 2025-08-17 22:23:07

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

In my experience if the PC doesn't start at all (no beeps no fan spin no lights) it's one of two things

1. PSU issue (e.g., it immediately trips due to a fault/connection issue or is otherwise not functioning)
2. Issue with the power button or the connection from power button to the front panel header

Some motherboards have a dedicated power button on them, or else you could try removing the front panel connector and manually bridging the the pins (check the manual for which pins are the power & reset switch). This could help you rule out the simple stuff.

Offline

#60 2025-08-17 23:27:59

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Yes, I made all what you think always. I guess a ram issue should beep when the bios/uefi is checking I/O is well or not and all the peripherals.
The other option is that in the mobo some inner soldering in the pcb just had died, like gpus when you need to do reballing for example, but for mobos as far as I know it's impossible to do.


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#61 2025-08-26 00:32:31

LuxFerre
Member
Registered: 2010-03-01
Posts: 117

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Succulent of your garden wrote:

Yes, I made all what you think always. I guess a ram issue should beep when the bios/uefi is checking I/O is well or not and all the peripherals.
The other option is that in the mobo some inner soldering in the pcb just had died, like gpus when you need to do reballing for example, but for mobos as far as I know it's impossible to do.

With the RAM issues I had it would at least run the fans and light up some leds on the MB. The screen would often be black and it won't always beep though. If nothing at all happens when you press power I would suspect the PSU is dead... or one of the power cables has issues.

Offline

#62 2025-08-26 01:17:11

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

LuxFerre wrote:
Succulent of your garden wrote:

Yes, I made all what you think always. I guess a ram issue should beep when the bios/uefi is checking I/O is well or not and all the peripherals.
The other option is that in the mobo some inner soldering in the pcb just had died, like gpus when you need to do reballing for example, but for mobos as far as I know it's impossible to do.

With the RAM issues I had it would at least run the fans and light up some leds on the MB. The screen would often be black and it won't always beep though. If nothing at all happens when you press power I would suspect the PSU is dead... or one of the power cables has issues.

Thanks for the info. The psu is no the issue, I'm mostly sure some soldering just died in the mobo in my case :C. Good ones is that I had made the replacement smile, but since the mention of the ram was as weird as my issue, I was guessing if something I was not seeing, but  i don't think is the case. That mobo in my case is just dead.


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

Board footer

Powered by FluxBB