You are not logged in.

#26 2025-07-09 15:34:57

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Yes, without pci=nomsi it boots.

Offline

#27 2025-07-09 15:48:08

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,608

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Good, keep it this way (unlike the other two options I'm not sure how this will mitigate the CPU bug anyway)
Then wait for a crash.

Last edited by seth (2025-07-09 15:48:22)

Offline

#28 2025-07-10 09:51:22

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Found a different form of crash issue today: suspend completely bricks the amdgpu driver (mainline kernel, 6650 XT), needs hard shutoff to fix.

https://bbs.archlinux.org/viewtopic.php?id=306813 This post mentions C-State control working around the issue, though I haven't tried yet
Wondering if my motherboard model/firmware just has severe C-state control problems.

Mentioning it only incase it provides useful info, I'm just going to disable suspend for the time being.

Offline

#29 2025-07-10 10:29:33

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I've also set up netconsole in case it helps capture additional log messages I would otherwise be unable to retrieve.

Offline

#30 2025-07-10 15:53:05

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,608

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Offline

#31 2025-07-10 20:51:46

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I've been looking through various threads about this kind of soft-lock issue on Ryzen CPUs, and I'm starting to notice that basically every single report uses ASUS or Asrock motherboards.
Has any investigation been done in that direction? I find it *kind of* implausible that this would be such a consistent issue with these motherboards, all the way from the 1st generation Ryzen CPUs to the current generation, but with how many other issues have been prevalent in ASUS/Asrock boards...

Not that I would really be able to do anything about that, my motherboard is a highly specific pick for the hardware I have in my machine.

But I would feel more comfortable putting in workarounds for motherboard issues than CPU issues...


While I could find some reports with MSI boards, they were only about 15% of the amount of ASUS/Asrock mentions. And some reports were 'works perfectly fine on the MSI board out of box' or 'works with Typical Current Idle' (cf. https://wiki.archlinux.org/title/Ryzen# … d_suspend).
So maybe Ryzen CPUs are just more sensitive and ASUS/Asrock are being more haphazard with the current/voltage management?

Even if really, this sort of thing happening on *any* board with stock settings is kind of outrageous.

Last edited by LunarLambda (2025-07-10 20:58:02)

Offline

#32 2025-07-10 22:07:18

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

LunarLambda wrote:

Even if really, this sort of thing happening on *any* board with stock settings is kind of outrageous

Do you know a channel of YT called gamers nexus ? Is not a bad source to know which brands are getting the worst quality assurance. Asus now days it seems wining, and ASRock it's from behind I guess. There is a interview in the channel to ASRock and there are news of latest asus oops. 

If you still can get back your money, and want to make the change, try to change to intel 12th gen if you can or to a generation that's is not broken and works well for your needs, as you may know 13th and 14th gen are broken also in intel.  If you believe that problem is the motherboard brand and their incompetency to do a proper fix, then in my humble opinion I would suggest doing something like that.

Nevertheless hope you can somehow fix your issue soon  without doing what I'm suggesting.


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#33 2025-07-10 22:16:49

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,608

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

So maybe Ryzen CPUs are just more sensitive and ASUS/Asrock are being more haphazard with the current/voltage management?

Yes and possibly, but

While I could find some reports with MSI boards, they were only about 15% of the amount of ASUS/Asrock mentions.

Which would probably meet the ratio I've seen on unrelated journals - asus might just have a higher market share (among the relevant audience)

It would be a rather bold strategy to maximize your profits by gaining a reputation of being unusable.
In reality this most likely gets fine-tuned to the usage patterns on windows. Running a bunch of pointless daemons likely helps to keep the cores warm and lower the PBO triggers.
Also the situation is aggrevating.

Did you ever

If not, adjust the curve optimizer and pass slightly more voltage to each core (+4 as suggested in the wiki is probably a good starting point) - running the CPU (slightly) hotter is however kinda the plan in that case (to prevent the PBO from going overboard)

?

Offline

#34 2025-07-11 08:31:51

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I haven't yet. I initially ran the CPU with a -30mV undervolt, but then gradually reduced this once crashing started happening. I believe I experienced at least one crash with no Curve Shaper/Curve Optimizer applied at all, in either direction. I can try +4. I've kept my settings as is thus far, and only messed with the kernel command line.

I also had issues running 4 RAM sticks previously, no idea if that was fixed by the BIOS update or also a sign of bad luck with the silicon.

Last edited by LunarLambda (2025-07-11 08:32:45)

Offline

#35 2025-07-11 08:35:37

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Succulent of your garden wrote:
LunarLambda wrote:

Even if really, this sort of thing happening on *any* board with stock settings is kind of outrageous

Do you know a channel of YT called gamers nexus ? Is not a bad source to know which brands are getting the worst quality assurance. Asus now days it seems wining, and ASRock it's from behind I guess. There is a interview in the channel to ASRock and there are news of latest asus oops. 

If you still can get back your money, and want to make the change, try to change to intel 12th gen if you can or to a generation that's is not broken and works well for your needs, as you may know 13th and 14th gen are broken also in intel.  If you believe that problem is the motherboard brand and their incompetency to do a proper fix, then in my humble opinion I would suggest doing something like that.

Nevertheless hope you can somehow fix your issue soon  without doing what I'm suggesting.

I'm aware of the somewhat poor track record of ASUS/ASrock, but I had hoped that things were mostly resolved at the time I had put this PC together. It's unfortunate, this motherboard was otherwise rock solid and great to work with.

It's really just very stressful that I can't figure out if it's a CPU or motherboard issue. I have contacted the retailer for my CPU about getting it exchanged under warranty, in the meantime, though.

Offline

#36 2025-07-11 12:12:38

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

LunarLambda wrote:

I'm aware of the somewhat poor track record of ASUS/ASrock, but I had hoped that things were mostly resolved at the time I had put this PC together. It's unfortunate, this motherboard was otherwise rock solid and great to work with.

It's really just very stressful that I can't figure out if it's a CPU or motherboard issue. I have contacted the retailer for my CPU about getting it exchanged under warranty, in the meantime, though.

This is going to sound a little bit silly in the first place because of  your situation. But have you ever checked the asus pro ws mobo series ? there are workstations mobos, and if I'm not remembering wrong there are some ones pretty cheap. Maybe you can change to a pro ws mobo of some years back, but not quite sure because it seems your cpu is too new. but if you found a ws pro compatible with your current cpu socket then maybe that could work for you, since there are workstation mobos they are pretty neat and not too new.

Last edited by Succulent of your garden (2025-07-11 12:14:36)


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#37 2025-07-11 12:17:39

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Have not. I'll keep it in mind as an option

Offline

#38 2025-07-11 12:25:45

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

LunarLambda wrote:

Have not. I'll keep it in mind as an option

check it out, there are also intel cpu models if you want to make the change of brand.  Probably for the cheaper models you would need to use a riser cable and do some magic so you can use all the pci lanes available, since some mobos have pci 5.0 and pci 4.0 or 3.0 divided, obviously if your gpu are big and uses two pci slot size.

Last edited by Succulent of your garden (2025-07-11 12:26:19)


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#39 2025-07-15 14:11:12

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I have now updated my motherboard's UEFI to the latest version (1512 iirc) and have a replacement CPU on my desk.

My plan is now the following:

1. Run the old CPU without all the workarounds and see if the crashing still occurs
2. If it does, reset CPU voltage & performance settings in the UEFI to complete stock and see if that crashes too
3. Rinse and repeat with the new CPU

If it crashes with the new CPU, I conclude it's a motherboard power management issue and I put the workarounds back/adjust voltages to prevent the issue

If it *doesn't* crash with the new CPU, I return the old one and live happily ever after ^^

I also enabled the CPU Watchdog Timer in UEFI, on the off-chance that can net any extra clues/let Linux report any additional info.

Update the next day: despite leaving the PC running overnight, with MC running, with the CPU undervolted and C6 enabled, I actually didn't run into any crash yet.
I even tried to run `stress` on a single CPU core using `systemd-run --property CPUAffinity=...` (and mprime) in case that could directly confirm an issue with C6, and nothing so far.

I'll keep observing, but there's a minor chance that the 1504 -> 1512 BIOS update just fixed things. That'd be a little embarassing after all that back and forth.'

Also installed rasdaemon for long-term tracking, but as far as I can tell the machine has never logged any sort of MCE or other issue.

Last edited by LunarLambda (2025-07-16 06:59:12)

Offline

#40 2025-07-16 18:08:46

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Old CPU still freezes, disabled the undervolt (= all CPU settings are *stock*).

Nothing in netconsole, no MCE or panic or anything at all. CPU Watchdog timer didn't seem to kick in either, machine was locked up for at least 15 minutes.


The only thing I can really add is that basically every single time, the last thing logged by the journal are messages from rtkit-daemon:

Jul 16 19:10:15 ancyor rtkit-daemon[1914]: Supervising 18 threads of 11 processes of 2 users.
Jul 16 19:10:15 ancyor rtkit-daemon[1914]: Supervising 18 threads of 11 processes of 2 users.
Jul 16 19:40:20 ancyor rtkit-daemon[1914]: Supervising 18 threads of 11 processes of 2 users.
Jul 16 19:40:20 ancyor rtkit-daemon[1914]: Supervising 18 threads of 11 processes of 2 users.

Freeze happened around 19:45 based on the frozen Desktop.

Last edited by LunarLambda (2025-07-16 18:11:40)

Offline

#41 2025-07-17 14:56:43

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

LunarLambda wrote:

Nothing in netconsole, no MCE or panic or anything at all

That seems like an error coming from the motherboard in my humble opinion. If the old cpu does freeze, is very unlikely that is bug from AMD PSP hardware in your new cpu or something similar. Since your Arch Linux doesn't log anything as you probably know,  everything is saying it's a hardware issue.

Since you have two gpus, may I ask: How much watts does support your power supply ? have you ever tried to see if with unplugging a gpu from the socket helps ? I mean, you can do that just to troubleshoot more that is not a power supply + [or] kind of pci lanes issue in your mobo. But for example if with just one gpu the whole system can work fine then you can be sure is a error from the mobo chipset handeling the pci lanes with the cpu and the gpus. If your amd cpu have a integrated graphics card, try to unplug both gpus also to see what could happen. If with only the cpu everything works fine, put the other card that you haven't used to troubleshoot yet, just to discard a gpu firmware/driver issue with the chipset handling the pci lanes.

If everything just keep the same, then in my  opinion there is a bug probably in the mobo, that could be hardware based in how the uefi and/or chipset is made. Since the support has been pretty shitty from the brand, and it's most likely something that AMD alone can't fix probably because if it that was the case, they probably  could just make a cpu firmware update. So probably the mobo brands and/or AMD just fucked up everything while they were talking about the cpu specifications and handling with the mobo.

I guess welcome to the age of X86_64 cpu enshitification. Hope someday RISC-V will be mainstream to help you troubleshoot more deeply and faster :C

Last edited by Succulent of your garden (2025-07-17 15:00:32)


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#42 2025-07-17 15:48:22

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

The PSU is 1000W and has a Cybenetics Platinum rating, I'm very confident it's not an issue with that.

As for PCIe lanes, the GPU slots are both Gen 5 x8 (though the GPUs only do Gen 4 speed), behind a PCIe switch on the motherboard, but then routed directly to the CPU (so it does not go through the chipset)
I disabled the CPU's integrated graphics in the UEFI to save a bit of power/heat, and to make sure the system doesn't accidentally try to use it for anything.

The two-GPU setup has been working since February, I only started having the soft-locks issues around May, and almost exclusively while playing modded MC.
(worth noting that the CPU was *always* undervolted, since I built the PC, though while troubleshooting I've disabled it for the time being)

I have nothing to suggest a GPU issue, the soft-lock always affects both desktops simultaneously and in the exact same way.

Some things that seemed to mitigate things were disabling CPU turbo, or messing with C-states. But I dislike those options since it loses performance I paid for.
Right now I'm running everything at stock settings to see if it will still crash.

It could still be a motherboard issue ultimately, but right now I don't have a good way to confirm either theory other than process of elimination.

Last edited by LunarLambda (2025-07-17 15:48:37)

Offline

#43 2025-07-17 16:13:41

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

With your new clarifications  I also think is not a gpu issue, neither a PSU issue.

LunarLambda wrote:

Right now I'm running everything at stock settings to see if it will still crash.

It could still be a motherboard issue ultimately, but right now I don't have a good way to confirm either theory other than process of elimination.

I guess if you get the case of stock settings failing, then you can more than confirm that is the mobo, since you have it running it fine months ago. In that case it could be also a  bad soldering of the inner layers of the mobo PCB, that just not longer works well enough. Sadly if that's your case, then is not so much you can do than getting another mobo.

EDIT: You can try to get off the PSU and check with a multimeter if all the lanes are sending the correct voltages, after you closed the circuit of course.  But I don't think that is going to help in anything, is just to be 100% sure that the PSU is not getting just a little faulty during the past months. Which I think is not your case.

Last edited by Succulent of your garden (2025-07-17 16:18:30)


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#44 2025-07-19 12:48:54

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I bought the PSU this year, it's a Corsair and I'd be astounded if it was faulty.

That said, I just had another crash. Latest BIOS, all stock CPU settings, no undervolt or anything at all. It honestly spooked me a bit because after a few days of running fine I wasn't really expecting it to die on stock settings.

Offline

#45 2025-07-19 12:50:16

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I also don't have a multi-meter so PSU-level troubleshooting would be difficult. That said I'll inspect all the cables for everything when I swap in the new CPU for testing.

EDIT: Newest crash also did not leave any sort of logs behind. I saw some 'interrupt took too long' and hrtimer messages over the last few days, but they were always isolated and did not coincide with anything noticeable and were always many hours away from any crash.

Last edited by LunarLambda (2025-07-19 13:29:54)

Offline

#46 2025-07-19 14:40:34

Succulent of your garden
Member
From: Majestic kingdom of pot plants
Registered: 2024-02-29
Posts: 1,380

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Does you mobo have backup BIOS option to setup to the default stock BIOS ? If I'm not wrong the issue also does happen with stock default BIOS right ?

If the issue does happen with stock BIOS then is 100% your motherboard. It could be a 100% firmware issue, or a combination of a very specific bad pci inner layers soldering and firmware that just freeze your computer and doesn't reset. I mean in regular faulty motheboard,  the mobo should reset intermediately or turn off if there is some power issue, like for example a electromagnetic field issue or detecting voltage anomalies.  For that I'm more thinking is a firmware issue, but since your computer works fine a couple of months, that makes me think about the bad soldering too, but now with the current days standards who knows  :C ?

If you don't want to buy a multi-meter, try to found a good friend who can go to visit and use his computer with your PSU to check if that happen. Or check if it does have a multi-meter and try to run the PSU in isolation smile , but I don't think your PSU is your issue TBH. Also if you have or can get another PSU swap it in your computer. If ti does have lesser watts just get out one gpu. If your PC still freezes then that is another way to be 100% that the issue is comming from the motherboard.

Last edited by Succulent of your garden (2025-07-19 14:42:03)


str( @soyg ) == str( @potplant ) btw!

Also now with avatar logo included!

Offline

#47 2025-07-19 14:47:46

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I had not updated the BIOS from when I bought the Motherboard before making this thread, so yes. I don't think there's any way to roll back to the very first BIOS version, though? At least if the vendor is to be believed.

But, from what other people have said, if it was the motherboard delivering too little power in certain situations, it wouldn't necessarily detect this as an anomaly and reset?

PCIe could be plausible, but, I feel like there would be intermittent issues, that the kernel would detect and try to reset devices / report random errors during operations?

It seems unlikely that faulty PCIe connections/layering would exclusively cause this one type of crash with this one type of symptom, with no logs or driver errors?

I will try the new CPU for now, and report back. I really hope the world is simple and it's just a faulty CPU.

EDIT 2025-07-20: I've now swapped in the new CPU. Observing.

Last edited by LunarLambda (2025-07-20 12:01:16)

Offline

#48 2025-07-24 10:31:09

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

Well, I thought I was in the clear but after around 4 days of uninterrupted uptime I did manage to get another freeze on the new CPU, all stock settings. (Which is longer than anything the old CPU managed, in fairness!)

I'm gonna try the only other source of errors in the system, one of the drives in my ZFS pools frequently drops offline and has to be reset by the kernel:

[321271.809465] ata9.00: exception Emask 0x0 SAct 0x808000 SErr 0xed0001 action 0x6 frozen
[321271.809732] ata9: SError: { RecovData PHYRdyChg CommWake 10B8B BadCRC Handshk LinkSeq }
[321271.809940] ata9.00: failed command: WRITE FPDMA QUEUED
[321271.810136] ata9.00: cmd 61/08:78:10:9d:3e/00:00:00:00:00/40 tag 15 ncq dma 4096 out
[321271.810136]          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[321271.810584] ata9.00: status: { DRDY }
[321271.810775] ata9.00: failed command: WRITE FPDMA QUEUED
[321271.810955] ata9.00: cmd 61/40:b8:48:bb:20/00:00:00:00:00/40 tag 23 ncq dma 32768 out
[321271.810955]          res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
[321271.811320] ata9.00: status: { DRDY }
[321271.811502] ata9: hard resetting link
[321272.274459] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[321272.275949] ata9.00: NCQ Non-Data Log not supported
[321272.291337] ata9.00: NCQ Non-Data Log not supported
[321272.291522] ata9.00: configured for UDMA/133
[321272.301910] ata9: EH complete

I'm not sure how a dying drive would cause full-system crashes (without anything else reporting errors, ZFS being clean, and it not even being the root drive). But eh.

I'll also try rcu_nocbs=0-15 again, I suspect that might be the most likely thing to fix it.

Last edited by LunarLambda (2025-07-24 10:38:15)

Offline

#49 2025-07-24 11:03:56

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,608

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

If this is a PBO issue, nothing (incl. new silicon) but adjusting the voltage supply (curve optimizer) will be able to fully address this.
Have you already tried to offset that by +4 (which is a rough starting point for tests, not a guarantee)

Offline

#50 2025-07-24 15:19:50

LunarLambda
Member
Registered: 2021-08-02
Posts: 50

Re: [SOLVED] Is my CPU faulty? Regular, random userspace lockup/crash

I can try, but ultimately it doesn't really solve my troubles - at stock settings the CPU hits TJmax in certain loads, despite dual-tower cooler and proper thermal paste application. Besides, would a PBO issue take 4 days to manifest?

I'm gonna try removing the faulty drive, then try a PSU swap, and then I might need to reconsider the possibility of a motherboard fault...

Offline

Board footer

Powered by FluxBB