You are not logged in.

#1 2022-09-10 23:59:21

sfbahr
Member
Registered: 2022-08-31
Posts: 2

[Solved] Hardware Errors every ~5 minutes, eventually reboots itself

First post here as I'm new to Arch in general (but not linux). I decided to run Arch for the first time on my new home server, but I'm afraid that I've run into a bad CPU. I'm getting hardware errors from the kernel printed out consistently every 5 minutes about CPU 4 & 12, in addition to some concerning errors on boot (via `journalctl -p 3 -b`):

https://i.imgur.com/pB60feo.png

https://i.imgur.com/YGLIcsK.png

Edit: Here's an attempted copy of the error messages:

CPU 12: Machine Check: 0 Bank 1: dc20000006030151
TSC 291b203d6e ADDR ffffff9b8f4d2a MISC d01b0ff00000000 SYND 1a002868 IPID 100b000000000
Processor 2:a50f00 TIME 1661905752 SOCKET 0 APIC 9 microcode a50000b

CPU:12 (19:50:0) MC0_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc2000000c0a0015
Error Addr: 0x00ff8f3bc171f000
IPID: 0x001000b000000000, Syndrome: 0x000000003a010021
Load Store Unit Ext. Error Code: 10, A parity error was detected in a PWC entry by any access.

CPU:4 (19:50:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000006030151
Error Addr: 0x00ffffff9b8f4d2a
IPID: 0x000100b000000000, Syndrome: 0x000000001a002868
Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
cache level: L1, tx: INSN, mem-tx: IRD

I had this same problem with the live install image of Arch when I was booting off of that setting up my permanent Arch install. Unfortunately, I can't just ignore what's going on because it ends up forcibly rebooting my computer after 10 minutes-1 day of uptime, which doesn't work for a home server that I'm hoping to depend upon being online.

My first suspicion was some sort of memory issue as at first only one of my two 16GB DIMMs was detected, but after I reseated them both they've been working fine. I just fired up MemTest86 off of a flash drive and it looks like I don't have any memory errors so far.

I then suspected my CPU was perhaps overheating, so I tried taking my case completely off but that made no difference at all and temps seem quite reasonable even when running memtest (<70 Celsius).

My build is:
Motherboard: ASRock Rack X470D4U
CPU: AMD Ryzen 7 5700G (integrated Radeon graphics)
Memory: Corsair Vengeance LPX 32GB (2X16GB) DDR4 3200 (PC4-25600) C16 1.35V Desktop Memory - Black
Storage: SAMSUNG 970 EVO Plus SSD 2TB - M.2 NVMe

When I tried searching Google w/ my error messages it seems like some people are suggesting the CPU voltage is set too high which seems weird as I haven't set any custom CPU settings in my BIOS and in fact I'm struggling to find an area where I do have that customization. Other results seem to indicate that some AMD CPUs just have this issue and are faulty.

I have installed the `amd-ucode` package as per https://wiki.archlinux.org/title/Ryzen# … de_support but struggling to find any next steps besides requesting warranty service from AMD and hopefully getting a RMA: https://www.amd.com/en/support/kb/warra … mation/pib

Am I missing something here? Is there something more I can adjust or test? Or should I just go ahead and request AMD's warranty service?


Mod Edit - Replaced oversized images with links.
CoC - Pasting pictures and code

Last edited by sfbahr (2022-10-19 23:17:14)

Offline

#2 2022-09-11 17:44:19

xerxes_
Member
Registered: 2018-04-29
Posts: 675

Re: [Solved] Hardware Errors every ~5 minutes, eventually reboots itself

Offline

#3 2022-09-11 17:48:01

Slithery
Administrator
From: Norfolk, UK
Registered: 2013-12-01
Posts: 5,776

Re: [Solved] Hardware Errors every ~5 minutes, eventually reboots itself

Please don't post images of text, post the actual text in [⁣code] [⁣/code] tags.

CoC - Pasting pictures and code


No, it didn't "fix" anything. It just shifted the brokeness one space to the right. - jasonwryan
Closing -- for deletion; Banning -- for muppetry. - jasonwryan

aur - dotfiles

Offline

#4 2022-09-11 17:56:36

Maniaxx
Member
Registered: 2014-05-14
Posts: 738

Re: [Solved] Hardware Errors every ~5 minutes, eventually reboots itself

Maybe check for BIOS updates.


sys2064

Offline

#5 2022-10-19 15:54:47

sfbahr
Member
Registered: 2022-08-31
Posts: 2

Re: [Solved] Hardware Errors every ~5 minutes, eventually reboots itself

To close the loop here, I went ahead and contacted AMD for warranty service and they issued me an RMA after reseating everything and taking pictures of my computer. They confirmed there was an issue with the CPU (or APU as they like to call it) and sent me a replacement. I just connected my replacement 5700G and now it's been running stable for over 10 hours without any hardware errors!

So yes, the CPU was indeed the problem.

Checking `journalctl -p 3 -b` now I can still see I have errors related to sound and my iGPU, so these were unrelated:

snd_hda_intel 0000:30:00.6: no codecs found!
amdgpu 0000:30:00.0: amdgpu: Unable to locate a BIOS ROM
amdgpu 0000:30:00.0: amdgpu: Fatal error during GPU init

Offline

#6 2022-10-19 22:40:44

xerxes_
Member
Registered: 2018-04-29
Posts: 675

Re: [Solved] Hardware Errors every ~5 minutes, eventually reboots itself

So, now you may edit your first post in this thread and add in title [Solved]. For other error make new thread.

Offline

Board footer

Powered by FluxBB