mce : harware error CPU

cyayon · 2018-01-22 12:51:09

Hi all,

Since some week i have some random crash with no log output and black screen.

I ran some memtest86 test (only 3 passes), and no reported errors...

I just use 'stress' command (pacman -S stress) and try the following command : "stress --cpu 8", after about 10 seconds, i get this errors messages on the console :

mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: 9000004000010005
mce: [Hardware Error]: TSC 139be719174
mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1516624544 SOCKET 0 APIC 4 microcode 1c
mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: 9000004000010005
mce: [Hardware Error]: TSC 13b4643c2a7
mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1516624546 SOCKET 0 APIC 6 microcode 1c
mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: 9000004000010005
mce: [Hardware Error]: TSC 13ce936121b
mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1516624548 SOCKET 0 APIC 6 microcode 1c
mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: 9000004000010005
mce: [Hardware Error]: TSC 13d452276d6
mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1516624549 SOCKET 0 APIC 4 microcode 1c

and crash !

Is it my CPU which need to be replaced ? motherboard ?

Thanks in advance.

seth · 2018-01-22 13:15:57

temperature issue?

drcouzelis · 2018-01-22 13:35:40

What CPU are you using?

If it is a Ryzen CPU (like mine) I might have some bad news for you...

cyayon · 2018-01-22 13:53:34

Hi,

I don't think it'a thermal issue, in normal operation, i monitor it each 15s and i do not detected anything wrong.

It's an "old" Intel core i5-3570K with an motherboad Asus p8Z77MPRO.

I got no problem since 2012 (buy date), but since few week, i got some random crash after some I/O intensive batch. No change after few kernel updates...

I have done some memtest86 test (3 passes only without any errors), and try today after a crash, le "stress" command from stress pacman package.

After about 5-10 seconds of the command "stress --cpu 8", the system crash with MCE hardware error.

thanks.

seth · 2018-01-22 13:58:59

Don't think if you can measure
Your poll interval is bigger than the stress time and it literally happens when you heat up all cores at once.
It's an "old" board so this can easily be a D.U.S.T. error.

cyayon · 2018-01-22 14:02:42

thanks seth.

But when the crash appears on my batch, it's after about 15-30 min, so i check temperature each 15s and it's stable.
Of course, when i used stress, i can't monitor temperature, it's too short.

what is a D.U.S.T error please ?

seth · 2018-01-22 14:07:42

http://computimesinc.com/wp-content/upl … r-dust.jpg

Your batch leaves the same errors behind?
Just to be sure, you've read https://wiki.archlinux.org/index.php/Mi … de_updates ?

cyayon · 2018-01-22 14:16:21

Ah... OK, no it's not a dust problem :-) it's clean.

I can't see any error while batch is running because, there is no screen and system had not the time to syslog something...
Sometime, the same batch has no problem, sometimes it crash the system.
But i suppose, the "stress" crash result is abnormal ...

Yes, i have installed and enabled last available intel-ucode (20180108). it has been patched on boot :

kernel: microcode: microcode updated early to revision 0x1c, date = 2015-02-26
kernel: microcode: sig=0x306a9, pf=0x2, revision=0x1c
kernel: microcode: Microcode Update Driver: v2.2.

of course, no overclock or something else.

drcouzelis · 2018-01-22 15:11:37

cyayon wrote:

It's an "old" Intel core i5-3570K with an motherboad Asus p8Z77MPRO.

Ok. There were some known issues with early batches of Ryzen processors, but of course that's not an issue here.

From what I understood (and I might be waaaaay wrong) is that an MCE error is pretty much just the computer saying "ACK! SOMETHING WENT WRONG!". It could be the CPU, the motherboard, the RAM, or even the PSU...

My advice would be to take things slow, don't start blowing money replacing parts if you don't know if they're actually broken, test your computer with other known working parts as much as you can, and lastly, take advice from someone much smarter than me.

Good luck!

cyayon · 2018-01-22 15:23:07

Ok thanks.

But could you please confirm that running “stress —cpu 8” must not generate these mce logs and crash the system ?

Thanks

drcouzelis · 2018-01-22 19:22:46

Sure! I ran the command "stress --cpu 12" for a little over two hours just now. This was the only output to my terminal:

stress: info: [2544] dispatching hogs: 12 cpu, 0 io, 0 vm, 0 hdd

In the output of "journalctl" (as root), the last MCE error I have was from August 2016.

seth · 2018-01-22 19:39:33

This is in 88% of all cases a thermal issue and afaiu we did not rule that out with your (unexplained) batch halt.
Please watch the temps closely and whether they boost. If the coller is loosely connected, the temperature under full load can rise extremely fast.

cyayon · 2018-01-23 06:24:34

Hi,

After a simple clear cmos, everything is fine now !
No more crash.

I think it was a bad default overclock from asus in turbo mode...

Thanks :-)

Arch Linux

#1 2018-01-22 12:51:09

mce : harware error CPU

#2 2018-01-22 13:15:57

Re: mce : harware error CPU

#3 2018-01-22 13:35:40

Re: mce : harware error CPU

#4 2018-01-22 13:53:34

Re: mce : harware error CPU

#5 2018-01-22 13:58:59

Re: mce : harware error CPU

#6 2018-01-22 14:02:42

Re: mce : harware error CPU

#7 2018-01-22 14:07:42

Re: mce : harware error CPU

#8 2018-01-22 14:16:21

Re: mce : harware error CPU

#9 2018-01-22 15:11:37

Re: mce : harware error CPU

#10 2018-01-22 15:23:07

Re: mce : harware error CPU

#11 2018-01-22 19:22:46

Re: mce : harware error CPU

#12 2018-01-22 19:39:33

Re: mce : harware error CPU

#13 2018-01-23 06:24:34

Re: mce : harware error CPU

Board footer