You are not logged in.

#1 2017-12-16 11:41:54

sl1pkn07
Member
From: Spanishtán
Registered: 2010-03-30
Posts: 371

[SOLVED] MCE error, how isolate the problem?

Hi i have a dual cpu setup

Asus z10pe-d8
2x E5-2650-V4 ES
48Gb hynx ddr4 2400

And, randomly received this error (after reboot when crash the machine)

http://wstaw.org/m/2017/12/16/IMG_20171216_072123.jpg

Any help for isolate the problematic core?

Greetings

Last edited by sl1pkn07 (2018-02-02 16:29:58)

Offline

#2 2017-12-25 16:59:38

sl1pkn07
Member
From: Spanishtán
Registered: 2010-03-30
Posts: 371

Re: [SOLVED] MCE error, how isolate the problem?

nothing?

Offline

#3 2017-12-25 17:55:27

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,739

Re: [SOLVED] MCE error, how isolate the problem?

Please don't bump.  You have ben asked not to bump before, but hey, its Christmas.  Next time, remember that "crickets" is an invitation to provide more information.

It appears to be a hardware problem.  Are you overclocking your system?
Have you installed and configured the microcode updates for your processor?


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#4 2017-12-25 19:02:46

sl1pkn07
Member
From: Spanishtán
Registered: 2010-03-30
Posts: 371

Re: [SOLVED] MCE error, how isolate the problem?

no, zero overclock (also turbo is disabled)

and yes, the latest microcode (20171117) is installed and loaded (seems have a bug, needs another update by intel)

└───╼  dmesg | grep microcode
[    0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0xb000020 (or later)
[    2.230426] microcode: sig=0x406f0, pf=0x1, revision=0x14
[    2.231892] microcode: Microcode Update Driver: v2.2.
[   14.347336] microcode: late loading on model 79 is disabled.

Offline

#5 2017-12-25 20:20:42

R00KIE
Forum Fellow
From: Between a computer and a chair
Registered: 2008-09-14
Posts: 4,734

Re: [SOLVED] MCE error, how isolate the problem?

sl1pkn07 wrote:
[   14.347336] microcode: late loading on model 79 is disabled.

Are you sure you have set things up to update the microcode before anything else loads[1]?

[1] https://wiki.archlinux.org/index.php/Microcode


R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K

Offline

#6 2017-12-25 21:28:51

sl1pkn07
Member
From: Spanishtán
Registered: 2010-03-30
Posts: 371

Re: [SOLVED] MCE error, how isolate the problem?

yes. microcode is loaded througth grub2 like say in the wiki you posted since ages

EDIT: https://patchwork.kernel.org/patch/10014361

EDIT2: https://patchwork.kernel.org/patch/10142303/

Last edited by sl1pkn07 (2018-01-04 14:53:20)

Offline

#7 2018-01-17 18:14:37

sl1pkn07
Member
From: Spanishtán
Registered: 2010-03-30
Posts: 371

Re: [SOLVED] MCE error, how isolate the problem?

ok seems no exist microcode for the ES processors, the "explain" is posted here https://bbs.archlinux.org/viewtopic.php … 8#p1760578


the new CPU is arrived, but have the same problem

new test i performed:

- Disable all Cstates in kernel like posted in this thread: https://bbs.archlinux.org/viewtopic.php … 8#p1557698
- Compile glibc without lock-elision: https://bbs.archlinux.org/viewtopic.php … 7#p1564317

none of this working at all, the reboot still occour, but with more interval of time

any know other method for mitigate/fix this error?

note: all reboots say the same MCE error, except this data:

ADDR
MISC
TIME

and some time, the data after "bank 7:"

anyone know what means the output of that data?

greetings

Offline

#8 2018-01-19 19:04:14

sl1pkn07
Member
From: Spanishtán
Registered: 2010-03-30
Posts: 371

Re: [SOLVED] MCE error, how isolate the problem?

More findings

└───╼  journalctl | grep Hardware\ Error                                                                                                                                                                                                                                                             
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: Machine check events logged
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: fe00000000010091
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: TSC 0 ADDR 36d06c580 MISC 1503cbc86 
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: PROCESSOR 0:406f0 TIME 1516211147 SOCKET 0 APIC 0 microcode 14
ene 17 18:45:50 archlinux kernel: [Hardware Error]: event severity: fatal
ene 17 18:45:50 archlinux kernel: [Hardware Error]:  Error 0, type: fatal
ene 17 18:45:50 archlinux kernel: [Hardware Error]:  fru_text: DIMM B1
ene 17 18:45:50 archlinux kernel: [Hardware Error]:   section_type: memory error
ene 17 18:45:50 archlinux kernel: [Hardware Error]:   physical_address: 0x000000036d06c580
ene 17 18:45:50 archlinux kernel: [Hardware Error]:   physical_address_mask: 0x00003fffffffffc0
ene 17 18:45:50 archlinux kernel: [Hardware Error]:   node: 0 card: 1 module: 0 bank: 1 row: 23382 column: 552 
ene 17 18:45:50 archlinux kernel: [Hardware Error]:   error_type: 3, multi-bit ECC
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: Machine check events logged
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: be00000000010091
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: TSC 0 ADDR 36766ea40 MISC 1503cbc86 
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: PROCESSOR 0:406f0 TIME 1516287934 SOCKET 0 APIC 0 microcode 14
ene 18 16:05:45 archlinux kernel: [Hardware Error]: event severity: fatal
ene 18 16:05:45 archlinux kernel: [Hardware Error]:  Error 0, type: fatal
ene 18 16:05:45 archlinux kernel: [Hardware Error]:  fru_text: DIMM B1
ene 18 16:05:45 archlinux kernel: [Hardware Error]:   section_type: memory error
ene 18 16:05:45 archlinux kernel: [Hardware Error]:   physical_address: 0x000000036766ea40
ene 18 16:05:45 archlinux kernel: [Hardware Error]:   physical_address_mask: 0x00003fffffffffc0
ene 18 16:05:45 archlinux kernel: [Hardware Error]:   node: 0 card: 1 module: 0 bank: 1 row: 23006 column: 848 
ene 18 16:05:45 archlinux kernel: [Hardware Error]:   error_type: 3, multi-bit ECC

I think this is the root of the problem

http://wstaw.org/m/2018/01/19/IMG_20180 … 17_768.jpg

Offline

#9 2018-01-20 13:26:53

liara
Member
Registered: 2013-04-10
Posts: 32

Re: [SOLVED] MCE error, how isolate the problem?

Don't know if this is relevant, but just in case:

https://bugs.archlinux.org/task/53112

Offline

#10 2018-02-02 16:29:38

sl1pkn07
Member
From: Spanishtán
Registered: 2010-03-30
Posts: 371

Re: [SOLVED] MCE error, how isolate the problem?

Fixed with a new ram

greetings

Offline

Board footer

Powered by FluxBB