greetings
]]>└───╼ journalctl | grep Hardware\ Error
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: Machine check events logged
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: fe00000000010091
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: TSC 0 ADDR 36d06c580 MISC 1503cbc86
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: PROCESSOR 0:406f0 TIME 1516211147 SOCKET 0 APIC 0 microcode 14
ene 17 18:45:50 archlinux kernel: [Hardware Error]: event severity: fatal
ene 17 18:45:50 archlinux kernel: [Hardware Error]: Error 0, type: fatal
ene 17 18:45:50 archlinux kernel: [Hardware Error]: fru_text: DIMM B1
ene 17 18:45:50 archlinux kernel: [Hardware Error]: section_type: memory error
ene 17 18:45:50 archlinux kernel: [Hardware Error]: physical_address: 0x000000036d06c580
ene 17 18:45:50 archlinux kernel: [Hardware Error]: physical_address_mask: 0x00003fffffffffc0
ene 17 18:45:50 archlinux kernel: [Hardware Error]: node: 0 card: 1 module: 0 bank: 1 row: 23382 column: 552
ene 17 18:45:50 archlinux kernel: [Hardware Error]: error_type: 3, multi-bit ECC
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: Machine check events logged
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: be00000000010091
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: TSC 0 ADDR 36766ea40 MISC 1503cbc86
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: PROCESSOR 0:406f0 TIME 1516287934 SOCKET 0 APIC 0 microcode 14
ene 18 16:05:45 archlinux kernel: [Hardware Error]: event severity: fatal
ene 18 16:05:45 archlinux kernel: [Hardware Error]: Error 0, type: fatal
ene 18 16:05:45 archlinux kernel: [Hardware Error]: fru_text: DIMM B1
ene 18 16:05:45 archlinux kernel: [Hardware Error]: section_type: memory error
ene 18 16:05:45 archlinux kernel: [Hardware Error]: physical_address: 0x000000036766ea40
ene 18 16:05:45 archlinux kernel: [Hardware Error]: physical_address_mask: 0x00003fffffffffc0
ene 18 16:05:45 archlinux kernel: [Hardware Error]: node: 0 card: 1 module: 0 bank: 1 row: 23006 column: 848
ene 18 16:05:45 archlinux kernel: [Hardware Error]: error_type: 3, multi-bit ECC
I think this is the root of the problem
]]>the new CPU is arrived, but have the same problem
new test i performed:
- Disable all Cstates in kernel like posted in this thread: https://bbs.archlinux.org/viewtopic.php … 8#p1557698
- Compile glibc without lock-elision: https://bbs.archlinux.org/viewtopic.php … 7#p1564317
none of this working at all, the reboot still occour, but with more interval of time
any know other method for mitigate/fix this error?
note: all reboots say the same MCE error, except this data:
ADDR
MISC
TIME
and some time, the data after "bank 7:"
anyone know what means the output of that data?
greetings
]]>[ 14.347336] microcode: late loading on model 79 is disabled.
Are you sure you have set things up to update the microcode before anything else loads[1]?
]]>and yes, the latest microcode (20171117) is installed and loaded (seems have a bug, needs another update by intel)
└───╼ dmesg | grep microcode
[ 0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0xb000020 (or later)
[ 2.230426] microcode: sig=0x406f0, pf=0x1, revision=0x14
[ 2.231892] microcode: Microcode Update Driver: v2.2.
[ 14.347336] microcode: late loading on model 79 is disabled.
It appears to be a hardware problem. Are you overclocking your system?
Have you installed and configured the microcode updates for your processor?
Asus z10pe-d8
2x E5-2650-V4 ES
48Gb hynx ddr4 2400
And, randomly received this error (after reboot when crash the machine)
http://wstaw.org/m/2017/12/16/IMG_20171216_072123.jpg
Any help for isolate the problematic core?
Greetings
]]>