You are not logged in.
Hi i have a dual cpu setup
Asus z10pe-d8
2x E5-2650-V4 ES
48Gb hynx ddr4 2400
And, randomly received this error (after reboot when crash the machine)
http://wstaw.org/m/2017/12/16/IMG_20171216_072123.jpg
Any help for isolate the problematic core?
Greetings
Last edited by sl1pkn07 (2018-02-02 16:29:58)
Offline
nothing?
Offline
Please don't bump. You have ben asked not to bump before, but hey, its Christmas. Next time, remember that "crickets" is an invitation to provide more information.
It appears to be a hardware problem. Are you overclocking your system?
Have you installed and configured the microcode updates for your processor?
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
no, zero overclock (also turbo is disabled)
and yes, the latest microcode (20171117) is installed and loaded (seems have a bug, needs another update by intel)
└───╼ dmesg | grep microcode
[ 0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0xb000020 (or later)
[ 2.230426] microcode: sig=0x406f0, pf=0x1, revision=0x14
[ 2.231892] microcode: Microcode Update Driver: v2.2.
[ 14.347336] microcode: late loading on model 79 is disabled.
Offline
[ 14.347336] microcode: late loading on model 79 is disabled.
Are you sure you have set things up to update the microcode before anything else loads[1]?
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
yes. microcode is loaded througth grub2 like say in the wiki you posted since ages
EDIT: https://patchwork.kernel.org/patch/10014361
EDIT2: https://patchwork.kernel.org/patch/10142303/
Last edited by sl1pkn07 (2018-01-04 14:53:20)
Offline
ok seems no exist microcode for the ES processors, the "explain" is posted here https://bbs.archlinux.org/viewtopic.php … 8#p1760578
the new CPU is arrived, but have the same problem
new test i performed:
- Disable all Cstates in kernel like posted in this thread: https://bbs.archlinux.org/viewtopic.php … 8#p1557698
- Compile glibc without lock-elision: https://bbs.archlinux.org/viewtopic.php … 7#p1564317
none of this working at all, the reboot still occour, but with more interval of time
any know other method for mitigate/fix this error?
note: all reboots say the same MCE error, except this data:
ADDR
MISC
TIME
and some time, the data after "bank 7:"
anyone know what means the output of that data?
greetings
Offline
More findings
└───╼ journalctl | grep Hardware\ Error
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: Machine check events logged
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: fe00000000010091
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: TSC 0 ADDR 36d06c580 MISC 1503cbc86
ene 17 18:45:50 archlinux kernel: mce: [Hardware Error]: PROCESSOR 0:406f0 TIME 1516211147 SOCKET 0 APIC 0 microcode 14
ene 17 18:45:50 archlinux kernel: [Hardware Error]: event severity: fatal
ene 17 18:45:50 archlinux kernel: [Hardware Error]: Error 0, type: fatal
ene 17 18:45:50 archlinux kernel: [Hardware Error]: fru_text: DIMM B1
ene 17 18:45:50 archlinux kernel: [Hardware Error]: section_type: memory error
ene 17 18:45:50 archlinux kernel: [Hardware Error]: physical_address: 0x000000036d06c580
ene 17 18:45:50 archlinux kernel: [Hardware Error]: physical_address_mask: 0x00003fffffffffc0
ene 17 18:45:50 archlinux kernel: [Hardware Error]: node: 0 card: 1 module: 0 bank: 1 row: 23382 column: 552
ene 17 18:45:50 archlinux kernel: [Hardware Error]: error_type: 3, multi-bit ECC
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: Machine check events logged
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: be00000000010091
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: TSC 0 ADDR 36766ea40 MISC 1503cbc86
ene 18 16:05:45 archlinux kernel: mce: [Hardware Error]: PROCESSOR 0:406f0 TIME 1516287934 SOCKET 0 APIC 0 microcode 14
ene 18 16:05:45 archlinux kernel: [Hardware Error]: event severity: fatal
ene 18 16:05:45 archlinux kernel: [Hardware Error]: Error 0, type: fatal
ene 18 16:05:45 archlinux kernel: [Hardware Error]: fru_text: DIMM B1
ene 18 16:05:45 archlinux kernel: [Hardware Error]: section_type: memory error
ene 18 16:05:45 archlinux kernel: [Hardware Error]: physical_address: 0x000000036766ea40
ene 18 16:05:45 archlinux kernel: [Hardware Error]: physical_address_mask: 0x00003fffffffffc0
ene 18 16:05:45 archlinux kernel: [Hardware Error]: node: 0 card: 1 module: 0 bank: 1 row: 23006 column: 848
ene 18 16:05:45 archlinux kernel: [Hardware Error]: error_type: 3, multi-bit ECC
I think this is the root of the problem
Offline
Don't know if this is relevant, but just in case:
Offline
Fixed with a new ram
greetings
Offline