You are not logged in.

#1 2018-03-17 11:38:48

Lockheed
Member
Registered: 2010-03-16
Posts: 1,550

Investigating MCE errors

I have some MCE errors I'd like to investigate:

[    0.018122] mce: [Hardware Error]: Machine check events logged
[    0.018130] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: f600000000070f0f
[    0.018258] mce: [Hardware Error]: TSC 0 ADDR fea10190
[    0.018343] mce: [Hardware Error]: PROCESSOR 2:700f01 TIME 1521186153 SOCKET 0 APIC 0 microcode 7000106

I read that the way to find of what it means is by

/usr/sbin/mcelog --k8 --ascii < myerror

However, mcelog is no longer working on Arch and is replaced with rasdaemon.
How can I replicate this particular mcelog functionality with this new tool?

Offline

#2 2018-03-17 17:19:08

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: Investigating MCE errors

With the two executables that come with the rasdaemon package, I can't find an interesting option. I also can't read find anything in the two man-pages.

In the past, with 'mcelog', you would enable the service that came with the package. It would then add the explanation message to the journal whenever a machine check event happened. Perhaps 'rasdaemon' works the same, you can just enable the service that comes with the package and then wait until the next event happens?

This seems annoying, but better than nothing, I guess.

Offline

#3 2018-03-17 19:18:23

Lockheed
Member
Registered: 2010-03-16
Posts: 1,550

Re: Investigating MCE errors

I have it enabled but I haven't seen any relevant message.
Any idea where should I look for it?

Offline

#4 2018-03-17 20:12:22

loqs
Member
Registered: 2014-03-06
Posts: 18,960

Re: Investigating MCE errors

What if you try building mcelog and see if it can decode anything from the error it can not use /dev/mcelog with the arch kernels but can it still decode extracted errors?

git clone git://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git
cd mcelog
make
./mcelog --k8 --ascii < myerror

Offline

#5 2018-03-17 20:19:20

Lockheed
Member
Registered: 2010-03-16
Posts: 1,550

Re: Investigating MCE errors

It does not compile correctly:

cc -c -g -Os  -Wall -Wextra -Wno-missing-field-initializers -Wno-unused-parameter -Wstrict-prototypes -Wformat-security -Wmissing-declarations -Wdeclaration-after-statement  -o denverton.o denverton.c
cc -c -g -Os  -Wall -Wextra -Wno-missing-field-initializers -Wno-unused-parameter -Wstrict-prototypes -Wformat-security -Wmissing-declarations -Wdeclaration-after-statement  -o msr.o msr.c
cc -c -g -Os  -Wall -Wextra -Wno-missing-field-initializers -Wno-unused-parameter -Wstrict-prototypes -Wformat-security -Wmissing-declarations -Wdeclaration-after-statement  -o bus.o bus.c
cc -c -g -Os  -Wall -Wextra -Wno-missing-field-initializers -Wno-unused-parameter -Wstrict-prototypes -Wformat-security -Wmissing-declarations -Wdeclaration-after-statement  -o unknown.o unknown.c
( printf "char version[] = \"" ; 			\
if test -e .os_version; then				\
	cat .os_version	| tr -d '\n' ;			\
elif command -v git >/dev/null; then 			\
	if [ -d .git ] ; then 				\
		git describe --tags HEAD | tr -d '\n'; 	\
	else 						\
		printf "unknown" ; 			\
	fi ;						\
else							\
	printf "unknown" ;				\
fi ;							\
printf '";\n'						\
) > version.tmp
cmp version.tmp version.c || mv version.tmp version.c
cmp: version.c: No such file or directory
cc -c -g -Os  -Wall -Wextra -Wno-missing-field-initializers -Wno-unused-parameter -Wstrict-prototypes -Wformat-security -Wmissing-declarations -Wdeclaration-after-statement  -o version.o version.c
cc   mcelog.o p4.o k8.o dmi.o tsc.o core2.o bitfield.o intel.o nehalem.o dunnington.o tulsa.o config.o memutil.o msg.o eventloop.o leaky-bucket.o memdb.o server.o trigger.o client.o cache.o sysfs.o yellow.o page.o rbtree.o sandy-bridge.o ivy-bridge.o haswell.o broadwell_de.o broadwell_epex.o skylake_xeon.o denverton.o msr.o bus.o unknown.o version.o   -o mcelog

Offline

#6 2018-03-17 20:33:14

loqs
Member
Registered: 2014-03-06
Posts: 18,960

Re: Investigating MCE errors

cc   mcelog.o p4.o k8.o dmi.o tsc.o core2.o bitfield.o intel.o nehalem.o dunnington.o tulsa.o config.o memutil.o msg.o eventloop.o leaky-bucket.o memdb.o server.o trigger.o client.o cache.o sysfs.o yellow.o page.o rbtree.o sandy-bridge.o ivy-bridge.o haswell.o broadwell_de.o broadwell_epex.o skylake_xeon.o denverton.o msr.o bus.o unknown.o version.o   -o mcelog

Unless there was an error after that last line that matches my build here which produced mcelog in the local directory

Offline

#7 2018-03-17 20:41:25

Lockheed
Member
Registered: 2010-03-16
Posts: 1,550

Re: Investigating MCE errors

Right. What should I put under "myerror"?
I am trying the numerical value, but no luck:

# ./mcelog --k8 --ascii < f600000000070f0f
-bash: f600000000070f0f: No such file or directory

Offline

#8 2018-03-17 20:48:26

loqs
Member
Registered: 2014-03-06
Posts: 18,960

Re: Investigating MCE errors

cat myerror 
[    0.018122] mce: [Hardware Error]: Machine check events logged
[    0.018130] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: f600000000070f0f
[    0.018258] mce: [Hardware Error]: TSC 0 ADDR fea10190
[    0.018343] mce: [Hardware Error]: PROCESSOR 2:700f01 TIME 1521186153 SOCKET 0 APIC 0 microcode 7000106
./mcelog --k8 --ascii < myerror
mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
Machine check events logged
mcelog: Unknown CPU type vendor 2 family 22 model 0
Hardware event. This is not a software error.
CPU 0 0 data cache 
TIME 1521186153 Fri Mar 16 07:42:33 2018
STATUS 0 MCGSTATUS 0
CPUID Vendor AMD Family 22 Model 0
(Fields were incomplete)
SOCKET 0 APIC 0 microcode 7000106

unfortunately not much help on my system

Offline

#9 2018-03-17 20:56:13

Lockheed
Member
Registered: 2010-03-16
Posts: 1,550

Re: Investigating MCE errors

Ah, got it. Thank you for your guidenance.
I run it as root and got some result.

# ./mcelog --k8 --ascii < myerror
Machine check events logged
mcelog: Unknown CPU type vendor 2 family 22 model 0
Hardware event. This is not a software error.
CPU 0 0 data cache 
TIME 1521186153 Fri Mar 16 08:42:33 2018
STATUS 0 MCGSTATUS 0
CPUID Vendor AMD Family 22 Model 0
(Fields were incomplete)
SOCKET 0 APIC 0 microcode 7000106

Not sure how reliable is its indication that it is a hardware error or whether some more info can be extracted from it. My processor is AMD Kabini and I get the same output whether I use --k8 or --generic.

Offline

Board footer

Powered by FluxBB