You are not logged in.

#1 2017-05-15 22:35:12

bpont
Banned
Registered: 2010-03-24
Posts: 161

Machine Check Exceptions and other Boot Errors

I need some help deciphering boot errors and troubleshooting some recent sudden system failures/reboots.

Some system info and error messages:

Lenovo ThinkCentre M58p Desktop
CPU: Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz

$ dmesg | grep DMI:
[    0.000000] DMI: LENOVO 7220RY8/LENOVO, BIOS 5CKT77AUS 05/07/2012
$ sudo dmidecode --type memory
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.5 present.

Handle 0x001E, DMI type 16, 15 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: None
	Maximum Capacity: 8 GB
	Error Information Handle: Not Provided
	Number Of Devices: 4

Handle 0x001F, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x001E
	Error Information Handle: 0xFF00
	Total Width: 40960 bits
	Data Width: 40960 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: 1
	Locator: J6G1
	Bank Locator: DIMM 0
	Type: DDR2
	Type Detail: Synchronous
	Speed: 1067 MHz
	Manufacturer: Unknown                                         
	Serial Number: 00000000
	Asset Tag: 00000000
	Part Number: 000000000000000000000000000000000000

Handle 0x0020, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x001E
	Error Information Handle: 0xFF00
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: DIMM
	Set: 1
	Locator: J6G2
	Bank Locator: DIMM 1
	Type: DDR2
	Type Detail: Synchronous
	Speed: 1067 MHz
	Manufacturer: 48spaces                                        
	Serial Number: 01234567
	Asset Tag: 01234567
	Part Number: 012345678901234567890123456789012345

Handle 0x0021, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x001E
	Error Information Handle: 0xFF00
	Total Width: 41984 bits
	Data Width: 41984 bits
	Size: 2048 MB
	Form Factor: DIMM
	Set: 1
	Locator: J6H1
	Bank Locator: DIMM 2
	Type: DDR2
	Type Detail: Synchronous
	Speed: 1067 MHz
	Manufacturer: Unknown                                         
	Serial Number: 00000000
	Asset Tag: 00000000
	Part Number: 000000000000000000000000000000000000

Handle 0x0022, DMI type 17, 27 bytes
Memory Device
	Array Handle: 0x001E
	Error Information Handle: 0xFF00
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: DIMM
	Set: 1
	Locator: J6H2
	Bank Locator: DIMM 3
	Type: DDR2
	Type Detail: Synchronous
	Speed: 1067 MHz
	Manufacturer: 48spaces                                        
	Serial Number: 01234567
	Asset Tag: 01234567
	Part Number: 012345678901234567890123456789012345
$ journalctl -xb
May 15 14:33:11 host kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: 942000160004010a
May 15 14:33:11 host kernel: mce: [Hardware Error]: TSC 0 ADDR 10cfa9640 
May 15 14:33:11 host kernel: mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1494876777 SOCKET 0 APIC 0 microcode a0b
May 15 14:33:11 host kernel: ACPI Error: [CAPB] Namespace lookup failure, AE_ALREADY_EXISTS (20160930/dsfield-211)
May 15 14:33:11 host kernel: ACPI Error: Method parse/execution failed [\_SB.PCI0._OSC] (Node ffff88010989eac8), AE_ALREADY_EXISTS (20160930/psparse-543)
May 15 14:33:11 host kernel: platform INT0800:00: failed to claim resource 0
May 15 14:33:11 host kernel: acpi INT0800:00: platform device creation failed: -16
May 15 14:33:16 host kernel: tpm tpm0: A TPM error (6) occurred attempting to read a pcr value

The mce hardware errors show up during the onscreen boot messages of a forced (unwanted) reboot which was caused by some unknown event. If I manually reboot after that, the hardware error messages are gone, but the remaining error messages listed above always occur.

$ sudo mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 3 
ADDR 10cfa9640 
TIME 1494876777 Mon May 15 14:32:57 2017
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000160004010a MCGSTATUS 0
MCGCAP 806 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 23
Hardware event. This is not a software error.
MCE 1
CPU 1 BANK 3 
ADDR 9aba8a40 
TIME 1494877091 Mon May 15 14:38:11 2017
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000160004010a MCGSTATUS 0
MCGCAP 806 APICID 1 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 23
Hardware event. This is not a software error.
MCE 2
CPU 0 BANK 3 
ADDR 9aba8a40 
TIME 1494877091 Mon May 15 14:38:11 2017
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000160004010a MCGSTATUS 0
MCGCAP 806 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 23
Hardware event. This is not a software error.
MCE 3
CPU 0 BANK 3 
ADDR c7c68a40 
TIME 1494877255 Mon May 15 14:40:55 2017
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000160004010a MCGSTATUS 0
MCGCAP 806 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 23
Hardware event. This is not a software error.
MCE 4
CPU 1 BANK 3 
ADDR c7c68a40 
TIME 1494877255 Mon May 15 14:40:55 2017
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000160004010a MCGSTATUS 0
MCGCAP 806 APICID 1 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 23

I really don't know what any of these error messages mean or how to troubleshoot or fix them. On the offhand chance it's not some kernel bug, I'm trying to isolate any offending programs I might be running which could be causing the sudden system shutdown/reboots.  Any helpful advice would be appreciated.

Offline

#2 2017-05-15 22:55:27

loqs
Member
Registered: 2014-03-06
Posts: 17,333

Re: Machine Check Exceptions and other Boot Errors

Which kernel version is the affected system using?  Which kernel version did the issues first occur with?  Are there other kernels such as linux-lts that are not affected by the issues?

Offline

#3 2017-05-15 23:13:13

bpont
Banned
Registered: 2010-03-24
Posts: 161

Re: Machine Check Exceptions and other Boot Errors

loqs wrote:

Which kernel version is the affected system using?  Which kernel version did the issues first occur with?  Are there other kernels such as linux-lts that are not affected by the issues?

I'm using kernel 4.10.13-1 (base) from the core repo.  I haven't closely tracked my kernel versioning against this problem, but the problem recently surfaced within the past week. I don't have linux-lts installed, so I can't say whether or not it would be affected.

Offline

#4 2017-05-15 23:30:22

loqs
Member
Registered: 2014-03-06
Posts: 17,333

Re: Machine Check Exceptions and other Boot Errors

bpont wrote:

the problem recently surfaced within the past week.

So what kernel updates if any does the pacman.log show for the past week?

bpont wrote:

I don't have linux-lts installed, so I can't say whether or not it would be affected.

You can have multiple kernels installed concurrently.  Can you not install the linux-lts kernel to test the 4.9 series and either linux-hardened or the linux-zen package from testing to test 4.11 if you do not want to build a 4.11 kernel yourself.

Offline

#5 2017-05-15 23:59:20

bpont
Banned
Registered: 2010-03-24
Posts: 161

Re: Machine Check Exceptions and other Boot Errors

loqs wrote:
bpont wrote:

the problem recently surfaced within the past week.

So what kernel updates if any does the pacman.log show for the past week?

bpont wrote:

I don't have linux-lts installed, so I can't say whether or not it would be affected.

You can have multiple kernels installed concurrently.  Can you not install the linux-lts kernel to test the 4.9 series and either linux-hardened or the linux-zen package from testing to test 4.11 if you do not want to build a 4.11 kernel yourself.

These are my most recent updates:

[2017-04-20 17:33] [ALPM] upgraded linux (4.10.8-1 -> 4.10.10-1)
[2017-04-26 19:21] [ALPM] upgraded linux (4.10.10-1 -> 4.10.11-1)
[2017-05-01 15:50] [ALPM] upgraded linux (4.10.11-1 -> 4.10.13-1)

I suppose it's related to the latest kernel.  I'll probably just install the linux-lts kernel as a fallback, because I really can't invest too much time installing/testing other kernels or rolling my own.  I have an older system, so maybe an older kernel would be better anyway.  I just hope there's no conflicts with running newer packages with an older kernel.  I'm also assuming linux-lts uses the standard /etc/mkinitcpio.conf and I'll only need to adjust my grub config.
This is the first time I've ever had any kernel issues like this and I've been on arch for a long time, so I'm inexperienced with troubleshooting the error messages I posted.
Thanks for helping.

Offline

#6 2017-05-16 00:09:36

loqs
Member
Registered: 2014-03-06
Posts: 17,333

Re: Machine Check Exceptions and other Boot Errors

Yes linux-lts uses the same mkiniccpio.conf so you just need to update grub.cfg.

Offline

Board footer

Powered by FluxBB