You are not logged in.

#1 2010-02-13 11:26:35

skwo
Member
Registered: 2008-11-13
Posts: 133

Hardware Error, Machine Check Exception

In continue to this post http://bbs.archlinux.org/viewtopic.php?id=90266
After I installed all the updates, I still get kernel panics:

::Waiting for udev events to be processed
Hardware Error
CPU1: Machine Check Exception 4 Bank 0: b20000001040080f
RIP !INEXACT! 00:<00000000c10c7121>
TSC 1403dc409e
Processor 0:f43 time 1266058880 Socket 0 APIC 1
CPU0: Machine Check Exception 4 Bank 0: b20000001040080f
RIP !INEXACT! 00:<00000000c10b4ad2>
TSC 1403dc4069
Processor 0:f43 time 1266058880 Socket 0 APIC 1
This is not a software problem
Machine check: Processor context corrupt
Kernel Panic not syncing: Fatal Machine Check
PID 956, comm load_modules.sh Tainted: G  M 2.6.32-ARCH #1
Call Trace:
  <hex numbers> panic+<hex>
  <hex numbers> mce_panic+<hex>
  <hex numbers> do_machine_check+<hex>
  <and more here>

The CPU is Intel Pentium 4 3.0ghz with Hyper-Threading this why its identified as two CPUs.
I had a problem with CPU Cooling but I replaced the heatsink and temperatures are ok now. Other than this I suspect the motherboard, the PSU and the video card. Memory is ok, its passed 22 passes in Memtest86+ for 12 hours.

Any suggestions? What hardware part may cause this error?

Thanks a lot!


ArchLinux x86_64 on Dell Latitude E5410

Offline

#2 2010-02-13 15:01:49

skwo
Member
Registered: 2008-11-13
Posts: 133

Re: Hardware Error, Machine Check Exception

Ok after I tried to understand the error I got the following informations:

CPU1: Machine Check Exception 4 Bank 0: b20000001040080f

No idea what is 4, Bank 0 means CPU (afaik Banks 0 - 3 means CPU).

0xb = 1011b - This means that the information in the status register IA32_MCi_STATUS is valid, there were no overflow of previous error, error was not corrected bu the CPU and error was enabled by associated EEj bit in IA32_MCi_CTL register.

0x2 = 0010b - This means that IA32_MCi_MISC and IA32_MCi_ADDR registers either invalid or not exists, and processor context was corrupted.
Other zeros means something architectural (no clear explanation in intel manuals, or I havent found it).

0x1040 = Model specific error: No address parity error, no hardware failure detected on response, no parity error detected on response, no Data Parity detected on either PIC or FSB access, no invalid pic request error, The state machine that tracks P and N data-strobe relative timing has NOT become unsynchronized and NO glitch has been detected, Data strobe glitch, No address strobe glitch

0x080f = BUSLG_SRC_ERR_NOTIMEOUT_ERR
0x080F = 0000 1000 0000 1111
According to the following scheme:

0000 1000 0000 1111
000F 1PPT RRRR IILL
F = 0 = Something related to Pentium family
PP = 00 = Local processor originated the request
T = 0 = Request did not timeout
RRRR = 0000 = Generic error
II = 11 = Other transaction
LL = 11 = Processor cannot determine the memory hierarchy level

From this analyze I understood that its not a TLB or cache error, also probably not a memory error, and seems like not a Bus error. I still suspect the PSU that might over-volt or under-volt specific rail (I suspect the +12V rail). This also might be a CPU corruption due to the heat, unlikely tough.

Would like to hear your opinions.


ArchLinux x86_64 on Dell Latitude E5410

Offline

#3 2010-02-16 00:35:40

RJARRRPCGP
Member
From: USA (Vermont)
Registered: 2006-07-08
Posts: 23

Re: Hardware Error, Machine Check Exception

Likely bad caps or a bad PSU.

Unless you're overclocking your processor.

Offline

Board footer

Powered by FluxBB