You are not logged in.
Pages: 1
Hello,
For the past 5 or 6 weeks (about since the last kernel 3.16 and xorg 1.16 upgrade, although I think it is unrelated) , my samsung laptop (a NP900X3F) is crashing approximately twice a week (but I haven't noticed any correlation with any particular activity).
Last time it happened, I was able to grab a picture of the error message in the console. It goes like this:
[95178.311349] mce: [Hardware Error]: CPU3: Machine Check Exception: 5 Bank 4: b200000000100402
[more time] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f38863dbc72>
[more time] mce: [Hardware Error]: TSC 430970232f3
[more time] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1411595110 SOCKET 0 APIC 3 microcode 17
[more time] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[more time] mce: [Hardware Error]: CPU2: Machine Check Exception: 5 Bank 4: b200000000100402
[more time] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8131a547> {intel_idle+0xe7/0x180}
[more time] mce: [Hardware Error]: TSC 430970235cc
[more time] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1411595110 SOCKET 0 APIC 2 microcode 17
[more time] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[more time] mce: [Hardware Error]: CPU1: Machine Check Exception: 5 Bank 4: b200000000100402
[more time] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8131a547> {intel_idle+0xe7/0x180}
[more time] mce: [Hardware Error]: TSC 43097038e7c
[more time] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1411595110 SOCKET 0 APIC 1 microcode 17
[more time] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[more time] mce: [Hardware Error]: CPU0: Machine Check Exception: 5 Bank 4: b200000000100402
[more time] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8131a547> {intel_idle+0xe7/0x180}
[more time] mce: [Hardware Error]: TSC 43097038e94
[more time] mce: [Hardware Error]: PROCESSOR 0: 306a9 TIME 1411595110 SOCKET 0 APIC 0 microcode 17
[more time] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[more time] mce: [Hardware Error]: Machine check: Processor context corrupt
[more time] Kernel panic - not syncing: Fatal Machine check
[more time] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[more time] drm_kms_helper: panic occurred, switching back to text console
[95178.514479] Rebooting in 30 seconds..Not good. After some looking up the web, I ran the message through mcelog and got:
# mcelog --ascii < error3
CPU3: Machine Check Exception: 5 Bank 4: b200000000100402
Hardware event. This is not a software error.
CPU 0 BANK 0 TSC 430970232f3
RIP !INEXACT! 33:7f38863dbc72
TIME 1411595110 Thu Sep 25 07:45:10 2014
MCG status:
MCi status:
Machine check not valid
Corrected error
MCA: No Error
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 58
SOCKET 0 APIC 3 microcode 17So it says it is not a software error. Fair enough. I've ran memtest86+ over 2 passes and it detected no error. I will run more overnight. [edit: I ran 7 passes last night with no error /edit, sorry for the bump]
This page reports on a similar error, but there is no conclusion. There is also mention of a similar hw crash that was solved by changing motherboard and cpu. I'd like to avoid to go there if I can.
Last, this page contains information on mca message and it seems the "Bank 4" message refers to the "Northbridge or DRAM controller".
I might open and reseat the RAM, but not sure it's gonna make any difference.
Or I could go ahead and take the opportunity to buy a new laptop, but this one is only one year old and it a very nifty one -or at least it seemed to be until the latest errors- so I am reluctant to do that :-)
Thanks for any input.
Last edited by frigaut (2014-09-25 22:16:37)
Archer since 03/2009 - AUR packages
Offline
This seems to be an internal CPU problem. Triggering some checksum error in either 1 or more cores. My laptop (Samsung NP900X3C) does the same from time to time.
The reason Windows does not crash and linux does, might have to do with the fact that Windows automatically loads the latest microcode of Intel into the CPU at boot. Microcode is the "firmware" running within the CPU (yes that exists), and Intel from time to time patches this (bugfixes) and releases new microcode. This should actually be updated with a BIOS update by Samsung, but that is not likely going to happen......
Fortunately, it is possible to load the new Intel microcode into the CPU at boot time, which is what Windows does, but this only holds until the next reboot.
Even more fortunately, in Debian and Ubuntu this can be done as easily as:
sudo apt-get install intel-microcode
After this, linux will load the newest Intel microcode into the CPU at every boot. Please try it and see if your problems go away. I have done this myself at my laptop a few days ago and have not seen any crashes since.
Sorry about the Debian-centric answer, but you can probably rather easily find out how to do the same under Arch. Success.
Offline
Sorry about the Debian-centric answer, but you can probably rather easily find out how to do the same under Arch. Success.
Offline
Thanks. Turns out after my initial post, I found the info about the microcode update and installed intel-ucode (link given by mauritiusdadd) on 28/10/2014. I believe the crash continued, but not totally sure as I did not use the laptop that much after that date. I am back with it now so will monitor. Thanks anyway, as this might be useful for other users.
Last edited by frigaut (2015-07-17 02:48:22)
Archer since 03/2009 - AUR packages
Offline
I have the same model: no issues running either custom or vanilla kernels.
A little more detail about your setup might help: do you boot in UEFI mode?
Offline
@jasonwryan: Not using UEFI, are you on this laptop?
Few infos on my config below. Happy to provide more if you think it would help.
71:~ $ cat /proc/cmdline
BOOT_IMAGE=../vmlinuz-linux root=/dev/sda2 rw quiet vga=current resume=/dev/sda1 initrd=../intel-ucode.img,../initramfs-linux.img
85:~ $ sudo parted -l
[sudo] password for frigaut:
Model: ATA LITEONIT LMT-128 (scsi)
Disk /dev/sda: 128GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 1049kB 4296MB 4295MB primary linux-swap(v1)
2 4296MB 128GB 124GB primary ext4 boot
86:~ $ lspci
00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.3 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 4 (rev c4)
00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation HM75 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
01:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01)
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
03:00.0 USB controller: Renesas Technology Corp. uPD720202 USB 3.0 Host Controller (rev 02)
87:~ $
Archer since 03/2009 - AUR packages
Offline
Yes, UEFI works great for me.
Our machines are slightly different: mine has the Intel card...
Offline
OK, if crashes persist, I may try to boot UEFI.
What do you mean the intel card? Mine does too, doesn't it?
Archer since 03/2009 - AUR packages
Offline
01:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01)
I doubt it is related, though...
Offline
Well, it seems you're not the only one who had experienced this issue with that machine: https://lists.archlinux.org/pipermail/a … 37851.html.
--edit: just out of curiosity, are you still able to access the BIOS?
Last edited by mauritiusdadd (2015-07-17 08:12:57)
Offline
Thanks for the link. I have been in contact with 2 other guys with the same problem indeed. My problem is not so acute as theirs. At the worst period I had one crash every other day. Some other people experience 3 crashes per day, and the guy you are linking to has lost his laptop altogether. Might come to that for me but for now, with the microcode patch installed, no crash since I restarted using this laptop, i.e. about a week.
And yes, I am still able to access the BIOS.
Last edited by frigaut (2015-07-19 12:09:39)
Archer since 03/2009 - AUR packages
Offline
samsung Series 9 Laptop crashes in ubuntu 13.04-15.10 and other Linux (debian, arch, freebsd).
Check: cleaned laptop, change the battery, change CPU cooler, memtest, temp.
kernel 3.11 - crashes every two weeks
kernel 3.16-4.1 - crashes everyday
fecit:
microcode updation
install UEFI and BIOS CSM
blacklist samsung_laptop
log:
> [19367.116180] Disabling lock debugging due to kernel taint
> [19367.116196] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402
> [19367.116202] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7>
> [19367.116205] mce: [Hardware Error]: TSC 2824672b8e7
> [19367.116211] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12
> [19367.116213] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [19367.116216] mce: [Hardware Error]: Some CPUs didn't answer in synchronization
> [19367.116218] mce: [Hardware Error]: Machine check: Invalid
> [19367.116220] Kernel panic - not syncing: Fatal machine check on current CPU
Offline
Pages: 1