You are not logged in.
Apart from suffering from kernel panics caused by Intel Speedstep, it seems that this CPU also suffers from faulty instructions
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6b877e0 in __lll_unlock_elision () from /usr/lib/libpthread.so.0
I have intel-ucode.img loaded
title Arch Linux (LVM)
linux /vmlinuz-linux
initrd /intel-ucode.img
initrd /initramfs-linux.img
options root=/dev/mapper/yuuzora-root resume=/dev/sda1 rw noquiet
but aparently microcode update lacks update for broadwell cpus (as noted in latest available update https://downloadcenter.intel.com/downlo … -Data-File), and
dmesg | grep microcode
[ 0.420492] microcode: CPU0 sig=0x40671, pf=0x2, revision=0xd
[ 0.420498] microcode: CPU1 sig=0x40671, pf=0x2, revision=0xd
[ 0.420504] microcode: CPU2 sig=0x40671, pf=0x2, revision=0xd
[ 0.420510] microcode: CPU3 sig=0x40671, pf=0x2, revision=0xd
[ 0.420538] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
How can I workaround this?
Last edited by Goresome (2015-09-22 10:33:04)
Offline
Are you reasonably sure it's not a software bug, like passing NULL to pthread_mutex_lock?
If it really is hardware then the only workaround is to recompile glibc without lock elision. Something like that:
sudo abs core
cp -r /var/abs/core/glibc /tmp
cd /tmp/glibc
# now edit PKGBUILD and remove the line containing --enable-lock-elision
makepkg
sudo pacman -U /tmp/glibc/glibc-whatever.pkg.tar.xz
You may want to ensure that you have some other OS or bootable CD/pendrive available in case anything goes wrong
Offline
Thank you, I have compiled it with --disable-elision-lock and crashes are gone. However I've seen here people with Skylake CPU having the same thing.
As Haswell cpus have this functionality but its disabled with microcode, Broadwell and Skylake have this functionality too, and even have the same bug, but there is no microcode update for them, so I kinda wonder why maintaners enabled --enable-elision-lock.
Offline
Ran into this same issue on my new Skylake box, everything was crashing (gdm was the most immediately apparent one, but I've got core dumps all over the place). This doesn't bode well.
Going to build a debug version of glibc and have a go at debugging the issue.
Offline
Ran into this same issue on my new Skylake box, everything was crashing (gdm was the most immediately apparent one, but I've got core dumps all over the place). This doesn't bode well.
Going to build a debug version of glibc and have a go at debugging the issue.
Unfortunately it is intel that developed bugged chips (Haswell, Broadwell and Skylake), they have it in their errata (BDD50,51) https://www-ssl.intel.com/content/dam/w … update.pdf
Offline
My contacts at Intel told me that Haswell-EP and Broadwell-EP had the errata, but Broadwell-K, Haswell-E (desktop), Haswell-EX and Skylake are fine.
Also it's a SIGSEGV on the xend instruction. Not clear why yet:
>>> l
24 __lll_unlock_elision(int *lock, int private)
25 {
26 /* When the lock was free we're in a transaction.
27 When you crash here you unlocked a free lock. */
28 if (*lock == 0)
29 _xend();
30 else
31 lll_unlock ((*lock), private);
32 return 0;
33 }
>>> disas
Dump of assembler code for function __lll_unlock_elision:
0x00007ffff685b7b0 <+0>: mov eax,DWORD PTR [rdi]
0x00007ffff685b7b2 <+2>: mov rdx,rdi
0x00007ffff685b7b5 <+5>: test eax,eax
0x00007ffff685b7b7 <+7>: je 0x7ffff685b7e0 <__lll_unlock_elision+48>
0x00007ffff685b7b9 <+9>: lock dec DWORD PTR [rdx]
0x00007ffff685b7bc <+12>: je 0x7ffff685b7d4 <__lll_unlock_elision+36>
0x00007ffff685b7be <+14>: lea rdi,[rdx]
0x00007ffff685b7c1 <+17>: sub rsp,0x80
0x00007ffff685b7c8 <+24>: call 0x7ffff6858d80 <__lll_unlock_wake>
0x00007ffff685b7cd <+29>: add rsp,0x80
0x00007ffff685b7d4 <+36>: xor eax,eax
0x00007ffff685b7d6 <+38>: ret
0x00007ffff685b7d7 <+39>: nop WORD PTR [rax+rax*1+0x0]
=> 0x00007ffff685b7e0 <+48>: xend
0x00007ffff685b7e3 <+51>: xor eax,eax
0x00007ffff685b7e5 <+53>: ret
End of assembler dump.
>>> info regis
rax 0x0 0
rbx 0x7ffff7fbc548 140737353860424
rcx 0x2 2
rdx 0x7ffff1c4ed08 140737249602824
rsi 0x0 0
rdi 0x7ffff1c4ed08 140737249602824
rbp 0x7fffffffd600 0x7fffffffd600
rsp 0x7fffffffd5d8 0x7fffffffd5d8
r8 0x0 0
r9 0x7ffff6a61200 140737331466752
r10 0x51 81
r11 0x7fffefefacc0 140737218849984
r12 0x7ffff1f07f48 140737252458312
r13 0x7fffffffd730 140737488344880
r14 0x7fffffffd7f0 140737488345072
r15 0xffffffff 4294967295
rip 0x7ffff685b7e0 0x7ffff685b7e0 <__lll_unlock_elision+48>
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
>>> p $_siginfo._sifields._sigfault
$2 = {
si_addr = 0x0
}
Last edited by neunon (2015-09-23 11:45:33)
Offline
neunon wrote:Ran into this same issue on my new Skylake box, everything was crashing (gdm was the most immediately apparent one, but I've got core dumps all over the place). This doesn't bode well.
Going to build a debug version of glibc and have a go at debugging the issue.
Unfortunately it is intel that developed bugged chips (Haswell, Broadwell and Skylake), they have it in their errata (BDD50,51) https://www-ssl.intel.com/content/dam/w … update.pdf
That PDF doesnt apply to Skylake.. its 6th generation, not 5th. I am having the same crash on my i7-6700k. Perhaps it's a bug in libpthread?
Offline
My contacts at Intel told me that Haswell-EP and Broadwell-EP had the errata, but Broadwell-K, Haswell-E (desktop), Haswell-EX and Skylake are fine.
Also it's a SIGSEGV on the xend instruction. Not clear why yet:
Skylake is not "fine" at all, at least not outside of the Intel dreamlands...
SIGSEGV in xend means TSX itself is misbehaving (were it a SIGILL, it could have been a known, but supposedly already fixed, glibc bug). That SIGSEGV is exactly what was reported in Broadwell-H processors before they received a microcode update that disabled TSX-NI support for real.
You likely need a microcode update for your Skylake system. ASUS and ASROCK already started deploying them, pester your motherboard vendor for one, and please report back (with the contents of /proc/cpuinfo) if it fixes anything.
Offline
Hmm, MSI as well. They were actually the first to ship some updates with the newer skylake microcode, it seems.
Offline
Meh, I was mistaken about the SIGSEGV. It can also happen due to a programming error, not just because of chip errata. And the software error is _a lot more common_.
The programming error goes like this: should a program/library attempt to unlock an already unlocked mutex, it will cause a SIGSEGV when Intel TSX-NI (RTM) is in use.
So, what application or library (not libpthreads, the one that *called* libpthreads*) was crashing in __lll_unlock_elision ? It may need to be fixed...
Offline
Meh, I was mistaken about the SIGSEGV. It can also happen due to a programming error, not just because of chip errata. And the software error is _a lot more common_.
The programming error goes like this: should a program/library attempt to unlock an already unlocked mutex, it will cause a SIGSEGV when Intel TSX-NI (RTM) is in use.
So, what application or library (not libpthreads, the one that *called* libpthreads*) was crashing in __lll_unlock_elision ? It may need to be fixed...
seems like nvidia : http://www.phoronix.com/scan.php?page=n … atest-Woes
Offline