[SOLVED] Broadwell 5675c elision crash.

Goresome · 2015-09-17 13:31:53

Apart from suffering from kernel panics caused by Intel Speedstep, it seems that this CPU also suffers from faulty instructions

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6b877e0 in __lll_unlock_elision () from /usr/lib/libpthread.so.0

I have intel-ucode.img loaded

title Arch Linux (LVM)
linux /vmlinuz-linux
initrd /intel-ucode.img
initrd /initramfs-linux.img
options root=/dev/mapper/yuuzora-root resume=/dev/sda1 rw noquiet

but aparently microcode update lacks update for broadwell cpus (as noted in latest available update https://downloadcenter.intel.com/downlo … -Data-File), and

 dmesg | grep microcode
[    0.420492] microcode: CPU0 sig=0x40671, pf=0x2, revision=0xd
[    0.420498] microcode: CPU1 sig=0x40671, pf=0x2, revision=0xd
[    0.420504] microcode: CPU2 sig=0x40671, pf=0x2, revision=0xd
[    0.420510] microcode: CPU3 sig=0x40671, pf=0x2, revision=0xd
[    0.420538] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba

How can I workaround this?

Last edited by Goresome (2015-09-22 10:33:04)

mich41 · 2015-09-17 15:30:34

Are you reasonably sure it's not a software bug, like passing NULL to pthread_mutex_lock?

If it really is hardware then the only workaround is to recompile glibc without lock elision. Something like that:

sudo abs core
cp -r /var/abs/core/glibc /tmp
cd /tmp/glibc
# now edit PKGBUILD and remove the line containing --enable-lock-elision
makepkg
sudo pacman -U /tmp/glibc/glibc-whatever.pkg.tar.xz

You may want to ensure that you have some other OS or bootable CD/pendrive available in case anything goes wrong

Goresome · 2015-09-22 10:32:10

Thank you, I have compiled it with --disable-elision-lock and crashes are gone. However I've seen here people with Skylake CPU having the same thing.
As Haswell cpus have this functionality but its disabled with microcode, Broadwell and Skylake have this functionality too, and even have the same bug, but there is no microcode update for them, so I kinda wonder why maintaners enabled --enable-elision-lock.

neunon · 2015-09-23 10:54:34

Ran into this same issue on my new Skylake box, everything was crashing (gdm was the most immediately apparent one, but I've got core dumps all over the place). This doesn't bode well.

Going to build a debug version of glibc and have a go at debugging the issue.

Goresome · 2015-09-23 10:57:59

neunon wrote:

Ran into this same issue on my new Skylake box, everything was crashing (gdm was the most immediately apparent one, but I've got core dumps all over the place). This doesn't bode well.
Going to build a debug version of glibc and have a go at debugging the issue.

Unfortunately it is intel that developed bugged chips (Haswell, Broadwell and Skylake), they have it in their errata (BDD50,51) https://www-ssl.intel.com/content/dam/w … update.pdf

neunon · 2015-09-23 11:40:40

My contacts at Intel told me that Haswell-EP and Broadwell-EP had the errata, but Broadwell-K, Haswell-E (desktop), Haswell-EX and Skylake are fine.

Also it's a SIGSEGV on the xend instruction. Not clear why yet:

>>> l
24      __lll_unlock_elision(int *lock, int private)
25      {
26        /* When the lock was free we're in a transaction.
27           When you crash here you unlocked a free lock.  */
28        if (*lock == 0)
29          _xend();
30        else
31          lll_unlock ((*lock), private);
32        return 0;
33      }
>>> disas
Dump of assembler code for function __lll_unlock_elision:
   0x00007ffff685b7b0 <+0>:     mov    eax,DWORD PTR [rdi]
   0x00007ffff685b7b2 <+2>:     mov    rdx,rdi
   0x00007ffff685b7b5 <+5>:     test   eax,eax
   0x00007ffff685b7b7 <+7>:     je     0x7ffff685b7e0 <__lll_unlock_elision+48>
   0x00007ffff685b7b9 <+9>:     lock dec DWORD PTR [rdx]
   0x00007ffff685b7bc <+12>:    je     0x7ffff685b7d4 <__lll_unlock_elision+36>
   0x00007ffff685b7be <+14>:    lea    rdi,[rdx]
   0x00007ffff685b7c1 <+17>:    sub    rsp,0x80
   0x00007ffff685b7c8 <+24>:    call   0x7ffff6858d80 <__lll_unlock_wake>
   0x00007ffff685b7cd <+29>:    add    rsp,0x80
   0x00007ffff685b7d4 <+36>:    xor    eax,eax
   0x00007ffff685b7d6 <+38>:    ret    
   0x00007ffff685b7d7 <+39>:    nop    WORD PTR [rax+rax*1+0x0]
=> 0x00007ffff685b7e0 <+48>:    xend   
   0x00007ffff685b7e3 <+51>:    xor    eax,eax
   0x00007ffff685b7e5 <+53>:    ret    
End of assembler dump.
>>> info regis
rax            0x0      0
rbx            0x7ffff7fbc548   140737353860424
rcx            0x2      2
rdx            0x7ffff1c4ed08   140737249602824
rsi            0x0      0
rdi            0x7ffff1c4ed08   140737249602824
rbp            0x7fffffffd600   0x7fffffffd600
rsp            0x7fffffffd5d8   0x7fffffffd5d8
r8             0x0      0
r9             0x7ffff6a61200   140737331466752
r10            0x51     81
r11            0x7fffefefacc0   140737218849984
r12            0x7ffff1f07f48   140737252458312
r13            0x7fffffffd730   140737488344880
r14            0x7fffffffd7f0   140737488345072
r15            0xffffffff       4294967295
rip            0x7ffff685b7e0   0x7ffff685b7e0 <__lll_unlock_elision+48>
eflags         0x10246  [ PF ZF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
>>> p $_siginfo._sifields._sigfault
$2 = {
  si_addr = 0x0
}

Last edited by neunon (2015-09-23 11:45:33)

Splex · 2015-09-24 03:10:19

Goresome wrote:

neunon wrote:
Ran into this same issue on my new Skylake box, everything was crashing (gdm was the most immediately apparent one, but I've got core dumps all over the place). This doesn't bode well.
Going to build a debug version of glibc and have a go at debugging the issue.
Unfortunately it is intel that developed bugged chips (Haswell, Broadwell and Skylake), they have it in their errata (BDD50,51) https://www-ssl.intel.com/content/dam/w … update.pdf

That PDF doesnt apply to Skylake.. its 6th generation, not 5th. I am having the same crash on my i7-6700k. Perhaps it's a bug in libpthread?

hmh · 2015-10-09 12:06:56

neunon wrote:

My contacts at Intel told me that Haswell-EP and Broadwell-EP had the errata, but Broadwell-K, Haswell-E (desktop), Haswell-EX and Skylake are fine.
Also it's a SIGSEGV on the xend instruction. Not clear why yet:

Skylake is not "fine" at all, at least not outside of the Intel dreamlands...

SIGSEGV in xend means TSX itself is misbehaving (were it a SIGILL, it could have been a known, but supposedly already fixed, glibc bug). That SIGSEGV is exactly what was reported in Broadwell-H processors before they received a microcode update that disabled TSX-NI support for real.

You likely need a microcode update for your Skylake system. ASUS and ASROCK already started deploying them, pester your motherboard vendor for one, and please report back (with the contents of /proc/cpuinfo) if it fixes anything.

hmh · 2015-10-09 12:08:27

Hmm, MSI as well. They were actually the first to ship some updates with the newer skylake microcode, it seems.

hmh · 2015-10-20 12:00:42

Meh, I was mistaken about the SIGSEGV. It can also happen due to a programming error, not just because of chip errata. And the software error is _a lot more common_.

The programming error goes like this: should a program/library attempt to unlock an already unlocked mutex, it will cause a SIGSEGV when Intel TSX-NI (RTM) is in use.

So, what application or library (not libpthreads, the one that *called* libpthreads*) was crashing in __lll_unlock_elision ? It may need to be fixed...

orlfman · 2015-11-05 22:20:11

hmh wrote:

Meh, I was mistaken about the SIGSEGV. It can also happen due to a programming error, not just because of chip errata. And the software error is _a lot more common_.
The programming error goes like this: should a program/library attempt to unlock an already unlocked mutex, it will cause a SIGSEGV when Intel TSX-NI (RTM) is in use.
So, what application or library (not libpthreads, the one that *called* libpthreads*) was crashing in __lll_unlock_elision ? It may need to be fixed...

seems like nvidia : http://www.phoronix.com/scan.php?page=n … atest-Woes

Arch Linux

#1 2015-09-17 13:31:53

[SOLVED] Broadwell 5675c elision crash.

#2 2015-09-17 15:30:34

Re: [SOLVED] Broadwell 5675c elision crash.

#3 2015-09-22 10:32:10

Re: [SOLVED] Broadwell 5675c elision crash.

#4 2015-09-23 10:54:34

Re: [SOLVED] Broadwell 5675c elision crash.

#5 2015-09-23 10:57:59

Re: [SOLVED] Broadwell 5675c elision crash.

#6 2015-09-23 11:40:40

Re: [SOLVED] Broadwell 5675c elision crash.

#7 2015-09-24 03:10:19

Re: [SOLVED] Broadwell 5675c elision crash.

#8 2015-10-09 12:06:56

Re: [SOLVED] Broadwell 5675c elision crash.

#9 2015-10-09 12:08:27

Re: [SOLVED] Broadwell 5675c elision crash.

#10 2015-10-20 12:00:42

Re: [SOLVED] Broadwell 5675c elision crash.

#11 2015-11-05 22:20:11

Re: [SOLVED] Broadwell 5675c elision crash.

Board footer