[SOLVED] CPU stall on boot, Intel Core 2 Processors

NiceGuy · 2018-08-30 19:42:15

As for the Fedora 29 installation, it installed fine. It booted normally without stalling and just the usual fedora kernel boot paramters,
but failed to load gdm.service and the gnome-initial-setup, so no root or user account was created and configured.

Tried to hack a little bit, lost interest and did just reboots. All boots with fedora 29 kernel 4.18.5 had no technical glitches or stalling
and the system was usable except for the login. ;-)

loqs · 2018-08-30 20:07:13

In kernel/time/clocksource.c there are three calls to

schedule_work(&watchdog_work);

If you replace one of those calls with

See if that makes any difference

xchoice · 2018-08-31 00:13:10

on my case, LG E500 laptop, core 2 duo t7100, booting onto fallback, available_clocksource are "hpet acpi_pm" and current_clocksource is "hpet". so I added clocksource= to grub kernel line, and both hpet and acpi_pm boot fine (one at a time ofc), so, is this a problem where the kernel/some module fails to setup the proper clock source?

loqs · 2018-08-31 00:23:28

Wondering if tsc is judged unstable by the watchdog and removed from the list of clock sources but somehow the affected systems get caught in the watchdog thread.

viktorj · 2018-08-31 07:52:02

I have a question related to bisecting, which I asked myself earlier, but I didn't give it much thought until now: During the bisection I found one bad commit which seemed to be not as bad as all the others, this was

[0bbcce5d1ef3f771a349896f1c7574d20dc6f4bd] Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

This is the "merge commit" which includes the bad commit which @nic3-14159 and I found. I tested all the bad commits without the "quiet" parameter and after that additionally with "debug", and this particular commit was bad without "quiet", but fine with "debug", whereas all other bad commit were bad in both cases, so I marked them as bad of course because all bad commits, including the one above, wouldn't let me boot under normal conditions, i.e. with the default initramfs and with the microcode image.

Is it possible that there are several commits which are somehow related to each other so that the resulting kernel is not only affected by one bad commit, but by these several commits? In my case this means that there is at least one commit after

[7197e77abcb65a71d0b21d67beb24f153a96055e] clocksource: Remove kthread

which makes the kernel behave slightly better, i.e. booting with "debug" works, but there is another one after

[0bbcce5d1ef3f771a349896f1c7574d20dc6f4bd] Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

which makes the kernel behave worse again. I am completely new to this, I have never bisected before, so I have no idea what is going on and I also don't know what the code does. But I can test my kernels and perhaps bisect again a bit near the mentioned commits if anyone thinks that it is useful. If I remember correctly, @nic3-14159 could not identify the bad commits as easily as I could, so he might not have seen what I saw. I can give you my bisection log if anyone else wants to try to reproduce what I saw.

NiceGuy · 2018-08-31 08:39:09

I noticed that heftig changed the default Arch kernel .config for 4.18.5 because of bug report https://bugs.archlinux.org/task/59824

diff:
# Platform RTC drivers
#
-CONFIG_RTC_DRV_CMOS=m
+CONFIG_RTC_DRV_CMOS=y

I noticed yesterday the journalctl message about the rtc. Forgot to mention it.
Could this influence our issue?

Going back to at least 4.17.14 the config was also set to CONFIG_RTC_DRV_CMOS=m.

Output in the journal:
---
kernel: hctosys: unable to open rtc device (rtc0)
kernel: Unstable clock detected, switching default tracing clock to "global"
If you want to keep using the local clock, then add:
"trace_clock=local"
on the kernel command line
---

viktorj · 2018-08-31 09:31:51

progandy wrote:

By the way, here is the commit that initially introduced the kthread. So apparently it wasn't just silliness.
https://git.kernel.org/pub/scm/linux/ke … 7fd34a34ab

commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
Author: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date:   Tue Aug 18 17:09:42 2009 +0200

    clocksource: Avoid clocksource watchdog circular locking dependency
    
    stop_machine from a multithreaded workqueue is not allowed because
    of a circular locking dependency between cpu_down and the workqueue
    execution. Use a kernel thread to do the clocksource downgrade.
    
    Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: john stultz <johnstul@us.ibm.com>
    LKML-Reference: <20090818170942.3ab80c91@skybase>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Nobody here commented on that. How did you find it? Are there other commits related to this? Should we mention this to the kernel developers?

GoatWarning · 2018-08-31 09:33:44

loqs wrote:

Those affected are you using intel-ucode.img and what flags does lscpu show?

I am using intel-ucode.img and I am affected. Installed linux-lts as a workaround.

Booting with linux-lts 4.14.67-1, lscpu lists these flags:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm pti tpr_shadow vnmi flexpriority dtherm ida

progandy · 2018-08-31 11:01:54

viktorj wrote:

Nobody here commented on that. How did you find it? Are there other commits related to this? Should we mention this to the kernel developers?

I used git blame. The kernel devs should normally think of that themselves, but if you want to let them know, sure. It is also possible that this commit message states wrong things or the kernel has changed and the deadlock has been resovled.

git show v4.17:kernel/time/clocksource.c | less -N
git blame -L 132,133  v4.17 -- kernel/time/clocksource.c

Last edited by progandy (2018-08-31 11:05:12)

NiceGuy · 2018-08-31 11:48:58

loqs wrote:

In kernel/time/clocksource.c there are three calls to
schedule_work(&watchdog_work);
If you replace one of those calls with
;
See if that makes any difference

I replaced the third occurence of "schedule_work(&watchdog_work);" with ";" - line 268.
I also changed the .config to include CONFIG_RTC_DRV_CMOS=y.

No difference, it stalls -> had to use tsc=unstable.

Last edited by NiceGuy (2018-09-02 12:11:47)

xerxes_ · 2018-08-31 14:35:52

I'm also affected by this bug, as I has previously written in this thread. I installed lts kernel (4.14.65) to workaround this issue and it worked find (except that sometimes when boot it automatically restarted system when it was started from could boot, but after restart system boots to the end).

The most important thing I want to share is that after update to kernel 4.18.5 from 4.18.3 I successfully booting system just by deleting quiet parameter (so that is only rw parameter at end in line with linux grub config). I booting like that for four days and there was no stall so far. And yes, I am using intel-ucode.img.

Is it possible that this issue has something to do with timing or/and cpu frequency, so when messages are displayed at boot, the boot process is a tiny bit fraction slowed down and stall does not occur? Or maybe double space after rw parameter has something to do with it (I deleting that double space when deleting quiet parameter)?

Sorry for maybe some silly questions like that, but when logic fails, some strange ideas are coming to mind.

Last edited by xerxes_ (2018-08-31 14:38:33)

progandy · 2018-08-31 14:45:42

It is likely that there is some rare race condition that triggers only under specific circumstances. Even something small like the size and compression of the initramfs, hardware found by acpi probes, or console output might affect the timing and make the bug appear and disappear, yes.

The commit found to cause the issue might not be the real culprit either. It might just change the timing of some kernel processes and therefore uncover the bug.

Last edited by progandy (2018-08-31 14:51:44)

NiceGuy · 2018-09-02 12:03:12

I'm going to build a CONFIG_DEBUG_INFO enabled kernel. Maybe this will provide us with a more detailed output.

For those that do not know about CONFIG_DEBUG_INFO:

  If you say Y here the resulting kernel image will include 
  debugging info resulting in a larger kernel image. 
  This adds debug symbols to the kernel and modules (gcc -g), and 
  is needed if you intend to use kernel crashdump or binary object 
  tools like crash, kgdb, LKCD, gdb, etc on the kernel.
  Say Y here only if you plan to debug the kernel.

[UPDATE:] As for the outcome of a fully debug enabled kernel; it did not make any difference, except that I always was able to boot successfully with any or none of the additional kernel boot parameters after several attempts. It matched the behavior of what some of you already reported, but kept stalling repeatedly again.

For those of you curious enough for some statistics of the fully debug enabled 4.18.5 kernel (based on Arch default config):
The build process took a bit more than 3 hours to finish. It also took 17 GiB of space to complete (more than my RAM based tmpfs could offer) and both kernel debug packages were combined ~1,6 GiB in size compared to just the default Arch non-debug 160 MiB in size of the linux and linux-headers packages. ;-)

I'm also investigating why on Fedora 29 the 4.18.5 kernel showed no signs of boot stalling. -> postponed and now unnecessary (patch available)

Last edited by NiceGuy (2018-09-03 20:39:05)

xchoice · 2018-09-02 21:45:59

from wikipedia (https://en.wikipedia.org/wiki/Time_Stamp_Counter):
"Use in exploiting cache side-channel attacks
The time stamp counter can be used to time instructions accurately which can be exploited in the Meltdown and Spectre security vulnerabilities. However if this is not available other counters or timers can be used, as is the case with the ARM processors vulnerable to this type of attack."

Maybe a new patch to Meltdown and Spectre did something that affects tsc on this older platforms, hence why hpet and acpi_pm clock sources have no problems?

Just found this while doing nothing on my pc, so maybe it's of any help to someone

viktorj · 2018-09-03 11:52:49

A kernel developer posted a patch, see here:
https://lkml.org/lkml/2018/9/3/253

You are welcome to test it. I think you have to apply it on top of the latest mainline release (4.19-rc2) or newer, older releases might not work. If your system still fails to boot with the patch, post it here or on the mailing list. In the latter case you should provide further information to help the developers resolve the problem.

sjensen · 2018-09-03 20:03:09

Just applied the patch (with some fuzz) to 4.18.5-arch1 and did a quick test. Works fine on my Lenovo X200 without "clocksource=hpet".
That hardware has an unstable TSC and previous wouldn't boot without "clocksource=hpet"

# dmesg|grep DMI

[ 0.000000] DMI: LENOVO 7454CTO/7454CTO, BIOS 6DET72WW (3.22 ) 10/25/2012

# head /proc/cpuinfo

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz
stepping : 10
microcode : 0xa0c
cpu MHz : 1127.401
cache size : 3072 KB
physical id : 0

# dmesg |grep -E 'clocksource|tsc|rtc'

[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[ 0.016666] tsc: Fast TSC calibration using PIT
[ 0.019999] tsc: Detected 2393.816 MHz processor
[ 0.019999] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x228165ba90a, max_idle_ns: 440795221225 ns
[ 0.133350] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
[ 0.255595] clocksource: Switched to clocksource tsc-early
[ 0.339183] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 0.398162] tsc: Marking TSC unstable due to TSC halts in idle
[ 0.398188] clocksource: Switched to clocksource hpet
[ 0.405749] rtc_cmos 00:02: RTC can wake from S4
[ 0.405983] rtc_cmos 00:02: registered as rtc0
[ 0.406004] rtc_cmos 00:02: alarms up to one month, y3k, 114 bytes nvram, hpet irqs

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource

hpet acpi_pm

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource

hpet

Only thing is, i now get this message twice, but this may because this patch was not really meant for this kernel.

[ 0.000000] tsc: Fast TSC calibration using PIT
[...]
[ 0.016666] tsc: Fast TSC calibration using PIT

benoliver999 · 2018-09-03 20:44:23

sjensen wrote:

Just applied the patch (with some fuzz) to 4.18.5-arch1 and did a quick test. Works fine on my Lenovo X200 without "clocksource=hpet".
That hardware has an unstable TSC and previous wouldn't boot without "clocksource=hpet"

Good stuff! Couple of questions:

- Is it consistently booting up?
- Have you emailed the ML?

Thank you for testing this.

NiceGuy · 2018-09-03 20:50:18

sjensen wrote:

Only thing is, i now get this message twice, but this may because this patch was not really meant for this kernel.
[ 0.000000] tsc: Fast TSC calibration using PIT
[...]
[ 0.016666] tsc: Fast TSC calibration using PIT

Don't worry, with 4.17.19 (based on Arch default config) I also get this twice and there is no "boot stalling fix" patch applied or necessary.

Last edited by NiceGuy (2018-09-03 20:51:28)

NiceGuy · 2018-09-03 21:56:10

benoliver999 wrote:

sjensen wrote:
Just applied the patch (with some fuzz) to 4.18.5-arch1 and did a quick test. Works fine on my Lenovo X200 without "clocksource=hpet".
That hardware has an unstable TSC and previous wouldn't boot without "clocksource=hpet"
Good stuff! Couple of questions:
- Is it consistently booting up?
- Have you emailed the ML?
Thank you for testing this.

- Yes, in my case I had ~10 successful boots and no stalling (Intel Core 2 Duo). I applied the patch Peter Zijlstra developed on top of 4.18.5!
- LKML has been informed - they know the fix is working well by now.

It might take a few hours or even days, until the patch is applied in Linus' tree and for Greg KH to backport it to the 4.18 stable series.

Last edited by NiceGuy (2018-09-03 23:47:00)

malcontent · 2018-09-04 00:10:00

Peter Zijlstra is the best. ❤️❤️❤️

https://static.lwn.net/images/conf/2015 … jlstra.jpg

Mod Edit - Replaced oversized images with links.
CoC - Pasting pictures and code

Last edited by Slithery (2018-09-04 00:21:28)

malcontent · 2018-09-04 00:11:08

Please contact Phoronix and tell Michael about this great news!

gregfrankenstein · 2018-09-04 03:39:38

I went ahead and tested with the linux-mainline package in the AUR, both for future-proofing and to test the patch as-is without fuzzing. As an aside, I learned the hard way why you shouldn't have your vimrc set to convert tabs to spaces when you're pasting in the contents of a diff. That was a good dozen minutes of going "why does this hunk not match" when the code and line numbers were identical, and dangit, I'll never get those minutes back.

My failure to conform to spacing standards aside, Peter's patch allows 4.19-rc2 to boot consistently on my Lenovo T400 and everything looks okay.

# dmesg | grep DMI

[ 0.000000] DMI: LENOVO 64752N3/64752N3, BIOS CBET4000 1c84243 09/07/2016
[ 0.602600] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)

# head /proc/cpuinfo

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz
stepping : 6
cpu MHz : 1398.732
cache size : 6144 KB
physical id : 0
siblings : 2

# uname -r

4.19.0-rc2-mainline

Boy will I be glad when this is merged upstream, because it's quite the eternity to compile a kernel on this thing.

malcontent · 2018-09-04 04:25:37

Peter Zijlstra is a fucking boss, a king.

Please make sure you give him hugs, love and all the nice things to him. <3

Peter. <3

Last edited by malcontent (2018-09-04 04:34:27)

sjensen · 2018-09-04 08:26:03

benoliver999 wrote:

Good stuff! Couple of questions:
- Is it consistently booting up?
- Have you emailed the ML?

1) It is booting consistently. (+8 and counting, without stalling)
2) No. I think it's not necessary anymore?

great work!

benoliver999 · 2018-09-04 08:57:42

I think it looks like we have a fix. I'll update the topic when it makes its way to Arch.

Thank you so much everyone for looking into this, this thread went far beyond my level of expertise and it was a real learning experience.

Arch Linux

#151 2018-08-30 19:42:15

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#152 2018-08-30 20:07:13

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#153 2018-08-31 00:13:10

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#154 2018-08-31 00:23:28

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#155 2018-08-31 07:52:02

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#156 2018-08-31 08:39:09

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#157 2018-08-31 09:31:51

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#158 2018-08-31 09:33:44

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#159 2018-08-31 11:01:54

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#160 2018-08-31 11:48:58

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#161 2018-08-31 14:35:52

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#162 2018-08-31 14:45:42

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#163 2018-09-02 12:03:12

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#164 2018-09-02 21:45:59

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#165 2018-09-03 11:52:49

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#166 2018-09-03 20:03:09

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#167 2018-09-03 20:44:23

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#168 2018-09-03 20:50:18

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#169 2018-09-03 21:56:10

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#170 2018-09-04 00:10:00

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#171 2018-09-04 00:11:08

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#172 2018-09-04 03:39:38

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#173 2018-09-04 04:25:37

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#174 2018-09-04 08:26:03

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#175 2018-09-04 08:57:42

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

Board footer