[SOLVED] CPU stall on boot, Intel Core 2 Processors

nic3-14159 · 2018-08-24 19:49:28

viktorj wrote:

Thanks for pointing to ccache! Unfortunately this config option didn't help to reduce the compile time here, but according to this blog post one has to set the kernel build timestamp like this
KBUILD_BUILD_TIMESTAMP='' make -j2
in order to make ccache actually useful for repeated kernel builds. I haven't tested it yet, but I will use both tweaks from now on because I believe both are necessary.

I'll probably try that tweak as well. One thing to be careful of though when testing a kernel build, is that you reboot several times. Some of my builds would boot a couple of times, only to fail a few reboots later. I had actually finished bisecting yesterday, but after testing the apparent broken commit, the computer booted, several times. I had only been testing each kernel once, so I may have told git that a commit was good, even though it would have failed after testing again. So, back to square one. Hopefully that tweak does work, and I can regain my lost progress.

loqs · 2018-08-24 20:00:38

You could reuse the bisection up until you marked the first commit as good so that might save you something.
Edit:
Or you could ask viktorj for their bisect log check then check the last good and bad on that.

Last edited by loqs (2018-08-24 20:06:58)

viktorj · 2018-08-24 22:01:41

It turned out that setting the kernel build timestamp as above didn't really help, probably it's more complicated than that because this is part of getting reproducible builds, which is way too complicated for me. So I'm still at about 20-25 minutes. Maybe it is just the huge amount of changes between the commits, because the issue seems to be in a commit in the "merge window" where all the big changes are made.

My last bad commit is:

# bad: [3c89adb0d11117f64d5b501730be7fb2bf53a479] Merge tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

My last good commit is:

# good: [db020be9f7a0eb667761f0b762c1aadef2d7bd24] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

I tested the good one only twice, so you can test it a few times more. Still about 6 steps to do from there, if I did everything correctly.

nic3-14159 · 2018-08-25 07:33:46

Just finished my git bisect. According to my tests, this is the bad commit:

# first bad commit: [7197e77abcb65a71d0b21d67beb24f153a96055e] clocksource: Remove kthread

I did test it, and it indeed hang on boot. However, it did not fail until the 5th reboot, and then it would continue to boot around 5 times before failing. I wrote a script for the build process, and I had it backup all kernels and initrd images to a folder, so I may recheck some of the good commits. If a kernel booted 5 times, I marked it as good, but after the results of my bisect, it might be a good idea to recheck them. Also, the ccache tweaks also didn't seem to have an effect on build times, but as I neared the end of the bisect, almost all of the kernel was being pulled from the cache, and it built in under 10 minutes. It probably is the huge changes between commits, and as the commits get closer as the bisect progresses, the differences shrink, allowing for more cache hits.

viktorj · 2018-08-25 08:07:38

nic3-14159 wrote:

# first bad commit: [7197e77abcb65a71d0b21d67beb24f153a96055e] clocksource: Remove kthread
I did test it, and it indeed hang on boot. However, it did not fail until the 5th reboot, and then it would continue to boot around 5 times before failing.

This is very confusing for me, because if you look in the git history for the upstream kernel, you see that this commit is from May 2nd just after the release of v4.17-rc3, so included in all the 4.17.y releases. My boot failures are much more "reliable" than yours and only occured with the 4.18.y releases, so now I am wondering if we have the same issue. If this is not the case, you should test a 4.17 kernel from the Arch repos and see if it also fails in the same "unreliable" way. Maybe there are two different changes that are related to each other.

Hopefully I will finish bisecting today or tomorrow, depending on how long the build times are.

loqs · 2018-08-25 08:54:51

https://github.com/torvalds/linux/commi … 153a96055e

git tag --contains 7197e77abcb65a71d0b21d67beb24f153a96055e
v4.18
v4.18-rc1
v4.18-rc2
v4.18-rc3
v4.18-rc4
v4.18-rc5
v4.18-rc6
v4.18-rc7
v4.18-rc8
v4.18.1
v4.18.2
v4.18.3
v4.18.4
v4.18.5

Edit:
@nic3-14159 if you checkout the latest stable release (use a clean directory or at least a different branch to preserve the bisect) then revert 7197e77abcb65a71d0b21d67beb24f153a96055e does that kernel work?
Edit2:
Oops fixed comment intended for nic3-14159 not viktorj

Last edited by loqs (2018-08-25 11:29:29)

nic3-14159 · 2018-08-25 14:13:10

I will definitely do some more testing, because I was a little confused when git bisect was continually giving me v4.17-rc* releases, and not a single v4.18-rc release. I'll test the things that have been suggested, and if they do not give any meaningful result, I may try to re-run a bisect on a fresh copy of the kernel repository. I was working on the git.kernel.org repository, so I might try bisecting the one from git.archlinux.org.

I found an article suggesting the use of qemu and scripts to test kernels without rebooting, and then automate the entire process with git bisect run, but I do not know how accurate that would be. I was also thinking of doing something with an Arduino and a bunch of servos that would power cycle my laptop and select a working boot entry if the test one fails, and then somehow make a way to detect a no boot and report to git, and trying to automate it that way. Automating kernel bisects would be nice, but likely quite difficult to get working properly.

loqs · 2018-08-25 16:33:37

man 1 git-describe wrote:

The command finds the most recent tag that is reachable from a commit. If the tag points to the commit, then only the tag is shown. Otherwise, it suffixes the tag name with the number of additional
commits on top of the tagged object and the abbreviated object name of the most recent commit.

git describe 7197e77abcb65a71d0b21d67beb24f153a96055e
v4.17-rc3-21-g7197e77abcb6
git describe --contains 7197e77abcb65a71d0b21d67beb24f153a96055e
v4.18-rc1~155^2~11

The first tag published after 7197e77abcb65a71d0b21d67beb24f153a96055e is v4.17-rc3 but the first tag including that commit is v4.18-rc1
I hope that helps explain the confusing output you were seeing.
Edit:
Fixed commit to be the one in question not its parent.

Last edited by loqs (2018-08-25 16:54:28)

nic3-14159 · 2018-08-25 22:03:29

loqs wrote:

The first tag published after 7197e77abcb65a71d0b21d67beb24f153a96055e is v4.17-rc3 but the first tag including that commit is v4.18-rc1
I hope that helps explain the confusing output you were seeing.

Oh, I think I understand now. So that commit was made during the v4.17 kernel builds, but not merged until the v4.18 builds?

loqs wrote:

@nic3-14159 if you checkout the latest stable release (use a clean directory or at least a different branch to preserve the bisect) then revert 7197e77abcb65a71d0b21d67beb24f153a96055e does that kernel work?

I just tried this on the 4.18.4-arch1 release, and the kernel does indeed boot after reverting that commit.

loqs · 2018-08-25 22:25:09

nic3-14159 wrote:

loqs wrote:
The first tag published after 7197e77abcb65a71d0b21d67beb24f153a96055e is v4.17-rc3 but the first tag including that commit is v4.18-rc1
I hope that helps explain the confusing output you were seeing.
Oh, I think I understand now. So that commit was made during the v4.17 kernel builds, but not merged until the v4.18 builds?

Exactly

nic3-14159 wrote:

loqs wrote:
@nic3-14159 if you checkout the latest stable release (use a clean directory or at least a different branch to preserve the bisect) then revert 7197e77abcb65a71d0b21d67beb24f153a96055e does that kernel work?
I just tried this on the 4.18.4-arch1 release, and the kernel does indeed boot after reverting that commit.

That would indicate you have found the source although as the bug seems intermittent probably needs more testing before notifying upstream

perl scripts/get_maintainer.pl kernel/time/clocksource.c
John Stultz <john.stultz@linaro.org> (supporter:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)
Thomas Gleixner <tglx@linutronix.de> (supporter:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)
Stephen Boyd <sboyd@kernel.org> (reviewer:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)
linux-kernel@vger.kernel.org (open list:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)

Unfortunately there is no subsystem only list so if you do report it upstream it should be to the main linux-kernel mailing list.
reporting-bugs covers more information about reporting kernel bugs.

cb9 · 2018-08-26 07:33:10

Hi,
I am not linux expert and never downgraded kernel before. I am facing same problem after recent update. After reading all posts and wiki I think by using following commands I will be able to downgrade linux kernel, firmware, headers. Please correct me if I am wrong. I just wanted to confirm before trying to downgrade.

pacman -U https://archive.archlinux.org/packages/l/linux/linux-4.17.14.arch1-1-x86_64.pkg.tar.xz   
https://archive.archlinux.org/packages/l/linux-firmware/linux-firmware-20180717.8d69bab-1-any.pkg.tar.xz
https://archive.archlinux.org/packages/l/linux-headers/linux-headers-4.17.14.arch1-1-x86_64.pkg.tar.xz

nic3-14159 · 2018-08-26 07:47:17

cb9 wrote:

Hi,
I am not linux expert and never downgraded kernel before. I am facing same problem after recent update. After reading all posts and wiki I think by using following commands I will be able to downgrade linux kernel, firmware, headers. Please correct me if I am wrong. I just wanted to confirm before trying to downgrade.
pacman -U https://archive.archlinux.org/packages/l/linux/linux-4.17.14.arch1-1-x86_64.pkg.tar.xz   
https://archive.archlinux.org/packages/l/linux-firmware/linux-firmware-20180717.8d69bab-1-any.pkg.tar.xz
https://archive.archlinux.org/packages/l/linux-headers/linux-headers-4.17.14.arch1-1-x86_64.pkg.tar.xz

Yes, I believe that is correct. Just make sure that all those files are either on one line, separated by spaces, or running pacman -U separately with each url.
However, if you had those packages installed before, they are likely in your pacman cache, located at /var/cache/pacman/pkg. By running pacman -U on the same packages in that directory, you can save time not re-downloading those packages

cd /var/cache/pacman/pkg
pacman -U linux-4.17.14.arch1-1-x86_64.pkg.tar.xz linux-firmware-20180717.8d69bab-1-any.pkg.tar.xz linux-headers-4.17.14.arch1-1-x86_64.pkg.tar.xz

cb9 · 2018-08-26 07:48:02

Thank you.

gregfrankenstein · 2018-08-26 09:48:11

Yep, that would seem to be the commit. I rebuilt my kernel with a patch like so and I can boot and reboot to my heart's content.

Last edited by gregfrankenstein (2018-08-26 20:32:17)

loqs · 2018-08-26 09:57:38

With an independent confirmation you could report upstream or wait for viktorj's bisection results to see if they match.

progandy · 2018-08-26 13:28:35

By the way, here is the commit that initially introduced the kthread. So apparently it wasn't just silliness.
https://git.kernel.org/pub/scm/linux/ke … 7fd34a34ab

commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
Author: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date:   Tue Aug 18 17:09:42 2009 +0200

    clocksource: Avoid clocksource watchdog circular locking dependency
    
    stop_machine from a multithreaded workqueue is not allowed because
    of a circular locking dependency between cpu_down and the workqueue
    execution. Use a kernel thread to do the clocksource downgrade.
    
    Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: john stultz <johnstul@us.ibm.com>
    LKML-Reference: <20090818170942.3ab80c91@skybase>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

josealb77 · 2018-08-26 14:53:14

Hi everybody, one more here with the same problem in an Intel Core duo HP G61 portatil.

I've solved the problem temporary just adding "nolapic" to the kernel parameters, GRUB_CMDLINE_LINUX, in /etc/default/grub, until you guys (or upstream people) get to the real solving solution. I've extracted it from this old thread: https://bbs.archlinux.org/viewtopic.php?id=140426 Now just appears 1 CPU, but the system boots and works fine for the use I make of it.

I hope it helps somebody. And if someone is absolutely against this solution because it implies some kind of risk I haven't realize about, feel free to tell us about it. Thank you!

progandy · 2018-08-26 15:16:00

nolapic disables all but one cpu core, so you loose performance.

You could try to manually set a specific clocksource on the commandline instead of letting the kernel choose it. Maybe that would prevent the stall as well?
https://www.kernel.org/doc/html/v4.18/a … eters.html

Baltazard2013 · 2018-08-26 17:44:03

4.18.5.arch1-1 doesn't work for me :-(

josealb77 · 2018-08-26 19:19:36

progandy wrote:

nolapic disables all but one cpu core, so you loose performance.
You could try to manually set a specific clocksource on the commandline instead of letting the kernel choose it. Maybe that would prevent the stall as well?
https://www.kernel.org/doc/html/v4.18/a … eters.html

I think I've mistaken my cpu's name. My laptop has an Intel Pentium(R) Dual-Core CPU T4300 @2.10GHz CPU. Which I think is not the same as an Intel Core Duo. My fault, sorry. So, that means that it only has 2 cores x64. Using the "nolapic" and the "clocksource=X86-64" in the same line does not have any effect. And using "clocksource=X86-64" doesn't solve the no booting neither.

Since working with just 1 of 2 cores does not affects too much... I will stay just with the "nolapic" parameter waiting for the final solution.

Thank you for the ideas by the way.

loqs · 2018-08-26 19:27:10

I think the valid options for clocksource for X64-64 given

        clocksource=    Override the default clocksource
                        Format: <string>
                        Override the default clocksource and use the clocksource
                        with the name specified.
                        Some clocksource names to choose from, depending on
                        the platform:
                        [all] jiffies (this is the base, fallback clocksource)
                        [ACPI] acpi_pm
                        [ARM] imx_timer1,OSTS,netx_timer,mpu_timer2,
                                pxa_timer,timer3,32k_counter,timer0_1
                        [X86-32] pit,hpet,tsc;
                                scx200_hrt on Geode; cyclone on IBM x440
                        [MIPS] MIPS
                        [PARISC] cr16
                        [S390] tod
                        [SH] SuperH
                        [SPARC64] tick
                        [X86-64] hpet,tsc

Are jiffies,hpet,tsc and acpi_pm.

mneiner · 2018-08-26 19:38:07

Baltazard2013 wrote:

4.18.5.arch1-1 doesn't work for me :-(

Me neither, 5 to 10 reboot attempts before successful. I guess I can roll back several kernel versions, but for the time being, I'll just limit reboots.

This problem is significant. I would hate to be a new potential arch user with a duo core, or other impacted hardware, attempting to get up and running. Seems this might be worthy of some kind of post on the main page "news" section.

I certainly appreciate all the more advanced users who have gone to work to solve this, my skill set just doesn't allow participation in those efforts. I did do a practice run at Chroot on a live CD into my FS in case the time comes when I can't get it in. So there's a silver lining in this problem - My horizons have been expanded.

nic3-14159 · 2018-08-26 19:50:55

Finished testing manually setting the clocksource option, it does appear to work. It seems that setting it to any valid clocksource will allow the system to boot without stalling. You can get a list of available clocksources for your system by running

cat /sys/devices/system/clocksource/clocksource0/available_clocksource

On my system it outputted hpet and acpi_pm, but jiffies and refined-jiffies also worked for me. I tested each clocksource 6 times, and each time it booted successfully.
Edit: Tested on linux 4.18.3, would have tested on 4.18.5, but got a signature error and didn't want to wait fixing it/waiting to redownload update

One thing that I found interesting is that after running

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

on my system on one of the few instances that it did boot without setting the clocksource parameter, it gave me "hpet". However, setting clocksource to hpet in the kernel command line allowed it to boot consistantly.

Last edited by nic3-14159 (2018-08-26 20:02:36)

loqs · 2018-08-26 19:53:25

@progandy you think there is enough information to report the issue upstream?

gregfrankenstein · 2018-08-26 20:30:21

I think there's easily enough information, but maybe in the interim, the Arch maintainers can temporarily patch it out, given how fatal the issue is on some hardware? I moved the patch I used to GitLab because wow, Pastebin has gotten a lot worse since I used it last.

https://gitlab.com/snippets/1748678

Arch Linux

#76 2018-08-24 19:49:28

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#77 2018-08-24 20:00:38

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#78 2018-08-24 22:01:41

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#79 2018-08-25 07:33:46

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#80 2018-08-25 08:07:38

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#81 2018-08-25 08:54:51

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#82 2018-08-25 14:13:10

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#83 2018-08-25 16:33:37

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#84 2018-08-25 22:03:29

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#85 2018-08-25 22:25:09

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#86 2018-08-26 07:33:10

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#87 2018-08-26 07:47:17

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#88 2018-08-26 07:48:02

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#89 2018-08-26 09:48:11

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#90 2018-08-26 09:57:38

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#91 2018-08-26 13:28:35

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#92 2018-08-26 14:53:14

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#93 2018-08-26 15:16:00

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#94 2018-08-26 17:44:03

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#95 2018-08-26 19:19:36

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#96 2018-08-26 19:27:10

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#97 2018-08-26 19:38:07

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#98 2018-08-26 19:50:55

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#99 2018-08-26 19:53:25

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

#100 2018-08-26 20:30:21

Re: [SOLVED] CPU stall on boot, Intel Core 2 Processors

Board footer