You are not logged in.
Thanks for pointing to ccache! Unfortunately this config option didn't help to reduce the compile time here, but according to this blog post one has to set the kernel build timestamp like this
KBUILD_BUILD_TIMESTAMP='' make -j2
in order to make ccache actually useful for repeated kernel builds. I haven't tested it yet, but I will use both tweaks from now on because I believe both are necessary.
I'll probably try that tweak as well. One thing to be careful of though when testing a kernel build, is that you reboot several times. Some of my builds would boot a couple of times, only to fail a few reboots later. I had actually finished bisecting yesterday, but after testing the apparent broken commit, the computer booted, several times. I had only been testing each kernel once, so I may have told git that a commit was good, even though it would have failed after testing again. So, back to square one. Hopefully that tweak does work, and I can regain my lost progress.
Offline
You could reuse the bisection up until you marked the first commit as good so that might save you something.
Edit:
Or you could ask viktorj for their bisect log check then check the last good and bad on that.
Last edited by loqs (2018-08-24 20:06:58)
Offline
It turned out that setting the kernel build timestamp as above didn't really help, probably it's more complicated than that because this is part of getting reproducible builds, which is way too complicated for me. So I'm still at about 20-25 minutes. Maybe it is just the huge amount of changes between the commits, because the issue seems to be in a commit in the "merge window" where all the big changes are made.
My last bad commit is:
# bad: [3c89adb0d11117f64d5b501730be7fb2bf53a479] Merge tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
My last good commit is:
# good: [db020be9f7a0eb667761f0b762c1aadef2d7bd24] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
I tested the good one only twice, so you can test it a few times more. Still about 6 steps to do from there, if I did everything correctly.
Offline
Just finished my git bisect. According to my tests, this is the bad commit:
# first bad commit: [7197e77abcb65a71d0b21d67beb24f153a96055e] clocksource: Remove kthread
I did test it, and it indeed hang on boot. However, it did not fail until the 5th reboot, and then it would continue to boot around 5 times before failing. I wrote a script for the build process, and I had it backup all kernels and initrd images to a folder, so I may recheck some of the good commits. If a kernel booted 5 times, I marked it as good, but after the results of my bisect, it might be a good idea to recheck them. Also, the ccache tweaks also didn't seem to have an effect on build times, but as I neared the end of the bisect, almost all of the kernel was being pulled from the cache, and it built in under 10 minutes. It probably is the huge changes between commits, and as the commits get closer as the bisect progresses, the differences shrink, allowing for more cache hits.
Offline
# first bad commit: [7197e77abcb65a71d0b21d67beb24f153a96055e] clocksource: Remove kthread
I did test it, and it indeed hang on boot. However, it did not fail until the 5th reboot, and then it would continue to boot around 5 times before failing.
This is very confusing for me, because if you look in the git history for the upstream kernel, you see that this commit is from May 2nd just after the release of v4.17-rc3, so included in all the 4.17.y releases. My boot failures are much more "reliable" than yours and only occured with the 4.18.y releases, so now I am wondering if we have the same issue. If this is not the case, you should test a 4.17 kernel from the Arch repos and see if it also fails in the same "unreliable" way. Maybe there are two different changes that are related to each other.
Hopefully I will finish bisecting today or tomorrow, depending on how long the build times are.
Offline
https://github.com/torvalds/linux/commi … 153a96055e
git tag --contains 7197e77abcb65a71d0b21d67beb24f153a96055e
v4.18
v4.18-rc1
v4.18-rc2
v4.18-rc3
v4.18-rc4
v4.18-rc5
v4.18-rc6
v4.18-rc7
v4.18-rc8
v4.18.1
v4.18.2
v4.18.3
v4.18.4
v4.18.5
Edit:
@nic3-14159 if you checkout the latest stable release (use a clean directory or at least a different branch to preserve the bisect) then revert 7197e77abcb65a71d0b21d67beb24f153a96055e does that kernel work?
Edit2:
Oops fixed comment intended for nic3-14159 not viktorj
Last edited by loqs (2018-08-25 11:29:29)
Offline
I will definitely do some more testing, because I was a little confused when git bisect was continually giving me v4.17-rc* releases, and not a single v4.18-rc release. I'll test the things that have been suggested, and if they do not give any meaningful result, I may try to re-run a bisect on a fresh copy of the kernel repository. I was working on the git.kernel.org repository, so I might try bisecting the one from git.archlinux.org.
I found an article suggesting the use of qemu and scripts to test kernels without rebooting, and then automate the entire process with git bisect run, but I do not know how accurate that would be. I was also thinking of doing something with an Arduino and a bunch of servos that would power cycle my laptop and select a working boot entry if the test one fails, and then somehow make a way to detect a no boot and report to git, and trying to automate it that way. Automating kernel bisects would be nice, but likely quite difficult to get working properly.
Offline
The command finds the most recent tag that is reachable from a commit. If the tag points to the commit, then only the tag is shown. Otherwise, it suffixes the tag name with the number of additional
commits on top of the tagged object and the abbreviated object name of the most recent commit.
git describe 7197e77abcb65a71d0b21d67beb24f153a96055e
v4.17-rc3-21-g7197e77abcb6
git describe --contains 7197e77abcb65a71d0b21d67beb24f153a96055e
v4.18-rc1~155^2~11
The first tag published after 7197e77abcb65a71d0b21d67beb24f153a96055e is v4.17-rc3 but the first tag including that commit is v4.18-rc1
I hope that helps explain the confusing output you were seeing.
Edit:
Fixed commit to be the one in question not its parent.
Last edited by loqs (2018-08-25 16:54:28)
Offline
The first tag published after 7197e77abcb65a71d0b21d67beb24f153a96055e is v4.17-rc3 but the first tag including that commit is v4.18-rc1
I hope that helps explain the confusing output you were seeing.
Oh, I think I understand now. So that commit was made during the v4.17 kernel builds, but not merged until the v4.18 builds?
@nic3-14159 if you checkout the latest stable release (use a clean directory or at least a different branch to preserve the bisect) then revert 7197e77abcb65a71d0b21d67beb24f153a96055e does that kernel work?
I just tried this on the 4.18.4-arch1 release, and the kernel does indeed boot after reverting that commit.
Offline
loqs wrote:The first tag published after 7197e77abcb65a71d0b21d67beb24f153a96055e is v4.17-rc3 but the first tag including that commit is v4.18-rc1
I hope that helps explain the confusing output you were seeing.Oh, I think I understand now. So that commit was made during the v4.17 kernel builds, but not merged until the v4.18 builds?
Exactly
loqs wrote:@nic3-14159 if you checkout the latest stable release (use a clean directory or at least a different branch to preserve the bisect) then revert 7197e77abcb65a71d0b21d67beb24f153a96055e does that kernel work?
I just tried this on the 4.18.4-arch1 release, and the kernel does indeed boot after reverting that commit.
That would indicate you have found the source although as the bug seems intermittent probably needs more testing before notifying upstream
perl scripts/get_maintainer.pl kernel/time/clocksource.c
John Stultz <john.stultz@linaro.org> (supporter:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)
Thomas Gleixner <tglx@linutronix.de> (supporter:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)
Stephen Boyd <sboyd@kernel.org> (reviewer:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)
linux-kernel@vger.kernel.org (open list:TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER)
Unfortunately there is no subsystem only list so if you do report it upstream it should be to the main linux-kernel mailing list.
reporting-bugs covers more information about reporting kernel bugs.
Offline
Hi,
I am not linux expert and never downgraded kernel before. I am facing same problem after recent update. After reading all posts and wiki I think by using following commands I will be able to downgrade linux kernel, firmware, headers. Please correct me if I am wrong. I just wanted to confirm before trying to downgrade.
pacman -U https://archive.archlinux.org/packages/l/linux/linux-4.17.14.arch1-1-x86_64.pkg.tar.xz
https://archive.archlinux.org/packages/l/linux-firmware/linux-firmware-20180717.8d69bab-1-any.pkg.tar.xz
https://archive.archlinux.org/packages/l/linux-headers/linux-headers-4.17.14.arch1-1-x86_64.pkg.tar.xz
Offline
Hi,
I am not linux expert and never downgraded kernel before. I am facing same problem after recent update. After reading all posts and wiki I think by using following commands I will be able to downgrade linux kernel, firmware, headers. Please correct me if I am wrong. I just wanted to confirm before trying to downgrade.pacman -U https://archive.archlinux.org/packages/l/linux/linux-4.17.14.arch1-1-x86_64.pkg.tar.xz https://archive.archlinux.org/packages/l/linux-firmware/linux-firmware-20180717.8d69bab-1-any.pkg.tar.xz https://archive.archlinux.org/packages/l/linux-headers/linux-headers-4.17.14.arch1-1-x86_64.pkg.tar.xz
Yes, I believe that is correct. Just make sure that all those files are either on one line, separated by spaces, or running pacman -U separately with each url.
However, if you had those packages installed before, they are likely in your pacman cache, located at /var/cache/pacman/pkg. By running pacman -U on the same packages in that directory, you can save time not re-downloading those packages
cd /var/cache/pacman/pkg
pacman -U linux-4.17.14.arch1-1-x86_64.pkg.tar.xz linux-firmware-20180717.8d69bab-1-any.pkg.tar.xz linux-headers-4.17.14.arch1-1-x86_64.pkg.tar.xz
Offline
Thank you.
Offline
Yep, that would seem to be the commit. I rebuilt my kernel with a patch like so and I can boot and reboot to my heart's content.
Last edited by gregfrankenstein (2018-08-26 20:32:17)
Offline
With an independent confirmation you could report upstream or wait for viktorj's bisection results to see if they match.
Offline
By the way, here is the commit that initially introduced the kthread. So apparently it wasn't just silliness.
https://git.kernel.org/pub/scm/linux/ke … 7fd34a34ab
commit 01548f4d3e8e94caf323a4f664eb347fd34a34ab
Author: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date: Tue Aug 18 17:09:42 2009 +0200
clocksource: Avoid clocksource watchdog circular locking dependency
stop_machine from a multithreaded workqueue is not allowed because
of a circular locking dependency between cpu_down and the workqueue
execution. Use a kernel thread to do the clocksource downgrade.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090818170942.3ab80c91@skybase>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |
Online
Hi everybody, one more here with the same problem in an Intel Core duo HP G61 portatil.
I've solved the problem temporary just adding "nolapic" to the kernel parameters, GRUB_CMDLINE_LINUX, in /etc/default/grub, until you guys (or upstream people) get to the real solving solution. I've extracted it from this old thread: https://bbs.archlinux.org/viewtopic.php?id=140426 Now just appears 1 CPU, but the system boots and works fine for the use I make of it.
I hope it helps somebody. And if someone is absolutely against this solution because it implies some kind of risk I haven't realize about, feel free to tell us about it. Thank you!
Offline
nolapic disables all but one cpu core, so you loose performance.
You could try to manually set a specific clocksource on the commandline instead of letting the kernel choose it. Maybe that would prevent the stall as well?
https://www.kernel.org/doc/html/v4.18/a … eters.html
| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |
Online
4.18.5.arch1-1 doesn't work for me :-(
Offline
nolapic disables all but one cpu core, so you loose performance.
You could try to manually set a specific clocksource on the commandline instead of letting the kernel choose it. Maybe that would prevent the stall as well?
https://www.kernel.org/doc/html/v4.18/a … eters.html
I think I've mistaken my cpu's name. My laptop has an Intel Pentium(R) Dual-Core CPU T4300 @2.10GHz CPU. Which I think is not the same as an Intel Core Duo. My fault, sorry. So, that means that it only has 2 cores x64. Using the "nolapic" and the "clocksource=X86-64" in the same line does not have any effect. And using "clocksource=X86-64" doesn't solve the no booting neither.
Since working with just 1 of 2 cores does not affects too much... I will stay just with the "nolapic" parameter waiting for the final solution.
Thank you for the ideas by the way.
Offline
I think the valid options for clocksource for X64-64 given
clocksource= Override the default clocksource
Format: <string>
Override the default clocksource and use the clocksource
with the name specified.
Some clocksource names to choose from, depending on
the platform:
[all] jiffies (this is the base, fallback clocksource)
[ACPI] acpi_pm
[ARM] imx_timer1,OSTS,netx_timer,mpu_timer2,
pxa_timer,timer3,32k_counter,timer0_1
[X86-32] pit,hpet,tsc;
scx200_hrt on Geode; cyclone on IBM x440
[MIPS] MIPS
[PARISC] cr16
[S390] tod
[SH] SuperH
[SPARC64] tick
[X86-64] hpet,tsc
Are jiffies,hpet,tsc and acpi_pm.
Offline
4.18.5.arch1-1 doesn't work for me :-(
Me neither, 5 to 10 reboot attempts before successful. I guess I can roll back several kernel versions, but for the time being, I'll just limit reboots.
This problem is significant. I would hate to be a new potential arch user with a duo core, or other impacted hardware, attempting to get up and running. Seems this might be worthy of some kind of post on the main page "news" section.
I certainly appreciate all the more advanced users who have gone to work to solve this, my skill set just doesn't allow participation in those efforts. I did do a practice run at Chroot on a live CD into my FS in case the time comes when I can't get it in. So there's a silver lining in this problem - My horizons have been expanded.
_________________________
Asus X200CA Notebook
Offline
Finished testing manually setting the clocksource option, it does appear to work. It seems that setting it to any valid clocksource will allow the system to boot without stalling. You can get a list of available clocksources for your system by running
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
On my system it outputted hpet and acpi_pm, but jiffies and refined-jiffies also worked for me. I tested each clocksource 6 times, and each time it booted successfully.
Edit: Tested on linux 4.18.3, would have tested on 4.18.5, but got a signature error and didn't want to wait fixing it/waiting to redownload update
One thing that I found interesting is that after running
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
on my system on one of the few instances that it did boot without setting the clocksource parameter, it gave me "hpet". However, setting clocksource to hpet in the kernel command line allowed it to boot consistantly.
Last edited by nic3-14159 (2018-08-26 20:02:36)
Offline
@progandy you think there is enough information to report the issue upstream?
Offline
I think there's easily enough information, but maybe in the interim, the Arch maintainers can temporarily patch it out, given how fatal the issue is on some hardware? I moved the patch I used to GitLab because wow, Pastebin has gotten a lot worse since I used it last.
Offline