You are not logged in.

#1 2014-11-02 20:54:14

Potomac
Member
Registered: 2011-12-25
Posts: 481

random freeze at boot with kernel 3.17.x

I notice a random freeze every 5~10 boots with kernel 3.17.1 and 3.17.2,

every 5~10 boots a freeze can occur shortly after the load of the kernel, I can see these messages on screen and then nothing, it seems like a freeze :

:: running early hook [udev]
:: running hook [udev]
:: Triggering uvents...

sometimes the freeze happens a few seconds after systemd starts ( after the message "mount /home" for example )

I suspect the new loading mode of "intel-ucode" package, maybe it can bring some collision/timing problems ?

https://www.archlinux.org/news/changes- … deupdates/

my configuration :
archlinux 64 bits
cpu pentium dual core E6800 3.33 Ghz
ati radeon HD4650 PCIe ( radeon open source driver, KMS early start )

I use the "kernel mode setting" early start mode, in /etc/mkinitcpio.conf I have this line :

MODULES="radeon"

Last edited by Potomac (2014-11-03 00:56:52)

Offline

#2 2014-11-02 22:30:32

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

finaly the culprit is not intel-ucode,

because when the freeze occurs after the start of systemd I can have this kernel trace message ( 120 seconds after the freeze ), I took a photo of my screen :

https://bugzilla.kernel.org/attachment.cgi?id=156341

Offline

#3 2014-11-03 00:34:46

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

I upgraded to the testing version of systemd ( 217.3 ) and I still have the bug

Last edited by Potomac (2014-11-03 00:57:17)

Offline

#4 2014-11-05 23:36:45

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

Am I alone with this bug ?

Offline

#5 2014-11-06 03:09:15

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

If I downgrade to kernel 3.16.4 I don't have this bug

Offline

#6 2014-11-06 06:54:15

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

when the freeze occurs I can read these error messages :

:: running hook [udev]
:: Triggering uevents...

worker [53] /devices/pci0000:00/0000:00:1f.2/ata8/host7/target7:0:1/7:0:1:0 is taking a long time
worker [60] /devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:0/6:0:0:0/block/sr0 is taking a long time

https://bugzilla.kernel.org/attachment.cgi?id=156871

Offline

#7 2014-11-06 19:38:24

nadley
Member
Registered: 2014-09-01
Posts: 7

Re: random freeze at boot with kernel 3.17.x

I have the same problem with the 3.17.2 and 3.17.1 kernels. About a third of all boots hangs after "Triggering uvents...". Other times the boot process continues a bit longer and then hangs when mounting either a partition from my SSD or an NFS share. Some times the system boots ok, which seems to indicate a race condition somewhere.

Disabling intel-ucode does not help, but reverting back to 3.16.4 solves the problem for me. I have a 64-bit system with Intel i7-2600K and use nouveau with a GeForce 210. I do not load KMS early.

Offline

#8 2014-11-06 19:51:58

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

Ok, thanks for your report, I have created a bug report in kernel's bugzilla :

https://bugzilla.kernel.org/show_bug.cgi?id=87581


can you tell me more about your PC configuration ?

I want to collect clues about the element who triggers the bug, do you use an IDE/SATA controler PCie card ?
Do you have an IDE hard disk ?

Do you have some NTFS partitions on your hard disks ?

Some people don't have this bug because I think they don't have the element who triggers this bug, we must find the culprit, maybe a driver or a race condition on some PCI devices ?

Last edited by Potomac (2014-11-06 20:00:04)

Offline

#9 2014-11-06 22:07:46

nadley
Member
Registered: 2014-09-01
Posts: 7

Re: random freeze at boot with kernel 3.17.x

Complete computer specs:

Motherboard: Gigabyte Z68P-DS3 with 16GB RAM
CPU: Intel i7 2600K 3.4GHz
PCIe graphics card: ASUS EN210 using nouveau driver
PCIe NIC: Intel PRO/1000 using e1000e driver

Two SATA SSD connected to motherboard:
Crucial M500 240GB (with system installed on an ext4 partition, no swap)
SanDisk Ultra Plus 256Gb (not in fstab)

A SATA DVD writer, internal USB card reader and external USB sound card are also connected.

No IDE-drives and no NTFS partitions. I may try to upgrade the firmware on the Crucial SSD, since I belive that the one I have is pretty old.

With 3.17 I also had a problem shutting down the machine (this might be an unrelated problem). The OS seemed to shut down correctly, but it did not halt the computer. This also works correctly in previous versions of the kernel.

I have other computers with similar (but not identical) hardware, and may upgrade them to 3.17 to try to reproduce the problem when I have the time.

Offline

#10 2014-11-06 23:08:05

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

what we have in common are :

- we use the same OS ( archlinux 64 bits )

- the same brand of motherboard ( gigabyte ) but not the same model ( I have an old motherboard : gigabyte GA-P31-DSL3 ), but sometimes gigabyte motherboards can share the same defects in term of ACPI management and bios design, and sometimes same chips for things like temperature/voltage sensors, internal sound card ( realtek HD audio )

- we use a SATA DVD writer ( LiteOn ATAPI  iHAS124  C )

but I don't have SSD, I have only 3 Sata hard disk connected on the motherboard and 2 IDE harddisks ( one connected on the IDE port on the motherboard and one connected on the JMicron SATA/IDE controler PCIe card )

my network internal card : Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 01)

I hope we will found the culprit, maybe a problem with ACPI or a problem related to concurrency threads ( collision, race condition )

you could do a "git bisect" in order to find the commit who has introduced the bug, if you have time and a fast PC :

https://www.kernel.org/pub/software/scm … k2009.html

Last edited by Potomac (2014-11-06 23:17:55)

Offline

#11 2014-11-08 15:33:07

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

I suspect on of these commits to be the culprit for the race condition :


commit 1d3408209d43d2e72b5d8dbfb9b60fece399d75b
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date:   Sat Aug 16 18:47:15 2014 +0200

    x86: Tell irq work about self IPI support
    
    commit 3010279f0fc36f0388872203e63ca49912f648fd upstream.
    
    x86 supports irq work self-IPIs when local apic is available. This is
    partly known on runtime so lets implement arch_irq_work_has_interrupt()
    accordingly.
    
    This should be safely called after setup_arch().
    
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6fd5de08a5337d0d601f6361671813df4f013da9
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date:   Sat Aug 16 18:37:19 2014 +0200

    irq_work: Force raised irq work to run on irq work interrupt
    
    commit 76a33061b9323b7fdb220ae5fa116c10833ec22e upstream.
    
    The nohz full kick, which restarts the tick when any resource depend
    on it, can't be executed anywhere given the operation it does on timers.
    If it is called from the scheduler or timers code, chances are that
    we run into a deadlock.
    
    This is why we run the nohz full kick from an irq work. That way we make
    sure that the kick runs on a virgin context.
    
    However if that's the case when irq work runs in its own dedicated
    self-ipi, things are different for the big bunch of archs that don't
    support the self triggered way. In order to support them, irq works are
    also handled by the timer interrupt as fallback.
    
    Now when irq works run on the timer interrupt, the context isn't blank.
    More precisely, they can run in the context of the hrtimer that runs the
    tick. But the nohz kick cancels and restarts this hrtimer and cancelling
    an hrtimer from itself isn't allowed. This is why we run in an endless
    loop:
    
    	Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
    	CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
    	Workqueue: btrfs-endio-write normal_work_helper [btrfs]
    	 ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
    	 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
    	 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
    	Call Trace:
    	 <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
    	 [<ffffffff8a7ef928>] panic+0xd4/0x207
    	 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
    	 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
    	 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
    	 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
    	 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20
    	 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
    	 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
    	 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390
    	 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
    	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
    	 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
    	 [<ffffffff8a008268>] do_nmi+0xb8/0x100
    	 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
    	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
    	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
    	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
    	 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
    	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
    	 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
    	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
    	 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
    	 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
    	 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
    	 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
    	 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
    	 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
    	 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
    	 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
    	 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70
    	 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
    	 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
    	 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
    	 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
    	 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
    	 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
    	 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
    	 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80
    
    To fix this we force non-lazy irq works to run on irq work self-IPIs
    when available. That ability of the arch to trigger irq work self IPIs
    is available with arch_irq_work_has_interrupt().
    
    Reported-by: Catalin Iacob <iacobcatalin@gmail.com>
    Reported-by: Dave Jones <davej@redhat.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b677a767bcb3b9fbb4c7f921e5ba9577f1577049
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Sep 6 15:43:02 2014 +0200

    irq_work: Introduce arch_irq_work_has_interrupt()
    
    commit c5c38ef3d70377dc504a6a3f611a3ec814bc757b upstream.
    
    The nohz full code needs irq work to trigger its own interrupt so that
    the subsystem can work even when the tick is stopped.
    
    Lets introduce arch_irq_work_has_interrupt() that archs can override to
    tell about their support for this ability.
    
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

https://gitorious.org/opensuse/kernel/c … ece399d75b

http://gitorious.ti.com/ti-linux-kernel … 10833ec22e

http://gitorious.ti.com/ti-linux-kernel … c814bc757b

Last edited by Potomac (2014-11-08 15:38:30)

Offline

#12 2014-11-08 15:45:46

nadley
Member
Registered: 2014-09-01
Posts: 7

Re: random freeze at boot with kernel 3.17.x

After a lot of trial and error, my system finally boots and shutdowns correctly now (at least the last 15-20 attempts). I don't know exactly what fixed it, but here is some of the stuff that I did first that did not have any effect on the problem:

* Updated BIOS to latest version.

* Disabled stuff in BIOS that I don't use (parallel port, serial port, onboard VGA and LAN ports, etc)

* Removed USB sound card and card reader.

* Removed all partitions except / in fstab

* Tried to boot with acpi=off

I then noticed that the SATA controller mode in BIOS was set to IDE (which is default). I changed this to AHCI (and rebuilt my initramfs), and the system suddenly started everytime without freezing. However, my home directory is an NFS share, and it did not get mounted. Somehow the NetworkManager-wait-online.service that I use to delay the mounting until the network is active, stopped working. I removed NetworkManager and use dhcpcd instead and now everything seems to works as it should.

I don't know if the boot problem is fixed by using AHCI, or if it just changes the timing so that the race condition does not occur as often, but I will use the 3.17.2 kernel from now on and report any further problems here.

Offline

#13 2014-11-09 12:42:27

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

I'm doing a git bisect between kernel 3.16.7 ( who doesn't have the bug ) and kernel 3.17 ( who has the bug ), I hope I will find the commit who has introduced the bug

Offline

#14 2014-11-10 23:00:02

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

I found the commit who has introduced the bug after doing a git bisect,

the first bad commit is :

first bad commit: [045065d8a300a37218c548e9aa7becd581c6a0e8] [SCSI] fix qemu boot hang problem

http://git.kernel.org/cgit/linux/kernel … d581c6a0e8

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9c44392..ce62e87 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue *q)
blk_requeue_request(q, req);
atomic_dec(&sdev->device_busy);
out_delay:
- if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
+ if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
blk_delay_queue(q, SCSI_QUEUE_DELAY);
}

I think I can create a patch in order to revert this faulty commit

Offline

#15 2014-11-11 01:58:02

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

I created a patch who solves the problem :

--- a/drivers/scsi/scsi_lib.c	2014-10-05 21:23:04.000000000 +0200
+++ b/drivers/scsi/scsi_lib.c	2014-11-11 00:24:29.656031943 +0100
@@ -1775,7 +1775,7 @@ static void scsi_request_fn(struct reque
 	blk_requeue_request(q, req);
 	atomic_dec(&sdev->device_busy);
 out_delay:
-	if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
+	if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
 		blk_delay_queue(q, SCSI_QUEUE_DELAY);
 }
 

https://bugs.archlinux.org/task/42692?getfile=12416

this patch reverts the faulty commit 045065d8a300a37218c548e9aa7becd581c6a0e8,

I rebuilt kernel 3.17.2 with this patch and now the bug is gone, no problems

Last edited by Potomac (2014-11-11 02:15:52)

Offline

#16 2014-11-11 07:39:46

Ledti
Member
Registered: 2010-07-31
Posts: 122
Website

Re: random freeze at boot with kernel 3.17.x

Potomac wrote:

I created a patch who solves the problem :

--- a/drivers/scsi/scsi_lib.c	2014-10-05 21:23:04.000000000 +0200
+++ b/drivers/scsi/scsi_lib.c	2014-11-11 00:24:29.656031943 +0100
@@ -1775,7 +1775,7 @@ static void scsi_request_fn(struct reque
 	blk_requeue_request(q, req);
 	atomic_dec(&sdev->device_busy);
 out_delay:
-	if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
+	if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
 		blk_delay_queue(q, SCSI_QUEUE_DELAY);
 }
 

https://bugs.archlinux.org/task/42692?getfile=12416

this patch reverts the faulty commit 045065d8a300a37218c548e9aa7becd581c6a0e8,

I rebuilt kernel 3.17.2 with this patch and now the bug is gone, no problems

Nice one. That patch fixes the issue I described here. Oddly enough after rebooting a few times (testing stability), I noticed a sporadic boot-time freeze similar to the OP. Sticking with my original assessment that 3.17 is rather buggy overall.

Last edited by Ledti (2014-11-11 08:36:27)

Offline

#17 2014-11-11 08:37:14

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

Ledti : you could try the kernel 3.18rc4 with my patch :

https://aur.archlinux.org/packages/linux-mainline/

my patch will work also with kernel 3.18rc4,

maybe the kernel 3.18rc4 with my patch will solve all your bugs related to kernel 3.17

to be sure that the bug is gone you need to reboot at least 15 times in a row, if you see no freezes, no  hangs at boot then you can be sure that the bug is gone

Last edited by Potomac (2014-11-11 08:40:29)

Offline

#18 2014-11-16 18:11:44

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

there is another way to solve my bug, with this patch :

--- a/drivers/scsi/scsi_lib.c   2014-10-05 21:23:04.000000000 +0200
+++ b/drivers/scsi/scsi_lib.c   2014-11-16 17:39:16.819674725 +0100
@@ -1776,7 +1776,7 @@ static void scsi_request_fn(struct reque
    atomic_dec(&sdev->device_busy);
 out_delay:
    if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
-      blk_delay_queue(q, SCSI_QUEUE_DELAY);
+      blk_delay_queue(q, 40);
 }
 
 static inline int prep_to_mq(int ret)

it gives a little more time ( 40 ms instead of 3 ms, which is the default value for SCSI_QUEUE_DELAY )

Offline

#19 2014-11-19 20:23:46

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

I solved the mystery,

I found that the element who triggers the bug ( random hang at boot with
kernel 3.17 and 3.18 ) is the combination of 3 elements :

- the use of a SATA DVD burner ( Liteon iHAS124 C ) on a ICH7 Sata controler
- the use of a gigabyte motherboard GA-P31-DSL3 ( bios F10A, ICH7
controler, intel P31 chipset )
- commit 74665016086615bbaa3fa6f83af410a0a4e029ee ( scsi: convert
host_busy to atomic_t )

If I connect this Sata DVD burner and a sata harddisk by using the SATA
ports of the motherboard then the bug will occur ( but the bug will
occur only on kernels 3.17 and 3.18, there is no problems with older
kernels, and no problems with Windows 7 )

If I disconnect the SATA DVD burner then the bug is gone, no problems
with kernels 3.17 and 3.18,

And if I connect the SATA DVD burner on my JMicron SATA/IDE PCIe card
then there is no problem, no bugs, this is a perfect workaround for my
problem, because I can use kernel 3.17/3.18 without problem with this
configuration.

But I don't know which element I should blame, my gigabyte motherboard ?
( faulty bios ? ) The use of "atomic_t" in scsi source code ? (
innapropriate way to handle SATA devices, it breaks compatibility with
some PC configurations ? )

Offline

#20 2014-11-23 17:32:32

Potomac
Member
Registered: 2011-12-25
Posts: 481

Re: random freeze at boot with kernel 3.17.x

another patch made by a linux developer who solves the bug :

https://bugzilla.kernel.org/attachment.cgi?id=158411

Last edited by Potomac (2014-11-23 17:32:55)

Offline

Board footer

Powered by FluxBB