watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]

Gruntz · 2020-01-30 20:26:00

Hello all,

I have a supemicro motherboard and two xeon x5650. I have 64GB of ram and several VMs on it.

I have one windows10, that i use for gaming ( with gpu pass-through )
I have another, with archlinux for work ( soft dev, testings... i use this one as my main desktop.GPU pass-through here too. )
I have 3-4-5 more, for database server, gitlab server, stuff like that.

From time to time my host machine crashes. It has a lot of "watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]" messages.
Each time different processor. After it appears, the computer slowly becomes unresponsive, and eventually hangs completely.

Do you have any clue what this could be? I read about the problem. It appears when a cpu is stuck ot task for a long time or something like that.
But that is normal for VMs. Can I work around it?

Best regards.

loqs · 2020-01-30 20:47:38

See https://www.kernel.org/doc/Documentatio … chdogs.txt for the details, it indicates a kernel task stuck for in excess of 20 seconds.
Edit:
The backtrace triggered by the lockup detection should help locate the cause.

Last edited by loqs (2020-01-30 20:55:07)

Gruntz · 2020-01-31 07:07:13

Hello,

Thank you for your replay loqs. The doc you send from the kernel.org was very helpful for understanding the problem.

Here is my backtrace:

Jan 30 22:14:43 darkstar kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 44s! [worker:131042]
Jan 30 22:14:43 darkstar kernel: Modules linked in: cdc_acm rpcsec_gss_krb5 dm_mod vhost_net tun vhost macvtap tap macvlan bonding xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ip>
Jan 30 22:14:43 darkstar kernel:  vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
Jan 30 22:14:43 darkstar kernel: CPU: 2 PID: 131042 Comm: worker Tainted: G             L    5.4.12-arch1-1 #1
Jan 30 22:14:43 darkstar kernel: Hardware name: Gateway GT350 F1/GT350 F1, BIOS P03     07/26/2010
Jan 30 22:14:43 darkstar kernel: RIP: 0010:smp_call_function_many+0x21d/0x280
Jan 30 22:14:43 darkstar kernel: Code: e8 88 c8 7c 00 3b 05 d6 c1 20 01 89 c7 0f 83 7a fe ff ff 48 63 c7 48 8b 0b 48 03 0c c5 20 f9 f7 9d 8b 41 18 a8 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f>
Jan 30 22:14:43 darkstar kernel: RSP: 0018:ffffbc648b4ffcf8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Jan 30 22:14:43 darkstar kernel: RAX: 0000000000000003 RBX: ffff9a7c5f8aba40 RCX: ffff9a7c5f8312c0
Jan 30 22:14:43 darkstar kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
Jan 30 22:14:43 darkstar kernel: RBP: ffffffff9ce7ee90 R08: ffff9a7c5f8aba48 R09: 0000000000000006
Jan 30 22:14:43 darkstar kernel: R10: ffff9a7c5f8aba48 R11: 0000000000000005 R12: ffff9a7c5f8aa340
Jan 30 22:14:43 darkstar kernel: R13: ffff9a7c5f8aba48 R14: 0000000000000001 R15: 0000000000000140
Jan 30 22:14:43 darkstar kernel: FS:  00007f58b0dbf700(0000) GS:ffff9a7c5f880000(0000) knlGS:0000000000000000
Jan 30 22:14:43 darkstar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 30 22:14:43 darkstar kernel: CR2: 0000000000000030 CR3: 0000000808eb8004 CR4: 00000000000226e0
Jan 30 22:14:43 darkstar kernel: DR0: 000000007ffe0270 DR1: 00000000002f2040 DR2: 0000000000000000
Jan 30 22:14:43 darkstar kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 30 22:14:43 darkstar kernel: Call Trace:
Jan 30 22:14:43 darkstar kernel:  flush_tlb_mm_range+0xed/0x150
Jan 30 22:14:43 darkstar kernel:  tlb_flush_mmu+0xa4/0x160
Jan 30 22:14:43 darkstar kernel:  tlb_finish_mmu+0x3d/0x70
Jan 30 22:14:43 darkstar kernel:  unmap_region+0xf4/0x130
Jan 30 22:14:43 darkstar kernel:  __do_munmap+0x255/0x4c0
Jan 30 22:14:43 darkstar kernel:  __vm_munmap+0x67/0xb0
Jan 30 22:14:43 darkstar kernel:  __x64_sys_munmap+0x28/0x30
Jan 30 22:14:43 darkstar kernel:  do_syscall_64+0x4e/0x140
Jan 30 22:14:43 darkstar kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 30 22:14:43 darkstar kernel: RIP: 0033:0x7f5c028d10db
Jan 30 22:14:43 darkstar kernel: Code: 8b 15 a9 5d 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 0>
Jan 30 22:14:43 darkstar kernel: RSP: 002b:00007f58b0dbd038 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Jan 30 22:14:43 darkstar kernel: RAX: ffffffffffffffda RBX: 00007f58da31e9c0 RCX: 00007f5c028d10db
Jan 30 22:14:43 darkstar kernel: RDX: 0000000000000000 RSI: 0000000000801000 RDI: 00007f58d9b1e000
Jan 30 22:14:43 darkstar kernel: RBP: 00007f58b4dc79c0 R08: 0000000000000000 R09: 0000000000000001
Jan 30 22:14:43 darkstar kernel: R10: 00007f58d76000c0 R11: 0000000000000206 R12: 00007f5c029bb020
Jan 30 22:14:43 darkstar kernel: R13: 0000000002800000 R14: 00007f58b0dbd140 R15: 00007f58b0dbf700

Do you have any idea what may cause the problem?

Best regards.

loqs · 2020-02-01 21:55:58

Are there any other threads that are also stuck which might be contending for the lock?

Gruntz · 2020-02-02 18:08:21

Hello,

This happens form time to time. Usually I see locks on more then once CPU. Sometimes two, sometimes three. It happens to a random CPU. This exact lock is to CPU 4, that is pinned to a Windows 10 VM. I was using the Windows VM hard when it happened, but some times it happens when the computer is idle. There is not a certain pattern.

Here is a bit more of the log.

And here is a log form another boot. This was on 21 of December, and it was outputting these messages for some time.

Best regards.

Arch Linux

#1 2020-01-30 20:26:00

watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]

#2 2020-01-30 20:47:38

Re: watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]

#3 2020-01-31 07:07:13

Re: watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]

#4 2020-02-01 21:55:58

Re: watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]

#5 2020-02-02 18:08:21

Re: watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]

Board footer