You are not logged in.
Hello all,
I have a supemicro motherboard and two xeon x5650. I have 64GB of ram and several VMs on it.
I have one windows10, that i use for gaming ( with gpu pass-through )
I have another, with archlinux for work ( soft dev, testings... i use this one as my main desktop.GPU pass-through here too. )
I have 3-4-5 more, for database server, gitlab server, stuff like that.
From time to time my host machine crashes. It has a lot of "watchdog: BUG: soft lockup - CPU#4 stuck for 45s! [worker:131043]" messages.
Each time different processor. After it appears, the computer slowly becomes unresponsive, and eventually hangs completely.
Do you have any clue what this could be? I read about the problem. It appears when a cpu is stuck ot task for a long time or something like that.
But that is normal for VMs. Can I work around it?
Best regards.
Offline
See https://www.kernel.org/doc/Documentatio … chdogs.txt for the details, it indicates a kernel task stuck for in excess of 20 seconds.
Edit:
The backtrace triggered by the lockup detection should help locate the cause.
Last edited by loqs (2020-01-30 20:55:07)
Offline
Hello,
Thank you for your replay loqs. The doc you send from the kernel.org was very helpful for understanding the problem.
Here is my backtrace:
Jan 30 22:14:43 darkstar kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 44s! [worker:131042]
Jan 30 22:14:43 darkstar kernel: Modules linked in: cdc_acm rpcsec_gss_krb5 dm_mod vhost_net tun vhost macvtap tap macvlan bonding xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ip>
Jan 30 22:14:43 darkstar kernel: vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
Jan 30 22:14:43 darkstar kernel: CPU: 2 PID: 131042 Comm: worker Tainted: G L 5.4.12-arch1-1 #1
Jan 30 22:14:43 darkstar kernel: Hardware name: Gateway GT350 F1/GT350 F1, BIOS P03 07/26/2010
Jan 30 22:14:43 darkstar kernel: RIP: 0010:smp_call_function_many+0x21d/0x280
Jan 30 22:14:43 darkstar kernel: Code: e8 88 c8 7c 00 3b 05 d6 c1 20 01 89 c7 0f 83 7a fe ff ff 48 63 c7 48 8b 0b 48 03 0c c5 20 f9 f7 9d 8b 41 18 a8 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f>
Jan 30 22:14:43 darkstar kernel: RSP: 0018:ffffbc648b4ffcf8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Jan 30 22:14:43 darkstar kernel: RAX: 0000000000000003 RBX: ffff9a7c5f8aba40 RCX: ffff9a7c5f8312c0
Jan 30 22:14:43 darkstar kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
Jan 30 22:14:43 darkstar kernel: RBP: ffffffff9ce7ee90 R08: ffff9a7c5f8aba48 R09: 0000000000000006
Jan 30 22:14:43 darkstar kernel: R10: ffff9a7c5f8aba48 R11: 0000000000000005 R12: ffff9a7c5f8aa340
Jan 30 22:14:43 darkstar kernel: R13: ffff9a7c5f8aba48 R14: 0000000000000001 R15: 0000000000000140
Jan 30 22:14:43 darkstar kernel: FS: 00007f58b0dbf700(0000) GS:ffff9a7c5f880000(0000) knlGS:0000000000000000
Jan 30 22:14:43 darkstar kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 30 22:14:43 darkstar kernel: CR2: 0000000000000030 CR3: 0000000808eb8004 CR4: 00000000000226e0
Jan 30 22:14:43 darkstar kernel: DR0: 000000007ffe0270 DR1: 00000000002f2040 DR2: 0000000000000000
Jan 30 22:14:43 darkstar kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 30 22:14:43 darkstar kernel: Call Trace:
Jan 30 22:14:43 darkstar kernel: flush_tlb_mm_range+0xed/0x150
Jan 30 22:14:43 darkstar kernel: tlb_flush_mmu+0xa4/0x160
Jan 30 22:14:43 darkstar kernel: tlb_finish_mmu+0x3d/0x70
Jan 30 22:14:43 darkstar kernel: unmap_region+0xf4/0x130
Jan 30 22:14:43 darkstar kernel: __do_munmap+0x255/0x4c0
Jan 30 22:14:43 darkstar kernel: __vm_munmap+0x67/0xb0
Jan 30 22:14:43 darkstar kernel: __x64_sys_munmap+0x28/0x30
Jan 30 22:14:43 darkstar kernel: do_syscall_64+0x4e/0x140
Jan 30 22:14:43 darkstar kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 30 22:14:43 darkstar kernel: RIP: 0033:0x7f5c028d10db
Jan 30 22:14:43 darkstar kernel: Code: 8b 15 a9 5d 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb 89 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 0>
Jan 30 22:14:43 darkstar kernel: RSP: 002b:00007f58b0dbd038 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Jan 30 22:14:43 darkstar kernel: RAX: ffffffffffffffda RBX: 00007f58da31e9c0 RCX: 00007f5c028d10db
Jan 30 22:14:43 darkstar kernel: RDX: 0000000000000000 RSI: 0000000000801000 RDI: 00007f58d9b1e000
Jan 30 22:14:43 darkstar kernel: RBP: 00007f58b4dc79c0 R08: 0000000000000000 R09: 0000000000000001
Jan 30 22:14:43 darkstar kernel: R10: 00007f58d76000c0 R11: 0000000000000206 R12: 00007f5c029bb020
Jan 30 22:14:43 darkstar kernel: R13: 0000000002800000 R14: 00007f58b0dbd140 R15: 00007f58b0dbf700
Do you have any idea what may cause the problem?
Best regards.
Offline
Are there any other threads that are also stuck which might be contending for the lock?
Offline
Hello,
This happens form time to time. Usually I see locks on more then once CPU. Sometimes two, sometimes three. It happens to a random CPU. This exact lock is to CPU 4, that is pinned to a Windows 10 VM. I was using the Windows VM hard when it happened, but some times it happens when the computer is idle. There is not a certain pattern.
Here is a bit more of the log.
And here is a log form another boot. This was on 21 of December, and it was outputting these messages for some time.
Best regards.
Offline