You are not logged in.

#1 2019-07-29 08:43:33

mareurs
Member
Registered: 2019-07-29
Posts: 3

AMD TR 2950x and kvm hard locks the system

Hi!

I bought an AMD TR2 system to use it with kvm for virtualizing a couple of machines, one of which is a Windows Server 2016. Once a day the kernel hard locks and not even the Magic SysRq keys are working. The system is as following:

CPU: AMD ThreadRipper 2950x
MB: Gigabyte X399 AORUS PRO (BIOS version: F2g, Update AGESA 1.1.0.2)
NAND: 2x Samsung 970 EVO Plus - 500G
OS: Arch Linux
Kernels: 5.2.3-arch1-1-ARCH, 4.19.61-1 (LTS)

I had before, 2 other no-name NAND disks that were crashing the kernel in the same way every time I was copying from one another. Now, I cannot reproduce the issue with the new Samsung NAND but the system still hard locks once a day. The only way to start it is with a hard-reset.

Reading through various forums I tried the following:
1. Disable/enable IOMMU in BIOS (various kernel params: amd_iommu=on, amd_iommu=pt, amd_iommu=soft)
2. kernel params for nvme ASP issues: nvme_core.default_ps_max_latency_us=0
3. Tried latest kernel and linux-lts
4. Compiled kernel with IOMMU debugging options, pci debugging, etc. Enabled panic for all OOPS to be able to catch the defect
5. Enabled kernel dump for OOPS
6. My current kernel parameters are:
kernel.nmi_watchdog = 0 hugepagesz=1G hugepages=48 processor.max_cstate=1 rcu_nocbs=0-31 idle=nomwait nvme_core.default_ps_max_latency_us=0 clocksource=hpet amd_iommu=on iommu=pt pcie_acs_override=downstream vfio_iommu_type1.allow_unsafe_interrupts=1 kvm.ignore_msrs=1 rd.driver.pre=vfio-pci netconsole=6665@192.168.2.20/br0,6666@192.168.2.1/c0:25:e9:0f:2a:e3 audit=0 loglevel=8 quiet

No matter the configuration, the hang is always the same: no magic sysrq, no logs, no dump.
I am a developer but not a kernel developer so I am asking nicely if there is any way that I can catch this hard lock in order to understand what the BUG is or what the hardware issue is.
The Arch manual https://wiki.archlinux.org/index.php/Ge … leshooting didn't help to much. Is there another documentation Arch-related or not on debugging hard locks?

Thank you!

Offline

#2 2019-07-29 12:26:57

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,919

Re: AMD TR 2950x and kvm hard locks the system

Let's try to narrow down things a bit first.


Do you have early microcode updating configured ?
If not, do so.

post lspci -k please.

Is "CSM support"  disabled and "Above 4G decoding" enabled ?
Keep Iommu enabled in uefi firmware[1].

Check the kernel parameters documentation and remove as many as possible.

examples:
clocksource=hpet
The kernel has logic to determine whether hpet is a stable clocksource and fallback to other methods if it isn't.
Don't force it unless you have  a very good reason.

amd_iommu doesn't even have an "on" value
iommu=pt is intended for X86 systems, not x86_64


[1] AMD x86_64 use IOMMU for lots of things since approx 2006.
Disabling IOMMU when running a 64-bit OS cripples the system.
For Intel cpus things are different.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#3 2019-07-29 14:58:26

mareurs
Member
Registered: 2019-07-29
Posts: 3

Re: AMD TR 2950x and kvm hard locks the system

Hi! Thanks for the quick reply!
1. I do have the "early microcode update". It's an image that always gets loaded before the vmlinuz in grub.
2. lspci -k

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
        Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
        Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Kernel driver in use: pcieport
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Kernel driver in use: pcieport
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Kernel driver in use: pcieport
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
        Kernel driver in use: pcieport
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
        Kernel driver in use: pcieport
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 59)
        Subsystem: Gigabyte Technology Co., Ltd FCH SMBus Controller
        Kernel modules: i2c_piix4, sp5100_tco
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
        Subsystem: Gigabyte Technology Co., Ltd FCH LPC Bridge
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
        Kernel driver in use: k10temp
        Kernel modules: k10temp
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
00:19.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
00:19.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
00:19.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
00:19.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
        Kernel driver in use: k10temp
        Kernel modules: k10temp
00:19.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
00:19.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
00:19.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
00:19.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller (rev 02)
        Subsystem: ASMedia Technology Inc. X399 Series Chipset USB 3.1 xHCI Controller
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller (rev 02)
        DeviceName: Promontory SATA
        Subsystem: ASMedia Technology Inc. X399 Series Chipset SATA Controller
        Kernel driver in use: ahci
        Kernel modules: ahci
01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset PCIe Bridge (rev 02)
        Kernel driver in use: pcieport
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
        DeviceName: Onboard LAN Atheros
        Kernel driver in use: pcieport
02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
        Kernel driver in use: pcieport
02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
        DeviceName: Onboard LAN Realtek
        Kernel driver in use: pcieport
02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
        Kernel driver in use: pcieport
02:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
        Kernel driver in use: pcieport
04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
        Subsystem: Gigabyte Technology Co., Ltd I211 Gigabit Network Connection
        Kernel driver in use: igb
        Kernel modules: igb
08:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01)
        Subsystem: Phison Electronics Corporation E12 NVMe Controller
        Kernel driver in use: nvme
09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
        Subsystem: TP-LINK Technologies Co., Ltd. TG-3468 Gigabit PCI Express Network Adapter
        Kernel driver in use: r8169
        Kernel modules: r8169
0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
        Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
0a:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
        Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
        Kernel driver in use: ccp
        Kernel modules: ccp
0a:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
        Subsystem: Gigabyte Technology Co., Ltd Zeppelin USB 3.0 Host controller
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
0b:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
        Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
0b:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
        DeviceName: DIE0 M.2 SATA
        Subsystem: Gigabyte Technology Co., Ltd FCH SATA Controller [AHCI mode]
        Kernel driver in use: ahci
        Kernel modules: ahci
0b:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
        DeviceName: Audio Codec ALC1220
        Subsystem: Gigabyte Technology Co., Ltd Family 17h (Models 00h-0fh) HD Audio Controller
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
40:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
        Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
40:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
        Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
40:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Kernel driver in use: pcieport
40:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Kernel driver in use: pcieport
40:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
40:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
        Kernel driver in use: pcieport
40:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
40:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
40:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
        Kernel driver in use: pcieport
40:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
40:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
        Kernel driver in use: pcieport
41:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
        Kernel driver in use: nvme
42:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
        Kernel driver in use: nvme
43:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)
        Subsystem: Gigabyte Technology Co., Ltd GT218 [GeForce 210]
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia
43:00.1 Audio device: NVIDIA Corporation High Definition Audio Controller (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd High Definition Audio Controller
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
44:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
        Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
44:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
        Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
        Kernel driver in use: ccp
        Kernel modules: ccp
44:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
        Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
45:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
        Subsystem: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
45:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
        Subsystem: Gigabyte Technology Co., Ltd FCH SATA Controller [AHCI mode]
        Kernel driver in use: ahci
        Kernel modules: ahci

3. CSM is enabled (although i boot in EFI mode). 4G was disabled but I thought it had to do with accessing above 4G for 32bit systems and mostly use for GPU
4. I will remove the un-needed parameters and reboot. Do you need the IOMMU groups as well?
amd_iommu = on and amd_iommu=pt is from an archwiki page:
Enable IOMMU support by setting the correct kernel parameter depending on the type of CPU in use:

For Intel CPUs (VT-d) set intel_iommu=on
For AMD CPUs (AMD-Vi) set amd_iommu=on
You should also append the iommu=pt parameter. This will prevent Linux from touching devices which cannot be passed through.

found here: https://wiki.archlinux.org/index.php/PC … h_via_OVMF

Last edited by mareurs (2019-08-01 08:43:38)

Offline

#4 2019-07-29 16:16:52

2ManyDogs
Forum Fellow
Registered: 2012-01-15
Posts: 4,645

Re: AMD TR 2950x and kvm hard locks the system

Please edit your post and use [ code ] tags (not italics) when posting output.

https://wiki.archlinux.org/index.php/Co … s_and_code
https://bbs.archlinux.org/help.php#bbcode

Offline

#5 2019-08-01 08:49:23

mareurs
Member
Registered: 2019-07-29
Posts: 3

Re: AMD TR 2950x and kvm hard locks the system

Hi!
I have about 3-4 days now that the system is stable on both VMs (Windows Server 2016 and Centos 7).
I did a couple of things:
1. Removed unused kernel params like you suggested, I am left now with only : amd_iommu=on audit=1 nvme_core.default_ps_max_latency_us=0 quiet
2. Enabled some kernel module options:
options kvm_amd npt=1 nested=1 avic=1
3. Rewrote libvirt xml files for the VMs. I give each VM a separate NUMA node
4. After a couple of days of stability, I switched the kernel to linux-lts since this is the one I want to run in production

For now it seems that it is working nicely, I have good throughput for the NAND, processor but a bit of latency for the memory (but this is another story)
I will update here with more info as I find it.
Thanks a lot Wolf, you saved my life! smile

Last edited by mareurs (2019-08-01 08:55:21)

Offline

Board footer

Powered by FluxBB