You are not logged in.
I have trouble to run OpenCL applications with my AMD Radeon RX 580. KataGo which heavily uses OpenCL keeps segfaulting (with specific settings), sometimes even freezing my computer. I opened an issue there and we concluded it's likely not a KataGo issue but a hardware/driver/OpenCL problem. I also tried geekbench which is failing after the first few tests, but only due to "internal errors" rather than a segmentation fault. clpeak runs fine.
I tried to run OpenCL apps such as KataGo on Windows (same machine) and it works flawlessly, no freezes, no segfaults, just as it should be. So I assume my hardware is alright and it may be linux or driver related.
It may be something similar to GPU fault detected, eventually system freeze from gentoo forums.
Here is some info about my system:
# lshw -sanitize
computer
description: Desktop Computer
product: System Product Name (SKU)
vendor: System manufacturer
version: System Version
serial: [REMOVED]
width: 4294967295 bits
capabilities: smbios-3.1 dmi-3.1 smp vsyscall32
configuration: boot=normal chassis=desktop family=To be filled by O.E.M. sku=SKU uuid=[REMOVED]
*-core
description: Motherboard
product: PRIME X370-PRO
vendor: ASUSTeK COMPUTER INC.
physical id: 0
version: Rev X.0x
serial: [REMOVED]
slot: Default string
*-firmware
description: BIOS
vendor: American Megatrends Inc.
physical id: 0
version: 4207
date: 12/08/2018
size: 64KiB
capacity: 15MiB
capabilities: pci apm upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
*-memory
description: System Memory
physical id: 2c
slot: System board or motherboard
size: 16GiB
*-bank:0
description: [empty]
product: Unknown
vendor: Unknown
physical id: 0
serial: [REMOVED]
slot: DIMM_A1
*-bank:1
description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
product: CMK16GX4M2A2666C16
vendor: Corsair
physical id: 1
serial: [REMOVED]
slot: DIMM_A2
size: 8GiB
width: 64 bits
clock: 2133MHz (0.5ns)
*-bank:2
description: [empty]
product: Unknown
vendor: Unknown
physical id: 2
serial: [REMOVED]
slot: DIMM_B1
*-bank:3
description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
product: CMK16GX4M2A2666C16
vendor: Corsair
physical id: 3
serial: [REMOVED]
slot: DIMM_B2
size: 8GiB
width: 64 bits
clock: 2133MHz (0.5ns)
*-cache:0
description: L1 cache
physical id: 2e
slot: L1 - Cache
size: 768KiB
capacity: 768KiB
clock: 1GHz (1.0ns)
capabilities: pipeline-burst internal write-back unified
configuration: level=1
*-cache:1
description: L2 cache
physical id: 2f
slot: L2 - Cache
size: 4MiB
capacity: 4MiB
clock: 1GHz (1.0ns)
capabilities: pipeline-burst internal write-back unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 30
slot: L3 - Cache
size: 16MiB
capacity: 16MiB
clock: 1GHz (1.0ns)
capabilities: pipeline-burst internal write-back unified
configuration: level=3
*-cpu
description: CPU
product: AMD Ryzen 7 1700 Eight-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 31
bus info: cpu@0
version: AMD Ryzen 7 1700 Eight-Core Processor
serial: [REMOVED]
slot: AM4
size: 2888MHz
capacity: 3750MHz
width: 64 bits
clock: 100MHz
capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca cpufreq
configuration: cores=8 enabledcores=8 threads=16
*-pci:0
description: Host bridge
product: Family 17h (Models 00h-0fh) Root Complex
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 100
bus info: pci@0000:00:00.0
version: 00
width: 32 bits
clock: 33MHz
*-generic UNCLAIMED
description: IOMMU
product: Family 17h (Models 00h-0fh) I/O Memory Management Unit
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.2
bus info: pci@0000:00:00.2
version: 00
width: 32 bits
clock: 33MHz
capabilities: msi ht cap_list
configuration: latency=0
*-pci:0
description: PCI bridge
product: Family 17h (Models 00h-0fh) PCIe GPP Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 1.3
bus info: pci@0000:00:01.3
version: 00
width: 32 bits
clock: 33MHz
capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:28 ioport:e000(size=4096) memory:fe500000-fe7fffff
*-usb
description: USB controller
product: X370 Series Chipset USB 3.1 xHCI Controller
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0
bus info: pci@0000:01:00.0
version: 02
width: 64 bits
clock: 33MHz
capabilities: msi pm pciexpress xhci bus_master cap_list
configuration: driver=xhci_hcd latency=0
resources: irq:47 memory:fe7a0000-fe7a7fff
*-usbhost:0
product: xHCI Host Controller
vendor: Linux 5.5.13-arch2-1 xhci-hcd
physical id: 0
bus info: usb@1
logical name: usb1
version: 5.05
capabilities: usb-2.00
configuration: driver=hub slots=14 speed=480Mbit/s
*-usbhost:1
product: xHCI Host Controller
vendor: Linux 5.5.13-arch2-1 xhci-hcd
physical id: 1
bus info: usb@2
logical name: usb2
version: 5.05
capabilities: usb-3.10
configuration: driver=hub slots=8 speed=10000Mbit/s
*-storage
description: SATA controller
product: X370 Series Chipset SATA Controller
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.1
bus info: pci@0000:01:00.1
version: 02
width: 32 bits
clock: 33MHz
capabilities: storage msi pm pciexpress ahci_1.0 bus_master cap_list rom
configuration: driver=ahci latency=0
resources: irq:41 memory:fe780000-fe79ffff memory:fe700000-fe77ffff
*-pci
description: PCI bridge
product: X370 Series Chipset PCIe Upstream Port
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.2
bus info: pci@0000:01:00.2
version: 02
width: 32 bits
clock: 33MHz
capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:32 ioport:e000(size=4096) memory:fe500000-fe6fffff
*-pci:0
description: PCI bridge
product: 300 Series Chipset PCIe Port
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0
bus info: pci@0000:02:00.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:33
*-pci:1
description: PCI bridge
product: 300 Series Chipset PCIe Port
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 2
bus info: pci@0000:02:02.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:34
*-pci:2
description: PCI bridge
product: 300 Series Chipset PCIe Port
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 3
bus info: pci@0000:02:03.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:36
*-pci:3
description: PCI bridge
product: 300 Series Chipset PCIe Port
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 4
bus info: pci@0000:02:04.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:37 memory:fe600000-fe6fffff
*-usb
description: USB controller
product: ASM1143 USB 3.1 Host Controller
vendor: ASMedia Technology Inc.
physical id: 0
bus info: pci@0000:06:00.0
version: 00
width: 64 bits
clock: 33MHz
capabilities: msi pm pciexpress xhci bus_master cap_list
configuration: driver=xhci_hcd latency=0
resources: irq:48 memory:fe600000-fe607fff
*-usbhost:0
product: xHCI Host Controller
vendor: Linux 5.5.13-arch2-1 xhci-hcd
physical id: 0
bus info: usb@3
logical name: usb3
version: 5.05
capabilities: usb-2.00
configuration: driver=hub slots=2 speed=480Mbit/s
*-usbhost:1
product: xHCI Host Controller
vendor: Linux 5.5.13-arch2-1 xhci-hcd
physical id: 1
bus info: usb@4
logical name: usb4
version: 5.05
capabilities: usb-3.10
configuration: driver=hub slots=2 speed=10000Mbit/s
*-pci:4
description: PCI bridge
product: 300 Series Chipset PCIe Port
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 6
bus info: pci@0000:02:06.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:38 ioport:e000(size=4096) memory:fe500000-fe5fffff
*-network
description: Ethernet interface
product: I211 Gigabit Network Connection
vendor: Intel Corporation
physical id: 0
bus info: pci@0000:07:00.0
logical name: enp7s0
version: 03
serial: [REMOVED]
size: 1Gbit/s
capacity: 1Gbit/s
width: 32 bits
clock: 33MHz
capabilities: pm msi msix pciexpress bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.6.0-k duplex=full firmware=0. 6-1 ip=[REMOVED] latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
resources: irq:24 memory:fe500000-fe51ffff ioport:e000(size=32) memory:fe520000-fe523fff
*-pci:5
description: PCI bridge
product: 300 Series Chipset PCIe Port
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 7
bus info: pci@0000:02:07.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:39
*-pci:1
description: PCI bridge
product: Family 17h (Models 00h-0fh) PCIe GPP Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 3.1
bus info: pci@0000:00:03.1
version: 00
width: 32 bits
clock: 33MHz
capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:29 ioport:d000(size=4096) memory:fe900000-fe9fffff ioport:e0000000(size=270532608)
*-display
description: VGA compatible controller
product: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:09:00.0
version: e7
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: irq:61 memory:e0000000-efffffff memory:f0000000-f01fffff ioport:d000(size=256) memory:fe900000-fe93ffff memory:c0000-dffff
*-multimedia
description: Audio device
product: Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0.1
bus info: pci@0000:09:00.1
version: 00
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi bus_master cap_list
configuration: driver=snd_hda_intel latency=0
resources: irq:57 memory:fe960000-fe963fff
*-pci:2
description: PCI bridge
product: Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 7.1
bus info: pci@0000:00:07.1
version: 00
width: 32 bits
clock: 33MHz
capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:30 memory:fe200000-fe4fffff
*-generic:0 UNCLAIMED
description: Non-Essential Instrumentation
product: Zeppelin/Raven/Raven2 PCIe Dummy Function
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0
bus info: pci@0000:0a:00.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: pm pciexpress cap_list
configuration: latency=0
*-generic:1
description: Encryption controller
product: Family 17h (Models 00h-0fh) Platform Security Processor
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.2
bus info: pci@0000:0a:00.2
version: 00
width: 32 bits
clock: 33MHz
capabilities: pm pciexpress msi msix bus_master cap_list
configuration: driver=ccp latency=0
resources: irq:44 memory:fe300000-fe3fffff memory:fe400000-fe401fff
*-usb
description: USB controller
product: Family 17h (Models 00h-0fh) USB 3.0 Host Controller
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.3
bus info: pci@0000:0a:00.3
version: 00
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi xhci bus_master cap_list
configuration: driver=xhci_hcd latency=0
resources: irq:50 memory:fe200000-fe2fffff
*-usbhost:0
product: xHCI Host Controller
vendor: Linux 5.5.13-arch2-1 xhci-hcd
physical id: 0
bus info: usb@5
logical name: usb5
version: 5.05
capabilities: usb-2.00
configuration: driver=hub slots=4 speed=480Mbit/s
*-usb:0
description: Keyboard
product: Keyboard
vendor: Cherry GmbH
physical id: 1
bus info: usb@5:1
version: 0.32
capabilities: usb-2.00
configuration: driver=usbhid maxpower=100mA speed=2Mbit/s
*-usb:1
description: Keyboard
product: USB Gaming Mouse
vendor: Holtek
physical id: 2
bus info: usb@5:2
version: 1.06
capabilities: usb-2.00
configuration: driver=usbhid maxpower=100mA speed=12Mbit/s
*-usbhost:1
product: xHCI Host Controller
vendor: Linux 5.5.13-arch2-1 xhci-hcd
physical id: 1
bus info: usb@6
logical name: usb6
version: 5.05
capabilities: usb-3.00
configuration: driver=hub slots=4 speed=5000Mbit/s
*-pci:3
description: PCI bridge
product: Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 8.1
bus info: pci@0000:00:08.1
version: 00
width: 32 bits
clock: 33MHz
capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:31 memory:fe800000-fe8fffff
*-generic UNCLAIMED
description: Non-Essential Instrumentation
product: Zeppelin/Renoir PCIe Dummy Function
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0
bus info: pci@0000:0b:00.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: pm pciexpress cap_list
configuration: latency=0
*-storage
description: SATA controller
product: FCH SATA Controller [AHCI mode]
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.2
bus info: pci@0000:0b:00.2
version: 51
width: 32 bits
clock: 33MHz
capabilities: storage pm pciexpress msi ahci_1.0 bus_master cap_list
configuration: driver=ahci latency=0
resources: irq:43 memory:fe808000-fe808fff
*-multimedia
description: Audio device
product: Family 17h (Models 00h-0fh) HD Audio Controller
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0.3
bus info: pci@0000:0b:00.3
version: 00
width: 32 bits
clock: 33MHz
capabilities: pm pciexpress msi bus_master cap_list
configuration: driver=snd_hda_intel latency=0
resources: irq:59 memory:fe800000-fe807fff
*-serial
description: SMBus
product: FCH SMBus Controller
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 14
bus info: pci@0000:00:14.0
version: 59
width: 32 bits
clock: 66MHz
configuration: driver=piix4_smbus latency=0
resources: irq:0
*-isa
description: ISA bridge
product: FCH LPC Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 14.3
bus info: pci@0000:00:14.3
version: 51
width: 32 bits
clock: 66MHz
capabilities: isa bus_master
configuration: latency=0
*-pci:1
description: Host bridge
product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 101
bus info: pci@0000:00:01.0
version: 00
width: 32 bits
clock: 33MHz
*-pci:2
description: Host bridge
product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 102
bus info: pci@0000:00:02.0
version: 00
width: 32 bits
clock: 33MHz
*-pci:3
description: Host bridge
product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 103
bus info: pci@0000:00:03.0
version: 00
width: 32 bits
clock: 33MHz
*-pci:4
description: Host bridge
product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 104
bus info: pci@0000:00:04.0
version: 00
width: 32 bits
clock: 33MHz
*-pci:5
description: Host bridge
product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 105
bus info: pci@0000:00:07.0
version: 00
width: 32 bits
clock: 33MHz
*-pci:6
description: Host bridge
product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 106
bus info: pci@0000:00:08.0
version: 00
width: 32 bits
clock: 33MHz
*-pci:7
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 107
bus info: pci@0000:00:18.0
version: 00
width: 32 bits
clock: 33MHz
*-pci:8
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 108
bus info: pci@0000:00:18.1
version: 00
width: 32 bits
clock: 33MHz
*-pci:9
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 109
bus info: pci@0000:00:18.2
version: 00
width: 32 bits
clock: 33MHz
*-pci:10
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 10a
bus info: pci@0000:00:18.3
version: 00
width: 32 bits
clock: 33MHz
configuration: driver=k10temp
resources: irq:0
*-pci:11
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 10b
bus info: pci@0000:00:18.4
version: 00
width: 32 bits
clock: 33MHz
*-pci:12
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 10c
bus info: pci@0000:00:18.5
version: 00
width: 32 bits
clock: 33MHz
*-pci:13
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 10d
bus info: pci@0000:00:18.6
version: 00
width: 32 bits
clock: 33MHz
*-pci:14
description: Host bridge
product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 10e
bus info: pci@0000:00:18.7
version: 00
width: 32 bits
clock: 33MHz
More info about the related installed packages:
linux 5.5.13.arch2-1
mesa 20.0.3-1
opencl-mesa 20.0.3-1
Segfaulting KataGo produces
# dmesg
[ 5143.722198] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 16800 thread katago:cs0 pid 16801
[ 5143.722203] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000600
[ 5143.722205] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04800C
[ 5143.722210] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
[ 5144.595948] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 16800 thread katago:cs0 pid 16801
[ 5144.595952] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000600
[ 5144.595954] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04800C
[ 5144.595959] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
[ 5152.414302] katago[16855]: segfault at 8 ip 00007f6af5865c1d sp 00007f6a537fd050 error 6 in pipe_radeonsi.so[7f6af576a000+2ad000]
[ 5152.414316] Code: 00 00 8b 83 40 06 00 00 c6 83 2d 6e 00 00 01 48 89 df 48 8d 74 24 60 83 e0 fe 83 c8 02 89 83 40 06 00 00 48 8b 83 c8 04 00 00 <c6> 40 08 01 ff 93 68 03 00 00 48 8b 93 c8 04 00 00 8b 83 40 06 00
[ 5152.414365] audit: type=1701 audit(1585684467.023:119): auid=1000 uid=1000 gid=1000 ses=4 pid=16800 comm="katago" exe="/home/xy/repos/KataGo/cpp/katago" sig=11 res=1
[ 5152.433480] audit: type=1130 audit(1585684467.043:120): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-16870-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 5155.633706] audit: type=1131 audit(1585684470.243:121): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-16870-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Running KataGo with settings that don't segfault immediately produces lags, sometimes freezes the system, and
# journalctl -xa
...
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 17185 thread katago:cs0 pid 17189
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000600
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
...
Mar 31 23:27:07 arx rtkit-daemon[1409]: The canary thread is apparently starving. Taking action.
Mar 31 23:27:07 arx rtkit-daemon[1409]: Demoting known real-time threads.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1838 of process 1722.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1433 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1427 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1404 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Demoted 4 threads.
...
Mar 31 23:29:50 arx kernel: amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 17185 thread katago:cs0 pid 17189
Mar 31 23:29:50 arx kernel: amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000600
Mar 31 23:29:52 arx kernel: amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
Mar 31 23:29:55 arx kernel: amdgpu 0000:09:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
Mar 31 23:35:47 arx systemd-udevd[732]: card0: Worker [17487] processing SEQNUM=4011 is taking a long time
-- Reboot --
where `...` denotes many many of the instances of the first four fully printed lines.
A backtrace of the segfault shows and lightvector confirms KataGo is just "calling a normal OpenCL routine" `clEnqueueCopyBuffer`.
I would be happy about some hints how to further trace down the issue and finally (in case it is) file in a bug report at the right place.
Offline
Are you fine with using closed source packages? There's a package "opencl-amd" in the AUR that extracts just the OpenCL library out of AMD's "AMDGPU-PRO" Ubuntu package. That opencl-amd package solved crashes with OpenCL for me on my RX480. You can remove the "opencl-mesa" package when you install opencl-amd. The opencl-mesa package is just terrible and was crashing my whole PC the last time I tried using it.
There is also the "ROCm" project that offers an alternative to opencl-mesa and is open source. I couldn't build it the last time I wanted to try it. The AUR package for the OpenCL part of ROCm is "rocm-opencl-runtime".
Offline
Wow! Thanks for the hint! I was not aware of how terrible "opencl-mesa" is. I used it because there was no mention of it anywhere and I like to stick to open source whenever possible. However, using "opencl-amd", OpenCL applications run faster and without issues at all. Probably one should add this info to the wiki. Just a note for AMD Radeon RX 480/580 and maybe other similar GPUs.
https://wiki.archlinux.org/index.php/GPGPU would be the first guess, maybe https://wiki.archlinux.org/index.php/AMDGPU too. Should I add notes there?
Also, souldn't I file in a bug report at mesa3d.org for this?
Offline
It would make sense to report the bug but I bet that kind of problem is already known upstream. I had crashes with different programs so I'm guessing bugs to work on are not rare at all.
About open source, you could check out that "rocm-opencl-runtime" AUR package. I wanted to try it a few months ago but it didn't compile at that time. That ROCm stuff is what AMD intends to officially use in the future. The home is here (there's also a fancy homepage for it somewhere on AMD's website):
https://github.com/RadeonOpenCompute/ROCm
Last edited by Ropid (2020-04-02 16:22:41)
Offline