You are not logged in.

#1 2020-04-02 13:46:56

xy
Member
Registered: 2015-04-08
Posts: 16

OpenCL App segfaults - GPU fault detected: 146

I have trouble to run OpenCL applications with my AMD Radeon RX 580. KataGo which heavily uses OpenCL keeps segfaulting (with specific settings), sometimes even freezing my computer. I opened an issue there and we concluded it's likely not a KataGo issue but a hardware/driver/OpenCL problem. I also tried geekbench which is failing after the first few tests, but only due to "internal errors" rather than a segmentation fault. clpeak runs fine.

I tried to run OpenCL apps such as KataGo on Windows (same machine) and it works flawlessly, no freezes, no segfaults, just as it should be. So I assume my hardware is alright and it may be linux or driver related.

It may be something similar to GPU fault detected, eventually system freeze from gentoo forums.

Here is some info about my system:

# lshw -sanitize
computer                    
    description: Desktop Computer
    product: System Product Name (SKU)
    vendor: System manufacturer
    version: System Version
    serial: [REMOVED]
    width: 4294967295 bits
    capabilities: smbios-3.1 dmi-3.1 smp vsyscall32
    configuration: boot=normal chassis=desktop family=To be filled by O.E.M. sku=SKU uuid=[REMOVED]
  *-core
       description: Motherboard
       product: PRIME X370-PRO
       vendor: ASUSTeK COMPUTER INC.
       physical id: 0
       version: Rev X.0x
       serial: [REMOVED]
       slot: Default string
     *-firmware
          description: BIOS
          vendor: American Megatrends Inc.
          physical id: 0
          version: 4207
          date: 12/08/2018
          size: 64KiB
          capacity: 15MiB
          capabilities: pci apm upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
     *-memory
          description: System Memory
          physical id: 2c
          slot: System board or motherboard
          size: 16GiB
        *-bank:0
             description: [empty]
             product: Unknown
             vendor: Unknown
             physical id: 0
             serial: [REMOVED]
             slot: DIMM_A1
        *-bank:1
             description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
             product: CMK16GX4M2A2666C16
             vendor: Corsair
             physical id: 1
             serial: [REMOVED]
             slot: DIMM_A2
             size: 8GiB
             width: 64 bits
             clock: 2133MHz (0.5ns)
        *-bank:2
             description: [empty]
             product: Unknown
             vendor: Unknown
             physical id: 2
             serial: [REMOVED]
             slot: DIMM_B1
        *-bank:3
             description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
             product: CMK16GX4M2A2666C16
             vendor: Corsair
             physical id: 3
             serial: [REMOVED]
             slot: DIMM_B2
             size: 8GiB
             width: 64 bits
             clock: 2133MHz (0.5ns)
     *-cache:0
          description: L1 cache
          physical id: 2e
          slot: L1 - Cache
          size: 768KiB
          capacity: 768KiB
          clock: 1GHz (1.0ns)
          capabilities: pipeline-burst internal write-back unified
          configuration: level=1
     *-cache:1
          description: L2 cache
          physical id: 2f
          slot: L2 - Cache
          size: 4MiB
          capacity: 4MiB
          clock: 1GHz (1.0ns)
          capabilities: pipeline-burst internal write-back unified
          configuration: level=2
     *-cache:2
          description: L3 cache
          physical id: 30
          slot: L3 - Cache
          size: 16MiB
          capacity: 16MiB
          clock: 1GHz (1.0ns)
          capabilities: pipeline-burst internal write-back unified
          configuration: level=3
     *-cpu
          description: CPU
          product: AMD Ryzen 7 1700 Eight-Core Processor
          vendor: Advanced Micro Devices [AMD]
          physical id: 31
          bus info: cpu@0
          version: AMD Ryzen 7 1700 Eight-Core Processor
          serial: [REMOVED]
          slot: AM4
          size: 2888MHz
          capacity: 3750MHz
          width: 64 bits
          clock: 100MHz
          capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca cpufreq
          configuration: cores=8 enabledcores=8 threads=16
     *-pci:0
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Root Complex
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 100
          bus info: pci@0000:00:00.0
          version: 00
          width: 32 bits
          clock: 33MHz
        *-generic UNCLAIMED
             description: IOMMU
             product: Family 17h (Models 00h-0fh) I/O Memory Management Unit
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 0.2
             bus info: pci@0000:00:00.2
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: msi ht cap_list
             configuration: latency=0
        *-pci:0
             description: PCI bridge
             product: Family 17h (Models 00h-0fh) PCIe GPP Bridge
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 1.3
             bus info: pci@0000:00:01.3
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:28 ioport:e000(size=4096) memory:fe500000-fe7fffff
           *-usb
                description: USB controller
                product: X370 Series Chipset USB 3.1 xHCI Controller
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0
                bus info: pci@0000:01:00.0
                version: 02
                width: 64 bits
                clock: 33MHz
                capabilities: msi pm pciexpress xhci bus_master cap_list
                configuration: driver=xhci_hcd latency=0
                resources: irq:47 memory:fe7a0000-fe7a7fff
              *-usbhost:0
                   product: xHCI Host Controller
                   vendor: Linux 5.5.13-arch2-1 xhci-hcd
                   physical id: 0
                   bus info: usb@1
                   logical name: usb1
                   version: 5.05
                   capabilities: usb-2.00
                   configuration: driver=hub slots=14 speed=480Mbit/s
              *-usbhost:1
                   product: xHCI Host Controller
                   vendor: Linux 5.5.13-arch2-1 xhci-hcd
                   physical id: 1
                   bus info: usb@2
                   logical name: usb2
                   version: 5.05
                   capabilities: usb-3.10
                   configuration: driver=hub slots=8 speed=10000Mbit/s
           *-storage
                description: SATA controller
                product: X370 Series Chipset SATA Controller
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0.1
                bus info: pci@0000:01:00.1
                version: 02
                width: 32 bits
                clock: 33MHz
                capabilities: storage msi pm pciexpress ahci_1.0 bus_master cap_list rom
                configuration: driver=ahci latency=0
                resources: irq:41 memory:fe780000-fe79ffff memory:fe700000-fe77ffff
           *-pci
                description: PCI bridge
                product: X370 Series Chipset PCIe Upstream Port
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0.2
                bus info: pci@0000:01:00.2
                version: 02
                width: 32 bits
                clock: 33MHz
                capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
                configuration: driver=pcieport
                resources: irq:32 ioport:e000(size=4096) memory:fe500000-fe6fffff
              *-pci:0
                   description: PCI bridge
                   product: 300 Series Chipset PCIe Port
                   vendor: Advanced Micro Devices, Inc. [AMD]
                   physical id: 0
                   bus info: pci@0000:02:00.0
                   version: 02
                   width: 32 bits
                   clock: 33MHz
                   capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
                   configuration: driver=pcieport
                   resources: irq:33
              *-pci:1
                   description: PCI bridge
                   product: 300 Series Chipset PCIe Port
                   vendor: Advanced Micro Devices, Inc. [AMD]
                   physical id: 2
                   bus info: pci@0000:02:02.0
                   version: 02
                   width: 32 bits
                   clock: 33MHz
                   capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
                   configuration: driver=pcieport
                   resources: irq:34
              *-pci:2
                   description: PCI bridge
                   product: 300 Series Chipset PCIe Port
                   vendor: Advanced Micro Devices, Inc. [AMD]
                   physical id: 3
                   bus info: pci@0000:02:03.0
                   version: 02
                   width: 32 bits
                   clock: 33MHz
                   capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
                   configuration: driver=pcieport
                   resources: irq:36
              *-pci:3
                   description: PCI bridge
                   product: 300 Series Chipset PCIe Port
                   vendor: Advanced Micro Devices, Inc. [AMD]
                   physical id: 4
                   bus info: pci@0000:02:04.0
                   version: 02
                   width: 32 bits
                   clock: 33MHz
                   capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
                   configuration: driver=pcieport
                   resources: irq:37 memory:fe600000-fe6fffff
                 *-usb
                      description: USB controller
                      product: ASM1143 USB 3.1 Host Controller
                      vendor: ASMedia Technology Inc.
                      physical id: 0
                      bus info: pci@0000:06:00.0
                      version: 00
                      width: 64 bits
                      clock: 33MHz
                      capabilities: msi pm pciexpress xhci bus_master cap_list
                      configuration: driver=xhci_hcd latency=0
                      resources: irq:48 memory:fe600000-fe607fff
                    *-usbhost:0
                         product: xHCI Host Controller
                         vendor: Linux 5.5.13-arch2-1 xhci-hcd
                         physical id: 0
                         bus info: usb@3
                         logical name: usb3
                         version: 5.05
                         capabilities: usb-2.00
                         configuration: driver=hub slots=2 speed=480Mbit/s
                    *-usbhost:1
                         product: xHCI Host Controller
                         vendor: Linux 5.5.13-arch2-1 xhci-hcd
                         physical id: 1
                         bus info: usb@4
                         logical name: usb4
                         version: 5.05
                         capabilities: usb-3.10
                         configuration: driver=hub slots=2 speed=10000Mbit/s
              *-pci:4
                   description: PCI bridge
                   product: 300 Series Chipset PCIe Port
                   vendor: Advanced Micro Devices, Inc. [AMD]
                   physical id: 6
                   bus info: pci@0000:02:06.0
                   version: 02
                   width: 32 bits
                   clock: 33MHz
                   capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
                   configuration: driver=pcieport
                   resources: irq:38 ioport:e000(size=4096) memory:fe500000-fe5fffff
                 *-network
                      description: Ethernet interface
                      product: I211 Gigabit Network Connection
                      vendor: Intel Corporation
                      physical id: 0
                      bus info: pci@0000:07:00.0
                      logical name: enp7s0
                      version: 03
                      serial: [REMOVED]
                      size: 1Gbit/s
                      capacity: 1Gbit/s
                      width: 32 bits
                      clock: 33MHz
                      capabilities: pm msi msix pciexpress bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
                      configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.6.0-k duplex=full firmware=0. 6-1 ip=[REMOVED] latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
                      resources: irq:24 memory:fe500000-fe51ffff ioport:e000(size=32) memory:fe520000-fe523fff
              *-pci:5
                   description: PCI bridge
                   product: 300 Series Chipset PCIe Port
                   vendor: Advanced Micro Devices, Inc. [AMD]
                   physical id: 7
                   bus info: pci@0000:02:07.0
                   version: 02
                   width: 32 bits
                   clock: 33MHz
                   capabilities: pci msi pm pciexpress normal_decode bus_master cap_list
                   configuration: driver=pcieport
                   resources: irq:39
        *-pci:1
             description: PCI bridge
             product: Family 17h (Models 00h-0fh) PCIe GPP Bridge
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 3.1
             bus info: pci@0000:00:03.1
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:29 ioport:d000(size=4096) memory:fe900000-fe9fffff ioport:e0000000(size=270532608)
           *-display
                description: VGA compatible controller
                product: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
                vendor: Advanced Micro Devices, Inc. [AMD/ATI]
                physical id: 0
                bus info: pci@0000:09:00.0
                version: e7
                width: 64 bits
                clock: 33MHz
                capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
                configuration: driver=amdgpu latency=0
                resources: irq:61 memory:e0000000-efffffff memory:f0000000-f01fffff ioport:d000(size=256) memory:fe900000-fe93ffff memory:c0000-dffff
           *-multimedia
                description: Audio device
                product: Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
                vendor: Advanced Micro Devices, Inc. [AMD/ATI]
                physical id: 0.1
                bus info: pci@0000:09:00.1
                version: 00
                width: 64 bits
                clock: 33MHz
                capabilities: pm pciexpress msi bus_master cap_list
                configuration: driver=snd_hda_intel latency=0
                resources: irq:57 memory:fe960000-fe963fff
        *-pci:2
             description: PCI bridge
             product: Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 7.1
             bus info: pci@0000:00:07.1
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:30 memory:fe200000-fe4fffff
           *-generic:0 UNCLAIMED
                description: Non-Essential Instrumentation
                product: Zeppelin/Raven/Raven2 PCIe Dummy Function
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0
                bus info: pci@0000:0a:00.0
                version: 00
                width: 32 bits
                clock: 33MHz
                capabilities: pm pciexpress cap_list
                configuration: latency=0
           *-generic:1
                description: Encryption controller
                product: Family 17h (Models 00h-0fh) Platform Security Processor
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0.2
                bus info: pci@0000:0a:00.2
                version: 00
                width: 32 bits
                clock: 33MHz
                capabilities: pm pciexpress msi msix bus_master cap_list
                configuration: driver=ccp latency=0
                resources: irq:44 memory:fe300000-fe3fffff memory:fe400000-fe401fff
           *-usb
                description: USB controller
                product: Family 17h (Models 00h-0fh) USB 3.0 Host Controller
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0.3
                bus info: pci@0000:0a:00.3
                version: 00
                width: 64 bits
                clock: 33MHz
                capabilities: pm pciexpress msi xhci bus_master cap_list
                configuration: driver=xhci_hcd latency=0
                resources: irq:50 memory:fe200000-fe2fffff
              *-usbhost:0
                   product: xHCI Host Controller
                   vendor: Linux 5.5.13-arch2-1 xhci-hcd
                   physical id: 0
                   bus info: usb@5
                   logical name: usb5
                   version: 5.05
                   capabilities: usb-2.00
                   configuration: driver=hub slots=4 speed=480Mbit/s
                 *-usb:0
                      description: Keyboard
                      product: Keyboard
                      vendor: Cherry GmbH
                      physical id: 1
                      bus info: usb@5:1
                      version: 0.32
                      capabilities: usb-2.00
                      configuration: driver=usbhid maxpower=100mA speed=2Mbit/s
                 *-usb:1
                      description: Keyboard
                      product: USB Gaming Mouse
                      vendor: Holtek
                      physical id: 2
                      bus info: usb@5:2
                      version: 1.06
                      capabilities: usb-2.00
                      configuration: driver=usbhid maxpower=100mA speed=12Mbit/s
              *-usbhost:1
                   product: xHCI Host Controller
                   vendor: Linux 5.5.13-arch2-1 xhci-hcd
                   physical id: 1
                   bus info: usb@6
                   logical name: usb6
                   version: 5.05
                   capabilities: usb-3.00
                   configuration: driver=hub slots=4 speed=5000Mbit/s
        *-pci:3
             description: PCI bridge
             product: Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 8.1
             bus info: pci@0000:00:08.1
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:31 memory:fe800000-fe8fffff
           *-generic UNCLAIMED
                description: Non-Essential Instrumentation
                product: Zeppelin/Renoir PCIe Dummy Function
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0
                bus info: pci@0000:0b:00.0
                version: 00
                width: 32 bits
                clock: 33MHz
                capabilities: pm pciexpress cap_list
                configuration: latency=0
           *-storage
                description: SATA controller
                product: FCH SATA Controller [AHCI mode]
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0.2
                bus info: pci@0000:0b:00.2
                version: 51
                width: 32 bits
                clock: 33MHz
                capabilities: storage pm pciexpress msi ahci_1.0 bus_master cap_list
                configuration: driver=ahci latency=0
                resources: irq:43 memory:fe808000-fe808fff
           *-multimedia
                description: Audio device
                product: Family 17h (Models 00h-0fh) HD Audio Controller
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0.3
                bus info: pci@0000:0b:00.3
                version: 00
                width: 32 bits
                clock: 33MHz
                capabilities: pm pciexpress msi bus_master cap_list
                configuration: driver=snd_hda_intel latency=0
                resources: irq:59 memory:fe800000-fe807fff
        *-serial
             description: SMBus
             product: FCH SMBus Controller
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 14
             bus info: pci@0000:00:14.0
             version: 59
             width: 32 bits
             clock: 66MHz
             configuration: driver=piix4_smbus latency=0
             resources: irq:0
        *-isa
             description: ISA bridge
             product: FCH LPC Bridge
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 14.3
             bus info: pci@0000:00:14.3
             version: 51
             width: 32 bits
             clock: 66MHz
             capabilities: isa bus_master
             configuration: latency=0
     *-pci:1
          description: Host bridge
          product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 101
          bus info: pci@0000:00:01.0
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:2
          description: Host bridge
          product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 102
          bus info: pci@0000:00:02.0
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:3
          description: Host bridge
          product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 103
          bus info: pci@0000:00:03.0
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:4
          description: Host bridge
          product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 104
          bus info: pci@0000:00:04.0
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:5
          description: Host bridge
          product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 105
          bus info: pci@0000:00:07.0
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:6
          description: Host bridge
          product: Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 106
          bus info: pci@0000:00:08.0
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:7
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 107
          bus info: pci@0000:00:18.0
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:8
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 108
          bus info: pci@0000:00:18.1
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:9
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 109
          bus info: pci@0000:00:18.2
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:10
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 10a
          bus info: pci@0000:00:18.3
          version: 00
          width: 32 bits
          clock: 33MHz
          configuration: driver=k10temp
          resources: irq:0
     *-pci:11
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 10b
          bus info: pci@0000:00:18.4
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:12
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 10c
          bus info: pci@0000:00:18.5
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:13
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 10d
          bus info: pci@0000:00:18.6
          version: 00
          width: 32 bits
          clock: 33MHz
     *-pci:14
          description: Host bridge
          product: Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
          vendor: Advanced Micro Devices, Inc. [AMD]
          physical id: 10e
          bus info: pci@0000:00:18.7
          version: 00
          width: 32 bits
          clock: 33MHz

More info about the related installed packages:

linux 5.5.13.arch2-1
mesa 20.0.3-1
opencl-mesa 20.0.3-1

Segfaulting KataGo produces

# dmesg
[ 5143.722198] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 16800 thread katago:cs0 pid 16801
[ 5143.722203] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
[ 5143.722205] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04800C
[ 5143.722210] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
[ 5144.595948] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 16800 thread katago:cs0 pid 16801
[ 5144.595952] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
[ 5144.595954] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04800C
[ 5144.595959] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
[ 5152.414302] katago[16855]: segfault at 8 ip 00007f6af5865c1d sp 00007f6a537fd050 error 6 in pipe_radeonsi.so[7f6af576a000+2ad000]
[ 5152.414316] Code: 00 00 8b 83 40 06 00 00 c6 83 2d 6e 00 00 01 48 89 df 48 8d 74 24 60 83 e0 fe 83 c8 02 89 83 40 06 00 00 48 8b 83 c8 04 00 00 <c6> 40 08 01 ff 93 68 03 00 00 48 8b 93 c8 04 00 00 8b 83 40 06 00
[ 5152.414365] audit: type=1701 audit(1585684467.023:119): auid=1000 uid=1000 gid=1000 ses=4 pid=16800 comm="katago" exe="/home/xy/repos/KataGo/cpp/katago" sig=11 res=1
[ 5152.433480] audit: type=1130 audit(1585684467.043:120): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-16870-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 5155.633706] audit: type=1131 audit(1585684470.243:121): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-16870-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

Running KataGo with settings that don't segfault immediately produces lags, sometimes freezes the system, and

# journalctl -xa
...
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 17185 thread katago:cs0 pid 17189
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
...
Mar 31 23:27:07 arx rtkit-daemon[1409]: The canary thread is apparently starving. Taking action.
Mar 31 23:27:07 arx rtkit-daemon[1409]: Demoting known real-time threads.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1838 of process 1722.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1433 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1427 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1404 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Demoted 4 threads.
...
Mar 31 23:29:50 arx kernel: amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 17185 thread katago:cs0 pid 17189
Mar 31 23:29:50 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
Mar 31 23:29:52 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
Mar 31 23:29:55 arx kernel: amdgpu 0000:09:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
Mar 31 23:35:47 arx systemd-udevd[732]: card0: Worker [17487] processing SEQNUM=4011 is taking a long time
-- Reboot --

where `...` denotes many many of the instances of the first four fully printed lines.

A backtrace of the segfault shows and lightvector confirms KataGo is just "calling a normal OpenCL routine" `clEnqueueCopyBuffer`.

I would be happy about some hints how to further trace down the issue and finally (in case it is) file in a bug report at the right place.

Offline

#2 2020-04-02 15:15:26

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: OpenCL App segfaults - GPU fault detected: 146

Are you fine with using closed source packages? There's a package "opencl-amd" in the AUR that extracts just the OpenCL library out of AMD's "AMDGPU-PRO" Ubuntu package. That opencl-amd package solved crashes with OpenCL for me on my RX480. You can remove the "opencl-mesa" package when you install opencl-amd. The opencl-mesa package is just terrible and was crashing my whole PC the last time I tried using it.

There is also the "ROCm" project that offers an alternative to opencl-mesa and is open source. I couldn't build it the last time I wanted to try it. The AUR package for the OpenCL part of ROCm is "rocm-opencl-runtime".

Offline

#3 2020-04-02 16:06:13

xy
Member
Registered: 2015-04-08
Posts: 16

Re: OpenCL App segfaults - GPU fault detected: 146

Wow! Thanks for the hint! I was not aware of how terrible "opencl-mesa" is. I used it because there was no mention of it anywhere and I like to stick to open source whenever possible. However, using "opencl-amd", OpenCL applications run faster and without issues at all. Probably one should add this info to the wiki. Just a note for AMD Radeon RX 480/580 and maybe other similar GPUs.

https://wiki.archlinux.org/index.php/GPGPU would be the first guess, maybe https://wiki.archlinux.org/index.php/AMDGPU too. Should I add notes there?

Also, souldn't I file in a bug report at mesa3d.org for this?

Offline

#4 2020-04-02 16:22:14

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: OpenCL App segfaults - GPU fault detected: 146

It would make sense to report the bug but I bet that kind of problem is already known upstream. I had crashes with different programs so I'm guessing bugs to work on are not rare at all.

About open source, you could check out that "rocm-opencl-runtime" AUR package. I wanted to try it a few months ago but it didn't compile at that time. That ROCm stuff is what AMD intends to officially use in the future. The home is here (there's also a fancy homepage for it somewhere on AMD's website):

https://github.com/RadeonOpenCompute/ROCm

Last edited by Ropid (2020-04-02 16:22:41)

Offline

Board footer

Powered by FluxBB