You are not logged in.

#1 2023-08-02 22:37:39

robbert-vdh
Member
Registered: 2023-08-02
Posts: 7
Website

[SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

This issue has got me stumped because I don't really know where the problem lies. A quick Google search doesn't turn up any hits from people suffering from both of these issues at the same time either. I think this started happening about a month ago. The XID error causes the screen to black out, audio to continue playing, and the GPUs fans to spin at full tilt. The MCE error occurs only once every 10-20 times this happens and it results in an immediate reboot.

The Xid 79 issue seems to usually be related to insufficient power delivery to the GPU, but the odd thing is that in the situations this issue happens the GPU isn't anywhere close to its thermal or power limits, nor is it fully utilized. The combination of the spurious MCE errors also makes me suspect the GPU itself, the motherboard, and maybe even the CPU. So at this point I'm not really sure what component to replace, and randomly replacing components in the hope it fixes the problem gets pretty expensive. Does any of this happen to ring a bell to anyone? See the bottom of this post for journald logs of the two errors.

Hardware:

  • Ryzen 9 5900x

  • Gigabyte B550 AORUS ELITE V2 motherboard

  • Gigabyte RTX 2080 SUPER (GAMING OC 8G (rev. 2.0))

  • 2x16 GB 3600 MT/s CL 16 RAM (Crucial Ballistix BL2K8G36C16U4B)

  • Corsair RM850x 2021 power supply

On the software side:

  • Packages are up to date.

  • I'm using amd-ucode, the Zen kernel, and I have tried NVIDIA DKMS driver packages for several 525 and 535 driver versions (through nvidia-all).

  • So far it has only happened with Proton games using DXVK (Bioshock 2 Remastered and Persona 5 Royal). I have not yet ran into any issues on the desktop, while watching movies in MPV with hardware decoding and fairly resource intensive shaders, and I've also ran Furmark for hours on end without any issues. I have not yet tried if it also happens with those Proton games when idling at a static screen. So far it's only happened while actively interacting with the game.

  • When restarting the game after a reboot, sometimes the issue repeats itself at the exact same place (or time).

  • As mentioned above, when it does happen, the GPU is not anywhere near its limits. I had `nvidia-smi dmon` write the utilization and temps to a file every second and just before the Xid 79 error the GPU was sipping 123/250 Watt at 70 degrees Celsius and with only 74/21% shader/memory utilization.

  • It happens on both Xorg and Wayland. Using KDE Plasma in both cases.

  • I have tried disabling PCIE power management with the pcie_aspm=off kernel parameter.

On the hardware side:

  • I tried different PCIe slots and I reconnected (and swapped around) the two PCIe power cables.

  • I updated the original 2021 motherboard firmware to the latest version.

  • I tried with and without XMP enabled (that and resizable BAR are the only things I've changed from the stock UEFI settings after updating).

  • I have not tried disabling resizable BAR, but I have also not found any reports as to that causing issues

  • I have also not yet ran memtest.

This is an excerpt from the early startup after the MCE:

aug 02 23:06:57 desktop kernel: microcode: CPU0: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU1: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU2: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU6: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU8: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU11: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU12: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU13: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: Machine check events logged
aug 02 23:06:57 desktop kernel: microcode: CPU14: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU20: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU23: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000000000108
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffb397d4fc MISC d012000100000000
aug 02 23:06:57 desktop kernel: microcode: CPU15: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU3: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU19: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU7: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: fbcon: Taking over console
aug 02 23:06:57 desktop kernel: microcode: CPU16: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU4: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU21: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU9: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU5: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU17: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU22: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: CPU10: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: SYND 4d000000 IPID 500b000000000
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1691010410 SOCKET 0 APIC a microcode a201025
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: Machine check events logged
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: CPU 23: Machine Check: 0 Bank 5: bea0000001000108
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: TSC 0 ADDR 35907acdb MISC d012000100000000 SYND 4d000000 IPID 500b000000000
aug 02 23:06:57 desktop kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1691010410 SOCKET 0 APIC 1b microcode a201025
aug 02 23:06:57 desktop kernel: microcode: CPU18: patch_level=0x0a201025
aug 02 23:06:57 desktop kernel: microcode: Microcode Update Driver: v2.2.

And this is the Xid error that happens when the screen blanks out and the fans start spinning:

aug 01 21:14:00 desktop kernel: NVRM: GPU at PCI:0000:05:00: GPU-69c30050-6cd9-b1ff-4065-0e002029b3d3
aug 01 21:14:00 desktop kernel: NVRM: Xid (PCI:0000:05:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
aug 01 21:14:00 desktop kernel: NVRM: GPU 0000:05:00.0: GPU has fallen off the bus.
aug 01 21:14:00 desktop kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                NVRM: nvidia-bug-report.sh as root to collect this data before
                                NVRM: the NVIDIA kernel module is unloaded.
aug 01 21:14:00 desktop kernel: nvidia-gpu 0000:05:00.3: Unable to change power state from D3hot to D0, device inaccessible
aug 01 21:14:00 desktop kernel: xhci_hcd 0000:05:00.2: Unable to change power state from D3hot to D0, device inaccessible
aug 01 21:14:00 desktop kernel: xhci_hcd 0000:05:00.2: Unable to change power state from D3cold to D0, device inaccessible
aug 01 21:14:00 desktop kernel: xhci_hcd 0000:05:00.2: Controller not ready at resume -19
aug 01 21:14:00 desktop kernel: xhci_hcd 0000:05:00.2: PCI post-resume error -19!
aug 01 21:14:00 desktop kernel: xhci_hcd 0000:05:00.2: HC died; cleaning up
aug 01 21:14:01 desktop kernel: nvidia-gpu 0000:05:00.3: i2c timeout error ffffffff
aug 01 21:14:01 desktop kernel: ucsi_ccg 7-0008: i2c_transfer failed -110

Last edited by robbert-vdh (2023-08-12 10:30:06)

Offline

#2 2023-08-03 07:01:59

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 70,041

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

Please post your complete system journal for the boot, eg.

sudo journalctl -b -1 | curl -F 'file=@-' 0x0.st

for the previous one.
Also (see the xhci_hcd error and the whole "interaction" thing)

lspci -nn
lsusb -tv

Then see https://wiki.archlinux.org/title/Ryzen#Troubleshooting for the usual suspects, the zen kernel isn't a ryZEN special kernel or anything like that, so you might want to try the regular and lts kernels as well.
https://wiki.archlinux.org/title/Kernel … ed_kernels

Offline

#3 2023-08-03 07:46:26

robbert-vdh
Member
Registered: 2023-08-02
Posts: 7
Website

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

seth wrote:

Please post your complete system journal for the boot, eg.

sudo journalctl -b -1 | curl -F 'file=@-' 0x0.st

for the previous one.

I don't think there's a lot more useful information in here, but this is a journal from a boot that ended in the Xid 79 error: http://0x0.st/H2E-.txt

seth wrote:

Also (see the xhci_hcd error and the whole "interaction" thing)

lspci -nn
lsusb -tv

The xhci_hcd thing from the log is from the USB-C port that's on the GPU. I'm not seeing any errors related to USB ports from controllers that are not part of the GPU, and if the entire GPU is disconnected from the PCIe bus it makes sense that the USB controller also is.

seth wrote:

Then see https://wiki.archlinux.org/title/Ryzen#Troubleshooting for the usual suspects, the zen kernel isn't a ryZEN special kernel or anything like that, so you might want to try the regular and lts kernels as well.
https://wiki.archlinux.org/title/Kernel … ed_kernels

I'm using the Zen kernel for its scheduler and low latency tuning because it gives me better DSP loads in low latency realtime audio applications while still getting good throughput (unlike the linux-rt kernel). I'll try running the games with the LTS kernel next time I get the chance but I have no idea if that would make any difference here.

Offline

#4 2023-08-03 11:35:25

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 70,041

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

The xhci_hcd thing from the log is from the USB-C port that's on the GPU.

I know. The idea was that you maybe have a controller wired there (that you then use w/ the game) or that a different device on the bus initiates the crash and the GPU gets entangled through this rather than being the cause.

aug 02 23:24:24 desktop kernel: nvidia-gpu 0000:05:00.3: Unable to change power state from D3hot to D0, device inaccessible

Is the very first sign of any trouble, so pelase record the HW layout.

Then there's oc.

aug 02 23:06:58 desktop nvidia-smi[795]: Warning: persistence mode is disabled on device 00000000:05:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.

and in general maybe try to simplify the test setup and crash the system playing the game on an openbox/X11 session.

On a technical note: I'm pretty sure there's no redactworthy VPN bullshit in the midst of some stacktrace?
Please be more careful w/ the obfuscations, it's ok to replace IPs w/ obvious bogus nonsense (123.45.67.89) but brute-force sed on a generic term just messes up the logs.

Offline

#5 2023-08-03 16:24:43

robbert-vdh
Member
Registered: 2023-08-02
Posts: 7
Website

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

seth wrote:

The xhci_hcd thing from the log is from the USB-C port that's on the GPU.

I know. The idea was that you maybe have a controller wired there (that you then use w/ the game) or that a different device on the bus initiates the crash and the GPU gets entangled through this rather than being the cause.

There is nothing plugged into the USB port. I think the system just tries to turn off the card after the Xid error, and it can't since the card is no longer connected to the PCIe bus.

seth wrote:
aug 02 23:24:24 desktop kernel: nvidia-gpu 0000:05:00.3: Unable to change power state from D3hot to D0, device inaccessible

Is the very first sign of any trouble, so pelase record the HW layout.

~
❯ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex [1022:1480]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU [1022:1481]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0 [1022:1440]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1 [1022:1441]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2 [1022:1442]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3 [1022:1443]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4 [1022:1444]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5 [1022:1445]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6 [1022:1446]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7 [1022:1447]
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 500 Series Chipset USB 3.1 XHCI Controller [1022:43ee]
02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 500 Series Chipset SATA Controller [1022:43eb]
02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 500 Series Chipset Switch Upstream Port [1022:43e9]
03:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43ea]
04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2080 SUPER] [10de:1e81] (rev a1)
05:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio Controller [10de:10f8] (rev a1)
05:00.2 USB controller [0c03]: NVIDIA Corporation TU104 USB 3.1 Host Controller [10de:1ad8] (rev a1)
05:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller [10de:1ad9] (rev a1)
06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
07:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
07:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
07:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
07:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]

~
❯ lsusb -tv
/:  Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
/:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 480M
    ID 1d6b:0002 Linux Foundation 2.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Audio, Driver=snd-usb-audio, 480M
        ID 1235:8014 Focusrite-Novation Scarlett 18i8
    |__ Port 1: Dev 2, If 1, Class=Audio, Driver=snd-usb-audio, 480M
        ID 1235:8014 Focusrite-Novation Scarlett 18i8
    |__ Port 1: Dev 2, If 2, Class=Audio, Driver=snd-usb-audio, 480M
        ID 1235:8014 Focusrite-Novation Scarlett 18i8
    |__ Port 1: Dev 2, If 3, Class=Audio, Driver=snd-usb-audio, 480M
        ID 1235:8014 Focusrite-Novation Scarlett 18i8
    |__ Port 1: Dev 2, If 4, Class=Audio, Driver=snd-usb-audio, 480M
        ID 1235:8014 Focusrite-Novation Scarlett 18i8
    |__ Port 1: Dev 2, If 5, Class=Application Specific Interface, Driver=, 480M
        ID 1235:8014 Focusrite-Novation Scarlett 18i8
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
    ID 1d6b:0002 Linux Foundation 2.0 root hub
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 480M
    ID 1d6b:0002 Linux Foundation 2.0 root hub
    |__ Port 1: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 12M
        ID 28de:1142 Valve Software Wireless Steam Controller
    |__ Port 1: Dev 2, If 1, Class=Human Interface Device, Driver=usbhid, 12M
        ID 28de:1142 Valve Software Wireless Steam Controller
    |__ Port 1: Dev 2, If 2, Class=Human Interface Device, Driver=usbhid, 12M
        ID 28de:1142 Valve Software Wireless Steam Controller
    |__ Port 1: Dev 2, If 3, Class=Human Interface Device, Driver=usbhid, 12M
        ID 28de:1142 Valve Software Wireless Steam Controller
    |__ Port 1: Dev 2, If 4, Class=Human Interface Device, Driver=usbhid, 12M
        ID 28de:1142 Valve Software Wireless Steam Controller
    |__ Port 3: Dev 3, If 0, Class=Human Interface Device, Driver=usbhid, 12M
        ID 046d:c07d Logitech, Inc. G502 Mouse
    |__ Port 3: Dev 3, If 1, Class=Human Interface Device, Driver=usbhid, 12M
        ID 046d:c07d Logitech, Inc. G502 Mouse
    |__ Port 6: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
        ID 05e3:0608 Genesys Logic, Inc. Hub
    |__ Port 7: Dev 5, If 0, Class=Human Interface Device, Driver=usbhid, 12M
        ID 048d:5702 Integrated Technology Express, Inc. RGB LED Controller
    |__ Port 9: Dev 6, If 0, Class=Hub, Driver=hub/4p, 480M
        ID 2109:2811 VIA Labs, Inc. Hub
        |__ Port 1: Dev 7, If 0, Class=Human Interface Device, Driver=usbhid, 12M
            ID 3434:0220
        |__ Port 1: Dev 7, If 1, Class=Human Interface Device, Driver=usbhid, 12M
            ID 3434:0220
        |__ Port 1: Dev 7, If 2, Class=Human Interface Device, Driver=usbhid, 12M
            ID 3434:0220
seth wrote:

Then there's oc.

Outside of the factory overclock, the card is running at stock speeds. I was also giving Wayland a try in the log I posted and you currently can't overclock NVIDIA cards on Wayland. I have not yet tried to underclock the card.

seth wrote:
aug 02 23:06:58 desktop nvidia-smi[795]: Warning: persistence mode is disabled on device 00000000:05:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.

It will print that on any NVIDIA GPU. Driver persistence is disabled by default. Enabling it will allow slightly faster CUDA startup times in headless setups: https://docs.nvidia.com/deploy/driver-p … index.html

seth wrote:

and in general maybe try to simplify the test setup and crash the system playing the game on an openbox/X11 session.

I"ll try Openbox+linux-lts with threadirqs disabled (I normally have that enabled) in the next couple days when I have the chance, but I don't think Openbox is going to give different behavior here considering the problem is the same between Plasma on Wayland and X11.

seth wrote:

On a technical note: I'm pretty sure there's no redactworthy VPN bullshit in the midst of some stacktrace?
Please be more careful w/ the obfuscations, it's ok to replace IPs w/ obvious bogus nonsense (123.45.67.89) but brute-force sed on a generic term just messes up the logs.

That backtrace is the result of SysRq+E. It will list pretty much any application running, including mullvad-gui. I can guarantee that no useful information was lost when redacting journal lines related to mullvad.

Offline

#6 2023-08-03 19:05:01

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 70,041

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

Outside of the factory overclock

"of course"

It will print that on any NVIDIA GPU. Driver persistence is disabled by default.

Yes. Enabling it might prevent the D-state transition (and issues)
You could also (at least on X11 w/ nvidia-settings) explicitly move the GPU to maximum performance mode.

the problem is the same between Plasma on Wayland and X11

That's news and it's then indeed less likley to be relevant.

The edited backtraces related to other processes than mullvad.

Offline

#7 2023-08-05 23:41:32

robbert-vdh
Member
Registered: 2023-08-02
Posts: 7
Website

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

seth wrote:

Yes. Enabling it might prevent the D-state transition (and issues)
You could also (at least on X11 w/ nvidia-settings) explicitly move the GPU to maximum performance mode.

I'm still convinced that this isn't the problem, but rather a result of the GPU running into an Xid 79 error and turning itself off. Every Xid 79 error you'll find online looks like this.

But to test your theory, I also enabled maximum performance mode and I disabled the legacy driver persistence mode (which like it says in the documentation, doesn't do anything if you're not running a headless system). The problem still happened with those settings. I just tried underclocking the card by 1000 MHz and 2000 Hz on the core and memory clocks respectively (lowest they'll go) as I've seen one other report from someone who was able to avoid the issue by underclocking. No crashes yet, but I've only tried this for an hour or two so far. Still somewhat odd though considering the card isn't even maxed out, and I can run Furmark for hours with no problems.

Like I said in the OP, the problem is almost certainly a hardware problem. I just don't know which component to start replacing first at this point since this could be a power supply issue (even though there are no issues at max load or with power spikes when running native workloads), an issue with the GPU, or even an issue with the CPU or motherboard. And then there's the part where it only seems to happen with Proton/DXVK.

seth wrote:

the problem is the same between Plasma on Wayland and X11

That's news and it's then indeed less likley to be relevant.

The edited backtraces related to other processes than mullvad.

No those were definitely from mullvad. And that part of the log isn't relevant anyways. Like I said, all of that happened after I sent SIGINT to the still running processes in order to cleanly reboot the system: https://en.wikipedia.org/wiki/Magic_SysRq_key

Offline

#8 2023-08-06 07:28:54

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 70,041

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

xid 79 means "GPU has fallen off the bus", https://docs.nvidia.com/deploy/xid-errors/index.html
You can envision this literally as "somebody yanked it out of the board" but it means that it doesn't respond on the bus (PCI) level and the reason is almost always (bad seating aside) that it gets underpowered.

If it works better when underclocked, either
1. your PSU is insufficient
2. you forgot to attache the dedicated 6/8-pin power supply
3. you didn't put it into the PEG slot (consult your board)
4. the GPU or PSU cables are badly seated
5. the PSU or cables are broken
6. the board or GPU are broken
7. the GPU overheats (insufficient cooling)
8. any combination of the above

1-4 are easy to check (the PSU looks ok at least on paper, the PEG is listed in the board manual and you can re/plug GPU and cables to make sure they're in place and firmly attached)
You can monitor 7, but figuring 5&6 is inevitably gonna be a PITA.

Offline

#9 2023-08-06 09:51:51

robbert-vdh
Member
Registered: 2023-08-02
Posts: 7
Website

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

Like I said before, I already established that 1-4 and 7 are not the case, and considering that the problem happens only at half power usage and with low temps I'm also not sure about 5 (again, I have had Furmark running for hours, that gets the card to the full 250/250 Watt power budget and to 80-something degrees Celsius with no issues)..

That's why I was curious if anyone's seen this specific combination of problems before (Xid 79+spurious MCEs, only while using Proton+DXVK). I'll try asking in a different forum, thanks for the effort though!

Offline

#10 2023-08-06 10:33:44

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 70,041

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

the problem happens only at half power usage and with low temps

Which is (along the related d3hot/d0 change) why my first best guess was/is that the GPU powers down and doesn't wake up from there.
Do your underclocking efforts reult in a static (low) clock rather than adaptive behavior?

Offline

#11 2023-08-06 10:36:58

robbert-vdh
Member
Registered: 2023-08-02
Posts: 7
Website

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

seth wrote:

the problem happens only at half power usage and with low temps

Which is (along the related d3hot/d0 change) why my first best guess was/is that the GPU powers down and doesn't wake up from there.
Do your underclocking efforts reult in a static (low) clock rather than adaptive behavior?

The problem does not occur during regular day to day desktop usage though. GPUs also constantly reclock there.

With the underclock the GPU is still set to adaptive mode, but with an additional (static) offset applied to the clock speeds it would normally hit.

Offline

#12 2023-08-06 10:54:13

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 70,041

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

And then there's the part where it only seems to happen with Proton/DXVK.

Does it happen if your enforce D3D?
https://wiki.archlinux.org/title/Steam/ … _emulation

The ryzen c-states or RAM integrity (run memtest86+ overnight, still) would rather not cause this kind of apparently hyper-specific and isolated issues.
You could also try the 470xx drivers hmm

And in case the affected games are 32bit, make sure to have the multilib versions of the driver installed.

Offline

#13 2023-08-12 10:29:12

robbert-vdh
Member
Registered: 2023-08-02
Posts: 7
Website

Re: [SOLVED?] MCE bea0000000000108 and XID 79 reboots and freezes w/ DXVK

I've been testing this downclock (-1000 Hz on the core clocks, -2000 Hz on the memory clocks) for the past week and everything seemed to be stable. So it's likely just the GPU dying, as I've tried lowering just the power limit to 150 Watt before and that didn't solve the problem. So if anyone finds this thread, then maybe just try downclocking first and see if that resolves the problem.

Offline

Board footer

Powered by FluxBB