You are not logged in.

#1 2024-04-12 18:08:09

mtawame
Member
Registered: 2024-04-12
Posts: 6

[SOLVED]Thunderbolt eGPU PCI error after update to linux-6.8.5

Issue with Radeon eGPU Blackout After System Updates


I've been successfully using two displays attached to a Radeon eGPU via Thunderbolt to my laptop for quite some time without any issues. However, after performing recent pacman -Syu updates earlier today, I encountered a significant problem. Approximately five minutes into using the eGPU setup, the system would abruptly blackout, rendering the laptop unusable.

In response to this issue, I decided to investigate the updates I had installed. After reviewing the upgrade history in /var/log/pacman.log, it became apparent that the recent kernel updates were the likely culprit. Consequently, I proceeded to downgrade to the previous kernel version, and to my relief, the problem ceased to occur. The eGPU setup resumed functioning as expected without any further blackouts.


Summary of Recent System Upgrades:


[2024-04-12T15:18:13+0800] [ALPM] upgraded mesa (1:24.0.4-2 -> 1:24.0.5-1)
[2024-04-12T15:18:13+0800] [ALPM] upgraded lib32-mesa (1:24.0.4-2 -> 1:24.0.5-1)

[2024-04-12T15:18:13+0800] [ALPM] upgraded vulkan-radeon (1:24.0.4-2 -> 1:24.0.5-1)
[2024-04-12T15:18:13+0800] [ALPM] upgraded lib32-vulkan-radeon (1:24.0.4-2 -> 1:24.0.5-1)

[2024-04-12T15:18:14+0800] [ALPM] upgraded linux (6.8.4.arch1-1 -> 6.8.5.arch1-1)
[2024-04-12T15:18:15+0800] [ALPM] upgraded linux-headers (6.8.4.arch1-1 -> 6.8.5.arch1-1)

Additional Log Information:

After experiencing the blackout, I was able to maintain a connection to the affected Arch Linux laptop via SSH from another device. Upon examining the system logs using the journalctl command, I attempted to shut down the system using sudo shutdown 0. However, despite the SSH connection being lost, the Arch Linux PC did not complete the shutdown process and remained halted.


Journal Command Log:


          
Apr 12 15:41:47 X16ARCH kernel: pcieport 0000:00:04.1: AER: Uncorrectable (Fatal) error message received from 0000:0b:01.0
Apr 12 15:41:47 X16ARCH kernel: pcieport 0000:0b:01.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data Link Layer, (Receiver ID)
Apr 12 15:41:47 X16ARCH kernel: pcieport 0000:0b:01.0:   device [8086:15ef] error status/mask=00000010/00000000
Apr 12 15:41:47 X16ARCH kernel: pcieport 0000:0b:01.0:    [ 4] DLP                    (First)
Apr 12 15:41:47 X16ARCH kernel: [drm] PCI error: detected callback, state(2)!!
Apr 12 15:41:47 X16ARCH kernel: snd_hda_intel 0000:0e:00.1: AER: can't recover (no error_detected callback)
Apr 12 15:41:48 X16ARCH kernel: pcieport 0000:0b:01.0: AER: Downstream Port link has been reset (0)
Apr 12 15:41:48 X16ARCH kernel: pcieport 0000:0b:01.0: AER: device recovery failed
Apr 12 15:42:08 X16ARCH assert_20240412154208_54.dmp[10724]: Uploading dump (out-of-process)
                                                             /tmp/dumps/assert_20240412154208_54.dmp
Apr 12 15:42:18 X16ARCH assert_20240412154208_54.dmp[10724]: Finished uploading minidump (out-of-process): success = yes
Apr 12 15:42:18 X16ARCH assert_20240412154208_54.dmp[10724]: response: CrashID=bp-7ef27122-589e-4bbf-af50-d6ede2240412
Apr 12 15:42:18 X16ARCH assert_20240412154208_54.dmp[10724]: file ''/tmp/dumps/assert_20240412154208_54.dmp'', upload yes: ''CrashID=bp-7ef27122-589e-4bbf-af50-d6ede2

Is there anything i can do to fix this blackout ? or just wait next update?

Last edited by mtawame (2024-04-15 06:30:31)

Offline

#2 2024-04-13 10:58:11

Lone_Wolf
Administrator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 13,582

Re: [SOLVED]Thunderbolt eGPU PCI error after update to linux-6.8.5

This looks like it may be hardware specific (not noticed by many) and require in depth troubleshooting .

Let's gather some data first .

Please post full journal output of a boot with the new kernel and lspci -k .
What is the brand and model of the device providing the egpu connection ?
How is the eGPU connected to the laptop (usb-C , thunderbolt 3 cable, thunderbolt cable etc ?

Last edited by Lone_Wolf (2024-04-13 10:58:37)


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.

clean chroot building not flexible enough ?
Try clean chroot manager by graysky

Offline

#3 2024-04-15 06:27:49

mtawame
Member
Registered: 2024-04-12
Posts: 6

Re: [SOLVED]Thunderbolt eGPU PCI error after update to linux-6.8.5

I solved it by temporarily disable dGPU

Referencing the Arch Linux eGPU wiki  page, I discovered a section detailing the process to disable the dGPU temporarily. Following the provided instructions, I executed the following script before starting X:

#!/bin/bash
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia

sudo modprobe vfio-pci
sudo driverctl --nosave set-override 0000:01:00.0 vfio-pci

By executing this script, which unloads the NVIDIA drivers and loads the VFIO-PCI drivers for the dGPU, I was able to resolve the PCI error and successfully use my Thunderbolt eGPU.

Offline

Board footer

Powered by FluxBB