You are not logged in.

#1 2022-11-29 04:29:22

jchau_xyz
Member
Registered: 2022-11-29
Posts: 14

[SOLVED] Laptop random crashes with eGPU attached

Alright, I'll try my best to summarize my issue.

I've been having for about a month now this issue where my laptop + eGPU w/ 2 external monitors will reboot seemingly randomly (like I see the BIOS post).

It could happen within 1h, 5h, or not happen at all.

I am not sure whether there is something wrong with my hardware or I might have messed up something in my eGPU setup, so I am hoping someone can sniff a lead.

Here's a neofetch:

-`                    jchau@archz13
                  .o+`                   -------------
                 `ooo/                   OS: Arch Linux x86_64
                `+oooo:                  Host: 21D2CTO1WW ThinkPad Z13 Gen 1
               `+oooooo:                 Kernel: 6.0.10-arch2-1
               -+oooooo+:                Uptime: 19 mins
             `/:-:++oooo+:               Packages: 885 (pacman)
            `/++++/+++++++:              Shell: zsh 5.9
           `/++++++++++++++:             Resolution: 1920x1200
          `/+++ooooooooooooo/`           WM: awesome
         ./ooosssso++osssssso+`          Theme: Everforest-Dark-B [GTK2/3]
        .oossssso-````/ossssss+`         Icons: Everforest-Dark [GTK2/3]
       -osssssso.      :ssssssso.        Terminal: kitty
      :osssssss/        osssso+++.       CPU: AMD Ryzen 7 PRO 6850U with Radeon 
     /ossssssss/        +ssssooo/-       GPU: AMD ATI Radeon RX 6800/6800 XT / 6
   `/ossssso+/:-        -:/+osssso+-     GPU: AMD ATI Radeon 680M
  `+sso+:-`                 `.-/+oso:    Memory: 2321MiB / 31361MiB
`++:.                           `-/+/
.`                                 `/

Some things to consider:

  • I don't use this setup for anything demanding; just some browsing and coding.

  • Temps seem fine.

  • Crashes have happened on Windows as well in the past, but I think it seems to have been more stable recently (fingers crossed).

  • I use TLP and have tried to disable usb autosuspend and runtime pm for things related to the eGPU, which sadly did not fix it.

  • I use both internal screen and external screens together, so I have both the iGPU and the eGPU.

  • I have tried using the LTS kernel, but that gave me another problem where my system would freeze except my cursor, forcing me to hold down the power button.

  • These random reboots do not happen without an eGPU, but the problem mentioned right above used to happen (hasn't happened in a while tho).

  • RAM seems to be healthy according to MemTest86.

  • I have an issue booting the laptop with the eGPU attached; the laptop beeps at me (some display related error according to Lenovo Diagnostics code) and does not recognize my internal display at all.
    If I plug the eGPU moments after the BIOS post shows, the internal display works, but my external displays are not recognized in xrandr (lspci and xrandr).
    In order to make the eGPU setup work, I have to wait for BIOS post > plug in the eGPU > boot into linux > reboot with the eGPU still attached. Maybe this info will be useful...

  • I haven't touched grub variables related to aspm and other stuff yet because I am worried they might affect battery life when I am using my laptop as a laptop.

  • I doubt that the external graphics card is the issue (copium) since I also had other similar issues with another card.

  • My cursor flickers on my external monitors, but not on my internal one.


I could not spot much in the journalctl -b-1 logs after a reboot happens other than this error:

Nov 28 22:26:29 archz13 kernel: mce: [Hardware Error]: Machine check events logged
Nov 28 22:26:29 archz13 kernel: [Hardware Error]: Corrected error, no action required.
Nov 28 22:26:29 archz13 kernel: [Hardware Error]: CPU:0 (19:44:1) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c011b
Nov 28 22:26:29 archz13 kernel: [Hardware Error]: Error Addr: 0x0000000036bb8f40
Nov 28 22:26:29 archz13 kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x000001d70a241f00
Nov 28 22:26:29 archz13 kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 12
Nov 28 22:26:29 archz13 kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

which seems to be related to firmware , but updating my BIOS and my EC FW did not fix this issue, though I don't even know if this is related to the reboot problem.

I now suspect that either the USB4 cable or maybe the PSU inside the eGPU enclosure might be the culprit (although idek if that should cause reboots in an eGPU setup), but I'd rather not get ahead of myself and purchase new parts right away without consulting here.

Let me know if you need any other logs from me.

Thanks!

Last edited by jchau_xyz (2022-12-05 03:57:08)

Offline

#2 2022-11-29 05:04:50

cfr
Member
From: Cymru
Registered: 2011-11-27
Posts: 7,171

Re: [SOLVED] Laptop random crashes with eGPU attached

I would try removing ctpv (or ctpv-git). It appears to be the first process to dump core. A bunch of other things then crash shortly before the journal ends. You can use coredumpctl to investigate the crashes. To see if ctpv is just the unlucky one this time, look at previous journals to see which processes dump core.

You should also check your fstab as you have multiple entries somewhere (though I don't think fstab is causing this issue).

Since you also have Windows, boot into Windows and make sure Fast Start is disabled.


CLI Paste | How To Ask Questions

Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L

Offline

#3 2022-11-29 07:44:42

jchau_xyz
Member
Registered: 2022-11-29
Posts: 14

Re: [SOLVED] Laptop random crashes with eGPU attached

Oh, I haven't disabled fast startup yet, I'll do that.

I'll also try removing ctpv to see if that helps though it sounds odd to me since this is just a utility to preview stuff on my file manager ?
edit: i don't think this is it. According to coredumpctl, ctpv has crashed a bunch before, but the dates dont corrolate closely to the times I've crashed. Also, in the journalctl log, the ctpv errors do happen 2h before the actual crash happens.

Also, (unfortunate) update, Windows has crashed twice within 2h since I posted this. Within that time span, I also tried swapping the USB cable, but that was not the problem in the end unless both the cables I have are defective...

So it makes me think that maybe it's more hardware related (either my PSU or hopefully not my eGPU)...

Last edited by jchau_xyz (2022-11-29 07:56:42)

Offline

#4 2022-11-29 08:40:20

jchau_xyz
Member
Registered: 2022-11-29
Posts: 14

Re: [SOLVED] Laptop random crashes with eGPU attached

I'll share the journalctl logs of the two crashes I had today, maybe that'll give more info.

First crash

Second crash (I compressed the mce Unified Memory Controller error to 1 line in this one)

Offline

#5 2022-11-29 14:42:17

cfr
Member
From: Cymru
Registered: 2011-11-27
Posts: 7,171

Re: [SOLVED] Laptop random crashes with eGPU attached

If Fast Start is enabled, weird problems will manifest in Linux because it means Windows is not actually shutting down and so device states are screwed up.

I would also look at SMART data for disks and run any available tests.

If it runs fine without the eGPU, though ....


CLI Paste | How To Ask Questions

Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L

Offline

#6 2022-11-29 19:00:28

jchau_xyz
Member
Registered: 2022-11-29
Posts: 14

Re: [SOLVED] Laptop random crashes with eGPU attached

Ok, I'll report back with the results if I encounter another crash (fingers crossed fast startup was the suspect)

Offline

#7 2022-12-05 03:56:55

jchau_xyz
Member
Registered: 2022-11-29
Posts: 14

Re: [SOLVED] Laptop random crashes with eGPU attached

So far, I haven't crashed again in the same fashion as described above (for now)... so I'll mark this as solved!

I've since fixed the fstab issue, disabled fast startup, removed picom, and added iommu=pt to my grub config.
Not sure which one solved the issue, but it's been running fine for 5-12h blocks of use.

I'm still having some trouble making the setup more convenient (AKA somehow allow cold-boot with eGPU and not needed to reboot to get it working), but that'll be for me to explore whenever I have time.

I've also encountered a "freeze" crash without the eGPU where everything is frozen except my mouse cursor, but I think excluding the iGPU from PM runtime should solve that.

Offline

Board footer

Powered by FluxBB