You are not logged in.

#1 2021-05-19 16:21:36

whereswaldon
Member
Registered: 2021-05-19
Posts: 5

How to debug unpredictable system hang on battery power?

Hello all! I'm running Arch on a System76 Galago Pro (model galp2). It has treated me well for a long time, but sometimes hangs unpredictably.

What I mean by hanging is:

- the mouse stops responding and all on-screen content freezes (audio is initially unaffected)
- after a few seconds, the mouse will process some more motion and more frames are drawn (anything on-screen that was animating progresses slightly)
- then the mouse stops again, and this time the lockup never recovers. At this point, audio playback begins to loop the last second or so of audio that was played (I assume whatever is currently in the buffer).

During this event, I am unable to switch virtual terminals or interact with the machine in any meaningful way. The only way I know to recover is to hold down the power button.

After I reboot, checking system logs reveals nothing. No errors were recorded, and I assume this may be because those logs weren't flushed to disk or were partially written and discarded as corrupt compressed data (not sure about this).

This problem occurs only on battery power, and is correlated with running taxing graphically-accelerated workloads on my machine. It has happened often when doing forms of videoconferencing in discord, in the browser, or in signal.

I first experienced this problem on Pop!_OS, and I started running Arch in the hope that newer software would help (and also, you know, because I like Arch). I have seen this happen in both X11 and Wayland environments, though it's much more frequent when running Wayland.

The fact that it persists across different distros makes me think it's either a hardware fault, a firmware problem, or a lack of Linux compatibility with something. Thing is, I can't tell which one of those it is without any kind of log. I'd be amazed if the kernel isn't spitting out some kind of warning, but I don't know how to _see_ it.

So my question really is "how can I capture any errors generated during this time if journalctl isn't capturing them?" If you have other recommendations on how to investigate this problem, I'd love to hear those as well. Thanks for any and all suggestions!

I've captured the output of `lspci -v` here  in case it is useful at this stage: https://paste.sr.ht/~whereswaldon/2a188 … 9e057deda2

Offline

#2 2021-05-20 12:18:28

whereswaldon
Member
Registered: 2021-05-19
Posts: 5

Re: How to debug unpredictable system hang on battery power?

I was able to capture my log output on a different device using socat:

On my problematic laptop, I ran

sudo dmesg -w | socat - udp-send:10.0.0.2:6666

to send logs to a different device.

On the other device, I ran

socat udp-recv:6666 -

to display them on stdout. That allowed my to capture this error:

[45281.536758] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[45281.537463] i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[45281.546751] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffa, in gnome-shell [852]
[45291.242801] Asynchronous wait on fence 0000:00:02.0:gnome-shell[852]:17b9d4 timed out (hint:intel_atomic_commit_ready [i915])
[45296.372318] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85dffffa, in gnome-shell [852]
[45296.373337] i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
[45296.374043] i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[45296.375683] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0

Nothing after the error is ever printed though (even when I hold the power button to force an emergency shutdown). This makes me suspect that the whole kernel hangs after this GPU hang, though I can't be certain.

Offline

#3 2021-05-24 21:08:48

TheGag96
Member
Registered: 2021-05-24
Posts: 2

Re: How to debug unpredictable system hang on battery power?

Hi there. I have a System76 Lemur and I believe I'm having the exact same problem and journalctl logs. This didn't happen before updating all my packages just recently. I guess our GPU driver is broken? Is it possible we can roll back a bit and see if that fixes it?

Offline

#4 2021-05-25 12:06:24

whereswaldon
Member
Registered: 2021-05-19
Posts: 5

Re: How to debug unpredictable system hang on battery power?

Hi @TheGag96, that's interesting to hear. I have another friend with a Lemur who appears to have the same problem, but who has not yet been able to capture the logs to prove it. I have filed a bug report against the intel DRM driver here: https://gitlab.freedesktop.org/drm/intel/-/issues/3510

If you do have *exactly* the same logs, can you chime in there with your experiences? In particular, I haven't been able to provide instructions to reproduce the hang. I know it requires me to be on battery, and that's about it. I can't find any particular action that triggers it reliably. I've been writing `ydotool` scripts to try to automate reproducing it, but so far with no luck.

> This didn't happen before updating all my packages just recently. I guess our GPU driver is broken? Is it possible we can roll back a bit and see if that fixes it?

I have had this problem for over a year. I don't think rolling back will help much. Perhaps you're just experiencing a similar problem? Or perhaps you just have been getting lucky? I misunderstood the problem for a long time because it usually happened when I was away from home. I thought it was wifi related. Turns out I was just on the charger a lot more at home, so there was less opportunity for it to go wrong.

Offline

#5 2021-05-25 21:35:25

TheGag96
Member
Registered: 2021-05-24
Posts: 2

Re: How to debug unpredictable system hang on battery power?

Just made a post there.

Offline

Board footer

Powered by FluxBB