You are not logged in.
How can I preserve the dump produced when the GPU hangs so that I can post it with an upstream bug report?
Sometimes, when my laptop is left alone, it appears unresponsive when I return. Key presses etc. are registered, but the display remains blank. The only remedy I've found is to hard reset the machine. As far as I can tell, this happens when KDE's power manager attempts to suspend the machine and a GPU hang is triggered. At least, it just happened and I found this in the journal:
Rha 07 00:32:10 MyComputer org_kde_powerdevil[1112]: powerdevil: Suspend session triggered with QMap(("GraceFade", QVariant(bool, true))("Type", QVariant(uint, 0)))
Rha 07 00:32:17 MyComputer kernel: [drm] GPU HANG: ecode 9:0:0xfffffffe, reason: Hang on rcs0, action: reset
Rha 07 00:32:17 MyComputer kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Rha 07 00:32:17 MyComputer kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Rha 07 00:32:17 MyComputer kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Rha 07 00:32:17 MyComputer kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Rha 07 00:32:17 MyComputer kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Rha 07 00:32:17 MyComputer kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Rha 07 00:32:29 MyComputer kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Rha 07 00:32:41 MyComputer kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
However, /sys/class/drm/card0/error is empty, presumably because I can view it only after rebooting.
Because this issue cannot be reliably reproduced, I cannot ssh in or hook up an external monitor. To do these things, I'd need to be in a particular room in a particular building on campus. If I could reliably reproduce, I could go there, configure sshd on the laptop and then trigger the bug. But the bug is triggered only occasionally - most of the time, I don't see it. So, short of moving in for a month or so, this kind of diagnostics is not a realistic option.
Is there some way of retrieving the file after rebooting? Or of configuring things so the dump is saved to disk? As I understand it from https://01.org/linuxgraphics/documentat … eport-bugs, the file does not actually have any content, but is generated only on being read. So even a script which automatically copied that file to disk on file change wouldn't, as I understand things, actually do the job.
Note that this question is related to https://bbs.archlinux.org/viewtopic.php?id=231954, but I did not know that a GPU hang was involved when I described the problem there (as the journal didn't contain this information then) and this seemed to me a different question (how do I collect specific diagnostic information?) and better asked separately. Apologies in advance if this is an incorrect judgement - I wasn't sure whether this should be a separate thread or not.
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
the file does not actually have any content, but is generated only on being read. So even a script which automatically copied that file to disk on file change wouldn't, as I understand things, actually do the job.
There are probably better approaches, but the above isn't really a limitation.
`cp` itself may not work, but any reading of the file will generate the output - it need not be human reading. You could collect the output of `cat /sys/class/drm/card0/error >> /path/to/log` in a loop - or as a systemd timer to gather the results.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Perhaps something like the following
dmesg -w | awk '/GPU crash dump saved to \/sys\/class\/drm\/card0\/error/ {system("cat /sys/class/drm/card0/error | bzip2 > error.bz2")}'
Edit:
removed quotes from string match that cause it to fail
Last edited by loqs (2017-12-07 03:22:22)
Offline
cfr wrote:the file does not actually have any content, but is generated only on being read. So even a script which automatically copied that file to disk on file change wouldn't, as I understand things, actually do the job.
There are probably better approaches, but the above isn't really a limitation.
`cp` itself may not work, but any reading of the file will generate the output - it need not be human reading. You could collect the output of `cat /sys/class/drm/card0/error >> /path/to/log` in a loop - or as a systemd timer to gather the results.
I guess my thought was that I could have something watch the file and trigger the copy, if the file itself changed to include the dump. Whereas otherwise I have to keep triggering the script, which I though might be problematic given how infrequently I'm triggering the bug.
What would a better approach be? I only asked about this because I couldn't think of any alternatives. If I've posed an XY problem, I'm happy to learn about Y (or X - can't remember which way round they go).
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
inotifywatch/inotifywait (inotify-tools), but no guarantees that it's gonna work reliably on sysfs (I'd say "no" but haven't tried in years)
Maybe you could setup yourself a shortcut to dump /sys/class/drm/card0/error into a regular file, sync the filesystem and reboot/stop X11/wayland and reload i915?
Online
Whereas otherwise I have to keep triggering the script
That's why I suggested the loop.
while grep -q '^No error' /sys/class/drm/card0/error ; do sleep 1 done;
cat /sys/class/drm/card0/error > /path/to/my/log
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
I had the same thought watching dmesg with awk will just sit until a new line is produced by dmesg.
Offline
If these errors are logged in dmesg then the problem is much simpler as they should also then be in the journal (which is maintained across a reboot).
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
The problem is in the journal you get
Rha 07 00:32:17 MyComputer kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
but it does not contain the contents of /sys/class/drm/card0/error
So I was using awk or at least the idea was scan dmesg -w so it follows dmesg for the error line mentioning /sys/class/drm/card0/error then execute a system call to bzip2 up the contents of /sys/class/drm/card0/error to some location.
Last edited by loqs (2017-12-07 14:13:12)
Offline
Offline
Oops, yes, sorry I missed that part.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
@slithery, he *could* but lacks the ability to ssh into it in general (see his first post)
Online
So I was using awk or at least the idea was scan dmesg -w so it follows dmesg for the error line mentioning /sys/class/drm/card0/error then execute a system call to bzip2 up the contents of /sys/class/drm/card0/error to some location.
Ah! OK. Now I understand. I didn't realise you could do that with dmesg. That sounds ideal. I'll see if I can get this working.
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
inotifywatch/inotifywait (inotify-tools), but no guarantees that it's gonna work reliably on sysfs (I'd say "no" but haven't tried in years)
Maybe you could setup yourself a shortcut to dump /sys/class/drm/card0/error into a regular file, sync the filesystem and reboot/stop X11/wayland and reload i915?
I'm not sure. I tried (1) switching to another TTY and (2) killing X during the last incident. No trace of these attempts appeared in the journal. In contrast, brightness key presses were registered. (I had thought that maybe the screen just had brightness set to zero - I didn't know about the GPU hang as this didn't occur in the journal when this happened before.) There was also no trace of my pressing the power button briefly. But sleep key and lid open/close were registered. So I'm not clear what the system will/won't register when in this state.
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
If these errors are logged in dmesg then the problem is much simpler as they should also then be in the journal (which is maintained across a reboot).
This is actually how I know it is a GPU hang. Last time, I didn't get this. But this time, the GPU hang was logged to the journal. Unfortunately, as noted above, I need the stuff from sysfs to file a useful bug report.
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
@slithery, he *could* but lacks the ability to ssh into it in general (see his first post)
Oops, I missed that.
Are you really not able to SSH in at other locations, for example by using a smartphone?
Offline
seth wrote:@slithery, he *could* but lacks the ability to ssh into it in general (see his first post)
Oops, I missed that.
Are you really not able to SSH in at other locations, for example by using a smartphone?
No smart phone. I have a phone, but I don't think you can ssh by SMS, can you?
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
Now I just have to remember to invoke loqs's magic line *before* I get a hang. (Rebooted yesterday, left the machine and ... too late .)
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
Perhaps something like the following
dmesg -w | awk '/GPU crash dump saved to \/sys\/class\/drm\/card0\/error/ {system("cat /sys/class/drm/card0/error | bzip2 > error.bz2")}'
Got it!
Bug ref: https://bugs.freedesktop.org/show_bug.cgi?id=104545 with crash dump attached. (I decompressed it as it was pretty small - hope that is not bad.) Does all that garbage-looking stuff in the middle actually mean something to developers? Just curious - it does not look like dumps I've seen in other cases.
Now my question is: is there some way that I can use this to recover from a hang without forcing a hard reboot? Can I at least force a cleaner shut down? That is, rather than having it copy the dump to a file, would it work to do something else in the `system` bit? I guess it should. At least, I guess I could have it power off or reboot or something? Or won't that work with the GPU hung?
I'm currently trying (as root):
dmesg -w | awk '/GPU crash dump saved to \/sys\/class\/drm\/card0\/error/ {system("systemctl reboot")}'
because I have no idea how to unhang the GPU.
Last edited by cfr (2018-01-09 04:07:12)
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
So there is a fix available https://patchwork.freedesktop.org/patch/187414/ if that works then you can file a bug with arch asking for it to be applied.
Offline
So there is a fix available https://patchwork.freedesktop.org/patch/187414/ if that works then you can file a bug with arch asking for it to be applied.
The date suggests it was fixed in November, even though I filed the bug right after getting the dump. Does that mean that it is fixed somewhere, but hasn't made it into Arch yet? I am not sure how to tell if it works or not. Do I patch the kernel?
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
The date suggests it was fixed in November, even though I filed the bug right after getting the dump. Does that mean that it is fixed somewhere, but hasn't made it into Arch yet?
Yes it is fixed in https://github.com/freedesktop/drm-tip/ … 289cf16703 but not in linux mainline or linux stable.
I am not sure how to tell if it works or not. Do I patch the kernel?
Yes.
Offline
cfr wrote:The date suggests it was fixed in November, even though I filed the bug right after getting the dump. Does that mean that it is fixed somewhere, but hasn't made it into Arch yet?
Yes it is fixed in https://github.com/freedesktop/drm-tip/ … 289cf16703 but not in linux mainline or linux stable.
Is there a way to tell when it makes it into Arch's stable kernels?
cfr wrote:I am not sure how to tell if it works or not. Do I patch the kernel?
Yes.
Thanks.
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
It may be a while before it is introduced to a stable kernel https://git.kernel.org/pub/scm/linux/ke … =v4.15-rc9 does not have the patch.
If the patch works for you, you can file a bug report on the arch bug tracker asking for the patch to be applied until it makes its way through the upstream trees to a stable release.
Offline
It may be a while before it is introduced to a stable kernel https://git.kernel.org/pub/scm/linux/ke … =v4.15-rc9 does not have the patch.
If the patch works for you, you can file a bug report on the arch bug tracker asking for the patch to be applied until it makes its way through the upstream trees to a stable release.
It is not clear how I can determine this since the problem manifests rather infrequently. That's one reason I asked how I could tell when it is fixed. Mostly, I'm running my script to power down if it happens and nothing ever does happen: I end up rebooting for some completely independent reason (new kernel or something). I might be able to establish it doesn't work for me, but I can't see how I could establish that it does.
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline