You are not logged in.

#1 2024-10-28 15:21:27

why_do_i_need_a_username
Member
Registered: 2024-08-26
Posts: 13

[CLOSED] [drm] VRAM is lost due to GPU reset; hibernate resume broken

Update: this is no longer relevant, ignore this post, leaving it here though because I need to refer to it elsewhere

I have a ThinkPad A285 (what's relevant is that it has an AMD GPU and (i)GPU) running Arch (actually I intend on running Debian 12 KDE on it long-term but I have Arch on it currently as well and it happens there as well so *technically* I get to post about it here too) and it seems to be completely incapable of hibernating/suspending to disk, at least not in any capacity relevant to desktop use. Arch is set up on encrypted LVM, and the swap partition used for hibernation is also inside a LUKS partition (though a different one than the one on which the LVM PV is).

mkinitcpio hooks look like this:

HOOKS=(base systemd autodetect microcode modconf kms keyboard sd-vconsole block sd-encrypt lvm2 filesystems fsck)

Kernel cmdline is this:

$ cat /proc/cmdline
initrd=\amd-ucode.img initrd=\initramfs-linux.img rd.luks.name=2ed266e4-7db0-44d1-ac9d-d402ef7e760d=a285_vg_nvme0n1p3_pv root=/dev/a285_vg/arch rootfstype=btrfs rootflags=defaults,noatime,ssd,space_cache=v2,subvol=/@ rd.luks.name=08b2f8c0-45c8-47ec-88ab-cc5461c07303=cryptswap resume=/dev/mapper/cryptswap rw quiet

The experience goes something like this:

1. Run `systemctl hibernate`
2. Wait for it to power off
3. Turn the machine back on
4. When prompted, punch in your LUKS password(s) (I usually set up TPM2 autounlock for this kind of setup however I haven't gotten around to doing it yet)
5. Be greeted with either a corrupted version of the screen contents when you hibernated the machine initially, an undisturbed (but unresponsive) version, or a black screen
6. Attempt to switch to another VT using Ctrl+Shift+F[1-6,10], nothing happens
7. Forcefully turn the thing off
8. Turn it back on
9. Get presented with these diag code beeps
10. Hold the power button to turn it back off
11. Turn it back on again, boot process goes as normal

The screen looks something like this, at least when it decides to give me the aforementioned "corrupted" version. Seriously more messed up version is this.

However, the device is very much still running when it's like this, as evidenced by the fact that I am able to SSH into it (over WiFi at that) while in this state, and as such I was able to run `dmesg` and `journalctl -xe` to see what was happening, with the relevant bit being this:

Oct 28 13:14:45 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: Dumping IP State
Oct 28 13:14:45 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: Dumping IP State Completed
Oct 28 13:14:45 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
Oct 28 13:14:45 a285-arch kernel: [drm] PCIE GART of 1024M enabled.
Oct 28 13:14:45 a285-arch kernel: [drm] PTB located at 0x000000F400A00000
Oct 28 13:14:45 a285-arch kernel: [drm] VRAM is lost due to GPU reset!
Oct 28 13:14:45 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: PSP is resuming...
Oct 28 13:14:45 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: reserve 0x400000 from 0xf43fc00000 for PSP TMR
Oct 28 13:14:46 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
Oct 28 13:14:46 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
Oct 28 13:14:46 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Oct 28 13:14:46 a285-arch kernel: [drm] kiq ring mec 2 pipe 1 q 0
Oct 28 13:14:47 a285-arch kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] [i]ERROR[/i] ring gfx test failed (-110)
Oct 28 13:14:47 a285-arch kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] [i]ERROR[/i] resume of IP block <gfx_v9_0> failed -110
Oct 28 13:14:47 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset(4) failed
Oct 28 13:14:47 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -110
Oct 28 13:14:47 a285-arch kernel: amdgpu 0000:06:00.0: amdgpu: GPU Recovery Failed: -110

It attempts to do this reset at least 3 times, and never succeeds. However, interestingly enough, the mouse cursor still works in this state, at least on Xorg sessions (not on Wayland ones), but other indicators of life like the CapsLock key LED don't seem to do anything.

When running like this, the device gets very hot, and drains the battery somewhat quickly, so I try not to leave it like this for long. When trying to shut it down the "proper way" (`sudo poweroff`) via SSH, the screen never actually turns off, the fan keeps spinning, and the only way out is to force it off. Afterwards, I get the same diag beeps (I tried Lenovo's beep diagnostics app and it didn't tell me anything of importance, just that this is error "0102FB", which I wasn't able to find anything about online), and the machine must be forcefully powered off and turned back on again.

As for the desktop environment I have on here, I initially installed GNOME, where it displayed a black screen after resume (only tested Wayland, not Xorg), then installed XFCE (and switched the login manager from gdm to lightdm), where it either shows a corrupted or intact version of what was on screen, however it's completely unresponsive and the contents do not update (e.g. clock on panel stays the same).

Here's the logs:

dmesg
journalctl -xe
/var/log/Xorg.0.log (probably/possibly irrelevant)

The closest thing to this that I've found online is this amdgpu freedesktop gitlab issue, however the cause there is different and the logs are also not the same, though the description is pretty similar to what I have here (it does indeed sometimes go black for me as well, just failed to mention it).

Before anybody mentions it, there's no firmware updates available for this:

$ sudo fwupdmgr get-updates
Devices with no available firmware updates: 
 • SSD 980 500GB
 • System Firmware
 • TPM
 • UEFI Device Firmware
 • UEFI Device Firmware
 • UEFI Device Firmware
No updatable devices

Is there anything that can be done about this, or should I just forgo hibernate/suspend-to-disk/S4 on this machine completely? I really want it to work, since it draws pretty much zero power while hibernated (because it's on the SSD and doesn't have to keep-alive the RAM), but it wouldn't be that much of an issue if it wasn't there. I'd be perpetually upset about it, though.

Issue originally described here, though on a different distro.

Last edited by why_do_i_need_a_username (2024-10-29 09:39:15)

Offline

Board footer

Powered by FluxBB