You are not logged in.
Hello everyone,
I’m experiencing persistent stability issues with my AMD Radeon RX 5700 XT (Navi10) on Arch Linux. The system reboots instantly whenever the GPU is heavily used, e.g., in Upscayl (image upscaling) or games. This happens on both Hyprland and KDE Plasma (X11). Additionally, suspend only works properly on linux-lts kernels. Suspend fails on both the mainline (linux) and zen (linux-zen) kernels.
System Info:
Kernel: Currently on linux-lts 6.12.45-1
mesa / vulkan-radeon / opencl-mesa: 25.2.2-1
linux-firmware-amdgpu: 20250808-1
Xorg-server: 21.1.18-2
Display Manager / Compositor: SDDM / Hyprland / Plasma X11
What I’ve Tried:
Downgrading Mesa / Vulkan / opencl-mesa / linux-lts / linux-firmware / linux-firmware-amdgpu / linux-firmware-radeon: temporary partial stability
Testing both linux-lts 6.16.38 and linux-zen / mainline kernels: same reboot issue
Kernel parameters: amdgpu.noretry=0 amdgpu.gpu_recovery=1 pcie_aspm=off
Stress-testing GPU with glmark2 (works fine under light OpenGL load)
Observations:
System reboots instantly under GPU compute / Vulkan / Upscayl, sometimes before the task even begins.
glmark2 and light OpenGL workloads run fine.
Reboots happen regardless of compositor or display server (Hyprland / Plasma X11).
Suspend only works on linux-lts; fails on mainline and zen kernels.
Even with fully aligned driver stack (Mesa, Vulkan, libva, OpenCL, firmware) the problem persists.
Question:
Has anyone successfully stabilized an RX 5700 XT on Arch with compute workloads, suspend working properly on non-LTS kernels, and stable on X11/Wayland? Could this be a deeper kernel / amdgpu issue rather than just a driver mismatch? Any insights, known-good kernel + driver + firmware snapshots, or workarounds would be hugely appreciated.
Thanks in advance.
Last edited by defaultuser0 (2025-09-11 11:52:51)
Offline
The system reboots instantly whenever the GPU is heavily used
https://wiki.archlinux.org/title/Ryzen#Troubleshooting - GPU failures would not cause reboots (but be related to the above for APUs and in case the GPU just starves the CPU)
Next to the very specific Ryzen problems, general overheating (monitor temperatures) or under-powering (is this a notebook?) would cause emergency shutdowns/reboots
What are the exact symptoms of "improperly" behaving suspends?
And is this S3 or s2idle? ("cat /sys/power/mem_sleep")
Offline
https://wiki.archlinux.org/title/Ryzen#Troubleshooting - GPU failures would not cause reboots (but be related to the above for APUs and in case the GPU just starves the CPU)
Next to the very specific Ryzen problems, general overheating (monitor temperatures) or under-powering (is this a notebook?) would cause emergency shutdowns/reboots
What are the exact symptoms of "improperly" behaving suspends?
And is this S3 or s2idle? ("cat /sys/power/mem_sleep")
I ran a pure CPU stress test through stress-ng, and my system did not reboot, which suggests the CPU itself is stable. I am using Ryzen 5 5500. The reboots happen when the GPU is under load — in Upscayl, games, or Vulkan workloads, and sometimes randomly. But since the reboot is instant, the kernel doesn't have enough time to log something on journalctl. For suspend: when I try, the system fails to enter suspend at all, like, my screen goes gray, and then nothing works until I reboot my PC. and I’m currently on s2idle [deep]. I’m using a discrete RX 5700 XT with latest kernel, latest Mesa/Vulkan/libva/OpenCL, and latest linux-firmware-amdgpu. Any ideas if this could be a GPU/kernel/firmware interaction issue rather than a CPU problem?
Offline
For suspend: when I try, the system fails to enter suspend at all, like, my screen goes gray, and then nothing works until I reboot my PC.
After another failed suspension on linux-zen, I checked the logs through journalctl, and found out the issue.
kernel: amdgpu 0000:12:00.0: amdgpu: psp reg (0x16061) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
kernel: [drm] psp mode 1 reset failed!
kernel: amdgpu 0000:12:00.0: amdgpu: GPU mode1 reset failed
kernel: amdgpu 0000:12:00.0: PM: pci_pm_suspend_noirq(): amdgpu_pmops_suspend_noirq [amdgpu] returns -22
kernel: amdgpu 0000:12:00.0: PM: dpm_run_callback(): pci_pm_suspend_noirq returns -22
kernel: amdgpu 0000:12:00.0: PM: failed to suspend async noirq: error -22
kernel: PM: noirq suspend of devices failed
Offline
I ran a pure CPU stress test through stress-ng, and my system did not reboot, which suggests the CPU itself is stable.
No, the error is triggered by specific access patterns, apparently core cycling does this. Just constant maximum load usually does not.
https://bbs.archlinux.org/viewtopic.php … 9#p2260219
and I’m currently on s2idle [deep]
Ie S3, does s2idle work?
https://wiki.archlinux.org/title/Power_ … end_method
Though also you're likely in the https://bbs.archlinux.org/viewtopic.php … 9#p2258599 bucket.
Offline
Okay, I've dug deeper and came back with an update:
Suspend only works if I downgrade both linux to 6.16.1.arch1-1 and linux-zen to 6.16.1.zen1-1. With newer kernels, suspend breaks.
I did not downgrade linux-firmware, so suspend seems to depend purely on the kernel version.
GPU-heavy workloads (Upscayl, Vulkan apps, games) still trigger hard reboots. Even if I set the amdgpu.noretry=0 amdgpu.gpu_recovery=1 pcie_aspm=off parameters on grub for one boot.
The system reboots so suddenly that no kernel logs are captured right before or after the crash.
does s2idle work?
When the issue with suspension was still persisting, I once tried using s2idle, and I think it most likely didn't enter suspension mode. On s2idle, my monitor has turned off (compared to S3 where my monitor was displaying gray screen), Along with my keyboard's lighting, but the fans of my PC were still spinning. When I tried to wake it up from suspension, it refused to. But now that I downgraded the kernel, suspension issue got fixed on S3.
Alright, the issue with suspension is hopefully fixed. But what do I do with the instant reboots of heavy GPU usage? Should I try downgrading the drivers again, or It's pointless?
Offline
See whether enforcing and performance level or power profile avoids the crashes, https://wiki.archlinux.org/title/AMDGPU … nce_levels
AMD has issues w/ adaptive power modes, you could try to run some 6.14 kernel (downgrading the kernel in isolation is, unlike with pretty much every other package, ok-ish)
Offline
Well, now I tried changing performance level to low, and nothing changed. The crashes still occur.
you could try to run some 6.14 kernel
I also tried that, but even after I tried using 6.14.7-zen2-1-zen, it changed nothing. Same with 6.14.6-zen1-1-zen. But, when I tried 6.14.9-zen1-1-zen, then my GPU fans maxed out, monitor turned off, and my system went to hard lock. Like, nothing works until I forcibly shut down my PC. And yes, I exactly mean forcibly shutting it down. Because hitting the restart button does nothing.
Offline
Well, now I tried changing performance level to low, and nothing changed. The crashes still occur.
So unlikely the absolute power draw of the GPU.
Test whether core cycling causes this and in any event have a look at the PBO adjustment.
Also check the journal after a spontaneous reboot (not the one that crashed but the running one) for MCE errors (see the Ryzen troubleshooting section again)
Offline
Also check the journal after a spontaneous reboot (not the one that crashed but the running one) for MCE errors (see the Ryzen troubleshooting section again)
I checked the journal through
journalctl -b -1 | grep -i mce
And only saw “In-kernel MCE decoding enabled” (no actual MCEs). So the reboots probably aren’t triggered by the CPU itself throwing a fatal machine check.
Offline
Not! "-1"
You're not interested in the boot that crashed but the subsequent one.
The MCE errors would only prove the condition, their absence doesn't prove all that much.
The behavior is OTR, it's not the absolute power draw of the GPU, try core cycling and/or PBO adjustment.
The next pot. offender would be the RAM, might be esp. when it's used for GART/GTT
But you'll have to run memtest86+ for hours (the entire night ie. 16+ h) to rule out issues w/ that.
Offline
Okay, I came back after like 9 hours of memtest86+ running. The test was passed 11 times, 0 errors.
So, what I can say, is that CPU is stable under stress load, RAM is stable, and GPU runs synthetic benchmarks fine (glmark2 didn’t cause reboots). But real workloads like Upscayl still reboot my PC. And mostly, I am suspecting that It's an issue with driver instability (Mesa / amdgpu / ROCm / OpenCL stack quirks). Since glmark2 run was smooth. It only used OpenGL. Upscayl, on the other hand, leans on Vulkan + OpenCL, which is a lot more fragile on Navi10 (RX 5700 XT). Also, speaking of drivers, I noticed that new versions of Mesa, Vulkan, and OpenCL through pacman -Syu. But I decided to not install them because I don't think it will fix something. But, what else do you suggest me to do now? Because I don't know anymore.
Offline
But, what else do you suggest me to do now? Because I don't know anymore.
https://wiki.archlinux.org/title/Ryzen#Troubleshooting
Again: maximum CPU stress is entirely *NOT* what is causing this, it very much *is* "real workloads" dependent.
Offline
Okay, I came back and I finally fixed it. The cause was because my CPU voltage was too low in the firmware settings. I increased it to 1.3V (it was 1.2750 before), and the rebooting stopped.
Offline
Hi
I am experiencing a similar issue for suspend failure.
My system does not reboot. But....
It occurs on both the stock kernel and the CachyOS kernel. When I force systemctl suspend, the PC appears to suspend—the screens turn off—but the system keeps running in the background. Upon attempting to resume, the displays remain black and I cannot return to the session. This issue did not occur before, and undervolting is not a factor, as it worked fine with my previous setup.
amdgpu 0000:08:00.0: amdgpu: MODE1 reset
amdgpu 0000:08:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:08:00.0: amdgpu: GPU psp mode1 reset
amdgpu 0000:08:00.0: amdgpu: psp reg (0x16061) wait timed out, mask: 8000ffff, read: ffffffff exp: 80000000
[drm] psp mode 1 reset failed!
amdgpu 0000:08:00.0: amdgpu: GPU mode1 reset failed
amdgpu 0000:08:00.0: PM: pci_pm_suspend_noirq(): amdgpu_pmops_suspend_noirq [amdgpu] returns -22
amdgpu 0000:08:00.0: PM: dpm_run_callback(): pci_pm_suspend_noirq... returns -22
amdgpu 0000:08:00.0: PM: failed to suspend async noirq: error -22Last edited by SupKurtJ (2025-09-15 22:57:13)
Offline
Okay, I came back and I finally fixed it. The cause was because my CPU voltage was too low in the firmware settings. I increased it to 1.3V (it was 1.2750 before), and the rebooting stopped.
I think you might be dealing with two separate issues.
I have the same GPU as you, but I don’t experience reboots (related to your undervolting—using Curve Optimizer, if available in your BIOS, is a better way to lower per-core voltage and fine-tune stability for each core).
Regarding the suspend failures, I’ve found that many people are experiencing the same problem:
https://gitlab.freedesktop.org/drm/amd/-/issues/4531
https://discuss.cachyos.org/t/kernel-6- … or/14937/9
https://forum.manjaro.org/t/stable-upda … /181125/76
So I think it's an issue from mesa amd driver (I use it) or kernel.
Offline
It seems that a patch exists for this issue
https://gitlab.freedesktop.org/agd5f/li … c3f5bc3a82
Offline
defaultuser0 wrote:Okay, I came back and I finally fixed it. The cause was because my CPU voltage was too low in the firmware settings. I increased it to 1.3V (it was 1.2750 before), and the rebooting stopped.
I think you might be dealing with two separate issues.
I have the same GPU as you, but I don’t experience reboots (related to your undervolting—using Curve Optimizer, if available in your BIOS, is a better way to lower per-core voltage and fine-tune stability for each core).
Regarding the suspend failures, I’ve found that many people are experiencing the same problem:
https://gitlab.freedesktop.org/drm/amd/-/issues/4531
https://discuss.cachyos.org/t/kernel-6- … or/14937/9
https://forum.manjaro.org/t/stable-upda … /181125/76So I think it's an issue from mesa amd driver (I use it) or kernel.
Kernel 6.17.0-0rc6 appears to fix this issue. I have a machine with an RX 5700 that was experiencing what I assume to be the same suspend/resume bug on 6.16.7-1. It seems to be happening with 5000 series GPUS for some reason. I booted Fedora Rawhide 20250917 running kernel version 6.17.0-0rc6 and suspend and resume seem to be fixed.
Offline