[SOLVED] Driver for AMDGPU doesn't work anymore after a random time

Ekaradon · 2025-04-02 19:05:01

Hello guys,

So I've got this weird issue where I can run this command:

❯ DRI_PRIME=1 LIBGL_DEBUG=verbose glxinfo

And get the desired output with all the 0x2xx lines and so on.
But if I wait a bit (no idea about the duration, it can be almost instantaneous or take a while), I finish to have this output instead:

❯ DRI_PRIME=1 LIBGL_DEBUG=verbose glxinfo

name of display: :1
DRI_PRIME: device 0 skipped (default device)
DRI_PRIME: device 1  -> selected (/dev/dri/renderD128)
[1]    33277 segmentation fault (core dumped)  DRI_PRIME=1 LIBGL_DEBUG=verbose glxinfo

And if I reboot the computer it can or cannot work. I usually ask for a reinstall of these:

sudo pacman -Syu mesa xf86-video-amdgpu linux

And then, after a reboot, it may work. I have seriously no idea what is happening. Why something could work but break after a bit of time?

Here some logs I could find on journalctl

april 02 20:54:15 user kernel: [drm] PCIE GART of 512M enabled (table at 0x00000081FEB00000).
april 02 20:54:15 user kernel: amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x81fc000000 for PSP TMR
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x00000035, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x00525c00 (82.92.0)
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
april 02 20:54:16 user kernel: [drm] kiq ring mec 3 pipe 1 q 0
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: MES FW versoin must be larger than 0x63 to support limit single process feature.
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: failed to change_config.
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: resume of IP block <mes_v11_0> failed -22
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
april 02 20:54:15 user kernel: [drm] PCIE GART of 512M enabled (table at 0x00000081FEB00000).
april 02 20:54:15 user kernel: amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x81fc000000 for PSP TMR
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x00000035, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x00525c00 (82.92.0)
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
april 02 20:54:16 user kernel: [drm] kiq ring mec 3 pipe 1 q 0
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: MES FW versoin must be larger than 0x63 to support limit single process feature.
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: failed to change_config.
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: resume of IP block <mes_v11_0> failed -22
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
april 02 20:54:17 user kernel: [drm] Register(0) [regUVD_PGFSM_STATUS] failed to reach value 0x00800000 != 0x00c00000n
april 02 20:54:17 user kernel: amdgpu 0000:03:00.0: [drm:jpeg_v4_0_set_powergating_state [amdgpu]] *ERROR* amdgpu: JPEG enable power gating failed
april 02 20:54:17 user kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <jpeg_v4_0> failed -110
april 02 20:54:18 user kernel: [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003n
april 02 20:54:18 user kernel: [drm] Register(0) [regUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003n
april 02 20:54:18 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:43 param:0x00000000 message:PowerDownVcn?
april 02 20:54:18 user kernel: amdgpu 0000:03:00.0: amdgpu: Failed to power gate VCN!
april 02 20:54:18 user kernel: [drm:vcn_v4_0_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -121. 
april 02 20:54:18 user kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:36 param:0x00000001 message:SetWorkloadMask?
april 02 20:54:18 user kernel: amdgpu 0000:03:00.0: amdgpu: Failed to set workload mask 0x00000001
april 02 20:54:18 user kernel: amdgpu 0000:03:00.0: amdgpu: (-121) failed to disable video power profile mode



april 02 20:54:29 user systemd-coredump[33194]: Process 33192 (glxinfo) of user 1000 dumped core.
                                                    
                                                    Stack trace of thread 33192:
                                                    #0  0x000077a743bbc8ed glFramebufferTextureMultisampleMultiviewOVR (libGLX_mesa.so.0 + 0x478ed)
                                                    #1  0x000077a743bb0f0c n/a (libGLX_mesa.so.0 + 0x3bf0c)
                                                    #2  0x000077a743ba399b n/a (libGLX_mesa.so.0 + 0x2e99b)
                                                    #3  0x000077a743ba4d9a n/a (libGLX_mesa.so.0 + 0x2fd9a)
                                                    #4  0x000077a743ba040a n/a (libGLX_mesa.so.0 + 0x2b40a)
                                                    #5  0x0000560cb7e23366 n/a (/usr/bin/glxinfo + 0x2366)
                                                    #6  0x000077a743d27488 n/a (libc.so.6 + 0x27488)
                                                    #7  0x000077a743d2754c __libc_start_main (libc.so.6 + 0x2754c)
                                                    #8  0x0000560cb7e23df5 n/a (/usr/bin/glxinfo + 0x2df5)
                                                    ELF object binary architecture: AMD x86-64
april 02 20:54:29 user systemd[1]: systemd-coredump@2-33193-0.service: Deactivated successfully.

Do someone understand what is happening here?

Last edited by Ekaradon (2025-04-04 14:39:00)

seth · 2025-04-02 20:16:09

april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: failed to change_config.
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: resume of IP block <mes_v11_0> failed -22
april 02 20:54:16 user kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).

https://community.frame.work/t/suspend- … over/66891 - seems a 6.13.6 regression?
No problems w/ teh LTS kernel?

Ekaradon · 2025-04-02 20:42:48

I'm going to test. It really looks like you are on something indeed:
https://community.frame.work/t/graphics … e/66269/26
It just going to take a bit of time to test the lts as I don't understand yet how I can trigger this bug.

Ekaradon · 2025-04-04 14:38:15

I can confirm that the issue doesn't appear anymore once using the lts. Thanks Seth!

OpusOne · 2025-04-14 13:45:46

I have run into a problem that looks like this suspend/resume thing, but in my case, I get this error code instead:

amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62)

The sequence that triggers it:
- Suspend to RAM
- The suspend sequence almost completes... but from what I can see in the journal, it seems to resume by itself just before the suspend is fully completed:

...
ACPI: PM: Preparing to enter system sleep state S3
ACPI: EC: event blocked
ACPI: EC: EC stopped
kernel: ACPI: PM: Saving platform NVS memory
kernel: Disabling non-boot CPUs ...
kernel: Wakeup pending. Abort CPU freeze
kernel: Non-boot CPUs are not disabled
ACPI: EC: EC started
ACPI: PM: Waking up from system sleep state S3
...

After this auto-resume (which I don't know what triggers it), amdgpu fails to resume properly and ends with this

amdgpu: amdgpu_device_ip_resume failed (-62)

. No display, the machine needs a reboot after that.

It definitely looks like the issue mentioned, but in my case, I get this -62 error (what is it?), and also it wakes up instantly after suspending, which I don't see a cause in the journal at all except that it just shows this "kernel: Wakeup pending.".

* Note that:
- It only ever happens if I suspend to RAM. If the suspend to RAM has succeeded, resuming always works fine.
- It happens only unfrequently when suspending to RAM, making it hard to debug. I can't reproduce it for sure, it just happens every once in a while when suspending to RAM. I had it happen twice for the past few weeks: once with kernel 6.13.8 and once with kernel 6.14.2. I think I had skipped a couple kernel versions, so I think I had updated from 6.13.5 to 6.13.8 directly, making the above bug the likely cause, but it's not systematic so it's hard to reproduce and so hard to "bisect" it.

Last edited by OpusOne (2025-04-14 13:50:19)

seth · 2025-04-14 19:22:05

◉ LC_ALL=C errno 62                    
ETIME 62 Timer expired

Ekaradon wrote:

the issue doesn't appear anymore once using the lts[-kernel]

Same?

Do you have a journal covering an affected boot?
Eg.

sudo journalctl -b -1 | curl -F 'file=@-' 0x0.st

for the previous ("-1") one.

Sanity check

kernel: Wakeup pending. Abort CPU freeze

Is there a parallel windows installation?

Arch Linux

#1 2025-04-02 19:05:01

[SOLVED] Driver for AMDGPU doesn't work anymore after a random time

#2 2025-04-02 20:16:09

Re: [SOLVED] Driver for AMDGPU doesn't work anymore after a random time

#3 2025-04-02 20:42:48

Re: [SOLVED] Driver for AMDGPU doesn't work anymore after a random time

#4 2025-04-04 14:38:15

Re: [SOLVED] Driver for AMDGPU doesn't work anymore after a random time

#5 2025-04-14 13:45:46

Re: [SOLVED] Driver for AMDGPU doesn't work anymore after a random time

#6 2025-04-14 19:22:05

Re: [SOLVED] Driver for AMDGPU doesn't work anymore after a random time

Board footer