[Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

gdl · 2025-02-04 04:07:25

535 wrote:

I'm seeing this issue on my RX 570; another Polaris card.
Possibly relevantly, I had to disable power management and the Display Core driver (by adding the kernel parameters `amdgpu.dpm=0` and `amdgpu.dc=0`) in order to prevent my RX 570 from losing its mind (heavily corrupted graphics), especially when waking from sleep. Is anyone else who's having this issue using `amdgpu` kernel parameters?
I've also been seeing an infrequent display flicker where what looks to be a terminal or console replaces the entire screen for a single frame. I need to do some more testing, but I'm starting to think that they occur at the same time as the log gets spammed with those `[drm] scheduler comp_X.Y.Z is not ready, skipping` messages.

I do not have display flicker on Arch but interestingly it happens on Windows 11. A bit OT but I noticed that on Windows 11, 90% of the time I resume from suspend I have to turn off and on (power cable) the monitor a few times to finally wake up the GPU.
At least on Arch, I am 100% sure the monitor will turn on and the scheduler comp messages.

DeKay · 2025-02-04 14:09:18

No improvements with 6.13.1-arch1-1

Magnesium · 2025-02-08 14:01:24

I have this problem a long time, even with 6.14.0-rc1-1-mainline and sway (wayland)

but have found a possible workaround here https://bbs.archlinux.org/viewtopic.php?id=302767

downgrading mesa an vulkan-radeon to 24.2.7-1
and installing llvm18-libs

***** update *****
3 days in a row, resuming from sleep and hibernation, without crash!

[ia@sun ~] % date [0][16:11:00]
Thu Feb 13 04:11:20 PM -03 2025
[ia@sun ~] % uptime [0][16:11:20]
16:11:22 up 3 days, 5:09, 1 user, load average: 0.57, 0.92, 0.88

**** update ****
unfortunately, had a crash: kernel null pointer dereferece
new tactics: using latest mesa and vulkan-radeon, downgrading kernel to 5.19.13 - yes you read it correctly
status: no more log spam [drm] scheduler stuff
after resuming from suspend, no more amdgpu error ringtest stuff
lets see if the system will remain stable more than four days in row, with sucessive suspending/hibernating and resuming.

Last edited by Magnesium (2025-02-21 18:05:59)

altoid · 2025-04-15 01:33:05

This also happens to me after resume from suspend.

Kernel build script and config (6.14.2) available here on my GitHub.

Apr 15 02:28:00 Workstation kernel: sd 0:0:0:0: [sda] Starting disk
Apr 15 02:28:00 Workstation kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
Apr 15 02:28:00 Workstation kernel: random: crng reseeded on system resumption
Apr 15 03:28:51 Workstation kernel: [drm] scheduler comp_1.1.1 is not ready, skipping
Apr 15 03:28:51 Workstation kernel: [drm] scheduler comp_1.1.1 is not ready, skipping
Apr 15 03:28:51 Workstation kernel: [drm] scheduler comp_1.1.1 is not ready, skipping
Apr 15 03:28:51 Workstation kernel: [drm] scheduler comp_1.1.1 is not ready, skipping

CPU: Xeon E3-1231 v3
GPU: RX 480 with 4GB of VRAM

No visual glitches, it works just fine but the console spam is obviously suggesting that something is wrong.

Last edited by altoid (2025-04-15 01:36:38)

DeKay · 2025-04-15 02:14:38

loqs wrote:

Which kernel introduced the issue? Is the issue still present in 6.13?
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-mainline-6.13-1-x86_64.pkg.tar.zst
Do you need help performing the bisection requested by upstream?

Someone has done a git bisect to find the problematic commit. Hope this leads to a fix soon.

altoid · 2025-04-15 07:43:56

DeKay wrote:

Someone has done a git bisect to find the problematic commit. Hope this leads to a fix soon.

I can offer you / anyone to look into this today.

I was reverse engineering the Polaris firmware anyway, so I might as well recompile the kernel but undo that commit, then debug it and try to re-create that commit but without the side-effects we all are experiencing.

After all, I have to delete my latest commit, namely Linux 6.14.2, in my (Arch Build System) kernel repository anyway since I forgot to enable UDF filesystem drivers / CONFIG_UDF_FS in the kernel config.

I will try to get it to work when I have some free time!

Last edited by altoid (2025-04-15 07:46:50)

DeKay · 2025-04-15 15:28:34

altoid wrote:

I can offer you / anyone to look into this today.

That would be amazing!

altoid · 2025-04-17 06:45:43

@DeKay: I have to apologize, since holidays are coming up I now have a bunch of stuff to do.

I will try to get to it once I have the time.

Maybe the guys in the ticket(s) will get it first.

A -110 return value usually means timeout, so the card failed to resume some compute queues and an application tries to use them despite the driver not having them initialized fully.

The commit in question is related to retimer test pattern sequencing in the AMD Display Core (drm/amd/display).

I did take a manual look and some code stands out to me, although I am not very familiar with this code-base (AMDGPU) at all.

The commit that our friend landed on after bisecting is mostly related to DisplayPort and I don't really see (yet) how it could interfere.

I will try some stuff while I still have time.

DeKay · 2025-04-17 20:42:10

@altoid. No worries. But before you do anything, be sure to check back to the ticket first. There is some dispute on that bisect and a new theory as to what causes the problem.

altoid · 2025-04-17 22:13:59

@DeKay: Thank you for the heads up.

Yes, I had a strong feeling (from a developers viewpoint) that this commit is not at all related.

We will see soon

Last edited by altoid (2025-04-17 22:14:36)

itektur · 2025-04-21 13:37:26

Chiming in because I just found this thread which is awesome because I have the same issue since what feels like forever.

Some machine details:

Motherboard: X570-A PRO (MS-7C37)
CPU: AMD Ryzen 9 5900X 12-Core Processor
Video: Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
uname: Linux hostname 6.14.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 10 Apr 2025 18:43:59 +0000 x86_64 GNU/Linux

I am using Xorg and Fluxbox without any compositing whatsoever (so no transparencies even if I or applications ask for it).

The most prominent symptom is the logspam that happens after the first suspend/resume (no surprises here):

Apr 21 14:04:50 hostname kernel: [drm] scheduler comp_1.2.0 is not ready, skipping
Apr 21 14:04:51 hostname kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Apr 21 14:04:51 hostname kernel: [drm] scheduler comp_1.2.0 is not ready, skipping
Apr 21 14:04:51 hostname kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Apr 21 14:04:51 hostname kernel: [drm] scheduler comp_1.2.0 is not ready, skipping
Apr 21 14:04:52 hostname kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Apr 21 14:04:52 hostname kernel: [drm] scheduler comp_1.2.0 is not ready, skipping
Apr 21 14:04:52 hostname kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Apr 21 14:04:52 hostname kernel: [drm] scheduler comp_1.2.0 is not ready, skipping
Apr 21 14:04:53 hostname kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Apr 21 14:04:53 hostname kernel: [drm] scheduler comp_1.2.0 is not ready, skipping

However, there are a few more things I would like to add that might or might not be related:

I use only one display/monitor. I use guake (and tmux) for virtual terminals. If I turn guake into fullscreen, the logspam stops!
However, even with guake at fullscreen the whole time, switching to another window of another application (which will then be displayed on top of guake) will add a few logspam lines again. The number of log lines seems to be number of different comp_* times the number of windows being "dirty" and then redrawn, at least most of the time. For example, in my case, There are comp_1.2.0 and comp_1.3.1, so the first number is 2. If I switch to another window (using keyboard shortcuts) such that it was hidden by guake and now is brought to the foreground, that other window needs to be redrawn, so the second number is 1. Thus, 2 * 1 = 2 new log lines. Switching back to guake such that the other window disappears means that guake is dirty so the second number will also be 1 in this case. Two lines. Switching focus such that 2 windows reappear on top of guake makes 4 lines and so on. Moving windows around with their contents being displayed during movement means lots of redraws because of windows being dirty/repositioned, thus, OMG many lines.
If guake is at display but not fullscreen, I get my usual logspam back at kind of a fixed interval. On top of the log lines I get from moving windows or redisplaying them because of switching virtual desktops and so on, that is. Not sure what that regular interval correlates to (seems to be once every half-second).

The whole issue becomes apparent after the first suspend/resume cycle, but there are other symptoms that seem to worsen over time:

After a month or two or three of suspending/resuming every day, mpv starts behaving strangely. That is, it seems to have a hard time starting video playback, as if the harddisk were broken or something. However, audio playback works fine throughout. Starting a new video takes almost a second! If it has become that bad, I basically know that the next suspend/resume cycle will result in a frozen system (or at least a frozen X) right after resume (or at least when the next logspam line should appear, can't quite remember). No screensaver, btw. Everything is being displayed and I can move the mouse cursor around but that's it. Not even Numlock will work (LED will not respond).
Similarly, if it is not just mpv having issues, it might happen that, after a suspend/resume, every process I start from then on will freeze as soon as it asks too much of the video card. More specifically, geany, firefox and mpv, for example, are all listed in nvtop, and new processes of them will all freeze. The only thing I see of them is the window decoration, but no content whatsoever. New processes of programs that do not appear in nvtop will work just fine. Also, processes which already have opened all the windows they need also continue to work. (I cannot remember whether still-working nvtop-listed processes can open new windows without problems.) However, similar to the situation with mpv above, this is usually the last day anything works at all. If I suspend and then resume the next day, everything will be frozen and I need to make use of the reset button.
By the way, guake is listed in nvtop as well.

I hope that helps solving the issue.

DeKay · 2025-04-21 13:51:05

itektur wrote:

Chiming in because I just found this thread which is awesome because I have the same issue since what feels like forever.
Some machine details:
Motherboard: X570-A PRO (MS-7C37)
CPU: AMD Ryzen 9 5900X 12-Core Processor
Video: Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
uname: Linux hostname 6.14.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 10 Apr 2025 18:43:59 +0000 x86_64 GNU/Linux
<snip>
I use only one display/monitor. I use guake (and tmux) for virtual terminals. If I turn guake into fullscreen, the logspam stops!
However, even with guake at fullscreen the whole time, switching to another window of another application (which will then be displayed on top of guake) will add a few logspam lines again. The number of log lines seems to be number of different comp_* times the number of windows being "dirty" and then redrawn
<snip>
The whole issue becomes apparent after the first suspend/resume cycle, but there are other symptoms that seem to worsen over time:
<snip>
I hope that helps solving the issue.

Awesome detective work @itektur. Could you add your findings to the bug I filed on the drm/amd Gitlab. I think this is starting to get developer interest now that recent findings have narrowed down the issue somewhat. Your findings will help.

kschendel · 2025-04-21 16:52:42

I get 4 of these every time I switch activities (essentially, desktops) in KDE Plasma. 5800X and RX550, running 6.12.16 x86_64. Single monitor. Every activity switch gets me 4 of those stupid warnings.

The KDE stuff is whatever was in OpenSUSE Tumbleweed as of the 2025-04-02 update. If anyone needs more info and can tell me how to get it, I'm happy to oblige.

I hadn't been paying much attention to this; however I now need to track down a separate issue and the logs are so full of this garbage that I can't get anywhere. I'm going to simply remove the offending printk and recompile, but I thought I would report this first.

Last edited by kschendel (2025-04-22 06:23:45)

altoid · 2025-04-22 09:59:42

@itektur: Thank you a lot for actually contributing to the ticket!

I must say that I have not gotten very far - mostly, due to lack of time and patience.

The only thing I tried was compiling and running the latest mainline kernel that was available at the time - 6.15-rc2 - which you will not be able to compile unless you disable many, many different features (the landlock LSM being one of them) due to compiler errors.

It did have a whole bunch of drm/amdgpu related new commits in it, though nothing specific to this issue.. and after booting it, the problem persisted.

If you want to try for yourself, 6.15-rc3 just came out, and it should compile just fine if you replace your kernel config with my minimal one from my GitHub - make sure to disable landlock though!

Then, I dug further into the documentation of drm and amdgpu module related parameters.

Here are some interesting parameters for debugging, harvested from here and here.

[drm.debug]
| Bitflag | Meaning                             |
|---------|-------------------------------------|
| 0x1     | CORE (DRM core code)                |
| 0x2     | DRIVER (vendor‐specific driver code)|
| 0x4     | KMS (Kernel Mode Setting)           |
| 0x8     | PRIME (buffer‐sharing code)         |
| 0x10    | ATOMIC (atomic modesetting)         |
| 0x20    | VBL (vblank/event handling)         |
| 0x40    | STATE (atomic‐state debugging)      |
| 0x80    | LEASE (leasing code)                |
| 0x100   | DP (DisplayPort code)               |
| 0x200   | DRMRES (managed‐resources tracking) |
[amdgpu.debug]
| Bitflag | Meaning                                |
|---------|----------------------------------------|
| 0x1     | Debug VM handling                      |
| 0x2     | Simulate Large BAR                     |
| 0x4     | Disable GPU soft recovery              |
| 0x8     | Use VRAM for firmware buffer           |
| 0x10    | Enable RAS ACA logging                 |
| 0x20    | Enable experimental reset paths        |
| 0x40    | Disable GPU ring reset                 |
| 0x80    | Use VRAM for SMU debug memory pool     |

Of particular interest to me are the

amdgpu.debug = 0x4 | 0x20 | 0x80

bit flags, but the full DRM debug output would be nice to see as well.

Later, I will play around with the options from the two tables above, as well as with power management related driver options such as:

amdgpu.aspm
amdgpu.runpm

I also finally got to the AMDGPU soft-recovery code for ring failures and have some ideas to try out on my mind!

Should I stumble upon something interesting, I will edit this post (or, post a new reply altogether).

Thank you to everyone that posted in the ticket, and if you did, you should probably enable some verbose debug messages!

Consult the tables and the two links I posted for more information, until I have some time (and patience) to test my theories.

P.S: It should be obvious, but you can get around this bug if you suspend your machine without any hardware acceleration features currently being used / activated by applications (VA-API, Vulkan, et cetera).

Of course that is not a fix, but for machines with high-uptime requirements it is a temporary workaround.

Edit:

I noticed that

amdgpu.debug

is not exposed to the user by default, but you can patch

amdgpu_init_debug_options

or alternatively add the module parameter yourself.

Last edited by altoid (2025-04-26 00:40:57)

altoid · 2025-04-25 11:59:18

Hi, can everyone try

amdgpu.reset_method = 0

on your kernel command line?

Completely fixes it for me, suspend and resume worked 3 (now 5..) times in a row with both Firefox and Spotify using hardware acceleration, and rings don't timeout anymore.

This is definitely related to an older issue, and will likely take a bit longer to "properly" fix.

I also posted the same finding on GitLab... hope this works, at least temporarily.

Last edited by altoid (2025-04-25 12:29:16)

itektur · 2025-04-26 11:40:33

altoid wrote:

Hi, can everyone try
amdgpu.reset_method = 0
on your kernel command line?
Completely fixes it for me, suspend and resume worked 3 (now 5..) times in a row with both Firefox and Spotify using hardware acceleration, and rings don't timeout anymore.

Works for me as well, thanks.

WFV · 2025-04-27 02:12:54

@ altoid: Just curious: is this approach regarding "legacy" resets as in line 544 and 567 of linux/drivers/gpu/drm/amd/amdgpu/amdgpu.h? Thank you.

altoid · 2025-04-27 04:12:59

WFV wrote:

@ altoid: Just curious: is this approach regarding "legacy" resets as in line 544 and 567 of linux/drivers/gpu/drm/amd/amdgpu/amdgpu.h? Thank you.

Yes.

I posted exactly that enum and more information on GitLab: https://gitlab.freedesktop.org/drm/amd/ … te_2884988

Whether it actually works seems to vary from vendor and card version though, and I am still trying to figure out why.. bit of a huge undertaking if you are working alone, and you don't have all the different cards with all the different BIOS versions from all the different vendors.

DeKay · 2025-07-11 23:19:18

A fix has been found, confirmed to work, and confirmed as correct by Alex Deucher. Is there a way to get this one-liner in the Arch kernel?

https://gitlab.freedesktop.org/drm/amd/ … te_3001452

seth · 2025-07-12 09:26:20

Arch will typically not pick up patches unless they've been officially upstreamed - you can ask for a patch to be picked up, https://gitlab.archlinux.org/archlinux/ … x/-/issues but depending on how things move upstream might get it w/ the next kernel releases anyway.
Alternatively see https://wiki.archlinux.org/title/Compile_kernel_module

DeKay · 2025-07-17 18:47:56

Well, looks like I'm famous now. My bug report on this issue made it into Phoronix. The fix is going into 6.16 but will hopefully be backported.

https://www.phoronix.com/news/Linux-Fix … laris-Spam

DeKay · 2025-07-27 18:48:05

The fix has been backported to 6.15.8 in Arch and works great! Will mark this thread as [solved].

https://gitlab.freedesktop.org/drm/amd/ … te_3027395

Arch Linux

#26 2025-02-04 04:07:25

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#27 2025-02-04 14:09:18

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#28 2025-02-08 14:01:24

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#29 2025-04-15 01:33:05

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#30 2025-04-15 02:14:38

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#31 2025-04-15 07:43:56

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#32 2025-04-15 15:28:34

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#33 2025-04-17 06:45:43

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#34 2025-04-17 20:42:10

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#35 2025-04-17 22:13:59

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#36 2025-04-21 13:37:26

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#37 2025-04-21 13:51:05

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#38 2025-04-21 16:52:42

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#39 2025-04-22 09:59:42

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#40 2025-04-25 11:59:18

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#41 2025-04-26 11:40:33

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#42 2025-04-27 02:12:54

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#43 2025-04-27 04:12:59

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#44 2025-07-11 23:19:18

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#45 2025-07-12 09:26:20

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#46 2025-07-17 18:47:56

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

#47 2025-07-27 18:48:05

Re: [Solved] It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready,

Board footer