You are not logged in.

#1 2019-11-11 22:04:25

luken
Member
Registered: 2014-06-26
Posts: 37

[SOLVED] AMDGPU driver is crashing GPU after it will warm up.

So, the issue I have is that I need AMDGPU driver for Vulkan support, to play some games, but AMDGPU makes GPU crash after it will warm up (no video output).

Literally after it will warm up - like one second after I start hearing a GPU's fan. It's dependent on load, because if game won't warm GPU up it will work... until I will hear the fan and GPU will crash shortly after. Less GPU-intensive games will work a bit longer, but we are talking seconds here.

radeon driver works just fine (on non-vulkan games), so it's not a hardware issue.

I can trigger a crash by running gputest program:

gputest /test=fur /width=1280 /height=1024 /benchmark /benchmark_duration_ms=5000 /no_scorebox /print_score

In journalctl the only messages I get during the crash are these:

lis 11 21:03:42 Luken-Desktop kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
lis 11 21:03:47 Luken-Desktop kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
lis 11 21:03:47 Luken-Desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=36133, emitted seq=36135
lis 11 21:03:47 Luken-Desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GpuTest pid 8267 thread GpuTest:cs0 pid 8306

With GPU restart following.

My GPU is Radeon R9 290 .

I'm enabling amdgpu in a /etc/modprobe.d/amdgpu.conf file, that looks like that:

options amdgpu cik_support=1
options amdgpu dc=1
options amdgpu dpm=1
blacklist radeon

Enabling/disabling dc and dpm parameters doesn't change anything (well I'm not really sure but disabling dc may be making crash to happen few seconds later).

I have newest packages and 5.3.10-arch1-1 kernel.

As everything generally work until the GPU will warm up, could it be that GPU is being somehow overclocked by the AMDGPU driver, which causes overheating and this? And if that's the case - how can it be fixed/worked around?

I figured that maybe output of sudo cat /sys/kernel/debug/dri/0/amdgpu_pm_info may be useful here:

Clock Gating Flags Mask: 0x0
    Graphics Medium Grain Clock Gating: Off
    Graphics Medium Grain memory Light Sleep: Off
    Graphics Coarse Grain Clock Gating: Off
    Graphics Coarse Grain memory Light Sleep: Off
    Graphics Coarse Grain Tree Shader Clock Gating: Off
    Graphics Coarse Grain Tree Shader Light Sleep: Off
    Graphics Command Processor Light Sleep: Off
    Graphics Run List Controller Light Sleep: Off
    Graphics 3D Coarse Grain Clock Gating: Off
    Graphics 3D Coarse Grain memory Light Sleep: Off
    Memory Controller Light Sleep: Off
    Memory Controller Medium Grain Clock Gating: Off
    System Direct Memory Access Light Sleep: Off
    System Direct Memory Access Medium Grain Clock Gating: Off
    Bus Interface Medium Grain Clock Gating: Off
    Bus Interface Light Sleep: Off
    Unified Video Decoder Medium Grain Clock Gating: Off
    Video Compression Engine Medium Grain Clock Gating: Off
    Host Data Path Light Sleep: Off
    Host Data Path Medium Grain Clock Gating: Off
    Digital Right Management Medium Grain Clock Gating: Off
    Digital Right Management Light Sleep: Off
    Rom Medium Grain Clock Gating: Off
    Data Fabric Medium Grain Clock Gating: Off
    Address Translation Hub Medium Grain Clock Gating: Off
    Address Translation Hub Light Sleep: Off

GFX Clocks and Power:
    1250 MHz (MCLK)
    300 MHz (SCLK)
    300 MHz (PSTATE_SCLK)
    150 MHz (PSTATE_MCLK)
    1000 mV (VDDGFX)
    47.105 W (average GPU)

GPU Temperature: 69 C
GPU Load: 0 %
MEM Load: 0 %

UVD: Disabled

VCE: Disabled

I also tried to underclock the GPU with this:

echo manual | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level || exit 1
echo 3 | sudo tee /sys/class/drm/card0/device/pp_dpm_sclk && echo manual | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level && echo 3 | sudo tee /sys/class/drm/card0/device/pp_dpm_sclk

(I added options amdgpu ppfeaturemask=0xffffffff to my amdgpu.conf file) But it didn't help.

I feel like I'm close to the solution and by some specific undercloaking/limiting I should be able to prevent this issue, but I don't know how to proceed from here.

Do you have any ideas how to fix this or how to proceed from here?


Edit #1:

I added part of the glxinfo output for some additional data regarding my hardware and software.

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: X.Org (0x1002)
    Device: AMD Radeon R9 200 Series (HAWAII, DRM 3.33.0, 5.3.10-arch1-1, LLVM 9.0.0) (0x67b1)
    Version: 19.2.3
    Accelerated: yes
    Video memory: 4096MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.5
    Max compat profile version: 4.5
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 3355 MB, largest block: 3355 MB
    VBO free aux. memory - total: 4019 MB, largest block: 4019 MB
    Texture free memory - total: 3355 MB, largest block: 3355 MB
    Texture free aux. memory - total: 4019 MB, largest block: 4019 MB
    Renderbuffer free memory - total: 3355 MB, largest block: 3355 MB
    Renderbuffer free aux. memory - total: 4019 MB, largest block: 4019 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 4096 MB
    Total available memory: 8192 MB
    Currently available dedicated video memory: 3355 MB
OpenGL vendor string: X.Org
OpenGL renderer string: AMD Radeon R9 200 Series (HAWAII, DRM 3.33.0, 5.3.10-arch1-1, LLVM 9.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.3
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

Last edited by luken (2019-11-15 11:20:26)

Offline

#2 2019-11-12 03:43:09

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: [SOLVED] AMDGPU driver is crashing GPU after it will warm up.

What were you doing when you got that 69°C measurement in that file in /sys/kernel/debug/?

I'm asking because in that output you showed from there, in that moment in time the GPU core was running at a very low clock speed, but at the same time it showed a 69°C measurement. If you got that measurement while you weren't using the GPU much, like while you were just using some desktop programs like the web browser, then that's really worrying. I'd then think the cooler of your graphics card has problems, and you should then research how to clean things and replace thermal paste etc.

Last edited by Ropid (2019-11-12 03:43:47)

Offline

#3 2019-11-12 11:11:33

luken
Member
Registered: 2014-06-26
Posts: 37

Re: [SOLVED] AMDGPU driver is crashing GPU after it will warm up.

I was pretty  much idling (and running browser in the background). I just compared temps of the "radeon" and "amdgpu" drivers, and it looks like the "amdgpu" is idling 10 degrees Celsius higher than the "radeon" one. On the "radeon" driver I'm idling at about 60 Celsius, and as I mentioned, it doesn't crash at all.

This card is by design running very hot (95 degrees on full load is normal) and also worked without issues on Windows, so the question is - can we undercloak it somehow or tweak something in amdgpu driver configuration to lower the temps?

I looked into /sys/class/drm/card0/device/ files for both drivers, to find any difference, and I noticed that /sys/class/drm/card0/device/power_dpm_state for amdgpu has performance , but for radeon it has balance (both when idling).

Could it be it? I'm not sure as that value can be dynamic, and radeon can as well go to performance mode when on load.

Also interesting files available when amdgpu is active:

/sys/class/drm/card0/device/pp_power_profile_mode

NUM        MODE_NAME     SCLK_UP_HYST   SCLK_DOWN_HYST SCLK_ACTIVE_LEVEL     MCLK_UP_HYST   MCLK_DOWN_HYST MCLK_ACTIVE_LEVEL
  0   BOOTUP_DEFAULT:        -                -                -                -                -                -
  1 3D_FULL_SCREEN *:        0              100               30                0              100               10
  2     POWER_SAVING:       10                0               30                -                -                -
  3            VIDEO:        -                -                -               10               16               31
  4               VR:        0               11               50                0              100               10
  5          COMPUTE:        0                5               30                -                -                -
  6           CUSTOM:        -                -                -                -                -                -

/sys/class/drm/card0/device/pp_dpm_sclk

0: 300Mhz 
1: 483Mhz *
2: 682Mhz 
3: 867Mhz 
4: 908Mhz 
5: 942Mhz 
6: 968Mhz 
7: 977Mhz

/sys/class/drm/card0/device/pp_num_states

states: 2
0 boot
1 performance

Offline

#4 2019-11-12 16:57:24

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: [SOLVED] AMDGPU driver is crashing GPU after it will warm up.

It should be something like 40°C at idle (at 20°C room temperature). This is even if it's a hot card and is 95°C while gaming, it should still be pretty cold at idle, it shouldn't be 60°C or 70°C.

I have a newer AMD card. The amdgpu driver is also reporting "performance" for me in that variable you mention. My idle temperatures are around 35°C. It seems when the driver reports "performance", it will still dynamically change clock speeds and will slow down the card if there's nothing much to do, making things run cool at idle.

I think you should try to fix the cooling problem. There's something physical wrong with the card or the PC case. It's something like, the insides of the PC case are stuffed with dust, or case fans or the GPU's fans are broken, or the GPU's thermal paste went bad over the years.

Last edited by Ropid (2019-11-12 17:00:28)

Offline

#5 2019-11-12 17:43:52

luken
Member
Registered: 2014-06-26
Posts: 37

Re: [SOLVED] AMDGPU driver is crashing GPU after it will warm up.

I will try to replace a thermal paste in a week and report back. I found that this model has really horrid factory paste, so it might solve the issue, but we will see smile . Yet it's still strange that it took amdgpu for thermals to go that crazy. It has to force GPU to do something extra compared to Windows or 'radeon' drivers.

Last edited by luken (2019-11-12 17:45:40)

Offline

#6 2019-11-15 11:19:55

luken
Member
Registered: 2014-06-26
Posts: 37

Re: [SOLVED] AMDGPU driver is crashing GPU after it will warm up.

Yep, changing thermal paste fixed the issue smile . The idle temp actually dropped about 10 degrees on the amdgpu driver and about 20 degrees on the radeon driver (!) . I was too fixed on the software side of things to consider this, so thanks!

Offline

Board footer

Powered by FluxBB