You are not logged in.

#1 2024-02-08 18:33:31

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

amdgpu crashes since this week

As the title says, I since this week I have had crashes while using my 6900 XT.
This happens while playing a game. The usage of the card doesn't matter though.

Here's the relevant journalctl log:

Feb 08 18:48:48.687102 xbox kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=8446608, emitted seq=8446610
Feb 08 18:48:48.702165 xbox kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process java pid 57305 thread java:cs0 pid 57452

(yes, my PC is called xbox)

I was playing Minecraft, so java as the cause makes sense. This happens in every game though.
This however doesn't happen every time. Sometimes I can play a game just fine, but quite often the GPU just crashes.


I have no idea what causes this, although I have noticed the *new* entry in journalctl:

Feb 08 19:10:23.419224 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: New power limit (0) is out of range [229,293]
Feb 08 19:10:23.419399 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: New power limit (0) is out of range [229,293]

My first guess was that CoreCtrl is causing this issue. However, I have disabled it and the logs still appear.

I really don't know what to do now.



Other info:
Kernel: 6.7.4-zen1-1-zen
DE: KDE Plasma (5.27.10) (Wayland)

and since this is an hardware issue:
MB: X570 Steel Legend
CPU: 5800X3D
RAM: DDR4 3600 MHz

Last edited by UnknownUser95 (2024-02-10 14:59:15)

Offline

#2 2024-02-08 19:14:42

cryptearth
Member
Registered: 2024-02-03
Posts: 427

Re: amdgpu crashes since this week

As there's quite a lot of work due to recent issues with amdgpu it's possible it to cause the crashes. Still a few questions:

- What specific brand and model is your gpu?
- Is it stock speeds or overclocked?
- Why ZEN kernel? Are using ZEN?
- Have you tried standard kernel?
- Have you tried older kernels like 6.6-lts?
- Have you tried X11 instead of Wayland?
- How are your displays connected? DisplayPort, HDMI, DVI, any adapter?
- Does the issue also occur on 2D desktop or only on 3D load?

Online

#3 2024-02-08 19:33:30

loqs
Member
Registered: 2014-03-06
Posts: 17,765

Re: amdgpu crashes since this week

In addition what was the last kernel version without the issue?  What was the first kernel version with the issue?

Offline

#4 2024-02-08 20:34:35

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

cryptearth wrote:

As there's quite a lot of work due to recent issues with amdgpu it's possible it to cause the crashes. Still a few questions:

- What specific brand and model is your gpu?
- Is it stock speeds or overclocked?
- Why ZEN kernel? Are using ZEN?
- Have you tried standard kernel?
- Have you tried older kernels like 6.6-lts?
- Have you tried X11 instead of Wayland?
- How are your displays connected? DisplayPort, HDMI, DVI, any adapter?
- Does the issue also occur on 2D desktop or only on 3D load?

Whoops, forgot about that: I have the XFX Speedster SWFT 319.

I use the Zen Kernel, because it is supposed to be better (better performance for higher power usage). I didn't think about that too much, so I didn't try the regular or the LTS one. That's on me, although I hoped for something for the current one.
I'm not really interested in X11. Wayland has solved quite a few annoyances, so I simply didn't want to switch - again on me.

I really should've tried more and not use the stuff I regularly use, very sorry about that. I'll get back with some information regarding the other kernels and X11.


And for the rest:
- it crashes on any usage. I had a very low usage ffmpeg encode with nothing else running and got a crash. It seems any non desktop dGPU usage means crash.
- I use DP only

Offline

#5 2024-02-08 20:36:13

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

loqs wrote:

In addition what was the last kernel version without the issue?  What was the first kernel version with the issue?

I can't answer that.
If I remember correctly, it was the 6.6 series, but I can't say for sure.

Offline

#6 2024-02-08 20:39:50

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

Quick Addition:

I changed nothing, except using the LTS kernel. The "amdgpu 0000:0f:00.0: amdgpu: New power limit (0) is out of range [229,293]" message is gone.

Offline

#7 2024-02-08 21:36:28

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

The LTS kernel didn't change anything.
I'm trying X11 now.

Offline

#8 2024-02-08 21:41:39

loqs
Member
Registered: 2014-03-06
Posts: 17,765

Re: amdgpu crashes since this week

You can check in /var/log/pacman.log to see which packages have been upgraded in the past week and the versions involved.

Offline

#9 2024-02-08 22:11:45

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

That's a really good idea!

So, I updated on the third, two days before a crash:
linux 6.7.2 -> 6.7.3
linux-lts 6.6.14-1 -> 6.6.15-1
and mesa updates

I'll can try a rollback.





I was looking for an exact date, when I found something, which may have something to do with this:
On the forth, right after logging in, my PC was unresponsive, so I had to hard reset.
I found this log repeated 3 times:

watchdog: BUG: soft lockup - CPU#10 stuck for 85s! [Thread (pooled):1685]
Feb 04 11:45:55.359340 xbox kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq nfnetlink_queue nf_conntrack_netlink ip6t_REJECT nf_reject_ipv6 ip6table_nat ip6table_filter ip6table_mangle ip6_tables xt_nat ipt_REJECT nf_reject_ipv4 xt_NFQUEUE xt_mark xt_connmark iptable_nat nf_nat nf_conntrack nf_>
Feb 04 11:45:55.375907 xbox kernel:  rapl wmi_bmof dca snd pcspkr acpi_cpufreq soundcore k10temp ccp i2c_piix4 ryzen_smu(OE) mac_hid vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) uinput i2c_dev sg crypto_user dm_mod fuse loop nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid amdgpu video amdxc>
Feb 04 11:45:55.375926 xbox kernel: CPU: 10 PID: 1685 Comm: Thread (pooled) Tainted: G           OEL     6.7.3-zen1-1-zen #1 139a3424eeb05a95d5e71c5f895423dc81d9738c
Feb 04 11:45:55.375937 xbox kernel: Hardware name: To Be Filled By O.E.M. X570 Steel Legend/X570 Steel Legend, BIOS P4.10 10/19/2022
Feb 04 11:45:55.375948 xbox kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x225/0x2e0
Feb 04 11:45:55.375958 xbox kernel: Code: 41 c1 e4 10 41 c1 e5 12 45 09 ec 44 89 e0 c1 e8 10 66 87 43 02 89 c2 c1 e2 10 81 fa ff ff 00 00 77 5e 31 d2 eb 02 f3 90 8b 03 <66> 85 c0 75 f7 44 39 e0 0f 84 8e 00 00 00 c6 03 01 48 85 d2 74 0e
Feb 04 11:45:55.375970 xbox kernel: RSP: 0018:ffffb13185c7bad8 EFLAGS: 00000206
Feb 04 11:45:55.375979 xbox kernel: RAX: 00000000002c0200 RBX: ffff9cbc1e3a5a28 RCX: 0000000000000000
Feb 04 11:45:55.375989 xbox kernel: RDX: 0000000000000000 RSI: 0000000000000200 RDI: ffff9cbc1e3a5a28
Feb 04 11:45:55.375998 xbox kernel: RBP: ffff9cc21ecb5040 R08: 00000000fffbf832 R09: fffffffa02ba3eb8
Feb 04 11:45:55.376007 xbox kernel: R10: 000000000000057a R11: 0000000000000001 R12: 00000000002c0000
Feb 04 11:45:55.376016 xbox kernel: R13: 00000000002c0000 R14: 0000000000014668 R15: 0000000000000000
Feb 04 11:45:55.376026 xbox kernel: FS:  000077f597c006c0(0000) GS:ffff9cc21ec80000(0000) knlGS:0000000000000000
Feb 04 11:45:55.376040 xbox kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 04 11:45:55.376050 xbox kernel: CR2: 000077f5a17c400a CR3: 0000000156f6e000 CR4: 0000000000f50ef0
Feb 04 11:45:55.376061 xbox kernel: PKRU: 55555554
Feb 04 11:45:55.376071 xbox kernel: Call Trace:
Feb 04 11:45:55.376079 xbox kernel:  <IRQ>
Feb 04 11:45:55.376089 xbox kernel:  ? watchdog_timer_fn+0x1b8/0x220
Feb 04 11:45:55.376098 xbox kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Feb 04 11:45:55.376109 xbox kernel:  ? __hrtimer_run_queues+0x124/0x2c0
Feb 04 11:45:55.376118 xbox kernel:  ? hrtimer_interrupt+0xfb/0x440
Feb 04 11:45:55.376128 xbox kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 04 11:45:55.376139 xbox kernel:  ? __sysvec_apic_timer_interrupt+0x50/0x140
Feb 04 11:45:55.376146 xbox kernel:  ? sysvec_apic_timer_interrupt+0x6d/0x90
Feb 04 11:45:55.376155 xbox kernel:  </IRQ>
Feb 04 11:45:55.376164 xbox kernel:  <TASK>
Feb 04 11:45:55.376173 xbox kernel:  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
Feb 04 11:45:55.376182 xbox kernel:  ? native_queued_spin_lock_slowpath+0x225/0x2e0
Feb 04 11:45:55.376193 xbox kernel:  _raw_spin_lock+0x29/0x30
Feb 04 11:45:55.376203 xbox kernel:  shrink_lock_dentry+0xb9/0x100
Feb 04 11:45:55.376224 xbox kernel:  shrink_dentry_list+0xb5/0x2a0
Feb 04 11:45:55.376234 xbox kernel:  prune_dcache_sb+0x55/0x80
Feb 04 11:45:55.376247 xbox kernel:  super_cache_scan+0x128/0x1c0
Feb 04 11:45:55.376256 xbox kernel:  do_shrink_slab+0x13d/0x360
Feb 04 11:45:55.376266 xbox kernel:  shrink_slab+0x296/0x3c0
Feb 04 11:45:55.376276 xbox kernel:  shrink_node+0x1de/0xd00
Feb 04 11:45:55.376286 xbox kernel:  ? filemap_map_pages+0x3d0/0x500
Feb 04 11:45:55.376295 xbox kernel:  do_try_to_free_pages+0xe7/0x5c0
Feb 04 11:45:55.376304 xbox kernel:  try_to_free_mem_cgroup_pages+0x10b/0x230
Feb 04 11:45:55.376313 xbox kernel:  reclaim_high+0xb1/0x100
Feb 04 11:45:55.376322 xbox kernel:  mem_cgroup_handle_over_high+0xa8/0x260
Feb 04 11:45:55.376331 xbox kernel:  exit_to_user_mode_prepare+0x160/0x1f0
Feb 04 11:45:55.376338 xbox kernel:  irqentry_exit_to_user_mode+0x9/0x30
Feb 04 11:45:55.376347 xbox kernel:  asm_exc_page_fault+0x26/0x30
Feb 04 11:45:55.376356 xbox kernel: RIP: 0033:0x7835a190c1d8
Feb 04 11:45:55.376365 xbox kernel: Code: c2 0f b7 c9 c1 e2 10 09 ca 89 53 40 31 d2 48 89 74 c3 48 66 89 94 43 48 01 00 00 40 f6 c5 01 0f 85 1d 01 00 00 4c 8b 7c 24 10 <41> 0f b7 47 0a a8 01 0f 84 a3 00 00 00 8b 73 20 85 f6 74 11 41 0f
Feb 04 11:45:55.376375 xbox kernel: RSP: 002b:000077f597bff460 EFLAGS: 00010246
Feb 04 11:45:55.376384 xbox kernel: RAX: 0000000000000002 RBX: 000077f597bff550 RCX: 0000000000000003
Feb 04 11:45:55.376394 xbox kernel: RDX: 0000000000000000 RSI: 000077f5a17c4000 RDI: 000078359cc00010
Feb 04 11:45:55.376403 xbox kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
Feb 04 11:45:55.376412 xbox kernel: R10: 000077f588005630 R11: 91e91a8d36a7c031 R12: 0000000000000000
Feb 04 11:45:55.376425 xbox kernel: R13: 000077f597bff930 R14: 000000000000006a R15: 000077f5a17c4000
Feb 04 11:45:55.376433 xbox kernel:  </TASK>

(I updated to the latest BIOS today, so v4.10 is outdated)


The timeline so far:
03.02.  -> 04.02.             ->  05.02
update -> watchdog bug -> first GPU crash

Offline

#10 2024-02-09 00:21:06

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 22,387

Re: amdgpu crashes since this week

FWIW I have a 6900XT as well and no issues fully updated... if you're unlucky we're looking at HW issues.

Offline

#11 2024-02-09 09:05:06

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

I reseated the GPU, just to make sure. It didn't change anything.


I got two new errors though:

Feb 09 09:58:35.667638 xbox kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

This one is repeated twice:

Feb 09 09:56:39.868284 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.883386 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04b9e7c000 from client 0x1b (UTCL2)
Feb 09 09:56:39.883546 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
Feb 09 09:56:39.883684 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Feb 09 09:56:39.883816 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x1
Feb 09 09:56:39.883948 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.884109 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Feb 09 09:56:39.884239 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.884361 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.884468 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.884571 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04b9e9c000 from client 0x1b (UTCL2)
Feb 09 09:56:39.884675 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
Feb 09 09:56:39.884776 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Feb 09 09:56:39.884878 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x1
Feb 09 09:56:39.884980 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.885081 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Feb 09 09:56:39.885183 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.885292 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.885395 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.885497 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04b9ebb000 from client 0x1b (UTCL2)
Feb 09 09:56:39.885602 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
Feb 09 09:56:39.885702 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Feb 09 09:56:39.885801 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x1
Feb 09 09:56:39.885900 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.885997 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Feb 09 09:56:39.886095 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.886193 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.886300 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.886400 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04b9edb000 from client 0x1b (UTCL2)
Feb 09 09:56:39.886503 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 09 09:56:39.886603 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 09 09:56:39.886703 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 09 09:56:39.886804 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.886902 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 09 09:56:39.887003 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.887102 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.887201 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.887307 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04b9f1b000 from client 0x1b (UTCL2)
Feb 09 09:56:39.887408 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 09 09:56:39.887508 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 09 09:56:39.887608 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 09 09:56:39.887707 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.887805 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 09 09:56:39.887903 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.888002 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.888101 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.888198 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04b9efb000 from client 0x1b (UTCL2)
Feb 09 09:56:39.888305 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 09 09:56:39.888408 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 09 09:56:39.888507 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 09 09:56:39.888604 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.888702 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 09 09:56:39.888802 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.888900 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.889025 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.889126 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04b9f1b000 from client 0x1b (UTCL2)
Feb 09 09:56:39.889226 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 09 09:56:39.889333 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 09 09:56:39.889433 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 09 09:56:39.889532 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.889630 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 09 09:56:39.889728 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.889827 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.889926 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.890021 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04bdb01000 from client 0x1b (UTCL2)
Feb 09 09:56:39.890124 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 09 09:56:39.890224 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 09 09:56:39.890330 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 09 09:56:39.890428 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 09:56:39.890525 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 09 09:56:39.890625 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 09:56:39.890723 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:          RW: 0x0
Feb 09 09:56:39.890823 xbox kernel: amdgpu 0000:0f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32791, for process java pid 3215 thread java:cs0 pid 3341)
Feb 09 09:56:39.890922 xbox kernel: amdgpu 0000:0f:00.0: amdgpu:   in page starting at address 0x0000fd04bdb21000 from client 0x1b (UTCL2)

Offline

#12 2024-02-09 10:39:39

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

Yesterday I tried different kernels, now I tried X11 instead of Wayland:

I haven't had a crash yet while putting more stress on the GPU, so it looks like a Plasma-Wayland issue with the new kernel, mesa, or glibc/gcc-libs versions.
However V1del has no issues, while we're using the same versions.
I just don't know.

I have to test X11 more, to see if that's really the cause.

Offline

#13 2024-02-09 11:20:46

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 22,387

Re: amdgpu crashes since this week

I might have to test something that really stresses things/different usage, sometimes some workloads can be wonky, e.g. I know AoEIV for example can lead to various crashes but that seems more game related (but will similarly trigger the kind of faults you see above).

FWIW main stuff I'm testing/playing right now is Street Fighter 6 which generally isn't that taxing on the GPU.

Offline

#14 2024-02-09 21:38:02

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

It's not an X11 or Wayland issue, just got a crash within the X11 session.

I'll fully disable the amdgpu.featuremask. If that doesn't fix it, it's probably an hardware issue (although weird that they only appeared this week).

Offline

#15 2024-02-10 10:53:10

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

I now have downgraded all updates and still got a crash.

It's most likely a hardware issue hmm

Offline

#16 2024-02-10 21:29:40

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

Alright, I tested Windows and it looked stable.
So it's just something weird with my system.

To be fair, my system was cursed from the beginning. I'll just do an actual fresh install and it should be good again.

Offline

#17 2024-02-11 20:55:31

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

An update to this:

This looks like it's fixed now.

A reinstall worked. I have absolutely no idea what caused that issue.

Last edited by UnknownUser95 (2024-02-11 21:29:03)

Offline

#18 2024-02-12 17:09:03

Oddwierdo
Member
Registered: 2023-07-29
Posts: 32

Re: amdgpu crashes since this week

I'm kinda running into this issue as well.

amdgpu 0000:2d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32770, for process firefox pid 1683 thread firefox:cs0 pid 1687)
amdgpu 0000:2d:00.0: amdgpu:   in page starting at address 0x0000800300120000 from client 0x1b (UTCL2)
amdgpu 0000:2d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601031
amdgpu 0000:2d:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:2d:00.0: amdgpu:          MORE_FAULTS: 0x1
amdgpu 0000:2d:00.0: amdgpu:          WALKER_ERROR: 0x0
amdgpu 0000:2d:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
amdgpu 0000:2d:00.0: amdgpu:          MAPPING_ERROR: 0x0
amdgpu 0000:2d:00.0: amdgpu:          RW: 0x0

Happend twice so far. Both times it happend during browsing with Firefox. Dunno what'S causing it. Gaming is fine for example.

Last edited by Oddwierdo (2024-02-12 17:12:14)

Offline

#19 2024-02-15 19:25:36

UnknownUser95
Member
Registered: 2024-02-08
Posts: 13

Re: amdgpu crashes since this week

Update in my problem:

I got another crash, so I investigated more.

CoreCtrl has the option to let the GPU handle it's fans and with that its temps. For some reason that was broken.
No fan control what so ever. I made my own curve, and temps drastically dropped (while keeping quite).

I have to see if this was the issue.

Offline

#20 2024-02-16 07:03:46

kaholer444
Member
Registered: 2024-02-16
Posts: 5

Re: amdgpu crashes since this week

I have the same issue

amdgpu 0000:28:00.0: amdgpu: New power limit (0) is out of range [272,363]
amdgpu 0000:28:00.0: amdgpu: New power limit (0) is out of range [272,363]

desktop: kde plasma 5.27.10
kernel: 6.7.4-1-cachyos-eevdf
cpu: 5800x
gpu: 6950 xt

disabling TearFree seems to fix the issue on x11

/etc/X11/xorg.conf.d/20-amdgpu.conf
Section "OutputClass"
        Identifier "AMD"
        Driver "amdgpu"
        Option "TearFree" "false"
        Option "EnablePageFlip" "false"
EndSection

Offline

#21 2024-02-17 22:52:27

kaholer444
Member
Registered: 2024-02-16
Posts: 5

Re: amdgpu crashes since this week

more errors

Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00501431
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x1
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:5 pasid:32786, for process firefox pid 19172 thread firefox:cs0 pid 19250)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MORE_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 17 22:42:43 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu:          RW: 0x0
Feb 17 22:42:54 cachyos-x8664 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=35833771, emitted seq=35833773
Feb 17 22:42:54 cachyos-x8664 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 19172 thread firefox:cs0 pid 19250
Feb 17 22:42:54 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Feb 17 22:42:58 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: failed to suspend display audio
Feb 17 22:42:58 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: MODE1 reset
Feb 17 22:42:58 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset
Feb 17 22:42:58 cachyos-x8664 kernel: amdgpu 0000:28:00.0: amdgpu: GPU smu mode1 reset
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00
Feb 17 22:42:58 cachyos-x8664 kernel: snd_hda_intel 0000:28:00.1: spurious response 0x0:0x0, last cmd=0x7f7c00

any ideas?

Offline

#22 2024-02-17 23:16:52

seth
Member
Registered: 2012-09-03
Posts: 54,562

Re: amdgpu crashes since this week

The OP wrote:

CoreCtrl has the option to let the GPU handle it's fans and with that its temps. For some reason that was broken.

@UnknownUser95, has the situation since stabilized?

Otherwise:
https://bbs.archlinux.org/viewtopic.php … 2#p2151162 ?
Tried the LTS kernel?

Offline

#23 2024-02-18 00:20:21

cryptearth
Member
Registered: 2024-02-03
Posts: 427

Re: amdgpu crashes since this week

Sorry for being late back to the party. Just updated to 6.7.5 and had myself a beer over the reboot issue seems to be fixed (at least for my system).
First of all: I have to appologize for my initial reply 10 days ago: I totally messed up by confusing myself reading ZEN as XEN - hence my stupid question "do you use zen?" - it was meant to say "do you use xen (virtualization)?". Since I read up on the Zen kernel and as you use a zen based system it seems obvious to use a kernel optimized for it.

Back to the topic:
You all seem to run very beefy high end cards: 6900 ... 6950 ...
@Oddwierdo what card do you use?
Just out of curiosity: Are your gpus somehow marked as "O.C. version"? As windows seem to be able to keep the cards stable and it's maybe driver related: OC cards are usually tested at the factory to be stable on the OC profile they come with - but I'm sure they're only tested with windows and then-current amd drivers.

Also: As 6.7.5 just released with bunch of amdgpu fixes: Has any of you updated yet?

Online

#24 2024-02-18 00:41:47

kaholer444
Member
Registered: 2024-02-16
Posts: 5

Re: amdgpu crashes since this week

not an oc version the gpu is "XFX Speedster MERC 319" p/n: RX-695XATBD9

I'm in the processes of installing a fresh version of arch to isolate the problem, I read that setting "amdgpu.mcbp=0" may help, I will test the kernel version you mentioned + lts kernel and report back

Offline

#25 2024-02-18 08:59:09

seth
Member
Registered: 2012-09-03
Posts: 54,562

Re: amdgpu crashes since this week

Since I read up on the Zen kernel and as you use a zen based system it seems obvious to use a kernel optimized for it.

The linux-zen kernel is NOT optimized for ryzen systems, https://wiki.archlinux.org/title/Kernel … ed_kernels

Offline

Board footer

Powered by FluxBB