You are not logged in.
EDIT 2026-05-25: The solution appears to ultimately consist of `amd_iommu=off`, switching from `mesa` to `mesa-git`, and switching from `lightdm` to `atrium`. Been running for 2 days now with no crashes, no dmesg errors. Can push the PC all day without any issue.
There's still some occasional graphical glitches but those generally just require restarting whatever application and don't cause any degraded functionality.
I run a multi-seat system with two ASRock Intel Arc B580 Challenger OC GPUs.
Using lightdm (nothing else seems to currently support multi-seat with Wayland sessions without crashing).
Seats are assigned via loginctl attach, and named seat0 and seat-hoodie. seat0 runs Sway, seat-hoodie runs Plasma (Wayland).
The GPU on seat-hoodie frequently reports errors such as the following:
[ 6065.892513] xe 0000:08:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0xf9c3c000 flags=0x0020]
[ 8204.519701] xe 0000:08:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0xf9c03000 flags=0x0020]
[ 8938.663986] xe 0000:08:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0xfa12d000 flags=0x0020]
[ 9744.086126] perf: interrupt took too long (3414 > 3328), lowering kernel.perf_event_max_sample_rate to 58500
[10593.961866] xe 0000:08:00.0: [drm] Tile0: GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0
[10593.961896] xe 0000:08:00.0: [drm] Tile0: GT0: Check job timeout: seqno=5314202, lrc_seqno=5314202, guc_id=0, not started
[10593.961905] xe 0000:08:00.0: [drm] Tile0: GT0: Timedout job: seqno=5314202, lrc_seqno=5314202, guc_id=0, flags=0x53 in no process [-1]
[10593.984330] xe 0000:08:00.0: [drm] Xe device coredump has been created
[10593.984338] xe 0000:08:00.0: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
[10593.984340] ------------[ cut here ]------------
[10593.984342] xe 0000:08:00.0: [drm] Tile0: GT0: Kernel-submitted job timed out
[10593.984366] WARNING: CPU: 9 PID: 2234299 at drivers/gpu/drm/xe/xe_guc_submit.c:1386 guc_exec_queue_timedout_job+0x8f9/0xc60 [xe]
[10593.984437] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq netconsole vfat fat uvcvideo videobuf2_vmalloc uvc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev atlantic igc macsec ptp pps_core ucsi_acpi typec_ucsi amd_atl intel_rapl_msr typec intel_rapl_common roles snd_hda_codec_alc882 snd_hda_codec_intelhdmi kvm_amd snd_hda_codec_realtek_lib snd_hda_codec_hdmi spd5118 snd_hda_codec_generic mei_lb mei_gsc_proxy kvm asus_ec_sensors irqbypass snd_hda_intel polyval_clmulni eeepc_wmi ghash_clmulni_intel snd_hda_codec asus_wmi aesni_intel sp5100_tco platform_profile snd_hda_core sparse_keymap rapl pcspkr snd_usb_audio rfkill wmi_bmof thunderbolt snd_intel_dspcfg gpio_amdpt gpio_generic snd_usbmidi_lib i2c_piix4 acpi_pad snd_intel_sdw_acpi amd_3d_vcache k10temp i2c_smbus snd_hwdep ccp snd_ump snd_rawmidi snd_seq_device snd_pcm snd_timer snd soundcore ses mc enclosure mei_gsc mei_me mtd_intel_dg pmt_telemetry mousedev joydev pmt_discovery pmt_crashlog mei pmt_class mtd mac_hid i2c_dev crypto_user ntsync dm_mod
[10593.984470] nfnetlink mpt3sas raid_class hid_logitech_dj scsi_transport_sas xe intel_vsec drm_ttm_helper ttm i2c_algo_bit drm_suballoc_helper drm_buddy gpu_sched nvme drm_gpuvm drm_exec nvme_core drm_gpusvm_helper nvme_keyring drm_display_helper video nvme_auth cec hkdf wmi hid_logitech_hidpp zfs(POE) spl(OE)
[10593.984493] CPU: 9 UID: 0 PID: 2234299 Comm: kworker/u64:2 Tainted: P OE 6.18.32-1-lts #1 PREEMPT(full) e22f7a5ec35a13e3d48c5f5fec5197f5d753ef16
[10593.984497] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[10593.984498] Hardware name: ASUS System Product Name/ProArt X870E-CREATOR WIFI, BIOS 2202 04/09/2026
[10593.984500] Workqueue: gt-ordered-wq drm_sched_job_timedout [gpu_sched]
[10593.984505] RIP: 0010:guc_exec_queue_timedout_job+0x8f9/0xc60 [xe]
[10593.984575] Code: 4c 24 10 44 89 44 24 08 e8 94 df 75 fb 44 8b 44 24 08 8b 4c 24 10 48 c7 c7 08 65 14 c1 48 8b 54 24 18 48 89 c6 e8 c7 8e b2 fa <0f> 0b 49 8b 47 60 e9 30 f9 ff ff c6 04 24 01 bb c2 ff ff ff e9 95
[10593.984577] RSP: 0018:ffffcc3cb19afd88 EFLAGS: 00010246
[10593.984579] RAX: 0000000000000000 RBX: 00000000ffffffc2 RCX: 0000000000000027
[10593.984581] RDX: ffff8a0abdc5d008 RSI: 0000000000000001 RDI: ffff8a0abdc5d000
[10593.984582] RBP: ffff89fba894e800 R08: 0000000000000000 R09: ffffffffbdc614a0
[10593.984584] R10: ffffcc3cb19afc30 R11: ffffcc3cb19afc28 R12: ffff89fb8a33c028
[10593.984585] R13: ffff89fba894e818 R14: ffff89fda3761180 R15: ffff89fbb6298000
[10593.984587] FS: 0000000000000000(0000) GS:ffff8a0aff3fc000(0000) knlGS:0000000000000000
[10593.984588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10593.984590] CR2: 00007fd60e53c000 CR3: 000000072e171000 CR4: 0000000000f50ef0
[10593.984592] PKRU: 55555554
[10593.984593] Call Trace:
[10593.984595] <TASK>
[10593.984597] ? mm_cid_get.isra.0+0x3d0/0x410
[10593.984602] ? __pfx_autoremove_wake_function+0x10/0x10
[10593.984605] drm_sched_job_timedout+0x92/0x1a0 [gpu_sched 27a45555e950af75100a45b09169d1b920eb4551]
[10593.984609] process_one_work+0x191/0x350
[10593.984612] worker_thread+0x198/0x300
[10593.984614] ? __pfx_worker_thread+0x10/0x10
[10593.984615] kthread+0xfa/0x240
[10593.984618] ? __pfx_kthread+0x10/0x10
[10593.984620] ? __pfx_kthread+0x10/0x10
[10593.984622] ret_from_fork+0x22a/0x260
[10593.984624] ? __pfx_kthread+0x10/0x10
[10593.984626] ret_from_fork_asm+0x1a/0x30
[10593.984629] </TASK>
[10593.984631] ---[ end trace 0000000000000000 ]---
[10593.984633] xe 0000:08:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe]
[10593.984677] xe 0000:08:00.0: [drm] Tile0: GT0: reset queued
[10593.984681] xe 0000:08:00.0: [drm] Tile0: GT0: reset started
[10593.993920] xe 0000:08:00.0: [drm] Tile0: GT0: reset done
[10839.339427] xe 0000:08:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0xf9b8d000 flags=0x0020]The stack traces seem to variable report either "kernel-submitted job timed out" or "job timed out ... [kwin_wayland]".
Running Plasma on seat0 never has any issues.
Sometimes, this situation can devolve into the seat fully freezing, with the message
May 21 19:26:32 ancyor kernel: xe 0000:08:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe]
May 21 19:26:32 ancyor kernel: xe 0000:08:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe]
May 21 19:26:32 ancyor kernel: xe 0000:08:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe]
May 21 19:26:32 ancyor kernel: xe 0000:08:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe]
May 21 19:26:32 ancyor kernel: xe 0000:08:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe]Being printed nearly 1000 times per second(!!!), so fast that systemd-journald can't keep up with it and reports missed kernel messages.
At this point only force-resetting the machine helps. Because of the extreme nature of the crash, I also can't easily attach a full journal. `journalctl -b -1 -l` produced 1 GB (!) of output.
Video: https://txt.buny.computer/dmesg_crash.mp4
System information:
`lspci -nn`:
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex [1022:14d8]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU [1022:14d9]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db]
00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da]
00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db]
00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A] [1022:14dd]
00:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A] [1022:14dd]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 71)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0 [1022:14e0]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1 [1022:14e1]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2 [1022:14e2]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3 [1022:14e3]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4 [1022:14e4]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5 [1022:14e5]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6 [1022:14e6]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7 [1022:14e7]
01:00.0 PCI bridge [0604]: Intel Corporation Device [8086:e2ff] (rev 01)
02:01.0 PCI bridge [0604]: Intel Corporation Device [8086:e2f0]
02:02.0 PCI bridge [0604]: Intel Corporation Device [8086:e2f1]
03:00.0 VGA compatible controller [0300]: Intel Corporation Battlemage G21 [Arc B580] [8086:e20b]
04:00.0 Audio device [0403]: Intel Corporation Device [8086:e2f7]
05:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal] [144d:a80c]
06:00.0 PCI bridge [0604]: Intel Corporation Device [8086:e2ff] (rev 01)
07:01.0 PCI bridge [0604]: Intel Corporation Device [8086:e2f0]
07:02.0 PCI bridge [0604]: Intel Corporation Device [8086:e2f1]
08:00.0 VGA compatible controller [0300]: Intel Corporation Battlemage G21 [Arc B580] [8086:e20b]
09:00.0 Audio device [0403]: Intel Corporation Device [8086:e2f7]
0a:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port [1022:43f4] (rev 01)
0b:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0b:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0b:0c.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0b:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0c:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal] [144d:a80c]
0d:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port [1022:43f4] (rev 01)
0e:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0e:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0e:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0e:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0e:0c.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0e:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01)
0f:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx [1000:00e6]
10:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I226-V [8086:125c] (rev 06)
11:00.0 Ethernet controller [0200]: Aquantia Corp. AQtion AQC113CS NBase-T/IEEE 802.3an Ethernet Controller [Antigua 10G] [1d6a:94c0] (rev 03)
12:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal] [144d:a80c]
13:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 800 Series Chipset USB 3.x XHCI Controller [1022:43fd] (rev 01)
14:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller [1022:43f6] (rev 01)
15:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 800 Series Chipset USB 3.x XHCI Controller [1022:43fd] (rev 01)
16:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller [1022:43f6] (rev 01)
17:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM4242 PCIe Switch Upstream Port [1b21:2421] (rev 01)
18:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM4242 PCIe Switch Downstream Port [1b21:2423] (rev 01)
18:01.0 PCI bridge [0604]: ASMedia Technology Inc. ASM4242 PCIe Switch Downstream Port [1b21:2423] (rev 01)
18:02.0 PCI bridge [0604]: ASMedia Technology Inc. ASM4242 PCIe Switch Downstream Port [1b21:2423] (rev 01)
18:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM4242 PCIe Switch Downstream Port [1b21:2423] (rev 01)
7d:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM4242 USB 3.2 xHCI Controller [1b21:2426] (rev 01)
7e:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM4242 USB 4 / Thunderbolt 3 Host Router [1b21:2425] (rev 01)
7f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge PCIe Dummy Function [1022:14de] (rev cb)
7f:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP [1022:1649]
7f:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b6]
7f:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b7]
7f:00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Ryzen HD Audio Controller [1022:15e3]
80:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8]`/proc/cmdline`
loglevel=8 split_lock_detect=off preempt=fullSeat assignments (`/etc/udev/rules.d`):
File: /etc/udev/rules.d/72-seat-drm-pci-0000_08_00_0.rules
TAG=="seat", ENV{ID_FOR_SEAT}=="drm-pci-0000_08_00_0", ENV{ID_SEAT}="seat-hoodie"
File: /etc/udev/rules.d/72-seat-graphics-pci-0000_08_00_0.rules
TAG=="seat", ENV{ID_FOR_SEAT}=="graphics-pci-0000_08_00_0", ENV{ID_SEAT}="seat-hoodie"
File: /etc/udev/rules.d/72-seat-i2c-dev-pci-0000_08_00_0.rules
TAG=="seat", ENV{ID_FOR_SEAT}=="i2c-dev-pci-0000_08_00_0", ENV{ID_SEAT}="seat-hoodie"
File: /etc/udev/rules.d/72-seat-sound-pci-0000_09_00_0.rules
TAG=="seat", ENV{ID_FOR_SEAT}=="sound-pci-0000_09_00_0", ENV{ID_SEAT}="seat-hoodie"
File: /etc/udev/rules.d/72-seat-usb-pci-0000_13_00_0-usb-0_5.rules
TAG=="seat", ENV{ID_FOR_SEAT}=="usb-pci-0000_13_00_0-usb-0_5", ENV{ID_SEAT}="seat-hoodie"All my packages up to date, this happens on both the stable kernel (7.0.9-arch1-1) and LTS (6.18.32-1).
Things I've tried:
- Setting `xe.enable_psr=0`
- Limiting PCIe slot speed in UEFI settings
- Updating UEFI to latest version
- Updating GPU firmware to latest version using fwupd
- seat-hoodie has two displays, 60 Hz and 75 Hz. I've tried setting both to 60 Hz.
- Installing `mesa-git`
Things I've not yet tried:
- Setting `iommu=soft` or similar
- Physically or logically swapping the two GPUs between the seats (I have no reason to expect this would help)
- Running KDE Plasma from the KDE-Unstable repository
I'm fairly certain it's not a problem with the GPUs themselves, as they're brand new (I got them 4 days ago or so)
I'm also fairly certain it's not a problem with the PCIe slots on the motherboard, as the slot assignment for the seats was previously swapped, and again only seat-hoodie had issues.
That was with a Radeon RX 6650 XT, which also constantly printed stack traces about DRM timeouts in dmesg, though never experienced full-hangs.
Unfortunately, it's hard to narrow down whether the issue is with the `xe` driver itself, or with running Plasma Wayland on a secondary seat (it works fine on seat0).
Plasma X11 doesn't seem to start correctly, it flickers/gets stuck on the Plasma startup logo and never gets to Desktop, and GNOME / GNOME Classic both don't work at all on secondary seat because the code allowing it to do so it has been unmerged for several years now.
The 'job timed out' stack traces, while annoying, don't seem to generally cause any issues. No flickering, no broken graphics, no game crashes. This makes it even harder to determine the root cause.
Very rarely the seat will freeze for a few second if the GPU gets stuck and the driver issues a reset, but other than that there's no degraded functionality. That is, until it suddenly devolves into the reset loop I described above.
Which can happen during games, during simple desktop use (just a browser open), or even while the seat is on the lock screen/idle.
Unfortunately, searching online for various pieces and keywords from the error messages did not yield much of anything, so I'm out of ideas for how to further narrow the issue down.
Last edited by LunarLambda (2026-05-25 20:49:13)
Offline
with two ASRock Intel Arc
Does it help to assign *one* to each compositor?
https://wiki.archlinux.org/title/KDE#Me … -specific)
https://wiki.archlinux.org/title/Waylan … compositor
If not:
Using lightdm (nothing else seems to currently support multi-seat with Wayland sessions without crashing).
Which card does lightdm run on (check/post your xorg log)
Do you get any of this w/
- lightdm and only one seat/session (sway)
- No DM but two sessions started from the console each
- lightdm and weston instead of sway / two X11 sessions (WM/DE should™ be irrelevant here)
?
Offline
- As far as I can tell using `gputop`, each compositor is using the correct GPU, no issues there:
~ ❯ lsgpu
card0 Intel Battlemage (Gen20) drm:/dev/dri/card0
└─renderD128 drm:/dev/dri/renderD128
card1 Intel Battlemage (Gen20) drm:/dev/dri/card1
└─renderD129 drm:/dev/dri/renderD129DRM minor 0 Frequency(MHz) GT0-1200/1200 GT1-1200/1200
PID MEM RSS rcs vcs vecs bcs ccs NAME
1 0B 0B | 0.0% || 0.0% || 0.0% || 0.0% || 0.0% | systemd
2319 32M 32M | 0.0% || 0.0% || 0.0% || 0.0% || 0.0% | sway
3173 0B 0B | 0.0% || 0.0% || 0.0% || 0.0% || 0.0% | Xwayland
DRM minor 1 Frequency(MHz) GT0-1200/1200 GT1-1200/1200
PID MEM RSS rcs vcs vecs bcs ccs NAME
1 0B 0B | 0.0% || 0.0% || 0.0% || 0.0% || 0.0% | systemd
13591 354M 354M | 8.0% ▎ || 0.0% || 0.0% || 0.0% || 0.0% | kwin_wayland
13800 0B 0B | 0.0% || 0.0% || 0.0% || 0.0% || 0.0% | Xwayland- The GPUs also have a fixed assignment to lightdm using per-seat Xorg.conf files. The assignment is based on PCIe bus number. (lightdm starts one xorg/greeter per seat, they're separate servers)
[LightDM]
logind-check-graphical=true
run-directory=/run/lightdm
[Seat:*]
greeter-session=lightdm-slick-greeter
session-wrapper=/etc/lightdm/Xsession
[Seat:seat0]
xserver-config=/etc/X11/seat0.conf
[Seat:seat-hoodie]
xserver-config=/etc/X11/seat-hoodie.conf
[XDMCPServer]
[VNCServer]File: /etc/X11/seat0.conf
Section "Device"
Identifier "CardIntel"
Driver "modesetting"
BusID "PCI:3:0:0"
EndSection
File: /etc/X11/seat-hoodie.conf
Section "Device"
Identifier "CardIntel"
Driver "modesetting"
BusID "PCI:8:0:0"
EndSection- LightDM with only seat0 works fine (seat0 never has major issues in any configuration I've tested, which is Sway, Weston, KDE, GNOME)
- Since only seat0 has VT access, and seats are managed by logind, starting two Wayland seats from console is tricky / "not generally supported". I think sway might support it but I'd have to check
- X11 sessions behave very flaky on this setup. Maybe that's an argument for trying to disable Xwayland as a test?
- Have not tried double-weston / double-sway configuration.
Since rebooting with `xe.enable_psr=0` I've had 3-4 hours of uptime without any kernel messages so far, so I'll wait and see if maybe that was the fix. If not, I'll test with sway on both seats.
EDIT: seat-hoodie does have a KDE startup script setting `DRI_PRIME=1!` also. I can try setting the KDE DRM devices that you linked also, but again, I don't think the issue lies there, per gputop.
Last edited by LunarLambda (2026-05-21 20:57:35)
Offline
After some more testing today:
1) `mesa-git` and/or `xe.enable_psr=0` has removed all IO_PAGE_FAULT messages and *nearly* all timeouts logged as coming from kwin_wayland
2) Unfortunately, closing steam games (using Proton/DXVK) still causes GPU crash loops / system lockups as initially described. Including on seat0, this time (I started playing a new game today).
3) The most recent crash was logged as coming from iommu.c _domain_flush_pages, which lead to me trying `iommu.strict=1`. However, doing so actually caused the system to crash almost immediately upon launching a game, with the following:
May 22 18:45:36 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:36 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:40 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0000:03:00.0 address=0x1002f0cb0]
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0000:03:00.0 address=0x1002f0ce0]
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=18211 recv=18210
May 22 18:45:44 ancyor kernel: xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=18212 recv=18210
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: AMD-Vi: IOMMU 0000:00:00.2: Completion-Wait loop timed out
May 22 18:45:44 ancyor kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0000:03:00.0 address=0x1002f0d50]It seems, for whatever reason, that the system is failing to properly invalidate IO TLB mappings, which then appears to result in those timeout/reset loops.
This would align with the crashes occurring upon *closing* games, or shortly thereafter. Like tearing down the Vulkan context and deallocating GPU resources seem to more easily hit whatever problematic code path.
Offline
No problem w/ "iommu=soft"?
Does "amd_iommu=fullflush" help?
Offline
Didn't see your reply. I've been running with `amd_iommu=off` since yesterday (since `iommu.strict` made it worse, I figured disabling it entirely was the next logical step) and that seems to have improved things. I did have one or two more graphical session lockups (on seat0 this time, without kernel messages), which I suspect may have been X11-related (Xwayland in games/Lightdm greeter)
Today I switched to the `atrium` display manager, which is purely Wayland-based, and am now on 8 hours of uptime with no zero crashes, zero log messages, zero timeouts.
Last edited by LunarLambda (2026-05-23 19:15:18)
Offline