You are not logged in.
I have a series of annoying events that I'm almost sure are all related to the GPU. These errors happen almost randomly; there's nothing I can do to force them to happen. It's hard to reproduce them, but I get at least one a day. As you'll see, I'm guessing almost blindly and I don't even know if the output I'm providing here is useful, so any help is appreciated.
1. The screen suddenly fades to almost black (it's possible to see the white windows, barely), if I switch to a tty it's still dark. Using the fn+4 combination of the Thinkpad Z16 puts it to sleep, and if I press the power button, it wakes with the regular brightness back. I have several journalctls of when this happens.
Happened around 20:55
Happened around 22:00
Around 8:50
21:59
22:50
All of the above I got by running
journalctl -b-0
I always get the following if I grep errors in xorg.log
grep EE .local/share/xorg/Xorg.0.log
[ 8.399] Current Operating System: Linux soti 6.5.5-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 23 Sep 2023 22:55:13 +0000 x86_64
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[ 8.419] (EE) Failed to load module "ati" (module does not exist, 0)
[ 8.422] (EE) open /dev/fb0: Permission denied
[ 8.590] (II) Initializing extension MIT-SCREEN-SAVER
[ 8.856] (II) XINPUT: Adding extended input device "Wacom HID 52F4 Finger" (type: TOUCHSCREEN, id 15)
2. My cursor stutters. This is an example of xorg.log when the cursor lags, which I got by running
cat .local/share/xorg/Xorg.0.log
and shows things like
[ 84.171] (EE) event18 - Logi POP Mouse: client bug: event processing lagging behind by 34ms, your system is too slow
3. Videos sometimes lag (don't know how to diagnose this one)
My system:
ThinkPad Z16 Gen
Kernel 6.5.5-arch1-1
Openbox
CPU AMD Ryzen 9 PRO 6950H with Radeon Graphics (16) @ 4.935GHz
GPU AMD ATI Radeon RX 6400/6500 XT/6500M
GP AMD ATI Radeon 680M
Packages installed that might be related:
extra/libxxf86vm 1.1.5-1 (15.5 KiB 35.2 KiB) (Installed)
extra/xf86-input-synaptics 1.9.2-1 (57.0 KiB 137.8 KiB) [xorg-drivers] (Installed)
extra/xf86-input-libinput 1.4.0-1 (41.5 KiB 105.6 KiB) [xorg-drivers] (Installed)
extra/xf86-video-fbdev 0.5.0-3 (13.4 KiB 29.3 KiB) [xorg-drivers] (Installed)
extra/xf86-video-vesa 2.6.0-1 (16.4 KiB 33.4 KiB) [xorg-drivers xorg] (Installed)
aur/mesa-amdonly-gaming-git 23.2.0_devel.173114.7cf7ea25002.d41d8cd98f00b204e9800998ecf8427e-1 (+6 1.13) (Installed: 23.3.0_devel.177925.edd3cd67c2f.d41d8cd98f00b204e9800998ecf8427e-1)
Last edited by Trvzn (2024-03-20 16:43:45)
Offline
Remove the xf86-video-* packages and using official Mesa drivers, then see if that works. There is also the very suspicious error:
Sep 27 22:00:00 soti kernel: [Hardware Error]: Error Addr: 0x00000001fea01a00
Sep 27 22:00:00 soti kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x000001ff0a240700
Sep 27 22:00:00 soti kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 12
Sep 27 22:00:00 soti kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep 27 22:00:00 soti kernel: [Hardware Error]: Corrected error, no action required.
Sep 27 22:00:00 soti kernel: [Hardware Error]: CPU:0 (19:44:1) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c011b
which happens around the times where you say this happens. The address and IPID are not consistent but the sydrome is, and the error pattern of the 6th number from the right going up by one per error happens on all but one of them, the only one that it wasn't there I think is because the journal wasn't long enough to catch it, it ended too soon after the issue. There are also amdgpu messages, but they don't seem to be showing errors.
There is also this thread, which indicates that your CPU may just not support Linux at all: https://www.reddit.com/r/thinkpad/comme … and_linux/
Offline
I tried installing xf86-video-amdgpu, but the problems continued. Then I tried your solution
Remove the xf86-video-* packages and using official Mesa drivers
but I just experienced the black screen again.
There is also this thread, which indicates that your CPU may just not support Linux at all
Well, the thread ends with many people talking about how they run linux, including arch, successfully in it and it's on Lenovo's certified list for Fedora or Ubuntu, so there's hope. I'll continue to look for a solution and report it here. I hope I don't have a hardware problem or worse, have to switch to another distro.
Last edited by Trvzn (2023-10-02 17:22:58)
Offline
Found this here.
I disabled the TPM in the BIOS and didn't experience problems for 3 days. Then the screen faded again on the 4th...(it might be a coincidence, but it was after I updated the system with pacman -Syu)
What really intrigues me is how hard it is to find a pattern to the problem to replicate it.
Last edited by Trvzn (2023-10-06 17:27:10)
Offline
I don't experience this error when I use an external monitor, only the laptop screen. Today, it happened again and the output of dmsesg -l warn is pasted below.
[ 0.145626] Speculative Return Stack Overflow: IBPB-extending microcode not applied!
[ 0.145627] Speculative Return Stack Overflow: WARNING: See https://kernel.org/doc/html/latest/admin-guide/hw-vuln/srso.html for mitigation options.
[ 0.280596] TSC synchronization [CPU#0 -> CPU#2]:
[ 0.280600] Measured 4689105531 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.282044] #1 #3 #5 #7 #9 #11 #13 #15
[ 0.471502] pnp 00:05: disabling [mem 0xffffffe0-0xffffffff] because it overlaps 0000:03:00.0 BAR 6 [mem 0xfffe0000-0xffffffff pref]
[ 0.488292] pci 0000:00:00.2: can't derive routing for PCI INT A
[ 0.488294] pci 0000:00:00.2: PCI INT A: not connected
[ 0.512785] ACPI: thermal: [Firmware Bug]: No valid trip found
[ 0.591779] blacklist: Duplicate blacklisted hash bin:80b4d96931bf0d02fd91a61e19d14f1da452e66db2408ca8604d411f92659f0a
[ 0.591781] blacklist: Duplicate blacklisted hash bin:f52f83a3fa9cfbd6920f722824dbe4034534d25b8507246b3b957dac6e1bce7a
[ 0.591783] blacklist: Duplicate blacklisted hash bin:c5d9d8a186e2c82d09afaa2a6f7f2e73870d3e64f72c4e08ef67796a840f0fbd
[ 0.591784] blacklist: Duplicate blacklisted hash bin:1aec84b84b6c65a51220a9be7181965230210d62d6d33c48999c6b295a2b0a06
[ 0.591786] blacklist: Duplicate blacklisted hash bin:c3a99a460da464a057c3586d83cef5f4ae08b7103979ed8932742df0ed530c66
[ 0.591787] blacklist: Duplicate blacklisted hash bin:58fb941aef95a25943b3fb5f2510a0df3fe44c58c95e0ab80487297568ab9771
[ 0.591789] blacklist: Duplicate blacklisted hash bin:5391c3a2fb112102a6aa1edc25ae77e19f5d6f09cd09eeb2509922bfcd5992ea
[ 0.591791] blacklist: Duplicate blacklisted hash bin:d626157e1d6a718bc124ab8da27cbb65072ca03a7b6b257dbdcbbd60f65ef3d1
[ 0.591792] blacklist: Duplicate blacklisted hash bin:d063ec28f67eba53f1642dbf7dff33c6a32add869f6013fe162e2c32f1cbe56d
[ 0.591793] blacklist: Duplicate blacklisted hash bin:29c6eb52b43c3aa18b2cd8ed6ea8607cef3cfae1bafe1165755cf2e614844a44
[ 0.591795] blacklist: Duplicate blacklisted hash bin:90fbe70e69d633408d3e170c6832dbb2d209e0272527dfb63d49d29572a6f44c
[ 0.591797] blacklist: Duplicate blacklisted hash bin:106faceacfecfd4e303b74f480a08098e2d0802b936f8ec774ce21f31686689c
[ 0.591798] blacklist: Duplicate blacklisted hash bin:174e3a0b5b43c6a607bbd3404f05341e3dcf396267ce94f8b50e2e23a9da920c
[ 0.591800] blacklist: Duplicate blacklisted hash bin:2b99cf26422e92fe365fbf4bc30d27086c9ee14b7a6fff44fb2f6b9001699939
[ 0.591802] blacklist: Duplicate blacklisted hash bin:2e70916786a6f773511fa7181fab0f1d70b557c6322ea923b2a8d3b92b51af7d
[ 0.591803] blacklist: Duplicate blacklisted hash bin:3fce9b9fdf3ef09d5452b0f95ee481c2b7f06d743a737971558e70136ace3e73
[ 0.591805] blacklist: Duplicate blacklisted hash bin:47cc086127e2069a86e03a6bef2cd410f8c55a6d6bdb362168c31b2ce32a5adf
[ 0.591806] blacklist: Duplicate blacklisted hash bin:71f2906fd222497e54a34662ab2497fcc81020770ff51368e9e3d9bfcbfd6375
[ 0.591808] blacklist: Duplicate blacklisted hash bin:82db3bceb4f60843ce9d97c3d187cd9b5941cd3de8100e586f2bda5637575f67
[ 0.591810] blacklist: Duplicate blacklisted hash bin:8ad64859f195b5f58dafaa940b6a6167acd67a886e8f469364177221c55945b9
[ 0.591812] blacklist: Duplicate blacklisted hash bin:8d8ea289cfe70a1c07ab7365cb28ee51edd33cf2506de888fbadd60ebf80481c
[ 0.591814] blacklist: Duplicate blacklisted hash bin:aeebae3151271273ed95aa2e671139ed31a98567303a332298f83709a9d55aa1
[ 0.591815] blacklist: Duplicate blacklisted hash bin:c409bdac4775add8db92aa22b5b718fb8c94a1462c1fe9a416b95d8a3388c2fc
[ 0.591817] blacklist: Duplicate blacklisted hash bin:c617c1a8b1ee2a811c28b5a81b4c83d7c98b5b0c27281d610207ebe692c2967f
[ 0.591819] blacklist: Duplicate blacklisted hash bin:c90f336617b8e7f983975413c997f10b73eb267fd8a10cb9e3bdbfc667abdb8b
[ 0.591820] blacklist: Duplicate blacklisted hash bin:64575bd912789a2e14ad56f6341f52af6bf80cf94400785975e9f04e2d64d745
[ 0.591822] blacklist: Duplicate blacklisted hash bin:45c7c8ae750acfbb48fc37527d6412dd644daed8913ccd8a24c94d856967df8e
[ 0.591997] blacklist: Duplicate blacklisted hash bin:47ff1b63b140b6fc04ed79131331e651da5b2e2f170f5daef4153dc2fbc532b1
[ 0.591998] blacklist: Duplicate blacklisted hash bin:5391c3a2fb112102a6aa1edc25ae77e19f5d6f09cd09eeb2509922bfcd5992ea
[ 0.592040] blacklist: Duplicate blacklisted hash bin:80b4d96931bf0d02fd91a61e19d14f1da452e66db2408ca8604d411f92659f0a
[ 0.592073] blacklist: Duplicate blacklisted hash bin:992d359aa7a5f789d268b94c11b9485a6b1ce64362b0edb4441ccc187c39647b
[ 0.592117] blacklist: Duplicate blacklisted hash bin:c452ab846073df5ace25cca64d6b7a09d906308a1a65eb5240e3c4ebcaa9cc0c
[ 0.592142] blacklist: Duplicate blacklisted hash bin:e051b788ecbaeda53046c70e6af6058f95222c046157b8c4c1b9c2cfc65f46e5
[ 0.600782] Unstable clock detected, switching default tracing clock to "global"
If you want to keep using the local clock, then add:
"trace_clock=local"
on the kernel command line
[ 0.885337] usb: port power management may be unreliable
[ 7.419658] FAT-fs (nvme0n1p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[ 7.539637] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[ 8.942141] ath11k_pci 0000:04:00.0: Failed to set the requested Country regulatory setting
[ 8.942306] ath11k_pci 0000:04:00.0: Failed to set the requested Country regulatory setting
[ 9.303925] Bluetooth: hci0: HCI Enhanced Setup Synchronous Connection command is advertised, but not supported.
[ 9.797774] Bluetooth: hci0: Bad flag given (0x1) vs supported (0x0)
[ 9.881170] Xorg[1319]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 10.228298] warning: `kdeconnectd' uses wireless extensions which will stop working for Wi-Fi 7 hardware; use nl80211
[ 2170.995220] ------------[ cut here ]------------
[ 2170.995226] WARNING: CPU: 12 PID: 9033 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_psr.c:225 dmub_psr_enable+0xf7/0x100 [amdgpu]
[ 2170.995677] Modules linked in: uhid ccm michael_mic ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_cgroup xt_mark snd_seq_dummy snd_hrtimer snd_seq snd_seq_device xt_owner rfcomm xt_tcpudp ip6table_raw iptable_raw ip6table_mangle iptable_mangle cmac algif_hash algif_skcipher af_alg ip6table_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c snd_ctl_led ip6table_filter ip6_tables iptable_filter bnep qrtr_mhi uvcvideo videobuf2_vmalloc uvc videobuf2_memops videobuf2_v4l2 btusb btrtl videodev btintel btbcm videobuf2_common btmtk mc bluetooth ecdh_generic intel_rapl_msr intel_rapl_common joydev snd_soc_acp6x_mach snd_acp6x_pdm_dma snd_soc_dmic snd_sof_amd_vangogh snd_sof_amd_rembrandt mousedev qrtr snd_sof_amd_renoir snd_sof_amd_acp ath11k_pci snd_sof_pci edac_mce_amd snd_sof_xtensa_dsp ath11k snd_sof kvm_amd qmi_helpers snd_sof_utils snd_hda_codec_realtek kvm snd_hda_codec_generic snd_soc_core snd_hda_codec_hdmi irqbypass mac80211 crct10dif_pclmul snd_compress crc32_pclmul
[ 2170.995783] ac97_bus polyval_clmulni snd_pcm_dmaengine snd_hda_intel polyval_generic snd_intel_dspcfg snd_pci_ps libarc4 gf128mul snd_intel_sdw_acpi wacom ghash_clmulni_intel snd_rpl_pci_acp6x snd_hda_scodec_cs35l41_spi usbhid hid_multitouch snd_hda_codec sha512_ssse3 snd_acp_pci cfg80211 sha256_ssse3 snd_acp_legacy_common snd_hda_core snd_hda_scodec_cs35l41_i2c sha1_ssse3 snd_hda_scodec_cs35l41 snd_pci_acp6x snd_hwdep aesni_intel snd_pci_acp5x thinkpad_acpi snd_hda_cs_dsp_ctls snd_pcm snd_rn_pci_acp3x ucsi_acpi cs_dsp ledtrig_audio crypto_simd snd_acp_config snd_timer typec_ucsi vfat rfkill think_lmi snd_soc_acpi sp5100_tco snd_soc_cs35l41_lib cryptd psmouse fat rapl pcspkr firmware_attributes_class typec wmi_bmof snd thunderbolt ccp snd_pci_acp3x mhi k10temp i2c_piix4 roles soundcore amd_pmf i2c_hid_acpi serial_multi_instantiate i2c_hid platform_profile amd_pmc acpi_tad mac_hid pkcs8_key_parser crypto_user fuse dm_mod loop nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 amdgpu i2c_algo_bit
[ 2170.995888] drm_ttm_helper ttm drm_exec serio_raw drm_suballoc_helper atkbd amdxcp sdhci_pci drm_buddy libps2 vivaldi_fmap gpu_sched cqhci nvme sdhci drm_display_helper nvme_core crc32c_intel xhci_pci mmc_core cec nvme_common i8042 xhci_pci_renesas video serio wmi
[ 2170.995922] CPU: 12 PID: 9033 Comm: kworker/12:2H Not tainted 6.6.7-arch1-1 #1 4505c4baa0b3d7c4037b0e8f5402626fa360717f
[ 2170.995925] Hardware name: LENOVO 21D4000KUS/21D4000KUS, BIOS N3GET47W (1.27 ) 12/08/2022
[ 2170.995927] Workqueue: events_highpri dm_irq_work_func [amdgpu]
[ 2170.996231] RIP: 0010:dmub_psr_enable+0xf7/0x100 [amdgpu]
[ 2170.996512] Code: c0 75 cf 81 fb e8 03 00 00 74 1f 48 8b 44 24 48 65 48 2b 04 25 28 00 00 00 75 13 48 83 c4 50 5b 5d 41 5c 41 5d e9 49 a1 2a d7 <0f> 0b eb dd e8 a0 16 29 d7 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 2170.996514] RSP: 0018:ffffc90007aebcf0 EFLAGS: 00010246
[ 2170.996517] RAX: 00000689c385ded8 RBX: 00000000000003e9 RCX: 000000000000000c
[ 2170.996518] RDX: 0000000000192c27 RSI: 00000000001921b3 RDI: 00000689c36cb2b1
[ 2170.996520] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc901409e4000
[ 2170.996521] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8881088fcbb0
[ 2170.996523] R13: 0000000000000000 R14: ffffc90007aebdce R15: 0000000000000000
[ 2170.996524] FS: 0000000000000000(0000) GS:ffff88883e900000(0000) knlGS:0000000000000000
[ 2170.996526] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2170.996528] CR2: 00000bf138c31000 CR3: 00000005cf220000 CR4: 0000000000f50ee0
[ 2170.996529] PKRU: 55555554
[ 2170.996531] Call Trace:
[ 2170.996536] <TASK>
[ 2170.996539] ? dmub_psr_enable+0xf7/0x100 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[ 2170.996819] ? __warn+0x81/0x130
[ 2170.996826] ? dmub_psr_enable+0xf7/0x100 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[ 2170.997141] ? report_bug+0x171/0x1a0
[ 2170.997148] ? handle_bug+0x3c/0x80
[ 2170.997152] ? exc_invalid_op+0x17/0x70
[ 2170.997154] ? asm_exc_invalid_op+0x1a/0x20
[ 2170.997161] ? dmub_psr_enable+0xf7/0x100 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[ 2170.997457] edp_set_psr_allow_active+0x27e/0x3b0 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[ 2170.997742] dp_handle_hpd_rx_irq+0x3f3/0x460 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[ 2170.998019] handle_hpd_rx_irq+0xcc/0x2d0 [amdgpu e701fa499ff52b3d93a2ef0e43fc9f785d64b1b0]
[ 2170.998303] process_one_work+0x174/0x340
[ 2170.998309] worker_thread+0x27b/0x3a0
[ 2170.998314] ? __pfx_worker_thread+0x10/0x10
[ 2170.998317] kthread+0xe8/0x120
[ 2170.998322] ? __pfx_kthread+0x10/0x10
[ 2170.998325] ret_from_fork+0x34/0x50
[ 2170.998330] ? __pfx_kthread+0x10/0x10
[ 2170.998334] ret_from_fork_asm+0x1b/0x30
[ 2170.998343] </TASK>
[ 2170.998345] ---[ end trace 0000000000000000 ]---
[48956.231071] ACPI: button: The lid device is not compliant to SW_LID.
Offline
From your second journal:
Sep 27 18:23:42 soti kernel: ------------[ cut here ]------------
Sep 27 18:23:42 soti kernel: WARNING: CPU: 12 PID: 225 at drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_psr.c:224 dmub_psr_enable+0x101/0x110 [amdgpu]
Sep 27 18:23:42 soti kernel: Modules linked in: ccm michael_mic snd_seq_dummy snd_hrtimer snd_seq snd_seq_device rfcomm snd_ctl_led qrtr_mhi cmac algif_hash algif_skcipher af_alg bnep snd_soc_acp6x_mach snd_soc_dmic snd_acp6x_pdm_dma snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof intel_rapl_msr snd_sof_utils intel_rapl_common qrtr snd_soc_core edac_mce_amd ath11k_pci snd_compress ath11k ac97_bus kvm_amd snd_pcm_dmaengine snd_hda_codec_realtek qmi_helpers joydev snd_hda_scodec_cs35l41_spi snd_hda_codec_generic snd_hda_codec_hdmi kvm snd_pci_ps irqbypass mac80211 snd_hda_intel snd_rpl_pci_acp6x uvcvideo btusb crct10dif_pclmul snd_intel_dspcfg crc32_pclmul btrtl videobuf2_vmalloc polyval_clmulni snd_intel_sdw_acpi btbcm uvc polyval_generic libarc4 snd_acp_pci btintel gf128mul videobuf2_memops snd_hda_scodec_cs35l41_i2c btmtk ghash_clmulni_intel snd_hda_codec mousedev sha512_ssse3 snd_pci_acp6x snd_hda_scodec_cs35l41 videobuf2_v4l2 cfg80211 snd_pci_acp5x bluetooth aesni_intel videodev wacom
Sep 27 18:23:42 soti kernel: snd_hda_core snd_hda_cs_dsp_ctls snd_rn_pci_acp3x ucsi_acpi crypto_simd think_lmi snd_acp_config vfat sp5100_tco cs_dsp videobuf2_common snd_hwdep snd_soc_acpi cryptd typec_ucsi usbhid snd_pcm hid_multitouch fat mc rapl ecdh_generic psmouse pcspkr typec wmi_bmof firmware_attributes_class snd_soc_cs35l41_lib k10temp thunderbolt ccp snd_timer snd_pci_acp3x mhi i2c_piix4 roles i2c_hid_acpi serial_multi_instantiate i2c_hid amd_pmf amd_pmc acpi_tad mac_hid pkcs8_key_parser crypto_user fuse dm_mod loop ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 amdgpu i2c_algo_bit drm_ttm_helper serio_raw thinkpad_acpi ttm atkbd drm_suballoc_helper libps2 ledtrig_audio amdxcp vivaldi_fmap platform_profile sdhci_pci drm_buddy nvme snd cqhci gpu_sched sdhci soundcore crc32c_intel drm_display_helper nvme_core rfkill xhci_pci cec mmc_core i8042 xhci_pci_renesas nvme_common video serio wmi
Sep 27 18:23:42 soti kernel: CPU: 12 PID: 225 Comm: kworker/12:1H Not tainted 6.5.5-arch1-1 #1 d82a0f532dd8cfe67d5795c1738d9c01059a0c62
Sep 27 18:23:42 soti kernel: Hardware name: LENOVO 21D4000KUS/21D4000KUS, BIOS N3GET35W (1.16 ) 06/29/2022
Sep 27 18:23:42 soti kernel: Workqueue: events_highpri dm_irq_work_func [amdgpu]
Sep 27 18:23:42 soti kernel: RIP: 0010:dmub_psr_enable+0x101/0x110 [amdgpu]
Sep 27 18:23:42 soti kernel: Code: c0 75 c5 81 fb e8 03 00 00 74 1f 48 8b 44 24 48 65 48 2b 04 25 28 00 00 00 75 13 48 83 c4 50 5b 5d 41 5c 41 5d e9 9f 6e f2 dd <0f> 0b eb dd e8 d6 e5 f0 dd 66 0f 1f 44 00 00 90 90 90 90 90 90 90
Sep 27 18:23:42 soti kernel: RSP: 0018:ffffc08c0171bd00 EFLAGS: 00010246
Sep 27 18:23:42 soti kernel: RAX: 0000000000000000 RBX: 00000000000003e9 RCX: 000000000000e220
Sep 27 18:23:42 soti kernel: RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffc08c0171bc58
Sep 27 18:23:42 soti kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Sep 27 18:23:42 soti kernel: R10: 0000000000000001 R11: 0000000000000100 R12: ffff9a2c89d4e820
Sep 27 18:23:42 soti kernel: R13: 0000000000000000 R14: ffffc08c0171bdde R15: 0000000000000000
Sep 27 18:23:42 soti kernel: FS: 0000000000000000(0000) GS:ffff9a33be900000(0000) knlGS:0000000000000000
Sep 27 18:23:42 soti kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 27 18:23:42 soti kernel: CR2: 00007f0b30131000 CR3: 000000012ca3e000 CR4: 0000000000750ee0
Sep 27 18:23:42 soti kernel: PKRU: 55555554
Sep 27 18:23:42 soti kernel: Call Trace:
Sep 27 18:23:42 soti kernel: <TASK>
Sep 27 18:23:42 soti kernel: ? dmub_psr_enable+0x101/0x110 [amdgpu ee815d842cda2379c55304edd04f2f150ea9be03]
Sep 27 18:23:42 soti kernel: ? __warn+0x81/0x130
Sep 27 18:23:42 soti kernel: ? dmub_psr_enable+0x101/0x110 [amdgpu ee815d842cda2379c55304edd04f2f150ea9be03]
Sep 27 18:23:42 soti kernel: ? report_bug+0x171/0x1a0
Sep 27 18:23:42 soti kernel: ? handle_bug+0x3c/0x80
Sep 27 18:23:42 soti kernel: ? exc_invalid_op+0x17/0x70
Sep 27 18:23:42 soti kernel: ? asm_exc_invalid_op+0x1a/0x20
Sep 27 18:23:42 soti kernel: ? dmub_psr_enable+0x101/0x110 [amdgpu ee815d842cda2379c55304edd04f2f150ea9be03]
Sep 27 18:23:42 soti kernel: edp_set_psr_allow_active+0x27e/0x3b0 [amdgpu ee815d842cda2379c55304edd04f2f150ea9be03]
Sep 27 18:23:42 soti kernel: dp_handle_hpd_rx_irq+0x318/0x350 [amdgpu ee815d842cda2379c55304edd04f2f150ea9be03]
Sep 27 18:23:42 soti kernel: handle_hpd_rx_irq+0xcc/0x2d0 [amdgpu ee815d842cda2379c55304edd04f2f150ea9be03]
Sep 27 18:23:42 soti kernel: ? blk_queue_exit+0x12/0x50
Sep 27 18:23:42 soti kernel: process_one_work+0x1e1/0x3f0
Sep 27 18:23:42 soti kernel: worker_thread+0x51/0x390
Sep 27 18:23:42 soti kernel: ? __pfx_worker_thread+0x10/0x10
Sep 27 18:23:42 soti kernel: kthread+0xe8/0x120
Sep 27 18:23:42 soti kernel: ? __pfx_kthread+0x10/0x10
Sep 27 18:23:42 soti kernel: ret_from_fork+0x34/0x50
Sep 27 18:23:42 soti kernel: ? __pfx_kthread+0x10/0x10
Sep 27 18:23:42 soti kernel: ret_from_fork_asm+0x1b/0x30
Sep 27 18:23:42 soti kernel: </TASK>
Sep 27 18:23:42 soti kernel: ---[ end trace 0000000000000000 ]---
Sep 27 18:23:43 soti kernel: [drm] PCIE GART of 512M enabled (table at 0x00000080FEB00000).
Sep 27 18:23:43 soti kernel: [drm] PSP is resuming...
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
Sep 27 18:26:21 soti kernel: [drm] DMUB hardware initialized: version=0x02020017
Sep 27 18:26:21 soti kernel: [drm] kiq ring mec 2 pipe 1 q 0
Sep 27 18:26:21 soti kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Sep 27 18:26:21 soti kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
Sep 27 18:26:47 soti kernel: mce: [Hardware Error]: Machine check events logged
Sep 27 18:26:47 soti kernel: [Hardware Error]: Corrected error, no action required.
Sep 27 18:26:47 soti kernel: [Hardware Error]: CPU:0 (19:44:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c011b
Sep 27 18:26:47 soti kernel: [Hardware Error]: Error Addr: 0x0000000025ee0cc0
Sep 27 18:26:47 soti kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000001780a240001
Sep 27 18:26:47 soti kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 12
Sep 27 18:26:47 soti kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep 27 18:26:47 soti kernel: mce: [Hardware Error]: Machine check events logged
Sep 27 18:26:47 soti kernel: [Hardware Error]: Corrected error, no action required.
Sep 27 18:26:47 soti kernel: [Hardware Error]: CPU:0 (19:44:1) MC16_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c011b
Sep 27 18:26:47 soti kernel: [Hardware Error]: Error Addr: 0x0000000025ed1d80
Sep 27 18:26:47 soti kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000001680a258001
Sep 27 18:26:47 soti kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 12
Sep 27 18:26:47 soti kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep 27 18:26:47 soti kernel: [Hardware Error]: Corrected error, no action required.
Sep 27 18:26:47 soti kernel: [Hardware Error]: CPU:0 (19:44:1) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c011b
Sep 27 18:26:47 soti kernel: [Hardware Error]: Error Addr: 0x0000000025eee700
Sep 27 18:26:47 soti kernel: [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x0000010b0a240381
Sep 27 18:26:47 soti kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 12
Sep 27 18:26:47 soti kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep 27 18:26:47 soti kernel: [Hardware Error]: Corrected error, no action required.
Sep 27 18:26:47 soti kernel: [Hardware Error]: CPU:0 (19:44:1) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc204000000c011b
Sep 27 18:26:47 soti kernel: [Hardware Error]: Error Addr: 0x0000000025f53d00
Sep 27 18:26:47 soti kernel: [Hardware Error]: IPID: 0x0000009600350f00, Syndrome: 0x000001780a258001
Sep 27 18:26:47 soti kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 12
Sep 27 18:26:47 soti kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Sep 27 18:27:09 soti kernel: [drm] PCIE GART of 512M enabled (table at 0x00000080FEB00000).
Sep 27 18:27:09 soti kernel: [drm] PSP is resuming...
Sep 27 18:27:09 soti kernel: [drm] reserve 0xa00000 from 0x80fd000000 for PSP TMR
Sep 27 18:27:09 soti kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Sep 27 18:27:09 soti kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Sep 27 18:27:09 soti kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
Sep 27 18:27:09 soti kernel: amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000d, smu fw if version = 0x0000000f, smu fw program = 0, version = 0x00492000 (73.32.0)
Sep 27 18:27:09 soti kernel: amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
Sep 27 18:27:09 soti kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
This happens a lot …
Sep 27 20:55:31 soti kernel: amdgpu 0000:67:00.0: amdgpu: GPU reset(1) succeeded!
Sep 27 20:55:31 soti kernel: [drm] Skip scheduling IBs!
Sep 27 20:55:31 soti kernel: [drm] Skip scheduling IBs!
Sep 27 20:55:31 soti kernel: [drm] Skip scheduling IBs!
Sep 27 20:55:31 soti systemd[1]: Created slice Slice /system/systemd-coredump.
Sep 27 20:55:31 soti systemd[1]: Started Process Core Dump (PID 18446/UID 0).
Sep 27 20:55:32 soti systemd-coredump[18457]: Process 667 (Xorg) of user 1000 dumped core.
Stack trace of thread 669:
#0 0x00007fec2d94483c n/a (libc.so.6 + 0x8e83c)
#1 0x00007fec2d8f4668 raise (libc.so.6 + 0x3e668)
#2 0x00007fec2d8dc4b8 abort (libc.so.6 + 0x264b8)
#3 0x000055fb09e145a0 OsAbort (Xorg + 0x1535a0)
#4 0x000055fb09e15a0b FatalError (Xorg + 0x154a0b)
#5 0x000055fb09e1c516 n/a (Xorg + 0x15b516)
#6 0x00007fec2d8f4710 n/a (libc.so.6 + 0x3e710)
#7 0x00007fec2d94483c n/a (libc.so.6 + 0x8e83c)
#8 0x00007fec2d8f4668 raise (libc.so.6 + 0x3e668)
#9 0x00007fec2d8dc4b8 abort (libc.so.6 + 0x264b8)
#10 0x00007fec2bc16657 n/a (radeonsi_dri.so + 0x816657)
#11 0x00007fec2bc19933 n/a (radeonsi_dri.so + 0x819933)
#12 0x00007fec2b4ba64c n/a (radeonsi_dri.so + 0xba64c)
#13 0x00007fec2b4d9b6c n/a (radeonsi_dri.so + 0xd9b6c)
#14 0x00007fec2d9429eb n/a (libc.so.6 + 0x8c9eb)
#15 0x00007fec2d9c671c n/a (libc.so.6 + 0x11071c)
More confirmation:
https://askubuntu.com/questions/1490791 … ous-reason
https://gitlab.freedesktop.org/drm/amd/-/issues/2872
https://gitlab.freedesktop.org/drm/amd/-/issues/2645
Generic advice:
https://wiki.archlinux.org/title/Ryzen#Troubleshooting
https://wiki.archlinux.org/title/Microcode
Offline
More confirmation:
...
https://gitlab.freedesktop.org/drm/amd/-/issues/2645
This one suggests it's still open but there are some workarounds (haarp and rorymichele describe the exact same problem I have). As it happens randomly and rarely, it might take time to check each solution. Thanks, seth!
Offline
I know this is a 4-month problem, but I want to update it just in case someone comes by this thread in the future. It's now been 3 months and I no longer experience this issue. It's hard to pinpoint exactly what solved it given the inconsistent nature with which it happened, but I followed all the suggestions above. Marking as solved and thanking this amazing community one more time.
Offline