You are not logged in.
Hello everyone!
Problem:
As title says my system freezes becoming completely unresponsive. This happens no matter what I'm doing and I am unable to switch tty to kill the problematic process with htop. If I was listening to music or streams, audio continues to loop in background while the screen becomes black as if there isn't video input anymore. This problem is several months old and now I'm quite desperate.
Journal log:
2022-02-13 https://go.0xfc.de/u8asnb
2022-02-14 https://go.0xfc.de/hu8a7y
What I tried to do but was ineffective:
Checked the RAM integrity with memtest86, 4 cycles: passed without errors.
Tried normal, lts and zen kernel: same problem.
Deactivated CoreCtrl (utility I use to undervolt/overclock the gpu): same problem.
Deactivated KDE widgets one by one: same problem.
Tested Firefox without hardware acceleration, with a new profile: same problem.
Tested several kernel parameters suggested:
nvme_core.default_ps_max_latency_us=0
nvme_core.default_ps_max_latency_us=5500
acpi_osi=Linux
My system:
[OS] Arch Linux, Kernel: 5.16.8-zen1-1-zen
[DE] KDE, Display Server: x11
[CPU] AMD Ryzen 5 2600X
[MB] MSI B450M MORTAR MAX (MS-7B89), BIOS 2.80
[Memory] RAM: 16 GB, Swap: 4 GB
[Graphics] AMD Radeon RX 570
[nvme] KINGSTON SA2000M8250G, firmware S5Z42109 (latest)
[ziltoid@arch ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 232.9G 0 disk
└─sda1 8:1 0 232.9G 0 part /mnt/games-ssd
sdb 8:16 0 931.5G 0 disk
├─sdb2 8:18 0 400G 0 part /mnt/WinData
├─sdb3 8:19 0 200G 0 part
└─sdb4 8:20 0 331.5G 0 part /mnt/Games
nvme0n1 259:0 0 232.9G 0 disk
├─nvme0n1p1 259:1 0 1G 0 part /boot/efi
├─nvme0n1p2 259:2 0 50G 0 part /
├─nvme0n1p3 259:3 0 178G 0 part /home
└─nvme0n1p4 259:4 0 3.9G 0 part [SWAP]
If I have forgot any useful info tell me what's missing and I'll add it as soon as I can. Thanks in advance to all who can help.
Edit 14 Feb 22: Added today's morning crash log.
Edit 26 Feb 22: Updated situation.
Last edited by ziltoid (2022-04-05 18:37:10)
Offline
Jin, Jîyan, Azadî
Offline
I have amd-ucode installed.
sudo journalctl -k --grep=microcode
feb 13 19:24:06 arch kernel: microcode: CPU0: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU1: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU2: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU3: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU4: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU5: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU6: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU7: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU8: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU9: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU10: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: CPU11: patch_level=0x0800820d
feb 13 19:24:06 arch kernel: microcode: Microcode Update Driver: v2.2.
feb 13 21:31:40 arch kernel: microcode: CPU1: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU2: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU3: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU4: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU5: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU6: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU7: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU8: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU9: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU10: patch_level=0x0800820d
feb 13 21:31:40 arch kernel: microcode: CPU11: patch_level=0x0800820d
Last edited by ziltoid (2022-02-13 22:12:31)
Offline
Added second log.
Offline
I have amd-ucode installed.
But is it applied in the bootloader configuration?
I have a Ryzen 5850U and the patch level is higher than yours:
alpine:~$ doas dmesg | grep microcode
[ 0.618684] microcode: CPU0: patch_level=0x0a50000c
[ 0.618688] microcode: CPU1: patch_level=0x0a50000c
[ 0.618691] microcode: CPU2: patch_level=0x0a50000c
[ 0.618702] microcode: CPU3: patch_level=0x0a50000c
[ 0.618707] microcode: CPU4: patch_level=0x0a50000c
[ 0.618711] microcode: CPU5: patch_level=0x0a50000c
[ 0.618715] microcode: CPU6: patch_level=0x0a50000c
[ 0.618721] microcode: CPU7: patch_level=0x0a50000c
[ 0.618725] microcode: CPU8: patch_level=0x0a50000c
[ 0.618731] microcode: CPU9: patch_level=0x0a50000c
[ 0.618735] microcode: CPU10: patch_level=0x0a50000c
[ 0.618741] microcode: CPU11: patch_level=0x0a50000c
[ 0.618745] microcode: CPU12: patch_level=0x0a50000c
[ 0.618750] microcode: CPU13: patch_level=0x0a50000c
[ 0.618753] microcode: CPU14: patch_level=0x0a50000c
[ 0.618758] microcode: CPU15: patch_level=0x0a50000c
[ 0.618760] microcode: Microcode Update Driver: v2.2.
alpine:~$
Jin, Jîyan, Azadî
Offline
Yes it is loaded:
### BEGIN /etc/grub.d/10_linux ###
menuentry 'Arch Linux, with Linux linux-zen' --class arch --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-linux-zen-advanced-74849db5-f411-4feb-b591-4f10b403f2c7' {
savedefault
load_video
set gfxpayload=keep
insmod gzio
insmod part_gpt
insmod ext2
search --no-floppy --fs-uuid --set=root 74849db5-f411-4feb-b591-4f10b403f2c7
echo 'Loading Linux linux-zen ...'
linux /boot/vmlinuz-linux-zen root=UUID=74849db5-f411-4feb-b591-4f10b403f2c7 rw lsm=lockdown,yama,apparmor,bpf audit=1 loglevel=3 quiet amdgpu.ppfeaturemask=0xffffffff
echo 'Loading initial ramdisk ...'
initrd /boot/amd-ucode.img /boot/initramfs-linux-zen.img
}
Offline
Firt log is interesting:
feb 13 11:15:41 arch kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
feb 13 11:15:41 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=186138, emitted seq=186140
feb 13 11:15:41 arch kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Prey.exe pid 4221 thread Prey.exe pid 4307
feb 13 11:15:41 arch kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset begin!
…
feb 13 11:15:41 arch kernel: amdgpu: cp is busy, skip halt cp
feb 13 11:15:41 arch kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
feb 13 11:15:41 arch kernel: [drm:dce110_vblank_set [amdgpu]] *ERROR* Failed to get VBLANK!
feb 13 11:15:41 arch kernel: amdgpu: rlc is busy, skip halt rlc
feb 13 11:15:41 arch kernel: amdgpu 0000:26:00.0: amdgpu: BACO reset
feb 13 11:15:42 arch kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset succeeded, trying to resume
feb 13 11:15:42 arch kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
feb 13 11:15:42 arch kernel: [drm] VRAM is lost due to GPU reset!
feb 13 11:15:43 arch kernel: [UFW BLOCK] IN=enp34s0 OUT= MAC=01:00:5e:00:00:01:8c:dc:02:d5:53:f3:08:00 SRC=192.168.1.1 DST=224.0.0.1 LEN=32 TOS=0x00 PREC=0x60 TTL=1 ID=2266 DF PROTO=2
feb 13 11:15:44 arch kernel: amdgpu:
failed to send message 200 ret is 0
feb 13 11:15:49 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:15:52 arch kernel: amdgpu:
failed to send message 100 ret is 0
feb 13 11:15:55 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:15:55 arch kernel: amdgpu: SMU Firmware start failed!
feb 13 11:15:55 arch kernel: amdgpu: Failed to load SMU ucode.
feb 13 11:15:55 arch kernel: amdgpu: fw load failed
feb 13 11:15:55 arch kernel: amdgpu: smu firmware loading failed
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: [drm] Skip scheduling IBs!
feb 13 11:15:55 arch kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset(2) failed
feb 13 11:15:55 arch kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
feb 13 11:15:55 arch kernel: amdgpu 0000:26:00.0: amdgpu: GPU reset end with ret = -22
feb 13 11:15:57 arch kernel: amdgpu:
failed to send message 201 ret is 0
feb 13 11:16:03 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:05 arch kernel: amdgpu:
failed to send message 282 ret is 0
feb 13 11:16:08 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:08 arch kernel: sysrq: This sysrq operation is disabled.
feb 13 11:16:10 arch kernel: amdgpu:
failed to send message 170 ret is 0
feb 13 11:16:11 arch kernel: sysrq: This sysrq operation is disabled.
feb 13 11:16:11 arch kernel: sysrq: This sysrq operation is disabled.
feb 13 11:16:13 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:14 arch kernel: sysrq: This sysrq operation is disabled.
feb 13 11:16:16 arch kernel: sysrq: This sysrq operation is disabled.
feb 13 11:16:16 arch kernel: amdgpu:
failed to send message 171 ret is 0
feb 13 11:16:19 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:21 arch kernel: amdgpu:
failed to send message 200 ret is 0
feb 13 11:16:24 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:26 arch kernel: amdgpu:
failed to send message 201 ret is 0
feb 13 11:16:29 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:32 arch kernel: amdgpu:
failed to send message 200 ret is 0
feb 13 11:16:34 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:34 arch kernel: sysrq: This sysrq operation is disabled.
feb 13 11:16:37 arch kernel: amdgpu:
failed to send message 201 ret is 0
feb 13 11:16:39 arch kernel: sysrq: This sysrq operation is disabled.
feb 13 11:16:42 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:45 arch kernel: amdgpu:
failed to send message 282 ret is 0
feb 13 11:16:47 arch kernel: amdgpu:
last message was failed ret is 0
feb 13 11:16:50 arch kernel: amdgpu:
failed to send message 170 ret is 0
https://bugs.freedesktop.org/show_bug.cgi?id=109001
Try to disable DPMS (does this happen when the monitor turn{s,ed} off for power saving?)
Offline
In the past weeks/months there were times when, after resuming from suspend, the monitor stayed black or with colored patterns, unusable. The only alternative was to reset the machine.
I'll try to disable DPMS and see if things change.
Offline
after resuming from suspend, the monitor stayed black
Is that (S3) a conditio sine qua non? (Doesn't show up in the log)
Offline
No, the freezes are not triggered by the suspension to ram, they happen in random moments while I'm doing random activities on the pc. The last log shows a freeze after 2 min from boot and I didn't have the time to do anything.
Offline
UPDATE
20/02/2022 I set kernel parameter `amdgpu.runpm=0` to prevent the video card to turn off. The freezes still occurs but now I can see how the pc behave before the stop (before this change the video signal disappeared immediately). What I can see:
1. Sometimes using firefox or freetube these programs start to crash repeatedly until the machine freezes. dmesg (see link below) shows the programs segfault.
2. Sometimes (rarely) the pc boots already froze.
3. Sometimes the pc logoff by force.
4. I can't tty.
segfault event 2022/02/24:
journal: https://go.0xfc.de/e6exo2
dmesg: https://go.0xfc.de/f8cd75
xorg: https://go.0xfc.de/ab81sm
Last edited by ziltoid (2022-02-26 10:07:32)
Offline
feb 24 23:00:30 arch kernel: Isolated Web Co[3162]: segfault at 686e2a7c ip 00007f4ce6bd0012 sp 00007fff686e1ca0 error 6 in libxul.so[7f4ce64ac000+596f000]
feb 24 23:00:30 arch kernel: Code: 00 75 78 80 7c 24 30 00 74 35 80 7c 24 20 11 0f 84 97 00 00 00 8a 4c 24 21 80 f9 0f 77 21 48 8b 44 24 28 ba 01 00 00 00 d3 e2 <66> 09 90 34 01 00 00 ba fe ff ff ff d3 c2 66 21 90 30 01 00 00 64
feb 24 23:00:30 arch systemd[1]: Created slice Slice /system/systemd-coredump.
feb 24 23:00:30 arch systemd[1]: Started Process Core Dump (PID 8550/UID 0).
feb 24 23:00:32 arch systemd-coredump[8551]: Process 3162 (Isolated Web Co) of user 1000 dumped core.
…
Stack trace of thread 3162:
#0 0x00007f4ce6bd0012 n/a (libxul.so + 0x40bf012)
#1 0x00007f4ce6bbf28e n/a (libxul.so + 0x40ae28e)
#2 0x00007f4ce6bb91e2 n/a (libxul.so + 0x40a81e2)
#3 0x00007f4cea8da576 n/a (libxul.so + 0x7dc9576)
#4 0x000022ff8f8e8d82 n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64
Firefox crashed somewhere™ in its own libxul - doesn't look amdgpu related.
feb 24 23:19:50 arch systemd[757]: Started Shisen-Sho - Shisen-Sho Mahjongg-like Tile Game.
20 minutes later you were still procrastinating
feb 24 23:43:43 arch sudo[9966]: ziltoid : TTY=pts/2 ; PWD=/home/ziltoid ; USER=root ; COMMAND=/usr/bin/pacman --sync -y -u --
Then running an update
feb 24 23:44:12 arch dbus-daemon[851]: [session uid=1000 pid=851] Activating service name='org.kde.Shutdown' requested by ':1.13' (uid=1000 pid=1099 comm="/usr/bin/ksmserver ")
Then follows a clean (well, as clean as plasma gets that these days) shutdown.
There's no indication of a freeze or amdgpu error in that journal?
You've seen https://wiki.archlinux.org/title/Ryzen#Troubleshooting ?
Offline
Too bad, I was hoping the firefox crash could lead to the main problem, because usually after some crash the system would lock itself and the logs are "broken", like today's log: https://go.0xfc.de/l6rmt5
I'll try to follow the suggestion in the page you linked and if nothing works I'll reinstall everything.
Thanks for the help
Offline
Last update: problem solved!
In the end the problem was the malfunctioning ram: a more recent memtest I run generated lots of errors within the first minute. The strange and annoying thing is that the first thing I checked when the freezes started was the ram, and the same test gave zero errors.
With new ram the system is stable.
Last edited by ziltoid (2022-04-05 18:34:38)
Offline