You are not logged in.
Pages: 1
Topic closed
Hi,
Since about two years I get random freezes with archlinux on my desktop PC.
I don't want you to troubleshoot my problems, but maybe you can help me to figure out how to troubleshoot at all.
Whats the best/fastest way to find the problem? Is there any CLI-Magic to get Error-Statistics from journalctl?
Problem Description
* Freeze = Mouse not moving, Ctrl+Alt+Fx not working, Keyboard Numpad LED always on, sound works sometimes.
* Freezes absolutely random ~5 times a week
* Works for hours/days without freezes even at high load
* only way out is reset the system
System Information
* Kernel: 5.0.2-arch1-1-ARCH
* Fileystem: LUKS > LVM >
My PC
* CPU: AMD Ryzen 5 1600
* Mainboard: Asrock AB350 Pro4
* GPU: Radeon HD 6850
* RAM: G.SKILL F4-3200C16-8GVKB 2x8GB
* nvme: Samsung SSD 970 EVO 500GB
Thanks in advance!
Last edited by hasdf (2019-04-19 16:27:31)
Offline
journalctl |grep -i "hardware err"
to make sure it is not the Linux Ryzen power management bug. If you see any output that would be bad.
I suppose adding "processor.max_cstate=1" to the kernel boot options to see if that avoids freezes is always worth a try on Ryzen.
Why did you wait two years to investigate this?
Offline
Thank you for your fast reply!
journalctl |grep -i "hardware err"
to make sure it is not the Linux Ryzen power management bug. If you see any output that would be bad.
This is the output. I had more freezes since April 03.
➜ ~ journalctl |grep -i "hardware err"
Apr 03 21:10:00 martin-desktop kernel: mce: [Hardware Error]: Machine check events logged
Apr 03 21:10:00 martin-desktop kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
Apr 03 21:10:00 martin-desktop kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff852bb172 MISC d012000101000000 SYND 4d000000 IPID 500b000000000
Apr 03 21:10:00 martin-desktop kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1554313917 SOCKET 0 APIC 0 microcode 8001129
I suppose adding "processor.max_cstate=1" to the kernel boot options to see if that avoids freezes is always worth a try on Ryzen.
Trying to add this.
Why did you wait two years to investigate this?
Laziness and because everytime I tried to investigate, the error didn't occur... This time I lost 2h of work and was angry about it
Last edited by hasdf (2019-04-10 16:52:00)
Offline
Yes, those are the Ryzen hardware errors I was expecting. In 2017 we had forum threads about this issue. That power management setting has helped for me at least, for other possible fixes see https://wiki.gentoo.org/wiki/Ryzen#Rand … mce_events
Morn wrote:Why did you wait two years to investigate this?
Laziness and because everytime I tried to investigate, the error didn't occur... This time I lost 2h of work and was angry about it
It's a miracle you did not lose more data during all that time. Linux file systems are simply not made for daily crashes...
Offline
Seems like adding "processor.max_cstate=1" to the kernel boot options has solved this issue.
Thank you Morn for your help!
Offline
I think you don't need to set it to 1. I also had freezes with the same CPU, but setting it to 5 was enough to fix the problem:
processor.max_cstate=5 rcu_nocbs=0-11
You see I also added rcu_nocbs=0-11 in order to fix another problem with our Ryzen CPU. The other thread also mentions "idle=nomwait", but I think this is not needed anymore in recent kernel versions.
Offline
OK. I'll try this and report back if the system crashes again.
Am I correct, that C-states are only important for power-saving? Or does it influence the CPU lifetime as well?
Offline
OK. I'll try this and report back if the system crashes again.
Am I correct, that C-states are only important for power-saving? Or does it influence the CPU lifetime as well?
I think higher power save states might actually decrease CPU lifetime if anything. Running the CPU at constant clock speed puts the least strain on the CPU and motherboard AFAIK. Normally machines tend to run best if you do not turn them on and off all the time, CPUs are no different. That is why personally I do not mind max_cstate=1.
Offline
I see it the other way round: max_cstate=1 means the CPU is constantly operating at full voltage, thus getting hot, which decreases lifetime. Well, it doesn't really operate constantly, because cstate 1 already knows to "halt" the CPU, so it doesn't get as hot as in cstate 0. But I would at least allow the CPU to enter cstate 3. And here's a reddit thread suggesting to disable cstate 6 only. By the way, you can read about the cstates here.
edit: One reddit user says C6 state is unstable with DRAM "Power Down Enable" option. Disable "Power Down Enable", not C6 state, but I still had freezes with "DRAM Power Down" disabled.
Last edited by thorstenhirsch (2019-04-19 19:25:26)
Offline
Out of curiosity, do you have the amd microcode/firmware loading with the kernel? I ask because I had the same issue with random freezes and removed it from the bootctl entries, which fixed it.
Offline
After reading the wiki page I would answer your question like this:
- I haven't enabled early microcode update, so it wasn't updated when the kernel was loaded
- but I haven't disabled late microcode update, so it was updated at a later boot stage by systemd
However I've now also enabled early microcode update... and it made no difference. I don't even see any update messages in dmesg.
[ 0.832583] microcode: CPU0: patch_level=0x08001137
(repeated for each code)
I think this is an updated microcode version, because I've found user reports for the same CPU as mine with microcode version 0x8001126 and others with 0x8001129. So actually I don't know where my update came from, but it seems to be neither Linux's early nor late update function.
edit: Maybe AGESA 1.0.0.6 came with this microcode update and it installs it even before the kernel is loaded...?
Last edited by thorstenhirsch (2019-04-19 23:11:48)
Offline
Maybe AGESA 1.0.0.6 came with this microcode update and it installs it even before the kernel is loaded...?
definitely possible as firmware updates often include latest microcode updates.
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
- I haven't enabled early microcode update, so it wasn't updated when the kernel was loaded
- but I haven't disabled late microcode update, so it was updated at a later boot stage by systemd
If you didn't install the amd-ucode package from core, then you aren't loading it. Which means I was barking up the wrong tree and I can be safely ignored.
Offline
Hi, I had the same problem on Lenovo ideapad 720S-13ARR (AMD Ryzen 7 2700U)
Adding processor.max_cstate=1 to kernel cmdline was helpful for me.
Offline
Thanks for the contribution. Be careful of bumping old topics, especially those marked [SOLVED]. As this topic is more than six months old and the OP has not been back since May, I am closing this now.
Offline
Pages: 1
Topic closed