You are not logged in.
I have an arch server that I typically run headless, although I've currently got it set up with a monitor and keyboard to help debug this issue. A few weeks ago I switched out the server's hardware, changing from an Intel to AMD processor, following the Top to Bottom guide on ArchWiki
Ever since then I've had this stability issue where the system will freeze after 24-72 hours. I can't access it over the network, and now that I have a monitor and keyboard, I can see the command prompt there is frozen too (but still shows the latest output). I've only been able to get the system back with a hard reset when this happens.
Here's a summary of what I've tried so far:
- Uninstall intel-ucode and installed amd-ucode (I was certain this was the issue when I discovered it, but alas it didn't fix anything).
- Flashed latest BIOS for the new motherboard.
- Checked journalctl with boot=-1 but there's nothing out of the ordinary right before the crash.
- I thought this might be a kernel panic so I tried to set up a crashdump kernel following the guide on ArchWiki to get some more debug information. Unfortunately I haven't been able to get it working, the crashdump kernel fails to boot when it can't find /dev/disk/by-uuid.
I'm not sure what else to try here, does anyone have any other suggestions? If this does sound like a kernel panic, can anyone advise on why the crashdump kernel is failing to boot?
Last edited by theneuralbit (2022-05-27 14:20:13)
Offline
Hopefully this post isn't beating the dead horse - but I'd agree with your first suspicion here:
- Uninstall intel-ucode and installed amd-ucode
So on that, can you confirm that 1) you've updated your bootloader / bootmanager config(s) and 2) confirm that these new configs are really being used (e.g., perhaps also add a kernel parameter that will make obvious / notable changes to confirm it is used).
Last edited by Trilby (2022-01-03 17:05:04)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
You might consider testing the ram.
At least 10 full passes if not over night.
Offline
Assuming the above suggestions don't fix it, I would suggest logging the CPU and RAM usage as well to make sure that's not the cause. There are various tools that do it but a very simple solution is to write a script that appends the output of mpstat, free and so on to a file. Then have a systemd time run that every minute. Maybe there's a resource usage spike immediately before the crash.
Offline
0) 1+ pass of memtest as suggested above and cpuburn for 4 hours (pay attention to temperature)
1) Does this server have GPU normally or it's really headless? Does it have AMD GPU? Try
/etc/modprobe.d/amdgpu.conf
blacklist amdgpu
2) Have you tried enable systemd watchdog (if motherboard support it)
3) LTS kernel? Different kernels also freeze?
4) Check/Reinstall all packages (see wiki) in case of file system corruption
5) Enable REISUB commands, check if they are working when system freeze
/etc/sysctl.d/99-sysctl.conf
kernel.sysrq = 1
6) Out of repository/AUR packages?
Last edited by avi9526 (2022-01-03 22:42:27)
Offline
i dont know which amd cpu you have but i had stability issues with an 1800x which i fixed with this bios setting:
Advanced/AMD CBS/Zen Common Options
Power Supply Idle Control -> Typical Current Idle
Offline
Thanks for the suggestions everyone! Sorry I took so long to get back to this. I was actually waiting for an email notification and was surprised to find all this great advice when I checked back a week later.
You might consider testing the ram.
At least 10 full passes if not over night.
I installed memtest86+ and booted it up. The screen is turning off, typically somewhere during the second pass. The power LED is still on though. Perhaps this is a crash indicating bad RAM? I was expecting it just to report an error if something was wrong. I'll see if I can swap in another stick and get different behavior.
So on that, can you confirm that 1) you've updated your bootloader / bootmanager config(s) and 2) confirm that these new configs are really being used
Good question, I did regenerate the grub config (grub-mkconfig). I also just did this again after installing memtest86+, and it found it and added it to the grub menu. I just checked the grub.cfg and it does have references to amd-ucode.img and not intel-ucode.img.
Offline
i dont know which amd cpu you have but i had stability issues with an 1800x which i fixed with this bios setting:
Advanced/AMD CBS/Zen Common Options Power Supply Idle Control -> Typical Current Idle
This turned out to be the issue. Thank you!!
I'm fine turning this off if it's what I need to do to keep the machine running, but it would be nice to reduce power usage when idle though. I wonder if there's some software change that would resolve this?
Offline