Random system freeze -- how to get information?

clupus · 2017-01-31 16:05:08

Hello to everybody reading,

I have the problem that one of my headless machines (intended as NAS and a small network server) has frozen randomly in the past. This happened approximately daily for the last 4-5 days. Most of the time at night, when nobody can investigate.
The effect is that the machine gets unresponsive from the network. The monitor shows the typical TTY login prompt. Any keyboard input seems to be discarded including numlock. The Ctrl+Alt+Del and any magic SysRq does not respond. The machine seems to be frozen completely.

I thought of a kernel panic but there is no error log on the screen visible which should be there in such a case as far as I know.
I once started `watch -i 0.1 'dmesg -dT | tail -n 40'` and looked the next day. There was nothing special on the kernel logs but which terminated with the information that a core dump was created. However I found no such core dump. Maybe the hard disc was yet ro or gone completely.

To investigate I installed munin and atop and configured to log at every 5 min the state of the system. After restarting (and potential repair of the file systems), munin shows that the logging stopped at a time instant. I do not know if this is the time the machine halts or not. It could be such that only the network is broken and the machine would be there for a small amount of time (I think, I saw once a log in journalctl that was 30min after the end of the munin line but I am no longer sure). As far as I can see there are no exceptional values in the logged ones (like high CPU/memory/network load/temperature/...).
Atop as well does not give any clue right before the failure.

The machine has run for approx. 1 year with small periods of pauses (mainly for maintenance). The linux installed should thus be at least correctly installed. There were no big software updates recently so I think a regression should also not be the case here (correct me if I am wrong). Thus I thought of hardware defects.

I did run memtest86+ without finding any errors after 6 complete passes.
The hard discs are connected in a soft RAID1 thus failure of one (or even a bunch of the correct) hard disc(s) should be recoverable for now.
Of course any other hardware (like mainboard or RAM/SATA/network controller could cause problems. Here I do not know how to investigate except for complete replacement.

Do you have any ideas what could be the issue and how to investigate? I have no real further ideas.

An additional constraint is that I am not present all the time. The machine is standing "off-site" and I can come there only once in 14 days. I can only instruct a another person (my father, no knowledge of linux) to restart or take a photo of the screen. So...

Thanks in advance
Christian

chr0mag · 2017-02-02 01:08:20

Hi Christian,

I suggest checking to see if you're suffering from this bug: https://bugzilla.kernel.org/show_bug.cgi?id=109051 ? This affects Intel Baytrail processors and causes a complete system lockup with no console access, no logs, no SysRq, no info at all to help troubleshoot.

*What kernel are you running?

uname -r

*What processor is your system using?

cat /proc/cpuinfo | grep "model name"

Take the model name output from cpuinfo and lookup your processor codename here: https://en.wikipedia.org/wiki/List_of_I … processors . If you do in fact have a Baytrail processor this might be your culprit.

I finally managed to diagnose this as the issue on one of my systems and the kernel boot parameter suggested in the bug (and burried on this page https://wiki.archlinux.org/index.php/In … eshooting) mitigates the issue for me.

intel_idle.max_cstate=1

Cheers,
chr0mag

clupus · 2017-02-02 11:50:05

Hello chr0mag,

chr0mag wrote:

*What kernel are you running?

4.9.6-1-ARCH

chr0mag wrote:

*What processor is your system using?

model name	: VIA C7-D Processor 1800MHz

So this is no intel CPU and therefore the bug is (in my understanding) not applicable. Thanks nevertheless.

While further looking around I found something interesing. Long story short, `sudo ldconfig` segfaulted every time it was called. I found the hint to remove the cache and regenerate. Was not possible (segfault). Therefore I reinstalled the glibc package using pacman and at least ldconfig was no longer any issue.

I tought that maybe the file system of /usr was compromised when the machine once hang for whatever reason and made an essential program unusable. Do not ask me, I had just an idea. Thus I backed up /etc and reinstalled all installed packages. This was yesterday at 1pm and since then the machine ran without problems. Before the reinstallation the machine ran for approx. 1-6 hours.
Of course this is no proof that the problem is solved. If you have seen such effect you might enlighten me ;-). I would hope the problem is solved but I will have to wait until I am quite sure before I will close the issue.

Thanks a lot
Christian

Arch Linux

#1 2017-01-31 16:05:08

Random system freeze -- how to get information?

#2 2017-02-02 01:08:20

Re: Random system freeze -- how to get information?

#3 2017-02-02 11:50:05

Re: Random system freeze -- how to get information?

Board footer