You are not logged in.

#1 2013-10-30 09:15:32

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,963
Website

hard shutdowns and failed reboots... not sure what to check

I have a desktop that I currently turn on every few days to upgrade and sync with a laptop (it's normally my main system, but I've been working more on the laptop recently). In the last 2 weeks, it has shutdown (full poweroff) on its own while syncing (with unison). The first time it happened, it rebooted on its own and everything seemed to work. Just now, it fell into a cycle of failed reboots and shutdowns.

I switched off the PSU and opened up the case to see if anything was visibly wrong. I have checked all of the connectors. I clean out the dust every couple of months so there is no build-up.

I have had two physical problems with the computer before but I do not think they are related. The first was a detached plastic power button on the front panel which could get stuck in the depressed position, but that was fixed with a little glue. It is holding steady, and it is definitely not the source of the error because the reboot cycle persisted even with the front panel open so that the button was not touching the switch. The other problem was with old thermal grease, which I replaced a few weeks ago. The system has worked fine since then and both BIOS and the "sensors" application (lm_sensors package) report normal temperatures (40-50 C) while idle. Even under full load the core temperatures remain below the "high" temperatures (74 C). For reference, "critical" is 100 C, so I'm far away if the sensors are to be believed.

After waiting about 15 seconds, I turned the PSU back on and restarted the computer. I first entered bios to check the "health" screen but saw nothing unusual (although admittedly I do not know exactly what voltages to expect, but the temperature and fan RPM look good). I have since booted the system and all appears to be working well. Temperatures and everything else reported by "sensors" look normal to me. dmesg reports no errors and there is nothing in journalctl or errors.log. I have not re-run fsck this time, but I did last time and no errors were reported.

Given the failure to reboot I obviously suspect a hardware error, but I'm not sure how to proceed. I will probably run memtest today when I have the time, but other than that I don't have any idea where to start.  It may be the main disk as it has only appeared under heavy disk I/O (the filesystems that were being synced are on different logical volumes on the same LVM partition), but again, fsck reported nothing (last time).

Does anyone have any suggestions?

p.s. The system is still running fine right now and I have managed to complete the sync and system upgrade. It really does seem random to me, but I supposed hardware failures usually are.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#2 2013-10-30 10:39:57

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,568
Website

Re: hard shutdowns and failed reboots... not sure what to check

In addition to the memtest, a full SMART check may be worth while, particularly if it seems disk I/O related.  (edit: caveat: I know very little about the smartmon tools, except that they seem to be the tool to use if disk failure may be an issue).

Last edited by Trilby (2013-10-30 10:41:02)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#3 2013-10-30 18:59:05

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,963
Website

Re: hard shutdowns and failed reboots... not sure what to check

Thanks for the suggestion. I'll run a full SMART check after a few hours of memtesting and report back.


edit
Neither memtest nor smartctl report any errors. The long test is still running on a  backup disk just to check, but that wasn't even mounted when the system crashed.

The computer has been running for hours without issue, but I'm running the minimal backup system from another disk and only using it to run tests from the tty. I don't hear any strange noises. The only strange message that I see in errors.log is

<timestamp> localhost kernel: [...] nouveau E[    PTHERM][0000:01:00.0] unhandled intr 0x00000060

The timestamp matches the system start time according to the output of uptime. I'll stress the CPU next with mprime to see if that crashes it, but beyond that I really don't have any ideas besides rechecking all of the connectors and maybe digging around in BIOS again.

Does anyone have any other ideas? Even if it seems to be working now, not knowing what caused the crash will drive me crazy with paranoia.

Last edited by Xyne (2013-10-31 02:30:23)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#4 2013-10-31 17:09:31

henk
Member
From: Weert, Netherlands
Registered: 2013-01-01
Posts: 334

Re: hard shutdowns and failed reboots... not sure what to check

Searching on the PTHERM and unhandled intr error I found a few sites which might be if interest.
https://www.google.com/#q=nouveau+PTHER … intr+error

Last edited by henk (2013-10-31 17:11:50)

Offline

#5 2013-11-01 16:44:53

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,963
Website

Re: hard shutdowns and failed reboots... not sure what to check

Several hours of mprime testing yielded no errors and the reported temperatures never went above "high".

@henk
Thanks. I will look for ways to test the graphics card and drivers now. I do not see the PTHERM error when running the main system, presumably because I am using the Nvidia drivers, but it may indicate an error in the card itself.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

Board footer

Powered by FluxBB