You are not logged in.

#1 2015-04-29 13:22:08

bjcubsfan
Member
From: Oklahoma City, OK, USA
Registered: 2012-05-04
Posts: 32
Website

Hard lock ups multiple times a day - How can I troubleshoot?

My machine locks up hard multiple times a day. If I happen to be sitting in front of it, I observe that the image on the monitors stops changing. Mouse and keyboard will not work at all (including Magic key SysRq sequences). I cannot ping the box from another machine. The system has not made it to 24 hours of uptime for some weeks.

I am beginning to suspect that this may be an issue with the NVIDIA driver, but that's only a guess since it's the only thing I've seen before that can lock up a machine hard like this. I have checked journalctl and the Xorg logs, but don't see anything telling.

What should be my next steps toward solving this issue?

Offline

#2 2015-04-29 14:22:38

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Hard lock ups multiple times a day - How can I troubleshoot?

Did you enable Magic key SysRq sequences?

Where's the rest of the X log?

The journal has

Apr 29 07:49:54 lore dkms[938]: Error! Could not locate dkms.conf file.
Apr 29 07:49:54 lore dkms[938]: File:  does not exist.
Apr 29 07:49:54 lore dkms[938]: /usr/sbin/dkms: line 1857: echo: write error: Broken pipe

...

Apr 29 07:49:54 lore rpc.idmapd[1000]: rpc.idmapd: Skipping configuration file "/etc/idmapd.conf": No such file or directory
Apr 29 07:49:54 lore kernel: nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Apr 29 07:49:54 lore rpc.idmapd[1017]: Unable to open '/proc/sys/fs/nfs/idmap_cache_timeout' to set client cache expiration time to 0 seconds

...

Apr 29 07:49:54 lore rpc.mountd[1023]: Could not make a socket: (97) Address family not supported by protocol
Apr 29 07:49:54 lore rpc.mountd[1023]: Could not make a socket: (97) Address family not supported by protocol

...

Apr 29 07:49:54 lore rm[1112]: /usr/bin/rm: cannot remove ‘/etc/dfs/sharetab’: No such file or directory

No idea what it means.
The journal mentions, NFS, ZFS - tell us more about your setup.

Offline

#3 2015-04-29 15:08:11

bjcubsfan
Member
From: Oklahoma City, OK, USA
Registered: 2012-05-04
Posts: 32
Website

Re: Hard lock ups multiple times a day - How can I troubleshoot?

It looks like I didn't have the Magic key sequences enabled (kernel.sysreq was set to 16), so I enabled them now . I will update on next lockup to see if that works now.

I have ZFS running for a large non-root data store. I think I am married to the NVIDIA binary drivers because I am using a 4k monitor over HDMI as my primary display.

I believe that is all of the X log. It also looks very similar to the current X log.

I am not sure about the dkms stuff, but I think that is only needed for virtualbox.

bpotter@lore➜ /var/log» dkms status                                                                                                         [9:58:55]
vboxhost, 4.2.0: added
vboxhost, 4.2.2: added
vboxhost, 4.3.26: added

I could probably get rid of NFS if that is an issue. I don't serve NFS, but I do mount one.

/etc/dfs may be zfs related, but there are no files there.

I did some research and noticed that my problems likely began after the upgrade where these two things (and many others) happened:

extra/nvidia                          346.47-3               3 -> 10
core/linux                            3.18.6-1               -> 3.19.2-1

Offline

#4 2015-04-29 15:09:32

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 20,659

Re: Hard lock ups multiple times a day - How can I troubleshoot?

When it locks up, do any indicators on the keyboard blink?  That would be an indication of a kernel panic. 
If there are no blinking lights, does it unlock eventually?
Has the machine ever been stable?
Have you tried installing a RAM test in your boot loader and running it?

Next, I would start blacklisting any kernel module you don't absolutely need to see if you can isolate it that way.

Also, this looks ominous:

Apr 29 07:49:48 lore kernel: This system BIOS has enabled interrupt remapping
                             on a chipset that contains an erratum making that
                             feature unstable.  To maintain system stability
                             interrupt remapping is being disabled.  Please
                             contact your BIOS vendor for an update

I redouble my suggestion you pare down loaded modules to try and isolate this.

Also, I see that it looks at the microcode

Apr 29 07:49:48 lore kernel: microcode: CPU0 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU1 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU2 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU3 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU4 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU5 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU6 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU7 sig=0x106a5, pf=0x1, revision=0x11

But I don't see it being updated.  That might be normal for your processor, but double check that and ensure you have the microcode update stuff installed in the boot loader correctly.


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
The shortest way to ruin a country is to give power to demagogues.— Dionysius of Halicarnassus
---
How to Ask Questions the Smart Way

Offline

#5 2015-04-29 15:31:09

bjcubsfan
Member
From: Oklahoma City, OK, USA
Registered: 2012-05-04
Posts: 32
Website

Re: Hard lock ups multiple times a day - How can I troubleshoot?

I don't believe I get the kernel panic blinking lights, but I'll double check next time.

It never unlocks. Every morning when I come in it is locked, and I can't get anything to go through. I don't know how long it has been locked then, but one of the past 10 days or so it surely was locked for a long time.

The machine was previously stable. I was considering rolling back the nvidia or kernel or both to before the update I mentioned.

I will look into the upgrading my BIOS and ensuring that the microcode update stuff works for my boot loader.

I will also start removing uneccessary modules to see if that helps.

Thanks.

Offline

#6 2015-04-29 15:58:53

bjcubsfan
Member
From: Oklahoma City, OK, USA
Registered: 2012-05-04
Posts: 32
Website

Re: Hard lock ups multiple times a day - How can I troubleshoot?

I have the latest BIOS version installed, so I guess I am out of luck on that point.

I was able to get the microcode to update on boot so now I get:

Apr 29 10:49:48 lore kernel: CPU0 microcode updated early to revision 0x19, date = 2013-06-21

I will try paring down the modules before I go home tonight to see if that stabilizes the system.

Offline

Board footer

Powered by FluxBB