You are not logged in.
My machine locks up hard multiple times a day. If I happen to be sitting in front of it, I observe that the image on the monitors stops changing. Mouse and keyboard will not work at all (including Magic key SysRq sequences). I cannot ping the box from another machine. The system has not made it to 24 hours of uptime for some weeks.
I am beginning to suspect that this may be an issue with the NVIDIA driver, but that's only a guess since it's the only thing I've seen before that can lock up a machine hard like this. I have checked journalctl and the Xorg logs, but don't see anything telling.
What should be my next steps toward solving this issue?
Offline
Did you enable Magic key SysRq sequences?
Where's the rest of the X log?
The journal has
Apr 29 07:49:54 lore dkms[938]: Error! Could not locate dkms.conf file.
Apr 29 07:49:54 lore dkms[938]: File: does not exist.
Apr 29 07:49:54 lore dkms[938]: /usr/sbin/dkms: line 1857: echo: write error: Broken pipe
...
Apr 29 07:49:54 lore rpc.idmapd[1000]: rpc.idmapd: Skipping configuration file "/etc/idmapd.conf": No such file or directory
Apr 29 07:49:54 lore kernel: nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Apr 29 07:49:54 lore rpc.idmapd[1017]: Unable to open '/proc/sys/fs/nfs/idmap_cache_timeout' to set client cache expiration time to 0 seconds
...
Apr 29 07:49:54 lore rpc.mountd[1023]: Could not make a socket: (97) Address family not supported by protocol
Apr 29 07:49:54 lore rpc.mountd[1023]: Could not make a socket: (97) Address family not supported by protocol
...
Apr 29 07:49:54 lore rm[1112]: /usr/bin/rm: cannot remove ‘/etc/dfs/sharetab’: No such file or directoryNo idea what it means.
The journal mentions, NFS, ZFS - tell us more about your setup.
Offline
It looks like I didn't have the Magic key sequences enabled (kernel.sysreq was set to 16), so I enabled them now . I will update on next lockup to see if that works now.
I have ZFS running for a large non-root data store. I think I am married to the NVIDIA binary drivers because I am using a 4k monitor over HDMI as my primary display.
I believe that is all of the X log. It also looks very similar to the current X log.
I am not sure about the dkms stuff, but I think that is only needed for virtualbox.
bpotter@lore➜ /var/log» dkms status [9:58:55]
vboxhost, 4.2.0: added
vboxhost, 4.2.2: added
vboxhost, 4.3.26: addedI could probably get rid of NFS if that is an issue. I don't serve NFS, but I do mount one.
/etc/dfs may be zfs related, but there are no files there.
I did some research and noticed that my problems likely began after the upgrade where these two things (and many others) happened:
extra/nvidia 346.47-3 3 -> 10
core/linux 3.18.6-1 -> 3.19.2-1Offline
When it locks up, do any indicators on the keyboard blink? That would be an indication of a kernel panic.
If there are no blinking lights, does it unlock eventually?
Has the machine ever been stable?
Have you tried installing a RAM test in your boot loader and running it?
Next, I would start blacklisting any kernel module you don't absolutely need to see if you can isolate it that way.
Also, this looks ominous:
Apr 29 07:49:48 lore kernel: This system BIOS has enabled interrupt remapping
on a chipset that contains an erratum making that
feature unstable. To maintain system stability
interrupt remapping is being disabled. Please
contact your BIOS vendor for an updateI redouble my suggestion you pare down loaded modules to try and isolate this.
Also, I see that it looks at the microcode
Apr 29 07:49:48 lore kernel: microcode: CPU0 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU1 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU2 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU3 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU4 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU5 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU6 sig=0x106a5, pf=0x1, revision=0x11
Apr 29 07:49:48 lore kernel: microcode: CPU7 sig=0x106a5, pf=0x1, revision=0x11But I don't see it being updated. That might be normal for your processor, but double check that and ensure you have the microcode update stuff installed in the boot loader correctly.
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
The shortest way to ruin a country is to give power to demagogues.— Dionysius of Halicarnassus
---
How to Ask Questions the Smart Way
Offline
I don't believe I get the kernel panic blinking lights, but I'll double check next time.
It never unlocks. Every morning when I come in it is locked, and I can't get anything to go through. I don't know how long it has been locked then, but one of the past 10 days or so it surely was locked for a long time.
The machine was previously stable. I was considering rolling back the nvidia or kernel or both to before the update I mentioned.
I will look into the upgrading my BIOS and ensuring that the microcode update stuff works for my boot loader.
I will also start removing uneccessary modules to see if that helps.
Thanks.
Offline
I have the latest BIOS version installed, so I guess I am out of luck on that point.
I was able to get the microcode to update on boot so now I get:
Apr 29 10:49:48 lore kernel: CPU0 microcode updated early to revision 0x19, date = 2013-06-21I will try paring down the modules before I go home tonight to see if that stabilizes the system.
Offline