Odd Deadlock issue - What logging can I enable to investigate?

darkfoon · 2013-11-10 23:54:40

Hello,

I have an arch box that I leave running 24/7 to host some video games and run folding@home (to heat the room).

Lately, an issue has started, and has gotten more frequent, that I find alarming. The box will become completely "dead" to the outside until I hit the reset button.

The network port is sending out some kind of data when this happens, since I can see the router port light that computer is connected to blinking as if constant traffic is on the wire.
When I try to connect over SSH, the connection times out.

I carried a monitor over to the box and plugged it, and a USB keyboard, in and the monitor wouldn't wake from sleep and the USB keyboard wasn't initialized (no lights).

It has gotten to the point where it happens every night now.

I don't know what these symptoms mean. I don't see anything in the logs that indicate what is going on. I've installed mcelog as a daemon and it isn't logging anything.

I'm looking for what kind of logging I can enable to help diagnose the issue.

Full disclosure: the machine is overclocked, but it's stable and has more than enough cooling for the job (even with F@H running on all threads, the core temp is a solid 50C after 6 hours). I saw a machine check event once and installed mcelog just in case.
I saw a Kernel Oops in the logs earlier this week when F@H crashed a thread handler and went into an infinite loop.

Last edited by darkfoon (2013-11-17 20:08:39)

schmidtbag · 2013-11-12 18:57:19

Try a live CD that contains a few of the programs you regularly run and see if that'll work overnight. Also, I'd highly recommend going memtest and a S.M.A.R.T. test. I'm thinking you've got a defective stick of RAM.

darkfoon · 2013-11-15 08:54:58

When you mentioned SMART test, I checked my logs for smartd, and I see a lot of these entries every so many hours

Nov 14 23:06:28 Server5 smartd[334]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 53 to 54
Nov 14 22:36:29 Server5 smartd[334]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 54 to 53
Nov 14 22:36:29 Server5 smartd[334]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 44 to 43

The error rates/recovered values keep going back and forth between 53 and 55 in 1 step increments.

I've turned off all the servers and just have F@H running with only 6 threads (half as many as usual). The disk IO is significantly reduced at this point. And it hasn't "crashed" once in 4 days now.
I'll run memtest for a day this weekend, but based on this so far, it looks to me like the hard disk drive is the problem.
Your thoughts?

Last edited by darkfoon (2013-11-15 09:06:53)

schmidtbag · 2013-11-15 14:14:33

I'm not sure if those errors are normal or not, or if they're even errors (last one doesn't seem to be). I don't know everything that smartd offers, but there's a tool called gsmartcontrol that can show you a table of diagnostics your particular drive has, along with the results that you currently have and the results you SHOULD have. It will highlight any diagnostic as pink if there's a potential issue, and anything red means you need to replace the drives *now*. You don't need to run any tests to view the tables. Below is a sample screenshot of what you DON'T want to see:
http://gsmartcontrol.berlios.de/home/im … ailing.png

Hard drives are tricky. Like anything mechanical, once they show signs of failure, their failure rate will multiply.

Last edited by schmidtbag (2013-11-15 14:15:09)

darkfoon · 2013-11-17 20:07:35

Ran a long SMART test and ran memtest86+ for over 10 hours and neither has produced any errors. the SMART diagnostic says that the drive is HEALTHY, and memtest found 0 errors.
I'm at a loss for the cause of this issue.

Is there any extra logging or diagnostics that you can think of enabling? Can I enable a feature in the kernel to have it print a kernel panic to a log or dump core (not like I would really know how to debug it).

At this point, the only thing I can think of it that somebody found an exploit in one of the game servers I am running and used it to crash the OS. The issue appears to have gone away when I turned off the game servers, so that supports this hypothesis. However, that seems fairly difficult since I run all the servers as unprivileged users that don't have a shell. They would have to find a way to get a root shell from a non-root user.

Last edited by darkfoon (2013-11-17 20:09:35)

schmidtbag · 2013-11-17 20:32:26

Well considering you said this problem happens every night, what I would do is at the same time every night, enable another one (and only one) of your services. Eventually, as your services accumulate, you should encounter the failure. When that happens, it is likely going to be the one of the last 2 services you enabled. So when it fails, disable the last 2 services and continue to enable any remaining services you haven't got to yet. Once you have enabled the rest of your services, it's probably best to wait a few days just to make sure you narrowed down your problem. Then, if nothing bad happens, re-enable only one of those services, and see if it still fails. If it doesn't fail, then you have your conclusion.

At this point, I'm not getting the impression this is a hardware or kernel problem. The kernel could possibly be panicking but if the problem went away due to disabling one of your services, it's really just a matter of narrowing down which service it is.

darkfoon · 2014-01-09 09:55:09

I tried what you suggested with the services and I had no problems. After 2 months of continuous uptime, I was confident the issue was maybe a misconfiguration with a server (I increased security in the time as a precaution).
That is until today, sometime after I unplugged my external USB3.0 hard disk drive. I tried to log into the machine and I ran into the same issue where the network port was busy, but I couldn't connect to the machine.

At this point, I am back to thinking this might be a hardware issue. Perhaps a problem with the USB host on the system. Or the driver for that USB hard disk. It loads the SES module when I plug it into the system.
I have noticed that once I connect the drive to the system and disconnect it, I may not be able to reconnect it to the system until I reboot. I have this same issue on a different machine running Windows 8.

darkfoon · 2014-01-09 10:16:28

Just looked in the logs and I found these most recent entries. (journalctl -r -b -1)

Jan 08 21:11:58 Server5 kernel: BUG: unable to handle kernel paging request at 00000000fffea313
Jan 08 21:05:52 Server5 smartd[328]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 49 to 51
Jan 08 21:05:52 Server5 smartd[328]: Device: /dev/sda [SAT], 9 Offline uncorrectable sectors
Jan 08 21:05:52 Server5 smartd[328]: Device: /dev/sda [SAT], 9 Currently unreadable (pending) sectors

In looking at just kernel entries, I find over 31 of these (journalctl -r -k -b -1):

Jan 08 01:13:52 Server5 kernel: ata2: EH complete
Jan 08 01:13:52 Server5 kernel: end_request: I/O error, dev sda, sector 4017206
Jan 08 01:13:52 Server5 kernel: cdb[0]=0x28: 28 00 00 3d 4c 36 00 00 08 00
Jan 08 01:13:52 Server5 kernel: sd 1:0:0:0: [sda] CDB:
Jan 08 01:13:52 Server5 kernel: ASC=0x11 ASCQ=0x4
Jan 08 01:13:52 Server5 kernel: sd 1:0:0:0: [sda]
Jan 08 01:13:52 Server5 kernel:         00 3d 4c 36
Jan 08 01:13:52 Server5 kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jan 08 01:13:52 Server5 kernel: Descriptor sense data with sense descriptors (in hex):
Jan 08 01:13:52 Server5 kernel: Sense Key : 0x3 [current] [descriptor]
Jan 08 01:13:52 Server5 kernel: sd 1:0:0:0: [sda]
Jan 08 01:13:52 Server5 kernel: Result: hostbyte=0x00 driverbyte=0x08
Jan 08 01:13:52 Server5 kernel: sd 1:0:0:0: [sda]
Jan 08 01:13:52 Server5 kernel: sd 1:0:0:0: [sda] Unhandled sense code
Jan 08 01:13:52 Server5 kernel: ata2.00: configured for UDMA/133
Jan 08 01:13:52 Server5 kernel: ata2.00: error: { UNC }
Jan 08 01:13:52 Server5 kernel: ata2.00: status: { DRDY ERR }
Jan 08 01:13:52 Server5 kernel: ata2.00: cmd c8/00:08:36:4c:3d/00:00:00:00:00/e0 tag 0 dma 4096 in
                                           res 51/40:00:36:4c:3d/00:00:37:4c:3d/e0 Emask 0x9 (media error)
Jan 08 01:13:52 Server5 kernel: ata2.00: failed command: READ DMA
Jan 08 01:13:52 Server5 kernel: ata2.00: irq_stat 0x40000001
Jan 08 01:13:52 Server5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

This looks like my disk drive is dying.

Last edited by darkfoon (2014-01-09 10:17:32)

Arch Linux

#1 2013-11-10 23:54:40

Odd Deadlock issue - What logging can I enable to investigate?

#2 2013-11-12 18:57:19

Re: Odd Deadlock issue - What logging can I enable to investigate?

#3 2013-11-15 08:54:58

Re: Odd Deadlock issue - What logging can I enable to investigate?

#4 2013-11-15 14:14:33

Re: Odd Deadlock issue - What logging can I enable to investigate?

#5 2013-11-17 20:07:35

Re: Odd Deadlock issue - What logging can I enable to investigate?

#6 2013-11-17 20:32:26

Re: Odd Deadlock issue - What logging can I enable to investigate?

#7 2014-01-09 09:55:09

Re: Odd Deadlock issue - What logging can I enable to investigate?

#8 2014-01-09 10:16:28

Re: Odd Deadlock issue - What logging can I enable to investigate?

Board footer