You are not logged in.
I've got some really weird behavior with my home file server that started early yesterday. It first presented itself with irssi being super slow (occasionally not responded) and I kept getting server disconnects. I poked around journalctl and found this:
pr 15 22:21:01 gaia.haiti systemd[1]: systemd-journald.service watchdog timeout!
Apr 15 22:22:32 gaia.haiti systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Apr 15 22:22:32 gaia.haiti systemd[1]: systemd-journald.service: main process exited, code=killed, status=10/USR1
Apr 15 22:22:32 gaia.haiti systemd[1]: Unit systemd-journald.service entered failed state.
Apr 15 22:22:32 gaia.haiti systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Apr 15 22:22:32 gaia.haiti systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
Apr 15 22:22:32 gaia.haiti systemd[1]: systemd-journald.service: main process exited, code=killed, status=10/USR1
Apr 15 22:22:32 gaia.haiti systemd[1]: Unit systemd-journald.service entered failed state.
Apr 15 22:22:32 gaia.haiti systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Apr 15 22:22:32 gaia.haiti systemd[1]: systemd-journald.service: main process exited, code=killed, status=10/USR1
Apr 15 22:22:32 gaia.haiti systemd[1]: Unit systemd-journald.service entered failed state.
Apr 15 22:33:52 gaia.haiti systemd-journal[8929]: Forwarding to syslog missed 8 messages.
Apr 15 22:33:53 gaia.haiti systemd-journal[8929]: Journal stopped
Apr 15 22:33:53 gaia.haiti systemd-journal[9057]: Permanent journal is using 1.3G (max allowed 1.3G, trying to leave 11.7M free of 33.9G available → current limit 1.3G).
Apr 15 22:33:53 gaia.haiti systemd-journald[8929]: Received SIGTERM from PID 1 (systemd).
Apr 15 22:33:53 gaia.haiti systemd[1]: Unit systemd-journald.service entered failed state.
Apr 15 22:33:53 gaia.haiti systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
Apr 15 22:33:53 gaia.haiti systemd[1]: Stopping Journal Service...
Apr 15 22:33:53 gaia.haiti systemd[1]: Starting Journal Service...
Apr 15 22:33:53 gaia.haiti systemd[1]: Started Journal Service.
Apr 15 22:33:53 gaia.haiti systemd-journal[9057]: Journal started
Which looks a little strange but didn't explain the irssi behavior. On a whim I turned up the front fan on the server as I had seen a couple alerts about the disk temp hitting 40C, within a few minutes munin spewed a bunch of alerts, primarily OK alerts around disk latency which is a little strange. That said, I keep getting alerts about eth0 errors being unknown/a high number.
This morning I looked at munin and there are huge gaps in the charts, 1-2 hours which is really strange. Looking in the munin logs I see errors about connecting to the tcp port. Anyone know what's going on? I've got munin configured to use CGI for the graph generation.
While I was typing this I lost network connectivity to the server (ssh/http) for 5-10 minutes, after I got back in I saw the systemd messages above. Scrolling through my buffer I see the following error appeared regularly yesterday:
[1279012.741565] INFO: task nginx:32084 blocked for more than 120 seconds.
[1279012.741616] Not tainted 3.13.7-1-ARCH #1
[1279012.741648] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1279012.741701] nginx D 00000343 0 32084 32083 0x00000000
[1279012.741706] c0f7fe7c 00200086 c0b4e930 00000343 00000000 c16aca40 2b3cbea6 00048ae4
[1279012.741712] c16aca40 f4c7c2c0 c0faac80 c0faac80 c0f7fe40 00054878 00000000 c0f7ff14
[1279012.741717] 00000000 00000000 c0b4e930 00000000 c0bfe42c c0bfe490 000005a8 000005a8
[1279012.741722] Call Trace:
[1279012.741731] [<c1453443>] schedule+0x23/0x60
[1279012.741736] [<c14537c4>] schedule_preempt_disabled+0x14/0x20
[1279012.741739] [<c1455725>] __mutex_lock_slowpath+0x105/0x330
[1279012.741742] [<c1455960>] mutex_lock+0x10/0x1c
[1279012.741747] [<c1104870>] generic_file_aio_write+0x40/0x90
[1279012.741763] [<c1156467>] do_sync_write+0x57/0x90
[1279012.741767] [<c1156410>] ? do_sync_read+0x90/0x90
[1279012.741770] [<c1156c25>] vfs_write+0x95/0x1c0
[1279012.741773] [<c11572f1>] SyS_write+0x51/0x90
[1279012.741777] [<c145d40d>] sysenter_do_call+0x12/0x28
Looking on google suggests this is related to resource starvation.
Output of free -m:
total used free shared buffers cached
Mem: 3234 3111 123 0 3 63
-/+ buffers/cache: 3044 189
Swap: 2047 1047 1000
I do have 8GB installed but it's i686 (legacy system) and so far my attempts to build a kernel with PAE have failed.
Edit: I stopped subsonic which was occupying about 500M Virtual RAM and munin is updating more reliably.
Edit Edit: Still happened. Installed sysstat and it looks like things are still being blocked. Also shows average memory utilization at 95%
Linux 3.13.7-1-ARCH (gaia.haiti) 04/16/2014 _i686_ (4 CPU)
10:27:33 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
10:37:33 AM 1 408 1.63 1.74 2.13 0
10:47:33 AM 1 409 1.28 1.39 1.76 0
10:57:38 AM 0 408 2.11 1.70 1.75 4
11:07:38 AM 0 403 11.43 8.34 5.02 14
11:17:38 AM 0 421 15.65 11.50 7.96 16
11:34:57 AM 1 440 29.46 35.64 25.85 4
11:44:57 AM 0 399 16.61 18.16 20.32 3
Average: 0 413 11.17 11.21 9.26 6
Update: Looks to be directly related to Memory utilization, followed this guide on diagnosing high load and appears the system is swapping a bit. I expected it to handle things a little better since I'm using a SSD (the only hardware change in a few months).
Last edited by Ashex (2014-04-16 10:03:31)
Offline