A lot of zombies

Llama · 2017-05-19 04:12:10

Hi,

I've been experiencing a zombie invasion. Not quite apocalypse, but some routine. The apps involved: TeXstudio, gVim, Okular; Google Chrome and even Kdesu - very occasionally. The only thing vaguely in common is xelatex. That is, I use it daily, the apps involved, the editor and the viewer, sometimes turn into zombies, Once or twice a day, minor annoyance so far. The telltale sign is always the same: "can't write output.pdf", depending on IDE/plugin.

Google is not aware of similar issues. Is there a way to investigate the problem?

nerolucas · 2017-05-30 12:58:16

I've have the exact same problem, on my home laptop. I noticed it only happens when using the kernel 4.10 or newer ( I hoped that 4.11 would fix the issue, but nope). The interesting thing is, I made a new installation on the laptop I use at work and it doesn't seems to occur. I'm suspecting it happens only on Optimus setups. Is that your case?

As a suggestion to fix your problem, use the linux-lts kernel. It fixed the problem for me.

ratcheer · 2017-05-30 13:19:59

I always see zombies when running btrfs scrub. But, the commands run to completion, successfully, and then the zombies go away. It doesn't seem to be a problem.

Tim

nerolucas · 2017-05-30 23:02:21

It not exacly the same thing... Those zombie processes can't be killed, and shutting down the system fails because of said process, in my case Google Chrome.

seth · 2017-05-31 07:23:30

zombies ("D") wait for an ioctl response. check dmesg for I/O errors.
If you've no remote filesystems involved (smb, nfs) this is very most likely either your RAM or local HDD, so run a SMART test and prepare for a backup. If the disk seems good, test your RAM.

https://wiki.archlinux.org/index.php/S.M.A.R.T.
https://www.archlinux.org/packages/extr … emtest86+/

mwohah · 2017-05-31 17:01:08

I'm experiencing this as well.

I don't think it is due to faulty RAM or a local HDD as I have fairly fresh hardware where everything was running peachy before, it seems too much of a coincidence if other users are experiencing the same thing. Now that nerolucas mentioned it, the issues did start happening around the time I upgraded to 4.10. Upgrading to 4.11 did not solve the issue for me either.

A lot of processes that previously shut down just fine when they crashed are now suddenly becoming zombie processes. A good example of this is Firefox: when it freezes, GNOME will show the "This app is not responding" dialog after a couple of seconds. You just request to terminate the app and that's that; you can restart it and get on with your business. Now, if I request that the app be terminated, nothing happens and the Firefox window will remain stuck. What's worse, the dialog will close, have no effect, and then show up again a couple of seconds later. If you discard it, GNOME will start throwing it in your face on other virtual desktops as well.

As a result, the GUI of Firefox is now effectively stuck and unusable. The only actual solution is to reboot. You can try moving it to another desktop and ignoring it or even removing the window with xkill, which will ditch the window, but the app will remain present as defunct process in your process list. You can't close it and you can't start Firefox again either as it will complain that an instance is already running. In other words: can't kill it, can't close it, can't restart it, can only reboot.

As other users posted before, the same happens for other applications as well. I'm experiencing the same issue with, for example, Geary, which force quit just fine before.

Has anyone reported or have a link to a related issue for the kernel (if this even is a kernel issue)? I tried taking a look at journalctl after the crash occurs, but it literally shows nothing about the crash or freeze. Logs before and after the time of occurrance completely omit any information about it.

seth · 2017-05-31 17:09:20

strace such app and see what it does when freezing, maybe you can spot a pattern on the actual ioctl

Also you should try to find patterns in your setups ie. "lspci", "lsusb", "lscpu" and "loginctl session-status"

mwohah · 2017-05-31 18:00:48

I'll see if I can get an strace. I'll have to try and do this before I do anything with the "The app is not responding dialog", since any attempt to do it after "killing" it seems to be too late: strace won't attach to the process since it's already defunct.

In the interest of finding similarities: I'm running an Asus ROG laptop, have an Intel Kaby Lake CPU, using the NVIDIA proprietary driver and kernel 4.11. Also have an NVMe Samsung SSD that Arch is running from. I can post the remainder of the commands, but this seemed to be the most relevant data as the rest is mostly noise I guess.

It almost seems as if the process is correctly killed by GNOME, but no one is reaping the process after it's been killed. For the example of Geary, I've had the same crashes before this problem started occurring, but I could always kill the app without it getting defunct. This is the reason I'm suspicious of something else than the app itself; the app is of course buggy as it is crashing, but the reaping or "crash recovery", if you will, doesn't seem to be working properly either.

seth · 2017-05-31 20:36:19

The two most interesting aspects would be whether this is an optimus/nvidia system (see post #2) and whether you're all on gnome (some daemon causing deadlocks on the journal coredumps)

Speaking of which: what if you redirect coredumps to files resp. disable them altogether?
https://wiki.archlinux.org/index.php/Core_dump

mwohah · 2017-06-01 16:23:02

Regarding similarities to the setup of the other users:

I'm not using Optimus or dual-graphics (my device only exposes the discrete NVIDIA GPU).
I am indeed using GNOME (with GDM and all other things GNOME as well).
I happen to have coredumps completely disabled via systemd (since the start of my installation).
I'm running F2FS on my root partition (in case this is some weird filesystem-related issue).
I'm running ecryptfs on my home folder.
I'm not using any custom builds or AUR versions of GNOME or any other core packages.

nerolucas · 2017-06-01 17:27:50

From your list:

- I also have and Nvidia GPU, but it happens regardless I'm using the Nvidia or Intel GPU.
- I'm also on F2FS.
- I'm not using GNOME, but KDE Plasma.

I'm betting is a problem related to the Filesystem.

mwohah · 2017-06-02 16:40:53

I've been using F2FS for a couple of years now, but only for my home folder until I got my new machine a couple of months back (after which I activated it for my entire root filesystem). Haven't had any issues with it so far, but they may well have broken something related to it in kernel 4.10.

It would be interesting to hear if Llama is also on F2FS.

This may also be related (or not, but I thought I'd share it to see if anyone encountered the same): I've started using docker since about a month ago. This was when I was already on kernel 4.10. I've encountered strange issues where docker and/or its containers would get themselves in such a state that they become completely unresponsive. In other words, any command sent to the docker daemon just freezes up until it times out about 60 seconds later. This appears to happen completely randomly, sometimes it works fine for days, then MySQL queries start freezing and as a result docker becomes unresponsive, sometimes even trying to start the docker service via systemctl is enough to make it freeze.

I tried switching from UNIX sockets to TCP as well as switching between overlay2 and devicemapper, but nothing helped. I've been blaming this on some hardware or software corner case that I was encountering, since most of these type of issues are resolved on the docker bug tracker, but this may very well be related to the same problem as the docker daemon would also become defunct after it was killed and this happened (only a reboot helped). Below is one of the logs I grabbed from when that happened. F2FS is also mentioned in there, but it could just be the backend that happened to be active during the backtrace (not necessarily indicating a problem with F2FS):

kernel: INFO: task dockerd:3061 blocked for more than 120 seconds.
kernel:       Tainted: P        W  O    4.10.11-1-ARCH #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: dockerd         D    0  3061      1 0x00000000
kernel: Call Trace:
kernel:  __schedule+0x22f/0x700
kernel:  schedule+0x3d/0x90
kernel:  schedule_timeout+0x243/0x3d0
kernel:  ? ttwu_do_activate+0x6f/0x80
kernel:  ? try_to_wake_up+0x18d/0x3c0
kernel:  wait_for_common+0xbe/0x180
kernel:  ? wake_up_q+0x80/0x80
kernel:  wait_for_completion+0x1d/0x20
kernel:  f2fs_issue_flush+0x160/0x1b0 [f2fs]
kernel:  f2fs_do_sync_file+0x46f/0x730 [f2fs]
kernel:  f2fs_sync_file+0x11/0x20 [f2fs]
kernel:  vfs_fsync_range+0x4b/0xb0
kernel:  do_fsync+0x3d/0x70
kernel:  SyS_fsync+0x10/0x20
kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa9
kernel: RIP: 0033:0x4967b4
kernel: RSP: 002b:000000c4211ad2b8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
kernel: RAX: ffffffffffffffda RBX: 00000000000000d0 RCX: 00000000004967b4
kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001c
kernel: RBP: 000000c4211ad170 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
kernel: R13: 0000000000000014 R14: 0000000000000f70 R15: 0000000000000027

Apart from that, how should we proceed next? We could create an issue on the kernel bug tracker, but I'm afraid that this isn't much to go on without a reliable way to reproduce the zombie creation.

seth · 2017-06-02 19:23:50

schedule() is *tremendously* suspicious to be related to a mutex.
I'd bet your right arm that the backtrace of other zombies looks similar.

mwohah · 2017-06-02 20:32:06

I think I can give you my left arm as well, a couple of other backtraces:

kernel: INFO: task evolution-addre:9530 blocked for more than 120 seconds.
kernel:       Tainted: P        W  O    4.10.13-1-ARCH #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: evolution-addre D    0  9530   3535 0x00000000
kernel: Call Trace:
kernel:  __schedule+0x22f/0x700
kernel:  schedule+0x3d/0x90
kernel:  schedule_timeout+0x243/0x3d0
kernel:  ? ttwu_do_activate+0x6f/0x80
kernel:  ? try_to_wake_up+0x18d/0x3c0
kernel:  wait_for_common+0xbe/0x180
kernel:  ? wake_up_q+0x80/0x80
kernel:  wait_for_completion+0x1d/0x20
kernel:  f2fs_issue_flush+0x160/0x1b0 [f2fs]
kernel:  f2fs_do_sync_file+0x46f/0x730 [f2fs]
kernel:  f2fs_sync_file+0x11/0x20 [f2fs]
kernel:  vfs_fsync_range+0x4b/0xb0
kernel:  ? filemap_fdatawait_range+0x1e/0x30
kernel:  vfs_fsync+0x1c/0x20
kernel:  ecryptfs_fsync+0x34/0x40 [ecryptfs]
kernel:  vfs_fsync_range+0x4b/0xb0
kernel:  do_fsync+0x3d/0x70
kernel:  SyS_fsync+0x10/0x20
kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa9

kernel: INFO: task systemd-journal:10954 blocked for more than 120 seconds.
kernel:       Tainted: P        W  O    4.10.13-1-ARCH #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: systemd-journal D    0 10954      1 0x00000104
kernel: Call Trace:
kernel:  __schedule+0x22f/0x700
kernel:  ? sched_clock+0x9/0x10
kernel:  schedule+0x3d/0x90
kernel:  schedule_timeout+0x243/0x3d0
kernel:  ? ttwu_do_activate+0x6f/0x80
kernel:  ? try_to_wake_up+0x18d/0x3c0
kernel:  wait_for_common+0xbe/0x180
kernel:  ? wake_up_q+0x80/0x80
kernel:  wait_for_completion+0x1d/0x20
kernel:  f2fs_issue_flush+0x160/0x1b0 [f2fs]
kernel:  f2fs_do_sync_file+0x46f/0x730 [f2fs]
kernel:  f2fs_sync_file+0x11/0x20 [f2fs]
kernel:  vfs_fsync_range+0x4b/0xb0
kernel:  do_fsync+0x3d/0x70
kernel:  SyS_fsync+0x10/0x20
kernel:  do_syscall_64+0x54/0xc0
kernel:  entry_SYSCALL64_slow_path+0x25/0x25

I can find a couple of these related to dockerd, evolution-addre(ss...), systemd-journal and dconf-service. I didn't even know about these other cases up until now, as they weren't directly visible.

As mentioned before, there is nothing in the logs about the times I had this issue with Firefox, perhaps something went so haywire that it couldn't even be logged, which is strange to say the least.

The first occurrence of these issues is from May 11th, at this time I was on kernel 4.10.13, which I had been on since about May 2nd.

@nerolucas, could you take a gander at your logs to see if you have similar issues? This seems to point more and more towards an F2FS issue.

seth · 2017-06-02 21:20:43

https://bugs.archlinux.org/task/53663

loqs · 2017-06-03 20:03:10

Someone affected should probably report the issue upstream. Has anyone tried linux mainline 4.12-rc3 to see if the issue is still present?

mwohah · 2017-06-04 19:20:26

Posted upstream at https://bugzilla.kernel.org/show_bug.cgi?id=195983

Ernestosparalesto · 2017-06-09 11:07:51

using linux 4.9.9-2 the bug is not present
spotted using npm
im on f2fs too

Last edited by Ernestosparalesto (2017-06-09 11:08:26)

Arch Linux

#1 2017-05-19 04:12:10

A lot of zombies

#2 2017-05-30 12:58:16

Re: A lot of zombies

#3 2017-05-30 13:19:59

Re: A lot of zombies

#4 2017-05-30 23:02:21

Re: A lot of zombies

#5 2017-05-31 07:23:30

Re: A lot of zombies

#6 2017-05-31 17:01:08

Re: A lot of zombies

#7 2017-05-31 17:09:20

Re: A lot of zombies

#8 2017-05-31 18:00:48

Re: A lot of zombies

#9 2017-05-31 20:36:19

Re: A lot of zombies

#10 2017-06-01 16:23:02

Re: A lot of zombies

#11 2017-06-01 17:27:50

Re: A lot of zombies

#12 2017-06-02 16:40:53

Re: A lot of zombies

#13 2017-06-02 19:23:50

Re: A lot of zombies

#14 2017-06-02 20:32:06

Re: A lot of zombies

#15 2017-06-02 21:20:43

Re: A lot of zombies

#16 2017-06-03 20:03:10

Re: A lot of zombies

#17 2017-06-04 19:20:26

Re: A lot of zombies

#18 2017-06-09 11:07:51

Re: A lot of zombies

Board footer