You are not logged in.

#1 2016-12-06 01:55:43

Leonid.I
Member
From: Aethyr
Registered: 2009-03-22
Posts: 999

[SOLVED] FYI: rpcbind 0.2.4 and multiple systemd failures

As the topic says, this is an FYI in hope that someone finds this useful. It took me sometime to un-break my system tongue This is not a bug in rpcbind (at least not only), nor in any ArchLinux packaging, but rather a race-condition issue (or several issues) in systemd. This issue does not show up in a VM or container, only on the real hardware. Even though it is almost 100% reproducible on my machine, I already found a workaround and have no interest in reporting it to systemd upstream and dealing with their insanities...

My setup is a NIS client and NFS server. It fetches various maps from a NIS master via ypbind and exports a bunch of /home's via NFSv4. The relevant packages are ypbind-mt, nfs-utils and their dependency rpcbind. With v0.2.4, the latter includes rpcbind.socket that containes

ListenStream=0.0.0.0:111
ListenDatagram=0.0.0.0:111
ListenStream=[::]:111
ListenDatagram=[::]:111

in addition to

ListenStream=/run/rpcbind.sock

which was the only entry present in earlier versions of rpcbind. rpcbind.socket is WantedBy= by nfs-server (i.e. rpcbind.service is socket-activated) but ypbind.service Requires= rpcbind.service. The latter also Requires= rpcbind.socket. So, the socket unit gets pulled in anyways.

systemd bug #1

It turns out, that on my system, ypbind always comes up before NFS. Due to a mysterious reason, an update rpcbind (0.2.3-3 -> 0.2.4-1) causes rpcbind.service to FAIL when pulled by ypbind.service. The latter also fails to start of course. Interestingly, rpcbind later gets socket-activated by nfs-server.service. Of course, when starting things manually by systemctl show no such failure.

systemd bug #2

This observation made me change ypbind.service to require rpcbind.socket (instead of service), which fixed my NIS setup, but instead made logind fail in at least 4 times out of 5 (measured by about 50 boot ups):

$ cat /var/log/errors.log
...
2016-12-04T00:54:13.000-07:00 takahe systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Connection timed out
2016-12-04T00:54:13.000-07:00 takahe systemd[1]: Failed to register name: Connection timed out
2016-12-04T00:54:13.000-07:00 takahe systemd[1]: Failed to set up API bus: Connection timed out
2016-12-04T00:54:38.000-07:00 takahe systemd-logind[2434]: Failed to add match for JobRemoved: Connection timed out
2016-12-04T00:54:38.000-07:00 takahe systemd-logind[2434]: Failed to fully start up daemon: Connection timed out
2016-12-04T00:54:38.000-07:00 takahe systemd[1]: Failed to start Login Service.
2016-12-04T00:58:14.000-07:00 takahe systemd-logind[2500]: Failed to enable subscription: Failed to activate service 'org.freedesktop.systemd1': timed out
2016-12-04T00:58:14.000-07:00 takahe systemd-logind[2500]: Failed to fully start up daemon: Connection timed out
2016-12-04T00:58:14.000-07:00 takahe systemd[1]: Failed to start Login Service.
...
$ cat /var/log/auth.log
...
2016-12-04T00:58:14.000-07:00 takahe dbus[2385]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
2016-12-04T00:58:14.000-07:00 takahe systemd-logind[2500]: Failed to enable subscription: Failed to activate service 'org.freedesktop.systemd1': timed out
2016-12-04T00:58:14.000-07:00 takahe systemd-logind[2500]: Failed to fully start up daemon: Connection timed out
...
$ cat /var/log/daemon.log
...
2016-12-04T00:58:06.000-07:00 takahe dbus[2385]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
2016-12-04T00:58:18.000-07:00 takahe dbus[2385]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30004ms)
...

At this point, systemd is totally borked and so is the machine. Adding systemd.journald.forward_to_console=1 in kernel cmdline revealed a super-helpful message that logind is "Looping too fast".

Fix
Somehow, the above issues have to do with the fact that we now ask systemd to handle all rpcbind sockets. This is not needed since rpcbind is fully capable of handling them. So, a simple fix that avoids changing lots of NFS and NIS-related units amounts to dropping the old rpcbind.socket in /etc/systemd/system:

$ systemctl status rpcbind.socket
● rpcbind.socket - RPCbind Server Activation Socket
   Loaded: loaded (/etc/systemd/system/rpcbind.socket; disabled; vendor preset:
disabled)
   Active: active (running) since Sun 2016-12-04 22:57:41 MST; 19h ago
   Listen: /var/run/rpcbind.sock (Stream)
$  cat /etc/systemd/system/rpcbind.socket
[Unit]
Description=RPCbind Server Activation Socket

[Socket]
ListenStream=/var/run/rpcbind.sock

[Install]
WantedBy=sockets.target

Yes, I know it's a workaround, but at least my system now boots smile Again, this is a race issue with the systemd parallel startup, and I was lucky to somehow run into it.

Last edited by Leonid.I (2016-12-06 02:00:07)


Arch Linux is more than just GNU/Linux -- it's an adventure
pkill -9 systemd

Offline

Board footer

Powered by FluxBB