You are not logged in.
I have an Infiniband network between a cluster of 20 machines, I currently mount several nfs drives over that Infiniband network.
I mount the drives using a simple fstab (mount options are defaults,vers=3,bg,hard,intr), and I connect the network using a netctl service file:
Description='Whole Cluster Inifiniband Network'
Interface=ib0
Connection=ethernet
IP=static
Address=('192.168.1.21/24')
DNS=('192.168.1.1')
This unit works fine post-boot, but at boot it fails every time. I think it is failing because the ib0 device is not ready yet. I load Infiniband modules using the file /etc/modules-load.d/infiniband:
ib_ipoib
rdma_ucm
The problem is that at boot, netctl fails about 20% of the time, and if it fails, the kernel still attempts to mount the drives and I end up with an unresponsive drive.
My guess is that the best bet is to drop /etc/fstab and /etc/modules-load.d and instead switch to a 100% systemd approach, in which the modules are loaded by a systemd service file, and the netctl service file depends on that systemd file, meaning it will wait until the systemd service has completed before starting, and finally the drive mount service file will do the same thing.
I have a few questions though:
- How do you make a netctl service file wait on a systemd unit? That seems non-trivial to me.
- Is this even the best way of fixing this problem, or is there something simpler that I can do?
Any other advice is also appreciated.
Thanks!
Home Page: www.michaeldacre.com
Lab: Hunter Fraser's Lab
GPG key: E76370D6
Offline
An enabled netctl profile creates a systemd service, which AFAICT has some sort of dependency on the device:
BindsToInterfaces=()
An array of physical network interfaces that this profile needs before it can be
started. For ‘enabled’ profiles, systemd will wait for the presence of the specified
interfaces before starting a profile. If this variable is not specified, it defaults to
the value of Interface.
$ cat /etc/systemd/system/netctl@net1.service
.include /usr/lib/systemd/system/netctl@.service
[Unit]
Description=Data port net1
BindsTo=sys-subsystem-net-devices-net1.device
After=sys-subsystem-net-devices-net1.device
Is there nothing in your journal or systemctl status to give you a more concrete idea of why the profiles are failing?
Last edited by alphaniner (2015-09-03 18:58:57)
But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner
Offline
Sure, here you go:
Oct 01 15:05:19 node16 systemd[1]: Starting Whole Cluster Inifiniband Network...
Oct 01 15:05:19 node16 network[424]: Starting network profile 'Cluster_Infiniband'...
Oct 01 15:05:24 node16 network[424]: No connection found on interface 'ib0' (timeout)
Oct 01 15:05:24 node16 network[424]: Failed to bring the network up for profile 'Cluster_Infiniband'
Oct 01 15:05:24 node16 systemd[1]: netctl@Cluster_Infiniband.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 15:05:24 node16 systemd[1]: Failed to start Whole Cluster Inifiniband Network.
Oct 01 15:05:24 node16 systemd[1]: netctl@Cluster_Infiniband.service: Unit entered failed state.
Oct 01 15:05:24 node16 systemd[1]: netctl@Cluster_Infiniband.service: Failed with result 'exit-code'.
Not very helpful sadly, it is just telling me that there is no connectivity on the infiniband connection, and that is happening because the network interface is not ready yet. I know that netctl is supposed to wait for the device, but in this case that is not happening, I am not sure why. Ideally I would use some systemd script to manage the infiniband device and test for connectivity first, and have the netctl profile depend on that. Right now I am just manually running a simple shell script post-boot that runs 'netctl start Cluster_Infiniband' followed by a series of manual mount commands. Not very sophisticated at all, but I just can't get netctl to reliably start the Infiniband connection at boot-time.
Home Page: www.michaeldacre.com
Lab: Hunter Fraser's Lab
GPG key: E76370D6
Offline
Maybe more helpful than you realize. It times out because no connection is found. Look into the options TimeoutUp and TimeoutCarrier. I can't really distinguish between the two; from my limited understanding, "no carrier" and "network down" are synonomus.*
Alternately, as a workaround, you could change your nfs mounts to noauto and add "SkipNoCarrier" to the profile. This would result in the profile starting regardless of connectivity.
* In fact, the description of "SkipNoCarrier" in netctl.profile(5) defines a carrier as a "plugged-in cable"...
But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner
Offline