Possible workaround for stuck cifs mounts (?)

kokoko3k · 2017-02-10 12:33:38

please, see the last work here: https://bbs.archlinux.org/viewtopic.php … 3#p1898333

From time to time, i face a problem with stuck cifs connections, is it really irritating.
It happens that if the system serving samba/cifs shares goes down, the client remains stuck while trying to access the unaccessible share.

Some solutions are described here:
http://stackoverflow.com/questions/7462 … to-unmount

But they don't always work; lazy unmount can help, but blocked processes will remain blocked; and force removing the cifs kernel module will almost certanly lead to an unstable system.
This is very annoiyng, because the whole blocked process needs to be killed, no recovery is possible.

So what i tried is the following:

# mount.cifs //unstable_server/share /well_hidden_folder/unstable_server
# sshfs user@localhost:/well_hidden_folder/unstable_server /mnt/unstable_server_share

Here i "wrapped" the cifs mount with a fuse mount (sshfs), and any app that needs to access data on the unstable server share, will use the wrapped mount instead.

So if a simple "ls" needs to access the share, it will go through the following:

ls <-> sshfs <-> cifs <-/../-> Unstable Samba Server

So that if it happens that the share will become unaccessible, the only process that will remain hard-blocked is sshfs; but since sshfs runs in userspace, contrary to [cifs], it can be killed; doing so will "free" any process accessing it.

At this point, recovering "ls" will be easy with:

killall sshfs;sleep 1;killall -9 sshfs

I know, of course there is a big performance hit here: fuse and ssh protocol, but i was asking myself if there is a totally dumb fuse module that would just "remount" part of the filesystem elsewere (edit: found! http://bindfs.org/)
And it would be even better if it would have a "timeout" mount parameter that would simple desist in accessing the share after a specified time (<- probably this could be achieved via autofs).

-EDIT-
Found bindfs:
Also, you can just "overlay+wrap" the mounted share, which is nice:

bindfs /path/to/mounted/cifs/share /path/to/mounted/cifs/share

Afterwards, processes accessing /path/to/mounted/cifs/share will talk to fuse (bindfs),

Dumb benchmarks with cached data doesn't seem to show performance drops:

#time dd if=/mnt/bindfs/bigfile of=/dev/null
304085855 bytes (304 MB, 290 MiB) copied, 2,59298 s, 117 MB/s
real    0m2,598s
user    0m0,050s
sys     0m0,150s

#time dd if=/mnt/real_cifs_mount/bigfile of=/dev/null
304085855 bytes (304 MB, 290 MiB) copied, 2,62401 s, 116 MB/s
real    0m2,641s
user    0m0,060s
sys     0m0,240s

Do you think this is a valid workaround?

Last edited by kokoko3k (2020-04-15 09:21:10)

kokoko3k · 2017-02-15 12:43:52

Yes, shameless bump.
I'd like to add those info to the wiki, but first i'd like to know if anybody thinks it is a good way to solve the exposed problem, or maybe there are downsides i'm not thinking of.

-EDIT-
Here is a script that umount the share as it becomes unavailable, it first mounts the cifs share to a location noboby sohuld access, then it mount it over an accessible place via bindfs.
Then it checks the bindfs mount periodically via "ls", and if ls does not responds whitin a defined timeout, it first "kill" the bindfs mount, and then it unmounts the cifs share; i iterate it via a while/loop cycle, seems to work so far.

#!/bin/bash

# smbbind.sh share                   mountpoint
# smbbind.sh "//SERVER/SHARE" "/mnt/somewhere"

check_every=30   #Seconds
timeout_after=5  #Seconds
restart_after=30 #Seconds

hide_path=~/.smbbind/
cifs_opts="rw,cache=strict,username=nt_username,password=nt_password,domain=nt_domain,uid=0,noforceuid,gid=0,noforcegid,file_mode=0755,dir_mode=0755,nounix,serverino,rsize=61440,wsize=57344,actimeo=60,_netdev"

share="$1"
bind_mountpoint="$2"

function clean {
	#Umount bindfs, next umount cifs, then exit.
	echo [..] Cleaning...
	#The only open handle opened on the cifs share should be the bindfs one.
	echo [..] forcing umounting bindfs on "$bind_mountpoint"
	kill -9 $bindfs_pid
	umount -f "$bind_mountpoint"
	echo [OK] Done.
	# Umounted bindfs, cifs should umount without problems
	echo [..] Forcing umounting cifs on "$cifs_mountpoint"
	umount -f "$cifs_mountpoint"
	umount -l "$cifs_mountpoint"
	echo [OK] Done cleaning.
}

function finish {
        echo exiting...
	clean
	trap exit INT TERM EXIT
	exit
}

trap finish INT TERM EXIT

#Prepare environment
    mkdir $hide_path &>/dev/null
    cifs_mountpoint=$hide_path/$(echo $bind_mountpoint|tr '/' '_')
    mkdir -p $cifs_mountpoint &>/dev/null
    mkdir -p $bind_mountpoint &>/dev/null

while true ; do
       
    #Mount things:
        echo [..] mounting cifs "$share" on "$cifs_mountpoint"
        if timeout 10 mount.cifs "$share" "$cifs_mountpoint"  -o $cifs_opts ; then
            echo [OK] mounted cifs "$share" on "$cifs_mountpoint"

            echo [..] mounting bind "$cifs_mountpoint" on "$bind_mountpoint"
            bindfs "$cifs_mountpoint" "$bind_mountpoint"
            #Getting the pid of bindfs is tricky.
            bindfs_pid=$(ps -eo pid,args|grep bindfs |grep "$cifs_mountpoint" |grep "$bind_mountpoint" |cut -d " " -f 1)
            echo [OK] mounted bind "$cifs_mountpoint" on "$bind_mountpoint"
            
            #Check them
                echo [OK] Start Main check cycle, whill check every $check_every seconds...
                while true ; do
                    if ! timeout -k 1 $timeout_after ls "$bind_mountpoint" &>/dev/null ; then
                        echo no answer from bindfs for "$bind_mountpoint"
                        clean
                        break
                            #else
                        #echo Share is alive
                    fi
                    sleep $check_every
                done

                else
            echo [!!] Cannot mount "$share" on "$cifs_mountpoint"
        fi
                
    echo [..] Waiting $restart_after seconds: $(date) 
    sleep $restart_after

done

Last edited by kokoko3k (2017-05-24 11:43:53)

kokoko3k · 2020-04-15 09:06:34

I made a step forward, and decided to integrate with systemd instances, it seems to work nice.
Basically, I mount what i need, next i rebind it and check if it alive.
When it is not, I umount everything with -l.
My real mounts live under /mnt/.bindfs while the binded one is under /mnt/bindfs
When i need to use a share, i use the latter, the one handled by bindfs which can be killed/umounted without so much trouble because it lives in userspace.

Seems to work flawlessly so far; thanks corona.

The New service running:

koko@slimer# systemctl status bindfsit@gozer.cfg
● bindfsit@gozer.cfg.service - Binds filesystems and recovers from hangs using config gozer.cfg
   Loaded: loaded (/etc/systemd/system/bindfsit@.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-04-27 10:48:43 CEST; 5min ago
 Main PID: 5403 (bindfsit.sh)
    Tasks: 3 (limit: 4915)
   Memory: 2.1M
   CGroup: /system.slice/system-bindfsit.slice/bindfsit@gozer.cfg.service
           ├─5403 /bin/bash /home/koko/scripts/bindfsit.sh /etc/bindfsit/gozer.cfg
           ├─5407 bindfs -u koko /mnt/.bindfs/gozer /mnt/bindfs/gozer
           └─6804 sleep 10

The service itself:

koko@slimer# cat /etc/systemd/system/bindfsit\@.service 
[Unit]
Description=Binds filesystems and recovers from hangs using config %I

[Service]
Type=simple

#Config files live in /etc/bindfsit/
ExecStart=/home/koko/scripts/bindfsit.sh /etc/bindfsit/"%I"

[Install]
WantedBy=default.target

The main script:

#!/bin/bash

#binds the fs mounted on $real_mountpoint to $bind_mountpoint and when $real_mountpoint hangs, executes $recover_cmd and rebinds.
#if mount_cmd is not empty it will be executed when script starts.

#$1 is the configuration file, sourced by me with (eg):
	#check_every=10 #seconds
	#timeout_after=40 #seconds
	#restart_after=5 #seconds
	#real_mountpoint=/mnt/.bindfs/.myshare1
	#bind_mountpoint=/mnt/binfs/myshare1
	#mount_cmd="mount.cifs //pi/all $real_mountpoint -o rsize=131072,wsize=131072
	#recover_cmd="umount -l $real_mountpoint ; $mount_cmd"
	#user=koko # Makes all files owned by the specified user.
               # Also causes chown on the mounted filesystem to always fail.


source "$1" || exit 1

# It is handy to bind the whole autofs tree, i use it that way:

#koko@slimer# cat /etc/systemd/system/bindfs_autofs.service 
#[Unit]
#Description=Binds the autofs tree and recover from stalls
#
#[Service]
#Type=simple
#ExecStartPre=systemctl start autofs
#ExecStart=/home/koko/scripts/bindfs_it.sh /mnt/autofs.real /mnt/autofs "umount -l /mnt/autofs.real/*"
#
#[Install]
#WantedBy=default.target

echo $(basename $0) config:
echo config file="$1"
echo check_every="$check_every"
echo timeout_after="$timeout_after"
echo restart_after="$restart_after"
echo real_mountpoint="$real_mountpoint"
echo bind_mountpoint="$bind_mountpoint"
echo mount_cmd="$mount_cmd"
echo recover_cmd="$recover_cmd"
echo user="$user"

#Make mountpoints:
if [ ! -d "$real_mountpoint" ] ; then mkdir -p "$real_mountpoint"  || exit 1 ; fi
if [ ! -d "$bind_mountpoint" ] ; then mkdir -p "$bind_mountpoint"  || exit 1 ; fi

#mount things?
if [ ! -z "$mount_cmd" ] ; then
	while ! sh -c "$mount_cmd" ; do sleep $restart_after ; done
fi


function clean {
	#Umount bindfs, next execute "$recover_cmd", then exit.
	echo [..] Cleaning...
	#The only open handle opened on the cifs share should be the bindfs one.
	echo [..] forcing umounting bindfs on "$bind_mountpoint"
	kill -9 $bindfs_pid
	umount -f "$bind_mountpoint"
	echo [OK] Done.
	# execute recover command
	echo [..] Execute: "$recover_cmd"
	sh -c "$recover_cmd"
	echo [OK] Done cleaning.
}

function finish {
    echo exiting...
	clean
	umount -l "$real_mountpoint"
	trap exit INT TERM EXIT
	exit
}

trap finish INT TERM EXIT

#Prepare environment
    mkdir -p $bind_mountpoint &>/dev/null

while true ; do
    #Mount things:
    echo [..] binding "$real_mountpoint" to "$bind_mountpoint"
   	bindfs -u $user "$real_mountpoint" "$bind_mountpoint"
    #Getting the pid of bindfs is tricky.
       bindfs_pid=$(ps -eo pid,args|grep "bindfs $real_mountpoint $bind_mountpoint" |grep -vi grep |awk '{print $1}')
	echo [OK] mounted bind "$cifs_mountpoint" on "$bind_mountpoint", pid: "$bindfs_pid"

    echo [OK] Start Main check cycle, whill check every $check_every seconds...
    while true ; do
		#fixme: can we use stat and not ls here?
        if ! timeout -k 1 $timeout_after ls "$bind_mountpoint" &>/dev/null ; then 
            echo no answer from bindfs for "$bind_mountpoint"
            clean
            break
                #else
            #echo "$(date) Share is alive"
        fi
            sleep $check_every
    done

    echo [..] Waiting $restart_after seconds: $(date)
    sleep $restart_after
done

An example config file:

koko@slimer# cat /etc/bindfsit/gozer.cfg 
#bindfs_it configuration file:

myownhost=gozer

check_every=10          #While share is alive, check every #seconds
timeout_after=40        #share is not alive if it does not answer within #seconds
restart_after=60        #When share does not respond, check every #seconds

real_mountpoint=/mnt/.bindfs/$myownhost         #Real mountpoint
bind_mountpoint=/mnt/bindfs/$myownhost          #Bound mountpoint

mount_cmd="sshfs root@$myownhost:/ $real_mountpoint -o nodev,noatime,allow_other,max_read=65536,follow_symlinks,IdentityFile=/root/.ssh/id_rsa,port=22"

recover_cmd="umount -l $real_mountpoint ; $mount_cmd" #What to do when share does not answer

#User ownership, (can be blank, fixme in dthe script)
user=koko # Makes all files owned by the specified user.
              # Also causes chown on the mounted filesystem to always fail.

Last edited by kokoko3k (2020-04-27 08:56:12)

Buddlespit · 2020-04-16 16:48:48

Have you tried just mounting the cifs share via systemd mount? My main system has no problems re-acquiring the cifs mount after rebooting the server. My two satellite systems (Vero4k+) use fstab to mount the shares and have to be rebooted to re-acquire when the server reboots.

Or am I just not understanding your issue?

kokoko3k · 2020-04-17 05:55:07

Maybe it would even recover when the server goes online again, but in the meantime any process with open handles on that filesystem just hangs indefinitely or waits for a very loooong time.
When that happens on a filesystem handled at kernel level (nfs,cifs), umount -f won't work (it hangs too).
Adding an userapace layer allows me to 'free' the stuck processes.

seth · 2020-04-17 06:49:25

This sound as if you were somehow hard-mounting cifs. Don't.
"soft" is also the default according to "man mount.cifs"

kokoko3k · 2020-04-17 12:41:32

seth wrote:

This sound as if you were somehow hard-mounting cifs. Don't.
"soft" is also the default according to "man mount.cifs"

Nope, unfortunately.

man mount.cifs wrote:

"soft (default) The program accessing a file on the cifs mounted file system will not hang when the server crashes and will return errors to the user application."

It is not a crashing issue.

client:

# mount.cifs //pi/all /mnt/sambapi  -o username=xxx,password=xxx
# while true ; do ls /mnt/sambapi ; done
bin  boot  dev  downloads  etc  home  lib  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
Fri Apr 17 14:35:41 CEST 2020
bin  boot  dev  downloads  etc  home  lib  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
Fri Apr 17 14:35:42 CEST 2020
bin  boot  dev  downloads  etc  home  lib  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
[..]

Server:

# systemctl stop smbd.service
# sleep 3600
# systemctl start smbd.service

Client loop (ls) hangs for an hour and a couple of seconds.

if i try to unmount while it is hanged:

koko@slimer# sudo umount -f /mnt/sambapi
umount: /mnt/sambapi: target is busy.

If i try with -l it succedes, but that doesn't solves ls hanging, it is still stuck, you've to kill it and it is not nice when something big like plasma is the client.
Layering via bindfs solves that.

Does your system behaves differently maybe?

Last edited by kokoko3k (2020-04-17 12:45:17)

seth · 2020-04-17 12:50:48

Whether it crashes doesn't matter. This is only about the behavior when the server doesn't respond.
What's the output of "mount"?

Last edited by seth (2020-04-17 12:51:02)

kokoko3k · 2020-04-17 12:55:08

Nothing special (last line is the "test")

root@slimer# mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
dev on /dev type devtmpfs (rw,nosuid,relatime,size=8161516k,nr_inodes=2040379,mode=755)
run on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755)
/dev/sdb3 on / type ext4 (rw,noatime,nobarrier,noacl,commit=180)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw)
devpts on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=13958)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
/dev/sdb1 on /boot type ext4 (rw,noatime,nouser_xattr,noacl,commit=180)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,relatime,size=1048576k)
/dev/sdd1 on /mnt/ssd_kingston type ext4 (rw,noatime,nouser_xattr,noacl,commit=300)
/dev/sdb4 on /home type ext4 (rw,noatime,nobarrier,nouser_xattr,noacl,commit=180,stripe=32747)
/dev/sdc1 on /mnt/SSD_Sandisk_480GB_Backup type ext4 (rw,noatime)
/dev/sda2 on /mnt/rotativo type ext4 (rw,noatime,commit=180)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
/etc/autofs/auto.master.d/koko on /mnt/autofs.real type autofs (rw,relatime,fd=17,pgrp=8750,timeout=30,minproto=5,maxproto=5,indirect,pipe_ino=59784)
/mnt/autofs.real on /mnt/autofs type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=1634248k,mode=700,uid=1000,gid=100)
/etc/autofs/auto.misc on /misc type autofs (rw,relatime,fd=5,pgrp=8750,timeout=10,minproto=5,maxproto=5,indirect,pipe_ino=59770)
-hosts on /net type autofs (rw,relatime,fd=11,pgrp=8750,timeout=10,minproto=5,maxproto=5,indirect,pipe_ino=59780)
gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=100)
//pi/all on /mnt/sambapi type cifs (rw,relatime,vers=3.1.1,cache=strict,username=root,uid=0,noforceuid,gid=0,noforcegid,addr=192.168.15.16,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=131072,wsize=131072,bsize=1048576,echo_interval=60,actimeo=1)

edit:
Not sure what the man page means by "hang", i can use SIGINT or hit CTRL-C and it just exit immediately.
Client is just waiting for an answer from the filesystem; it is just like i pull a sata cable while an application is reading from it, IMO it is an ultra safe, but correct behaviour i'd expect, because i certanily don't want applications to give up whan a media is (even a lot of) slow to respond.

Last edited by kokoko3k (2020-04-17 13:03:44)

Arch Linux

#1 2017-02-10 12:33:38

Possible workaround for stuck cifs mounts (?)

#2 2017-02-15 12:43:52

Re: Possible workaround for stuck cifs mounts (?)

#3 2020-04-15 09:06:34

Re: Possible workaround for stuck cifs mounts (?)

#4 2020-04-16 16:48:48

Re: Possible workaround for stuck cifs mounts (?)

#5 2020-04-17 05:55:07

Re: Possible workaround for stuck cifs mounts (?)

#6 2020-04-17 06:49:25

Re: Possible workaround for stuck cifs mounts (?)

#7 2020-04-17 12:41:32

Re: Possible workaround for stuck cifs mounts (?)

#8 2020-04-17 12:50:48

Re: Possible workaround for stuck cifs mounts (?)

#9 2020-04-17 12:55:08

Re: Possible workaround for stuck cifs mounts (?)

Board footer