You are not logged in.
Hello.
I have a weird problem. Sometimes NFS exports are being unresponsive and mounted clients are can not read and write to these shares. Also showmount -e $nfsserver is shows nothing.
When I restart the nfs-server service everything goes back normal.
The problem happens 4-5 times in a year. When it happens I don't see any error log.
I'm getting the peername failed for a long time and usually it is not a problem. But when the share access is suspends I see that as last log.
[Sat Jul 31 07:35:48 2021] nfsd: peername failed (err 107)!
[Sat Jul 31 16:53:19 2021] nfsd: peername failed (err 107)!
[Sat Jul 31 19:20:23 2021] nfsd: peername failed (err 107)!
[Sun Aug 1 01:04:28 2021] nfsd: peername failed (err 107)!
[Sun Aug 1 06:49:00 2021] nfsd: peername failed (err 107)!
[Sun Aug 1 15:14:25 2021] nfsd: peername failed (err 107)!
[Sun Aug 1 21:50:46 2021] nfsd: peername failed (err 107)!
[Mon Aug 2 00:24:07 2021] nfsd: peername failed (err 107)!
[Mon Aug 2 02:45:53 2021] nfsd: peername failed (err 107)!
[Tue Aug 3 00:58:00 2021] nfsd: peername failed (err 107)!
[Mon Aug 9 10:04:49 2021] nfsd: peername failed (err 107)!
[Tue Aug 10 05:40:01 2021] nfsd: recvfrom returned errno 104
(uptime 305 days)
netstat -s | grep socket
222 resets received for embryonic SYN_RECV sockets
20770 packets pruned from receive queue because of socket buffer overrun
8757252 TCP sockets finished time wait in fast timer
242790 delayed acks further delayed because of locked socket
10 SYNs to LISTEN sockets dropped
I have 50++ exports and high client usage. Clients are very differ; linux, windows and datastore.
NFS server's hardware is very strong. 40core, bunch of free rams and 2x40GbE LACP network.
My nfs.conf only has an open line which is "threads=128"
I have too much pruned buffer overrun but my rmem-wmem is not that big.
I think I didn't understand the buffer concept.
cat /etc/sysctl.d/10-network.conf
net.core.netdev_max_backlog = 65536
net.core.somaxconn = 1048576
net.ipv4.netfilter.ip_conntrack_max = 1048576
net.nf_conntrack_max = 1048576
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384
net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_max_syn_backlog = 65536
net.ipv4.tcp_max_tw_buckets = 65536
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_timestamps = 0
Versions:
kernel: 5.4.85-1-lts
libnfs 3.0.0-2
nfs-utils 2.3.3-1
I've found 2 similar issue and I'm confused because they seem like 2 different things. One of them includes test procedure but the problem only happens few time in a year and I can not reproduce to test it!
https://access.redhat.com/solutions/543143
https://access.redhat.com/solutions/1360543
This is a prod system, I can not risk it. When the problem happens I usually do not have time to examine the situation. I prefer nfs-server restart to solve problem to keep my services up!
Can you give me an advice please?
Last edited by morphin (2021-08-12 12:45:47)
Offline