Connect timeouts in kernel 4.10.1

sulaweyo · 2017-03-12 13:26:05

I just wanted to point out that there is a bug in the current 4.10.1 kernel which causes connection timeouts.
I ran into that when i upgraded my home server yesterday. It just stopped to connect to local DB calls in running web services and stuff like that. I downgraded to the latest 4.9.11 kernel and the issue is gone.

More details here: https://bugzilla.kernel.org/show_bug.cgi?id=194723

Easy way to verify: ncat -k -l 19999 & C=1 ; while true ; do echo -n "$C " ; echo ping | ncat localhost 19999 ; C=`expr $C + 1` ; sleep 1 ; done

Last edited by sulaweyo (2017-03-12 13:26:23)

aboe · 2017-03-12 15:39:22

I can confirm this bug, running lts kernel until this is resolved.

damjan · 2017-03-12 16:00:54

can't reproduce this here, I wonder why? localhost resolves to an ipv6 address by default.

$ getent hosts localhost
::1             localhost

aboe · 2017-03-12 19:35:58

@damjan, it only is an issue if you have a service running on ipv4: 127.0.0.1

drankinatty · 2017-03-12 22:17:51

I can confirm this problem. Update to kernel 4.10.1-1 and gcc, etc.. broke bind9. During startup bind doesn't get past the 'libseccomp sandboxing active' command and does not load /etc/named.conf. As a result, bind is left dead -- could not connect to 127.0.0.1#953 (e.g. rndc: connect failed: 127.0.0.1#953: connection refused).

I tried individually downgrading packages installed during update on 3/10, but the only thing that worked was downgrading the kernel, gcc and the firmware. Otherwise I get:

# rndc -V sync --clean
create memory context
create socket manager
<snip>
using server 127.0.0.1 (127.0.0.1#953)
create socket
bind socket
connect
rndc: connect failed: 127.0.0.1#953: connection refused

After downgrade of linux (4.10.1-1 -> 4.9.9-1) and associated downgrade of linux-api-headers, linux-firmware (20170227.5abb924-1 -> 20170217.12987ca-2), gcc, gcc-libs, glibc, openresolv, binutils, cifs-utils, libinput, xf86-input-libinput, and valgrind -- bind9 is working well again.

Last edited by drankinatty (2017-03-12 22:30:54)

damjan · 2017-03-12 22:45:14

so, for ipv4 only …

ncat -4 -k -l 19999 & C=1 ; while true ; do echo -n "$C " ; echo ping | ncat -4 localhost 19999 ; C=`expr $C + 1` ; sleep 1 ; done

still don't have the issue

drankinatty · 2017-03-13 12:11:07

Damjan, are there any other tests I can do on my end that may help narrow this down. I have another server that was broken by the 4.10 upgrade, but instead of downgrading, I have switched to 4.9-lts. Bind is OK there, but X will not start (strange -- never had linux/linux-lts problems running X with the basic display drivers before...) Anyway, I have that box that I can test the current config on. When I first discovered the issue, I wanted to check for updates -- hard without name resolution, so I ended up pinging a repo and just putting the IP in mirrorlist -- worked, but no update to fix this problem :(

Not sure what your ncat test is supposed to show, but with 4.9-lts, it dies at 107, e.g.

1 ping
...
103 ping
104 ping
105 ping
106 ping
107

Last edited by drankinatty (2017-03-13 12:19:24)

damjan · 2017-03-14 11:39:20

@drankinatty

I don't know, different people complain about different things (resolving, localhost issues, kernel, glibc??).

you should narrow down what the issues is. does `traceroute -n ...` work, does `ping -n localhost` work, does `getent hosts some.domain` work etc.

slick517d · 2017-03-14 12:24:39

I comfirm bind/named issue here also. I have been running the 4.10.0 kernel here for a couple of weeks with out any problem until my update yesterday. Named loads and then dies and can not git rid of the defunct process until after reboot:

376 ? 00:00:00 named <defunct>

I put in the Google servers instead of 127.0.0.1 and get internet just fine.

I suspect one of these packages:

dnssec-anchors
openresolv
network-manager

sulaweyo · 2017-03-14 14:48:30

@slick517d i can drop network-manager from that list as i don't have that installed

slick517d · 2017-03-15 03:27:19

They fixed it with an upgrade with bind & bind-tools packages. Had to reenable named and rebooted.

sulaweyo · 2017-03-15 07:13:22

Yesterday evening i was still able to reproduce it on all my machines. None of them has bind installed but bind-tools
Now i just tested on my work machine and i can not reproduce it there with the latest updates. I'll verify when i get back home

slick517d · 2017-03-15 14:36:51

@sulaweyu I use my desktop as it's own dns server which in turn uses 127.0.0.1 ip address for look up. That part broke with 2 days ago update. For some reason named daemon (bind) would load and then die leaving a defunct process. The bind (named) update yesterday fixed that for me.

There seems to be other issues going on in this thread so the ones here that was using 127.0.0.1 for dns your issue is probably fixed.

Last edited by slick517d (2017-03-15 14:44:31)

sulaweyo · 2017-03-15 16:55:35

I can still reproduce it on all my machines at home while i can't at work. Have to dig deeper..

Last edited by sulaweyo (2017-03-15 16:59:50)

slick517d · 2017-03-15 18:55:10

@sulaweyu:

It is appearing like @damjan stated there are more things going on than one issue.

For clarification here I do not use the arch kernel or it's .config. I follow another kernel / .config with modified dvb modules designed for dvb blind scanning and higher bit rate capabilities.

So it appears either you are not using your own local dns server or if you are may be not using bind and some other resolver or the problem may be back to being with arch's new kernel and or .config but would not be the case if your home computer and work computer is presently running the same kernel.

Good luck on hunting down the issue

Last edited by slick517d (2017-03-15 18:58:22)

loqs · 2017-03-16 16:02:20

@sulaweyo on the affected systems what is the output of

$ cat /proc/sys/net/ipv4/tcp_tw_recycle

Have you been able to produce the same bisection as in https://bugzilla.kernel.org/show_bug.cgi?id=194723#c15?

sulaweyo · 2017-03-16 18:04:23

Jep that workaround fixes the issue on all my nodes

To test:

echo 0 >/proc/sys/net/ipv4/tcp_tw_recycle

Permanent via sysctl:

net.ipv4.tcp_tw_recycle = 0

loqs · 2017-03-16 18:22:05

I wonder why on your systems it was set to 1 on this system it is set to 0.

drankinatty · 2017-03-16 23:57:33

After updates today, all appears to be working fine (I'm one of the initial reporters that rely on bind9 for mail host/web host name resolution, dhcpd w/dyn_updates, etc.). So what was broken in 4.10.1/glibc 2.25/bind9 now appears working. (at least for my setup, which relies completely on bind9)

Last edited by drankinatty (2017-03-16 23:57:49)

twelveeighty · 2017-03-22 14:25:12

drankinatty wrote:

After updates today, all appears to be working fine (I'm one of the initial reporters that rely on bind9 for mail host/web host name resolution, dhcpd w/dyn_updates, etc.). So what was broken in 4.10.1/glibc 2.25/bind9 now appears working. (at least for my setup, which relies completely on bind9)

@drankinatty In your working setup, what settings do you currently have for these two values:

cat /proc/sys/net/ipv4/tcp_tw_recycle
cat /proc/sys/net/ipv4/tcp_timestamps

Arch Linux

#1 2017-03-12 13:26:05

Connect timeouts in kernel 4.10.1

#2 2017-03-12 15:39:22

Re: Connect timeouts in kernel 4.10.1

#3 2017-03-12 16:00:54

Re: Connect timeouts in kernel 4.10.1

#4 2017-03-12 19:35:58

Re: Connect timeouts in kernel 4.10.1

#5 2017-03-12 22:17:51

Re: Connect timeouts in kernel 4.10.1

#6 2017-03-12 22:45:14

Re: Connect timeouts in kernel 4.10.1

#7 2017-03-13 12:11:07

Re: Connect timeouts in kernel 4.10.1

#8 2017-03-14 11:39:20

Re: Connect timeouts in kernel 4.10.1

#9 2017-03-14 12:24:39

Re: Connect timeouts in kernel 4.10.1

#10 2017-03-14 14:48:30

Re: Connect timeouts in kernel 4.10.1

#11 2017-03-15 03:27:19

Re: Connect timeouts in kernel 4.10.1

#12 2017-03-15 07:13:22

Re: Connect timeouts in kernel 4.10.1

#13 2017-03-15 14:36:51

Re: Connect timeouts in kernel 4.10.1

#14 2017-03-15 16:55:35

Re: Connect timeouts in kernel 4.10.1

#15 2017-03-15 18:55:10

Re: Connect timeouts in kernel 4.10.1

#16 2017-03-16 16:02:20

Re: Connect timeouts in kernel 4.10.1

#17 2017-03-16 18:04:23

Re: Connect timeouts in kernel 4.10.1

#18 2017-03-16 18:22:05

Re: Connect timeouts in kernel 4.10.1

#19 2017-03-16 23:57:33

Re: Connect timeouts in kernel 4.10.1

#20 2017-03-22 14:25:12

Re: Connect timeouts in kernel 4.10.1

Board footer