Entire network connection down when arch computer segfaults

Vernox · 2023-04-30 23:12:11

I am running arch linux on mutliple pcs. Aside from a few having a DE, which my server doesn't, the configuration (installed packages) across all of them is almost the same.

For some weird reason, when my server segfaults (don't ask me why), the entire network goes down as well. That is for every single device that is connected to the modem. (including the modem)

The switches I've running are all basic 1gbit non-managed ones and my router is also the most standard home isp modem.

There are no VLANs setup.

Relevant information?:
When rebooting my router, I usually have to disconnect my server as it would somehow prevent the router from connecting to my isp.
The server is running a lot of docker containers, but has not yet ever run out of ram.
Kernel: 6.2.13-arch1-1

ethtool:

Settings for enp4s0:
	Supported ports: [ TP	MII ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                       100baseT/Half 100baseT/Full
	                       1000baseT/Full
	                       2500baseT/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  10baseT/Half 10baseT/Full
	                       100baseT/Half 100baseT/Full
	                       1000baseT/Full
	                       2500baseT/Full
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Link partner advertised link modes:  10baseT/Half 10baseT/Full
	                                    100baseT/Half 100baseT/Full
	                                    1000baseT/Full
	Link partner advertised pause frame use: Symmetric
	Link partner advertised auto-negotiation: Yes
	Link partner advertised FEC modes: Not reported
	Speed: 1000Mb/s
	Duplex: Full
	Auto-negotiation: on
	master-slave cfg: preferred slave
	master-slave status: slave
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: external
	MDI-X: Unknown (auto)
	Supports Wake-on: pumbg
	Wake-on: g
	Link detected: yes

I've also had a script running that writes dmesg every minute, but those logs didn't show anything at the time of the segfaults.
Would greatly appreciate any amount of help.

seth · 2023-05-01 05:53:45

when my server segfaults

What does that mean?
Processes can segfault, servers rather not.

the entire network goes down

And what does that mean? (DNS, dhcp, LAN, WAN…)?

I usually have to disconnect my server as it would somehow prevent the router from connecting to my isp.

It seems the server™ runs a relevant but rogue network service (DNS, DHCP, firewall) - what happens if you just yank the ethernet cable while the server is otherwise doing fine?

Please post the output of

find /etc/systemd -type l -exec test -f {} \; -print | awk -F'/' '{ printf ("%-40s | %s\n", $(NF-0), $(NF-1)) }' | sort -f

The server is running a lot of docker containers

And try the behavior without *all* of them.

I've also had a script running that writes dmesg every minute, but those logs didn't show anything at the time of the segfaults.

If the kernel halts and/or you force a hard reboot after the "segfault", no more data will be sync'd to disk.
Try at least https://wiki.archlinux.org/title/Keyboa … el_(SysRq)

These are very likely two problems:
1. your LAN configuration (not "the server", but perhaps only something that runs on "the server") is bonkers
2. the server stops (for whatever reason)

Vernox · 2023-05-01 22:11:56

seth wrote:

when my server segfaults
What does that mean?
Processes can segfault, servers rather not.

Oh, well. Whenever the system freezes (including Capslock), the screen just shows some stuff about a coredump and segfault, so I assumed that could happen to the kernel. Looking at the pictures, there is also something about a kernel panic.
I already have a hunch it's my stable diffusion container (causing the freeze, not the network issues) or something else but fixing those, is not my first priority if another faulty programm could lead to the same network issues.

seth wrote:

the entire network goes down
And what does that mean? (DNS, dhcp, LAN, WAN…)?

I'm actually not quite sure. I can always access local computers (except for the server) like for example the router, and browse it's interface.
Once, I even tried the crappy troubleshooting wizard on my router, but it only reported the internet to be disconnected, even though the cable itself was reported "working". (it kinda differentiates between the dsl cable and the connection itself, luckily)

seth wrote:

I usually have to disconnect my server as it would somehow prevent the router from connecting to my isp.
It seems the server™ runs a relevant but rogue network service (DNS, DHCP, firewall) - what happens if you just yank the ethernet cable while the server is otherwise doing fine?

Nothing unexpected happens. I do not have enough devices, or care enough about it, to let anything else manage my network except for my router (dhcp...). Even the DNS service is running on a different machine.

seth wrote:

Please post the output of

find /etc/systemd -type l -exec test -f {} \; -print | awk -F'/' '{ printf ("%-40s | %s\n", $(NF-0), $(NF-1)) }' | sort -f

atd.service                              | multi-user.target.wants
dbus-org.freedesktop.nm-dispatcher.service | system
dbus-org.freedesktop.timesync1.service   | system
fail2ban.service                         | multi-user.target.wants
fcron.service                            | multi-user.target.wants
getty@tty1.service                       | getty.target.wants
httpd.service                            | multi-user.target.wants
NetworkManager.service                   | multi-user.target.wants
NetworkManager-wait-online.service       | network-online.target.wants
p11-kit-server.socket                    | sockets.target.wants
pacoloco.service                         | multi-user.target.wants
remote-fs.target                         | multi-user.target.wants
smb.service                              | multi-user.target.wants
sshd.service                             | multi-user.target.wants
systemd-timesyncd.service                | sysinit.target.wants
upsMonitor.service                       | multi-user.target.wants
vsftpd.service                           | multi-user.target.wants

seth wrote:

The server is running a lot of docker containers
And try the behavior without *all* of them.

Unfortunately, I wasn't able to reproduce the issue with the two tests I did today.
I am however pretty confident none of my docker containers are necessarily at fault, since they pretty much all just serve web services.

seth wrote:

I've also had a script running that writes dmesg every minute, but those logs didn't show anything at the time of the segfaults.
If the kernel halts and/or you force a hard reboot after the "segfault", no more data will be sync'd to disk.
Try at least https://wiki.archlinux.org/title/Keyboa … el_(SysRq)

Oh thanks. I'll try, but since capslock also tends to not work in this situation, I do not have high expectations.

seth wrote:

These are very likely two problems:
1. your LAN configuration (not "the server", but perhaps only something that runs on "the server") is bonkers
2. the server stops (for whatever reason)

1. likely. I can take another look into the bios and see if there is something turned on that shouldn't be.
2. yes, the computer is unresponsive/frozen, but the router still cannot connect to the internet, even if the computer is *un*frozen.

I hope this information is sufficient enough to draw a good conclusion. If not, I'm happy to run a few more tests to uncover the real cause.

seth · 2023-05-01 22:38:40

I am however pretty confident none of my docker containers are necessarily at fault, since they pretty much all just serve web services.

And your WAN is broken for the entire LAN…
Is one of those webservices a Pi-Hole or a firewall?

the router still cannot connect to the internet, even if the computer is *un*frozen.

Then what takes it to re-establish the WAN?

- "I can always access local computers"
=> LAN is unaffected
- "Nothing unexpected happens."
=> It's not so much the loss of the server
- the router still cannot connect to the internet
=> I guess that's a modem/router combo? What does it say about the up/downlink? Do you try to ping WAN domains/IPs from the router?
Can you still ping IPs ("ping 8.8.8.8")?

Is it possible that, notably if it takes a router reboot to get out of this, actually the roouter loses the up- or downlink (or both) and the server "crashes" because of the lost network connection, not the other way round?

Vernox · 2023-05-01 23:14:35

seth wrote:

I am however pretty confident none of my docker containers are necessarily at fault, since they pretty much all just serve web services.
And your WAN is broken for the entire LAN…
Is one of those webservices a Pi-Hole or a firewall?
the router still cannot connect to the internet, even if the computer is *un*frozen.
Then what takes it to re-establish the WAN?
- "I can always access local computers"
=> LAN is unaffected
- "Nothing unexpected happens."
=> It's not so much the loss of the server
- the router still cannot connect to the internet
=> I guess that's a modem/router combo? What does it say about the up/downlink? Do you try to ping WAN domains/IPs from the router?
Can you still ping IPs ("ping 8.8.8.8")?
Is it possible that, notably if it takes a router reboot to get out of this, actually the roouter loses the up- or downlink (or both) and the server "crashes" because of the lost network connection, not the other way round?

The arch system in question does only host a wireguard containers, several instances similar to plex or emby, a stable diffusion web ui and a telegram bot. (the containers)

Whenever the system interrupts the internet connection:
+ I can still access other computers
+ I can access the webui of the modem/router which reports the internet connection as being "disconnected" while the dsl cable is stil reported as connected.
+ I cannot ping anything outside of my network, like if the internet connection would literally be non-existent.

In order to fix this, I either pull the LAN-cable thats connected to the arch system, or just hold its power button and reboot the arch system.
Rebooting the router usually doesn't help at all.

When this happened the first time, I assumed my ISP having issues !again! and after getting in touch with their support, they reported my modem as being "detected/online" but unable to connect to (whatever that means).

seth · 2023-05-02 06:44:06

While things are fine:

ip a; ip r # on the server and one other host in the LAN
# And on another host in the LAN
sudo nmap --script broadcast-dhcp-discover
dig google.com
tracepath google.com

When things are not (the server crashed), on another host in the LAN

ping -c1 google.com
ping -c1 8.8.8.8
dig google.com
tracepath 8.8.8.8

In order to fix this, I either pull the LAN-cable thats connected to the arch system, or just hold its power button and reboot the arch system.

Does that mean that otherwise the server becomes "*un*frozen" again?
We could then get a journal from the system for the actual crash.

Vernox · 2023-05-03 01:27:30

seth wrote:

While things are fine:

ip a; ip r # on the server and one other host in the LAN
# And on another host in the LAN
sudo nmap --script broadcast-dhcp-discover
dig google.com
tracepath google.com

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether b8:27:eb:9a:12:24 brd ff:ff:ff:ff:ff:ff
    inet 192.168.178.51/24 brd 192.168.178.255 scope global dynamic eth0
       valid_lft 629853sec preferred_lft 629853sec
    inet6 fe80::ba27:ebff:fe9a:1224/64 scope link
       valid_lft forever preferred_lft forever
3: wlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether b8:27:eb:cf:47:71 brd ff:ff:ff:ff:ff:ff
default via 192.168.178.1 dev eth0
192.168.178.0/24 dev eth0 proto kernel scope link src 192.168.178.51

Pre-scan script results:
| broadcast-dhcp-discover:
|   Response 1 of 1:
|     IP Offered: 192.168.178.74
|     DHCP Message Type: DHCPOFFER
|     Server Identifier: 192.168.178.1
|     IP Address Lease Time: 10d00h00m00s
|     Renewal Time Value: 5d00h00m00s
|     Rebinding Time Value: 8d18h00m00s
|     Subnet Mask: 255.255.255.0
|     Router: 192.168.178.1
|     Domain Name Server: 192.168.178.1
|     Domain Name: ___
|     Broadcast Address: 192.168.178.255
|_    NTP Servers: 192.168.178.1

; <<>> DiG 9.16.37-Debian <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15798
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		157	IN	A	142.251.36.238

;; Query time: 4 msec
;; SERVER: 192.168.178.1#53(192.168.178.1)
;; WHEN: Tue May 02 11:56:52 UTC 2023
;; MSG SIZE  rcvd: 55

 1?: [LOCALHOST]                      pmtu 1500
 1:  _gateway                                             0.719ms
 1:  _gateway                                             0.698ms
 2:  no reply
 3:  ....net                            13.002ms asymm  7
 4:  ....net                            12.661ms asymm  6
 5:  ....net                            12.773ms
 6:  no reply
 7:  no reply
 8:  no reply
...
29:  no reply
30:  no reply
     Too many hops: pmtu 1500
     Resume: pmtu 1500

seth wrote:

When things are not (the server crashed), on another host in the LAN
ping -c1 google.com
ping -c1 8.8.8.8
dig google.com
tracepath 8.8.8.8

Unfortunately, these freezes are pretty random and I was not yet able to reproduce them reliably.
Just an hour ago I found my router displaying an incorrect ip-address of the arch system in question, which reminded me of a similar problem I had before. The router is now completely reset and seems to display the correct information again.
I will keep this thread marked as unsolved for at least one month, and post any updates that might happen.

seth wrote:

In order to fix this, I either pull the LAN-cable thats connected to the arch system, or just hold its power button and reboot the arch system.
Does that mean that otherwise the server becomes "*un*frozen" again?
We could then get a journal from the system for the actual crash.

No this is just the procedure to make the modem connect to the internet again.
When the computer is frozen, nothing but a force shutdown (powerbutton) is able to "fix" it.
Not even the capslock indicator works during that situation.

seth · 2023-05-03 06:19:12

 1:  _gateway                                             0.698ms
 2:  no reply
 3:  ....net                            13.002ms asymm  7

If you remove the problematic host (server) from the network, does the "2: no reply" dissappear and is it otherwise consistently there?

Vernox · 2023-05-03 13:29:13

seth wrote:

 1:  _gateway                                             0.698ms
 2:  no reply
 3:  ....net                            13.002ms asymm  7
If you remove the problematic host (server) from the network, does the "2: no reply" dissappear and is it otherwise consistently there?

It's consistently there.

Arch Linux

#1 2023-04-30 23:12:11

Entire network connection down when arch computer segfaults

#2 2023-05-01 05:53:45

Re: Entire network connection down when arch computer segfaults

#3 2023-05-01 22:11:56

Re: Entire network connection down when arch computer segfaults

#4 2023-05-01 22:38:40

Re: Entire network connection down when arch computer segfaults

#5 2023-05-01 23:14:35

Re: Entire network connection down when arch computer segfaults

#6 2023-05-02 06:44:06

Re: Entire network connection down when arch computer segfaults

#7 2023-05-03 01:27:30

Re: Entire network connection down when arch computer segfaults

#8 2023-05-03 06:19:12

Re: Entire network connection down when arch computer segfaults

#9 2023-05-03 13:29:13

Re: Entire network connection down when arch computer segfaults

Board footer