You are not logged in.
I am running arch linux on mutliple pcs. Aside from a few having a DE, which my server doesn't, the configuration (installed packages) across all of them is almost the same.
For some weird reason, when my server segfaults (don't ask me why), the entire network goes down as well. That is for every single device that is connected to the modem. (including the modem)
The switches I've running are all basic 1gbit non-managed ones and my router is also the most standard home isp modem.
There are no VLANs setup.
Relevant information?:
When rebooting my router, I usually have to disconnect my server as it would somehow prevent the router from connecting to my isp.
The server is running a lot of docker containers, but has not yet ever run out of ram.
Kernel: 6.2.13-arch1-1
ethtool:
Settings for enp4s0:
Supported ports: [ TP MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
2500baseT/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
2500baseT/Full
Advertised pause frame use: Symmetric Receive-only
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Link partner advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Link partner advertised pause frame use: Symmetric
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
master-slave cfg: preferred slave
master-slave status: slave
Port: Twisted Pair
PHYAD: 0
Transceiver: external
MDI-X: Unknown (auto)
Supports Wake-on: pumbg
Wake-on: g
Link detected: yes
I've also had a script running that writes dmesg every minute, but those logs didn't show anything at the time of the segfaults.
Would greatly appreciate any amount of help.
Offline
when my server segfaults
What does that mean?
Processes can segfault, servers rather not.
the entire network goes down
And what does that mean? (DNS, dhcp, LAN, WAN…)?
I usually have to disconnect my server as it would somehow prevent the router from connecting to my isp.
It seems the server™ runs a relevant but rogue network service (DNS, DHCP, firewall) - what happens if you just yank the ethernet cable while the server is otherwise doing fine?
Please post the output of
find /etc/systemd -type l -exec test -f {} \; -print | awk -F'/' '{ printf ("%-40s | %s\n", $(NF-0), $(NF-1)) }' | sort -f
The server is running a lot of docker containers
And try the behavior without *all* of them.
I've also had a script running that writes dmesg every minute, but those logs didn't show anything at the time of the segfaults.
If the kernel halts and/or you force a hard reboot after the "segfault", no more data will be sync'd to disk.
Try at least https://wiki.archlinux.org/title/Keyboa … el_(SysRq)
These are very likely two problems:
1. your LAN configuration (not "the server", but perhaps only something that runs on "the server") is bonkers
2. the server stops (for whatever reason)
Online
when my server segfaults
What does that mean?
Processes can segfault, servers rather not.
Oh, well. Whenever the system freezes (including Capslock), the screen just shows some stuff about a coredump and segfault, so I assumed that could happen to the kernel. Looking at the pictures, there is also something about a kernel panic.
I already have a hunch it's my stable diffusion container (causing the freeze, not the network issues) or something else but fixing those, is not my first priority if another faulty programm could lead to the same network issues.
the entire network goes down
And what does that mean? (DNS, dhcp, LAN, WAN…)?
I'm actually not quite sure. I can always access local computers (except for the server) like for example the router, and browse it's interface.
Once, I even tried the crappy troubleshooting wizard on my router, but it only reported the internet to be disconnected, even though the cable itself was reported "working". (it kinda differentiates between the dsl cable and the connection itself, luckily)
I usually have to disconnect my server as it would somehow prevent the router from connecting to my isp.
It seems the server™ runs a relevant but rogue network service (DNS, DHCP, firewall) - what happens if you just yank the ethernet cable while the server is otherwise doing fine?
Nothing unexpected happens. I do not have enough devices, or care enough about it, to let anything else manage my network except for my router (dhcp...). Even the DNS service is running on a different machine.
Please post the output of
find /etc/systemd -type l -exec test -f {} \; -print | awk -F'/' '{ printf ("%-40s | %s\n", $(NF-0), $(NF-1)) }' | sort -f
atd.service | multi-user.target.wants
dbus-org.freedesktop.nm-dispatcher.service | system
dbus-org.freedesktop.timesync1.service | system
fail2ban.service | multi-user.target.wants
fcron.service | multi-user.target.wants
getty@tty1.service | getty.target.wants
httpd.service | multi-user.target.wants
NetworkManager.service | multi-user.target.wants
NetworkManager-wait-online.service | network-online.target.wants
p11-kit-server.socket | sockets.target.wants
pacoloco.service | multi-user.target.wants
remote-fs.target | multi-user.target.wants
smb.service | multi-user.target.wants
sshd.service | multi-user.target.wants
systemd-timesyncd.service | sysinit.target.wants
upsMonitor.service | multi-user.target.wants
vsftpd.service | multi-user.target.wants
The server is running a lot of docker containers
And try the behavior without *all* of them.
Unfortunately, I wasn't able to reproduce the issue with the two tests I did today.
I am however pretty confident none of my docker containers are necessarily at fault, since they pretty much all just serve web services.
I've also had a script running that writes dmesg every minute, but those logs didn't show anything at the time of the segfaults.
If the kernel halts and/or you force a hard reboot after the "segfault", no more data will be sync'd to disk.
Try at least https://wiki.archlinux.org/title/Keyboa … el_(SysRq)
Oh thanks. I'll try, but since capslock also tends to not work in this situation, I do not have high expectations.
These are very likely two problems:
1. your LAN configuration (not "the server", but perhaps only something that runs on "the server") is bonkers
2. the server stops (for whatever reason)
1. likely. I can take another look into the bios and see if there is something turned on that shouldn't be.
2. yes, the computer is unresponsive/frozen, but the router still cannot connect to the internet, even if the computer is *un*frozen.
I hope this information is sufficient enough to draw a good conclusion. If not, I'm happy to run a few more tests to uncover the real cause.
Offline
I am however pretty confident none of my docker containers are necessarily at fault, since they pretty much all just serve web services.
And your WAN is broken for the entire LAN…
Is one of those webservices a Pi-Hole or a firewall?
the router still cannot connect to the internet, even if the computer is *un*frozen.
Then what takes it to re-establish the WAN?
- "I can always access local computers"
=> LAN is unaffected
- "Nothing unexpected happens."
=> It's not so much the loss of the server
- the router still cannot connect to the internet
=> I guess that's a modem/router combo? What does it say about the up/downlink? Do you try to ping WAN domains/IPs from the router?
Can you still ping IPs ("ping 8.8.8.8")?
Is it possible that, notably if it takes a router reboot to get out of this, actually the roouter loses the up- or downlink (or both) and the server "crashes" because of the lost network connection, not the other way round?
Online
I am however pretty confident none of my docker containers are necessarily at fault, since they pretty much all just serve web services.
And your WAN is broken for the entire LAN…
Is one of those webservices a Pi-Hole or a firewall?the router still cannot connect to the internet, even if the computer is *un*frozen.
Then what takes it to re-establish the WAN?
- "I can always access local computers"
=> LAN is unaffected
- "Nothing unexpected happens."
=> It's not so much the loss of the server
- the router still cannot connect to the internet
=> I guess that's a modem/router combo? What does it say about the up/downlink? Do you try to ping WAN domains/IPs from the router?
Can you still ping IPs ("ping 8.8.8.8")?Is it possible that, notably if it takes a router reboot to get out of this, actually the roouter loses the up- or downlink (or both) and the server "crashes" because of the lost network connection, not the other way round?
The arch system in question does only host a wireguard containers, several instances similar to plex or emby, a stable diffusion web ui and a telegram bot. (the containers)
Whenever the system interrupts the internet connection:
+ I can still access other computers
+ I can access the webui of the modem/router which reports the internet connection as being "disconnected" while the dsl cable is stil reported as connected.
+ I cannot ping anything outside of my network, like if the internet connection would literally be non-existent.
In order to fix this, I either pull the LAN-cable thats connected to the arch system, or just hold its power button and reboot the arch system.
Rebooting the router usually doesn't help at all.
When this happened the first time, I assumed my ISP having issues !again! and after getting in touch with their support, they reported my modem as being "detected/online" but unable to connect to (whatever that means).
Offline
While things are fine:
ip a; ip r # on the server and one other host in the LAN
# And on another host in the LAN
sudo nmap --script broadcast-dhcp-discover
dig google.com
tracepath google.com
When things are not (the server crashed), on another host in the LAN
ping -c1 google.com
ping -c1 8.8.8.8
dig google.com
tracepath 8.8.8.8
In order to fix this, I either pull the LAN-cable thats connected to the arch system, or just hold its power button and reboot the arch system.
Does that mean that otherwise the server becomes "*un*frozen" again?
We could then get a journal from the system for the actual crash.
Online
While things are fine:
ip a; ip r # on the server and one other host in the LAN # And on another host in the LAN sudo nmap --script broadcast-dhcp-discover dig google.com tracepath google.com
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether b8:27:eb:9a:12:24 brd ff:ff:ff:ff:ff:ff
inet 192.168.178.51/24 brd 192.168.178.255 scope global dynamic eth0
valid_lft 629853sec preferred_lft 629853sec
inet6 fe80::ba27:ebff:fe9a:1224/64 scope link
valid_lft forever preferred_lft forever
3: wlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether b8:27:eb:cf:47:71 brd ff:ff:ff:ff:ff:ff
default via 192.168.178.1 dev eth0
192.168.178.0/24 dev eth0 proto kernel scope link src 192.168.178.51
Pre-scan script results:
| broadcast-dhcp-discover:
| Response 1 of 1:
| IP Offered: 192.168.178.74
| DHCP Message Type: DHCPOFFER
| Server Identifier: 192.168.178.1
| IP Address Lease Time: 10d00h00m00s
| Renewal Time Value: 5d00h00m00s
| Rebinding Time Value: 8d18h00m00s
| Subnet Mask: 255.255.255.0
| Router: 192.168.178.1
| Domain Name Server: 192.168.178.1
| Domain Name: ___
| Broadcast Address: 192.168.178.255
|_ NTP Servers: 192.168.178.1
; <<>> DiG 9.16.37-Debian <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15798
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 157 IN A 142.251.36.238
;; Query time: 4 msec
;; SERVER: 192.168.178.1#53(192.168.178.1)
;; WHEN: Tue May 02 11:56:52 UTC 2023
;; MSG SIZE rcvd: 55
1?: [LOCALHOST] pmtu 1500
1: _gateway 0.719ms
1: _gateway 0.698ms
2: no reply
3: ....net 13.002ms asymm 7
4: ....net 12.661ms asymm 6
5: ....net 12.773ms
6: no reply
7: no reply
8: no reply
...
29: no reply
30: no reply
Too many hops: pmtu 1500
Resume: pmtu 1500
When things are not (the server crashed), on another host in the LAN
ping -c1 google.com ping -c1 8.8.8.8 dig google.com tracepath 8.8.8.8
Unfortunately, these freezes are pretty random and I was not yet able to reproduce them reliably.
Just an hour ago I found my router displaying an incorrect ip-address of the arch system in question, which reminded me of a similar problem I had before. The router is now completely reset and seems to display the correct information again.
I will keep this thread marked as unsolved for at least one month, and post any updates that might happen.
In order to fix this, I either pull the LAN-cable thats connected to the arch system, or just hold its power button and reboot the arch system.
Does that mean that otherwise the server becomes "*un*frozen" again?
We could then get a journal from the system for the actual crash.
No this is just the procedure to make the modem connect to the internet again.
When the computer is frozen, nothing but a force shutdown (powerbutton) is able to "fix" it.
Not even the capslock indicator works during that situation.
Offline
1: _gateway 0.698ms
2: no reply
3: ....net 13.002ms asymm 7
If you remove the problematic host (server) from the network, does the "2: no reply" dissappear and is it otherwise consistently there?
Online
1: _gateway 0.698ms 2: no reply 3: ....net 13.002ms asymm 7
If you remove the problematic host (server) from the network, does the "2: no reply" dissappear and is it otherwise consistently there?
It's consistently there.
Offline