You are not logged in.
Here I am again - let me tell you, you can't get bored with Arch Linux - there's always something going on.
Somehow when I do updates that require a reboot like kernel-lts, the eno2 interface starts randomly crashing (down /up). The only solution I have discovered is to reboot the switch that is connected to eno2. After that, the port never crashes again and everything runs as it should.
I replaced the Supermicro (eno2) vs Switch patch cable with another one. Nothing helped.
Setup:
Supermicro A2SDi-4C-HLN4F, BIOS 1.7a 10/13/2022
eno1 <-> Switch 1 (Ubiquiti EdgeSwitch ES-10XP)
eno2 <-> Switch 2 (Ubiquiti EdgeSwitch ES-10XP)
This never happened on port eno1.
Port down/up always takes only 3 seconds:
Aug 15 02:24:18 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:24:21 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:24:22 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:24:25 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:24:26 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:24:29 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:24:32 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:24:35 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:24:37 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:24:40 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:24:52 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:24:55 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:26:10 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:26:13 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:26:15 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:26:17 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Aug 15 02:26:20 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 15 02:26:23 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
Here I restarted switch 2 (eno2) and since then everything is fine again:
Aug 16 00:17:59 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Down
Aug 16 00:18:46 kernel: ixgbe 0000:06:00.1 eno2: NIC Link is Up 1 Gbps, Flow Control: None
It's another mystery I can't figure out. I googled and found this thread from 2018: https://bbs.archlinux.org/viewtopic.php?id=237502 but I don't know if it's related to my problem.
journalctl -k -b: https://0x0.st/s/AP-8L8PRSu3ouhh4tv13tw/HLTi.txt
Any tips, please? Thanks
Last edited by vecino (2024-04-05 20:07:04)
Offline
Post output of command: 'lspci -nnk'.
Offline
Thanks for your response - here it is: https://0x0.st/HLLc.txt
Offline
Also post output of commands:
sudo ethtool eno1
sudo ethtool eno2
ethtool -a eno2
ethtool -i eno2
ethtool -S eno1
ethtool -S eno2
You may also see if 'ethtool --monitor eno2' display something interesting.
Last edited by xerxes_ (2023-08-18 20:03:10)
Offline
Now the problem does not manifest itself, because as I wrote above, I have already restarted Switch2 (eno2).
I am attaching the output of the required ethtool commands: https://0x0.st/HLVO.txt
Offline
Post the complete system journal of an affected boot to show the userspace context of the NIC flicker, not just the kernel messages.
Offline
Sorry I should have done it automatically first thing. Here it is: https://0x0.st/HLW9.log
I went through it repeatedly and found nothing objectionable.
Offline
You're only using eno1 and eno2?
Can you flip their connections (plug the wire from eno1 into eno2 et vv) to see whether the issue remains w/ eno2 or moves to eno1?
Or delay the activation of eno2 and only connect it after eno1 had some moment?
(Physically delaying the connection by plugging the cable somewhat later might do)
Is this also a problem fora combination of eno1 & eno3 ?
Offline
I understand what you mean, seth. It's a production router and I can't experiment much on it - several clients depend on it.
I have used Debian 10 and 11 with this hw for more than two years before and never had anything like this happen. I think it has something to do with Arch. As I wrote above, I can deal with rebooting the switch, but I would also like to find out what is actually happening.
I don't have ports eno3 and eno4 connected and I haven't tried swapping them with eno1 and eno2. I manage the router remotely via IPMI and now I can't change the ports.
I realize the problem is strange, but it's happened to me 3 times when I rebooted after upgrading to a newer kernel, so it definitely wasn't a coincidence.
Offline
Actually™, I didn't pay sufficient attention to the timestamps - the journal covers *days* and eno2 only goes down after ~46h - that's not a race condition on establishing multiple connection on the same PCI device.
"same pci device" also means that "pcie_aspm=off" likely won't help
You could "ixgbe.debug=16" but that will likely spam the journal.
So blind guessing first: what if you throw the NIC a life-line and run an eternal ping over it (and the attached switch)
Were you running NM on debian as well?
Offline
NIC and switch are under constant ping monitoring.
I didn't use NM on Debian ... I switched to it with Arch.
Offline
This actually shows a lot for ixgbe ("flapping"), incl. here: https://bbs.archlinux.org/viewtopic.php?id=246307 - "solution" was to use the other slot (eno1 in your case)
I also don't think it's a temperature issue if restarting the switch fixes it and a proxmox thread failed to cure it by pinging either from or to the NIC
Intel indeed suggested to disable aspm, https://community.intel.com/t5/Ethernet … -p/1258814 - but there was no follow up
=> I'd go w/ that next.
Un/surprisingly everyone seems to have the same ideas about this
Offline
"Configure IntMode=1,1. Set this in modprobe.d and restart the system."
What do the Intel people mean by that? What is it supposed to do? Does it make sense to try?
https://www.intel.com/content/www/us/en … ducts.html
edit:
IntMode allows load time control over the type of interrupt registered for by the driver. MSI-X is required for multiple queue support, and some kernels and combinations of kernel .config options will force a lower level of interrupt support.
Last edited by vecino (2023-08-19 20:04:49)
Offline
That's a module parameter for e1000e, a kernel module you're not using for hardware you don't have.
Offline