You are not logged in.
I operate 3 Supermicro servers (DMI: Supermicro X8SIL/X8SIL, BIOS 1.2a 06/27/2012) which have 2 integrated Ethernet controllers. All use the lts kernel atm: 6.1.18-1-lts
Every now and then the network interface does not come up resulting in the server being inaccesible over the 'normal' interfaces. The IPMI interface is responding and a reboot normally cures the situation. This only happens on an irregular basis once or twice a week.
journalctl output in this cases is:
[root@nullnullsix ~]# journalctl -b-1 | grep kernel | grep e1000
Mär 15 23:15:30 nullnullsix kernel: e1000e: Intel(R) PRO/1000 Network Driver
Mär 15 23:15:30 nullnullsix kernel: e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Disabling ASPM L0s L1
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): Failed to initialize MSI-X interrupts. Falling back to MSI interrupts.
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): Failed to initialize MSI interrupts. Falling back to legacy interrupts.
Mär 15 23:15:30 nullnullsix kernel: e1000e: probe of 0000:04:00.0 failed with error -2
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0: Disabling ASPM L0s L1
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 0000:05:00.0 (uninitialized): registered PHC clock
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:25:90:09:bb:41
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 eth0: Intel(R) PRO/1000 Network Connection
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 eth0: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 enp5s0: renamed from eth0
a normal boot looks like this:
[root@nullnullsix ~]# journalctl -b | grep kernel | grep e1000
Mär 16 12:02:48 nullnullsix kernel: e1000e: Intel(R) PRO/1000 Network Driver
Mär 16 12:02:48 nullnullsix kernel: e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0: Disabling ASPM L0s L1
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): registered PHC clock
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:25:90:09:bb:40
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 eth0: Intel(R) PRO/1000 Network Connection
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 eth0: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0: Disabling ASPM L0s L1
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 0000:05:00.0 (uninitialized): registered PHC clock
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 eth1: (PCI Express:2.5GT/s:Width x1) 00:25:90:09:bb:41
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 eth1: Intel(R) PRO/1000 Network Connection
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 eth1: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 enp4s0: renamed from eth0
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 enp5s0: renamed from eth1
Mär 16 12:02:56 nullnullsix kernel: e1000e 0000:04:00.0 enp4s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
lspci -vv output:
4:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Subsystem: Super Micro Computer Inc X8SIL
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at fb5e0000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at dc00 [size=32]
Region 3: Memory at fb5dc000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <128ns, L1 <64us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout+ AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [140 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
Kernel driver in use: e1000e
Kernel modules: e1000e
Maybe someone has a clue here?
And no, I can't change to kernel 6.2 because there the Wake-on-LAN is not working
Greetings
Harvey
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Does cold ./. warm boot matter?
Does the device show up and behave correctly on a rescan?
(eg. https://stackoverflow.com/questions/323 … f-pcie-bus )
Offline
Seth,
first, thanks for your input!
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Does cold ./. warm boot matter?
No. Had it after cold boot as well as after a reboot. I turns up sporadically without any rule afaict.
Does the device show up and behave correctly on a rescan?
(eg. https://stackoverflow.com/questions/323 … f-pcie-bus )
Will have to wait for the next failure to test that. But that is a good point to try.
At some point I had the suspicion that it could be the BMC sharing the same network port with 'normal' LAN. Will have to connect an additional network cable next time I am present at the server to rule that out. But why did it work then with pre-6 kernels...
Greetings
Harvey
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
Okay, today it did fail again. So I logged into the machine using IPMI console and it looks like the network devices do show up on the PCI bus:
[root@nullnullsix ~]# lspci | grep Ethernet
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Hence rescanning the pci bus is not the way to go I think..
Nevertheless I did try
echo 1 > /sys/bus/pci/rescan
without any changes.
But the second network interface seems to be functional (no cable connected here)
[root@nullnullsix ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether 00:25:90:09:bb:41 brd ff:ff:ff:ff:ff:ff
Note that enp4s0 is missing...
I get a strong feeling that this could be related to the IPMI device and the 'normal' interface sharing the same pysical interface. I will try to give the IPMI a dedicated network interface and cable and see if the problem persists... Weird
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline