You are not logged in.
I operate 3 Supermicro servers (DMI: Supermicro X8SIL/X8SIL, BIOS 1.2a 06/27/2012) which have 2 integrated Ethernet controllers. All use the lts kernel atm: 6.1.18-1-lts
Every now and then the network interface does not come up resulting in the server being inaccesible over the 'normal' interfaces. The IPMI interface is responding and a reboot normally cures the situation. This only happens on an irregular basis once or twice a week.
journalctl output in this cases is:
[root@nullnullsix ~]# journalctl -b-1 | grep kernel | grep e1000
Mär 15 23:15:30 nullnullsix kernel: e1000e: Intel(R) PRO/1000 Network Driver
Mär 15 23:15:30 nullnullsix kernel: e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Disabling ASPM L0s L1
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): Failed to initialize MSI-X interrupts. Falling back to MSI interrupts.
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): Failed to initialize MSI interrupts. Falling back to legacy interrupts.
Mär 15 23:15:30 nullnullsix kernel: e1000e: probe of 0000:04:00.0 failed with error -2
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0: Disabling ASPM L0s L1
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 0000:05:00.0 (uninitialized): registered PHC clock
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:25:90:09:bb:41
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 eth0: Intel(R) PRO/1000 Network Connection
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 eth0: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:05:00.0 enp5s0: renamed from eth0
a normal boot looks like this:
[root@nullnullsix ~]# journalctl -b | grep kernel | grep e1000
Mär 16 12:02:48 nullnullsix kernel: e1000e: Intel(R) PRO/1000 Network Driver
Mär 16 12:02:48 nullnullsix kernel: e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0: Disabling ASPM L0s L1
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): registered PHC clock
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:25:90:09:bb:40
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 eth0: Intel(R) PRO/1000 Network Connection
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 eth0: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0: Disabling ASPM L0s L1
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 0000:05:00.0 (uninitialized): registered PHC clock
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 eth1: (PCI Express:2.5GT/s:Width x1) 00:25:90:09:bb:41
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 eth1: Intel(R) PRO/1000 Network Connection
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 eth1: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:04:00.0 enp4s0: renamed from eth0
Mär 16 12:02:48 nullnullsix kernel: e1000e 0000:05:00.0 enp5s0: renamed from eth1
Mär 16 12:02:56 nullnullsix kernel: e1000e 0000:04:00.0 enp4s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
lspci -vv output:
4:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Subsystem: Super Micro Computer Inc X8SIL
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at fb5e0000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at dc00 [size=32]
Region 3: Memory at fb5dc000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <128ns, L1 <64us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout+ AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [140 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
Kernel driver in use: e1000e
Kernel modules: e1000e
Maybe someone has a clue here?
And no, I can't change to kernel 6.2 because there the Wake-on-LAN is not working
Greetings
Harvey
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Does cold ./. warm boot matter?
Does the device show up and behave correctly on a rescan?
(eg. https://stackoverflow.com/questions/323 … f-pcie-bus )
Offline
Seth,
first, thanks for your input!
Mär 15 23:15:30 nullnullsix kernel: e1000e 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Does cold ./. warm boot matter?
No. Had it after cold boot as well as after a reboot. I turns up sporadically without any rule afaict.
Does the device show up and behave correctly on a rescan?
(eg. https://stackoverflow.com/questions/323 … f-pcie-bus )
Will have to wait for the next failure to test that. But that is a good point to try.
At some point I had the suspicion that it could be the BMC sharing the same network port with 'normal' LAN. Will have to connect an additional network cable next time I am present at the server to rule that out. But why did it work then with pre-6 kernels...
Greetings
Harvey
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
Okay, today it did fail again. So I logged into the machine using IPMI console and it looks like the network devices do show up on the PCI bus:
[root@nullnullsix ~]# lspci | grep Ethernet
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Hence rescanning the pci bus is not the way to go I think..
Nevertheless I did try
echo 1 > /sys/bus/pci/rescan
without any changes.
But the second network interface seems to be functional (no cable connected here)
[root@nullnullsix ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether 00:25:90:09:bb:41 brd ff:ff:ff:ff:ff:ff
Note that enp4s0 is missing...
I get a strong feeling that this could be related to the IPMI device and the 'normal' interface sharing the same pysical interface. I will try to give the IPMI a dedicated network interface and cable and see if the problem persists... Weird
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
Okay, so that was the wrong idea. I gave the BMC a dedicated cable and set the IPMI network to 'dedicated' which means that it is away from the normal network interfaces. And right after the next boot:
[root@numalfix ~]# journalctl -b-1 | grep e1000
Mär 29 16:04:05 numalfix kernel: e1000e: Intel(R) PRO/1000 Network Driver
Mär 29 16:04:05 numalfix kernel: e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Mär 29 16:04:05 numalfix kernel: e1000e 0000:04:00.0: Disabling ASPM L0s L1
Mär 29 16:04:05 numalfix kernel: e1000e 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Mär 29 16:04:05 numalfix kernel: e1000e 0000:04:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 29 16:04:05 numalfix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): Failed to initialize MSI-X interrupts. Falling back to MSI interrupts.
Mär 29 16:04:05 numalfix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): Failed to initialize MSI interrupts. Falling back to legacy interrupts.
Mär 29 16:04:05 numalfix kernel: e1000e: probe of 0000:04:00.0 failed with error -2
Mär 29 16:04:05 numalfix kernel: e1000e 0000:05:00.0: Disabling ASPM L0s L1
Mär 29 16:04:05 numalfix kernel: e1000e 0000:05:00.0: Unable to change power state from D3cold to D0, device inaccessible
Mär 29 16:04:05 numalfix kernel: e1000e 0000:05:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 29 16:04:05 numalfix kernel: e1000e 0000:05:00.0 0000:05:00.0 (uninitialized): Failed to initialize MSI-X interrupts. Falling back to MSI interrupts.
Mär 29 16:04:05 numalfix kernel: e1000e 0000:05:00.0 0000:05:00.0 (uninitialized): Failed to initialize MSI interrupts. Falling back to legacy interrupts.
Mär 29 16:04:05 numalfix kernel: e1000e: probe of 0000:05:00.0 failed with error -2
Both network interfaces were inactive. After a reset (via the management interface which is working...) all is back to normal:
[root@numalfix ~]# journalctl -b | grep e1000
Mär 29 16:05:08 numalfix kernel: e1000e: Intel(R) PRO/1000 Network Driver
Mär 29 16:05:08 numalfix kernel: e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Mär 29 16:05:08 numalfix kernel: e1000e 0000:04:00.0: Disabling ASPM L0s L1
Mär 29 16:05:08 numalfix kernel: e1000e 0000:04:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 29 16:05:08 numalfix kernel: e1000e 0000:04:00.0 0000:04:00.0 (uninitialized): registered PHC clock
Mär 29 16:05:08 numalfix kernel: e1000e 0000:04:00.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:25:90:37:67:f4
Mär 29 16:05:08 numalfix kernel: e1000e 0000:04:00.0 eth0: Intel(R) PRO/1000 Network Connection
Mär 29 16:05:08 numalfix kernel: e1000e 0000:04:00.0 eth0: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 29 16:05:08 numalfix kernel: e1000e 0000:05:00.0: Disabling ASPM L0s L1
Mär 29 16:05:08 numalfix kernel: e1000e 0000:05:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Mär 29 16:05:08 numalfix kernel: e1000e 0000:05:00.0 0000:05:00.0 (uninitialized): registered PHC clock
Mär 29 16:05:08 numalfix kernel: e1000e 0000:05:00.0 eth1: (PCI Express:2.5GT/s:Width x1) 00:25:90:37:67:f5
Mär 29 16:05:08 numalfix kernel: e1000e 0000:05:00.0 eth1: Intel(R) PRO/1000 Network Connection
Mär 29 16:05:08 numalfix kernel: e1000e 0000:05:00.0 eth1: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
Mär 29 16:05:08 numalfix kernel: e1000e 0000:04:00.0 enp4s0: renamed from eth0
Mär 29 16:05:08 numalfix kernel: e1000e 0000:05:00.0 enp5s0: renamed from eth1
Mär 29 16:05:11 numalfix kernel: e1000e 0000:04:00.0 enp4s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
That was my best guess until now. Anyone else an idea?
Last edited by Harey (2023-03-29 14:15:34)
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
"pcie_aspm=off"?
Add e1000e to the initramfs?
Did you check the journal whether there're bus errors preceeding the device failure?
Offline
I did check the journal, no bus errors, not even warnings before. For now I tried to add the module to the initramfs. Let's see what happens now.
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
"pcie_aspm=off"?
Add e1000e to the initramfs?
Did both and it fixed my problem on a X8SIE upon reboot. Then I removed the kernel switch and it remained fixed, so most likely the mkinitcpio.conf edit made it work.
Not sure when my ports stopped working as it was connected to my google wifi mesh which falls back to wireless...
Last edited by prokrypt (2023-04-08 01:30:18)
Offline
@prokrypt: Is this on 6.2 or lts kernel? At least I am not alone
The mkinitcpio.conf edit makes it a lot more stable for me too, but it's not fixed completely... Yesterday I played around with one of the servers and had to reboot several times and look - here it is again... But only for 1 time. By now I can't tell why this is happening. I hoped that the move to the 6.2 kernel with the Wake-onLAN problem fixed would squash this bug as well
Linux is like a wigwam: No Gates, no Windows and an Apache inside
Offline
I was on 6.2.8.
Maybe I should upgrade to 6.2.10 and roll the dice again? Or perhaps just enjoy my ethernet while it's still working
Last edited by prokrypt (2023-04-08 19:49:08)
Offline