Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

hexadecagram · 2022-09-30 03:32:35

Hello,

I have a Shuttle XPC Cube SZ170R6V2 and was running the LTS kernel with this NIC:

> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)

I was experiencing what appear to be the same issues as reported in this thread. I have several other Arch installations, and this particular machine is the only one logging this, which has caused the NIC to lose carrier (and reconnect) at random intervals.

I have since switched from 5.15.71-1-lts to 5.19.11-arch1-1. After about 45 minutes of uptime after doing so, I was disappointed to start seeing the logs once again. So far, however, it seems to be less frequent than when I was running linux-lts.

Here are a couple of examples:

[ 9541.162287] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                 TDH                  <a6>
                 TDT                  <e9>
                 next_to_use          <e9>
                 next_to_clean        <a5>
               buffer_info[next_to_clean]:
                 time_stamp           <1002a4bd4>
                 next_to_watch        <a9>
                 jiffies              <1002a4dc1>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[ 9543.078918] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                 TDH                  <a6>
                 TDT                  <e9>
                 next_to_use          <e9>
                 next_to_clean        <a5>
               buffer_info[next_to_clean]:
                 time_stamp           <1002a4bd4>
                 next_to_watch        <a9>
                 jiffies              <1002a5000>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>

Both of these are back-to-back in the output of dmesg. However, there is usually a corresponding log message from systemd-networkd saying that enp0s31f6's carrier has been lost and then gained.

Anyone else seeing this?

seth · 2022-09-30 06:40:21

Try to add "pcie_aspm=off e1000e.SmartPowerDownEnable=0" to the kernel parameters and please post a complete system journal covering the unit hangs.
Also test

ethtool --show-eee enp0s31f6

and in doubt disable that ("ethtool --set-eee enp0s31f6 eee off", transient ie. won't survive a reboot - "e1000e.EEE=0" *might* apply)

hexadecagram · 2022-09-30 08:14:33

I meant to mention that I tried this and I still kept seeing the log messages afterward:

seth wrote:

https://forum.proxmox.com/threads/e1000 … 284/page-8
Try
ethtool -K enp0s31f6 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

hexadecagram · 2022-09-30 08:30:05

seth wrote:

Try to add "pcie_aspm=off e1000e.SmartPowerDownEnable=0" to the kernel parameters and please post a complete system journal covering the unit hangs.

I have enabled that and will reboot the moment I see another hang. dmesg has not shown any more since I posted but I have also been distracted with other tasks. Is there a tool to pipe dmesg into systemd-journald?

seth wrote:

Also test
ethtool --show-eee enp0s31f6
and in doubt disable that ("ethtool --set-eee enp0s31f6 eee off", transient ie. won't survive a reboot - "e1000e.EEE=0" *might* apply)

I'll keep it in mind. But this particular machine is a server that does not ever sleep or hibernate. FWIW, here's the current output:

EEE settings for enp0s31f6:
        EEE status: enabled - inactive
        Tx LPI: 17 (us)
        Supported EEE link modes:  100baseT/Full
                                   1000baseT/Full
        Advertised EEE link modes:  100baseT/Full
                                    1000baseT/Full
        Link partner advertised EEE link modes:  Not reported

seth · 2022-09-30 11:28:24

Is there a tool to pipe dmesg into systemd-journald?

The ringbuffer goes into the journal by default, "sudo journalctl -b" will have the kernel- and gloabl userspace messages.

But this particular machine is a server that does not ever sleep or hibernate.

That's not relevant to EEE, EEE powers the chip down when there's no traffic.

hexadecagram · 2022-09-30 21:24:35

seth wrote:

hexadecagram wrote:
Is there a tool to pipe dmesg into systemd-journald?
The ringbuffer goes into the journal by default, "sudo journalctl -b" will have the kernel- and gloabl userspace messages.

That's odd. I tried journalctl -b | grep -i detect the other day (or I thought I did) and it came back empty. I must have typoed something because I get matches now.

seth wrote:

hexadecagram wrote:
But this particular machine is a server that does not ever sleep or hibernate.
That's not relevant to EEE, EEE powers the chip down when there's no traffic.

Oh okay. For what it's worth, I have left it enabled for the time being.

Last hang was:

Sep 29 20:16:23 briareos kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:

❯ uptime
 14:23:45 up 20:46, 17 users,  load average: 1.40, 1.64, 1.72
❯ journalctl -b | grep Hang | wc -l
17

hexadecagram · 2022-10-01 10:41:48

While I've been waiting for another hang to occur, I noticed that it started happening on a completely different machine! This time, a Shuttle DH170, with the same model of NIC:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)

I likewise rebooted this machine into 5.19.11-arch1-1 and it almost immediately started happening again so I enabled the kernel parameters that you recommended, and switched back to 5.15.71-1-lts. So far, it's been a much quieter experience.

For the record, I count over 59.178 (yes) hangs since Sept. 25 on the first machine. The second one "only" had 6,565 since Sept. 28.

If I'm understanding the tickets you shared correctly, the problem has been remedied, but I've only skimmed them and I'll have to have a closer look tomorrow.

seth · 2022-10-01 11:59:44

What tickets did I share?
I just went by the guess that the chip doesn't power up out of some idle mode and suggested some measures to keep the NIC powered up (pci_aspm is a pretty broad sword and will affect all PCI devices to some degree)

hexadecagram · 2022-10-01 17:47:29

My mistake, it was lordnaikon, in the post that I referred to (not this one).

lordnaikon wrote:

What i did found was some discussion here https://bugzilla.kernel.org/show_bug.cgi?id=213377 and here https://bugzilla.kernel.org/show_bug.cgi?id=213651.

hexadecagram · 2022-10-01 21:47:45

Hangs today:

❯ journalctl -b G Hang | wc -l
61

I have implemented your second suggestion:

❯ sudo ethtool --set-eee enp0s31f6 eee off
❯ ethtool --show-eee enp0s31f6
EEE settings for enp0s31f6:
        EEE status: disabled
        Tx LPI: 17 (us)
        Supported EEE link modes:  100baseT/Full
                                   1000baseT/Full
        Advertised EEE link modes:  100baseT/Full
                                    1000baseT/Full
        Link partner advertised EEE link modes:  Not reported

Last edited by hexadecagram (2022-10-01 21:53:06)

hexadecagram · 2022-10-02 01:54:54

More hangs unfortunately (on the DH170; the SZ170R6V2 is still yet to report one since I implemented your fixes):

❯ journalctl -b G Hang | wc -l
125

seth · 2022-10-02 07:32:16

seth wrote:

please post a complete system journal covering the unit hangs

The context might hint at the cause.

hexadecagram · 2022-10-03 00:18:26

seth wrote:

seth wrote:
please post a complete system journal covering the unit hangs
The context might hint at the cause.

Sure. If it isn't the kernel, it may be related to keepalived or named. Here's the last 2 hangs, which, if I go back through the logs, seems to be recurring pattern. But again, the hangs are only occuring on the DH170 now. The SZ170R6V2 logs similar messages about sending gratuitous ARP messages but the interface does not hang. The SZ is also not running named. The DH170 is a router, so it's seeing a great deal more traffic than the SZ170R6V2, so that could also be a cause.

Note the logs below are from kernel 5.15. After gathering them, I have yet again switched back to kernel 5.19 and after an hour of uptime, I'm not seeing any hangs WITH the kernel parameters you've suggested, including e1000e.EEE=0 (yet). This is the first time I've run 5.19 WITH all kernel parameters. In other words, my attempts have been:

5.15,
5.19,
5.15 pcie_aspm=off e1000e.SmartPowerDownEnable=0,
5.15 pcie_aspm=off e1000e.SmartPowerDownEnable=0 e1000e.EEE=0 (which produced the following logs), and finally
5.19 pcie_aspm=off e1000e.SmartPowerDownEnable=0 e1000e.EEE=0 (now, AFTER the following logs were generated).

Oct 02 15:48:50 persephone dockerd[1181]: time="2022-10-02T15:48:50.262488215-07:00" level=info msg="memberlist: Suspect b7ffbfcc802c has failed, no acks received"
Oct 02 15:48:50 persephone dockerd[1181]: time="2022-10-02T15:48:50.939913012-07:00" level=warning msg="memberlist: Refuting a suspect message (from: b7ffbfcc802c)"
Oct 02 15:48:51 persephone kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                     TDH                  <99>
                                     TDT                  <ef>
                                     next_to_use          <ef>
                                     next_to_clean        <94>
                                   buffer_info[next_to_clean]:
                                     time_stamp           <100cb6ff1>
                                     next_to_watch        <99>
                                     jiffies              <100cb70e9>
                                     next_to_watch.status <0>
                                   MAC Status             <40080083>
                                   PHY Status             <796d>
                                   PHY 1000BASE-T Status  <3800>
                                   PHY Extended Status    <3000>
                                   PCI Status             <10>
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_1) Received advert from 192.168.0.78 with lower priority 101, ours 200, forcing new election
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_1) Sending/queueing gratuitous ARPs on br0 for 192.168.0.1
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) Receive advertisement timeout
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) Entering MASTER STATE
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) setting VIPs.
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) Sending/queueing gratuitous ARPs on br0 for 192.168.0.124
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.124
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) Sending/queueing gratuitous ARPs on br0 for 192.168.0.125
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.125
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) Sending/queueing gratuitous ARPs on br0 for 192.168.0.126
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.126
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.124
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.125
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.126
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.124
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.125
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.126
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.124
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.125
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.126
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.124
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.125
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.126
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) Master received advert from 192.168.0.77 with higher priority 200, ours 100
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) Entering BACKUP STATE
Oct 02 15:48:51 persephone Keepalived_vrrp[1167]: (VI_3) removing VIPs.
Oct 02 15:48:51 persephone named[973]: listening on IPv4 interface br0, 192.168.0.126#53
Oct 02 15:48:51 persephone named[973]: no longer listening on 192.168.0.126#53
Oct 02 15:48:53 persephone kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                                     TDH                  <99>
                                     TDT                  <ef>
                                     next_to_use          <ef>
                                     next_to_clean        <94>
                                   buffer_info[next_to_clean]:
                                     time_stamp           <100cb6ff1>
                                     next_to_watch        <99>
                                     jiffies              <100cb71b0>
                                     next_to_watch.status <0>
                                   MAC Status             <40080083>
                                   PHY Status             <796d>
                                   PHY 1000BASE-T Status  <3800>
                                   PHY Extended Status    <3000>
                                   PCI Status             <10>
Oct 02 15:48:53 persephone dockerd[1181]: time="2022-10-02T15:48:53.263230176-07:00" level=info msg="memberlist: Suspect b7ffbfcc802c has failed, no acks received"
Oct 02 15:48:55 persephone kernel: e1000e 0000:00:1f.6 enp0s31f6: Reset adapter unexpectedly
Oct 02 15:48:55 persephone systemd-networkd[2702941]: enp0s31f6: Lost carrier
Oct 02 15:48:55 persephone openvpn[2702985]: TLS Error: cannot locate HMAC in incoming packet from [AF_INET]94.102.61.34:53100
Oct 02 15:48:55 persephone nut-monitor[1163]: Poll UPS [ups@briareos] failed - Driver not connected
Oct 02 15:48:55 persephone kernel: bond0: (slave enp0s31f6): speed changed to 0 on port 1
Oct 02 15:48:55 persephone kernel: bond0: (slave enp0s31f6): link status definitely down, disabling slave
Oct 02 15:48:56 persephone Keepalived_vrrp[1167]: (VI_1) Sending/queueing gratuitous ARPs on br0 for 192.168.0.1
Oct 02 15:48:56 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:56 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:56 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:56 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:56 persephone Keepalived_vrrp[1167]: Sending gratuitous ARP on br0 for 192.168.0.1
Oct 02 15:48:58 persephone systemd-networkd[2702941]: enp0s31f6: Gained carrier
Oct 02 15:48:58 persephone kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Oct 02 15:48:58 persephone kernel: bond0: (slave enp0s31f6): link status definitely up, 1000 Mbps full duplex

Last edited by hexadecagram (2022-10-03 01:13:40)

seth · 2022-10-03 06:31:12

The log segment starts way too late

Oct 02 15:48:50 persephone dockerd[1181]: time="2022-10-02T15:48:50.262488215-07:00" level=info msg="memberlist: Suspect b7ffbfcc802c has failed, no acks received"

is already a symptom of the hang.

The DH170 is a router, so it's seeing a great deal more traffic than the SZ170R6V2, so that could also be a cause.

Wild guess (we haven't seen any data to support this):
https://bbs.archlinux.org/viewtopic.php … 7#p2022487

hexadecagram · 2022-10-03 09:29:26

seth wrote:

The log segment starts way too late

Oct 02 15:48:50 persephone dockerd[1181]: time="2022-10-02T15:48:50.262488215-07:00" level=info msg="memberlist: Suspect b7ffbfcc802c has failed, no acks received"

is already a symptom of the hang.

Here you are:

Oct 02 15:43:13 persephone dockerd[1181]: time="2022-10-02T15:43:13.661549053-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:yi19atxzd4idq03537ouc09ll leaving:false netPeers:4 entries:30 Queue qLen:0 netMsg/s:0"
Oct 02 15:43:13 persephone dockerd[1181]: time="2022-10-02T15:43:13.661608148-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:wvpxlmwf1tte5vfwuad4tcstz leaving:false netPeers:4 entries:112 Queue qLen:0 netMsg/s:0"
Oct 02 15:43:13 persephone dockerd[1181]: time="2022-10-02T15:43:13.661633192-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:0qrfiwthw0vi6ndjl3eclwwwv leaving:false netPeers:3 entries:65 Queue qLen:0 netMsg/s:0"
Oct 02 15:43:13 persephone dockerd[1181]: time="2022-10-02T15:43:13.661650047-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:ey98q5eyusupfn1ag5rm6wv66 leaving:false netPeers:4 entries:22 Queue qLen:0 netMsg/s:0"
Oct 02 15:43:13 persephone dockerd[1181]: time="2022-10-02T15:43:13.661665298-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:adwz3p67rg0p3anme9kit734q leaving:false netPeers:4 entries:34 Queue qLen:0 netMsg/s:0"
Oct 02 15:48:13 persephone dockerd[1181]: time="2022-10-02T15:48:13.861366099-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:yi19atxzd4idq03537ouc09ll leaving:false netPeers:4 entries:30 Queue qLen:0 netMsg/s:0"
Oct 02 15:48:13 persephone dockerd[1181]: time="2022-10-02T15:48:13.861439474-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:wvpxlmwf1tte5vfwuad4tcstz leaving:false netPeers:4 entries:112 Queue qLen:0 netMsg/s:0"
Oct 02 15:48:13 persephone dockerd[1181]: time="2022-10-02T15:48:13.861486492-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:0qrfiwthw0vi6ndjl3eclwwwv leaving:false netPeers:3 entries:65 Queue qLen:0 netMsg/s:0"
Oct 02 15:48:13 persephone dockerd[1181]: time="2022-10-02T15:48:13.861518765-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:ey98q5eyusupfn1ag5rm6wv66 leaving:false netPeers:4 entries:22 Queue qLen:0 netMsg/s:0"
Oct 02 15:48:13 persephone dockerd[1181]: time="2022-10-02T15:48:13.861555188-07:00" level=info msg="NetworkDB stats persephone(1ce6efe3729d) - netID:adwz3p67rg0p3anme9kit734q leaving:false netPeers:4 entries:34 Queue qLen:0 netMsg/s:0"
Oct 02 15:48:50 persephone dockerd[1181]: time="2022-10-02T15:48:50.262488215-07:00" level=info msg="memberlist: Suspect b7ffbfcc802c has failed, no acks received"
Oct 02 15:48:50 persephone dockerd[1181]: time="2022-10-02T15:48:50.939913012-07:00" level=warning msg="memberlist: Refuting a suspect message (from: b7ffbfcc802c)"

seth wrote:

hexadecagram wrote:
The DH170 is a router, so it's seeing a great deal more traffic than the SZ170R6V2, so that could also be a cause.
Wild guess (we haven't seen any data to support this):
https://bbs.archlinux.org/viewtopic.php … 7#p2022487

Done:

❯ journalctl -b G Hang | wc -l
40
❯ cat /proc/sys/net/core/wmem_max
212992
❯ echo 2097152 | sudo tee /proc/sys/net/core/wmem_max
2097152

hexadecagram · 2022-10-03 09:34:19

I just thought to have a look.

❯ ethtool --show-eee enp0s31f6
EEE settings for enp0s31f6:
        EEE status: enabled - inactive
        Tx LPI: 17 (us)
        Supported EEE link modes:  100baseT/Full
                                   1000baseT/Full
        Advertised EEE link modes:  100baseT/Full
                                    1000baseT/Full
        Link partner advertised EEE link modes:  Not reported
❯ sudo ethtool --set-eee enp0s31f6 eee off
❯ ethtool --show-eee enp0s31f6
EEE settings for enp0s31f6:
        EEE status: disabled
        Tx LPI: 17 (us)
        Supported EEE link modes:  100baseT/Full
                                   1000baseT/Full
        Advertised EEE link modes:  100baseT/Full
                                    1000baseT/Full
        Link partner advertised EEE link modes:  Not reported
❯ journalctl -b G Hang | wc -l
40

Evidently, the kernel parameter e1000e.EEE=0 does not work.

hexadecagram · 2022-10-03 19:07:37

Setting it to 2097152 did not prevent the occurrence of hangs, so I am setting it back.

❯ journalctl -b G Hang | wc -l
114
❯ echo 212992 | sudo tee /proc/sys/net/core/wmem_max
212992

seth · 2022-10-03 20:02:02

Can you prevent this by throwing the NIC a lifeline, eg. indefinitely "ping google.com" in a terminal?

Edit, and another thing I noticed: do you run docker on all affected systems?

Last edited by seth (2022-10-03 20:02:38)

hexadecagram · 2022-10-04 23:59:31

seth wrote:

Can you prevent this by throwing the NIC a lifeline, eg. indefinitely "ping google.com" in a terminal?

How would ICMP traffic differ from the gobs of other traffic it is routing? (It's an edge router.)

In any case, do you want the ping to originate from persephone itself or from another machine in the LAN, routed through persephone?

seth wrote:

Edit, and another thing I noticed: do you run docker on all affected systems?

There is currently only 1 affected system: persephone. briareos was the first one, and it has ceased logging hangs entirely. I have a 3rd machine that is more-or-less persephone's twin in terms of configuration (named, you guessed it, demeter) but is currently not routing any traffic as I cannot get the DHCP lease from my ISP to stick to it. All 3 systems are running Docker. { journalctl -b -u docker.service } looks very similar on all 3 machines.

hexadecagram · 2022-10-05 00:21:16

Also, there has been a substantial uptick in the number of hangs that persephone has seen since I reverted net.core.wmem_max. So I tried tweaking the send buffer again, this time to what I usually see recommended:

❯ journalctl -b G Hang | wc -l
2239
❯ echo 16777216 | sudo tee /proc/sys/net/core/wmem_max
16777216

Unfortunately, it still had no effect. Almost immediately, I get:

❯ journalctl -b G Hang | wc -l
2245

I did of course try 2097152 first, but the hang count kept climbing, so I decided to try 16MB.

Could all of this be simply due to network overload? Hard to imagine, but I did have netfilter / network tuning on the todo list, I just hadn't got to it yet. I've got some time coming up soon where I can focus on that.

By the way, if this is a known issue (I've been very busy this week and still haven't had a chance to look closely at those 2 tickets), I could just disable enp0s31f6 for the time being until the kernel is patched. Each machine has it bonded via LACP with enp1s0. Since I've recently repurposed persephone to be an edge router, the plan, however, is to remove that bonding and utilize one port as an uplink and the other as LAN. It's possible that doing so could relieve all these problems but it would be nice to know that before making such a move in order to avoid downtime.

Last edited by hexadecagram (2022-10-05 03:05:59)

seth · 2022-10-05 08:42:36

from persephone itself

There's no difference, the only plan is to keep the NIC constanly active and prevent any kind of downpowering.

Could all of this be simply due to network overload?

Raising wmem_max can help to prevent TCP buffer jams from getting critical.
Eg. if there's a period of no traffic, the NIC powers down, then there's a massive burst and the NIC can't power up fast enough the buffer might get flooded.

if this is a known issue

No idea - but if it affects LTS and main kernel it's probably not a temporary regression that might get adressed soon.

Each machine has it bonded via LACP with enp1s0

If you can accept the test on any system, the bonding might actually be at the root of this because of the different loading patterns on the NIC…

hexadecagram · 2022-10-06 01:10:14

seth wrote:

from persephone itself
There's no difference, the only plan is to keep the NIC constanly active and prevent any kind of downpowering.

Okay:

❯ journalctl -b G Hang | wc -l; ping -c 300 -q 8.8.8.8; journalctl -b G Hang | wc -l
19237
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
300 packets transmitted, 300 received, 0% packet loss, time 299420ms
rtt min/avg/max/mdev = 3.939/4.618/5.630/0.308 ms
19305

seth wrote:

Could all of this be simply due to network overload?
Raising wmem_max can help to prevent TCP buffer jams from getting critical.
Eg. if there's a period of no traffic, the NIC powers down, then there's a massive burst and the NIC can't power up fast enough the buffer might get flooded.

Sounds like you're sure the NIC is powering down then? It isn't something else?

seth wrote:

if this is a known issue
No idea - but if it affects LTS and main kernel it's probably not a temporary regression that might get adressed soon.
Each machine has it bonded via LACP with enp1s0
If you can accept the test on any system, the bonding might actually be at the root of this because of the different loading patterns on the NIC…

I have a new network switch coming in the mail very soon. Once it does, I'll reconfigure without bonding and see if it helps. Here's /etc/systemd/network/bond0.netdev, if you're interested. I've had it this way for many years. I have tried commenting-out LACPTransmitRate and MIIMonitorSec. The hangs keep persisting.

[NetDev]
Name=bond0
Kind=bond

[Bond]
Mode=802.3ad
LACPTransmitRate=fast
MIIMonitorSec=100ms

In the meantime, I've downed the NIC and the hangs have stopped, as one would expect.

For what it's worth, these hangs do not seem to be causing huge waves in performance. They are having SOME effect periodically, e.g. webpages won't load on the first try but repeated reloads eventually work, but my VPN tunnels are staying connected. For the most part however the network is fairly stable.

Thanks for all the assistance.

hexadecagram · 2022-10-18 06:30:20

seth wrote:

hexadecagram wrote:
Each machine has it bonded via LACP with enp1s0
If you can accept the test on any system, the bonding might actually be at the root of this because of the different loading patterns on the NIC…

Yep. My network switch arrived, has been installed, and the box has been reconfigured WITHOUT bond0. It's been about a week now and there hasn't been a single hang reported since it was reconfigured.

I have several machines that use LACP, and persephone and briareos are the only 2 that have reported hangs at all. Both only started doing so when I repurposed them as edge routers and they started forwarding a lot of traffic.

I can live without LACP on my edge router, in fact I can no longer use it at all. But I do plan to continue using LACP on my other machines so this issue could resurface. But it does seem to be related to either traffic forwarding or the amount of traffic involved.

Arch Linux

#1 2022-09-30 03:32:35

Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#2 2022-09-30 06:40:21

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#3 2022-09-30 08:14:33

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#4 2022-09-30 08:30:05

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#5 2022-09-30 11:28:24

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#6 2022-09-30 21:24:35

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#7 2022-10-01 10:41:48

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#8 2022-10-01 11:59:44

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#9 2022-10-01 17:47:29

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#10 2022-10-01 21:47:45

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#11 2022-10-02 01:54:54

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#12 2022-10-02 07:32:16

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#13 2022-10-03 00:18:26

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#14 2022-10-03 06:31:12

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#15 2022-10-03 09:29:26

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#16 2022-10-03 09:34:19

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#17 2022-10-03 19:07:37

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#18 2022-10-03 20:02:02

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#19 2022-10-04 23:59:31

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#20 2022-10-05 00:21:16

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#21 2022-10-05 08:42:36

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#22 2022-10-06 01:10:14

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

#23 2022-10-18 06:30:20

Re: Logs report "Detected Hardware Unit Hang" repeatedly (e1000e)

Board footer