You are not logged in.
EDIT: Basically, kernel crash due to bug FS#46570 (kernel soft locking). The patch is in kernel 4.1.10-2 (LTS).
Hi,
I have access to a server running Arch which seems to lose its connection since an update this morning (after restarting).
The server seems to start fine, and I can log into it (as well as visit webpages hosted on it) as long as I'm fast enough (i.e. just logging in and running htop shows a connection loss at an uptime of ~1min 22s).
I am not sure if this is a hardware crash or a software issue, as I have no physical access to the machine (~2000km away from me...). I have access to a rescue system and can chroot in.
The last updates before this issue arises are as follow:
[2015-10-01 19:53] [ALPM] transaction started
[2015-10-01 19:54] [ALPM] upgraded python (3.4.3-3 -> 3.5.0-2)
[2015-10-01 19:54] [ALPM] upgraded archey3 (0.5-3 -> 0.5-4)
[2015-10-01 19:54] [ALPM] upgraded avahi (0.6.31-16 -> 0.6.31-17)
[2015-10-01 19:54] [ALPM] upgraded libutil-linux (2.27-2 -> 2.27-5)
[2015-10-01 19:54] [ALPM] upgraded e2fsprogs (1.42.12-2 -> 1.42.13-1)
[2015-10-01 19:54] [ALPM] upgraded python-pyinotify (0.9.6-1 -> 0.9.6-2)
[2015-10-01 19:54] [ALPM] upgraded python-systemd (230-1 -> 230-2)
[2015-10-01 19:54] [ALPM] upgraded python-lxml (3.4.4-1 -> 3.4.4-2)
[2015-10-01 19:54] [ALPM] upgraded util-linux (2.27-2 -> 2.27-5)
[2015-10-01 19:54] [ALPM] upgraded fail2ban (0.9.3-3 -> 0.9.3-4)
[2015-10-01 19:54] [ALPM] upgraded libassuan (2.2.1-1 -> 2.3.0-1)
[2015-10-01 19:54] [ALPM] upgraded gnupg (2.1.8-3 -> 2.1.8-4)
[2015-10-01 19:54] [ALPM] upgraded intel-tbb (4.3_20150611-1 -> 4.4_20150728-1)
[2015-10-01 19:54] [ALPM] upgraded iproute2 (4.0.0-2 -> 4.1.1-1)
[2015-10-01 19:54] [ALPM] upgraded python-decorator (4.0.2-1 -> 4.0.2-2)
[2015-10-01 19:54] [ALPM] upgraded python-pexpect (3.3-1 -> 3.3-2)
[2015-10-01 19:54] [ALPM] upgraded python-packaging (15.3-1 -> 15.3-2)
[2015-10-01 19:54] [ALPM] upgraded python-setuptools (1:18.3.1-1 -> 1:18.3.2-1)
[2015-10-01 19:54] [ALPM] upgraded ipython (4.0.0-2 -> 4.0.0-3)
[2015-10-01 19:54] [ALPM] upgraded xcb-proto (1.11-1 -> 1.11-2)
[2015-10-01 19:54] [ALPM] upgraded mesa (11.0.1-1 -> 11.0.2-1)
[2015-10-01 19:54] [ALPM] upgraded mesa-libgl (11.0.1-1 -> 11.0.2-1)
[2015-10-01 19:54] [ALPM] upgraded python-numpy (1.9.2-2 -> 1.9.3-2)
[2015-10-01 19:54] [ALPM] upgraded python-scipy (0.16.0-1 -> 0.16.0-2)
[2015-10-01 19:54] [ALPM] upgraded python2-lxml (3.4.4-1 -> 3.4.4-2)
[2015-10-01 19:54] [ALPM] upgraded python2-pyinotify (0.9.6-1 -> 0.9.6-2)
[2015-10-01 19:54] [ALPM] upgraded python2-systemd (230-1 -> 230-2)
[2015-10-01 19:54] [ALPM] upgraded qscintilla (2.9-1 -> 2.9-2)
[2015-10-01 19:54] [ALPM] upgraded reflector (2014.11-1 -> 2014.11-2)
[2015-10-01 19:54] [ALPM] upgraded vim-runtime (7.4.854-2 -> 7.4.884-1)
[2015-10-01 19:54] [ALPM] upgraded vim (7.4.854-2 -> 7.4.884-1)
[2015-10-01 19:54] [ALPM] transaction completed
[2015-10-02 10:12] [ALPM] transaction started
[2015-10-02 10:12] [ALPM] installed gsl (1.16-1)
[2015-10-02 10:12] [ALPM] transaction completed
[2015-10-05 09:06] [ALPM] transaction started
[2015-10-05 09:06] [ALPM] upgraded harfbuzz (1.0.3-1 -> 1.0.4-1)
[2015-10-05 09:06] [ALPM] upgraded freetype2 (2.6-2 -> 2.6.1-1)
[2015-10-05 09:06] [ALPM] upgraded harfbuzz-icu (1.0.3-1 -> 1.0.4-1)
[2015-10-05 09:06] [ALPM] upgraded leptonica (1.71-2 -> 1.72-1)
[2015-10-05 09:06] [ALPM] upgraded linux-lts (4.1.8-1 -> 4.1.9-2)
[2015-10-05 09:06] [ALPM-SCRIPTLET] >>> Updating module dependencies. Please wait ...
[2015-10-05 09:06] [ALPM-SCRIPTLET] >>> Generating initial ramdisk, using mkinitcpio. Please wait...
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> Building image from preset: /etc/mkinitcpio.d/linux-lts.preset: 'default'
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> -k /boot/vmlinuz-linux-lts -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-lts.img
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> Starting build: 4.1.9-2-lts
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [base]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [mdadm]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [mdadm_udev]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [udev]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [autodetect]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [modconf]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [block]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [filesystems]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [keyboard]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [fsck]
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> Generating module dependencies
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> Creating gzip-compressed initcpio image: /boot/initramfs-linux-lts.img
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> Image generation successful
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> Building image from preset: /etc/mkinitcpio.d/linux-lts.preset: 'fallback'
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> -k /boot/vmlinuz-linux-lts -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-lts-fallback.img -S autodetect
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> Starting build: 4.1.9-2-lts
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [base]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [mdadm]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [mdadm_udev]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [udev]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [modconf]
[2015-10-05 09:06] [ALPM-SCRIPTLET] -> Running build hook: [block]
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> WARNING: Possibly missing firmware for module: aic94xx
[2015-10-05 09:06] [ALPM-SCRIPTLET] ==> WARNING: Possibly missing firmware for module: wd719x
[2015-10-05 09:07] [ALPM-SCRIPTLET] -> Running build hook: [filesystems]
[2015-10-05 09:07] [ALPM-SCRIPTLET] -> Running build hook: [keyboard]
[2015-10-05 09:07] [ALPM-SCRIPTLET] -> Running build hook: [fsck]
[2015-10-05 09:07] [ALPM-SCRIPTLET] ==> Generating module dependencies
[2015-10-05 09:07] [ALPM-SCRIPTLET] ==> Creating gzip-compressed initcpio image: /boot/initramfs-linux-lts-fallback.img
[2015-10-05 09:07] [ALPM-SCRIPTLET] ==> Image generation successful
[2015-10-05 09:07] [ALPM] upgraded php (5.6.13-1 -> 5.6.14-1)
[2015-10-05 09:07] [ALPM] upgraded php-apache (5.6.13-1 -> 5.6.14-1)
[2015-10-05 09:07] [ALPM] upgraded php-gd (5.6.13-1 -> 5.6.14-1)
[2015-10-05 09:07] [ALPM] upgraded php-mcrypt (5.6.13-1 -> 5.6.14-1)
[2015-10-05 09:07] [ALPM] upgraded php-pgsql (5.6.13-1 -> 5.6.14-1)
[2015-10-05 09:07] [ALPM] upgraded php-sqlite (5.6.13-1 -> 5.6.14-1)
[2015-10-05 09:07] [ALPM] upgraded php-tidy (5.6.13-1 -> 5.6.14-1)
[2015-10-05 09:07] [ALPM] upgraded poppler (0.33.0-1 -> 0.36.0-1)
[2015-10-05 09:07] [ALPM] upgraded python-decorator (4.0.2-2 -> 4.0.4-1)
[2015-10-05 09:07] [ALPM] upgraded texlive-bin (2015.37497-2 -> 2015.37497-3)
[2015-10-05 09:07] [ALPM-SCRIPTLET] >>> texlive: updating the filename database...
[2015-10-05 09:07] [ALPM-SCRIPTLET] mktexlsr: Updating /etc/texmf/ls-R...
[2015-10-05 09:07] [ALPM-SCRIPTLET] mktexlsr: Updating /usr/share/texmf/ls-R...
[2015-10-05 09:07] [ALPM-SCRIPTLET] mktexlsr: Updating /usr/share/texmf-dist/ls-R...
[2015-10-05 09:07] [ALPM-SCRIPTLET] mktexlsr: Updating /var/lib/texmf/ls-R...
[2015-10-05 09:07] [ALPM-SCRIPTLET] mktexlsr: Done.
[2015-10-05 09:07] [ALPM] transaction completed
The main updates I see which could cause an issue are the kernel and fail2ban. However even if I disable fail2ban when I can log in the connection is lost after about the same time.
The firewall rules are quite standard and have been in place using this page from the wiki (unchanged since a few years).
I have access to a rescue interface which can run a few hardware tests.
The hard drive seems alright (SMART status doesn't show anything bad, I can paste it if needed), as is the memory (memtester version 4.2.1 doesn't report any error).
The motherboard has been replaced today (after the problems appeared), which didn't change a thing.
lspci reports
00:00.0 Host bridge: Intel Corporation Atom Processor D2xxx/N2xxx DRAM Controller (rev 04)
00:02.0 VGA compatible controller: Intel Corporation Atom Processor D2xxx/N2xxx Integrated Graphics Controller (rev 0b)
00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02)
00:1d.0 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #1 (rev 02)
00:1d.1 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #2 (rev 02)
00:1d.2 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #3 (rev 02)
00:1d.3 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #4 (rev 02)
00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e2)
00:1f.0 ISA bridge: Intel Corporation NM10 Family LPC Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation NM10/ICH7 Family SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation NM10/ICH7 Family SMBus Controller (rev 02)
01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
I am running on the latest lts kernel ( 4.1.9-2-lts, standard from core). I am trying to access the last logs at the moment, I'll update this post to report anything significant.
Does anyone has any idea to where I should look ?
Edit: I am still digging in the logs. One error which appeared before the change of motherboard (after the updates, last line in the log file) is:
kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [swapper/1:0]
I doesn't appear anymore after the change in hardware though. The last significant thing happening before the crash (at least in the last 10 lines of the log files) is the time update from ntp (systemd-timesyncd[305]: Synchronized to time server [2001:41d0:8:188d::16]:123 (2.arch.pool.ntp.org)) with no visible jump backwards in system time. I can try to deactivate ntp and restart, but I'm not sure this is the main issue here as that service is running since way before that last update...
Re-Edit: I downgraded to kernel linux-lts 4.1.8-1-lts and the server seems to run fine. It's nice but it isn't a long term solution. Any idea about what might cause these crashes ?
Last edited by Nicocys (2015-10-08 08:09:55)
Offline
Hi,
I have had very similar issues to the one you describe on two boxes. Yesterday the first of the boxes (a webserver at a remote location) completely locked up every few hours after I had done one of my weekly updates the day before. So, I ran tests for a few hours, tried to reproduce the lockups (no success), checked the memory and harddisks with no results, so I suspected a problem with server's hardware. As a last attempt, I downgraded the kernel to the version it previously had (linux-lts 3.14.52) using the Arch Rollback Machine (thank god!). The box has been running fine since, i.e. the last 14 hours.
Then, today around noon my desktop box started to act up the same way, hanging every hour or so. It is also a box with an lts kernel. So. my theory at the moment is that something broke in the lts kernel around version 4.1.8 or 4.1.9. I downgraded my desktop box's kernel just now to see if that works.
Offline
Ok, the problem appeared again using the 4.1.8-1 kernel, which wasn't the case before. According to the monitoring system, all activity stopped at around 2:15am, while everything was running fine starting from 8:32pm the day before.
The main difference this time is that the system is still alive. I am unable to get access to it via ssh, the webpages load indefinitely, but it answers to ping... badly:
103 packets transmitted, 103 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 34.342/1137.393/4048.458/1065.862 ms
A traceroute shows that the choking point is my server:
traceroute to theserver (XX.XX.XX.X), 64 hops max, 52 byte packets
1 XX.XX.XX.X (XX.XX.XX.X) 0.435 ms 0.278 ms 0.248 ms
2 XX.XX.XX.XX (XX.XX.X.XX) 0.273 ms 0.263 ms 0.279 ms
3 XX.XX.XX.XX (XX.XX.XX.XX) 0.258 ms 0.233 ms 0.384 ms
4 umu-br1.sunet.se (193.11.0.225) 0.339 ms 0.314 ms 0.318 ms
5 m1fre.sunet.se (130.242.85.181) 17.675 ms 9.682 ms 10.310 ms
6 se-fre.nordu.net (109.105.102.9) 21.486 ms 9.809 ms 20.136 ms
7 dk-ore.nordu.net (109.105.97.130) 18.186 ms 18.179 ms 18.283 ms
8 nl-sar.nordu.net (109.105.97.137) 28.681 ms 28.757 ms 28.651 ms
9 * * *
10 rbx-g2-a9.fr.eu (94.23.122.78) 34.502 ms 34.612 ms 34.584 ms
11 vss-9a-6k.routers.ovh.net (91.121.131.156) 34.452 ms 34.213 ms
vss-9b-6k.routers.ovh.net (178.33.100.74) 34.434 ms
12 myserver (XX.XX.XX.XX) 3633.082 ms 2018.286 ms 874.079 ms
I'll restart and try to see something in the logs.
Edit: OK this is getting weird...
Nothing appears in the logs (as in nothing unusual) except stuff like
perf interrupt took too long (2508 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
which pop up twice (going down to kernel.perf_event_max_sample_rate at 25000 in the end). The last occurrence of this is at 22:57, while the last log entry is at 02:14.
That last entry is weird:
Oct 06 02:14:44 theserver kernel: usb 4-1: USB disconnect, device number 2
Oct 06 02:14:45 theserver kernel: usb 4-1: new low-speed USB device number 3 using uhci_hcd
Note that the machine is a server, with supposedly no usb devices connected to it. According to the monitoring system, server activity stopped at ~02:17, but there is no log entries after these two lines.
lsusb returns
Bus 005 Device 002: ID 058f:6254 Alcor Micro Corp. USB Hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
The server seems to run fine again, but I'm not sure for how long...
Last edited by Nicocys (2015-10-06 06:39:55)
Offline
Just an update:
My server with the downgraded kernel has been running without problems for 26 hours now. Also, I found this: https://lkml.org/lkml/2015/9/30/1117.
Offline
Ok, it seems they have a patch for that (possibly) but missed the 4.1.10 release.
I am running on 4.1.8 for now, and I'll see if it crashes again, as it did during last night.
Offline
It's a known problem, I just filed FS#46588 to backport the fix.
Offline
Ok, running on 4.1.8 without crash for more than a day now. I assume that last crash since rollback was bad luck/something went wrong, and isn't happening anymore (probably the first boot after the rollback I should have restarted again, don't know).
So the culprit seems to be kernel 4.1.9-lts indeed, and the connection lss is due to the system crashing.
I'll update the topic title then.
Offline
@Nicocys dont use kernel 4.1.9 and 4.1.10, they are still broken.
perf interrupt took too long (2508 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
to "fix" it, add somewhere in sysctl:
kernel.nmi_watchdog = 0
Last edited by NOTMY (2015-10-07 09:59:34)
Offline
@NOTMY, are you sure this is the same issue?
The bug you are referring to looks much older?
Offline
@ghen, yes Im pretty sure because since some versions of kernels this messages started appear on my 3 totally different home machines (and some servers, debian, ubuntu) and those machines just after time slowed very down or just stopped responds.
so I added on every machine "kernel.nmi_watchdog = 0" and no more messages and again monthly uptimes + to every grub I add "nohpet" option because with this timer kernels have some problems too.
ps Just be sure that after reboot "kernel.nmi_watchdog" is still 0.
Last edited by NOTMY (2015-10-07 12:44:43)
Offline
I just tried 4.1.10 with kernel.nmi_watchdog=0, and it locked up again after an hour.
Offline
In the logs with 4.1.8, I have that "perf interrupt lock took too long" twice, but no crash. It seems to be known and not critical.
Server is running fine on 4.1.8, so it's most likely that soft locking issue mentioned earlier in the topic.
Offline
I just tried 4.1.10 with kernel.nmi_watchdog=0, and it locked up again after an hour.
@Nicocys dont use kernel 4.1.9 and 4.1.10, they are still broken.
In the logs with 4.1.8, I have that "perf interrupt lock took too long" twice, but no crash. It seems to be known and not critical.
Server is running fine on 4.1.8, so it's most likely that soft locking issue mentioned earlier in the topic.
ye @Nicocys I read everywhere that is normal warning, information message and it repeated few times and machines went down or were very slow, In my case wasnt so non-critical.
Offline
linux-lts-4.1.10-2 has the patch: https://bugs.archlinux.org/task/46570
edit: so far my systems are stable with 4.1.10-2.
Last edited by ghen (2015-10-08 07:10:30)
Offline
Ok, package is in testing, I'll try it as soon as it hits core. I tag the topic as solved then, and reopen if needed.
Offline
Ok guys I got the same issue I think, after an update.
The server starts but apparently crashes directly. I have no way of sshing to it.
I have a rescue access and can mount the arch partition.
How can I downgrade my linux from there? Help!
water vapor is invisible
Offline
Solved. Downgrading to 4.8.1 did the trick
water vapor is invisible
Offline