[ABANDONED] Complete system lockup, seems to be raid related

Fox2k · 2016-04-15 00:32:15

Hi all,

I'm hoping I can get a bit of help with this one. I have a system running Arch which has a few raid arrays running. Some of the drives are hooked up to the onboard sata, others are on a pci-x sata card. I seem to be having trouble with one of the arrays in particular. The troubling thing is that as far as I can tell, this has started happening out of the blue. No hardware changes or Arch updates / configuration changes that I can recall in the past little while leading up to the problem. What is happening is that randomly while accessing one raid array in particular ("the array"), the system completely locks up. No logs, no kernel panics on the screen, nothing at all. Just complete unresponsiveness until I reboot the box. Upon reboot this leaves the array in resync=PENDING, and shortly (within 10 minutes or so) after I schedule the check to start, the system locks up again. Here's some additional info:

- As long as the array is unmounted, the system is solid. Only once the array is mounted (and the check automatically starts shortly thereafter) does it soon lock up again.
- Booting from a knoppix live CD, I am able to complete the check on the array with no problems. Rebooting back in to arch, the system is fine with a bit of access to the array here and there but within a few hours (or sooner) it will lock up again.

Things I have tried:
- Stopping X, unplugging network card, and disabling several onboard components before attempting to resync the array.
- Going in to init level 3 before runing the resync
- Leaving the machine on long enough for the resync to complete, in case it was still happening in the background and seeing if the system would become responsive again after a while. I left it on overnight, and in the morning it was still unresponsive. The resync typically takes about 4 hours.
- Doing a complete system update (pacman -Syu)
- Ensuring mdadm monitor service is running
- Running short and long smart tests on all component drives of the array (completed without errors)
- Running a single badblocks non-destructive pass on all component drives of the array (completed without errors)
- Reseated all of the sata cables (just in case?)
- Tried passing some configuration params to the kernel at boot as per this thread:
https://bbs.archlinux.org/viewtopic.php?id=201667
processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll libata.force=noncq
- Nothing is printed to any of the system logs in the moments leading up to the freeze. I suppose it's possible that there is data not yet flushed to the log file but I'm not sure how to check that.

The system is not overclocked. Additional info:

# uname -a
Linux hostname 4.4.5-1-ARCH #1 SMP PREEMPT Thu Mar 10 07:38:19 CET 2016 x86_64 GNU/Linux

# lspci
00:00.0 Host bridge: Intel Corporation E7230/3000/3010 Memory Controller Hub (rev c0)
00:01.0 PCI bridge: Intel Corporation E7230/3000/3010 PCI Express Root Port (rev c0)
00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01)
00:1d.0 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #1 (rev 01)
00:1d.1 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #2 (rev 01)
00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 IDE interface: Intel Corporation NM10/ICH7 Family SATA Controller [IDE mode] (rev 01)
00:1f.3 SMBus: Intel Corporation NM10/ICH7 Family SMBus Controller (rev 01)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Juniper PRO [Radeon HD 5750]
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Juniper HDMI Audio [Radeon HD 5700 Series]
02:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09)
03:02.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 21)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 21)
06:01.0 Multimedia audio controller: Creative Labs SB X-Fi
06:02.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] ES1000 (rev 02)

# cat /proc/mdstat

<snip>
md127 : active (auto-read-only) raid5 sdk1[4] sdf1[2] sdc1[1] sdd1[5] sde1[0] sdb1[3]
      4883799680 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
        resync=PENDING
<snip>

Any help would be greatly appreciated. The key seems to be in the fact that Knoppix has no problems while Arch does, otherwise I would suspect the sata controller card is on the fritz. However I don't know anything about the differences in the standard kernel configs (or anything else for that matter) between Arch and Knoppix that would point me in the right direction.

Last edited by Fox2k (2016-04-16 22:49:51)

Fox2k · 2016-04-16 22:48:40

For anyone who's interested, I was unable to make much more progress with this issue, however I suspected that software was more likely the culprit considering the fact that all of the drive diagnostics I tried came back clean and Knoppix also didn't seem to have any issues with the arrays. I decided to point my mirror to an archive snapshot of January 2015, I then completely downgraded the system. Maybe a bit overkill but unfortunately I just don't have any more time to spend on this issue as it's already cost me several days. I preferred to go this route rather than switching this box to run Ubuntu or something else.

The kernel now running is 3.17.6-1, and the problem seems to have gone away.

If anyone has any further insights in to what might have been causing this issue I'd be very happy to hear them, maybe it can also help anyone else having similar issues in the future who stumbles across this thread. I think I will be keeping the box at this revision for quite some time though as I can't afford to go through this hassle again any time soon!

Thanks for looking,

- Fox2k

Arch Linux

#1 2016-04-15 00:32:15

[ABANDONED] Complete system lockup, seems to be raid related

#2 2016-04-16 22:48:40

Re: [ABANDONED] Complete system lockup, seems to be raid related

Board footer