Strange raid problem

maz · 2011-05-19 16:23:49

Hi,

For the past 2 years ive been running a software raid6 with 11 drives. I use 2 adaptech raid pci-cards.
Its been working really well until a few hours ago.
Kernel is 2.6.30
Out of nowhere it listed 6 of the drives as faulty and was removed. I paniced ofc. Ive been going through logs and found this:

May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,3,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,3,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter reset request. SCSI hang ?
May 19 16:27:04 BOLL kernel: AAC: Host adapter BLINK LED 0xef
May 19 16:27:04 BOLL kernel: AAC1: adapter kernel panic'd ef.
---
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1492664655
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499135
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499143
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499151
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499159
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1492664663
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499167
May 19 16:42:06 BOLL kernel: sd 7:0:1:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: sd 7:0:5:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: sd 7:0:4:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdj, sector 1550499199
May 19 16:42:06 BOLL kernel: sd 7:0:3:0: rejecting I/O to offline device

Ive tried to reassamble the array with "mdadm --assemble" without any luck. It complains about:
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has no superblock - assembly aborted

When I do mdadm --examine --verbose /dev/sdc1 I get:

/dev/sdc1:
Magic : a92b4efc
Version : 0.90.00
UUID : fec34637:85b80b3b:afd3607f:5fb3c82e
Creation Time : Wed May 13 22:12:48 2009
Raid Level : raid6
Used Dev Size : 976727744 (931.48 GiB 1000.17 GB)
Array Size : 8790549696 (8383.32 GiB 9001.52 GB)
Raid Devices : 11
Total Devices : 11
Preferred Minor : 0
Update Time : Thu May 19 16:57:15 2011
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 6
Spare Devices : 0
Checksum : f47d92d - correct
Events : 32910
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 1 8 33 1 active sync /dev/sdc1
0 0 8 17 0 active sync /dev/sdb1
1 1 8 33 1 active sync /dev/sdc1
2 2 8 49 2 active sync /dev/sdd1
3 3 8 65 3 active sync /dev/sde1
4 4 8 81 4 active sync /dev/sdf1
5 5 0 0 5 faulty removed
6 6 0 0 6 faulty removed
7 7 0 0 7 faulty removed
8 8 0 0 8 faulty removed
9 9 0 0 9 faulty removed
10 10 0 0 10 faulty removed

But for instance sdh1 I get:

Checksum : f475480 - correct
Events : 32903
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 6 8 113 6 active sync /dev/sdh1
0 0 8 17 0 active sync /dev/sdb1
1 1 8 33 1 active sync /dev/sdc1
2 2 8 49 2 active sync /dev/sdd1
3 3 8 65 3 active sync /dev/sde1
4 4 8 81 4 active sync /dev/sdf1
5 5 8 97 5 active sync /dev/sdg1
6 6 8 113 6 active sync /dev/sdh1
7 7 8 129 7 active sync /dev/sdi1
8 8 8 145 8 active sync /dev/sdj1
9 9 8 161 9 active sync /dev/sdk1
10 10 8 177 10 active sync /dev/sdl1

The driver I use for my card is AACRAID. I did a pacman upgrade a few days back and rebooted but it worked after that until today.
I dont know if its the card thats faulty or all of the six drives and I REALLY could use some help because Im totally lost now.

maz · 2011-05-19 18:52:32

Have an update.

The array answers to mdadm --misc --detail /dev/md0 which says

Version : 0.90
Creation Time : Wed May 13 22:12:48 2009
Raid Level : raid6
Used Dev Size : 976727744 (931.48 GiB 1000.17 GB)
Raid Devices : 11
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Thu May 19 16:57:15 2011
State : active, FAILED, Not Started
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : fec34637:85b80b3b:afd3607f:5fb3c82e
Events : 0.32910
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
4 8 81 4 active sync /dev/sdf1
5 0 0 5 removed
6 0 0 6 removed
7 0 0 7 removed
8 0 0 8 removed
9 0 0 9 removed
10 0 0 10 removed

However I cannot readd the drives.

mdadm --add /dev/md0 /dev/sdg1
mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sdg1 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first.

Ive noted that sdg to sdl is from the same adaptec card and my theory is that it temporarily failed and then all the discs were removed at the same time.
Now my questions are: How do I readd them safely and can I just buy a new card, same model etc and make it work again?

PreparationH67 · 2011-05-31 23:36:14

You are not alone, the same thing happened to me. Except, I have 6 drives on a SAS controller with 2 mini SAS ports. The 6 SATA drives are connected using 2 breakout cables and are paired in groups of 3. It is a RAID 6 configuration and recently after a pacman upgrade I started having problems with the controller. It seemed to be a problem with the latest version and I switched to the LTS kernel. Everything seemed fine at first except now after a reboot only 4 drives were added to the array. The other 2 drives are healthy according to smartct but I cannot readd them and get the same error. The controller sees the drives at boot and spin them up and the drives are in groups of 3 so don't think the cable went bad since they are still there and one of the drives is working fine. Before this on the regular kernel the drives didn't show up at all and threw scsi allocation errors at boot and before this problem mdadm removed all the drives from the array, although it can back up without a problem until this happened. Anyone have any ideas?

Arch Linux

#1 2011-05-19 16:23:49

Strange raid problem

#2 2011-05-19 18:52:32

Re: Strange raid problem

#3 2011-05-31 23:36:14

Re: Strange raid problem

Board footer