You are not logged in.
Pages: 1
Hi,
For the past 2 years ive been running a software raid6 with 11 drives. I use 2 adaptech raid pci-cards.
Its been working really well until a few hours ago.
Kernel is 2.6.30
Out of nowhere it listed 6 of the drives as faulty and was removed. I paniced ofc. Ive been going through logs and found this:
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,3,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,3,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter reset request. SCSI hang ?
May 19 16:27:04 BOLL kernel: AAC: Host adapter BLINK LED 0xef
May 19 16:27:04 BOLL kernel: AAC1: adapter kernel panic'd ef.
---
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1492664655
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499135
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499143
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499151
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499159
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1492664663
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499167
May 19 16:42:06 BOLL kernel: sd 7:0:1:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: sd 7:0:5:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: sd 7:0:4:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdj, sector 1550499199
May 19 16:42:06 BOLL kernel: sd 7:0:3:0: rejecting I/O to offline device
Ive tried to reassamble the array with "mdadm --assemble" without any luck. It complains about:
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has no superblock - assembly aborted
When I do mdadm --examine --verbose /dev/sdc1 I get:
/dev/sdc1:
Magic : a92b4efc
Version : 0.90.00
UUID : fec34637:85b80b3b:afd3607f:5fb3c82e
Creation Time : Wed May 13 22:12:48 2009
Raid Level : raid6
Used Dev Size : 976727744 (931.48 GiB 1000.17 GB)
Array Size : 8790549696 (8383.32 GiB 9001.52 GB)
Raid Devices : 11
Total Devices : 11
Preferred Minor : 0Update Time : Thu May 19 16:57:15 2011
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 6
Spare Devices : 0
Checksum : f47d92d - correct
Events : 32910Layout : left-symmetric
Chunk Size : 64KNumber Major Minor RaidDevice State
this 1 8 33 1 active sync /dev/sdc10 0 8 17 0 active sync /dev/sdb1
1 1 8 33 1 active sync /dev/sdc1
2 2 8 49 2 active sync /dev/sdd1
3 3 8 65 3 active sync /dev/sde1
4 4 8 81 4 active sync /dev/sdf1
5 5 0 0 5 faulty removed
6 6 0 0 6 faulty removed
7 7 0 0 7 faulty removed
8 8 0 0 8 faulty removed
9 9 0 0 9 faulty removed
10 10 0 0 10 faulty removed
But for instance sdh1 I get:
Checksum : f475480 - correct
Events : 32903Layout : left-symmetric
Chunk Size : 64KNumber Major Minor RaidDevice State
this 6 8 113 6 active sync /dev/sdh10 0 8 17 0 active sync /dev/sdb1
1 1 8 33 1 active sync /dev/sdc1
2 2 8 49 2 active sync /dev/sdd1
3 3 8 65 3 active sync /dev/sde1
4 4 8 81 4 active sync /dev/sdf1
5 5 8 97 5 active sync /dev/sdg1
6 6 8 113 6 active sync /dev/sdh1
7 7 8 129 7 active sync /dev/sdi1
8 8 8 145 8 active sync /dev/sdj1
9 9 8 161 9 active sync /dev/sdk1
10 10 8 177 10 active sync /dev/sdl1
The driver I use for my card is AACRAID. I did a pacman upgrade a few days back and rebooted but it worked after that until today.
I dont know if its the card thats faulty or all of the six drives and I REALLY could use some help because Im totally lost now.
Offline
Have an update.
The array answers to mdadm --misc --detail /dev/md0 which says
Version : 0.90
Creation Time : Wed May 13 22:12:48 2009
Raid Level : raid6
Used Dev Size : 976727744 (931.48 GiB 1000.17 GB)
Raid Devices : 11
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistentUpdate Time : Thu May 19 16:57:15 2011
State : active, FAILED, Not Started
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0Layout : left-symmetric
Chunk Size : 64KUUID : fec34637:85b80b3b:afd3607f:5fb3c82e
Events : 0.32910Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
4 8 81 4 active sync /dev/sdf1
5 0 0 5 removed
6 0 0 6 removed
7 0 0 7 removed
8 0 0 8 removed
9 0 0 9 removed
10 0 0 10 removed
However I cannot readd the drives.
mdadm --add /dev/md0 /dev/sdg1
mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sdg1 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first.
Ive noted that sdg to sdl is from the same adaptec card and my theory is that it temporarily failed and then all the discs were removed at the same time.
Now my questions are: How do I readd them safely and can I just buy a new card, same model etc and make it work again?
Offline
You are not alone, the same thing happened to me. Except, I have 6 drives on a SAS controller with 2 mini SAS ports. The 6 SATA drives are connected using 2 breakout cables and are paired in groups of 3. It is a RAID 6 configuration and recently after a pacman upgrade I started having problems with the controller. It seemed to be a problem with the latest version and I switched to the LTS kernel. Everything seemed fine at first except now after a reboot only 4 drives were added to the array. The other 2 drives are healthy according to smartct but I cannot readd them and get the same error. The controller sees the drives at boot and spin them up and the drives are in groups of 3 so don't think the cable went bad since they are still there and one of the drives is working fine. Before this on the regular kernel the drives didn't show up at all and threw scsi allocation errors at boot and before this problem mdadm removed all the drives from the array, although it can back up without a problem until this happened. Anyone have any ideas?
Offline
Pages: 1