You are not logged in.

#1 2011-05-19 16:23:49

maz
Member
Registered: 2006-05-05
Posts: 38

Strange raid problem

Hi,

For the past 2 years ive been running a software raid6 with 11 drives. I use 2 adaptech raid pci-cards.
Its been working really well until a few hours ago.
Kernel is 2.6.30
Out of nowhere it listed 6 of the drives as faulty and was removed. I paniced ofc. Ive been going through logs and found this:

May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,5,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,4,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,3,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,3,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,2,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,1,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter abort request (7,0,0,0)
May 19 16:27:04 BOLL kernel: aacraid: Host adapter reset request. SCSI hang ?
May 19 16:27:04 BOLL kernel: AAC: Host adapter BLINK LED 0xef
May 19 16:27:04 BOLL kernel: AAC1: adapter kernel panic'd ef.
---
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1492664655
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499135
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499143
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499151
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499159
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1492664663
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdh, sector 1550499167
May 19 16:42:06 BOLL kernel: sd 7:0:1:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: sd 7:0:5:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: sd 7:0:4:0: rejecting I/O to offline device
May 19 16:42:06 BOLL kernel: end_request: I/O error, dev sdj, sector 1550499199
May 19 16:42:06 BOLL kernel: sd 7:0:3:0: rejecting I/O to offline device

Ive tried to reassamble the array with "mdadm --assemble" without any luck. It complains about:
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has no superblock - assembly aborted

When I do mdadm --examine --verbose /dev/sdc1 I get:

/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : fec34637:85b80b3b:afd3607f:5fb3c82e
  Creation Time : Wed May 13 22:12:48 2009
     Raid Level : raid6
  Used Dev Size : 976727744 (931.48 GiB 1000.17 GB)
     Array Size : 8790549696 (8383.32 GiB 9001.52 GB)
   Raid Devices : 11
  Total Devices : 11
Preferred Minor : 0

    Update Time : Thu May 19 16:57:15 2011
          State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 6
  Spare Devices : 0
       Checksum : f47d92d - correct
         Events : 32910

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       33        1      active sync   /dev/sdc1

   0     0       8       17        0      active sync   /dev/sdb1
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      active sync   /dev/sdf1
   5     5       0        0        5      faulty removed
   6     6       0        0        6      faulty removed
   7     7       0        0        7      faulty removed
   8     8       0        0        8      faulty removed
   9     9       0        0        9      faulty removed
  10    10       0        0       10      faulty removed

But for instance sdh1 I get:

Checksum : f475480 - correct
         Events : 32903

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     6       8      113        6      active sync   /dev/sdh1

   0     0       8       17        0      active sync   /dev/sdb1
   1     1       8       33        1      active sync   /dev/sdc1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       65        3      active sync   /dev/sde1
   4     4       8       81        4      active sync   /dev/sdf1
   5     5       8       97        5      active sync   /dev/sdg1
   6     6       8      113        6      active sync   /dev/sdh1
   7     7       8      129        7      active sync   /dev/sdi1
   8     8       8      145        8      active sync   /dev/sdj1
   9     9       8      161        9      active sync   /dev/sdk1
  10    10       8      177       10      active sync   /dev/sdl1

The driver I use for my card is AACRAID. I did a pacman upgrade a few days back and rebooted but it worked after that until today.
I dont know if its the card thats faulty or all of the six drives and I REALLY could use some help because Im totally lost now.

Offline

#2 2011-05-19 18:52:32

maz
Member
Registered: 2006-05-05
Posts: 38

Re: Strange raid problem

Have an update.

The array answers to mdadm --misc --detail /dev/md0 which says

Version : 0.90
  Creation Time : Wed May 13 22:12:48 2009
     Raid Level : raid6
  Used Dev Size : 976727744 (931.48 GiB 1000.17 GB)
   Raid Devices : 11
  Total Devices : 5
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu May 19 16:57:15 2011
          State : active, FAILED, Not Started
Active Devices : 5
Working Devices : 5
Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : fec34637:85b80b3b:afd3607f:5fb3c82e
         Events : 0.32910

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       5       0        0        5      removed
       6       0        0        6      removed
       7       0        0        7      removed
       8       0        0        8      removed
       9       0        0        9      removed
      10       0        0       10      removed

However I cannot readd the drives.

mdadm --add /dev/md0 /dev/sdg1
mdadm: /dev/sdg1 reports being an active member for /dev/md0, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sdg1 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdg1" first.

Ive noted that sdg to sdl is from the same adaptec card and my theory is that it temporarily failed and then all the discs were removed at the same time.
Now my questions are: How do I readd them safely and can I just buy a new card, same model etc and make it work again?

Offline

#3 2011-05-31 23:36:14

PreparationH67
Member
From: Chicago IL
Registered: 2010-07-28
Posts: 40

Re: Strange raid problem

You are not alone, the same thing happened to me. Except, I have 6 drives on a SAS controller with 2 mini SAS ports. The 6 SATA drives are connected using 2 breakout cables and are paired in groups of 3. It is a RAID 6 configuration and recently after a pacman upgrade I started having problems with the controller. It seemed to be a problem with the latest version and I switched to the LTS kernel. Everything seemed fine at first except now after a reboot only 4 drives were added to the array. The other 2 drives are healthy according to smartct but I cannot readd them and get the same error. The controller sees the drives at boot and spin them up and the drives are in groups of 3 so  don't think the cable went bad since they are still there and one of the drives is working fine. Before this on the regular kernel the drives didn't show up at all and threw scsi allocation errors at boot and before this problem mdadm removed all the drives from the array, although it can back up without a problem until this happened. Anyone have any ideas?

Offline

Board footer

Powered by FluxBB