[SOLVED] Possibly dead hard drive or bad raid setup...

passbe · 2011-06-10 01:35:09

I have a 4 drive raid5 running. Just recently one of the drives kept removing itself from the array. I have re / formated / partitioned the drive multiple times, but either mdadm doesn't accept the drive, or it accepts, resyncs, and when i try to access the raid it fails again.

I eventually pulled the drive out and plugged it into my desktop machine. That machine recognized the partition fine. However after putting the drive back into the raid i receive these sort of errors on startup:

Jun  8 17:04:41 localhost kernel: [  213.942913] ata3.00: failed command: READ DMA
Jun  8 17:04:41 localhost kernel: [  213.942922] ata3.00: cmd c8/00:08:40:10:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jun  8 17:04:41 localhost kernel: [  213.942924]          res 51/40:00:40:10:00/00:00:00:00:00/e0 Emask 0x9 (media error)
Jun  8 17:04:41 localhost kernel: [  213.942928] ata3.00: status: { DRDY ERR }
Jun  8 17:04:41 localhost kernel: [  213.942931] ata3.00: error: { UNC }
Jun  8 17:04:41 localhost kernel: [  213.979780] ata3.00: configured for UDMA/133
Jun  8 17:04:41 localhost kernel: [  213.979794] ata3: EH complete
Jun  8 17:04:44 localhost kernel: [  216.508311] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jun  8 17:04:44 localhost kernel: [  216.508316] ata3.00: BMDMA stat 0x24
Jun  8 17:04:44 localhost kernel: [  216.508321] ata3.00: failed command: READ DMA
Jun  8 17:04:44 localhost kernel: [  216.508330] ata3.00: cmd c8/00:08:40:10:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jun  8 17:04:44 localhost kernel: [  216.508332]          res 51/40:00:40:10:00/00:00:00:00:00/e0 Emask 0x9 (media error)
Jun  8 17:04:44 localhost kernel: [  216.508336] ata3.00: status: { DRDY ERR }
Jun  8 17:04:44 localhost kernel: [  216.508339] ata3.00: error: { UNC }
Jun  8 17:04:44 localhost kernel: [  216.538146] ata3.00: configured for UDMA/133
Jun  8 17:04:44 localhost kernel: [  216.538160] ata3: EH complete
Jun  8 17:04:47 localhost kernel: [  219.073696] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jun  8 17:04:47 localhost kernel: [  219.073701] ata3.00: BMDMA stat 0x24
Jun  8 17:04:47 localhost kernel: [  219.073705] ata3.00: failed command: READ DMA
Jun  8 17:04:47 localhost kernel: [  219.073714] ata3.00: cmd c8/00:08:40:10:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jun  8 17:04:47 localhost kernel: [  219.073716]          res 51/40:00:40:10:00/00:00:00:00:00/e0 Emask 0x9 (media error)
Jun  8 17:04:47 localhost kernel: [  219.073720] ata3.00: status: { DRDY ERR }
Jun  8 17:04:47 localhost kernel: [  219.073723] ata3.00: error: { UNC }
Jun  8 17:04:52 localhost kernel: [  224.093967] ata3.00: qc timeout (cmd 0x27)
Jun  8 17:04:52 localhost kernel: [  224.093972] ata3.00: failed to read native max address (err_mask=0x4)
Jun  8 17:04:52 localhost kernel: [  224.093976] ata3.00: HPA support seems broken, skipping HPA handling
Jun  8 17:04:52 localhost kernel: [  224.093980] ata3.00: revalidation failed (errno=-5)
Jun  8 17:04:52 localhost kernel: [  224.093989] ata3: hard resetting link
Jun  8 17:04:52 localhost kernel: [  224.567030] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun  8 17:04:52 localhost kernel: [  224.587154] ata3.00: configured for UDMA/133
Jun  8 17:04:52 localhost kernel: [  224.587174] sd 2:0:0:0: [sde] Unhandled sense code
Jun  8 17:04:52 localhost kernel: [  224.587177] sd 2:0:0:0: [sde]  Result: hostbyte=0x00 driverbyte=0x08
Jun  8 17:04:52 localhost kernel: [  224.587182] sd 2:0:0:0: [sde]  Sense Key : 0x3 [current] [descriptor]
Jun  8 17:04:52 localhost kernel: [  224.587187] Descriptor sense data with sense descriptors (in hex):
Jun  8 17:04:52 localhost kernel: [  224.587190]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Jun  8 17:04:52 localhost kernel: [  224.587202]         00 00 10 40 
Jun  8 17:04:52 localhost kernel: [  224.587207] sd 2:0:0:0: [sde]  ASC=0x11 ASCQ=0x4
Jun  8 17:04:52 localhost kernel: [  224.587212] sd 2:0:0:0: [sde] CDB: cdb[0]=0x28: 28 00 00 00 10 40 00 00 08 00
Jun  8 17:04:52 localhost kernel: [  224.587223] end_request: I/O error, dev sde, sector 4160
Jun  8 17:04:52 localhost kernel: [  224.587228] Buffer I/O error on device sde1, logical block 512

I have tried using SMART to test the drives, however all the drives seem to fail with some sort of error. I believe this is a false positive as I have read elsewhere that the 1TB Green Western Digitals (WD10EARS) seem to have problems with SMART.

Is this drive dead? Its only a year old... and unfortunately just out of warranty.

Last edited by passbe (2011-06-24 08:52:16)

fukawi2 · 2011-06-10 04:08:17

Jun  8 17:04:52 localhost kernel: [  224.093989] ata3: hard resetting link
----
Jun  8 17:04:52 localhost kernel: [  224.587223] end_request: I/O error, dev sde, sector 4160
Jun  8 17:04:52 localhost kernel: [  224.587228] Buffer I/O error on device sde1, logical block 512

I'd go with drive is dead.

pingwingg · 2011-06-10 12:58:40

I've had similar problems with WD Green 1 TB disk.
Disks don't like when the power is switched off while writing, I've had a loose SATA power connector which tended to unplug randomly. It killed my disk, it had many "bad sectors".

But, the good news is, that these "bad sectors" are only logical "bad sectors" and there is a way to fix them:
http://hddguru.com/software/2005.10.02-MHDD/

You need to zero-out the whole disk and then remap it.

Try to translate this Polish site to english (or any other): http://www.forumpc.pl/index.php?showtopic=151905
the procedure is lengthy but worthy the time

I've been using cheap and crappy USB-SATA bridge with horrible power adaptor, now I use good Icy Box USB disk enclousure and everything is all right.

Hope it will help.

passbe · 2011-06-12 10:30:38

I will zero fill the hard drive and see how that goes. Thank you.

passbe · 2011-06-14 11:57:12

Ok so i zero filled /dev/sde, and re-added it back to the array. It synced (for 2 days), and then my RAID5 completely failed. I can't bring it back up at all now. My BIOS says one of the hard drives is dead (SMART error). Fair enough drive is dead. But now i cant get the 3 drives back up and running in the RAID.

I have attempted the following:

sudo mdadm --assemble -vv /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: looking for devices for /dev/md127
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has no superblock - assembly aborted

It complains that every device does not have a superblock?

A mdadm --examine on each drive:

/dev/sdb1

          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : e3da6078:d4699e2d:c1da42cc:d58a42bf
           Name : myhost:0
  Creation Time : Sun Feb 27 09:27:39 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1953517954 (931.51 GiB 1000.20 GB)
     Array Size : 5860552704 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 1953517568 (931.51 GiB 1000.20 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 95618add:2d0b86be:d06cc198:1a5e101f

    Update Time : Tue Jun 14 20:44:51 2011
       Checksum : 51b4a80f - correct
         Events : 4808

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A.A. ('A' == active, '.' == missing)

/dev/sdc1

          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : e3da6078:d4699e2d:c1da42cc:d58a42bf
           Name : myhost:0
  Creation Time : Sun Feb 27 09:27:39 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1953517954 (931.51 GiB 1000.20 GB)
     Array Size : 5860552704 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 1953517568 (931.51 GiB 1000.20 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3056502e:51e004d0:4380f027:fe2d9ebc

    Update Time : Tue Jun 14 20:44:51 2011
       Checksum : e0b65526 - correct
         Events : 4808

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : A.A. ('A' == active, '.' == missing)

/dev/sdd1

         Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : e3da6078:d4699e2d:c1da42cc:d58a42bf
           Name : myhost:0
  Creation Time : Sun Feb 27 09:27:39 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1953517954 (931.51 GiB 1000.20 GB)
     Array Size : 5860552704 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 1953517568 (931.51 GiB 1000.20 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3b7ea2c8:9fd80575:94fbe3a0:cde69026

    Update Time : Tue Jun 14 20:34:16 2011
       Checksum : 2f1a716 - correct
         Events : 4791

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing)

/dev/sde1 (This is the bad drive)

          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : e3da6078:d4699e2d:c1da42cc:d58a42bf
           Name : myhost:0
  Creation Time : Sun Feb 27 09:27:39 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 1953523057 (931.51 GiB 1000.20 GB)
     Array Size : 5860552704 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 1953517568 (931.51 GiB 1000.20 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : bad3c5fd:3e439065:957ca15a:412339dc

    Update Time : Tue Jun 14 20:44:51 2011
       Checksum : 98033b24 - correct
         Events : 4808

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A.A. ('A' == active, '.' == missing)

The only difference being that /dev/sdd1 thinks that the RAID has all four devices? Where do i go from here, I'm completely stuck and about to loose 50% of my data. Help!?!

fukawi2 · 2011-06-14 23:29:22

First, pull the failed hard drive completely. You don't want that getting in the way and screwing up the rest.
I would try the "Finnix" rescue CD; it is very good at recovery work.
Also, I think this command should get the other 3 drives to assemble and run:

mdadm --assemble /dev/md0 --run --verbose /dev/sdb1 /dev/sdc1 /dev/sdd1

UNTESTED; I make no warranties as to it's accuracy!

passbe · 2011-06-15 00:47:08

I tried that several times. It always complains about a bad superblock on the first drive specified. I still do have the bad drive in, solely because i cant actually figure out which one it is. Im not at the machine right now, however i will try pulling the drive and doing the same thing, however im pretty sure it wont work as I did try earlier.

sudo mdadm --assemble /dev/md127 --run --verbose /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: looking for devices for /dev/md127
mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has no superblock - assembly aborted

passbe · 2011-06-15 09:33:01

Ok i tried finnix and it didn't help. I have tried to run mdadm --create without any success. It seems that every drive has a bad superblock. Doesn't matter what I try mdadm always hangs on that.

Im at my wits end.

fukawi2 · 2011-06-15 11:40:16

mdadm: cannot open device /dev/sdb1: Device or resource busy
mdadm: /dev/sdb1 has no superblock - assembly aborted

It looks like it can't find the superblock because it can't read sdb... See what else is using it; perhaps it's already assembled? Check /proc/mdstat

lsof | grep sbd
fuser /dev/sdb

passbe · 2011-06-15 12:04:52

I've checked /proc/mdstat, it tells me its not assembled. Also lsof and fuser returned nothing for all drives. mdadm seems to be useless until i can reinstate the superblock's on these drives. What has me stumped is how did I loose all the superblock's all at once.

All drives report the same partition of 900gig (1tb drives) with a type of Linux Raid Autodetect.

passbe · 2011-06-17 23:45:09

So after some more tinkering, I decided to try and zero the superblocks on each drive. It keeps complaining about bad superblocks so if I zero them out and issue the create command then I might have a chance. (I'm grasping at straws here). I'm attempting to issue the following:

sudo mdadm --zero-superblock /dev/sdb1
mdadm: Couldn't open /dev/sdb1 for write - not zeroing

lsof shows nothing is accessing that partition. Using umount /dev/sdb1 says that it wasn't mounted so how can I not write to it ?

How can I get my system in a state that I can issue such a command?

Thanks for the help.

passbe · 2011-06-17 23:46:32

Additionally this is the post that gave me the idea:

http://kevin.deldycke.com/2007/03/how-t … perblocks/

passbe · 2011-06-24 08:51:51

So after a while spent trying to debug the issue, all of a sudden the following commands rebuilt the array, not sure why all of a sudden but either way im moving all my data and starting fresh with a new batch of hard drives.

sudo mdadm -S /dev/md127
sudo mdam --create --assume-clean --raid-devices=4 --run --verbose /dev/sdb1 /dev/sdd1 /dev/sdc1 missing

Arch Linux

#1 2011-06-10 01:35:09

[SOLVED] Possibly dead hard drive or bad raid setup...

#2 2011-06-10 04:08:17

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#3 2011-06-10 12:58:40

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#4 2011-06-12 10:30:38

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#5 2011-06-14 11:57:12

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#6 2011-06-14 23:29:22

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#7 2011-06-15 00:47:08

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#8 2011-06-15 09:33:01

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#9 2011-06-15 11:40:16

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#10 2011-06-15 12:04:52

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#11 2011-06-17 23:45:09

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#12 2011-06-17 23:46:32

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

#13 2011-06-24 08:51:51

Re: [SOLVED] Possibly dead hard drive or bad raid setup...

Board footer