RAID 10 Array Broken

bronco21016 · 2013-03-30 18:05:32

So the other day my Arch server running an NFS share of media froze on me while I was watching a show. I came downstairs to investigate and since I couldn't ssh or any thing into the server I decided I would have to simply try a hard restart. I did the restart and the system wouldn't boot at all. I took it down from the rack and brought it to my desk and started digging deeper and found it was locking up the boot process while trying to mount my RAID 10 array. I commented the RAID lines from /etc/fstab and now I'm able to boot but I'm completely unable to get the raid working. The RAID device is supposed to be /dev/md0 with two partitions /md0p0 and /md0p1. The only thing showing up is /md127.

Running mdadm --manage /dev/md127 --run yields...

mdadm: failed to run array /dev/md127: Input/output error

Running this does allow me to run ...

mdadm --detail /dev/md127

which yields...

/dev/md127:
        Version : 1.2
  Creation Time : Thu Aug  2 18:18:48 2012
     Raid Level : raid10
  Used Dev Size : 1465007104 (1397.14 GiB 1500.17 GB)
   Raid Devices : 4
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Mar 29 17:52:30 2013
          State : active, FAILED, Not Started 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : ArchNAS:0  (local to host ArchNAS)
           UUID : f2cc4295:758a7f87:784a9410:7968752c
         Events : 59

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       0        0        2      removed
       3       0        0        3      removed

I've run smartmontools on each of the disks involved in the array and every disk is passing. I also find it weird that suddenly two disks have gone off the array and it would appear the disks that disappeared are one of the sets of mirrors so I might lose my data if the disks did indeed fail.

I'm pretty reluctant to get too crazy on running commands until I can diagnose exactly what is going on because this is a large data set that I haven't found a backup solution for yet. I know I know. My own fault if I lose the data but I'd like to try everything possible before just destroying it.

Thanks!

EDIT: How do I go about getting this moved to the Hardware forum without doing a double post? I'm not sure this was the proper place to post. Thank you.

Last edited by bronco21016 (2013-03-31 03:05:59)

jasonwryan · 2013-04-01 00:38:57

You can always use the report button to have a post moved. Moving to Kernel & Hardware...

What happens when you try and manually assemble the array from a Live environment? Can you get it started and run diagnostics on it? Was there any messages in your logs, or mail about a failure?

bronco21016 · 2013-04-05 18:58:49

Sorry I took so long to post back!

I'll remember that little tip! Thanks.

When I got home I started poking around and found that I had most of the data stored on the array kinda scattered in different places on different machines so I decided to try getting a little more adventurous to try to force the thing back to running.

I eventually got it all working by manually disabling it then running...

mdadm --assemble /dev/md0 --force

Once I did that I now had 3 of the 4 drives up and it was a matter of mirroring the data to the remaining drive.

Before doing all of this I ran some thorough diagnostics on the hard drives and all passed so I'm not sure at all what happened. Where are logs for mdadm kept? I'm pretty new to RAID so I wasn't sure where to go to find out what exactly caused the array to come apart like that. I also don't have monitoring setup to notify me of issues. It's been on the todo list but just hasn't happened yet. The one thing that was on the todo list though was getting CrashPlan setup and I can say now that it is all setup and running regular backups! Don't need anymore scares like that! Luckily I didn't lose any data this time around.

Can anybody offer some additional tips or reading for managing and recovering RAID? Thanks!

bronco21016 · 2013-05-09 21:33:38

Sorry to revive this thread from the dead but I had this issue crop up again.

I woke up the other morning to an e-mail saying the same two drives dropped out of the array and caused it to fail. The array has rebuilt just fine with all of the data intact and the drives all pass thorough S.M.A.R.T. tests.

Do I need to just replace these drives or what? It just doesn't seem possible for the drives to be bad considering no data is lost and it drops both drives at exactly the same time. Seems like something else is going on. Is anyone able to point me in the right places to look? Thanks.

jasonwryan · 2013-05-09 22:01:19

Was there anything else that happened at the time the drives dropped out of the array? Check journalctl and dmesg.

The fact that it is the same two drives failing would make me nervous about continuing to use those drives; replacing drives is cheap, data not so.

fukawi2 · 2013-05-09 23:04:05

jasonwryan wrote:

The fact that it is the same two drives failing would make me nervous about continuing to use those drives; replacing drives is cheap, data not so.

This. If you can, get 1 additional drive and set it up as a hot spare at a minimum.

Having recently experienced huge data loss when 3 out of 4 drives in an 8tb RAID-5 failed.... Backups are even more important when running RAID. Remember that you're increasing your chance of experiencing disk failure (while hopefully reducing the impact).

bronco21016 · 2013-05-10 00:43:27

Data is backed up after the first scare. It isn't critical data (replaceable media) but I learned my lesson!

I found this interesting thing in journalctl at the time right before I got an e-mail from mdadm:

May 05 03:53:56 ArchNAS kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 05 03:53:56 ArchNAS kernel: ata4.00: failed command: FLUSH CACHE EXT
May 05 03:53:56 ArchNAS kernel: [123B blob data]
May 05 03:53:56 ArchNAS kernel: ata4.00: status: { DRDY }
May 05 03:54:01 ArchNAS kernel: ata4: link is slow to respond, please be patient (ready=0)
May 05 03:54:06 ArchNAS kernel: ata4: device not ready (errno=-16), forcing hardreset
May 05 03:54:06 ArchNAS kernel: ata4: soft resetting link
May 05 03:54:12 ArchNAS kernel: ata4: link is slow to respond, please be patient (ready=0)
May 05 03:54:16 ArchNAS kernel: ata4: SRST failed (errno=-16)
May 05 03:54:16 ArchNAS kernel: ata4: soft resetting link
May 05 03:54:22 ArchNAS kernel: ata4: link is slow to respond, please be patient (ready=0)
May 05 03:54:26 ArchNAS kernel: ata4: SRST failed (errno=-16)
May 05 03:54:26 ArchNAS kernel: ata4: soft resetting link
May 05 03:54:32 ArchNAS kernel: ata4: link is slow to respond, please be patient (ready=0)
May 05 03:55:01 ArchNAS kernel: ata4: SRST failed (errno=-16)
May 05 03:55:01 ArchNAS kernel: ata4: soft resetting link
May 05 03:55:06 ArchNAS kernel: ata4: SRST failed (errno=-16)
May 05 03:55:06 ArchNAS kernel: ata4: reset failed, giving up

Since the drives check out in SMART I feel like they can be eliminated as being the problem. Now I did some searching and I found a post that recommended checking the power supply and cables.

The power supply can be eliminated because this problem has happened twice now. The first time was in an older case with a different power supply, and now this second time in a new rackmount case with a new power supply.

The two drives that dropped out are attached to a different brand of SATA cable than the other two drives so this makes me think the cables are suspect. Since the data is backed up and I love to learn and tinker I plan on swapping the cables around once the array finishes building. If a drive fails that is attached to the same cable as the drive that failed on ata4 after I swap them around then I can isolate the problem to the SATA cable. Cheap fix and we can all learn something!

fukawi2 · 2013-05-10 01:04:42

Sounds like a good plan; it could be the SATA plugs on the mobo too perhaps? Are they on the same controller as each other, and different to the other 2 drives?

bronco21016 · 2013-05-10 01:31:11

How can I tell if it's a different controller on the mobo? All 4 connectors are on the mobo no PCI controllers or anything. I'll have to check it out later.

Kinda hope the mobo is crapped out because I'd love a good excuse to upgrade to an i7 so I can do some on the fly transcoding for Plex on my tablet while on the road!

fukawi2 · 2013-05-10 02:13:15

Only way to tell is to RTFM most likely. IME there's nothing on the board to (easily) identify which port is which controller.

progandy · 2013-05-10 04:34:04

bronco21016 wrote:

How can I tell if it's a different controller on the mobo? All 4 connectors are on the mobo no PCI controllers or anything. I'll have to check it out later.

Maybe first list all sata controllers (lspci). If there is more than one, compare the path for the drives, e.g.

$ cat /sys/class/block/sd?/device/firmware_node/path
$ ls -ld /sys/class/block/sd?/device/firmware_node/physical_node

Last edited by progandy (2013-05-10 04:37:50)

R00KIE · 2013-05-10 10:12:10

fukawi2 wrote:

Only way to tell is to RTFM most likely. IME there's nothing on the board to (easily) identify which port is which controller.

RTFM is always the best way to go, but I think these things are usually color coded so you can easily tell which controller you are using (if there is more than one onboard sata controller).

fukawi2 · 2013-05-10 10:30:40

I've found they're usually color coded only to differentiate SATA-II and SATA-III connectors; there may be more than 1 of each type, and they're the same color. Again, just IME.

bronco21016 · 2013-05-13 03:51:40

Sorry it took me so long to get back to you guys. My job just requires so much travel so maintaining the hardware side of this server can be frustrating and uptime is so important because I use the server for stuff on the road. Thanks for all the helpful responses!

fukawi2 wrote:

I've found they're usually color coded only to differentiate SATA-II and SATA-III connectors; there may be more than 1 of each type, and they're the same color. Again, just IME.

That's been my experience as well. They're all the same color FWIW.

fukawi2 wrote:

Only way to tell is to RTFM most likely. IME there's nothing on the board to (easily) identify which port is which controller.

That didn't really get me anywhere. The manual isn't in depth enough. Model is Asus P5G41T-M if you feel inclined.

I do know it's based on the Intel G41 chipset so I found this... http://www.intel.com/content/www/us/en/ … ipset.html

The specs listed on that page toward the bottom say the G41 supports up to 6 SATA ports which seems to me it's all one controller.

progandy wrote:

Maybe first list all sata controllers (lspci). If there is more than one, compare the path for the drives, e.g.
$ cat /sys/class/block/sd?/device/firmware_node/path
$ ls -ld /sys/class/block/sd?/device/firmware_node/physical_node

That yields...

_SB_.PCI0.IDE1.CHN0.DRV0
_SB_.PCI0.IDE1.CHN0.DRV1
_SB_.PCI0.IDE1.CHN1.DRV0
_SB_.PCI0.IDE1.CHN1.DRV1

and

pci0000:00/0000:00:1f.2/ata3/host2/target2:0:0/2:0:0:0
pci0000:00/0000:00:1f.2/ata3/host2/target2:0:1/2:0:1:0
pci0000:00/0000:00:1f.2/ata4/host3/target3:0:0/3:0:0:0
pci0000:00/0000:00:1f.2/ata4/host3/target3:0:1/3:0:1:0

For drives sd[b-e]

I see two different channels but I don't know if that means separate controllers. In fact I don't really know what most of any of that means haha.

fukawi2 · 2013-05-13 04:19:47

Looking at the details for the mobo model, it certainly looks like 1 controller. Unlikely that 1 channel would be stuffed. I still like your dodgy cable theory. Have you tried that yet?

bronco21016 · 2013-05-13 04:40:06

Nah, not yet. I found some more cables in the parts bin the other day so once I get the motivation to take it down from the rack I'll give it a shot.

It took almost a month between failures the last time though so it could be months before I can say with certain if the cables fix it.

bronco21016 · 2013-05-13 14:00:49

The plot thickens... Woke up this morning about an hour ago to two e-mails from mdadm. The opposite two drives have failed out now and I never moved cables around or anything.

What in the world is going on here?

May 13 09:17:19 ArchNAS kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 13 09:17:19 ArchNAS kernel: ata3.00: failed command: FLUSH CACHE EXT
May 13 09:17:19 ArchNAS kernel: [123B blob data]
May 13 09:17:19 ArchNAS kernel: ata3.00: status: { DRDY }
May 13 09:17:24 ArchNAS kernel: ata3: link is slow to respond, please be patient (ready=0)
May 13 09:17:29 ArchNAS kernel: ata3: device not ready (errno=-16), forcing hardreset
May 13 09:17:29 ArchNAS kernel: ata3: soft resetting link
May 13 09:17:34 ArchNAS kernel: ata3: link is slow to respond, please be patient (ready=0)
May 13 09:17:39 ArchNAS kernel: ata3: SRST failed (errno=-16)
May 13 09:17:39 ArchNAS kernel: ata3: soft resetting link
May 13 09:17:44 ArchNAS kernel: ata3: link is slow to respond, please be patient (ready=0)
May 13 09:17:49 ArchNAS kernel: ata3: SRST failed (errno=-16)
May 13 09:17:49 ArchNAS kernel: ata3: soft resetting link
May 13 09:17:54 ArchNAS kernel: ata3: link is slow to respond, please be patient (ready=0)
May 13 09:18:24 ArchNAS kernel: ata3: SRST failed (errno=-16)
May 13 09:18:24 ArchNAS kernel: ata3: soft resetting link
May 13 09:18:29 ArchNAS kernel: ata3: SRST failed (errno=-16)
May 13 09:18:29 ArchNAS kernel: ata3: reset failed, giving up
May 13 09:18:29 ArchNAS kernel: ata3.00: disabled
May 13 09:18:29 ArchNAS kernel: ata3.01: disabled
May 13 09:18:29 ArchNAS kernel: ata3.00: device reported invalid CHS sector 0
May 13 09:18:29 ArchNAS kernel: ata3: EH complete
May 13 09:18:29 ArchNAS kernel: sd 2:0:0:0: [sdb] Unhandled error code
May 13 09:18:29 ArchNAS kernel: sd 2:0:0:0: [sdb]  
May 13 09:18:29 ArchNAS kernel: Result: hostbyte=0x04 driverbyte=0x00
May 13 09:18:29 ArchNAS kernel: sd 2:0:0:0: [sdb] CDB: 
May 13 09:18:29 ArchNAS kernel: cdb[0]=0x2a: 2a 00 00 00 00 47 00 00 01 00
May 13 09:18:29 ArchNAS kernel: end_request: I/O error, dev sdb, sector 71
May 13 09:18:30 ArchNAS kernel: end_request: I/O error, dev sdb, sector 71

fukawi2 · 2013-05-13 23:06:15

I'm out; this is the point that I'd start panicking and replacing motherboards/SATA Controllers with more expensive ones in absence of any actual theory of the cause

R00KIE · 2013-05-14 10:28:52

Can you correlate that with any increase in activity and power draw? A PSU going bad _might_ be the cause, but this depends on how old is the PSU you are using.

Also do you have smartd running and do you do periodic array resyncs? Or do you have smartd running and can correlate the time of failure with heavy writes to the drives? I have noticed that some drives do not play well with writing and getting their smart parameters read at the same time.

Can't think of anything else right now.

bronco21016 · 2013-05-15 16:53:11

I can't really correlate with any power draw. The only things running are NFS, transmission-daemon, and MariaDB for my XBMC library. At the time of the failure this particular time there was a lot of activity by transmission-daemon but that hasn't been consistent with each time I have the problem.

This time I went ahead and commented out the mount points in fstab and rebooted. After the reboot the system came back up but drives sdd and sde (the two that have been causing the most problems) didn't appear in /dev so I just went ahead and replaced the SATA cables. Everything came back up as normal after the reboot. I'm assuming the cables were the issue but I'll wait a few months before I decide that's for certain.

bronco21016 · 2013-05-27 02:01:16

Sorry to keep bringing this back from the dead.

Since replacing the cables the failure has happened three more times now.

I've been slowly digging further trying to really narrow things down and through my looking at the logs I've found that the failure has occurred on every SATA port at some point in the multitude of failures. This really leads me to believe that this is not a disk issue or a SATA cable issue. I've eliminated the power supply already by switching that with another one I had laying around several failures ago. This basically leaves only the motherboard SATA controller.

At this point I'm about to start following fukawi2's advice and start replacing the motherboard/CPU. Unfortunately, I don't have another system I can test this theory on so I might just have to bite the bullet and buy the hardware. Before I drop $300 on a Core i5 setup I want to just get some peace of mind and have someone tell me I'm taking the right step here. In the meantime I've reached out to ASUS about getting the board replaced under warranty but I'm honestly not very hopeful.

Thanks for all the help so far guys(or gals)!

fukawi2 · 2013-05-27 04:13:30

I wouldn't be too hopeful on the warranty; it's far too easy for them to point to a hundred other things especially since you're running Linux. But good luck!

I guess the next cheapest option would be to buy a $40 SATA card to get off the mobo controller; that's cheaper than replacing the whole mobo etc

bronco21016 · 2013-05-27 04:31:03

I've tried messing with PCI SATA cards a few times and have found them to be nothing short of a nightmare when it comes to Linux. I dunno maybe I'm just too dumb to do the research before hand or maybe I'm just justifying the new toys.

A new mobo etc will finally let me run Plex with on the fly transcoding since I'm far too lazy to do the conversions on about 600 video files ranging from short clips to feature length films.

I figured ASUS won't be very helpful but it's worth a try right? Kinda sad that the board barely made it a year... I purchased it July '12. It's only really functioning as a NAS appliance but I guess something was damaged at some point.

R00KIE · 2013-05-27 13:11:29

You didn't say, but are you running smartd? Like I said before, I suspect some drives don't like their smart status probed when they aren't idle.

Another thing you might try is shorter cables if you can find them, shorter cable means there is less length to pickup possible interference from other components, although that shouldn't be a problem to start with since signaling is differential.

I take care of a machine which shows the same problem you are experiencing, when I had smartd running the failures seemed to be more frequent, also I have opened the machine and rotated the cables (cable from drive A now plugs into drive B, B->C and C->A) and I have tried to route the sata cables as far away from the supply cables as possible and it seems to have helped, it didn't completely eliminate the problem though.

bronco21016 · 2013-05-27 15:27:50

You're right. I did have a post typed up about smartd a few days ago but I forgot to hit send so when I re-did my reply it wasn't there. Anyways....

No, I do not have smartd running on this machine. The SATA cables are already about as short as they can get since the mobo and the drives are kind of far apart in the server case. The thing about this failure is it was also happening in a previous case and with a different power supply. I really believe this is the motherboard gone bad even though I don't wanna believe it.

The single thing I can think of that changed before the failure began was adding an IDE drive. The OS used to exist on a 2 GB flash drive but as the girlfriend purchased more electronics that needed a server (squeezebox and tablet with Plex) I found I needed more space. I had an old 20GB IDE drive laying around so I migrated the OS to that and now had the space I needed.

I guess I shouldn't have left this out earlier but just didn't even think of it until now. Is it possible the IDE drive is overloading the controller or somehow causing conflict? I don't have any other SATA ports available so the only other option at this point would be some sort of PCI card and as mentioned before I've purchased a few and can never get them working properly with Linux. Maybe I need to take it down from the rack and see if there's anything in the BIOS I can change that might alleviate the issue of an IDE/SATA conflict.

R00KIE wrote:

I take care of a machine which shows the same problem you are experiencing, when I had smartd running the failures seemed to be more frequent, also I have opened the machine and rotated the cables (cable from drive A now plugs into drive B, B->C and C->A) and I have tried to route the sata cables as far away from the supply cables as possible and it seems to have helped, it didn't completely eliminate the problem though.

This problem machine you mention... do you know what chipset the motherboard is using? Mine is the Intel G41. Maybe we could narrow this problem down to a particular chipset?

Again thanks for working with me on this guys. I feel like I'm becoming a vampire but narrowing this down really has me lost.

Last edited by bronco21016 (2013-05-27 15:41:31)

Arch Linux

#1 2013-03-30 18:05:32

RAID 10 Array Broken

#2 2013-04-01 00:38:57

Re: RAID 10 Array Broken

#3 2013-04-05 18:58:49

Re: RAID 10 Array Broken

#4 2013-05-09 21:33:38

Re: RAID 10 Array Broken

#5 2013-05-09 22:01:19

Re: RAID 10 Array Broken

#6 2013-05-09 23:04:05

Re: RAID 10 Array Broken

#7 2013-05-10 00:43:27

Re: RAID 10 Array Broken

#8 2013-05-10 01:04:42

Re: RAID 10 Array Broken

#9 2013-05-10 01:31:11

Re: RAID 10 Array Broken

#10 2013-05-10 02:13:15

Re: RAID 10 Array Broken

#11 2013-05-10 04:34:04

Re: RAID 10 Array Broken

#12 2013-05-10 10:12:10

Re: RAID 10 Array Broken

#13 2013-05-10 10:30:40

Re: RAID 10 Array Broken

#14 2013-05-13 03:51:40

Re: RAID 10 Array Broken

#15 2013-05-13 04:19:47

Re: RAID 10 Array Broken

#16 2013-05-13 04:40:06

Re: RAID 10 Array Broken

#17 2013-05-13 14:00:49

Re: RAID 10 Array Broken

#18 2013-05-13 23:06:15

Re: RAID 10 Array Broken

#19 2013-05-14 10:28:52

Re: RAID 10 Array Broken

#20 2013-05-15 16:53:11

Re: RAID 10 Array Broken

#21 2013-05-27 02:01:16

Re: RAID 10 Array Broken

#22 2013-05-27 04:13:30

Re: RAID 10 Array Broken

#23 2013-05-27 04:31:03

Re: RAID 10 Array Broken

#24 2013-05-27 13:11:29

Re: RAID 10 Array Broken

#25 2013-05-27 15:27:50

Re: RAID 10 Array Broken

Board footer