You are not logged in.

#1 2011-02-27 04:16:08

TomB17
Member
Registered: 2009-09-02
Posts: 102

[SOLVED] System hung while mdadm is stabalizing

I've got an external 8 bay SATA RAID chassis (connected to my PC with 2 x eSATA) that I'm trying to expand.  It had 5 x 2TB drives (WD20EARS) configured as RAID5.   I added 3 more 2TB drives.

I did an mdadm --grow /dev/md3 --number-devices=7

I'd like to get to 7 device with a spare.  It gave me an error about not being able to backup some table and then it just went to work stabilizing with the 7 disk configuration.  That was about 6 hours ago.  The last I saw, it was something like 13% complete.  lol!

Meanwhile, I was adding 2 more drives to another drive array.  Basically, taking a 4 x 750GB drive RAID5 to a 6 x 750GB array.  They are on two different controllers and I was doing it simultaneously.  The smaller array gave me a similar error message.

As is always the case, I was moving a bunch of stuff around, clearing room so I could free loose drives I was using for backup and add them into the array.  The point being, I've got a mountain of data at stake including several databases and system images of systems that are no longer around but I'd really like to have on hand.

There is a backup of the backups (a 10 x 1.5TB RAID5 chassis).  I rsync to it every so often but it hasn't happened for a couple of months.  Anyway, I'm sure the backups are fine.

... so here's the situation...  the computer is locked.  The mouse pointer doesn't move.  I'm loathe to reboot the system because I looked up the error message mdadm spit out when I added the disks and it said basically the system couldn't backup a critical table that would allow the operation to be failsafe.  The point being, hope nothing happens during the process or all is lost.  Of course, something did happen.

I will add the system has been rock stable since I got it.  I've never seen it hang before.  I'm running 32 bit Arch and did a full system upgrade on Friday.

I've noticed a few bugs regarding particularly large arrays being expanded leading to lockup but those bugs go back to 2006.  I would imagine those issues are long fixed.

The system responds to PINGs but will not accept SSH connections.

Any opinions on whether to power cycle the system, or not?  Maybe I should unplug the eSATA connections, hoping just one of the arrays will fail allowing the other to recover?

Switch to Windows?  lol!  big_smile

I'm wide open to advice here.

Last edited by TomB17 (2011-03-14 01:00:37)

Offline

#2 2011-02-28 16:46:38

TomB17
Member
Registered: 2009-09-02
Posts: 102

Re: [SOLVED] System hung while mdadm is stabalizing

OK...  here's how it sorted out.

I had to reset the box.  It came up and hung again during the reshapes twice.

From there, I booted to an Arch install USB.  I assembled the smaller of the two arrays and it stabilized in about 3 hours.  From there, I rebooted back into the main system and the big array has been stable while reshaping ever since.

It would seem that mdadm is not stable reshaping two arrays at once.

The smaller array came back with no errors.  The big array is still reshaping but I am already seeing tons of problems with it.  It's to the point that if I download a large archive to it and do an md5sum check against it, it will fail every time.  It's going to take a couple of weeks to stabilize and I've decided to wait it out.  There's too much data on the array to walk away from.  It's no big deal if my Myth TV files have the odd bad block.  I'll still watch the shows.  I hope my databases are OK but suspect they won't be.  We'll see.

Once it's stabilized, I'd like to do an fsck on the file system.  It's XFS so I doubt it will work.  XFS has always blown up trying to check that drive.  Maybe I'll boot to the 64 bit Arch install and see if I can test the XFS from there.

Offline

#3 2011-03-01 17:24:35

TomB17
Member
Registered: 2009-09-02
Posts: 102

Re: [SOLVED] System hung while mdadm is stabalizing

Update: It's been running since Sunday.  I expect it to be complete sometime next week.

As raid456 reshapes my array, it is slowly but surely ruining the data.

Even copying a file onto that partition and then checking it with md5sum will fail, despite checking successfully on other partitions.

Personalities : [raid6] [raid5] [raid4]
md3 : active raid5 sdb1[6] sda1[8](S) sdc1[7] sdd1[5] sdg1[0] sdf1[2] sdh1[3] sde1[1]
      7814027264 blocks super 1.2 level 5, 512k chunk, algorithm 2 [7/7] [UUUUUUU]
      [==========>..........]  reshape = 53.9% (1053587456/1953506816) finish=3288.8min speed=4560K/sec

sad

Offline

#4 2011-03-12 19:20:12

TomB17
Member
Registered: 2009-09-02
Posts: 102

Re: [SOLVED] System hung while mdadm is stabalizing

Update:

Since these problems a couple of weeks ago, I ran 'badblocks -wvs' against every disk in the array.  After several days with no errors, I re-striped the array as RAID 6 and let it completely build before formatting or using the partition in any way.  This time, it completed without hanging.

md125 : active raid6 sdp1[7] sdo1[6] sdn1[5] sdm1[4] sdl1[3] sdk1[2] sdj1[1] sdi1[0]
      11721071616 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]

I formatted with EXT4 and put some data on the drive.  It will not store data without corrupting it.

If I download the most recent MiniMyth distrobution and MD5 checksum to that partition, it will not check with md5sum.  Downloading the distribution and checksum to another partition, however, works perfectly every time.

md5sum -c ram-minimyth-0.23.1-76.tar.bz2.md5
ram-minimyth-0.23.1-76.tar.bz2: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

I'm at a bit of a roadblock.  I thought the partition size might be too big for 32 bit Linux but it looks like it should be good to 16TB with 4KB.  dumpe2fs shows the filesystem has 4K block size.

For what it's worth, I didn't specify any of the tuning parameters when I created the stripe set or the file system.

mdadm --create /dev/md125 --level=6 --raid-devices=8 /dev/sd{i,j,k,l,m,n,o,p}1
mkfs.ext4 /dev/md125

I'm going to stew on this a bit but would appreciate ideas.  Has anyone else successfully created a large RAID 5 or 6 array?  Am I missing something?

Offline

#5 2011-03-14 00:59:48

TomB17
Member
Registered: 2009-09-02
Posts: 102

Re: [SOLVED] System hung while mdadm is stabalizing

The problem has been found.  It turned out to be buggy hardware.  I was using SiI 3132 - 2 x eSATA cards to connect to external SATA cages.  The SiI 3132, it turns out, has a bug which causes 'silent corruption' when both ports are driven hard.

It was hard to track that one down.  Not many people know about it.

I've ordered a pair of Marvell 9128 cards to replace the almost new but malfunctioning SiI 3132 cards.

In the mean time, I've re-created the big array with it connected to only one port on each card.  It's working perfectly now.  This morning, I copied a TB of data onto the partition and did a bunch of md5sums.  Everything looks good.  It's much faster now too.

Longer term, I'd like to move to ZFS.  For now, I'm just happy everything is working.

smile

I hope this thread helps someone who thinks mdadm is buggy when it's rally crappy Silicon Image hardware.  It was tough one to track down.

Last edited by TomB17 (2011-03-14 01:02:20)

Offline

Board footer

Powered by FluxBB