You are not logged in.
Hi, all (Not sure if this belongs here or in Kernel/Hardware)
I'm hoping I can get some help here or, at the very least, some pointers to some help. In case its relevant I'm using the LTS kernel 3.0.36-1-lts and mdadm v3.2.5 (2012, May 18th).
I have(had?) a raid5 array of 4x 1.5TB drives (that works out to 4.5 TB or 4.1TiB). I added another drive, went through the standard growth procedure and everything seemed fine. At 68% through the reshape, one of the disks failed and, due to the resulting halt errors (details below), I believe caused a panic and reboot. RAID6 from here on out, I promise. xD
In theory all my data should still be available on the remaining disks, I just don't know how to get to it. Here's what I've been trying so far:
Attempting to assemble the array with 4 out of 5 drives is unsuccessful because the new drive appears to be seen as a "spare" - perhaps that is standard until such time that it is fully integrated into the array. The error here is:
mdadm: /dev/md3 assembled from 3 drives and 1 spare - not enough to start the array.
Attempting to assemble the array with 5 out of 5 drives works briefly but, no matter what I do, mdadm tries to finish reshaping. Two minutes after the assemble attempt, because the disk is giving an apparently permanent read error, the console starts printing messages along the lines of:
INFO: task md3_reshape:$PID blocked for more than 120 seconds.
[ 1080.320000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
This is in spite that even /proc/mdstat shows that the disk is failed and that the array is degraded. After a few minutes of the above error I have to REISUB or hard-reset as the server becomes unresponsive. I really don't want to do that too often. I've tried using the "--freeze-reshape" flag but I'm either doing it wrong or the option isn't doing what I'm expecting it to do since the above error still happens.
This is the status immediately after booting with sdg removed and md3 being the array I'm worried about. A reassemble requires a --stop, (optionally my plugging in the failed drive, --stop again), and then the --assemble command. If only it would work:
$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md3 : inactive sdc1[1](S) sdb1[6](S) sde1[4](S) sdd1[3](S)
5860546144 blocks super 1.2
md1 : active raid1 sdf3[2] sda3[0]
239746096 blocks super 1.2 [2/2] [UU]
bitmap: 1/2 pages [4KB], 65536KB chunk
md4 : active raid1 sdf2[1] sda2[0]
4193268 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
md0 : active raid1 sdf1[1] sda1[0]
255936 blocks [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>
pacman russian roulette: yes | pacman -Rcs $(pacman -Q | LANG=C sort -R | head -n $((RANDOM % 10)))
(yes, I know its broken)
Offline
Quick update (in case anyone else has a similar situation in future), I'm using GNU ddrescue to clone the "failing" disk to another new disk. This will then at least get most of the data back.
See http://www.forensicswiki.org/wiki/Ddrescue for info on how to use ddrescue.
pacman russian roulette: yes | pacman -Rcs $(pacman -Q | LANG=C sort -R | head -n $((RANDOM % 10)))
(yes, I know its broken)
Offline