You are not logged in.
I have a system running a 12 x 2TB mdadm RAID6 with internal bitmap.
The system's root is located on that RAID, and recently the system became unresponsive during a pacman -Syu and had to be forcibly powered off. It has been unable to boot since then.
I have booted a recovery Arch installation from USB, and all 12 disks are operational and have good-looking mdadm headers.
11 disks have the same event count, 1 disk (/dev/sdb) has a slightly lower event count. Presumably that one failed.
Therefore my intentions were as such:
Assemble and start the RAID with 11 of 12 devices, without /dev/sdb, in degraded mode
Re-add /dev/sdb so that its missing events are replayed from the bitmap/RAID redundancy if possible
Manually fail and remove /dev/sdb, and add a good drive as replacement
So for starters I assembled the RAID from the 11 good devices, the ones that have the same event count:
[root@recovery defaultuser]# mdadm --assemble homeserver:raid --force /dev/sd[acdefghjlmn]
mdadm: /dev/md/homeserver:raid has been started with 11 devices (out of 12).
I did not expect anything else to happen at this point, before re-adding /dev/sdb. But to my great surprise mdadm started a resync operation on 11 drives. Wat.
What's that about? I mean, they all have the same event count and update time. What does mdadm think it is resyncing where to?
[root@recovery defaultuser]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdc[0] sdj[8] sdl[9] sdm[10] sdn[11] sdd[7] sde[6] sdg[5] sdh[4] sda[12] sdf[2]
19530270720 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/11] [U_UUUUUUUUUU]
[=>...................] resync = 5.5% (107548672/1953027072) finish=332.1min speed=92596K/sec
bitmap: 1/15 pages [4KB], 65536KB chunk
unused devices: <none>
As the disks are somewhat old and worn-out already I am not entirely happy with this turn of events.
Anyone have an explanation for what is going on?
For completeness, here the output of mdadm --detail:
[root@recovery defaultuser]# mdadm --detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Sun Oct 25 17:20:09 2015
Raid Level : raid6
Array Size : 19530270720 (18625.52 GiB 19999.00 GB)
Used Dev Size : 1953027072 (1862.55 GiB 1999.90 GB)
Raid Devices : 12
Total Devices : 11
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Nov 11 23:25:51 2018
State : clean, degraded, resyncing
Active Devices : 11
Working Devices : 11
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Resync Status : 15% complete
Name : homeserver:raid
UUID : XXXXXXXX:XXXXXXXX:XXXXXXXX:XXXXXXXX
Events : 55633
Number Major Minor RaidDevice State
0 8 32 0 active sync /dev/sdc
- 0 0 1 removed
2 8 80 2 active sync /dev/sdf
12 8 0 3 active sync /dev/sda
4 8 112 4 active sync /dev/sdh
5 8 96 5 active sync /dev/sdg
6 8 64 6 active sync /dev/sde
7 8 48 7 active sync /dev/sdd
11 8 208 8 active sync /dev/sdn
10 8 192 9 active sync /dev/sdm
9 8 176 10 active sync /dev/sdl
8 8 144 11 active sync /dev/sdj
Last edited by eomanis (2018-11-11 22:29:42)
Offline
-f, --force
Assemble the array even if the metadata on some devices appears to be out-of-date. If mdadm cannot find enough working devices to start
the array, but can find some devices that are recorded as having failed, then it will mark those devices as working so that the array can
be started. An array which requires --force to be started may contain data corruption. Use it carefully.-R, --run
Attempt to start the array even if fewer drives were given than were present last time the array was active. Normally if not all the
expected drives are found and --scan is not used, then the array will be assembled but not started. With --run an attempt will be made to
start it anyway.
You lost the data on sdb, mdadm is using the data on the other 11 drives to recreate the data.
Maybe --run would have been a better option then --force .
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
You lost the data on sdb, mdadm is using the data on the other 11 drives to recreate the data.
This I would have understood.
However that wasn't the case here, mdadm started the resync with only 11 of 12 drives and no spare, /dev/sdb not being part of the assembled array in any way, which can be seen in the output of /proc/mdstat.
Maybe --run would have been a better option then --force .
Thanks, I didn't know about --run, next time I'll try that first.
I have the suspicion that when the system crashed, more had gone wrong than just losing /dev/sdb, because well, losing a disk or even two of them would not bring down a RAID6.
Anyway, the resync ran through without errors, so I had at least 11 of 12 disks in sync.
Following that I made the discovery that apparently the SATA connection of /dev/sdb is faulty. Neither /dev/sdb nor the replacement disk could write data and always errored out in the kernel log with driverbyte=DRIVER_SENSE.
As a stop-gap measure I connected /dev/sdb with a USB docking station and re-assembled the RAID, and it synced up all 12 disks in a matter of minutes.
I'll see how I'll proceed from there, maybe change the SATA cable or connect it to a different SATA connector on the mainboard.
Or maybe get a different mainboard, the stuff is kind of old and all of this just reeks of shoddy hardware.
Offline