You are not logged in.

#1 2018-11-11 22:06:03

eomanis
Member
Registered: 2013-04-17
Posts: 50

Why is mdadm resyncing this degraded/incomplete array?

I have a system running a 12 x 2TB mdadm RAID6 with internal bitmap.
The system's root is located on that RAID, and recently the system became unresponsive during a pacman -Syu and had to be forcibly powered off. It has been unable to boot since then.

I have booted a recovery Arch installation from USB, and all 12 disks are operational and have good-looking mdadm headers.
11 disks have the same event count, 1 disk (/dev/sdb) has a slightly lower event count. Presumably that one failed.

Therefore my intentions were as such:

  • Assemble and start the RAID with 11 of 12 devices, without /dev/sdb, in degraded mode

  • Re-add /dev/sdb so that its missing events are replayed from the bitmap/RAID redundancy if possible

  • Manually fail and remove /dev/sdb, and add a good drive as replacement

So for starters I assembled the RAID from the 11 good devices, the ones that have the same event count:

[root@recovery defaultuser]# mdadm --assemble homeserver:raid --force /dev/sd[acdefghjlmn]
mdadm: /dev/md/homeserver:raid has been started with 11 devices (out of 12).

I did not expect anything else to happen at this point, before re-adding /dev/sdb. But to my great surprise mdadm started a resync operation on 11 drives. Wat.
What's that about? I mean, they all have the same event count and update time. What does mdadm think it is resyncing where to?

[root@recovery defaultuser]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md127 : active raid6 sdc[0] sdj[8] sdl[9] sdm[10] sdn[11] sdd[7] sde[6] sdg[5] sdh[4] sda[12] sdf[2]
      19530270720 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/11] [U_UUUUUUUUUU]
      [=>...................]  resync =  5.5% (107548672/1953027072) finish=332.1min speed=92596K/sec
      bitmap: 1/15 pages [4KB], 65536KB chunk

unused devices: <none>

As the disks are somewhat old and worn-out already I am not entirely happy with this turn of events.
Anyone have an explanation for what is going on?

For completeness, here the output of mdadm --detail:

[root@recovery defaultuser]# mdadm --detail /dev/md127 
/dev/md127:
        Version : 1.2
  Creation Time : Sun Oct 25 17:20:09 2015
     Raid Level : raid6
     Array Size : 19530270720 (18625.52 GiB 19999.00 GB)
  Used Dev Size : 1953027072 (1862.55 GiB 1999.90 GB)
   Raid Devices : 12
  Total Devices : 11
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Nov 11 23:25:51 2018
          State : clean, degraded, resyncing 
 Active Devices : 11
Working Devices : 11
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

  Resync Status : 15% complete

           Name : homeserver:raid
           UUID : XXXXXXXX:XXXXXXXX:XXXXXXXX:XXXXXXXX
         Events : 55633

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       -       0        0        1      removed
       2       8       80        2      active sync   /dev/sdf
      12       8        0        3      active sync   /dev/sda
       4       8      112        4      active sync   /dev/sdh
       5       8       96        5      active sync   /dev/sdg
       6       8       64        6      active sync   /dev/sde
       7       8       48        7      active sync   /dev/sdd
      11       8      208        8      active sync   /dev/sdn
      10       8      192        9      active sync   /dev/sdm
       9       8      176       10      active sync   /dev/sdl
       8       8      144       11      active sync   /dev/sdj

Last edited by eomanis (2018-11-11 22:29:42)

Offline

#2 2018-11-12 11:48:27

Lone_Wolf
Forum Moderator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,920

Re: Why is mdadm resyncing this degraded/incomplete array?

man mdadm wrote:

       -f, --force
              Assemble the array even if the metadata on some devices appears to be out-of-date.  If mdadm cannot find enough working  devices  to  start
              the  array,  but can find some devices that are recorded as having failed, then it will mark those devices as working so that the array can
              be started.  An array which requires --force to be started may contain data corruption.  Use it carefully.

       -R, --run
              Attempt to start the array even if fewer drives were given than were present last time the array was  active.   Normally  if  not  all  the
              expected  drives are found and --scan is not used, then the array will be assembled but not started.  With --run an attempt will be made to
              start it anyway.


You lost the data on sdb, mdadm is using the data on the other 11 drives to recreate the data.
Maybe --run would have been a better option then --force .


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#3 2018-11-12 16:09:59

eomanis
Member
Registered: 2013-04-17
Posts: 50

Re: Why is mdadm resyncing this degraded/incomplete array?

Lone_Wolf wrote:

You lost the data on sdb, mdadm is using the data on the other 11 drives to recreate the data.

This I would have understood.
However that wasn't the case here, mdadm started the resync with only 11 of 12 drives and no spare, /dev/sdb not being part of the assembled array in any way, which can be seen in the output of /proc/mdstat.

Lone_Wolf wrote:

Maybe --run would have been a better option then --force .

Thanks, I didn't know about --run, next time I'll try that first.

I have the suspicion that when the system crashed, more had gone wrong than just losing /dev/sdb, because well, losing a disk or even two of them would not bring down a RAID6.

Anyway, the resync ran through without errors, so I had at least 11 of 12 disks in sync.
Following that I made the discovery that apparently the SATA connection of /dev/sdb is faulty. Neither /dev/sdb nor the replacement disk could write data and always errored out in the kernel log with driverbyte=DRIVER_SENSE.
As a stop-gap measure I connected /dev/sdb with a USB docking station and re-assembled the RAID, and it synced up all 12 disks in a matter of minutes.

I'll see how I'll proceed from there, maybe change the SATA cable or connect it to a different SATA connector on the mainboard.
Or maybe get a different mainboard, the stuff is kind of old and all of this just reeks of shoddy hardware.

Offline

Board footer

Powered by FluxBB