Spare device of degraded RAID5 failed to set as replacement

lamarpavel · 2021-07-25 06:05:54

Hi there,

I've got a RAID5 here with me and wasn't looking too good even before I made a (possibly fatal) mistake. I'm assuming the data is lost, but it would be foolish to not at last ask for second opinions before wiping it.
So, first the situation:

Mistake 0: For years I've never bothered reshaping the RAID5 to a RAID6 because I assumed that this thing was probably not going to happen to me.
I've discovered that one of the devices of this RAID5 was apparently removed without my doing. I'll call this one /dev/sdb1.
Then I've discovered that my solution for sending notification mails from SMART has died of old age months ago without anyone noticing.
Mistake 1: it was a custom script because ~5 years ago I didn't want to (or wasn't able to) set up a sendmail/dma/etc)
For some weeks the error rates of /dev/sdb had been climbing and for some days they remained the same, but I assume that's because it was removed from the array and so did not receive new R/W requests.
I got two new drives, tested their stability over night and partitioned them.
Then I marked the (already removed) device /dev/sdb1 as failed, removed it (again, just to be sure), shut down the server and physically removed the device while adding the two new drives.
Finally, I added the two new drives as spares and mdadm immediately started recovering:

# mdadm --misc --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Thu Sep 19 18:50:34 2013
        Raid Level : raid5
        Array Size : 11720659200 (11177.69 GiB 12001.96 GB)
     Used Dev Size : 3906886400 (3725.90 GiB 4000.65 GB)
      Raid Devices : 4
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Jul 24 15:55:47 2021
             State : clean, degraded, recovering
    Active Devices : 3
   Working Devices : 5
    Failed Devices : 0
     Spare Devices : 2

            Layout : left-symmetric
        Chunk Size : 256K

Consistency Policy : bitmap

    Rebuild Status : 3% complete

              Name : <redacted>
              UUID : <redacted>
            Events : 479831

    Number   Major   Minor   RaidDevice State
       6       8       17        0      spare rebuilding   /dev/sdb1
       7       8       49        1      active sync   /dev/sdd1
       4       8       33        2      active sync   /dev/sdc1
       5       8       65        3      active sync   /dev/sde1

       8       8       81        -      spare   /dev/sdf1

Mistake 2: Before removing the failed device I should have tried to re-add it and then use the --replace option of mdadm, but I only read the arch wiki and not the manpage...
Now, while the RAID was rebuilding a second drive started to cough up SMART warnings.
This time I was watching the log and could see it happen, but maybe I was lucky and it wouldn't die before the rebuilding was complete.

Well, did it fail? I don't know... SMART sais:

[...]
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], 296 Currently unreadable (pending) sectors (changed +120)
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], 296 Offline uncorrectable sectors (changed +120)
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 56
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 82 to 77
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 62
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 39 to 38
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 197 Current_Pending_Sector changed from 98 to 97
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 198 Offline_Uncorrectable changed from 98 to 97
Jul 24 19:34:46 Butler smartd[380]: Device: /dev/sdd [SAT], ATA error count increased from 104 to 151
Jul 24 20:04:44 Butler smartd[380]: Device: /dev/sdd [SAT], 296 Currently unreadable (pending) sectors
Jul 24 20:04:44 Butler smartd[380]: Device: /dev/sdd [SAT], 296 Offline uncorrectable sectors
Jul 24 20:04:44 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 56 to 75
Jul 24 20:04:44 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 77 to 76
Jul 24 20:04:44 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
Jul 24 20:04:44 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 38 to 39
Jul 24 20:04:44 Butler smartd[380]: Device: /dev/sdd [SAT], ATA error count increased from 151 to 156
Jul 24 20:34:44 Butler smartd[380]: Device: /dev/sdd [SAT], 360 Currently unreadable (pending) sectors (changed +64)
Jul 24 20:34:44 Butler smartd[380]: Device: /dev/sdd [SAT], 360 Offline uncorrectable sectors (changed +64)
[...]
Jul 25 00:05:04 Butler smartd[380]: Device: /dev/sdd [SAT], ATA error count increased from 455 to 627
Jul 25 00:34:54 Butler smartd[380]: Device: /dev/sdd [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Jul 25 00:35:04 Butler smartd[380]: Device: /dev/sdd [SAT], 1056 Currently unreadable (pending) sectors (changed +240)
Jul 25 00:35:04 Butler smartd[380]: Device: /dev/sdd [SAT], 1056 Offline uncorrectable sectors (changed +240)
Jul 25 00:35:04 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 45
Jul 25 00:35:04 Butler smartd[380]: Device: /dev/sdd [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Jul 25 00:35:04 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 5 Reallocated_Sector_Ct changed from 14 to 1
Jul 25 00:35:04 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 197 Current_Pending_Sector changed from 91 to 88
Jul 25 00:35:04 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 198 Offline_Uncorrectable changed from 91 to 88
Jul 25 00:35:33 Butler smartd[380]: Device: /dev/sdd [SAT], ATA error count increased from 627 to 831
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], 1032 Currently unreadable (pending) sectors (changed -24)
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], 1032 Offline uncorrectable sectors (changed -24)
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], Failed SMART usage Attribute: 1 Raw_Read_Error_Rate.
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 45 to 43
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 66
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 34
Jul 25 01:04:54 Butler smartd[380]: Device: /dev/sdd [SAT], ATA error count increased from 831 to 943
[...]

At which values it started to stagnate.

But as far as I can see mdadm hasn't complained that the rebuilding failed, instead, it looks like it didn't do anything:

# mdadm --misc --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Thu Sep 19 18:50:34 2013
        Raid Level : raid5
        Array Size : 11720659200 (11177.69 GiB 12001.96 GB)
     Used Dev Size : 3906886400 (3725.90 GiB 4000.65 GB)
      Raid Devices : 4
     Total Devices : 5
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Jul 25 00:47:19 2021
             State : clean, degraded
    Active Devices : 3
   Working Devices : 5
    Failed Devices : 0
     Spare Devices : 2

            Layout : left-symmetric
        Chunk Size : 256K

Consistency Policy : bitmap

              Name : <redacted>
              UUID : <redacted>
            Events : 485548

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       7       8       49        1      active sync   /dev/sdd1
       4       8       33        2      active sync   /dev/sdc1
       5       8       65        3      active sync   /dev/sde1

       6       8       17        -      spare   /dev/sdb1
       8       8       81        -      spare   /dev/sdf1

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdf1[8](S) sdb1[6](S) sdd1[7] sdc1[4] sde1[5]
      11720659200 blocks super 1.2 level 5, 256k chunk, algorithm 2 [4/3] [_UUU]
      bitmap: 6/15 pages [24KB], 131072KB chunk

unused devices: <none>

So, adding the spares has prompted mdadm to start a recovery during which a second drive has started to deteriorate.
Now the process seems to have completed, but the spare was not added to the array.
dmesg has a bunch of these:

[33162.250480] ata4.00: exception Emask 0x0 SAct 0x80000 SErr 0x0 action 0x0
[33162.250496] ata4.00: irq_stat 0x40000008
[33162.250503] ata4.00: failed command: READ FPDMA QUEUED
[33162.250507] ata4.00: cmd 60/08:98:98:87:75/00:00:a7:01:00/40 tag 19 ncq dma 4096 in
                        res 43/40:08:98:87:75/00:00:a7:01:00/00 Emask 0x409 (media error) <F>
[33162.250524] ata4.00: status: { DRDY SENSE ERR }
[33162.250530] ata4.00: error: { UNC }
[33162.253455] ata4.00: configured for UDMA/133
[33162.253469] sd 3:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
[33162.253474] sd 3:0:0:0: [sdd] tag#19 Sense Key : Medium Error [current]
[33162.253477] sd 3:0:0:0: [sdd] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
[33162.253482] sd 3:0:0:0: [sdd] tag#19 CDB: Read(16) 88 00 <redacted> 08 00 00
[33162.253485] blk_update_request: I/O error, dev sdd, sector 7104464792 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[33162.253497] md/raid:md127: read error not correctable (sector 7104462744 on sdd1).
[33162.253513] ata4: EH complete

But I can still read data from the array without error (I carefully tried to read one file and performed one search with find).
So the drive (/dev/sdd, the one that started throwing errors during the recovery) isn't completely dead yet, it's just accumulating dead sectors.

I have replaced drives before and iirc simply adding a spare was sufficient in those cases: mdadm should automatically add the spare as regular active drive after recovery.
Here is what I tried then:

# mdadm /dev/md127 --replace /dev/sdd1 --with /dev/sdb1
mdadm: Marked /dev/sdd1 (device 1 in /dev/md127) for replacement
mdadm: Failed to set /dev/sdb1 as preferred replacement.

# mdadm --add /dev/md127 /dev/sdb1
mdadm: Cannot open /dev/sdb1: Device or resource busy

No changes to the output of mdadm --detail.

So, what do I make of this behaviour?
Can I find more logs from mdadm except for the messages in dmesg?
I don't want to walk into another mistake, so before I start to copy as much data from the array as still possible, which might degrade the drive further, I would welcome some advice.
And if push comes to shove, I hope this record will at least be of help to others in the future.

And for the future: I'm pretty sure that the drive that failed first had been removed for several days at least, so it was out of sync... would it even have been possible to re-add it to the array for use with the --replace option?

lamarpavel · 2021-07-28 08:02:25

I've done some more searching and the best lead came from the answeres here: https://superuser.com/questions/429776/ … ting-spare

Yes, this thread is about RAID1 and not 5, but the relevant parts seem to apply to mdadm in general, namely:

[...] I think your problem is, that the disk which is still in the md array has read errors, so adding a second disk fails.

This does match my situation insofar that one of my active disks started throwing errors during the recovery operation (specifically, the count of Currently unreadable (pending) sectors).

Still not sure what the best course of action now is. I assume mdadm has aborted the resync when encountering the first error.

The drive I removed first will likely not have read errors in the exact same sectors, so theoretically it could help in recovery, but since I don't know how long it had been removed from the array (likely between 0-2 days) and write operations must have occured in the mean time, it could be desaterous to re-add it to the array... right?

Arch Linux

#1 2021-07-25 06:05:54

Spare device of degraded RAID5 failed to set as replacement

#2 2021-07-28 08:02:25

Re: Spare device of degraded RAID5 failed to set as replacement

Board footer