Help restoring raid array

fireout · 2014-09-01 20:51:54

Hello,

I want to recover my raid5 array. /dev/md* is not showing up and mdadm --examine /dev/sdb (my array was with drive d,e,f,g) shows

/dev/sdb:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

Investigating further, i found out that smartctl -a /dev/sd* prints this

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.7-1-ARCH] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD20EFRX-68EUZN0
Serial Number:    WD-WMC4M2192751
LU WWN Device Id: 5 0014ee 6043e6510
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep  1 20:55:39 2014 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (26280) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 266) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   174   021    Pre-fail  Always       -       3250
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       54
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3164
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       54
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       44
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   113   103   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 198 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 198 occurred at disk power-on lifetime: 3164 hours (131 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:02:26.516  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.515  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:02:26.515  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:02:26.514  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.513  IDENTIFY DEVICE

Error 197 occurred at disk power-on lifetime: 3164 hours (131 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:02:26.515  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:02:26.514  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.513  IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08      00:02:26.493  SET FEATURES [Enable SATA feature]

Error 196 occurred at disk power-on lifetime: 3164 hours (131 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:02:26.514  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.513  IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08      00:02:26.493  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.493  IDENTIFY DEVICE

Error 195 occurred at disk power-on lifetime: 3164 hours (131 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 02 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 08      00:02:26.493  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.493  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:02:26.492  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:02:26.492  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.491  IDENTIFY DEVICE

Error 194 occurred at disk power-on lifetime: 3164 hours (131 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 46 00 00 00 a0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 08      00:02:26.492  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 08      00:02:26.492  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 08      00:02:26.491  IDENTIFY DEVICE
  ef 10 02 00 00 00 a0 08      00:02:26.471  SET FEATURES [Enable SATA feature]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%        19         -
# 2  Short offline       Interrupted (host reset)      10%         6         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Can "device fault" mean my disk is dead? Or do I have any chances of restoring anything?

Can smartctl test feature recover errors?

partprobe is giving me

Error: Invalid argument during seek for read on /dev/sdb

gdisk sais my gpt partition is damaged...

Not sure where to go from there, can anyone help please?

Last edited by fireout (2014-09-01 23:47:55)

fukawi2 · 2014-09-01 23:14:34

Please do not bump your thread when no-one else has replied -- use the edit feature to add information to your first post.

You can not recover a RAID5 with 1 disk. I'm not sure exactly what those errors in your SMART test are indicating, but they're not normal IME.

fireout · 2014-09-01 23:49:26

All 4 drives shows up in /dev (sd[bdef]). It's just the superblock (i guess) that cannot be read as my /dev/md is not listed.

Sorry about the bump, i updated my first post; you can remove subsequent post if needed.

fukawi2 · 2014-09-01 23:53:15

So you need to assemble the array with that drive missing?

Have you read through this and this?

fireout · 2014-09-02 00:19:28

Again, all the drives seems to be available, I'm guessing one or more of them might be corrupted or something (according to the smartctl output)

I did look into those, keep backing off the "recreating the array" step; I would like to get as much details as I can before I write anything to the disks, maybe something off the long smartctl test, but still 4 and something hours to go for that to complete...

fukawi2 · 2014-09-02 04:26:44

fireout wrote:

keep backing off the "recreating the array" step

Recreating is not the same as Reassembling. What is this array from? What actually happened to lead you to need to recover the array? Is there any information in /proc/mdstat?

fireout · 2014-09-02 10:09:43

/proc/mdstat is not even there, although, I booted of the ArchISO, not sure it sould show up with the install iso?... my /var was mounted on the array, could not boot otherwise; I changed my fstab but didn't reboot yet.

Power failure is the cause of all this, the power cord wasn't pushed all the way and the cord was knocked out.

fireout · 2014-09-02 22:22:20

Any other input before I resort to data retrieval services (and later, plan a backup strategy)?

looking at the drives further, I found out that one drive is indeed defect/not detected.
no partitions are present on any of the drives (/dev/sdb shows up but not /dev/sdb1)

This array was created by the intel raid chipset, can I recreate it without loosing data even if one drive is missing?

Thanks

fukawi2 · 2014-09-02 22:56:39

fireout wrote:

This array was created by the intel raid chipset, can I recreate it without loosing data even if one drive is missing?

Post the output of:

fdisk -l /dev/sd[bdef]

fireout · 2014-09-02 23:07:14

my partition was a gpt partition. fdisk shows

fdisk -l /dev/sdb

Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x00000000
Partition 1 does not start on physical sector boundary.

Device    Boot Start        End     Blocks  Id System
/dev/sdb1          1 4294967295 2147483647+ ee GPT

and for the two other drives

Disk /dev/sdd: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

gdisk show

gdisk -l /dev/sdb
GPT fdisk (gdisk) version 0.8.10

Warning! Disk size is smaller than the main header indicates! Loading
secondary header from the last sector of the disk! You should use 'v' to
verify disk integrity, and perhaps options on the experts' menu to repair
the disk.
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Warning! One or more CRCs don't match. You should repair the disk!

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: damaged

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
Disk /dev/sdb: 3907029168 sectors, 1.8 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): F486503E-7A80-46A7-A351-9BF0686221E9
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 11720765406
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048     11720765406   5.5 TiB     FD00  Linux RAID

fukawi2 · 2014-09-03 01:32:18

Yeah OK, I have no idea what's going on there. Sorry.

justin-8 · 2014-09-09 10:17:19

Can you show the output for gdisk -l on each disk, not just sdb, and also the output of:

mdadm --misc --examine /dev/sd[bdef]

and also:

for i in b d e f; do smartctl -a /dev/sd$i; done

Hopefully it's just one disk, but I have seen disks update out of sync before due to a failed sata cable, I assume the same could happen from a sudden loss of power. If that is the case, it is possible to lower the event count on the higher drive(s) and force the array to re-assemble. But hopefully I can help from the output of those commands.

Arch Linux

#1 2014-09-01 20:51:54

Help restoring raid array

#2 2014-09-01 23:14:34

Re: Help restoring raid array

#3 2014-09-01 23:49:26

Re: Help restoring raid array

#4 2014-09-01 23:53:15

Re: Help restoring raid array

#5 2014-09-02 00:19:28

Re: Help restoring raid array

#6 2014-09-02 04:26:44

Re: Help restoring raid array

#7 2014-09-02 10:09:43

Re: Help restoring raid array

#8 2014-09-02 22:22:20

Re: Help restoring raid array

#9 2014-09-02 22:56:39

Re: Help restoring raid array

#10 2014-09-02 23:07:14

Re: Help restoring raid array

#11 2014-09-03 01:32:18

Re: Help restoring raid array

#12 2014-09-09 10:17:19

Re: Help restoring raid array

Board footer