RAID1 - harddisk partly dropped out

Ovion · 2013-02-26 22:43:58

Hi all,
I'm irritated by the behavior of my RAID1, because one of the harddisks partly fell out of the array.
I have two 1-TB-harddisks in my machine and I parted both with a 100 MB-Partition for boot and a second partition using almost the rest of the space (in fact, I parted one of them and cloned the partition table). At the end of the disks there are some MB of unused space.
I created a RAID1 via mdadm on these harddisks, using both 100-MB-partitions for the first RAID-device and the two big ones for the second. Then I installed /boot on the 100-MB-device and the rest of the system into a crypted luks-container on the second RAID-device.
Everything worked fine, everything booted up as it should, but recently something strage happened: One of the partitions that formed the RAID containing the crypted container dropped out of the RAID while the /boot-RAID is still in perfect action (at least according to /proc/mdstat). I did some bootups meanwhile, that doesn't change anything, the bigger RAID-device remains with one disk (every time the same harddisk that drops out), the /boot-RAID still has both partitions. I tried to reassemble, but that doesn't work either.
I'm quite confused at the moment. I assume that it can't be a problem of the data on the RAID, because the RAID is the ground-structure of my configuration, everything was installed onto the RAID-device, so I think in that case both partitions should have dropped out. The other possibility is a hardware problem, but because the /boot-array is still working fine I'm sure (at least I think I can be) that the harddisk is working. I can write in /boot without triggering problems, I even had a update of the kernel meanwhile, which is written into boot. I also did a smartctl-test on the affected harddisk, the results seem to be fine, but I have to admit I'm not that experienced in this matter, I never had a defect harddisk yet, so I will place the results down below.
I would be glad if someone can help me figure out why just one of the two partitions dropped out of my RAID and what's the best to do.
Both RAID-devices were definitely working before (according to /proc/mdstat) and I haven't changed anything in their configuration, the only thing is activating smart-support because I noticed that it was disabled for some reason, but as far as I remember this wasn't coincidential with the described problem and besides that should have affected both RAID-devices.

Thanks in advance!

smartctl -t long /dev/sdb
#waited
smartctl -a /dev/sdb > down-below

smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.7.9-1-ARCH] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Black
Device Model:     WDC WD1002FAEX-00Z3A0
Serial Number:    WD-WCATR3167936
LU WWN Device Id: 5 0014ee 204f24ec5
Firmware Version: 05.01D05
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 6.0 Gb/s
Local Time is:    Tue Feb 26 23:31:02 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(17280) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 200) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3037)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   177   169   021    Pre-fail  Always       -       4125
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1455
  5 Reallocated_Sector_Ct   0x0033   159   159   140    Pre-fail  Always       -       324
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6749
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1411
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       50
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1404
194 Temperature_Celsius     0x0022   117   106   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       273
197 Current_Pending_Sector  0x0032   200   199   000    Old_age   Always       -       111
198 Offline_Uncorrectable   0x0030   200   199   000    Old_age   Offline      -       51
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   194   183   000    Old_age   Offline      -       1224

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

jasonwryan · 2013-02-26 22:58:54

Welcome to the boards.

What does "partly fell out of the array" mean? If /proc/mdstat and the mdadm utilities are reporting a healthy array then I am not sure what the issue you are describing is.

Ovion · 2013-02-26 23:23:03

Ok, I'll try to make my problem clearer: /proc/mdstat reports that my bigger array just consists of one disk while it should have two, one disk is missing here. Just the small array used for /boot is completely healthy (I've got this only because I need an unencrypted /boot to boot). I'm confused by the fact that the smaller partition of the affected harddisk is well integrated into the array while the bigger partition of the physical disk isn't interested into returning into the array it belongs to, while both are on the same physical disk.
Does that make my problem clearer?

# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sda1[0] sdb1[2]
      96320 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sda5[0] # I miss sdb5 here
      976444032 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

sda1 and sda5, as well as sdb1 and sdb5 are on the same physical disks (sda and sdb, as no one would have expected).

Edit:

jasonwryan wrote:

Welcome to the boards.

Thanks!

Edit:
Just for completeness: sdb5 is showing up in /dev but not in /proc/mdstat.

I hope I have delivered all necessary info now.

Last edited by Ovion (2013-02-26 23:32:39)

jasonwryan · 2013-02-26 23:49:44

OK: so the second array is degraded.

You can use mdadm to get a better view of the status of that array, but I would be thinking about dropping another drive into the array to ensure that you don't lose any data.

JGC · 2013-02-27 10:11:37

Your harddisk has bad sectors, you should replace it.

Anyways, your RAID setup consists of two RAID arrays. The first array containing /boot is not degraded because your disk doesn't have bad sectors there. Unless you remove the disk or your disk starts making bad sectors there too, that array will still be fine. The second array contains all your data and has been degraded due to bad sectors.

Ovion · 2013-02-27 13:40:04

Ok, that sounds logical. I run a badblocks-test on sdb5 at the moment, I'll report the result.

Thanks to both of you!

There is one more thing that confuses me. I did a

mdadm --examine /dev/sd?5

The result is interesting:

#/dev/sda5, which is working according to /proc/mdstat
Device Role : Active device 0
Array State : A. ('A' == active, '.' == missing)

#/dev/sdb5, which doesn't work according to /proc/mdstat
Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing)

Why is sdb5 marked as completely active, but sda5 is marked just as half-active? Or is the dot at sda5 according to sdb5, because that's the second device? Assuming that, however, why is sdb5 then marked as completely active?

jasonwryan · 2013-02-27 17:20:00

What does mdadm -D /dev/md1 show?

Ovion · 2013-02-27 17:45:58

# mdadm -D /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Mon Aug  6 20:58:12 2012
     Raid Level : raid1
     Array Size : 976444032 (931.21 GiB 999.88 GB)
  Used Dev Size : 976444032 (931.21 GiB 999.88 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Wed Feb 27 18:39:28 2013
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : archiso:1
           UUID : 8791071e:1cfafc62:df05c7a0:1874216f
         Events : 426133

    Number   Major   Minor   RaidDevice State
       0       8        5        0      active sync   /dev/sda5
       1       0        0        1      removed

Does "removed" mean I have manually removed it or is a failed device also marked as removed?

By the way: I have

badblocks -nsv /dev/sdb5

running in the background, according to the given time and percentage I await the result in approximately a week. I don't know whether that's a usual time for checking 931 GB.

jasonwryan · 2013-02-28 06:58:23

You could try readding the drive to the array, but even if it does work, this sort of thing is usually an indication that b0rkage is imminent.

Raid arrays tend to stress drives a little more; but they have the upside (in Raid 1, in your case) of meaning that you can just swap in a new drive, resynch and not lose any data.

Ovion · 2013-03-08 21:38:07

I'm very sorry for responding that late.

Unfortunately my badblocks-scan was discourteously interrupted by a kernel panic, but I saw it earlier indeed reporting bad blocks. Is there a possibility to rebuild the raid by ignoring the bad blocks, in case the remaining unused space on the disk is big enough for that?

jasonwryan wrote:

but they have the upside (in Raid 1, in your case) of meaning that you can just swap in a new drive, resynch and not lose any data.

Indeed, the reason for my choice building a raid1.

jasonwryan · 2013-03-08 21:53:20

I wouldn't fsck around with it. Remove the faulty drive, insert a new one, resync and go ob about your business...

Ovion · 2013-03-08 22:54:12

Yeah, probably the best idea. But as this is the first time I have a faultiy drive, I would really like to play around a little bit and test possibilities.
So, you mentioned fsck, but doesn't that go on the filesystem, which is ontop of the raid in this case? Would fsck nevertheless maybe be able to solve this issue?

Edit: Backup is made, that won't be a problem.

Last edited by Ovion (2013-03-08 22:54:43)

jasonwryan · 2013-03-09 01:45:47

Ovion wrote:

So, you mentioned fsck, but doesn't that go on the filesystem, which is ontop of the raid in this case? Would fsck nevertheless maybe be able to solve this issue?

No, it sounds like it is beyond that. Also, I was using it mostly as a play on words, not literally: s/s/u/

Ovion · 2013-03-12 20:45:10

Ok, thanks for help!

Isn't there a possibility to make the harddrive just ignore the faulty sectors?
Anyway: how do faulty sectors arise? Errors in production, which come up with a time delay?

jasonwryan · 2013-03-12 21:51:17

Ovion wrote:

sn't there a possibility to make the harddrive just ignore the faulty sectors?
Anyway: how do faulty sectors arise? Errors in production, which come up with a time delay?

Not that I am aware of; and even if there were, I wouldn't recommend it. Drives are cheap, replacing your data is not...

I have been running Raid 1 on both my desktop and sever for the last four years; in that time I have gone through at least four* failed 1 TB drives.
Raid seems defintitely more demanding on drives, but then I don't regret replacing any of them for the reason above.

* The original two were the first batch of Seagate 1TB drives that were subsequently found to have an inordinately high failure rate; that
may have been a contributing factor.

Strike0 · 2013-03-12 23:33:50

Ovion wrote:

Isn't there a possibility to make the harddrive just ignore the faulty sectors?

If you rebuild the bad array on the same drive, the bad sectors should automatically be ignored. As I understand it - please correct me if wrong - the sata drives 'swap' the bad sectors out after they failed checksums, replacing them with spare ones (drive feature for this purpose). No idea why the automatic swaping out of the bad apples lead to the degraded state in your case. So, faulty sectors are ignored generally. Maybe there are too many of them - you should really consider the advice above. In your output:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   159   159   140    Pre-fail  Always       -       324
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       273
197 Current_Pending_Sector  0x0032   200   199   000    Old_age   Always       -       111
198 Offline_Uncorrectable   0x0030   200   199   000    Old_age   Offline      -       51

ID #
5 shows the bad sectors already replaced by the drive's internal buffer of spare sectors
197 shows a number of sectors pending to be marked good/bad (this should be zero! Maybe thats the problem relating to the degraded state, or ID 198)

edit: Note that for ID 5 the drive is close to the threshold already & it is only going to get worse. If anything, use it for something you won't miss until then.

Last edited by Strike0 (2013-03-12 23:41:19)

Ovion · 2013-03-30 11:54:28

Ok, thanks! I definitely will replace this disk before I rely on that array again, clear thing even before asking for "repairing" it. Thanks for help.

When there is a sector failure, are the sectors independent from one another or can it happen that a sector failure behaves like a virus, spreading on its own when there is a sector failure somewhere on the disk?
I'm quite sure it's the first one, but just to make sure because I'm not that deep in the topic.

Strike0 · 2013-03-30 23:25:28

Not if you mean viral like infectious. I'm not an engineer either, but I picture it like material imperfections in the magnetism of parts of the disk which may become increasingly evident with wear and tear. Further, it is easy to imagine that those material defects affect an area, so that sectors close to the first erroneous one are more likely to fail next. But it must not. I have a disk, for example, that developed a couple of bad sectors very early, but that number has not changed for a year now (knockonwood). When a drive develops another type of physical defect (e.g. the spin or whatever), that may of course have viral effects.

The reason the drives have those spare sectors to swap in is at least twofold. Firstly, its the error correction itself. Secondly, its optimising the production process and return rate.

Ovion · 2013-03-31 19:28:59

Ok, that's nearly exactly what I assumed, thanks!

Arch Linux

#1 2013-02-26 22:43:58

RAID1 - harddisk partly dropped out

#2 2013-02-26 22:58:54

Re: RAID1 - harddisk partly dropped out

#3 2013-02-26 23:23:03

Re: RAID1 - harddisk partly dropped out

#4 2013-02-26 23:49:44

Re: RAID1 - harddisk partly dropped out

#5 2013-02-27 10:11:37

Re: RAID1 - harddisk partly dropped out

#6 2013-02-27 13:40:04

Re: RAID1 - harddisk partly dropped out

#7 2013-02-27 17:20:00

Re: RAID1 - harddisk partly dropped out

#8 2013-02-27 17:45:58

Re: RAID1 - harddisk partly dropped out

#9 2013-02-28 06:58:23

Re: RAID1 - harddisk partly dropped out

#10 2013-03-08 21:38:07

Re: RAID1 - harddisk partly dropped out

#11 2013-03-08 21:53:20

Re: RAID1 - harddisk partly dropped out

#12 2013-03-08 22:54:12

Re: RAID1 - harddisk partly dropped out

#13 2013-03-09 01:45:47

Re: RAID1 - harddisk partly dropped out

#14 2013-03-12 20:45:10

Re: RAID1 - harddisk partly dropped out

#15 2013-03-12 21:51:17

Re: RAID1 - harddisk partly dropped out

#16 2013-03-12 23:33:50

Re: RAID1 - harddisk partly dropped out

#17 2013-03-30 11:54:28

Re: RAID1 - harddisk partly dropped out

#18 2013-03-30 23:25:28

Re: RAID1 - harddisk partly dropped out

#19 2013-03-31 19:28:59

Re: RAID1 - harddisk partly dropped out

Board footer