Hard drive failure?

jjb2016 · 2016-05-28 01:12:43

Hi

I have a hard drive which i think is failing. I tried to run a SMART self-test in gnome-disks but the sekf test failed. Before i tried to run the self test i noticed it was showing 32 bad sectors.

Sound like a faulty drive? Any other way of checking the drive health?

TheChickenMan · 2016-05-28 03:45:55

"A few" bad sectors don't necessarily mean anything bad but combined with a failed SMART you probably should think about replacement. At least make sure your backup is current.

mich41 · 2016-05-28 09:21:42

Self test failure is a result of bad sectors

Well, bad sectors are bad sectors. Some result from one-time failures (sudden power loss, vibration, cosmic rays, bad voodoo spells), others from wearout of the disk machinery. The only sure way to tell them apart is to work the disk a bit and see if their number grows.

I found that unpacking 200 copies of kernel source tarball is effective at detecting disks prone to creation of new bad sectors.

You may want to backup your data since the disk may get worse, but if you like adventure feel free to test and possibly keep using it. I use a disk which had one bad sector once. I own another one which creates few bads per each gigabyte written. I know one guy who uses a disk which suffered head crash, but still works except for the few gigabytes of platter permanently damaged by the crash. YMMV.

Can you post full output of smartctl -a ?

Last edited by mich41 (2016-05-28 09:25:38)

jjb2016 · 2016-05-28 10:26:59

mich41 wrote:

Self test failure is a result of bad sectors
Well, bad sectors are bad sectors. Some result from one-time failures (sudden power loss, vibration, cosmic rays, bad voodoo spells), others from wearout of the disk machinery. The only sure way to tell them apart is to work the disk a bit and see if their number grows.
I found that unpacking 200 copies of kernel source tarball is effective at detecting disks prone to creation of new bad sectors.
You may want to backup your data since the disk may get worse, but if you like adventure feel free to test and possibly keep using it. I use a disk which had one bad sector once. I own another one which creates few bads per each gigabyte written. I know one guy who uses a disk which suffered head crash, but still works except for the few gigabytes of platter permanently damaged by the crash. YMMV.
Can you post full output of smartctl -a ?

Here is the smartctl output ...

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.5.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, [url=http://www.smartmontools.org]www.smartmontools.org[/url]

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD2503ABYZ-011FA0
Serial Number:    WD-WMAYP0DPA70X
LU WWN Device Id: 5 0014ee 0593454fc
Firmware Version: 01.01S03
User Capacity:    251,059,544,064 bytes [251 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat May 28 11:13:09 2016 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		( 3900) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  42) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   136   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   155   139   021    Pre-fail  Always       -       3208
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       213
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10269
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       194
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       163
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       59
194 Temperature_Celsius     0x0022   109   094   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       31
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       36
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       36

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     10269         126309139
# 2  Short offline       Completed: read failure       90%     10268         126309139
# 3  Conveyance offline  Completed: read failure       90%     10260         126309139
# 4  Short offline       Completed: read failure       90%     10260         126309139
# 5  Extended offline    Completed: read failure       90%     10260         126309139
# 6  Short offline       Completed without error       00%      6600         -
# 7  Short offline       Completed without error       00%      1492         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Last edited by jjb2016 (2016-05-28 11:07:05)

WorMzy · 2016-05-28 10:37:08

Please use code tags when pasting terminal output, it retains the formatting, and makes it easier to scroll through.

https://wiki.archlinux.org/index.php/Fo … s_and_code

Ropid · 2016-05-28 10:58:41

Other than the self-test errors, the "Current_Pending_Sector" entry in the table is worrying. Those are spots where the drive can't read from. The moment you write into one of those, the drive should move that sector into its spare area. The "Reallocated_Event_Count" will then go up and the "Current_Pending_Sector" count will go down.

You shouldn't use this drive anymore.

You could keep it around for fun and track how the counts develop over time.

jjb2016 · 2016-05-28 11:10:26

Ropid wrote:

Other than the self-test errors, the "Current_Pending_Sector" entry in the table is worrying. Those are spots where the drive can't read from. The moment you write into one of those, the drive should move that sector into its spare area. The "Reallocated_Event_Count" will then go up and the "Current_Pending_Sector" count will go down.
You shouldn't use this drive anymore.
You could keep it around for fun and track how the counts develop over time.

Disappointing! This is a Western Digital Re Enterprise grade disk. I bought it thinking it would be more reliable than the WD Blue I had before (which also failed).

I'm going to dd zeros to the disk to try and force reallocation of bad sectors ...

graysky · 2016-05-28 11:18:20

Can you test and flag bad sectors but usually more bad sectors are detected over time in my experience. Best advice is backup and replace the drive.

e2fsck -cckv /dev/sdxy

You can calculate the partition on which the first bad sector was found but I haven't done that in years.

EDIT: And yes, 10k hours is really low to be seeing these.

Last edited by graysky (2016-05-28 11:39:43)

WorMzy · 2016-05-28 11:26:45

If the drive is still in it's warranty period, you could try RMAing it. I would say it's unacceptable for an enterprise-class drive to go bad so quickly.

graysky · 2016-05-28 11:41:13

WorMzy wrote:

If the drive is still in it's warranty period, you could try RMAing it. I would say it's unacceptable for an enterprise-class drive to go bad so quickly.

From googling the part number it seems that it has a 5 year replacement.

jjb2016 · 2016-05-28 13:42:52

Created an RMA for it. Never submitted an RMA before - will see what happens.

graysky · 2016-05-28 13:49:25

jjb2016 wrote:

Created an RMA for it. Never submitted an RMA before - will see what happens.

In my experience you will get a used but "factory certified" unit that will have bad sectors on it within a few months. My comment is based on a single experience.

R00KIE · 2016-05-29 17:55:06

The drive may be perfectly fine, those pending sectors might not even (and most probably will not) get reallocated once you write over them. Those could be caused by vibration or fluctuations on the power supply while the drive is writing.

What might be concerning is that Multi_Zone_Error_Rate is not zero, although the value is low, my (limited) experience says that once that value starts increasing you may have trouble later on. You might want to run the diagnosis tools provided by Western Digital, they should have (or used to have) a bootable cd image with the tools, which you can run and get a smart report or error code. If you plan on RMAing the drive you should use that, as I doubt they will do anything if you mention linux.

jjb2016 · 2016-06-01 10:26:24

R00KIE wrote:

The drive may be perfectly fine, those pending sectors might not even (and most probably will not) get reallocated once you write over them. Those could be caused by vibration or fluctuations on the power supply while the drive is writing.
What might be concerning is that Multi_Zone_Error_Rate is not zero, although the value is low, my (limited) experience says that once that value starts increasing you may have trouble later on. You might want to run the diagnosis tools provided by Western Digital, they should have (or used to have) a bootable cd image with the tools, which you can run and get a smart report or error code. If you plan on RMAing the drive you should use that, as I doubt they will do anything if you mention linux.

Thanks - I'll take a look at the WD diagnosis tool. Do you think high temperature could damage the disk in this way (not talking fire heat or anything)?

Ropid · 2016-06-01 11:25:55

You can do "smartctl -x ..." if you are interested in what happened in the past with the temperature. It will show a graphic about the recent history for the temperature, and there should be an entry somewhere about "lifetime min, max temperature".

R00KIE · 2016-06-01 12:46:09

jjb2016 wrote:

Thanks - I'll take a look at the WD diagnosis tool. Do you think high temperature could damage the disk in this way (not talking fire heat or anything)?

Drives do have minimum and maximum operating temperatures, higher operating temperatures impact reliability but I'd say as long as you haven't exceeded the operational limits you should be fine. If at any point you have exceeded the maximum operating temperature you will have a failed in the past note in the output of smartctl for the temperature attribute.

If you have ever exceeded the maximum operating temperature then I suspect WD might not want to exchange the drive but you should try anyway.

Arch Linux

#1 2016-05-28 01:12:43

Hard drive failure?

#2 2016-05-28 03:45:55

Re: Hard drive failure?

#3 2016-05-28 09:21:42

Re: Hard drive failure?

#4 2016-05-28 10:26:59

Re: Hard drive failure?

#5 2016-05-28 10:37:08

Re: Hard drive failure?

#6 2016-05-28 10:58:41

Re: Hard drive failure?

#7 2016-05-28 11:10:26

Re: Hard drive failure?

#8 2016-05-28 11:18:20

Re: Hard drive failure?

#9 2016-05-28 11:26:45

Re: Hard drive failure?

#10 2016-05-28 11:41:13

Re: Hard drive failure?

#11 2016-05-28 13:42:52

Re: Hard drive failure?

#12 2016-05-28 13:49:25

Re: Hard drive failure?

#13 2016-05-29 17:55:06

Re: Hard drive failure?

#14 2016-06-01 10:26:24

Re: Hard drive failure?

#15 2016-06-01 11:25:55

Re: Hard drive failure?

#16 2016-06-01 12:46:09

Re: Hard drive failure?

Board footer