You are not logged in.
Pages: 1
Hi
I have a hard drive which i think is failing. I tried to run a SMART self-test in gnome-disks but the sekf test failed. Before i tried to run the self test i noticed it was showing 32 bad sectors.
Sound like a faulty drive? Any other way of checking the drive health?
Offline
"A few" bad sectors don't necessarily mean anything bad but combined with a failed SMART you probably should think about replacement. At least make sure your backup is current.
If quantum mechanics hasn't profoundly shocked you, you haven't understood it yet.
Niels Bohr
Offline
Self test failure is a result of bad sectors
Well, bad sectors are bad sectors. Some result from one-time failures (sudden power loss, vibration, cosmic rays, bad voodoo spells), others from wearout of the disk machinery. The only sure way to tell them apart is to work the disk a bit and see if their number grows.
I found that unpacking 200 copies of kernel source tarball is effective at detecting disks prone to creation of new bad sectors.
You may want to backup your data since the disk may get worse, but if you like adventure feel free to test and possibly keep using it. I use a disk which had one bad sector once. I own another one which creates few bads per each gigabyte written. I know one guy who uses a disk which suffered head crash, but still works except for the few gigabytes of platter permanently damaged by the crash. YMMV.
Can you post full output of smartctl -a ?
Last edited by mich41 (2016-05-28 09:25:38)
Offline
Self test failure is a result of bad sectors
Well, bad sectors are bad sectors. Some result from one-time failures (sudden power loss, vibration, cosmic rays, bad voodoo spells), others from wearout of the disk machinery. The only sure way to tell them apart is to work the disk a bit and see if their number grows.
I found that unpacking 200 copies of kernel source tarball is effective at detecting disks prone to creation of new bad sectors.
You may want to backup your data since the disk may get worse, but if you like adventure feel free to test and possibly keep using it. I use a disk which had one bad sector once. I own another one which creates few bads per each gigabyte written. I know one guy who uses a disk which suffered head crash, but still works except for the few gigabytes of platter permanently damaged by the crash. YMMV.
Can you post full output of smartctl -a ?
Here is the smartctl output ...
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.5.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, [url=http://www.smartmontools.org]www.smartmontools.org[/url]
=== START OF INFORMATION SECTION ===
Device Model: WDC WD2503ABYZ-011FA0
Serial Number: WD-WMAYP0DPA70X
LU WWN Device Id: 5 0014ee 0593454fc
Firmware Version: 01.01S03
User Capacity: 251,059,544,064 bytes [251 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat May 28 11:13:09 2016 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 3900) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 42) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 136 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 155 139 021 Pre-fail Always - 3208
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 213
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10269
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 194
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 163
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 59
194 Temperature_Celsius 0x0022 109 094 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 31
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 36
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 36
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 10269 126309139
# 2 Short offline Completed: read failure 90% 10268 126309139
# 3 Conveyance offline Completed: read failure 90% 10260 126309139
# 4 Short offline Completed: read failure 90% 10260 126309139
# 5 Extended offline Completed: read failure 90% 10260 126309139
# 6 Short offline Completed without error 00% 6600 -
# 7 Short offline Completed without error 00% 1492 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Last edited by jjb2016 (2016-05-28 11:07:05)
Offline
Please use code tags when pasting terminal output, it retains the formatting, and makes it easier to scroll through.
Sakura:-
Mobo: MSI MAG X570S TORPEDO MAX // Processor: AMD Ryzen 9 5950X @4.9GHz // GFX: AMD Radeon RX 5700 XT // RAM: 32GB (4x 8GB) Corsair DDR4 (@ 3000MHz) // Storage: 1x 3TB HDD, 6x 1TB SSD, 2x 120GB SSD, 1x 275GB M2 SSD
Making lemonade from lemons since 2015.
Offline
Other than the self-test errors, the "Current_Pending_Sector" entry in the table is worrying. Those are spots where the drive can't read from. The moment you write into one of those, the drive should move that sector into its spare area. The "Reallocated_Event_Count" will then go up and the "Current_Pending_Sector" count will go down.
You shouldn't use this drive anymore.
You could keep it around for fun and track how the counts develop over time.
Offline
Other than the self-test errors, the "Current_Pending_Sector" entry in the table is worrying. Those are spots where the drive can't read from. The moment you write into one of those, the drive should move that sector into its spare area. The "Reallocated_Event_Count" will then go up and the "Current_Pending_Sector" count will go down.
You shouldn't use this drive anymore.
You could keep it around for fun and track how the counts develop over time.
Disappointing! This is a Western Digital Re Enterprise grade disk. I bought it thinking it would be more reliable than the WD Blue I had before (which also failed).
I'm going to dd zeros to the disk to try and force reallocation of bad sectors ...
Offline
Can you test and flag bad sectors but usually more bad sectors are detected over time in my experience. Best advice is backup and replace the drive.
e2fsck -cckv /dev/sdxy
You can calculate the partition on which the first bad sector was found but I haven't done that in years.
EDIT: And yes, 10k hours is really low to be seeing these.
Last edited by graysky (2016-05-28 11:39:43)
CPU-optimized Linux-ck packages @ Repo-ck • AUR packages • Zsh and other configs
Offline
If the drive is still in it's warranty period, you could try RMAing it. I would say it's unacceptable for an enterprise-class drive to go bad so quickly.
Sakura:-
Mobo: MSI MAG X570S TORPEDO MAX // Processor: AMD Ryzen 9 5950X @4.9GHz // GFX: AMD Radeon RX 5700 XT // RAM: 32GB (4x 8GB) Corsair DDR4 (@ 3000MHz) // Storage: 1x 3TB HDD, 6x 1TB SSD, 2x 120GB SSD, 1x 275GB M2 SSD
Making lemonade from lemons since 2015.
Offline
If the drive is still in it's warranty period, you could try RMAing it. I would say it's unacceptable for an enterprise-class drive to go bad so quickly.
From googling the part number it seems that it has a 5 year replacement.
CPU-optimized Linux-ck packages @ Repo-ck • AUR packages • Zsh and other configs
Offline
Created an RMA for it. Never submitted an RMA before - will see what happens.
Offline
Created an RMA for it. Never submitted an RMA before - will see what happens.
In my experience you will get a used but "factory certified" unit that will have bad sectors on it within a few months. My comment is based on a single experience.
CPU-optimized Linux-ck packages @ Repo-ck • AUR packages • Zsh and other configs
Offline
The drive may be perfectly fine, those pending sectors might not even (and most probably will not) get reallocated once you write over them. Those could be caused by vibration or fluctuations on the power supply while the drive is writing.
What might be concerning is that Multi_Zone_Error_Rate is not zero, although the value is low, my (limited) experience says that once that value starts increasing you may have trouble later on. You might want to run the diagnosis tools provided by Western Digital, they should have (or used to have) a bootable cd image with the tools, which you can run and get a smart report or error code. If you plan on RMAing the drive you should use that, as I doubt they will do anything if you mention linux.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
The drive may be perfectly fine, those pending sectors might not even (and most probably will not) get reallocated once you write over them. Those could be caused by vibration or fluctuations on the power supply while the drive is writing.
What might be concerning is that Multi_Zone_Error_Rate is not zero, although the value is low, my (limited) experience says that once that value starts increasing you may have trouble later on. You might want to run the diagnosis tools provided by Western Digital, they should have (or used to have) a bootable cd image with the tools, which you can run and get a smart report or error code. If you plan on RMAing the drive you should use that, as I doubt they will do anything if you mention linux.
Thanks - I'll take a look at the WD diagnosis tool. Do you think high temperature could damage the disk in this way (not talking fire heat or anything)?
Offline
You can do "smartctl -x ..." if you are interested in what happened in the past with the temperature. It will show a graphic about the recent history for the temperature, and there should be an entry somewhere about "lifetime min, max temperature".
Offline
Thanks - I'll take a look at the WD diagnosis tool. Do you think high temperature could damage the disk in this way (not talking fire heat or anything)?
Drives do have minimum and maximum operating temperatures, higher operating temperatures impact reliability but I'd say as long as you haven't exceeded the operational limits you should be fine. If at any point you have exceeded the maximum operating temperature you will have a failed in the past note in the output of smartctl for the temperature attribute.
If you have ever exceeded the maximum operating temperature then I suspect WD might not want to exchange the drive but you should try anyway.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Pages: 1