SMART errors - is my disk fixable?

phunni · 2016-06-20 16:53:40

I've recently been geting some I/O errors during simple file copy actions so I suspected that my disk might be in trouble.

I ran a short and extended SMART test on both my disks (sda, sdb) and sdb resulted in the following report when running smartctl -a o nit:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.6.2-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12
Device Model:     ST31000524AS
Serial Number:    9VPCGAWQ
LU WWN Device Id: 5 000c50 035476beb
Firmware Version: JC4B
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Jun 20 17:47:10 2016 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(  592) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 164) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   086   006    Pre-fail  Always       -       102468534
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       115
  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       88
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       100169919
  9 Power_On_Hours          0x0032   052   052   000    Old_age   Always       -       42548
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       115
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       501
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       42950328330
189 High_Fly_Writes         0x003a   097   097   000    Old_age   Always       -       3
190 Airflow_Temperature_Cel 0x0022   062   056   045    Old_age   Always       -       38 (Min/Max 17/42)
194 Temperature_Celsius     0x0022   038   044   000    Old_age   Always       -       38 (0 11 0 0 0)
195 Hardware_ECC_Recovered  0x001a   026   015   000    Old_age   Always       -       102468534
197 Current_Pending_Sector  0x0012   093   092   000    Old_age   Always       -       319
198 Offline_Uncorrectable   0x0010   093   092   000    Old_age   Offline      -       319
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       42828 (146 107 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       577036012
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4132491586

SMART Error Log Version: 1
ATA Error Count: 485 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 485 occurred at disk power-on lifetime: 41487 hours (1728 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00   7d+04:05:01.686  READ DMA EXT
  27 00 00 00 00 00 e0 00   7d+04:05:01.683  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00   7d+04:05:01.677  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   7d+04:05:01.671  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   7d+04:05:01.650  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 484 occurred at disk power-on lifetime: 41487 hours (1728 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00   7d+04:04:58.829  READ DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:04:57.376  WRITE DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:04:57.376  WRITE DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:04:57.375  WRITE DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:04:57.375  WRITE DMA EXT

Error 483 occurred at disk power-on lifetime: 41487 hours (1728 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00   7d+04:04:12.916  READ DMA EXT
  ea 00 00 ff ff ff af 00   7d+04:04:08.243  FLUSH CACHE EXT
  ea 00 00 ff ff ff af 00   7d+04:04:02.751  FLUSH CACHE EXT
  ea 00 00 ff ff ff af 00   7d+04:04:02.242  FLUSH CACHE EXT
  35 00 08 ff ff ff ef 00   7d+04:03:57.140  WRITE DMA EXT

Error 482 occurred at disk power-on lifetime: 41487 hours (1728 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00   7d+04:03:27.824  READ DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:03:25.346  WRITE DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:03:25.346  WRITE DMA EXT
  ca 00 08 37 e2 f2 e6 00   7d+04:03:25.346  WRITE DMA
  35 00 08 ff ff ff ef 00   7d+04:03:25.345  WRITE DMA EXT

Error 481 occurred at disk power-on lifetime: 41487 hours (1728 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00   7d+04:01:13.790  READ DMA EXT
  25 00 00 ff ff ff ef 00   7d+04:01:13.789  READ DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:01:13.789  WRITE DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:01:13.789  WRITE DMA EXT
  35 00 08 ff ff ff ef 00   7d+04:01:13.788  WRITE DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     42548         1262549068
# 2  Extended offline    Completed: read failure       90%     42529         1262549068
# 3  Short offline       Completed: read failure       80%     42527         1262549068

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I've tried Googling some if these error messages, but, to be honest, not much makes sense. Is my disk done for? Is it possible to fix these errors, or should I just give up and replace it?

Thanks in advance!

daneel971 · 2016-06-21 12:24:51

  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       88
197 Current_Pending_Sector  0x0012   093   092   000    Old_age   Always       -       319
198 Offline_Uncorrectable   0x0010   093   092   000    Old_age   Offline      -       319

See if the number of Reallocated sector and offline uncorrectable increases.
If I were you, I'd replace it ASAP.

Anyway, run the seagate diagnostic tool, and see what it says.

R00KIE · 2016-06-21 13:15:36

I would not trust that disk. Get a new one, make it sweat with 'badblocks', then copy all data to the new disk. I would also make sure that the power supply is still in good condition. From my limited personal experience I'd say a bad power supply can cause some weird problems.

jsoy9pQbYVNu5nfU · 2016-06-21 14:03:23

daneel971 wrote:

  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       88
197 Current_Pending_Sector  0x0012   093   092   000    Old_age   Always       -       319
198 Offline_Uncorrectable   0x0010   093   092   000    Old_age   Offline      -       319
See if the number of Reallocated sector and offline uncorrectable increases.
If I were you, I'd replace it ASAP.
Anyway, run the seagate diagnostic tool, and see what it says.

While Seagate drives are notorious for funny SMART values, attribute no 5 means trouble. Replace and/or do not trust this disk. On the other hand attr 196 Reallocated Event Count (missing here) changing around a bit is normal for some Seagate models.

You can use smartctl -t long to run an extended self-test and watch the attributes.

Soukyuu · 2016-06-21 18:21:05

To add to the "funny SMART values" 2ion mentioned, basically IDs 1 and 7 are not real values, but use the upper bits for errors and lower bits counting access WITHOUT errors. According to this, unless your value for those IDs increases in huge leaps, everything is OK.

But as others have already said, your drive is probably dying. Reallocated sectors are those that cannot be written to anymore, and apparently your drive has run out of spare sectors as well.

mich41 · 2016-06-21 20:30:51

It seems that you filled only 30% of the disk, I guess if you are lucky there is some chance that part of the platter was DOA but the rest is usable and you just never found out until now. However, only solid testing could tell because as others said, if you are unlucky, the disk is failing and it will be making more and more errors in newly written data or just stop working at all.

Soukyuu wrote:

apparently your drive has run out of spare sectors

How do you know? If anything, I expected the normalized Reallocated_Sector_Ct value to show remaining percentage, but I don't know for sure.

frostschutz · 2016-06-21 20:44:16

Bad sectors aren't fixable. If you get Reallocated / Pending / Uncorrectable != 0 the drive is not trustworthy anymore. Maybe you can still use it on the side (video recorder, ...) but for anything important, you're gambling at this point.

Last edited by frostschutz (2016-06-21 20:45:33)

R00KIE · 2016-06-22 10:33:10

From my limited experience Current_Pending_Sectors and Offline_Uncorrectable are not critical, they show there was a problem but it might not be a problem with the disk itself, they can be caused by shock or vibration while writing or fluctuations on the supply voltage and usually get fixed after you write over the affected sectors (*). However if I see Reallocated_Sector_Ct != 0 then that means there is trouble and the disk is not to be trusted with anything.

(*) I've also had one disk that really was broken and would not fix or reallocate the bad/offending sectors and just mark them as Current_Pending_Sectors/Offline_Uncorrectable on the next read. If you plan to use the disk for anything make sure you test it properly with badblocks.

Soukyuu · 2016-06-23 18:17:01

I was under the impression that uncorrectable sector count shows how many sectors the HDD can not reallocate anymore, because no spare sectors are left.

frostschutz · 2016-06-23 18:31:07

R00KIE wrote:

From my limited experience Current_Pending_Sectors and Offline_Uncorrectable are not critical

A pending sector is a sector the disk can't read anymore. In other words, this disk already lost you data. (Some people use btrfs because they're worried about single bit errors... here you lose 512 Byte or 4 Kilobyte in one go). What's your definition of critical?

R00KIE wrote:

they show there was a problem but it might not be a problem with the disk itself

Nobody knows why it happened or when it may happen again, so all you can do is cross fingers and hope for the best (gamble).

For me it's a matter of trust. I trust my disks to keep the data I wrote. Once the disk starts losing data, I have no reason to assume it was somehow a fluke and not hardware related... the trust is gone.

I do use RAID-5, I can survive a single dead disk. But that works only if all other disks are in perfect condition...

[I also do have backups, but restoring backups is a lot more work than just popping in a new disk in a RAID.]

Last edited by frostschutz (2016-06-23 18:40:42)

R00KIE · 2016-06-23 21:03:45

frostschutz wrote:

R00KIE wrote:
From my limited experience Current_Pending_Sectors and Offline_Uncorrectable are not critical
A pending sector is a sector the disk can't read anymore. In other words, this disk already lost you data. (Some people use btrfs because they're worried about single bit errors... here you lose 512 Byte or 4 Kilobyte in one go). What's your definition of critical?

By critical I was thinking of things like increasing reallocated sectors or other problematic sectors, changes in other smart parameters or abnormal noises, something that would imply the drive could fail at any moment.

Like I said before, the drive might not be able to read a sector due to causes external to the drive, such as shocks/vibration or fluctuations of supply voltage, and the drive may be just fine.

From my limited experience I've seen three things happen: (a) the sector remains usable after being written over and doesn't give any more problems, (b) on a subsequent read it will become pending again, or as a nasty last option (c) the disk will look ok even after a few complete badblocks tests but will develop pending sectors randomly afterwards. I could name brands but I don't want to derail the thread

frostschutz wrote:

R00KIE wrote:
they show there was a problem but it might not be a problem with the disk itself
Nobody knows why it happened or when it may happen again, so all you can do is cross fingers and hope for the best (gamble).
For me it's a matter of trust. I trust my disks to keep the data I wrote. Once the disk starts losing data, I have no reason to assume it was somehow a fluke and not hardware related... the trust is gone.
I do use RAID-5, I can survive a single dead disk. But that works only if all other disks are in perfect condition...
[I also do have backups, but restoring backups is a lot more work than just popping in a new disk in a RAID.]

I agree, it's a matter of trust and gamble. Personally if I can assure that I will not lose data in case things go wrong (or if the data is not of much importance or I can get it again from somewhere) then I try giving the disk a second chance and see what happens. Of course this also depends if it will be too much work to recover from a failure, if it fails to read only one cat video I can download again then I might gamble, if I risk losing many files then I will probably just get a new disk.

mich41 · 2016-06-23 21:25:01

Just wanted to add that I've seen both (a) and (c) too.

Also (d) - a laptop had been dropped while running, the disk probably suffered head crash, few tens of GB unreadable and unwritable. However, the rest survived few nights of testing (badblocks, sequestial overwrites with pv, some taring/untaring of kernel sources). The owner decided to save money and gamble and so far nothing bad happened after few months.

But that's a rather hardcore story. I'm still not sure if this disk won't fail prematurely and I don't recommend dropping disks on the floor

Arch Linux

#1 2016-06-20 16:53:40

SMART errors - is my disk fixable?

#2 2016-06-21 12:24:51

Re: SMART errors - is my disk fixable?

#3 2016-06-21 13:15:36

Re: SMART errors - is my disk fixable?

#4 2016-06-21 14:03:23

Re: SMART errors - is my disk fixable?

#5 2016-06-21 18:21:05

Re: SMART errors - is my disk fixable?

#6 2016-06-21 20:30:51

Re: SMART errors - is my disk fixable?

#7 2016-06-21 20:44:16

Re: SMART errors - is my disk fixable?

#8 2016-06-22 10:33:10

Re: SMART errors - is my disk fixable?

#9 2016-06-23 18:17:01

Re: SMART errors - is my disk fixable?

#10 2016-06-23 18:31:07

Re: SMART errors - is my disk fixable?

#11 2016-06-23 21:03:45

Re: SMART errors - is my disk fixable?

#12 2016-06-23 21:25:01

Re: SMART errors - is my disk fixable?

Board footer