HDD Problem - Freezes for a while during linux init

I've got a problem with my HDD. My Arch usually boots like thunderbolt, but since 3-4 days it freezes for a while before login screen. So I checked dmesg and this is what I found:

[   53.360021] ata1: lost interrupt (Status 0x50)
[   53.360041] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[   53.360045] ata1.00: failed command: WRITE DMA EXT
[   53.360053] ata1.00: cmd 35/00:18:e0:d9:46/00:00:12:00:00/e0 tag 0 dma 12288 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   53.360056] ata1.00: status: { DRDY }
[   53.360075] ata1: soft resetting link
[   53.562494] ata1: nv_mode_filter: 0x3f39f&0x3f39f->0x3f39f, BIOS=0x3f000 (0xc6000000) ACPI=0x3f01f (20:900:0x11)
[   53.628863] ata1.00: configured for UDMA/100
[   53.628868] ata1.00: device reported invalid CHS sector 0
[   53.628883] ata1: EH complete
[   76.701622] fuse init (API version 7.21)

This is output from smartctl:

smartctl 6.1 2013-03-16 r3800 [i686-linux-3.9.4-1-ARCH] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke,

Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3500630A
Serial Number:    5QG1X3G1
Firmware Version: 3.AAE
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:    Fri Jun  7 11:41:09 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 163) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x000f   112   095   006    Pre-fail  Always       -       43863857
  3 Spin_Up_Time            0x0003   094   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2833
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   047   036   030    Pre-fail  Always       -       32856947045143
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       16160
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   097   097   020    Old_age   Always       -       3563
187 Reported_Uncorrect      0x0032   016   016   000    Old_age   Always       -       84
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   057   047   045    Old_age   Always       -       43 (Min/Max 23/43)
194 Temperature_Celsius     0x0022   043   053   000    Old_age   Always       -       43 (0 16 0 0 0)
195 Hardware_ECC_Recovered  0x001a   063   054   000    Old_age   Always       -       32213260
197 Current_Pending_Sector  0x0012   001   001   000    Old_age   Always       -       4294967295
198 Offline_Uncorrectable   0x0010   001   001   000    Old_age   Offline      -       4294967295
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 118 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 118 occurred at disk power-on lifetime: 15759 hours (656 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 20 01 00 00 00 e0 00      00:00:17.357  READ DMA EXT
  c6 20 10 01 00 00 ef 00      00:00:17.357  SET MULTIPLE MODE
  91 20 3f 01 00 00 ef 00      00:00:17.357  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 20 01 01 00 00 e1 00      00:00:17.350  RECALIBRATE [OBS-4]
  00 00 01 00 00 00 00 04      00:00:18.309  NOP [Abort queued commands]

Error 117 occurred at disk power-on lifetime: 15759 hours (656 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 20 01 00 00 00 e0 00      00:00:17.357  READ DMA EXT
  c6 20 10 01 00 00 ef 00      00:00:17.357  SET MULTIPLE MODE
  91 20 3f 01 00 00 ef 00      00:00:17.357  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 20 01 01 00 00 e1 00      00:00:17.350  RECALIBRATE [OBS-4]
  00 00 01 00 00 00 00 04      00:00:16.873  NOP [Abort queued commands]

Error 116 occurred at disk power-on lifetime: 15759 hours (656 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 20 01 00 00 00 e0 00      00:00:17.357  READ DMA EXT
  c6 20 10 01 00 00 ef 00      00:00:17.357  SET MULTIPLE MODE
  91 20 3f 01 00 00 ef 00      00:00:17.357  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 20 01 01 00 00 e1 00      00:00:17.350  RECALIBRATE [OBS-4]
  00 00 01 00 00 00 00 04      00:00:16.873  NOP [Abort queued commands]

Error 115 occurred at disk power-on lifetime: 15759 hours (656 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 20 01 00 00 00 e0 00      00:00:13.895  READ DMA EXT
  c6 20 10 01 00 00 ef 00      00:00:13.895  SET MULTIPLE MODE
  91 20 3f 01 00 00 ef 00      00:00:13.887  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 20 01 01 00 00 e1 00      00:00:13.419  RECALIBRATE [OBS-4]
  00 00 10 00 00 00 00 04      00:00:16.873  NOP [Abort queued commands]

Error 114 occurred at disk power-on lifetime: 15759 hours (656 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 00 00 00 00 e0  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 20 01 00 00 00 e0 00      00:00:13.895  READ DMA EXT
  c6 20 10 01 00 00 ef 00      00:00:13.895  SET MULTIPLE MODE
  91 20 3f 01 00 00 ef 00      00:00:13.887  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 20 01 01 00 00 e1 00      00:00:13.419  RECALIBRATE [OBS-4]
  00 00 01 00 00 00 00 04      00:00:13.419  NOP [Abort queued commands]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     16160         -
# 2  Short offline       Completed without error       00%     15759         -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It would be great if someone could help me a little to prevent some major incident and point out what's happening. Thanks in advance!


Re: HDD Problem - Freezes for a while during linux init

If my math is correct you have more Current Pending Sectors and Offline Uncorrectable sectors than user accessible sectors on the drive (over 4x), something seems to have gotten seriously screwed there. I have never seen anything like that and logic would say that it shouldn't be possible.

Also this doesn't inspire confidence:

91 20 3f 01 00 00 ef 00      00:00:13.887  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 20 01 01 00 00 e1 00      00:00:13.419  RECALIBRATE [OBS-4]

  00 00 01 00 00 00 00 04      00:00:13.419  NOP [Abort queued commands]



Re: HDD Problem - Freezes for a while during linux init

R00KIE wrote:

If my math is correct you have more Current Pending Sectors and Offline Uncorrectable sectors than user accessible sectors on the drive (over 4x), something seems to have gotten seriously screwed there. I have never seen anything like that and logic would say that it shouldn't be possible.

I'm guessing that you're looking at that figure of 4294967295.  That happens to be 2^32-1, or the unsigned value of -1, so it's more likely to just mean "undefined".  The other events hint towards a failing disk, although if it was me, I'd do a google search on the messages before paying out for a replacement.


Re: HDD Problem - Freezes for a while during linux init

Thanks both of you for feedback. Problem still remains and it's getting worse. When shutting down Arch it came up with errors like:

EXT4-fs ext4_da_writepage: jbd2_start: 8192 pages, ino 1704770; err -30

and hanged up sad. Then it didn't even want to boot up, console output was all screwed up. Right now it is working again, but I'm afraid of shutting it down :E


Re: HDD Problem - Freezes for a while during linux init

You best bet at getting the most uncorrupted data out of that disk, is to stop using it until you get a new disk that is at least twice as big as that one. The reason is that when you get a new disk, you want to try to do a full disk image in one go and get as much data out as possible before more data corruption occurs or the disk stops working at all.

The twice as large requirement is because you can then take your time to extract the files you want from the image and have enough free space to store them. After that make a backup strategy if you don't have one yet.

Has far as I know, both Current_Pending_Sector and Offline_Uncorrectable attributes start out at zero and 1) increase when there is a problem with a sector, 2) decrease when the problem is solved either by overwriting the affected sectors or when the sectors get reallocated thus decreasing the count, which should return to zero when there are no more sectors with problems. This implies that the count should never exceed the total number of user addressable sectors. Like I said, I have never seen any disk where those attributes didn't make sense, if those attributes weren't defined they wouldn't be reported I suppose.



