Hardrive failures

Rokurosv · 2009-03-07 18:07:50

Dunno if this is in the right section, but since it envolves hardware I thought I post this here.

I've noticed that my hardrive sometimes makes a strong sound, like it's spinning and then it stops and keeps doing the same cycle for a while. It happened a couple of times but the last couple of times my whole home partition changed to read only, not even root could change it. I'm on a fresh install but I'm wondering if it will happen again.
Last time I was using ext4 partitions and XFCE, right now I'm using KDE and ext3 just to rule that out. Any ideas on what it may be? Perhaps my drive is failing or something. How can I check if it's my drive? I'm using a 40GB Western Digital disk btw.

Thanks for your help

skottish · 2009-03-07 18:11:56

All major drive manufacturers have test disks that can be downloaded, usually in CD format. They're self contained, so OS shouldn't be a factor. See if you can find one of those, and run the entire test suite. It'll take a bunch of hours to do a thorough scan, but it's worth the time.

pyther · 2009-03-07 18:27:30

If your kernel was remounting / as read only it definitely indicates a bad drive, especially with the sounds you are hearing. If you run dmesg, there is a good chance you'll see a lot of I/O errors. It is possible it could have been a very damaged filesystem, but because you are hearing such sounds I'd guess the drive is failing.

R00KIE · 2009-03-07 21:04:16

Hmmm I have seen something similar (without the re-mount read only, but I was using windoze which doesn't know what that is anyway) but it was in an external box. I guess the power supply that came with the box just went postal and sometimes the HD just stopped and started and stopped again and kept going like that until a power cycle. The hard disk itself was working perfectly when connected inside my desktop pc. Check if the power supply in your pc isn't going postal too.
About the errors on disk .... when you sort out where the problem comes from and if you find that the problem isn't the disk then backup all your data a write the whole drive with zeros (using dd will do that trick nicely). I have a laptop disk that had a few sectors that couldn't be read, I thought the disc was dying, then I did that on the partition that was giving me trouble and everything is fine again (not even one reallocated sector ... or so the smart data says ... ).

shining · 2009-03-07 23:57:00

The hard disk of my 2 months old laptop is playing me tricks as well.
I had exactly the same symptoms, hearing the disk doing some cycles. And then some bad filesystems errors, and the filesystems being remounted read-only.
I installed smartmontools (pacman -S smartmontools) and started to read some docs :
http://smartmontools.sourceforge.net/faq.html
http://smartmontools.sourceforge.net/badblockhowto.html
The first steps I would do are, assuming your disk is sda :
# smartctl -t short /dev/sda
see the time you need to wait for the test to complete (normally less than 2 minutes)
# smartctl -a /dev/sda
I did not have any errors here. So I made the long one.
# smartctl -t long /dev/sda
see the time you need to wait for the test to complete (normally less than 2 hours)
# smartctl -a /dev/sda
Here I found some badblocks, so I followed the badblockhowto above. Basically it's what Rookie said, you just fill the bad blocks with zero.

If the disk can read the sector of data a single time, and the damage is permanent, not transient, then the disk firmware will mark the sector as 'bad' and allocate a spare sector to replace it. But if the disk can't read the sector even once, then it won't reallocate the sector, in hopes of being able, at some time in the future, to read the data from it. A write to an unreadable (corrupted) sector will fix the problem. If the damage is transient, then new consistent data will be written to the sector. If the damange is permanent, then the write will force sector reallocation. Please see Bad block HOWTO for instructions about how to force this sector to reallocate (Linux only).

This fixed the problem temporarily, but one week later I had a new bad block, apparently the last one of the disk, and apparently I could not fix it this time, I am not sure why. I formatted the whole disk (filling it with zeros), the smart long test actually did not show any errors, but the smart short test still did. Since all bad blocks were always at the end of the disk, I resized my disk from 320 GB to 250 GB, and now all the tests are fine, but well At least, if it could stay that way, I would be happy.

Last edited by shining (2009-03-08 00:03:29)

Rokurosv · 2009-03-08 00:20:51

Ok I ran couple of tests:
First I did a
grep -R DriveStatusError *

on /usr/src/linux-2.6.28-ARCH/ and nothing showed up. Then I did the long smart test and here's a couple of outputs

smartctl -a

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Protege
Device Model:     WDC WD400EB-00CPF0
Serial Number:    WD-WMAATF705942
Firmware Version: 06.04G06
User Capacity:    40,020,664,320 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   5
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Mar  7 18:07:06 2009 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 115)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:          (1754) seconds.
Offline data collection
capabilities:              (0x3b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    No General Purpose Logging support.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  30) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   199   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   104   089   021    Pre-fail  Always       -       2158
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       821
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1236
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       814
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   197   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0012   200   197   000    Old_age   Always       -       3
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 42 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 42 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 79 7f cf f0  Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 5b 7f cf f0 00      00:02:05.350  READ DMA
  f8 00 00 00 00 00 f0 08      00:02:05.350  READ NATIVE MAX ADDRESS
  ec 00 00 00 00 00 b0 0a      00:02:05.300  IDENTIFY DEVICE
  ef 03 45 00 00 00 b0 0a      00:02:05.300  SET FEATURES [Set transfer mode]
  f8 00 00 00 00 00 f0 08      00:02:05.300  READ NATIVE MAX ADDRESS

Error 41 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 79 7f cf f0  Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 5b 7f cf f0 00      00:02:02.050  READ DMA
  f8 00 00 00 00 00 f0 08      00:02:02.050  READ NATIVE MAX ADDRESS
  ec 00 00 00 00 00 b0 0a      00:02:02.050  IDENTIFY DEVICE
  ef 03 45 00 00 00 b0 0a      00:02:02.050  SET FEATURES [Set transfer mode]
  f8 00 00 00 00 00 f0 08      00:02:02.000  READ NATIVE MAX ADDRESS

Error 40 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 79 7f cf f0  Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 5b 7f cf f0 00      00:01:59.000  READ DMA
  f8 00 00 00 00 00 f0 08      00:01:58.950  READ NATIVE MAX ADDRESS
  ec 00 00 00 00 00 b0 0a      00:01:58.950  IDENTIFY DEVICE
  ef 03 45 00 00 00 b0 0a      00:01:58.950  SET FEATURES [Set transfer mode]
  f8 00 00 00 00 00 f0 08      00:01:58.950  READ NATIVE MAX ADDRESS

Error 39 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 79 7f cf f0  Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 5b 7f cf f0 00      00:01:55.750  READ DMA
  f8 00 00 00 00 00 f0 08      00:01:55.700  READ NATIVE MAX ADDRESS
  ec 00 00 00 00 00 b0 0a      00:01:55.700  IDENTIFY DEVICE
  ef 03 45 00 00 00 b0 0a      00:01:55.700  SET FEATURES [Set transfer mode]
  f8 00 00 00 00 00 f0 08      00:01:55.700  READ NATIVE MAX ADDRESS

Error 38 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 79 7f cf f0  Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 5b 7f cf f0 00      00:01:52.450  READ DMA
  f8 00 00 00 00 00 f0 08      00:01:52.450  READ NATIVE MAX ADDRESS
  ec 00 00 00 00 00 b0 0a      00:01:52.450  IDENTIFY DEVICE
  ef 03 45 00 00 00 b0 0a      00:01:52.450  SET FEATURES [Set transfer mode]
  f8 00 00 00 00 00 f0 08      00:01:52.450  READ NATIVE MAX ADDRESS

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       30%       143         48085936
# 2  Short offline       Completed without error       00%       140         -

Device does not support Selective Self Tests/Logging

So basically now I have to look for that block address and do what the guide says, right?.

shining · 2009-03-08 00:28:50

I find it strange that it reports the first error at 48 085 936, while it also says there is an error at 13 598 585, while should come before

I would still do as Rookie said, if possible. The guide will just tell you how to only address the reported issues, by writing zeros at specific place.
But it would be both safer and easier to just fill the whole drive with zeros.

Rokurosv · 2009-03-08 00:38:26

How should I do that? Maybe with a LiveCD? I have a sidux LiveCD lying around.......I borrowed my Arch CD

R00KIE · 2009-03-08 01:31:29

You see these two

197 Current_Pending_Sector  0x0012   200   197   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0012   200   197   000    Old_age   Always       -       3

These are the buggers that are causing you trouble.

These are still zero

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

Exactly the same happened to me, it was on a swap partition so I just nuked it with zeros, the first two errors went away and no reallocations done . Then just to be sure I nuked the whole drive with zeros and with test patterns .... can't remember which program I used .... there is one program that does that.
In case you start getting reallocations (too many that is) then consider contacting the manufacturer and ask how much is too much and if you should return the drive. A few reallocated sectors are ok, it should be somewhat expected but if they start increasing then better think of not trusting that drive.

Rokurosv · 2009-03-08 01:37:28

Since this was a semi fresh install, just downloaded a couple of screencasts but I saved them to my USB, I decided to just zero fill the drive.
This drive was a gift from a friend who had an old HP sitting and gathering dust.

I think it was nice that this happened since now I now how to use smartctl a little more and how to search for blocks and stuff.

shining · 2009-03-08 11:22:39

Rokurosv wrote:

Since this was a semi fresh install, just downloaded a couple of screencasts but I saved them to my USB, I decided to just zero fill the drive.
This drive was a gift from a friend who had an old HP sitting and gathering dust.
I think it was nice that this happened since now I now how to use smartctl a little more and how to search for blocks and stuff.

So you did it? and did it work?

You should have the following counters back to 0 :

197 Current_Pending_Sector  0x0012   200   197   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0012   200   197   000    Old_age   Always       -       3

And the short and long smartctl tests should pass again.

R00KIE · 2009-03-08 16:14:30

The other program I didn't remember the name is "badblocks", it can write a few patters to disk and then read them back and check if everything is ok. I leave it here for reference.

Arch Linux

#1 2009-03-07 18:07:50

Hardrive failures

#2 2009-03-07 18:11:56

Re: Hardrive failures

#3 2009-03-07 18:27:30

Re: Hardrive failures

#4 2009-03-07 21:04:16

Re: Hardrive failures

#5 2009-03-07 23:57:00

Re: Hardrive failures

#6 2009-03-08 00:20:51

Re: Hardrive failures

#7 2009-03-08 00:28:50

Re: Hardrive failures

#8 2009-03-08 00:38:26

Re: Hardrive failures

#9 2009-03-08 01:31:29

Re: Hardrive failures

#10 2009-03-08 01:37:28

Re: Hardrive failures

#11 2009-03-08 11:22:39

Re: Hardrive failures

#12 2009-03-08 16:14:30

Re: Hardrive failures

Board footer