I think my hard drive might be getting ready to crash again.

Gullible Jones · 2006-07-12 18:30:42

I'm starting to see some awfully weird stuff right now... after hard reboots due to X lockups (Sauerbraten fucking up), I seem to be losing data from config files that weren't being written to or used at the time of the reboot - .viminfo, for example, and the XML config stuff for Galeon. I also lost everything in my .xinitrc file after a normal, clean reboot. No SMART errors but I am getting rather suspicious. And no rootkits or trojans, AFAICT.

(Had this drive for only 2 years though, it shouldn't be failing now. Then again, it might have crashed before.)

Here's what smartctl --all has to say:

smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar family
Device Model:     WDC WD400BB-00JHA0
Serial Number:    WD-WMAMC1539009
Firmware Version: 05.01C05
User Capacity:    40,020,664,320 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Jul 12 14:26:23 2006 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (1260) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  23) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   168   166   021    Pre-fail  Always       -       2591
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1443
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5145
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1443
194 Temperature_Celsius     0x0022   103   088   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

iBertus · 2006-07-12 18:35:28

Did you run a self test on the drive? If not, try that and see if it reports any errors. You haven't heard anything funny coming from the general area of the drive?

Gullible Jones · 2006-07-12 21:39:51

Hmm... Where the heck does the the self-test dump its log? :?

iBertus · 2006-07-13 04:07:36

Try running smartctl again and see if the log is included in the output.

Gullible Jones · 2006-07-13 09:52:48

That's what I did... 'smartctl --test=long /dev/hda' (as root) and no output at all other than the "wait 23 minutes, blah blah blah" stuff. It also backgrounded itself, and I'm not sure whether it ended after 23 minutes or not. There were no logs left anywhere, not in /root, not in my home dir, not in /var/log... And I'm not seeing anything in the man page about where it records its output, I'll try looking again.

Bebo · 2006-07-13 10:04:37

The only logs I know of are available through smartctl -l selftest /dev/hdX. Not very much info there, but you have some test progress info, and whether the test was completed without finding errors. The same info is available in the smartctl -a /dev/hdX output. If errors were ancountered I guess they would be shown in the WHEN_FAILED column for the attributes.

iBertus · 2006-07-13 14:14:28

Bebo wrote:

The only logs I know of are available through smartctl -l selftest /dev/hdX. Not very much info there, but you have some test progress info, and whether the test was completed without finding errors. The same info is available in the smartctl -a /dev/hdX output. If errors were ancountered I guess they would be shown in the WHEN_FAILED column for the attributes.

That's what I meant. Maybe I should write more next time.

Gullible Jones · 2006-07-13 14:58:47

Ooohhhh... That was dumb of me. :oops:

Gullible Jones · 2006-07-13 15:28:22

Okay, here:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5157         -
# 2  Extended offline    Completed without error       00%      5152         -
# 3  Extended offline    Completed without error       00%      5147         -
# 4  Extended offline    Completed without error       00%      5146         -
# 5  Extended offline    Aborted by host               90%      5146         -

Apparently no errors, hopefull the HDD is fine... but the last line should be different. I didn't abort anything - I didn't use smartctl -X at any point and I waited for the test to complete. Is this a bug in smartctl?

Bebo · 2006-07-13 16:46:18

Don't know if it's a bug, but I have that message on one of my (two) boxes too, even though I'm pretty sure I never aborted anything.

Gullible Jones · 2006-07-18 18:27:00

Hmm, I just lost all of my Quod Libet config stuff again. I wonder if there are bad sectors on my drive or something.

(Grrr... Also lost my ToME savefile and its backup, both corrupted beyond all recognition.)

Jarsto · 2006-07-18 20:50:44

Well if smartctl isn't telling you anything you could use the appropriate fsck.???? to scan the file system for bad blocks. The only problem with this is that you can't do it on a mounted partition (you'll need a second installation or a livecd with the right tools to scan your root partition), but it should find the problem, if there is a bad block/sector.

The command I would use, if I wanted to scan my root partition would be "fsck.ext3 -c -c -p /dev/sda2". The double -c means it does a more thorough test and -p means it does repairs automatically. These options are for ext3 though, if you use a different filesytem your fsck probably needs different options. You should probably check for fsck.ext3 as well before you actually use it.

Gullible Jones · 2006-07-19 01:26:39

I use JFS... Is there a way to use the jfs_fsck program to look for bad blocks? Because I can't find one, neither on then net nor in the man pages..

(You can check for bad blocks on filesystem creation but that's another thing entirely...)

Jarsto · 2006-07-19 06:38:39

I don't use JFS but like you I can't find any sign of a way to check bad blocks on it. One of the reasons I continue to use ext3 is that it usually has the best support for this sort of thing.

If you have the space on other partitions you could use partimage to make a backup of the partition as it is now. Then format it with a bad-blocks check to see if there are any. I'm not sure what would happen to the information about the bad-blocks when you put the image back onto the partition though.

Kern · 2006-07-19 07:50:42

have you considered trying a manufacturers drivetest bootdisk. this will check the drive regardless of OS or Fs.
list here -> http://www.ultimatebootcd.com/
or just d/l the entire disk and burn to CD.
its one of my 3 "always carry" CD's

Gullible Jones · 2006-07-19 23:02:02

Jarsto wrote:

I don't use JFS but like you I can't find any sign of a way to check bad blocks on it. One of the reasons I continue to use ext3 is that it usually has the best support for this sort of thing.
If you have the space on other partitions you could use partimage to make a backup of the partition as it is now. Then format it with a bad-blocks check to see if there are any. I'm not sure what would happen to the information about the bad-blocks when you put the image back onto the partition though.

Okay I guess I'll try that...

Kern wrote:

have you considered trying a manufacturers drivetest bootdisk. this will check the drive regardless of OS or Fs.
list here -> http://www.ultimatebootcd.com/
or just d/l the entire disk and burn to CD.
its one of my 3 "always carry" CD's

Thanks... I haven't a floppy or CD drive at the moment but I'll see what I can do about that. 8)

Gullible Jones · 2006-07-21 20:02:50

Well, jfs_mkfs and e2fsck both fail to find any bad blocks on either partition. It should be noted, though, that I had to reformat the partitions to do a bad block check - could that have some effect on the detection of bad blocks?

Jarsto · 2006-07-22 05:41:42

Gullible Jones wrote:

Well, jfs_mkfs and e2fsck both fail to find any bad blocks on either partition. It should be noted, though, that I had to reformat the partitions to do a bad block check - could that have some effect on the detection of bad blocks?

To the best of my knowledge it shouldn't. Bad blocks are a physical problem with the drive, whatever filesystem you're using.

Gullible Jones · 2006-07-22 13:20:36

Hmm... Well I got some info on other stuff that could explain it.

- Electrostatic discharges: drive was pulled from a lightning-fried machine, apparently undamaged (I figured that it would be either okay or dead, given the surge's magnitude), but I'm starting to wonder if something got a little scrambled. If so, though. it took a while to show uo.

- Power supply issues: probably not, since the power supply had proven quite adequate for this hard drive previously, and for a similar hard drive, and I haven't observed any other symptoms of an inadequate PSU.

- Jolts during shipping: doesn't seem incredibly likely to me (and again, why would damage during shipping take so long to show up?).

Bysshe · 2006-07-22 17:17:31

Motherboard and RAM is possible.

Gullible Jones · 2006-07-22 18:04:26

Damn it... Maybe that explains the 8 MB of RAM that vanished into thin air? :evil:

(And I just lost all my data. Phooey!)

Bysshe · 2006-07-22 18:28:43

Damn, dude. I hate this is happening again. But when you get the chance to get down and dirty with it, and start running tests, be thorough with it. As in, don't just run memtest86, run prime95 or mprime (same). if it's the mobo or CPU, it's more likely to lock up the system. If it's the RAM, it's more likely to just list errors. Both/either can happen if it's the power supply. If you have more than one RAM stick, another PSU, whatever to plug into the board and cross-test individuals then by all means. Worst thing you can do is buy something hoping it fixes it. Need a reasonable diagnosis beyond circumstantials. Expect this to take some time, and as always, how do the capacitors on the motherboard look?

Gullible Jones · 2006-07-22 20:28:09

Caps looked fine last I checked (will check again), no vcore issues, no system lockups. (X-related lockups have occured, but that was with AIGLX and OpenGL games, and pretty clearly not affecting the whole system - sound still played, etc.) No spontaneous reboots but I'm not ruling out the PSU, it's (supposedly) 250 watts and probably budget crap.

Memtest86 has never detected anything wrong, however I know that one stick of RAM is 8 MB short of what it used to be. I'll try mprime since you recommended it... Annoyingly no one has made a PKGBUILD for it but what the heck.

Gullible Jones · 2006-07-22 22:57:34

Okay, 2 hours 13 minutes with mprime -t, everything passed with no errors of warnings, CPU temperature 49 C at end of test. I'm not sure that's adequate for rooting out more subtle stuff though (6-24 hours is recommended) so I suppose I'll also run it through the night. As for the capacitors, they still look quite fine.

Kern · 2006-07-23 14:02:33

- Electrostatic discharges: drive was pulled from a lightning-fried machine, apparently undamaged (I figured that it would be either okay or dead, given the surge's magnitude), but I'm starting to wonder if something got a little scrambled. If so, though. it took a while to show uo.

ESD damage is cumulative. in silicon terms you can weaken the transistors (silicon junctions) over a period of time just with low voltage discharges. It isnt necessarily so that these will fail totally and immediately. this is like a weld being ok but the metal around it rotting away over time.

re the CPU etc. a year or so back miqorz had similar probs with random crashes, especially with any medium to high CPU process. turned out he changed something on the CPU, cooler , fan or whatever, and the silicon heat transfer paste wasnt applied uniformly. this led to hotspots obviously and a breakdown of process wrt temperature.

back to basics, if you run a HD test (manufacturers testdisk ) itl remove that from the equation. then u can focus on other probables.

recently had trouble accessing a Maxtor 200G. worked ok when i made an image of it, (i work in data recovery/forensics) , but when trying to wipe it as the owner requested, it failed with I/O problems.

Maxtor hard drive test showed drive was failing.
so ... it worked ok to get an image, but screwed up when trying to write to it. Warranty job.

if you want to scrap the drive should it prove faulty, i can point you to some oss utils and procedures should you need to recover data. these have worked on drives that the M/F test confirmed drive failure. if you need to take it out of forum, dont be afraid to priv msg. im happy to make time to help.
good luck

Kern

Arch Linux

#1 2006-07-12 18:30:42

I think my hard drive might be getting ready to crash again.

#2 2006-07-12 18:35:28

Re: I think my hard drive might be getting ready to crash again.

#3 2006-07-12 21:39:51

Re: I think my hard drive might be getting ready to crash again.

#4 2006-07-13 04:07:36

Re: I think my hard drive might be getting ready to crash again.

#5 2006-07-13 09:52:48

Re: I think my hard drive might be getting ready to crash again.

#6 2006-07-13 10:04:37

Re: I think my hard drive might be getting ready to crash again.

#7 2006-07-13 14:14:28

Re: I think my hard drive might be getting ready to crash again.

#8 2006-07-13 14:58:47

Re: I think my hard drive might be getting ready to crash again.

#9 2006-07-13 15:28:22

Re: I think my hard drive might be getting ready to crash again.

#10 2006-07-13 16:46:18

Re: I think my hard drive might be getting ready to crash again.

#11 2006-07-18 18:27:00

Re: I think my hard drive might be getting ready to crash again.

#12 2006-07-18 20:50:44

Re: I think my hard drive might be getting ready to crash again.

#13 2006-07-19 01:26:39

Re: I think my hard drive might be getting ready to crash again.

#14 2006-07-19 06:38:39

Re: I think my hard drive might be getting ready to crash again.

#15 2006-07-19 07:50:42

Re: I think my hard drive might be getting ready to crash again.

#16 2006-07-19 23:02:02

Re: I think my hard drive might be getting ready to crash again.

#17 2006-07-21 20:02:50

Re: I think my hard drive might be getting ready to crash again.

#18 2006-07-22 05:41:42

Re: I think my hard drive might be getting ready to crash again.

#19 2006-07-22 13:20:36

Re: I think my hard drive might be getting ready to crash again.

#20 2006-07-22 17:17:31

Re: I think my hard drive might be getting ready to crash again.

#21 2006-07-22 18:04:26

Re: I think my hard drive might be getting ready to crash again.

#22 2006-07-22 18:28:43

Re: I think my hard drive might be getting ready to crash again.

#23 2006-07-22 20:28:09

Re: I think my hard drive might be getting ready to crash again.

#24 2006-07-22 22:57:34

Re: I think my hard drive might be getting ready to crash again.

#25 2006-07-23 14:02:33

Re: I think my hard drive might be getting ready to crash again.

Board footer