You are not logged in.
I'm starting to see some awfully weird stuff right now... after hard reboots due to X lockups (Sauerbraten fucking up), I seem to be losing data from config files that weren't being written to or used at the time of the reboot - .viminfo, for example, and the XML config stuff for Galeon. I also lost everything in my .xinitrc file after a normal, clean reboot. No SMART errors but I am getting rather suspicious. And no rootkits or trojans, AFAICT.
(Had this drive for only 2 years though, it shouldn't be failing now. Then again, it might have crashed before.)
Here's what smartctl --all has to say:
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar family
Device Model: WDC WD400BB-00JHA0
Serial Number: WD-WMAMC1539009
Firmware Version: 05.01C05
User Capacity: 40,020,664,320 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Jul 12 14:26:23 2006 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (1260) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 23) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 168 166 021 Pre-fail Always - 2591
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1443
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5145
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1443
194 Temperature_Celsius 0x0022 103 088 000 Old_age Always - 40
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Offline
Did you run a self test on the drive? If not, try that and see if it reports any errors. You haven't heard anything funny coming from the general area of the drive?
Offline
Hmm... Where the heck does the the self-test dump its log? :?
Offline
Try running smartctl again and see if the log is included in the output.
Offline
That's what I did... 'smartctl --test=long /dev/hda' (as root) and no output at all other than the "wait 23 minutes, blah blah blah" stuff. It also backgrounded itself, and I'm not sure whether it ended after 23 minutes or not. There were no logs left anywhere, not in /root, not in my home dir, not in /var/log... And I'm not seeing anything in the man page about where it records its output, I'll try looking again.
Offline
The only logs I know of are available through smartctl -l selftest /dev/hdX. Not very much info there, but you have some test progress info, and whether the test was completed without finding errors. The same info is available in the smartctl -a /dev/hdX output. If errors were ancountered I guess they would be shown in the WHEN_FAILED column for the attributes.
Offline
The only logs I know of are available through smartctl -l selftest /dev/hdX. Not very much info there, but you have some test progress info, and whether the test was completed without finding errors. The same info is available in the smartctl -a /dev/hdX output. If errors were ancountered I guess they would be shown in the WHEN_FAILED column for the attributes.
That's what I meant. Maybe I should write more next time.
Offline
Ooohhhh... That was dumb of me. :oops:
Offline
Okay, here:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 5157 -
# 2 Extended offline Completed without error 00% 5152 -
# 3 Extended offline Completed without error 00% 5147 -
# 4 Extended offline Completed without error 00% 5146 -
# 5 Extended offline Aborted by host 90% 5146 -
Apparently no errors, hopefull the HDD is fine... but the last line should be different. I didn't abort anything - I didn't use smartctl -X at any point and I waited for the test to complete. Is this a bug in smartctl?
Offline
Don't know if it's a bug, but I have that message on one of my (two) boxes too, even though I'm pretty sure I never aborted anything.
Offline
Hmm, I just lost all of my Quod Libet config stuff again. I wonder if there are bad sectors on my drive or something.
(Grrr... Also lost my ToME savefile and its backup, both corrupted beyond all recognition.)
Offline
Well if smartctl isn't telling you anything you could use the appropriate fsck.???? to scan the file system for bad blocks. The only problem with this is that you can't do it on a mounted partition (you'll need a second installation or a livecd with the right tools to scan your root partition), but it should find the problem, if there is a bad block/sector.
The command I would use, if I wanted to scan my root partition would be "fsck.ext3 -c -c -p /dev/sda2". The double -c means it does a more thorough test and -p means it does repairs automatically. These options are for ext3 though, if you use a different filesytem your fsck probably needs different options. You should probably check for fsck.ext3 as well before you actually use it.
Jarsto
Offline
I use JFS... Is there a way to use the jfs_fsck program to look for bad blocks? Because I can't find one, neither on then net nor in the man pages..
(You can check for bad blocks on filesystem creation but that's another thing entirely...)
Offline
I don't use JFS but like you I can't find any sign of a way to check bad blocks on it. One of the reasons I continue to use ext3 is that it usually has the best support for this sort of thing.
If you have the space on other partitions you could use partimage to make a backup of the partition as it is now. Then format it with a bad-blocks check to see if there are any. I'm not sure what would happen to the information about the bad-blocks when you put the image back onto the partition though.
Jarsto
Offline
have you considered trying a manufacturers drivetest bootdisk. this will check the drive regardless of OS or Fs.
list here -> http://www.ultimatebootcd.com/
or just d/l the entire disk and burn to CD.
its one of my 3 "always carry" CD's
Offline
I don't use JFS but like you I can't find any sign of a way to check bad blocks on it. One of the reasons I continue to use ext3 is that it usually has the best support for this sort of thing.
If you have the space on other partitions you could use partimage to make a backup of the partition as it is now. Then format it with a bad-blocks check to see if there are any. I'm not sure what would happen to the information about the bad-blocks when you put the image back onto the partition though.
Okay I guess I'll try that...
have you considered trying a manufacturers drivetest bootdisk. this will check the drive regardless of OS or Fs.
list here -> http://www.ultimatebootcd.com/
or just d/l the entire disk and burn to CD.
its one of my 3 "always carry" CD's
Thanks... I haven't a floppy or CD drive at the moment but I'll see what I can do about that. 8)
Offline
Well, jfs_mkfs and e2fsck both fail to find any bad blocks on either partition. It should be noted, though, that I had to reformat the partitions to do a bad block check - could that have some effect on the detection of bad blocks?
Offline
Well, jfs_mkfs and e2fsck both fail to find any bad blocks on either partition. It should be noted, though, that I had to reformat the partitions to do a bad block check - could that have some effect on the detection of bad blocks?
To the best of my knowledge it shouldn't. Bad blocks are a physical problem with the drive, whatever filesystem you're using.
Jarsto
Offline
Hmm... Well I got some info on other stuff that could explain it.
- Electrostatic discharges: drive was pulled from a lightning-fried machine, apparently undamaged (I figured that it would be either okay or dead, given the surge's magnitude), but I'm starting to wonder if something got a little scrambled. If so, though. it took a while to show uo.
- Power supply issues: probably not, since the power supply had proven quite adequate for this hard drive previously, and for a similar hard drive, and I haven't observed any other symptoms of an inadequate PSU.
- Jolts during shipping: doesn't seem incredibly likely to me (and again, why would damage during shipping take so long to show up?).
Offline
Motherboard and RAM is possible.
Offline
Damn it... Maybe that explains the 8 MB of RAM that vanished into thin air? :evil:
(And I just lost all my data. Phooey!)
Offline
Damn, dude. I hate this is happening again. But when you get the chance to get down and dirty with it, and start running tests, be thorough with it. As in, don't just run memtest86, run prime95 or mprime (same). if it's the mobo or CPU, it's more likely to lock up the system. If it's the RAM, it's more likely to just list errors. Both/either can happen if it's the power supply. If you have more than one RAM stick, another PSU, whatever to plug into the board and cross-test individuals then by all means. Worst thing you can do is buy something hoping it fixes it. Need a reasonable diagnosis beyond circumstantials. Expect this to take some time, and as always, how do the capacitors on the motherboard look?
Offline
Caps looked fine last I checked (will check again), no vcore issues, no system lockups. (X-related lockups have occured, but that was with AIGLX and OpenGL games, and pretty clearly not affecting the whole system - sound still played, etc.) No spontaneous reboots but I'm not ruling out the PSU, it's (supposedly) 250 watts and probably budget crap.
Memtest86 has never detected anything wrong, however I know that one stick of RAM is 8 MB short of what it used to be. I'll try mprime since you recommended it... Annoyingly no one has made a PKGBUILD for it but what the heck.
Offline
Okay, 2 hours 13 minutes with mprime -t, everything passed with no errors of warnings, CPU temperature 49 C at end of test. I'm not sure that's adequate for rooting out more subtle stuff though (6-24 hours is recommended) so I suppose I'll also run it through the night. As for the capacitors, they still look quite fine.
Offline
- Electrostatic discharges: drive was pulled from a lightning-fried machine, apparently undamaged (I figured that it would be either okay or dead, given the surge's magnitude), but I'm starting to wonder if something got a little scrambled. If so, though. it took a while to show uo.
ESD damage is cumulative. in silicon terms you can weaken the transistors (silicon junctions) over a period of time just with low voltage discharges. It isnt necessarily so that these will fail totally and immediately. this is like a weld being ok but the metal around it rotting away over time.
re the CPU etc. a year or so back miqorz had similar probs with random crashes, especially with any medium to high CPU process. turned out he changed something on the CPU, cooler , fan or whatever, and the silicon heat transfer paste wasnt applied uniformly. this led to hotspots obviously and a breakdown of process wrt temperature.
back to basics, if you run a HD test (manufacturers testdisk ) itl remove that from the equation. then u can focus on other probables.
recently had trouble accessing a Maxtor 200G. worked ok when i made an image of it, (i work in data recovery/forensics) , but when trying to wipe it as the owner requested, it failed with I/O problems.
Maxtor hard drive test showed drive was failing.
so ... it worked ok to get an image, but screwed up when trying to write to it. Warranty job.
if you want to scrap the drive should it prove faulty, i can point you to some oss utils and procedures should you need to recover data. these have worked on drives that the M/F test confirmed drive failure. if you need to take it out of forum, dont be afraid to priv msg. im happy to make time to help.
good luck
Kern
Offline