You are not logged in.
Pages: 1
Dunno if this is in the right section, but since it envolves hardware I thought I post this here.
I've noticed that my hardrive sometimes makes a strong sound, like it's spinning and then it stops and keeps doing the same cycle for a while. It happened a couple of times but the last couple of times my whole home partition changed to read only, not even root could change it. I'm on a fresh install but I'm wondering if it will happen again.
Last time I was using ext4 partitions and XFCE, right now I'm using KDE and ext3 just to rule that out. Any ideas on what it may be? Perhaps my drive is failing or something. How can I check if it's my drive? I'm using a 40GB Western Digital disk btw.
Thanks for your help
Offline
All major drive manufacturers have test disks that can be downloaded, usually in CD format. They're self contained, so OS shouldn't be a factor. See if you can find one of those, and run the entire test suite. It'll take a bunch of hours to do a thorough scan, but it's worth the time.
Offline
If your kernel was remounting / as read only it definitely indicates a bad drive, especially with the sounds you are hearing. If you run dmesg, there is a good chance you'll see a lot of I/O errors. It is possible it could have been a very damaged filesystem, but because you are hearing such sounds I'd guess the drive is failing.
Offline
Hmmm I have seen something similar (without the re-mount read only, but I was using windoze which doesn't know what that is anyway) but it was in an external box. I guess the power supply that came with the box just went postal and sometimes the HD just stopped and started and stopped again and kept going like that until a power cycle. The hard disk itself was working perfectly when connected inside my desktop pc. Check if the power supply in your pc isn't going postal too.
About the errors on disk .... when you sort out where the problem comes from and if you find that the problem isn't the disk then backup all your data a write the whole drive with zeros (using dd will do that trick nicely). I have a laptop disk that had a few sectors that couldn't be read, I thought the disc was dying, then I did that on the partition that was giving me trouble and everything is fine again (not even one reallocated sector ... or so the smart data says ... ).
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
The hard disk of my 2 months old laptop is playing me tricks as well.
I had exactly the same symptoms, hearing the disk doing some cycles. And then some bad filesystems errors, and the filesystems being remounted read-only.
I installed smartmontools (pacman -S smartmontools) and started to read some docs :
http://smartmontools.sourceforge.net/faq.html
http://smartmontools.sourceforge.net/badblockhowto.html
The first steps I would do are, assuming your disk is sda :
# smartctl -t short /dev/sda
see the time you need to wait for the test to complete (normally less than 2 minutes)
# smartctl -a /dev/sda
I did not have any errors here. So I made the long one.
# smartctl -t long /dev/sda
see the time you need to wait for the test to complete (normally less than 2 hours)
# smartctl -a /dev/sda
Here I found some badblocks, so I followed the badblockhowto above. Basically it's what Rookie said, you just fill the bad blocks with zero.
If the disk can read the sector of data a single time, and the damage is permanent, not transient, then the disk firmware will mark the sector as 'bad' and allocate a spare sector to replace it. But if the disk can't read the sector even once, then it won't reallocate the sector, in hopes of being able, at some time in the future, to read the data from it. A write to an unreadable (corrupted) sector will fix the problem. If the damage is transient, then new consistent data will be written to the sector. If the damange is permanent, then the write will force sector reallocation. Please see Bad block HOWTO for instructions about how to force this sector to reallocate (Linux only).
This fixed the problem temporarily, but one week later I had a new bad block, apparently the last one of the disk, and apparently I could not fix it this time, I am not sure why. I formatted the whole disk (filling it with zeros), the smart long test actually did not show any errors, but the smart short test still did. Since all bad blocks were always at the end of the disk, I resized my disk from 320 GB to 250 GB, and now all the tests are fine, but well At least, if it could stay that way, I would be happy.
Last edited by shining (2009-03-08 00:03:29)
pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))
Offline
Ok I ran couple of tests:
First I did a
grep -R DriveStatusError *
on /usr/src/linux-2.6.28-ARCH/ and nothing showed up. Then I did the long smart test and here's a couple of outputs
smartctl -a
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Protege
Device Model: WDC WD400EB-00CPF0
Serial Number: WD-WMAATF705942
Firmware Version: 06.04G06
User Capacity: 40,020,664,320 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 5
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Mar 7 18:07:06 2009 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 115) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (1754) seconds.
Offline data collection
capabilities: (0x3b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 199 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 104 089 021 Pre-fail Always - 2158
4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 821
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1236
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 814
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 197 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0012 200 197 000 Old_age Always - 3
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1
ATA Error Count: 42 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 42 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 b0 79 7f cf f0 Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 b0 5b 7f cf f0 00 00:02:05.350 READ DMA
f8 00 00 00 00 00 f0 08 00:02:05.350 READ NATIVE MAX ADDRESS
ec 00 00 00 00 00 b0 0a 00:02:05.300 IDENTIFY DEVICE
ef 03 45 00 00 00 b0 0a 00:02:05.300 SET FEATURES [Set transfer mode]
f8 00 00 00 00 00 f0 08 00:02:05.300 READ NATIVE MAX ADDRESS
Error 41 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 b0 79 7f cf f0 Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 b0 5b 7f cf f0 00 00:02:02.050 READ DMA
f8 00 00 00 00 00 f0 08 00:02:02.050 READ NATIVE MAX ADDRESS
ec 00 00 00 00 00 b0 0a 00:02:02.050 IDENTIFY DEVICE
ef 03 45 00 00 00 b0 0a 00:02:02.050 SET FEATURES [Set transfer mode]
f8 00 00 00 00 00 f0 08 00:02:02.000 READ NATIVE MAX ADDRESS
Error 40 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 b0 79 7f cf f0 Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 b0 5b 7f cf f0 00 00:01:59.000 READ DMA
f8 00 00 00 00 00 f0 08 00:01:58.950 READ NATIVE MAX ADDRESS
ec 00 00 00 00 00 b0 0a 00:01:58.950 IDENTIFY DEVICE
ef 03 45 00 00 00 b0 0a 00:01:58.950 SET FEATURES [Set transfer mode]
f8 00 00 00 00 00 f0 08 00:01:58.950 READ NATIVE MAX ADDRESS
Error 39 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 b0 79 7f cf f0 Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 b0 5b 7f cf f0 00 00:01:55.750 READ DMA
f8 00 00 00 00 00 f0 08 00:01:55.700 READ NATIVE MAX ADDRESS
ec 00 00 00 00 00 b0 0a 00:01:55.700 IDENTIFY DEVICE
ef 03 45 00 00 00 b0 0a 00:01:55.700 SET FEATURES [Set transfer mode]
f8 00 00 00 00 00 f0 08 00:01:55.700 READ NATIVE MAX ADDRESS
Error 38 occurred at disk power-on lifetime: 125 hours (5 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 b0 79 7f cf f0 Error: UNC 176 sectors at LBA = 0x00cf7f79 = 13598585
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 b0 5b 7f cf f0 00 00:01:52.450 READ DMA
f8 00 00 00 00 00 f0 08 00:01:52.450 READ NATIVE MAX ADDRESS
ec 00 00 00 00 00 b0 0a 00:01:52.450 IDENTIFY DEVICE
ef 03 45 00 00 00 b0 0a 00:01:52.450 SET FEATURES [Set transfer mode]
f8 00 00 00 00 00 f0 08 00:01:52.450 READ NATIVE MAX ADDRESS
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 30% 143 48085936
# 2 Short offline Completed without error 00% 140 -
Device does not support Selective Self Tests/Logging
So basically now I have to look for that block address and do what the guide says, right?.
Offline
I find it strange that it reports the first error at 48 085 936, while it also says there is an error at 13 598 585, while should come before
I would still do as Rookie said, if possible. The guide will just tell you how to only address the reported issues, by writing zeros at specific place.
But it would be both safer and easier to just fill the whole drive with zeros.
pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))
Offline
How should I do that? Maybe with a LiveCD? I have a sidux LiveCD lying around.......I borrowed my Arch CD
Offline
You see these two
197 Current_Pending_Sector 0x0012 200 197 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0012 200 197 000 Old_age Always - 3
These are the buggers that are causing you trouble.
These are still zero
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
Exactly the same happened to me, it was on a swap partition so I just nuked it with zeros, the first two errors went away and no reallocations done . Then just to be sure I nuked the whole drive with zeros and with test patterns .... can't remember which program I used .... there is one program that does that.
In case you start getting reallocations (too many that is) then consider contacting the manufacturer and ask how much is too much and if you should return the drive. A few reallocated sectors are ok, it should be somewhat expected but if they start increasing then better think of not trusting that drive.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Since this was a semi fresh install, just downloaded a couple of screencasts but I saved them to my USB, I decided to just zero fill the drive.
This drive was a gift from a friend who had an old HP sitting and gathering dust.
I think it was nice that this happened since now I now how to use smartctl a little more and how to search for blocks and stuff.
Offline
Since this was a semi fresh install, just downloaded a couple of screencasts but I saved them to my USB, I decided to just zero fill the drive.
This drive was a gift from a friend who had an old HP sitting and gathering dust.I think it was nice that this happened since now I now how to use smartctl a little more and how to search for blocks and stuff.
So you did it? and did it work?
You should have the following counters back to 0 :
197 Current_Pending_Sector 0x0012 200 197 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0012 200 197 000 Old_age Always - 3
And the short and long smartctl tests should pass again.
pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))
Offline
The other program I didn't remember the name is "badblocks", it can write a few patters to disk and then read them back and check if everything is ok. I leave it here for reference.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Pages: 1