You are not logged in.
Hi all!
I've started developing this small application to test harddrives. What it is in a nutshell is MHDD for linux. While MHDD is a very good piece of software, it's quite old and has problems working with many new controllers.
For those of you that never heard of this software, it works like this: by measuring how long it takes to read a group of sectors it can detect realocated blocks (especially ones that SMART doesn't report) or bad sectors waiting to happen (latent bad sectors).
By using this method you can confirm whatever the hard disk is dying and SMART is misreporting or other part of the PC is misbehaving.
Project is located at sourceforge: http://sourceforge.net/projects/hdck/
The PKGBUILD is already uploaded to AUR: http://aur.archlinux.org/packages.php?ID=39584
Usage is quite simple, after installing, run (as root/user having raw read access to block devices) command below and wait.
hdck -f /dev/disk-to-test
To test drives over slow interfaces (like USB or FireWire) or when testing multiple HDDs sharing bandwidth, use ATA VERIFY command:
hdck --ata-verify -f /dev/disk-to-test
You may want to write output also to a log file (as it can be quite copious):
hdck -f /dev/disk-to-test --log /tmp/hdck.log
Or use quick mode if you're in a hurry:
hdck -f /dev/disk-to-test --quick
Additional options are described through simple help message:
hdck --help
Any suggestions, questions, etc. are welcome.
Tomato
EDIT: spelling, new features
Last edited by tomato (2011-02-06 13:15:51)
sorry for my miserable english, it's my third language ; )
Offline
I will give this a try.
I just ran sudo hdsentinel and my first drive's "Estimated lifetime" has gone from the default "more than 1000 days" to "more than 984 days"... Very worrying.
Its power on time is only 840 days.
Offline
From what I can see my program is hdsentinel's surface scan, hdck is much more precise though.
And sorry to hear that, but you should always remember that those are only estimates. Also, in most applications any diagnostic is based only on SMART data, which is unreliable.
Anyway, I'll look forward to any suggestions.
sorry for my miserable english, it's my third language ; )
Offline
Hmm, trying this on my 500GB 7200RPM drive and ETA for completion is 4 hours.
It is better to keep your mouth shut and be thought a fool than to open it and remove all doubt. (Mark Twain)
Offline
Unfortunately that's the side effect of the convinience of not restarting the PC and ability to continue work.
If you are in a jiffy, you can try restarting the OS to single user mode and run with --min-reads 1 flag. The results should be similar and the time should be one third of the normal run.
sorry for my miserable english, it's my third language ; )
Offline
I used --background, the ETA was 2 hours, based on 3 loops (min-reads), but it kept going after that.
I thought it would stop at 20 loops (max-reads) after almost 14 hours, but then
read 8704000 sectors, 82.815MiB/s (61.743MiB/s), 2.78% (667.59%), in 13:45:08, loop 21 of 3, expected time: 02:03:35
I stopped it there. It's also using a lot of RAM, 250MB when I stopped it.
Why so many loops?
Offline
The first three loops are there to gather preliminary data, search for sectors that could be bad.
After this, all gathered times are compared, and if sector behaved consistently (e.g. on three reads gave 3.23ms, 3.20ms and 3.27ms or 10.56ms, 10.43ms and 10.5ms) then it's considered certain and not re-checked. Sectors that behaved inconsistently on the other hand (e.g. reads took 3.5ms, 10.56ms and 3.8ms) are considered uncertain and are re-read until they are statistically consistent.
At this time a sector is considered certain when its standard deviation is below 0.5 or when it has more than 5 reads and the standard deviation when 20% highest and 20% lowest times are discarded is below 0.5.
I described the max-reads quite bad, the absolute maximum reads is max-reads+min-reads, I'll have to re-phrase it or subtract the min-reads from max-reads in the program.
It's quite weird though that it needed over 20 full re-reads of the drive. It should do this only when over 25% of all reads were uncertain or 10% were totally unsuccessful. Were you using this drive instensively during this period (torrenting, compiling, streaming, etc.)?
As for 250MB of RAM usage: that's quite normal for average sized drives.
As a 500GB drive has 976562500 LBAs, thats 3814697 hdck blocks. Each read is saved in a 8byte variable. Three reads times 8 bytes times hdck blocks:
3*8*3814697=87MiB
And that's only raw data, there is quite big metadata overhead (12 bytes per block).
EDIT: wording
EDIT2: specific numbers
Last edited by tomato (2010-08-11 18:10:48)
sorry for my miserable english, it's my third language ; )
Offline
Program updated, now it will be a bit more readable and verbose by default. Also, the max-reads is the actual maximum number of re-reads.
sorry for my miserable english, it's my third language ; )
Offline
I tried hdck with my eeepc 900 2 ssd's and it seems to work fine (about 20 minutes to complete both tests).
Is it supposed to work with this config or was I just lucky? Are there special tweaks needed (for example, I'm thinking at the RPM speed, which is of course meaningless on ssd's).?
BTW, those two (extremely crappy) ssd's don't even support SMART, so hdck will probably become a must fo me.
Last edited by goran'agar (2010-08-12 07:48:51)
Sony Vaio VPCM13M1E - Arch Linux - LXDE
Offline
Sorry to say that, but hdck is kind of useless for non-rotating media.
I don't have a SSD, not even a broken one to test, I did conduct tests on few SD cards, including one that led to the card destruction. While I was able to see a difference between a SD card with many writes and one without, it wasn't anything really significant. Card with many erase cycles was more variable, the plot of read times was more jagged, while new cards have a relatively smooth line with few peaks.
As such I probably could extrapolate the jaggines to the media use level, I simply don't have enough data to do so.
There is also additional problem with flash media: even if I could detect realocations, they are part of normal work. As such, without insight to deep workings of the drive, it's hard to tell whatever the health of the drive is good or not.
HDDs are pretty simple in comparision, the logical layout as seen by the OS closly resembles the way data is saved on the drive, together with rotational delay, any reread on the part of the drive or reallocation is clearly visible (normal read is around 3ms, reallocated/reread is around 10ms for 7200rpm media).
sorry for my miserable english, it's my third language ; )
Offline
Sorry to say that, but hdck is kind of useless for non-rotating media.
Oh, I see. Thanks for your explainations.
Sony Vaio VPCM13M1E - Arch Linux - LXDE
Offline
I tried hdck on one of my laptop's external USB drive (80 GB, 5400 rpm) in background mode. I stopped the test after 4:30 hours at the 7th loop... It kept restarting and wanting to scan the whole drive each time. I'm guessing this is "normal" behaviour but I didn't want my HD to burn and I thought it would/should only rescan the sectors with a time difference... As it is it seemed like it would never stop.
I first tried the normal non-background mode, but I then couldn't use my computer.
I wasn't doing anything special besides browsing the web and talking with Pidgin during the scan, but I had Transmission and aMule running too.
Offline
This is normal behaviour only when the drive is in intensive use (torrenting/aMule-ing definitely counts). The less the drive is used during the test, the faster it will finish.
It's best to umount all partitions on the drive and run with -x, you'll get best results in fastest way possible.
Testing a drive in a USB enclosure also decreases the sensitivity of this application. Testing the drive using native interface or eSATA is highly recommended.
You could try running it with --nort and --nortio, this should make it have much lower impact on overall system performance.
One more thing: background testing should be used only when one does not want the testing to influence the performance of the system, but at the cost of much much longer testing (the default 20 max-reads is rather optimistic in this context).
Last edited by tomato (2010-08-21 14:58:04)
sorry for my miserable english, it's my third language ; )
Offline
This is normal behaviour only when the drive is in intensive use (torrenting/aMule-ing definitely counts). The less the drive is used during the test, the faster it will finish.
I forgot to mention that, but this disk wasn't used much in fact, as aMule and all but one of my torrents were accessing other drives. And the one torrent was only uploading at a (probably) very slow pace (think less than 10kB/s.)
Anyway... I might try again on an unmounted disk when I get a chance.
Offline
I've updated the program today (the PKGBUILD is already updated).
New features include:
printing simple diagnosis message after testing
shorter status message but printed on multiple lines (so it will be readable on 80 column screens), this should also make the errors and warnings more prominent
log output with basically the same messages as the ones shown on screen
comparing block times to rotational delay from supplied disk-rpm, not fixed values
I'm open to any suggestions of new features or changes to already implemented ones.
sorry for my miserable english, it's my third language ; )
Offline
Few tricks to reduce number of interrupted reads:
I've found three biggest disk writers on otherwise idle system:
two are KDE specific: nepomuk and akonadi
one is system wide: nscd
So, if you're testing the drive your /home is on, disabling nepomuk and akonadi is good idea.
If you're testing the drive your /tmp or /var are on, disable nscd.
Adding noatime or relatime to partitions on the drive tested is good idea too:
mount -o remount,noatime /
In my case, it reduces number of interrupted blocks by on order of magnitude.
additional writers can be discovered easly with iotop.
New version: 0.3.0
New features include:
quick mode
This mode of operation changes the default min-reads, max-reads and std-dev to 1, 10 and 0.75. What's more it also makes the algorithm ignore blocks below twice the rotational delay (about 16ms for 7200rpm) and ones that have one slower read and two faster reads (e.g: 25.56ms, 3.8ms, 3.5ms) but the max is still below 4 times the rotational delay.
All in all, it should make the test of a 1TB drive last around 4h (while the first read is aroung 3h and 15min), not the 12h it is now with just the 1, 10 and 0.75 settings.
more verbose status
Now, the status shows the amount of samples and blocks below thesholds and amount of interrupted reads live:
hdck status:
============
Loop: 1 of 1
Progress: 2.87%, 2.87% total
Read: 56064000 sectors of 1000204795904
Speed: 121.408MiB/s, average: 94.588MiB/s
Elapsed time: 00:04:49
Expected time: 02:47:50
Samples: Blocks:
< 2.1ms: 190880 190788
< 4.2ms: 26886 26877
< 8.3ms: 13 13
<16.7ms: 661 642
<33.3ms: 130 113
<50.0ms: 201 28
>50.0ms: 26 175
ERR : 0 0
Intrrpt: 203 364
The samples are only samples from non interrupted reads, not just performed reads (as are blocks), that's why the stats differ.
more detailed result
Now the result also includes ten slowest blocks along with standard deviation of their samples, fastest and slowest times, average time, truncated average(20%) and number of samples:
Worst blocks:
block no st.dev avg tr. avg max min valid samples
415 0.0244 3.10 3.10 3.13 3.09 yes 3
1 4.8017 4.67 4.67 10.21 1.89 yes 3
7 4.8043 4.67 4.67 10.22 1.90 yes 3
6 4.7995 4.67 4.67 10.21 1.90 yes 3
3 4.8039 4.68 4.68 10.22 1.90 yes 3
3218 4.7852 5.86 5.86 11.39 3.07 yes 3
182 15.0787 11.53 6.89 44.40 1.90 yes 7
181 18.0823 13.92 7.44 45.40 1.90 yes 5
0 1.0596 8.92 8.92 10.15 8.31 yes 3
129 0.0021 10.21 10.21 10.22 10.21 yes 3
As always, new version is already in AUR.
Last edited by tomato (2010-08-25 16:24:03)
sorry for my miserable english, it's my third language ; )
Offline
What is the impact on results from not supplying the correct rpm?
The problem I am having is that each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000. I'm sure I have compounded the problem by trying to run it on four 1.5/2TB unmounted drives in parallel. I was also using the -o option, just expecting the final report to be output to file, not a 600MB file of block statistics. Please make that more clear.
Offline
What is the impact on results from not supplying the correct rpm?
Fairly minimal, unless you've got uncommon drives, rotational delay for 7200rpm drive is ~8ms, for 5600rpm is ~11ms. It is 4ms for 15000rpm disks though.
The problem I am having is that each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000. I'm sure I have compounded the problem by trying to run it on four 1.5/2TB unmounted drives in parallel. I was also using the -o option, just expecting the final report to be output to file, not a 600MB file of block statistics. Please make that more clear.
I don't remember the specific version I added the feature, but the the --log (-l) option does exactly that. The -o option was described as "detailed statistics" from the beginning. As for running it on many drives in parallel: you may be bottlenecked by your mainboard. I had an intel board (don't remember the specific model) that couldn't move more than 300MB/s from disks, considering that a single 1.5TB drive can easily output 120MiB/s 4 drives may well be limited by the mainboard.
If you use PCI controller card you will be limited as well as PCI has a limit of 133MiB/s (shared between all cards), single lane PCIe has bigger limit: 250MiB/s.
In short: try to test 1 drive at a time, if problem persists, we will have to investigate.
I'm currently working on using an ATA specific command (READ_VERIFY) that has much smaller bandwidth requirements, it will allow for testing multiple drives on such mainboards/PCI RAID cards or relatively fast disks in USB and FireWire enclosures.
sorry for my miserable english, it's my third language ; )
Offline
new version
The main two changes are: switch to quantile-based detection of bad sectors and a much quicker quick-mode.
still working on READ_VERIFY
sorry for my miserable english, it's my third language ; )
Offline
You may be interested to know of a bug in Samsung F4 1.5/2TB drives that causes data loss if IDENTIFY_DEVICE is issued during a write.
http://sourceforge.net/apps/trac/smartm … GBadBlocks
One of the drives I was testing hdck with was an F4. It doesn't look like hdck uses this command, although I probably did use smartctl -a during testing to check the temperatures.
I wonder if this was my cause for "each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000."
Last edited by fphillips (2010-12-09 19:05:38)
Offline
You may be interested to know of a bug in Samsung F4 1.5/2TB drives that causes data loss if IDENTIFY_DEVICE is issued during a write.
http://sourceforge.net/apps/trac/smartm … GBadBlocksOne of the drives I was testing hdck with was an F4. It doesn't look like hdck uses this command, although I probably did use smartctl -a during testing to check the temperatures.
I wonder if this was my cause for "each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000."
it's unrelated, any activity that vastly increases io request times will increase number of questionable blocks.
hdck is just made to cope with it but doesn't use any magic/hackery to do it.
sorry for my miserable english, it's my third language ; )
Offline
New version: 0.5.0.
I've finally found some time to implement RAD_VERIFY ATA command. Testing drives connected using USB or FireWire is now supported using the --ata-verify switch.
This also should help if your south-bridge has very limited bandwidth available and you want to test multiple drives at the same time or when you're using port multipliers.
sorry for my miserable english, it's my third language ; )
Offline
The ending sector in --bad-sectors log has an off by one error (you're adding 256 rather than 255):
[root@ion shm]# head hdck.bad
69380864 69381120
69422848 69423104
69434624 69434880
69554944 69555200
69717504 69717760
69774848 69775104
69853184 69853440
69864960 69865216
69916416 69916672
70003456 70003712
[root@ion shm]# grep ^block hdck.log |awk {'print $4'}|head
69380864-69381119)
69422848-69423103)
69434624-69434879)
69554944-69555199)
69717504-69717759)
69774848-69775103)
69853184-69853439)
69864960-69865215)
69916416-69916671)
70003456-70003711)
BTW, does it make sense to write the -l/w/o logs to /dev/shm, or do they not get written until after the reads are done? I know at least the beginning of the -l log is written at start.
Offline
hdck results:
=============
possible latent bad sectors or silent realocations:
block 0 (LBA: 0-255) rel std dev: -nan, avg: 1.91, valid: yes, samples: 1, 9th decile: 1.91
block 1 (LBA: 256-511) rel std dev: -nan, avg: 1.97, valid: yes, samples: 1, 9th decile: 1.97
block 360 (LBA: 92160-92415) rel std dev: 1.83, avg: 23.30, valid: yes, samples: 4, 9th decile: 61.75
block 335687 (LBA: 85935872-85936127) rel std dev: 0.00, avg: 11.68, valid: yes, samples: 4, 9th decile: 11.69
block 424246 (LBA: 108606976-108607231) rel std dev: 0.00, avg: 5.30, valid: yes, samples: 14, 9th decile: 9.95
block 424953 (LBA: 108787968-108788223) rel std dev: 0.00, avg: 12.70, valid: yes, samples: 20, 9th decile: 9.98
block 753235 (LBA: 192828160-192828415) rel std dev: 0.00, avg: 8.76, valid: yes, samples: 4, 9th decile: 8.76
block 776270 (LBA: 198725120-198725375) rel std dev: 0.00, avg: 6.36, valid: yes, samples: 11, 9th decile: 13.18
block 776499 (LBA: 198783744-198783999) rel std dev: 0.00, avg: 5.10, valid: yes, samples: 11, 9th decile: 15.38
block 776524 (LBA: 198790144-198790399) rel std dev: 0.00, avg: 14.17, valid: yes, samples: 20, 9th decile: 12.79
10 uncertain blocks found
wall time: 9514s.194ms.662µs.931ns
sum time: 8360s.892ms.600µs
tested 915787 blocks (0 errors, 2998032 samples)
mean block time: 0s.2ms.765µs
std dev: 0.736480554(ms)
Number of invalid blocks because of detected interrupted reads: 0
Number of interrupted reads: 717
Individual block statistics:
<2.08ms: 191895
<4.17ms: 672444
<8.33ms: 51109
<16.67ms: 338
<33.33ms: 0
<50.00ms: 0
>50.00ms: 1
ERR: 0
Worst blocks:
block no st.dev avg 1stQ med 3rdQ valid samples 9th decile
857937 7.9138 7.38 3.42 3.43 7.39 yes 4 14.51
719754 7.6771 7.61 3.76 3.77 7.61 yes 4 14.52
764318 8.2016 7.14 3.04 3.04 7.14 yes 4 14.52
840018 7.9457 7.40 3.43 3.43 7.40 yes 4 14.55
864737 8.1238 7.49 3.42 3.43 7.49 yes 4 14.80
828547 8.3258 7.38 3.22 3.22 7.38 yes 4 14.87
884235 8.2141 7.77 3.66 3.66 7.77 yes 4 15.16
776499 5.1851 5.10 3.04 3.04 3.04 yes 11 15.38
799599 8.2557 8.36 4.23 4.24 8.37 yes 4 15.79
360 42.7147 23.30 1.94 1.95 23.31 yes 4 61.75
[b]Disk status: CRITICAL
CAUTION! Sectors that required more than 6 read attempts detected, drive may be ALREADY FAILING![/b]
doesn't look so good... what can i do about it? it's a laptop btw with a regular non-ssd hd
Offline
The ending sector in --bad-sectors log has an off by one error (you're adding 256 rather than 255):
[root@ion shm]# head hdck.bad 69380864 69381120 <snip> [root@ion shm]# grep ^block hdck.log |awk {'print $4'}|head 69380864-69381119) <snip>
it's not really an off-by one error, more of a feature One specifies inclusive ranges and the other right exclusive range.
I remove 1 from ending in .log file, as it's the one to be read by humans, .bad is for the application.
BTW, does it make sense to write the -l/w/o logs to /dev/shm, or do they not get written until after the reads are done? I know at least the beginning of the -l log is written at start.
I don't remember for sure, but if you're testing the drive your OS is running from, then yes. They are not flushed to disk, so if you have enough RAM it may look like they are written at program exit.
sorry for my miserable english, it's my third language ; )
Offline