hdck - new software for testing hard drives - v.0.5.0

tomato · 2010-08-10 22:45:39

Hi all!

I've started developing this small application to test harddrives. What it is in a nutshell is MHDD for linux. While MHDD is a very good piece of software, it's quite old and has problems working with many new controllers.

For those of you that never heard of this software, it works like this: by measuring how long it takes to read a group of sectors it can detect realocated blocks (especially ones that SMART doesn't report) or bad sectors waiting to happen (latent bad sectors).

By using this method you can confirm whatever the hard disk is dying and SMART is misreporting or other part of the PC is misbehaving.

Project is located at sourceforge: http://sourceforge.net/projects/hdck/
The PKGBUILD is already uploaded to AUR: http://aur.archlinux.org/packages.php?ID=39584

Usage is quite simple, after installing, run (as root/user having raw read access to block devices) command below and wait.

hdck -f /dev/disk-to-test

To test drives over slow interfaces (like USB or FireWire) or when testing multiple HDDs sharing bandwidth, use ATA VERIFY command:

hdck --ata-verify -f /dev/disk-to-test

You may want to write output also to a log file (as it can be quite copious):

hdck -f /dev/disk-to-test --log /tmp/hdck.log

Or use quick mode if you're in a hurry:

hdck -f /dev/disk-to-test --quick

Additional options are described through simple help message:

hdck --help

Any suggestions, questions, etc. are welcome.

Tomato

EDIT: spelling, new features

Last edited by tomato (2011-02-06 13:15:51)

Procyon · 2010-08-10 23:14:38

I will give this a try.

I just ran sudo hdsentinel and my first drive's "Estimated lifetime" has gone from the default "more than 1000 days" to "more than 984 days"... Very worrying.

Its power on time is only 840 days.

tomato · 2010-08-11 00:14:40

From what I can see my program is hdsentinel's surface scan, hdck is much more precise though.

And sorry to hear that, but you should always remember that those are only estimates. Also, in most applications any diagnostic is based only on SMART data, which is unreliable.

Anyway, I'll look forward to any suggestions.

dcc24 · 2010-08-11 07:19:42

Hmm, trying this on my 500GB 7200RPM drive and ETA for completion is 4 hours.

tomato · 2010-08-11 10:09:17

Unfortunately that's the side effect of the convinience of not restarting the PC and ability to continue work.

If you are in a jiffy, you can try restarting the OS to single user mode and run with --min-reads 1 flag. The results should be similar and the time should be one third of the normal run.

Procyon · 2010-08-11 13:11:49

I used --background, the ETA was 2 hours, based on 3 loops (min-reads), but it kept going after that.

I thought it would stop at 20 loops (max-reads) after almost 14 hours, but then

read 8704000 sectors, 82.815MiB/s (61.743MiB/s), 2.78% (667.59%), in 13:45:08, loop 21 of 3, expected time: 02:03:35

I stopped it there. It's also using a lot of RAM, 250MB when I stopped it.

Why so many loops?

tomato · 2010-08-11 16:44:12

The first three loops are there to gather preliminary data, search for sectors that could be bad.
After this, all gathered times are compared, and if sector behaved consistently (e.g. on three reads gave 3.23ms, 3.20ms and 3.27ms or 10.56ms, 10.43ms and 10.5ms) then it's considered certain and not re-checked. Sectors that behaved inconsistently on the other hand (e.g. reads took 3.5ms, 10.56ms and 3.8ms) are considered uncertain and are re-read until they are statistically consistent.

At this time a sector is considered certain when its standard deviation is below 0.5 or when it has more than 5 reads and the standard deviation when 20% highest and 20% lowest times are discarded is below 0.5.

I described the max-reads quite bad, the absolute maximum reads is max-reads+min-reads, I'll have to re-phrase it or subtract the min-reads from max-reads in the program.

It's quite weird though that it needed over 20 full re-reads of the drive. It should do this only when over 25% of all reads were uncertain or 10% were totally unsuccessful. Were you using this drive instensively during this period (torrenting, compiling, streaming, etc.)?

As for 250MB of RAM usage: that's quite normal for average sized drives.
As a 500GB drive has 976562500 LBAs, thats 3814697 hdck blocks. Each read is saved in a 8byte variable. Three reads times 8 bytes times hdck blocks:
3*8*3814697=87MiB
And that's only raw data, there is quite big metadata overhead (12 bytes per block).

EDIT: wording
EDIT2: specific numbers

Last edited by tomato (2010-08-11 18:10:48)

tomato · 2010-08-11 20:58:23

Program updated, now it will be a bit more readable and verbose by default. Also, the max-reads is the actual maximum number of re-reads.

goran'agar · 2010-08-12 07:48:29

I tried hdck with my eeepc 900 2 ssd's and it seems to work fine (about 20 minutes to complete both tests).
Is it supposed to work with this config or was I just lucky? Are there special tweaks needed (for example, I'm thinking at the RPM speed, which is of course meaningless on ssd's).?

BTW, those two (extremely crappy) ssd's don't even support SMART, so hdck will probably become a must fo me.

Last edited by goran'agar (2010-08-12 07:48:51)

tomato · 2010-08-12 10:45:36

Sorry to say that, but hdck is kind of useless for non-rotating media.

I don't have a SSD, not even a broken one to test, I did conduct tests on few SD cards, including one that led to the card destruction. While I was able to see a difference between a SD card with many writes and one without, it wasn't anything really significant. Card with many erase cycles was more variable, the plot of read times was more jagged, while new cards have a relatively smooth line with few peaks.

As such I probably could extrapolate the jaggines to the media use level, I simply don't have enough data to do so.

There is also additional problem with flash media: even if I could detect realocations, they are part of normal work. As such, without insight to deep workings of the drive, it's hard to tell whatever the health of the drive is good or not.

HDDs are pretty simple in comparision, the logical layout as seen by the OS closly resembles the way data is saved on the drive, together with rotational delay, any reread on the part of the drive or reallocation is clearly visible (normal read is around 3ms, reallocated/reread is around 10ms for 7200rpm media).

goran'agar · 2010-08-12 11:24:35

tomato wrote:

Sorry to say that, but hdck is kind of useless for non-rotating media.

Oh, I see. Thanks for your explainations.

stqn · 2010-08-13 09:52:23

I tried hdck on one of my laptop's external USB drive (80 GB, 5400 rpm) in background mode. I stopped the test after 4:30 hours at the 7th loop... It kept restarting and wanting to scan the whole drive each time. I'm guessing this is "normal" behaviour but I didn't want my HD to burn and I thought it would/should only rescan the sectors with a time difference... As it is it seemed like it would never stop.

I first tried the normal non-background mode, but I then couldn't use my computer.

I wasn't doing anything special besides browsing the web and talking with Pidgin during the scan, but I had Transmission and aMule running too.

tomato · 2010-08-13 12:50:49

This is normal behaviour only when the drive is in intensive use (torrenting/aMule-ing definitely counts). The less the drive is used during the test, the faster it will finish.

It's best to umount all partitions on the drive and run with -x, you'll get best results in fastest way possible.

Testing a drive in a USB enclosure also decreases the sensitivity of this application. Testing the drive using native interface or eSATA is highly recommended.

You could try running it with --nort and --nortio, this should make it have much lower impact on overall system performance.

One more thing: background testing should be used only when one does not want the testing to influence the performance of the system, but at the cost of much much longer testing (the default 20 max-reads is rather optimistic in this context).

Last edited by tomato (2010-08-21 14:58:04)

stqn · 2010-08-13 13:36:04

tomato wrote:

This is normal behaviour only when the drive is in intensive use (torrenting/aMule-ing definitely counts). The less the drive is used during the test, the faster it will finish.

I forgot to mention that, but this disk wasn't used much in fact, as aMule and all but one of my torrents were accessing other drives. And the one torrent was only uploading at a (probably) very slow pace (think less than 10kB/s.)

Anyway... I might try again on an unmounted disk when I get a chance.

tomato · 2010-08-22 18:21:11

I've updated the program today (the PKGBUILD is already updated).

New features include:

printing simple diagnosis message after testing
shorter status message but printed on multiple lines (so it will be readable on 80 column screens), this should also make the errors and warnings more prominent
log output with basically the same messages as the ones shown on screen
comparing block times to rotational delay from supplied disk-rpm, not fixed values

I'm open to any suggestions of new features or changes to already implemented ones.

tomato · 2010-08-25 14:33:48

Few tricks to reduce number of interrupted reads:
I've found three biggest disk writers on otherwise idle system:
two are KDE specific: nepomuk and akonadi
one is system wide: nscd

So, if you're testing the drive your /home is on, disabling nepomuk and akonadi is good idea.
If you're testing the drive your /tmp or /var are on, disable nscd.
Adding noatime or relatime to partitions on the drive tested is good idea too:

mount -o remount,noatime /

In my case, it reduces number of interrupted blocks by on order of magnitude.

additional writers can be discovered easly with iotop.

New version: 0.3.0

New features include:
quick mode
This mode of operation changes the default min-reads, max-reads and std-dev to 1, 10 and 0.75. What's more it also makes the algorithm ignore blocks below twice the rotational delay (about 16ms for 7200rpm) and ones that have one slower read and two faster reads (e.g: 25.56ms, 3.8ms, 3.5ms) but the max is still below 4 times the rotational delay.
All in all, it should make the test of a 1TB drive last around 4h (while the first read is aroung 3h and 15min), not the 12h it is now with just the 1, 10 and 0.75 settings.

more verbose status
Now, the status shows the amount of samples and blocks below thesholds and amount of interrupted reads live:

hdck status:
============
Loop:          1 of 1
Progress:      2.87%, 2.87% total
Read:          56064000 sectors of 1000204795904
Speed:         121.408MiB/s, average: 94.588MiB/s
Elapsed time:  00:04:49
Expected time: 02:47:50
         Samples:             Blocks:
< 2.1ms:               190880               190788
< 4.2ms:                26886                26877
< 8.3ms:                   13                   13
<16.7ms:                  661                  642
<33.3ms:                  130                  113
<50.0ms:                  201                   28
>50.0ms:                   26                  175
ERR    :                    0                    0
Intrrpt:                  203                  364

The samples are only samples from non interrupted reads, not just performed reads (as are blocks), that's why the stats differ.

more detailed result
Now the result also includes ten slowest blocks along with standard deviation of their samples, fastest and slowest times, average time, truncated average(20%) and number of samples:

Worst blocks:
block no      st.dev  avg   tr. avg max     min     valid  samples
         415  0.0244   3.10    3.10    3.13    3.09  yes   3
           1  4.8017   4.67    4.67   10.21    1.89  yes   3
           7  4.8043   4.67    4.67   10.22    1.90  yes   3
           6  4.7995   4.67    4.67   10.21    1.90  yes   3
           3  4.8039   4.68    4.68   10.22    1.90  yes   3
        3218  4.7852   5.86    5.86   11.39    3.07  yes   3
         182 15.0787  11.53    6.89   44.40    1.90  yes   7
         181 18.0823  13.92    7.44   45.40    1.90  yes   5
           0  1.0596   8.92    8.92   10.15    8.31  yes   3
         129  0.0021  10.21   10.21   10.22   10.21  yes   3

As always, new version is already in AUR.

Last edited by tomato (2010-08-25 16:24:03)

fphillips · 2010-09-16 23:55:48

What is the impact on results from not supplying the correct rpm?

The problem I am having is that each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000. I'm sure I have compounded the problem by trying to run it on four 1.5/2TB unmounted drives in parallel. I was also using the -o option, just expecting the final report to be output to file, not a 600MB file of block statistics. Please make that more clear.

tomato · 2010-09-21 05:18:02

fphillips wrote:

What is the impact on results from not supplying the correct rpm?

Fairly minimal, unless you've got uncommon drives, rotational delay for 7200rpm drive is ~8ms, for 5600rpm is ~11ms. It is 4ms for 15000rpm disks though.

fphillips wrote:

The problem I am having is that each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000. I'm sure I have compounded the problem by trying to run it on four 1.5/2TB unmounted drives in parallel. I was also using the -o option, just expecting the final report to be output to file, not a 600MB file of block statistics. Please make that more clear.

I don't remember the specific version I added the feature, but the the --log (-l) option does exactly that. The -o option was described as "detailed statistics" from the beginning. As for running it on many drives in parallel: you may be bottlenecked by your mainboard. I had an intel board (don't remember the specific model) that couldn't move more than 300MB/s from disks, considering that a single 1.5TB drive can easily output 120MiB/s 4 drives may well be limited by the mainboard.
If you use PCI controller card you will be limited as well as PCI has a limit of 133MiB/s (shared between all cards), single lane PCIe has bigger limit: 250MiB/s.

In short: try to test 1 drive at a time, if problem persists, we will have to investigate.

I'm currently working on using an ATA specific command (READ_VERIFY) that has much smaller bandwidth requirements, it will allow for testing multiple drives on such mainboards/PCI RAID cards or relatively fast disks in USB and FireWire enclosures.

tomato · 2010-10-07 16:18:41

new version

The main two changes are: switch to quantile-based detection of bad sectors and a much quicker quick-mode.

still working on READ_VERIFY

fphillips · 2010-12-09 18:58:52

You may be interested to know of a bug in Samsung F4 1.5/2TB drives that causes data loss if IDENTIFY_DEVICE is issued during a write.
http://sourceforge.net/apps/trac/smartm … GBadBlocks

One of the drives I was testing hdck with was an F4. It doesn't look like hdck uses this command, although I probably did use smartctl -a during testing to check the temperatures.

I wonder if this was my cause for "each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000."

Last edited by fphillips (2010-12-09 19:05:38)

tomato · 2010-12-09 22:05:39

fphillips wrote:

You may be interested to know of a bug in Samsung F4 1.5/2TB drives that causes data loss if IDENTIFY_DEVICE is issued during a write.
http://sourceforge.net/apps/trac/smartm … GBadBlocks
One of the drives I was testing hdck with was an F4. It doesn't look like hdck uses this command, although I probably did use smartctl -a during testing to check the temperatures.
I wonder if this was my cause for "each time it finishes a run, it increases the number of questionable blocks to reread, from 7000 it has gone up toward 200000."

it's unrelated, any activity that vastly increases io request times will increase number of questionable blocks.

hdck is just made to cope with it but doesn't use any magic/hackery to do it.

tomato · 2011-02-06 13:22:41

New version: 0.5.0.

I've finally found some time to implement RAD_VERIFY ATA command. Testing drives connected using USB or FireWire is now supported using the --ata-verify switch.
This also should help if your south-bridge has very limited bandwidth available and you want to test multiple drives at the same time or when you're using port multipliers.

fphillips · 2011-05-20 07:05:35

The ending sector in --bad-sectors log has an off by one error (you're adding 256 rather than 255):

[root@ion shm]# head hdck.bad 
69380864 69381120
69422848 69423104
69434624 69434880
69554944 69555200
69717504 69717760
69774848 69775104
69853184 69853440
69864960 69865216
69916416 69916672
70003456 70003712
[root@ion shm]# grep ^block hdck.log |awk {'print $4'}|head
69380864-69381119)
69422848-69423103)
69434624-69434879)
69554944-69555199)
69717504-69717759)
69774848-69775103)
69853184-69853439)
69864960-69865215)
69916416-69916671)
70003456-70003711)

BTW, does it make sense to write the -l/w/o logs to /dev/shm, or do they not get written until after the reads are done? I know at least the beginning of the -l log is written at start.

el mariachi · 2011-05-20 14:43:58

hdck results:
=============
possible latent bad sectors or silent realocations:
block 0 (LBA: 0-255) rel std dev:  -nan, avg:  1.91, valid: yes, samples: 1, 9th decile:  1.91
block 1 (LBA: 256-511) rel std dev:  -nan, avg:  1.97, valid: yes, samples: 1, 9th decile:  1.97
block 360 (LBA: 92160-92415) rel std dev:  1.83, avg: 23.30, valid: yes, samples: 4, 9th decile: 61.75
block 335687 (LBA: 85935872-85936127) rel std dev:  0.00, avg: 11.68, valid: yes, samples: 4, 9th decile: 11.69
block 424246 (LBA: 108606976-108607231) rel std dev:  0.00, avg:  5.30, valid: yes, samples: 14, 9th decile:  9.95
block 424953 (LBA: 108787968-108788223) rel std dev:  0.00, avg: 12.70, valid: yes, samples: 20, 9th decile:  9.98
block 753235 (LBA: 192828160-192828415) rel std dev:  0.00, avg:  8.76, valid: yes, samples: 4, 9th decile:  8.76
block 776270 (LBA: 198725120-198725375) rel std dev:  0.00, avg:  6.36, valid: yes, samples: 11, 9th decile: 13.18
block 776499 (LBA: 198783744-198783999) rel std dev:  0.00, avg:  5.10, valid: yes, samples: 11, 9th decile: 15.38
block 776524 (LBA: 198790144-198790399) rel std dev:  0.00, avg: 14.17, valid: yes, samples: 20, 9th decile: 12.79
10 uncertain blocks found

wall time: 9514s.194ms.662µs.931ns
sum time: 8360s.892ms.600µs
tested 915787 blocks (0 errors, 2998032 samples)
mean block time: 0s.2ms.765µs
std dev: 0.736480554(ms)
Number of invalid blocks because of detected interrupted reads: 0
Number of interrupted reads: 717
Individual block statistics:
<2.08ms: 191895
<4.17ms: 672444
<8.33ms: 51109
<16.67ms: 338
<33.33ms: 0
<50.00ms: 0
>50.00ms: 1
ERR: 0

Worst blocks:
block no      st.dev  avg   1stQ    med     3rdQ   valid samples 9th decile
      857937  7.9138   7.38    3.42    3.43    7.39  yes   4     14.51
      719754  7.6771   7.61    3.76    3.77    7.61  yes   4     14.52
      764318  8.2016   7.14    3.04    3.04    7.14  yes   4     14.52
      840018  7.9457   7.40    3.43    3.43    7.40  yes   4     14.55
      864737  8.1238   7.49    3.42    3.43    7.49  yes   4     14.80
      828547  8.3258   7.38    3.22    3.22    7.38  yes   4     14.87
      884235  8.2141   7.77    3.66    3.66    7.77  yes   4     15.16
      776499  5.1851   5.10    3.04    3.04    3.04  yes  11     15.38
      799599  8.2557   8.36    4.23    4.24    8.37  yes   4     15.79
         360 42.7147  23.30    1.94    1.95   23.31  yes   4     61.75

[b]Disk status: CRITICAL
CAUTION! Sectors that required more than 6 read attempts detected, drive may be ALREADY FAILING![/b]

doesn't look so good... what can i do about it? it's a laptop btw with a regular non-ssd hd

tomato · 2011-05-20 18:15:29

fphillips wrote:

The ending sector in --bad-sectors log has an off by one error (you're adding 256 rather than 255):
[root@ion shm]# head hdck.bad 
69380864 69381120
<snip>
[root@ion shm]# grep ^block hdck.log |awk {'print $4'}|head
69380864-69381119)
<snip>

it's not really an off-by one error, more of a feature One specifies inclusive ranges and the other right exclusive range.

I remove 1 from ending in .log file, as it's the one to be read by humans, .bad is for the application.

BTW, does it make sense to write the -l/w/o logs to /dev/shm, or do they not get written until after the reads are done? I know at least the beginning of the -l log is written at start.

I don't remember for sure, but if you're testing the drive your OS is running from, then yes. They are not flushed to disk, so if you have enough RAM it may look like they are written at program exit.

Arch Linux

#1 2010-08-10 22:45:39

hdck - new software for testing hard drives - v.0.5.0

#2 2010-08-10 23:14:38

Re: hdck - new software for testing hard drives - v.0.5.0

#3 2010-08-11 00:14:40

Re: hdck - new software for testing hard drives - v.0.5.0

#4 2010-08-11 07:19:42

Re: hdck - new software for testing hard drives - v.0.5.0

#5 2010-08-11 10:09:17

Re: hdck - new software for testing hard drives - v.0.5.0

#6 2010-08-11 13:11:49

Re: hdck - new software for testing hard drives - v.0.5.0

#7 2010-08-11 16:44:12

Re: hdck - new software for testing hard drives - v.0.5.0

#8 2010-08-11 20:58:23

Re: hdck - new software for testing hard drives - v.0.5.0

#9 2010-08-12 07:48:29

Re: hdck - new software for testing hard drives - v.0.5.0

#10 2010-08-12 10:45:36

Re: hdck - new software for testing hard drives - v.0.5.0

#11 2010-08-12 11:24:35

Re: hdck - new software for testing hard drives - v.0.5.0

#12 2010-08-13 09:52:23

Re: hdck - new software for testing hard drives - v.0.5.0

#13 2010-08-13 12:50:49

Re: hdck - new software for testing hard drives - v.0.5.0

#14 2010-08-13 13:36:04

Re: hdck - new software for testing hard drives - v.0.5.0

#15 2010-08-22 18:21:11

Re: hdck - new software for testing hard drives - v.0.5.0

#16 2010-08-25 14:33:48

Re: hdck - new software for testing hard drives - v.0.5.0

#17 2010-09-16 23:55:48

Re: hdck - new software for testing hard drives - v.0.5.0

#18 2010-09-21 05:18:02

Re: hdck - new software for testing hard drives - v.0.5.0

#19 2010-10-07 16:18:41

Re: hdck - new software for testing hard drives - v.0.5.0

#20 2010-12-09 18:58:52

Re: hdck - new software for testing hard drives - v.0.5.0

#21 2010-12-09 22:05:39

Re: hdck - new software for testing hard drives - v.0.5.0

#22 2011-02-06 13:22:41

Re: hdck - new software for testing hard drives - v.0.5.0

#23 2011-05-20 07:05:35

Re: hdck - new software for testing hard drives - v.0.5.0

#24 2011-05-20 14:43:58

Re: hdck - new software for testing hard drives - v.0.5.0

#25 2011-05-20 18:15:29

Re: hdck - new software for testing hard drives - v.0.5.0

Board footer