How to find file duplicates only matching size, not md5

bencahill · 2011-03-20 02:59:46

Hey guys, first post here.

I thought I'd share this, since it was quite useful for me. It may not be the most elegant solution, but it worked for me .

So, I have a terabyte (or so) of data, spanning 51,000+ files. Some are greater than 4GiB, many/most are ~4MiB. I have a (relatively) slow 1.6GHz Athlon, which doesn't really like md5-ing huge files , and I didn't need md5, just a size comparison. Unfortunately, all of the duplicate-file-finding tools I tried (including fslint, fdupes, dmerge, and dupseek) didn't have an option to not md5/sha1/hash in some form or another, or I did not find it .

So I decided to use fdupes (it's nice and simple for my needs), and trick it out .

I renamed /usr/bin/md5sum as md5sum.old, and created /usr/bin/md5sum with the following content:

#!/bin/bash

du $1|awk '{print $1}'|md5sum.old

Yes, my bash skills suck, and there are probably better ways of doing it, but again, it worked in this situation . So it's basically taking the du output of the file passed to md5sum (hopefully, again, it worked), shortening that to just the size, and then md5'ing that, which is much faster, and based on the size.

So now, you just run (e.g.):

fdupes -r /path/to/files > /path/to/duplicatefilelist

The -r is for recursive, and I'm redirecting it to a file because I know there will be a lot of dupes. If you don't have many, you could skip that.

Then, fdupes scans the number of files, and has a nice progress percentage so you can see how much is done. After this is done, you'd want to remove /usr/bin/md5sum, and restore the old md5sum .

I hope this is useful to someone else, and please share if there's a better way of doing this.

Thanks!

karol · 2011-03-20 04:16:33

Give

find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3

a try. It prints 'filename (full path) - filesize (in bytes)' for the files that have the same size and sorts them with the biggest ones first. If only one file has e.g. 12345678 bytes, it won't get printed. If at least two files are of the same size, all of them will get printed.

I see that a bunch of small files like metadata ones have the same size, but for big multimedia files it should be OK.

SahibBommelig · 2011-03-20 18:47:23

You also might want to try rmlint (self-advertisement...) or rdfind, which both are able to find twins much faster than fdupes.

If you really want only to compare the size you'd better go with karol's method.

bencahill · 2011-03-20 19:49:19

karol wrote:

find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3

Aha! Much faster and more elegant. I knew ya'll would give me something better .

Thanks, that's very good.

bencahill · 2011-03-20 19:52:44

SahibBommelig wrote:

You also might want to try rmlint (self-advertisement...) or rdfind, which both are able to find twins much faster than fdupes.
If you really want only to compare the size you'd better go with karol's method.

Ah, I find that md5's aren't that slow after all (yes, I realize that rmlint is smarter than simply md5'ing every file ). rmlint is very nice, thank you!

One thing though...can I specify a "preferred" dir (when passing multiple dirs to rmlint), so that it has the "originals," and the "duplicates" are from the other directories?

Thanks.

Procyon · 2011-03-20 20:02:55

karol wrote:

find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3

Use sort before uniq: find . -type f -printf "%p - %s\n" | sort -nr -k3 | uniq -D -f1

karol · 2011-03-20 20:11:57

Procyon wrote:

karol wrote:
find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3
Use sort before uniq: find . -type f -printf "%p - %s\n" | sort -nr -k3 | uniq -D -f1

Errors like that make me cringe. Good thing I'm not an airline pilot.

Thanks.

SahibBommelig · 2011-03-20 20:15:33

bencahill wrote:

...can I specify a "preferred" dir (when passing multiple dirs to rmlint), so that it has the "originals," and the "duplicates" are from the other directories?

I'm afraid, no. But as you're not the first person that is complaining about this, I might add it if I have the time..
For now you gonna have to parse the logfile to achieve this (which should be easy enough).

bencahill · 2011-03-20 20:29:26

SahibBommelig wrote:

For now you gonna have to parse the logfile to achieve this (which should be easy enough).

I'm afraid this is beyond my existing skills , but running the script will be fine for my use, as I'll be re-arranging files anyway.

Thanks again.

Grinch · 2011-03-21 11:12:14

SahibBommelig: Great work on rmlint!

SanskritFritz · 2011-03-21 12:10:31

offtopic

Grinch wrote:

SahibBommelig: Great work on rmlint!

+1, really, this is amazing!
I see there is a PKGBUILD in the tarball. We can expect an aur upload soon?

SahibBommelig · 2011-03-21 15:15:30

also offtopic:

In AUR now. Please continue any discussion here.

karol · 2011-03-27 11:09:43

http://www.commandlinefu.com/commands/v … n-md5-hash

arch_jsb · 2022-03-25 16:34:47

Sorry to necrobump, but this thread's pipe solution was nowhere to be found on the web, without actual programming... putting bbs.archlinux.org into google with appropriate search terms found this which saved me an hour yesterday and which I've saved locally for re-use. Thanks!.

Slithery · 2022-03-25 20:58:08

arch_jsb wrote:

Sorry to necrobump

Then why did you do it?

Closing this ancient thread.

Arch Linux

#1 2011-03-20 02:59:46

How to find file duplicates only matching size, not md5

#2 2011-03-20 04:16:33

Re: How to find file duplicates only matching size, not md5

#3 2011-03-20 18:47:23

Re: How to find file duplicates only matching size, not md5

#4 2011-03-20 19:49:19

Re: How to find file duplicates only matching size, not md5

#5 2011-03-20 19:52:44

Re: How to find file duplicates only matching size, not md5

#6 2011-03-20 20:02:55

Re: How to find file duplicates only matching size, not md5

#7 2011-03-20 20:11:57

Re: How to find file duplicates only matching size, not md5

#8 2011-03-20 20:15:33

Re: How to find file duplicates only matching size, not md5

#9 2011-03-20 20:29:26

Re: How to find file duplicates only matching size, not md5

#10 2011-03-21 11:12:14

Re: How to find file duplicates only matching size, not md5

#11 2011-03-21 12:10:31

Re: How to find file duplicates only matching size, not md5

#12 2011-03-21 15:15:30

Re: How to find file duplicates only matching size, not md5

#13 2011-03-27 11:09:43

Re: How to find file duplicates only matching size, not md5

#14 2022-03-25 16:34:47

Re: How to find file duplicates only matching size, not md5

#15 2022-03-25 20:58:08

Re: How to find file duplicates only matching size, not md5

Board footer