You are not logged in.

#1 2011-03-20 02:59:46

bencahill
Member
Registered: 2011-03-20
Posts: 40

How to find file duplicates only matching size, not md5

Hey guys, first post here. smile

I thought I'd share this, since it was quite useful for me. It may not be the most elegant solution, but it worked for me smile.

So, I have a terabyte (or so) of data, spanning 51,000+ files. Some are greater than 4GiB, many/most are ~4MiB. I have a (relatively) slow 1.6GHz Athlon, which doesn't really like md5-ing huge files tongue, and I didn't need md5, just a size comparison. Unfortunately, all of the duplicate-file-finding tools I tried (including fslint, fdupes, dmerge, and dupseek) didn't have an option to not md5/sha1/hash in some form or another, or I did not find it smile.

So I decided to use fdupes (it's nice and simple for my needs), and trick it out big_smile.

I renamed /usr/bin/md5sum as md5sum.old, and created /usr/bin/md5sum with the following content:

#!/bin/bash

du $1|awk '{print $1}'|md5sum.old

Yes, my bash skills suck, and there are probably better ways of doing it, but again, it worked in this situation wink. So it's basically taking the du output of the file passed to md5sum (hopefully, again, it worked), shortening that to just the size, and then md5'ing that, which is much faster, and based on the size.

So now, you just run (e.g.):

fdupes -r /path/to/files > /path/to/duplicatefilelist

The -r is for recursive, and I'm redirecting it to a file because I know there will be a lot of dupes. If you don't have many, you could skip that.

Then, fdupes scans the number of files, and has a nice progress percentage so you can see how much is done. After this is done, you'd want to remove /usr/bin/md5sum, and restore the old md5sum wink.

I hope this is useful to someone else, and please share if there's a better way of doing this.

Thanks!

Offline

#2 2011-03-20 04:16:33

karol
Archivist
Registered: 2009-05-06
Posts: 25,423

Re: How to find file duplicates only matching size, not md5

Give

find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3

a try. It prints 'filename (full path) - filesize (in bytes)' for the files that have the same size and sorts them with the biggest ones first. If only one file has e.g. 12345678 bytes, it won't get printed. If at least two files are of the same size, all of them will get printed.

I see that a bunch of small files like metadata ones have the same size, but for big multimedia files it should be OK.

Offline

#3 2011-03-20 18:47:23

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: How to find file duplicates only matching size, not md5

You also might want to try rmlint (self-advertisement...) or rdfind, which both are able to find twins much faster than fdupes.

If you really want only to compare the size you'd better go with karol's method.

Offline

#4 2011-03-20 19:49:19

bencahill
Member
Registered: 2011-03-20
Posts: 40

Re: How to find file duplicates only matching size, not md5

karol wrote:
find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3

Aha! Much faster and more elegant. I knew ya'll would give me something better big_smile.

Thanks, that's very good. smile

Offline

#5 2011-03-20 19:52:44

bencahill
Member
Registered: 2011-03-20
Posts: 40

Re: How to find file duplicates only matching size, not md5

SahibBommelig wrote:

You also might want to try rmlint (self-advertisement...) or rdfind, which both are able to find twins much faster than fdupes.

If you really want only to compare the size you'd better go with karol's method.

Ah, I find that md5's aren't that slow after all big_smile (yes, I realize that rmlint is smarter than simply md5'ing every file wink ). rmlint is very nice, thank you!

One thing though...can I specify a "preferred" dir (when passing multiple dirs to rmlint), so that it has the "originals," and the "duplicates" are from the other directories?

Thanks.

Offline

#6 2011-03-20 20:02:55

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: How to find file duplicates only matching size, not md5

karol wrote:
find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3

Use sort before uniq: find . -type f -printf "%p - %s\n" | sort -nr -k3 | uniq -D -f1

Offline

#7 2011-03-20 20:11:57

karol
Archivist
Registered: 2009-05-06
Posts: 25,423

Re: How to find file duplicates only matching size, not md5

Procyon wrote:
karol wrote:
find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3

Use sort before uniq: find . -type f -printf "%p - %s\n" | sort -nr -k3 | uniq -D -f1

Errors like that make me cringe. Good thing I'm not an airline pilot.

Thanks.

Offline

#8 2011-03-20 20:15:33

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: How to find file duplicates only matching size, not md5

bencahill wrote:

...can I specify a "preferred" dir (when passing multiple dirs to rmlint), so that it has the "originals," and the "duplicates" are from the other directories?

I'm afraid, no. But as you're not the first person that is complaining about this, I might add it if I have the time.. smile
For now you gonna have to parse the logfile to achieve this (which should be easy enough).

Offline

#9 2011-03-20 20:29:26

bencahill
Member
Registered: 2011-03-20
Posts: 40

Re: How to find file duplicates only matching size, not md5

SahibBommelig wrote:

For now you gonna have to parse the logfile to achieve this (which should be easy enough).

I'm afraid this is beyond my existing skills roll, but running the script will be fine for my use, as I'll be re-arranging files anyway.

Thanks again. smile

Offline

#10 2011-03-21 11:12:14

Grinch
Member
Registered: 2010-11-07
Posts: 265

Re: How to find file duplicates only matching size, not md5

SahibBommelig: Great work on rmlint!

Offline

#11 2011-03-21 12:10:31

SanskritFritz
Member
From: Budapest, Hungary
Registered: 2009-01-08
Posts: 1,600
Website

Re: How to find file duplicates only matching size, not md5

offtopic

Grinch wrote:

SahibBommelig: Great work on rmlint!

+1, really, this is amazing!
I see there is a PKGBUILD in the tarball. We can expect an aur upload soon? smile


zʇıɹɟʇıɹʞsuɐs AUR || Cycling in Budapest with a helmet camera || Revised log levels proposal: "FYI" "WTF" and "OMG" (John Barnette)

Offline

#12 2011-03-21 15:15:30

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: How to find file duplicates only matching size, not md5

also offtopic:

In AUR now. Please continue any discussion here.

Offline

#13 2011-03-27 11:09:43

karol
Archivist
Registered: 2009-05-06
Posts: 25,423

Re: How to find file duplicates only matching size, not md5

Offline

Board footer

Powered by FluxBB