You are not logged in.
Pages: 1
Topic closed
Hey guys, first post here.
I thought I'd share this, since it was quite useful for me. It may not be the most elegant solution, but it worked for me .
So, I have a terabyte (or so) of data, spanning 51,000+ files. Some are greater than 4GiB, many/most are ~4MiB. I have a (relatively) slow 1.6GHz Athlon, which doesn't really like md5-ing huge files , and I didn't need md5, just a size comparison. Unfortunately, all of the duplicate-file-finding tools I tried (including fslint, fdupes, dmerge, and dupseek) didn't have an option to not md5/sha1/hash in some form or another, or I did not find it
.
So I decided to use fdupes (it's nice and simple for my needs), and trick it out .
I renamed /usr/bin/md5sum as md5sum.old, and created /usr/bin/md5sum with the following content:
#!/bin/bash
du $1|awk '{print $1}'|md5sum.old
Yes, my bash skills suck, and there are probably better ways of doing it, but again, it worked in this situation . So it's basically taking the du output of the file passed to md5sum (hopefully, again, it worked), shortening that to just the size, and then md5'ing that, which is much faster, and based on the size.
So now, you just run (e.g.):
fdupes -r /path/to/files > /path/to/duplicatefilelist
The -r is for recursive, and I'm redirecting it to a file because I know there will be a lot of dupes. If you don't have many, you could skip that.
Then, fdupes scans the number of files, and has a nice progress percentage so you can see how much is done. After this is done, you'd want to remove /usr/bin/md5sum, and restore the old md5sum .
I hope this is useful to someone else, and please share if there's a better way of doing this.
Thanks!
Offline
Give
find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3
a try. It prints 'filename (full path) - filesize (in bytes)' for the files that have the same size and sorts them with the biggest ones first. If only one file has e.g. 12345678 bytes, it won't get printed. If at least two files are of the same size, all of them will get printed.
I see that a bunch of small files like metadata ones have the same size, but for big multimedia files it should be OK.
Offline
Offline
find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3
Aha! Much faster and more elegant. I knew ya'll would give me something better .
Thanks, that's very good.
Offline
Ah, I find that md5's aren't that slow after all (yes, I realize that rmlint is smarter than simply md5'ing every file
). rmlint is very nice, thank you!
One thing though...can I specify a "preferred" dir (when passing multiple dirs to rmlint), so that it has the "originals," and the "duplicates" are from the other directories?
Thanks.
Offline
find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3
Use sort before uniq: find . -type f -printf "%p - %s\n" | sort -nr -k3 | uniq -D -f1
Offline
karol wrote:find . -type f -printf "%p - %s\n" | uniq -D -f1 | sort -nr -k3
Use sort before uniq: find . -type f -printf "%p - %s\n" | sort -nr -k3 | uniq -D -f1
Errors like that make me cringe. Good thing I'm not an airline pilot.
Thanks.
Offline
...can I specify a "preferred" dir (when passing multiple dirs to rmlint), so that it has the "originals," and the "duplicates" are from the other directories?
I'm afraid, no. But as you're not the first person that is complaining about this, I might add it if I have the time..
For now you gonna have to parse the logfile to achieve this (which should be easy enough).
Offline
For now you gonna have to parse the logfile to achieve this (which should be easy enough).
I'm afraid this is beyond my existing skills , but running the script will be fine for my use, as I'll be re-arranging files anyway.
Thanks again.
Offline
SahibBommelig: Great work on rmlint!
Offline
offtopic
SahibBommelig: Great work on rmlint!
+1, really, this is amazing!
I see there is a PKGBUILD in the tarball. We can expect an aur upload soon?
zʇıɹɟʇıɹʞsuɐs AUR || Cycling in Budapest with a helmet camera || Revised log levels proposal: "FYI" "WTF" and "OMG" (John Barnette)
Offline
Offline
Sorry to necrobump, but this thread's pipe solution was nowhere to be found on the web, without actual programming... putting bbs.archlinux.org into google with appropriate search terms found this which saved me an hour yesterday and which I've saved locally for re-use. Thanks!.
Offline
Sorry to necrobump
Then why did you do it?
Closing this ancient thread.
Offline
Pages: 1
Topic closed