Dupe finder

bardo · 2005-09-23 15:06:33

...and finally I bought that Really Big Drive that made me able to backup my files, throw away debian and create another Arch machine. Thus, damocles was born

The only problem is, I just realized taht on the RBD I have backups from many previous systems: that debian, two gentoos, even some win boxes (oh, my...), each of them containing even older backups of previous OSs. So, I have a large amount of files that are potentially replicated on the fs, and I'm looking for a good (and fast... 170 gb...) dupe finder to get rid of all the crap I don't need to keep anymore...
A quick glance at the repos didn't help me, but I'm able to create packages, so if you just know a name there's no problem, and I'll make it available for everyone on the AUR

paranoos · 2005-09-23 18:53:29

why do you need to find duplicates? don't you need to just keep the most recent backup?

bardo · 2005-09-23 20:00:30

paranoos wrote:

why do you need to find duplicates? don't you need to just keep the most recent backup?

Well, my backups come from very different sources, some are backups of backups, and some things have been passed from some desktops to the laptop, so there surely are many dupes around.

A fortune told me: «God saves, but Buddha makes incremental backups». I just saved

alterkacker · 2005-09-23 22:02:52

I don't have anything like that in hand but if I had to write it myself I would use a scripting language (Perl, Python, Ruby) to recursively descend through all the backup directories, compute the MD5 hash for every file encountered, and use the hash code as the key in an associative array (a.k.a. hash or dictionary) entry who's contents are the path to the file. Then when another file hashes to the same entry print out the paths to the duplicate files.

I don't know how fast this would be for 170GB (surely you jest!) but it would be pretty easy to code and would get the job done.

Kern · 2005-09-23 23:07:42

i had similar dilemma til i lost a drive. then realised apart from the embuggerance of actually losing it, i never really recalled any of the files on there for real use.

Do you really intend to use 170 Gig of files or are you just hoarding stuff for a rainy century ?

Test: load the drive and then hide it away for a few weeks. if you cant remember whats on it or a reason to recall anything, then its of no real value.

Solution:
dd if=/dev/zero of=/dev/RBD

(note: posted in good humour)

bardo · 2005-09-23 23:25:47

Kern wrote:

Do you really intend to use 170 Gig of files or are you just hoarding stuff for a rainy century ?
Test: load the drive and then hide it away for a few weeks. if you cant remember whats on it or a reason to recall anything, then its of no real value.

Ehm.. there is *everything* inside it. It holds my entire media collection, for example. My ~ is now empty, since this system is new
And I know very well what's inside it. Well, for the greatest part ^_^ I'm trying to catalog everything, right now.

Hope the next century will be rainy, or I'll rm everything, hahaha!

Kern · 2005-09-25 12:31:42

i sympathise as i also have loads of backups on CD and a lot are duplicated.

my few thoughts on the matter are:

Maybe re collate everything by moving files across into a single directory or drive for sorting, and then maybe subdivide into types of file if there are say 1000s of mp3/mpg/avi etc makes simple operation possible.

How to ensure its a true duplicate that you are erasing, ie you need to make some sort of check either on filesize or md5sum.

then i thought ! oh no im gonna end up trying to perl something.....
then revisited the idea and found this ...

http://freshmeat.net/projects/dupseek/
http://www.beautylabs.net/software/dupseek.html

and thought, why re invent the wheel, unless of course i need a square one

hth

bardo · 2005-09-25 15:28:46

I tried dupseek, it's simple, fast and nice. If someone else wants to give it a try it is in the AUR
Also, it is very interesting the way it works, and what makes it so fast...

Arch Linux

#1 2005-09-23 15:06:33

Dupe finder

#2 2005-09-23 18:53:29

Re: Dupe finder

#3 2005-09-23 20:00:30

Re: Dupe finder

#4 2005-09-23 22:02:52

Re: Dupe finder

#5 2005-09-23 23:07:42

Re: Dupe finder

#6 2005-09-23 23:25:47

Re: Dupe finder

#7 2005-09-25 12:31:42

Re: Dupe finder

#8 2005-09-25 15:28:46

Re: Dupe finder

Board footer