You are not logged in.

#1 2005-09-23 15:06:33

bardo
Member
Registered: 2004-12-06
Posts: 96

Dupe finder

...and finally I bought that Really Big Drive that made me able to backup my files, throw away debian and create another Arch machine. Thus, damocles was born smile

The only problem is, I just realized taht on the RBD I have backups from many previous systems: that debian, two gentoos, even some win boxes (oh, my...), each of them containing even older backups of previous OSs. So, I have a large amount of files that are potentially replicated on the fs, and I'm looking for a good (and fast... 170 gb...) dupe finder to get rid of all the crap I don't need to keep anymore...
A quick glance at the repos didn't help me, but I'm able to create packages, so if you just know a name there's no problem, and I'll make it available for everyone on the AUR wink


dreaming in digital / living in realtime / thinking in binary / talking in ip / welcome to our world

Offline

#2 2005-09-23 18:53:29

paranoos
Member
From: thornhill.on.ca
Registered: 2004-07-22
Posts: 442

Re: Dupe finder

why do you need to find duplicates? don't you need to just keep the most recent backup?

Offline

#3 2005-09-23 20:00:30

bardo
Member
Registered: 2004-12-06
Posts: 96

Re: Dupe finder

paranoos wrote:

why do you need to find duplicates? don't you need to just keep the most recent backup?

Well, my backups come from very different sources, some are backups of backups, and some things have been passed from some desktops to the laptop, so there surely are many dupes around.

A fortune told me: «God saves, but Buddha makes incremental backups». I just saved smile


dreaming in digital / living in realtime / thinking in binary / talking in ip / welcome to our world

Offline

#4 2005-09-23 22:02:52

alterkacker
Member
From: Peoples Republic of Boulder
Registered: 2005-01-08
Posts: 52

Re: Dupe finder

I don't have anything like that in hand but if I had to write it myself I would use a scripting language (Perl, Python, Ruby) to recursively descend through all the backup directories, compute the MD5 hash for every file encountered, and use the hash code as the key in an associative array (a.k.a. hash or dictionary) entry who's contents are the path to the file. Then when another file hashes to the same entry print out the paths to the duplicate files.

I don't know how fast this would be for 170GB (surely you jest!) but it would be pretty easy to code and would get the job done.

Offline

#5 2005-09-23 23:07:42

Kern
Member
From: UK
Registered: 2005-02-09
Posts: 464

Re: Dupe finder

i had similar dilemma til i lost a drive. then realised apart from the embuggerance of actually losing it, i never really  recalled any of the files on there for real use.

Do you really intend to use 170 Gig of files or are you just hoarding stuff for a rainy century ?

Test: load the drive and then hide it away for a few weeks. if you cant remember whats on it or a reason to recall anything, then its of no real value.

Solution:
dd if=/dev/zero of=/dev/RBD

(note: posted in good humour) smile

Offline

#6 2005-09-23 23:25:47

bardo
Member
Registered: 2004-12-06
Posts: 96

Re: Dupe finder

Kern wrote:

Do you really intend to use 170 Gig of files or are you just hoarding stuff for a rainy century ?

Test: load the drive and then hide it away for a few weeks. if you cant remember whats on it or a reason to recall anything, then its of no real value.

Ehm.. there is *everything* inside it. It holds my entire media collection, for example. My ~ is now empty, since this system is new smile
And I know very well what's inside it. Well, for the greatest part ^_^ I'm trying to catalog everything, right now.

Hope the next century will be rainy, or I'll rm everything, hahaha!


dreaming in digital / living in realtime / thinking in binary / talking in ip / welcome to our world

Offline

#7 2005-09-25 12:31:42

Kern
Member
From: UK
Registered: 2005-02-09
Posts: 464

Re: Dupe finder

i sympathise as i also have loads of backups on CD and a lot are duplicated.

my few thoughts on the matter are:

Maybe re collate everything by moving files across into a single directory or drive for sorting, and then maybe subdivide into types of file if there are say 1000s of mp3/mpg/avi etc makes simple operation possible.

How to ensure its a true duplicate that you are erasing, ie you need to make some sort of check  either on filesize or md5sum.

then i thought ! oh no im gonna end up trying to perl something.....
then revisited the idea and found this ...

http://freshmeat.net/projects/dupseek/
http://www.beautylabs.net/software/dupseek.html

and thought, why re invent the wheel, unless of course i need a square one smile

hth

Offline

#8 2005-09-25 15:28:46

bardo
Member
Registered: 2004-12-06
Posts: 96

Re: Dupe finder

I tried dupseek, it's simple, fast and nice. If someone else wants to give it a try it is in the AUR wink
Also, it is very interesting the way it works, and what makes it so fast...


dreaming in digital / living in realtime / thinking in binary / talking in ip / welcome to our world

Offline

Board footer

Powered by FluxBB