dupekill - delete files with duped data (2012-06-13 update)

xelados · 2011-10-19 10:43:08

A few years ago, a friend and I had gotten a hold of a game's audio ripped directly from the disc. It included thousands of files that made up the game's sound, and many of the files sounded the exact same. We set out to pare this massive set of files down to something more manageable, with Python on our side. Thus, dupekill was born. When we first made it, it didn't recurse and was basically hacked together. I spent all of last night and most of this morning tweaking it for Python 3 and tried to turn it into a flexible console app/module.

dupekill will (by default) search the current directory for files with duplicated data and get rid of them. It accepts -h (help), -r (recursive), and -v (verbose) flags, which are obvious, and a path.

In addition, I added a "dry run" (-d) flag, so you know what will happen before you commit to getting rid of dupes. Without further delay, here it is:

https://github.com/sporkbox/dupekill
AUR Project Page

It's my first Python 3 app and the closest I've come to a truly useful piece of software.

Comments and suggestions welcome.

UPDATE (2011-11-04)
dupekill v1.3 is out. I followed keenerd's optimization suggestions to make it run better. Next on the list is an AUR package.

UPDATE (2012-06-12)
dupekill 1.5 supports ignoring symlinks with `-i`. Also available in the AUR.

UPDATE (2012-06-13)
dupekill 1.6

ignores device and character nodes as well as sockets and FIFO pipes.
`-a` and `-v` will now display the file that a dupe is a copy of, or the original file that a symlink clashes with.

Last edited by xelados (2012-07-25 17:43:49)

keenerd · 2011-10-19 11:16:56

This sha256 sums every file. That could be quite slow. For example:

time find /{bin,opt,sbin,usr,lib,lib32,lib64,boot,etc} -xtype f -print0 | xargs -0 NNNsum > /dev/null
sha256sum   - 153 seconds (user)
md5sum      - 44 seconds (user)
head -c 512 - 2.5 seconds (user)
ls -l       - 5.6 seconds (user)

User time ignores IO, but I ran all on cold caches to be safe.

Some suggestions for optimizations: first stat the files and get the total size. If the sizes match, then read and compare the first 512 bytes. If those match, then md5 the files. If those match, then sha512. (While head looks like a fast operation, comparing the 512 bytes is not, so it goes after the size.)

edit, added stats

Last edited by keenerd (2011-10-19 11:32:12)

lolilolicon · 2011-10-19 11:20:21

EDIT: ^

Isn't sha256 overkill? And if all you are looking for is identical files, why not compare file sizes first?

Last edited by lolilolicon (2011-10-19 11:21:59)

xelados · 2011-10-19 11:39:47

keenerd wrote:

This sha256 sums every file. That could be quite slow. For example:
time find /{bin,opt,sbin,usr,lib,lib32,lib64,boot,etc} -xtype f -print0 | xargs -0 NNNsum > /dev/null
sha256sum   - 153 seconds (user)
md5sum      - 44 seconds (user)
head -c 512 - 2.5 seconds (user)
ls -l       - 5.6 seconds (user)
User time ignores IO, but I ran all on cold caches to be safe.
Some suggestions for optimizations: first stat the files and get the total size. If the sizes match, then read and compare the first 512 bytes. If those match, then md5 the files. If those match, then sha512. (While head looks like a fast operation, comparing the 512 bytes is not, so it goes after the size.)
edit, added stats

If I understand correctly, the idea is to only use up resources when needed, correct? If two files are of a different size, then obviously their contents will differ, for instance. I tested dupekill on my music collection and it surely took a little while. I imagine checking simpler things would improve the speed of the system. Thanks for the suggestion! I'll see what I can do to introduce something like that in the next version.

Last edited by xelados (2011-10-19 11:40:15)

xelados · 2011-11-04 14:22:04

I've updated dupekill with the optimizations suggested by keenerd. I skipped the md5 hashing step because I felt it was pointless to do two separate hashes. One good hash is enough. The optimization reduced the execution time by at least an order of magnitude. My music directory used to take about 40 seconds. Now it's less than 3!

Next on my list is figuring out how to turn it into an AUR package.

Just as before, any comments and suggestions are welcome. Thanks for the feedback again, keenerd.

xelados · 2012-06-12 17:40:41

dupekill is now available as a package in the AUR (here). Feel free to test it and/or critique my code.

irtigor · 2012-06-12 21:44:19

Nice project and thanks for made it available as a pkgbuild. Also, I would like to see the idea of the TODO implemented and say that it failed in a directory with a socket

$ dupekill -d -a -r . > /dev/null
Traceback (most recent call last):
  File "/usr/bin/dupekill", line 186, in <module>
    dupekill(options.dry_run, options.recursive, options.verbose, options.all_files, options.ignore_links, path)
  File "/usr/bin/dupekill", line 114, in dupekill
    entry.append(open(entry[0], "rb").read(512))
IOError: [Errno 6] No such device or address: '/home/irtigor/.rxvt-unicode-seeadler'
$ file .rxvt-unicode-seeadler
.rxvt-unicode-seeadler: socket

(Possibly another error is "options.ignore_links", since I did not used the "-i")

xelados · 2012-06-13 01:13:08

ignore_links defaults to False, so that's not an issue.

It shouldn't be erroring, though... Hmmm.

Thanks for the bug report. I'll add sockets to the TODO and work on a fix.

EDIT: dupekill 1.6 has been commited and is ready to go! The socket bug should be fixed now. -v or -a will now print the file that the dupe (or clash) belongs to, as well.

What is a clash? Well, dupekill will add any link or regular file to its list of processed files. So what happens when you get to a file that the symlink points to, or has identical content to what it points to?

The program used to delete the file, and that's not the proper behavior. If you're ignoring links, it won't be an issue. If you're not using the `-d` flag, then it will remove the symlink from the list of files to compare against and delete the symlink.

I haven't managed to get a clash to happen yet due to the strange way that the file list generator sorts filenames. Feel free to toy around with it until you can create a scenario to test it in, as I couldn't.

Last edited by xelados (2012-06-13 05:15:18)

Arch Linux

#1 2011-10-19 10:43:08

dupekill - delete files with duped data (2012-06-13 update)

#2 2011-10-19 11:16:56

Re: dupekill - delete files with duped data (2012-06-13 update)

#3 2011-10-19 11:20:21

Re: dupekill - delete files with duped data (2012-06-13 update)

#4 2011-10-19 11:39:47

Re: dupekill - delete files with duped data (2012-06-13 update)

#5 2011-11-04 14:22:04

Re: dupekill - delete files with duped data (2012-06-13 update)

#6 2012-06-12 17:40:41

Re: dupekill - delete files with duped data (2012-06-13 update)

#7 2012-06-12 21:44:19

Re: dupekill - delete files with duped data (2012-06-13 update)

#8 2012-06-13 01:13:08

Re: dupekill - delete files with duped data (2012-06-13 update)

Board footer