You are not logged in.

#1 2016-06-02 04:53:23

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,739

dup.py: A duplicate file manager with geolocation extensions

Here is one I've been working on for a while.   https://aur.archlinux.org/packages/dup.py/

I've been frustrated with the family photo collection.  Many, many duplicate copies of the same files that have been copied from SD cards and phones and pushed into directories scattered over half a dozen computers.
This program allows you to build a database of these files and their SHA-2 256 hashes.  Then, using the hashes, identify the duplicates -- even if they've different names.   From there, you can delete one or more duplicates at a time by specifying a file name, or a directory containing the files to be removed.  Then, all files that match that have duplicates, and those duplicates can be verified, and they are not links of each other, can be removed in one operation.   The removal is paranoid, and the default is to do a dry run.

While I was at it, I added the ability to grab metadata as to geolocation and date/time from image files.  That metadata allows one to search across multiple directories for files taken after a given date (in chronological order),  or in order of great circle distance from a given geolocation.  It also allows images to represent pins on a Goggle Maps API based html file.

The data are stored in sqlite3 files and are tied together through the SHA-2 256 hashes.

Here is the man file:

man(1)                                                                   dup.py man page                                                                   man(1)

NAME
       dup.py - Manage duplicate files

SYNOPSIS
       dup.py [OPTION]...

DESCRIPTION
       dup.py is a Python application that was designed to manage multimedia libraries that have multiple copies of files and directories scattered over the file
       system or over multiple file systems.  It is a general purpose duplicate file management tool with extensions for managing metadata associated with photos
       and movie clips.

       It  will  walk  directory structures and append the file names and SHA-2 hashes to a sqlite database.  That database may contain multiple tree structures.
       It is then possible to identify duplicate files and to safely purge duplicates.  The purging algorithm is paranoid and will only  delete  files  after  it
       checks that they are still duplicates, and that they are not mere links of each other.

       There are tools to verify the integrity of the database, to read EXIF geolocation data for the files if available, to locate by date or by distance from a
       given latitude/longitude, and to create an HTML file using the Google Maps API with "Pins" for each photo with geolocation tags that can be  "clicked"  to
       open.

       dup.py supports Bash completion if the Python library argcomplete is installed

OPTIONS
       -h, --help
              Show help message and exit

       -v, --verbose
              Provide more information

       --debug
              Generate debugging information (Very Verbose)

       --database DATABASE
              Select database name given by DATABASE.  If no database name is provided, the default database is $HOME/.dup.sqlite

       -p PATH, --path PATH
              Scan  a  directory tree into database.  Walk the directory PATH, calculate the SHA-2 sum for each file, and add the file name and hash to the data‐
              base.  If there is an existing entry for a given file path, update the hash as appropriate.

       -d, --duplicate
              Check for duplicates in database.  Search for duplicate hashes and produce a list of hashes that have duplicate entries and the files names associ‐
              ated with those hashes.

       --purge PURGE
              Purge  duplicate  files from the file system.  Using the path or partial path PURGE, find all files for which the path name starts with PURGE, have
              duplicates in the database, and for which that the duplicates are verified to be valid, produce a list of candidate files to  delete.  The  program
              will not delete those files unless the --commit option is passed in addition to --purge

       --remove REMOVE
              Remove  files from database.  Compare REMOVE to the names of files in the database,  Remove all files that have names that start with REMOVE.  This
              option will create a list of files that will be removed from the data base, but will not actually remove those data base entries unless the  --com‐
              mit option is provided in addition to --remove.  Note that entries are removed from the database, but no files are removed from the filesystem.

       -commit
              Really delete things.  Causes the --purge and --remove options to actually delete files or database entries

       -c, --check
              Check database for integrity.  Read each file that is indexed by the data base and verify the hash for each file

       --exif Obtain  metadata  for  all files.  Attempt to read the exif geolocation metadata for media files and update the database.  This option requires the
              python library GExiv2 to be installed.

       --byDate BYDATE
              Output file paths by date since BYDATE.  BYDATE format is mm/dd/yyyy

       --lat LAT
              Latitude in decimal format (North positive, South negative). required for file paths by distance

       --long LONG
              Longitude in decimal format (West negative, East Positive), required for file paths by distance.  If latitude and longitude are both provided,  the
              files with geolocation metadata will be output in order of distance from the given geolocation (great circle distance)

       --map MAP
              Generate geolocation map to given file name MAP.  MAP will be an html file using Google's map API with pins for each file with geolocation informa‐
              tion

OTHER FILES
SEE ALSO
       None

BUGS
       No known bugs.

AUTHOR
       Eric Waller (ewwaller-dup@gmail.com)

1.0                                                                        1 June 2016                                                                     man(1)

A sample session: First, list the duplicates in the database, then do a dry run of purging a directory containing some duplicates, then show some files in increasing radius from a place near here.

ewaller@turing/home/ewaller[1] % dup.py --duplicate 
Hash 3b48adb86709e48e8a40dbce322aecf8b099aefdf3f6f220950e987b03643f7c
  /home/ewaller/multimedia/photos/Router/Senior slideshow pics/Level 8 2nd place team close up.jpg
  /home/ewaller/multimedia/photos/Router/Vegas and Reno/Level 8 2nd place team close up.jpg
Hash 686ab04e9cd21464529dad5509d588624f76f52085e313501e1b7bc04b6828c3
  /home/ewaller/multimedia/photos/meet/DVD/VIDEO_TS/VIDEO_TS.BUP
  /home/ewaller/multimedia/photos/meet/DVD/VIDEO_TS/VIDEO_TS.IFO
Hash 8b1138e4b8d907f2159bcbcee08cb1c44d71e1915e4d99bfd6aff882c39b65c0
  /home/ewaller/multimedia/photos/recover/again/recup_dir.2/b0714092.txt
  /home/ewaller/multimedia/photos/recover/again/recup_dir.2/b0722092.txt
Hash a7e6198dcec1f6a36a4221635f741b3a1eaf8342707b05fc6f2e79f5fcaed3ab
  /home/ewaller/multimedia/photos/meet/DVD/VIDEO_TS/VTS_01_0.BUP
  /home/ewaller/multimedia/photos/meet/DVD/VIDEO_TS/VTS_01_0.IFO
Hash edf9f59c4ecfd45dd1f7604285e1ffd5be3e895f773a4892809043e649517cb7
  /home/ewaller/multimedia/photos/recover/again/recup_dir.2/b0868715.txt
  /home/ewaller/multimedia/photos/recover/again/recup_dir.2/b0876747.txt
  /home/ewaller/multimedia/photos/recover/again/recup_dir.2/b0884715.txt
  /home/ewaller/multimedia/photos/recover/again/recup_dir.2/b0896939.txt
ewaller@turing/home/ewaller % dup.py --purge  "/home/ewaller/multimedia/photos/Router/Vegas and Reno"
Purging file: /home/ewaller/multimedia/photos/Router/Vegas and Reno/Level 8 2nd place team close up.jpg
  2845mS (MainThread) WARNING : ... Just kidding, this is a dry run
ewaller@turing/home/ewaller % dup.py --lat 34.2 --long -118.1 > foo
ewaller@turing/home/ewaller % head -20 foo 
/home/ewaller/multimedia/photos/Andoid/IMAG0040.jpg
/home/ewaller/multimedia/photos/Camera Uploads_1/2012-05-20 18.40.18.jpg
/home/ewaller/multimedia/photos/Camera Uploads_1/2012-05-20 18.41.10.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0316.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0311.jpg
/home/ewaller/multimedia/photos/Camera Uploads/2015-03-04 20.36.10.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0310.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0312.jpg
/home/ewaller/multimedia/photos/Camera Uploads/2014-07-08 20.24.40.jpg
/home/ewaller/multimedia/photos/Camera Uploads/2015-01-03 19.01.58.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0313.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0282.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0283.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0284.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0315.jpg
/home/ewaller/multimedia/photos/Camera Uploads/2015-03-04 20.26.26.jpg
/home/ewaller/multimedia/photos/Camera Uploads_1/2013-03-31 09.36.48.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0280.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0281.jpg
/home/ewaller/multimedia/photos/Andoid/IMAG0329.jpg
ewaller@turing/home/ewaller % 

Screen shots from an html generated by the --map option:
http://i.imgur.com/xXDa6tw.png
and zoomed into Baja California:
http://i.imgur.com/Pu1wnL4.png

Anyway, enjoy.

Last edited by ewaller (2016-06-02 13:52:17)


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#2 2016-06-02 06:55:14

Docbroke
Member
From: India
Registered: 2015-06-13
Posts: 1,433

Re: dup.py: A duplicate file manager with geolocation extensions

I haven't dared to organize my photo collection, but will give it a try soon now!

Offline

#3 2016-06-03 02:18:31

Steef435
Member
Registered: 2013-08-29
Posts: 577
Website

Re: dup.py: A duplicate file manager with geolocation extensions

I think there's a typo in your man page: the --commit flag is listed as -commit (one dash).

With my great organizing of documents alone I think I will have a fun time running this tool on my home directory ^^'

Thanks for sharing!

Offline

#4 2016-06-03 02:49:48

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,739

Re: dup.py: A duplicate file manager with geolocation extensions

Steef435 wrote:

Thanks for sharing!

My pleasure smile

Thanks for the catch.  I'll be updating that.

Edit:  Upstream has been updated

Last edited by ewaller (2016-06-03 03:04:52)


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#5 2016-06-03 11:33:30

TheSaint
Member
From: my computer
Registered: 2007-08-19
Posts: 1,523

Re: dup.py: A duplicate file manager with geolocation extensions

I'd like to call it duphoto.py wink. But you may add the option to compare 2 thumbnails, in order to find similar photo of different size.
Good job.


do it good first, it will be faster than do it twice the saint wink

Offline

#6 2016-09-10 14:22:35

Unia
Member
From: Stockholm, Sweden
Registered: 2010-03-30
Posts: 2,486
Website

Re: dup.py: A duplicate file manager with geolocation extensions

It seems as if argcomplete is now a required dependency? The manpage lists it as optional.

 > dup.py --path .
Traceback (most recent call last):
  File "/usr/bin/dup.py", line 649, in <module>
    main()
  File "/usr/bin/dup.py", line 641, in main
    argcomplete.autocomplete(parser)
NameError: name 'argcomplete' is not defined

This error is resolved by installing python-argcomplete.

EDIT: It works beautifully, thank you! smile

Last edited by Unia (2016-09-10 16:16:53)


If you can't sit by a cozy fire with your code in hand enjoying its simplicity and clarity, it needs more work. --Carlos Torres

Offline

#7 2016-09-10 16:23:08

Alad
Wiki Admin/IRC Op
From: Bagelstan
Registered: 2014-05-04
Posts: 2,407
Website

Re: dup.py: A duplicate file manager with geolocation extensions

Nice, I'm going to run this on my download folder which has 300 duplicates of every file thanks to Chrome's great download interface. hmm

Small gripe, would you mind putting the .1 file uncompressed on github, to avoid "binary diffs"? makepkg takes care of the compression transparently.


Mods are just community members who have the occasionally necessary option to move threads around and edit posts. -- Trilby

Offline

#8 2016-09-16 18:39:40

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,739

Re: dup.py: A duplicate file manager with geolocation extensions

Thank you everyone.  I've not been ignoring you -- I was really sick for about a week, been on travel for work, and spinning like crazy trying to get caught up with real work, yard work and fixing appliances work after my down time.  I have a couple days and will be cleaning this up.

Alad wrote:

Nice, I'm going to run this on my download folder which has 300 duplicates of every file thanks to Chrome's great download interface. hmm

Small gripe, would you mind putting the .1 file uncompressed on github, to avoid "binary diffs"? makepkg takes care of the compression transparently.

I suppose I can do that.  I have to admit that I maintain that file in emacs which handles the .gz transparently as well.  A small git restructure should not be too much effort.


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

Board footer

Powered by FluxBB