You are not logged in.

#1 2013-08-06 03:01:47

justinzane
Member
From: Weed, CA, US
Registered: 2013-07-19
Posts: 19
Website

Bulk/Scriptable Image File Deduplication Tools?

*Note: I'm not sure if this should be in Sys. Admin. instead. Please correct me.*

I've got an archive of image files, mostly photos but also vector art, that has gotten thoroughly messed up over years of use by several users. This archive is far too large to handle using something like Digikam which is not, as far as I can tell scriptable. One of the biggest problems is that a single image may have been copied to several different directories, under several different names, possible with different/missing EXIF/IPTC data, possibly in different formats (jpeg, png, Nikon nef, Adobe dng, tiff or svg, eps, pdf).

Digikam has a "fingerprinting" and matching tool which works very well for finding duplicates. This is almost a perfect solution. **However**, I cannot figure out how to use it in a scriptable, headless manner and using it through the Digikam interface would take weeks.

Anyone have a suggestion as to how to deduplicate image files programatically?

Thanks.

Offline

#2 2013-08-06 03:30:02

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: Bulk/Scriptable Image File Deduplication Tools?

AUR/findimagedupes seems what you have in mind.


This silver ladybug at line 28...

Offline

#3 2013-08-06 03:43:28

justinzane
Member
From: Weed, CA, US
Registered: 2013-07-19
Posts: 19
Website

Re: Bulk/Scriptable Image File Deduplication Tools?

lolilolicon wrote:

AUR/findimagedupes seems what you have in mind.

Sure sounds like it! I'm installing now and will give it a spin. Thanks.

Offline

#4 2013-08-06 23:24:15

justinzane
Member
From: Weed, CA, US
Registered: 2013-07-19
Posts: 19
Website

Re: Bulk/Scriptable Image File Deduplication Tools?

lolilolicon wrote:

AUR/findimagedupes seems what you have in mind.

justinzane wrote:

Sure sounds like it! I'm installing now and will give it a spin. Thanks.

So,

findimagedupes

works well for common image formats, but it cannot handle Nikon (NEF) raw files. After RTFM, the algorithm used is quite clear and simple:

DESCRIPTION
       findimagedupes compares a list of files for visual similarity.

       To calculate an image fingerprint:
         1) Read image.
         2) Resample to 160x160 to standardize size.
         3) Grayscale by reducing saturation.
         4) Blur a lot to get rid of noise.
         5) Normalize to spread out intensity as much as possible.
         6) Equalize to make image as contrasty as possible.
         7) Resample again down to 16x16.
         8) Reduce to 1bpp.
         9) The fingerprint is this raw image data.

       To compare two images for similarity:
         1) Take fingerprint pairs and xor them.
         2) Compute the percentage of 1 bits in the result.
         3) If percentage exceeds threshold, declare files to be similar.

Since I **need** NEF support, and I dislike/suck at Perl, does it seem reasonable to rewrite/wrap the app in shell/Python/C to handle raw files with dcraw/libkdcraw? Or is there an alternative program/library that handles raw files? Suggestions?

Offline

#5 2013-08-06 23:58:13

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 30,456
Website

Re: Bulk/Scriptable Image File Deduplication Tools?

imagemagick's convert may help


"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman

Offline

#6 2013-08-07 00:29:47

justinzane
Member
From: Weed, CA, US
Registered: 2013-07-19
Posts: 19
Website

Re: Bulk/Scriptable Image File Deduplication Tools?

Trilby wrote:

imagemagick's convert may help

AFAIK, imagemagick wraps ufraw. While I could `find`, `dcraw` nef->tempfile, etc. that seems like a kludge on top of a rather simple program to start with. Seems that in an age when even friggin' high-end smartphones produce raw files, that findimagedups or something similar should handle them. Or maybe I'm nuts.

Offline

#7 2014-01-16 12:23:27

jhnc
Member
Registered: 2014-01-16
Posts: 1
Website

Re: Bulk/Scriptable Image File Deduplication Tools?

findimagedupes just uses Perl's wrapper around graphicsmagick, so it should handle all the formats that that does.
The man page for the debian version of graphicsmagick that I have here says it understands NEF. To see if your version does, you could try something like:

    gm identify myfile.nef

Offline

Board footer

Powered by FluxBB