Help! Recursive directory comparison.

simongh · 2009-11-14 05:18:01

Not really a programming question, I suppose, but anyway:

I want to recursively compare two directory trees (who differ in structure), and list any duplicate files. I've been playing around with diff but can't make it handle differing tree structures. I'm kind of a newbie at this, though, and it's five o'clock in the morning.

For further clarification: the missus has a HDD with files (pictures, movies, music) sorted by some scheme not even she herself can fathom (though she'd never admit it).
I've made it my mission to backup and sort all her stuff. Now, some of the files on her drive are already on mine, but sorted (in a sane fashion), which is why I want to be able to find duplicates of any given filetype or pattern.

I hope that was clear enough.

Peasantoid · 2009-11-14 05:20:04

I'm too tired to think about this too hard, but I can definitely see find(1) in your future.

mikesd · 2009-11-14 05:39:35

By duplicate files do you mean the files have the same name or the files have identical contents and may or may not have the same name?

From your post it sounds like you just want to compare based on filenames. find will be of help in both cases however if you need to find duplicate files based on contents it will help to generate a hash of all the files and then compare the hashes for duplicates.

Last edited by mikesd (2009-11-14 05:40:42)

simongh · 2009-11-14 05:45:03

Yeah, comparison based on filename is what I'm after. find is exactly what i needed. Maybe I should have tackled this task when fully awake... might not have had to bother you with it.

Thanks for pointing me in the right direction, though!

skottish · 2009-11-14 05:48:26

Welcome to the forums.

If you feel like it, report back to this thread with your final solution. It may help someone with a similar problem. By the way, I believe that diff can recursively compare trees. I've never tried it, so I can't elaborate.

brisbin33 · 2009-11-14 05:51:03

#!/bin/bash

his_directory="/mnt/mine"
her_directory="/mnt/yours"

find "$his_directory" -type f -exec basename {} \; | while read file; do
  result="$(find "$her_directory" -name "$file")"

  [ -n "$result" ] && echo $result found in hers and yours

done

note: untested

Pox · 2009-11-14 05:51:05

Looks like you've already worked it out, but here's a solution in ruby (doing set intersection in bash is a bit of a pain):

#!/usr/bin/env ruby

def filenames(dir) # return basenames of all files in tree
    Dir["#{dir}/**/*"].map{|f| f.gsub(/^.*\//,'')}.sort
end

dir1, dir2 = ARGV[0..1]

puts (filenames(dir1) & filenames(dir2)).join("\n")

brisbin33 · 2009-11-14 05:58:19

i do believe we just wrote the same thing in two languages.

Xyne · 2009-11-14 06:06:15

If the purpose of this is to avoid backing up duplicate files, why do you only want to compare by file name? It seems that "fdupes" might be more appropriate for this.

*avoids temptation to rewrite previous script in Perl and Python*

simongh · 2009-11-14 06:22:33

fdupes looks interesting.. I'll probably end up using it as a part of the solution. Thanks for the heads up. Filename comparison should be sufficient though, as none of the file names have been altered, to my knowledge.
Thank you brisbin and pox for the examples, they might come in handy.
As for diff, it doesn't seem to go deeper in the trees unless the structure of both entries is identical... 'least that's what it seemed like to me.

I'll post what I came up with later. As for now, back to bed again... working the nightshift seriously messes up my sleep

Last edited by simongh (2009-11-14 06:23:02)

Xyne · 2009-11-14 07:27:18

simongh wrote:

working the nightshift seriously messes up my sleep

What's sleep and how much does it cost?

Procyon · 2009-11-14 10:49:26

brisbin33 wrote:

find "$his_directory" -type f -exec basename {} \;

find has -printf, which in this case is useful if you want to add a check for e.g. filesize ( -printf '%f\t%s\n' and when looking in dir 2: IFS=$'\t' ... while read file size ... find ... -size ${size}c )

simongh · 2009-11-14 23:19:41

In the end, a combination of `find . -name *.ext -exec basename {} \;`, `fdupes -rd` and `diff` did the job, coupled with a pipe or two. Basically, using find and diff to determine what files were on both drives, putting that in a file, and then make a loop to go through said file and remove (find -exec rm) all dupes. Finally, `fdupes` was used to clear all the dupes the missus already had of her own files on her extremely well sorted drive. Was planning on writing a bashscript to automate the process a bit, but it seemed like a waste of time.

Thanks all of you the tips and hints.

Arch Linux

#1 2009-11-14 05:18:01

Help! Recursive directory comparison.

#2 2009-11-14 05:20:04

Re: Help! Recursive directory comparison.

#3 2009-11-14 05:39:35

Re: Help! Recursive directory comparison.

#4 2009-11-14 05:45:03

Re: Help! Recursive directory comparison.

#5 2009-11-14 05:48:26

Re: Help! Recursive directory comparison.

#6 2009-11-14 05:51:03

Re: Help! Recursive directory comparison.

#7 2009-11-14 05:51:05

Re: Help! Recursive directory comparison.

#8 2009-11-14 05:58:19

Re: Help! Recursive directory comparison.

#9 2009-11-14 06:06:15

Re: Help! Recursive directory comparison.

#10 2009-11-14 06:22:33

Re: Help! Recursive directory comparison.

#11 2009-11-14 07:27:18

Re: Help! Recursive directory comparison.

#12 2009-11-14 10:49:26

Re: Help! Recursive directory comparison.

#13 2009-11-14 23:19:41

Re: Help! Recursive directory comparison.

Board footer