You are not logged in.

#1 2010-08-07 04:33:58

jalu
Member
Registered: 2009-04-05
Posts: 140

Performing a "shallow" diff?

Forgive me if this is a bit basic...

I have two directories that I need to diff, but each contains about 50GB of data. Doing a standard diff on the directories is very expensive and slow, as it compares all of the files as well as all of the data within the files.

Is there a way to perform a "shallow" diff -- that is, only look at file names, modification times, and sizes to determine if two files are different? I think rsync and Mercurial do something similar to keep overhead low, but I'm not 100% on that. I am aware that I can get some of the information I need from an rsync --dry-run, but the output is a bit difficult to parse in a shell script...

And insight would be greatly appreciated. Thanks!

Offline

#2 2010-08-07 04:53:21

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Performing a "shallow" diff?

There are apps to find duplicates - it that what you need?

Offline

#3 2010-08-07 05:02:05

jalu
Member
Registered: 2009-04-05
Posts: 140

Re: Performing a "shallow" diff?

Hello again, karol. Thanks for the response.

A dupe finder isn't really what I need. I'm basically using a script to rsync one hard drive to another, but before actually running rsync I would like to know if the contents of the drive even differ (this knowledge is needed elsewhere in the program). Like I said, I can do a normal diff but that is very expensive.

Offline

#4 2010-08-07 05:14:59

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Performing a "shallow" diff?

I'm not sure I can get you anything better that 'rsync --dry-run'. What is the problem with the parsing?

Offline

#5 2010-08-07 05:23:50

jalu
Member
Registered: 2009-04-05
Posts: 140

Re: Performing a "shallow" diff?

Interesting.

The problem with parsing is basically that I need to know either 1) the files are identical or 2) they aren't. This is obviously a simple yes/no answer, but rsync unfortunately doesn't provide this information with simple yes/no output like an exit code or one-line response. rsync basically prints one of two things...

$ rsync --dry-run -av a b
sending incremental file list
a/
a/b
a/c
a/d

sent 89 bytes  received 25 bytes  228.00 bytes/sec
total size is 0  speedup is 0.00 (DRY RUN)

or

$ rsync --dry-run -av a b
sending incremental file list

sent 77 bytes  received 13 bytes  180.00 bytes/sec
total size is 0  speedup is 0.00 (DRY RUN)

In the first example, the files differ. In the second, they don't. I guess this isn't incredibly hard to parse, but using awk for something like this would feel a bit hacky, so I was hoping there was a more straightforward way to get this information.

Offline

#6 2010-08-07 05:25:37

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Performing a "shallow" diff?

Wait a second, how about checking the size of the directories with 'du'? If it's different, something _had_ to change.

Edit: Unless you change the filenames only ;P
Edit 2: How about counting the lines of output? If there's more than four (if I count correctly), the files are not the same (or you got some errors that added extra files).

Last edited by karol (2010-08-07 05:33:09)

Offline

#7 2010-08-07 05:37:45

jalu
Member
Registered: 2009-04-05
Posts: 140

Re: Performing a "shallow" diff?

Those both might work, and are interesting ideas.

Because the directories I'm comparing are flat (no subdirectories), I was thinking of even making two calls to ls -l, one on each directory, and comparing the output. Because of the information ls -l provides, the diff would be able to consider permissions, owners, file size, and file time, and file name.

The only reason I wasn't comfortable doing this was that I thought there must be a better, more "right" way to do this, but it's looking like that might not be the case... :-)

Last edited by jalu (2010-08-07 05:39:09)

Offline

#8 2010-08-07 05:52:06

urist
Member
Registered: 2009-02-22
Posts: 248

Re: Performing a "shallow" diff?

What about the 'cmp' command? If it returns one line of output, the files are different. If it doesn't have any output whatsoever, like 'diff,' the files are the same.

This would probably require a script, however.

Last edited by urist (2010-08-07 05:52:57)

Offline

#9 2010-08-07 05:53:43

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Performing a "shallow" diff?

urist wrote:

What about the 'cmp' command? If it returns one line of output, the files are different. If it doesn't have any output whatsoever, like 'diff,' the files are the same.

This would probably require a script, however.

From cmp man page: "Compare two files byte by byte" - I'm not sure I'd like to do this with a 50 GB pack of files :-)

Offline

#10 2010-08-07 05:58:13

jalu
Member
Registered: 2009-04-05
Posts: 140

Re: Performing a "shallow" diff?

karol wrote:

From cmp man page: "Compare two files byte by byte" - I'm not sure I'd like to do this with a 50 GB pack of files :-)

I must agree. :-)

cmp does seem like an interesting littke program, though. I've never used it before. Thanks for the tip, urist.

In the meantime, I have managed to get my mirroring script to work (seemingly) well with ls comparisons, but I'm still wary as to whether this is the right way to do this... "If it ain't broke"?

Offline

Board footer

Powered by FluxBB