You are not logged in.
Pages: 1
Forgive me if this is a bit basic...
I have two directories that I need to diff, but each contains about 50GB of data. Doing a standard diff on the directories is very expensive and slow, as it compares all of the files as well as all of the data within the files.
Is there a way to perform a "shallow" diff -- that is, only look at file names, modification times, and sizes to determine if two files are different? I think rsync and Mercurial do something similar to keep overhead low, but I'm not 100% on that. I am aware that I can get some of the information I need from an rsync --dry-run, but the output is a bit difficult to parse in a shell script...
And insight would be greatly appreciated. Thanks!
Offline
There are apps to find duplicates - it that what you need?
Offline
Hello again, karol. Thanks for the response.
A dupe finder isn't really what I need. I'm basically using a script to rsync one hard drive to another, but before actually running rsync I would like to know if the contents of the drive even differ (this knowledge is needed elsewhere in the program). Like I said, I can do a normal diff but that is very expensive.
Offline
I'm not sure I can get you anything better that 'rsync --dry-run'. What is the problem with the parsing?
Offline
Interesting.
The problem with parsing is basically that I need to know either 1) the files are identical or 2) they aren't. This is obviously a simple yes/no answer, but rsync unfortunately doesn't provide this information with simple yes/no output like an exit code or one-line response. rsync basically prints one of two things...
$ rsync --dry-run -av a b
sending incremental file list
a/
a/b
a/c
a/d
sent 89 bytes received 25 bytes 228.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
or
$ rsync --dry-run -av a b
sending incremental file list
sent 77 bytes received 13 bytes 180.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
In the first example, the files differ. In the second, they don't. I guess this isn't incredibly hard to parse, but using awk for something like this would feel a bit hacky, so I was hoping there was a more straightforward way to get this information.
Offline
Wait a second, how about checking the size of the directories with 'du'? If it's different, something _had_ to change.
Edit: Unless you change the filenames only ;P
Edit 2: How about counting the lines of output? If there's more than four (if I count correctly), the files are not the same (or you got some errors that added extra files).
Last edited by karol (2010-08-07 05:33:09)
Offline
Those both might work, and are interesting ideas.
Because the directories I'm comparing are flat (no subdirectories), I was thinking of even making two calls to ls -l, one on each directory, and comparing the output. Because of the information ls -l provides, the diff would be able to consider permissions, owners, file size, and file time, and file name.
The only reason I wasn't comfortable doing this was that I thought there must be a better, more "right" way to do this, but it's looking like that might not be the case... :-)
Last edited by jalu (2010-08-07 05:39:09)
Offline
What about the 'cmp' command? If it returns one line of output, the files are different. If it doesn't have any output whatsoever, like 'diff,' the files are the same.
This would probably require a script, however.
Last edited by urist (2010-08-07 05:52:57)
Offline
What about the 'cmp' command? If it returns one line of output, the files are different. If it doesn't have any output whatsoever, like 'diff,' the files are the same.
This would probably require a script, however.
From cmp man page: "Compare two files byte by byte" - I'm not sure I'd like to do this with a 50 GB pack of files :-)
Offline
From cmp man page: "Compare two files byte by byte" - I'm not sure I'd like to do this with a 50 GB pack of files :-)
I must agree. :-)
cmp does seem like an interesting littke program, though. I've never used it before. Thanks for the tip, urist.
In the meantime, I have managed to get my mirroring script to work (seemingly) well with ls comparisons, but I'm still wary as to whether this is the right way to do this... "If it ain't broke"?
Offline
Pages: 1