How to perform an approximate comparison between directories?

catnap · 2019-04-22 19:39:36

Several years ago, when I was still learning about Linux and its file systems, I had the unfortunate habit of making backups of some of my Linux files to external hard drives that came with the standard NTFS formatting. More recently, I have found it more convenient to use the same ext3 file system both on my computer and on the external hard disks. However, the problem still persists with the old backups.

I would like to find out, if my newer backups supersede the old ones well enough to allow me to throw away some of the old external hard disks. For this, I would have to perform a comparison between two incompatible file systems. RSync can perform a comparison. But its problem is that it announces differences that are inconsequential. For example, the NTFS file system keeps no track of the permissions and therefore the permissions differ for each file. I would like to omit these differences in the report. The most robust way of checking the contents would probably be to perform the comparison approximately, just at the superficial level of the file names. Is there a good tool for this? Might there be a tool that also checks the contents, but does not report inconsequential, minor differences, and sees if one of the files is simply a newer version of the other?

ngoonee · 2019-04-23 06:01:55

rsync does that. Just don't use the -p flag to ignore permissions. You may also want to set 'modify-window'.

schard · 2019-04-23 06:44:21

Why not just use diff or sha256sum recursively?

catnap · 2019-04-23 13:22:57

ngoonee wrote:

rsync does that. Just don't use the -p flag to ignore permissions. You may also want to set 'modify-window'.

This will likely do the trick.

There's a further complexity to the issue that I forgot to mention in the original post. Maybe I will edit the original one later to have all of the question in one place. In any case, the file hierarchies of the different versions of the backup have become incompatible after several modifications. The command would therefore also have to ignore the file hierarchies and collapse the files into the same level---ignoring links, directories, and such. Alternatively, the command could report the hierarchy difference but without much verbosity, just as a single difference per a change in the structure. Either one of these solutions would probably work. But is there a concise command that will help?

Trilby · 2019-04-23 13:55:21

Do you not have any files with the same name? In otherwords, every file's basename is unique? That's often not the case.

catnap · 2019-04-23 14:10:55

Trilby wrote:

Do you not have any files with the same name? In otherwords, every file's basename is unique? That's often not the case.

That really could constitute a problem. The algorithm would apparently have to try to make an intelligent guess on which base level files are supposed to correspond with one-another. That could be difficult to implement, and precarious in any case.

So rather than collapsing the hierarchy, one could perhaps pursue the second solution that was suggested and make the difference report as brief as possible without losing essential information. E.g. if the directory

/home/user/Data/

was renamed into

/home/user/Results/

the algorithm would not report every file in the directory, in the following manner.

deleting Data/file1.dat
deleting Data/file2.dat
creating Results/file1.dat
creating Results/file2.dat

Rather the algorithm would report:

renaming Data/ to Results/

Trilby · 2019-04-23 14:28:51

In that case, I agree with schard: use hash values.

Generate a list of filenames and hashes for the source and backup, then group into A) hashes that overlap but have different names, these are moved or copied files, B) filesnames that overlap but have different hashes, these are changed files, and C) anything else is unique to the source or backup.

Two find commands would generate the lists, then a couple join command on the lists will cateogorize them:

cd /path/to/backup
find -type f -exec sha1sum '{}' \+ | sort > /tmp/backup.hash
cd /path/to/source
find -type f -exec sha1sum '{}' \+ | sort > /tmp/source.hash
cd /tmp
join -j 1 *.hash > renamed.list
join -j 2 *.hash > changed.list
# then perhaps a "grep -Ff ..." here depending on what you want for C

Last edited by Trilby (2019-04-23 14:33:35)

catnap · 2019-05-15 18:51:52

Upon contemplation, this topic seems quite broad an an unlikely one to soon be answered thoroughly. This is not a criticism because the question is quite open ended and therefore it is not susceptible for a simple solution. However, one can make a list of related best practices and combine these to get the best results. The summary of the methods that I'm currently using to tackle the problem of saving files from multiple devices and locations into one place is the following:

1. Use Git version control repositories as much as possible and Git Annex for large files
2. Use RSync with suited options to compare unversioned directories and to synchronize them
3. Use diff <(ls old) <(ls new) or an equivalent vimdiff command for approximate comparisons
4. Begin by endorsing a logical file structure that is consistent across devices

Others could probably supply more items to the list of best practices.

I humbly propose that because saving and structuring data from multiple devices---like USB sticks, external hard drives, decommissioned laptops, or old smart phones that gather dust in the drawer---must be a very common problem, this problem is no less relevant because it is mundane and unexciting. I believe the topic could benefit from a concentrated discussion. Maybe we should coin a term for this mundane chore and establish a Wiki page under a corresponding title to list some of the related best practices.

I've been mentally toying with the term "Data scavenging" or "Information scavenging". But maybe that must be regarded just as a first iteration because it has a somewhat negative undertone and might even connotate illegality---quite apart from the intended meaning. The simpler term "gathering" might not be specific enough, on the other hand.

Arch Linux

#1 2019-04-22 19:39:36

How to perform an approximate comparison between directories?

#2 2019-04-23 06:01:55

Re: How to perform an approximate comparison between directories?

#3 2019-04-23 06:44:21

Re: How to perform an approximate comparison between directories?

#4 2019-04-23 13:22:57

Re: How to perform an approximate comparison between directories?

#5 2019-04-23 13:55:21

Re: How to perform an approximate comparison between directories?

#6 2019-04-23 14:10:55

Re: How to perform an approximate comparison between directories?

#7 2019-04-23 14:28:51

Re: How to perform an approximate comparison between directories?

#8 2019-05-15 18:51:52

Re: How to perform an approximate comparison between directories?

Board footer