You are not logged in.
I think the speed of the hashing is not the real concern. It's how to identify which files to actually hash. I suspect an approach like the following could work well:
1) Get a list of all files on disk1 that have changed since the last sync (e.g, use last modify time)
2) Hash each of these files creating a map/dictionary of hash -> pathname
3) For any file on disk2 that has a name in the list from #1, create a second map/dictionary of hash -> pathname
4) Append to the list from 3 a hash and pathname for any file on disk2 that is not still present on disk1
5) Identify any matching hashes in the two lists and move/rename the files on disk2 to match the pathname from disk1 while also removing those matched entries from both hash lists
6) Copy any files remaining in the first hash list to disk2
7) Optionally delete any files from disk2 that remain in the second hash list
I think the two independent hash lists would be needed particularly to cover cases where files are renamed to take the place of previously existing files... for example:
Disk1 has fileA and fileB all of which are already synced to disk2. But then fileB is renamed to fileC and fileA is renamed to fileB. Without independent hash lists, it would be easy to see the same filename (fileB) on both disks but with different hash values - so a naive backup program might assume fileB changed, and must now be copied. But the independent hash lists would allow for no actual data to move in this backup process and the renames could be processed correctly.
If I were implementing this, I'd probably actually use two different hash algorithms. The first for constructing the two hash lists could be fairly "dumb" but fast (meaning missings a change would be fine as long as there weren't false positives of changes: even file size might be good enough for this). If/when a potential match between hashes on the two lists was found it could be confirmed or rejected with a better hash.
"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" - Richard Stallman
Offline
Instead of looking for a smart backup program, why not look for a 'smart' move or rename which keeps track of the changes in disk1. If you know what has moved in disk1 you can replicate it on disk2 before backing up disk1 to disk2.
Offline
In the conflict between users who want smarter software and software that wants smarter users, users win.
"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" - Richard Stallman
Offline
I suspect an approach like the following could work well
This is going to make my job easier. Thanks
I'd probably actually use two different hash algorithms. The first for constructing the two hash lists could be fairly "dumb" but fast
Agreed, and x33a's suggestion (xxHash, described as "extremely fast non-cryptographic hash algorithm") seems a good fit for the job. I'm not sure about filesize alone (filename cannot be used, and I don't have other parameters to combine in a quick hash.)
Last edited by eleius (2017-07-08 06:04:52)
Offline
Offline