You are not logged in.

#1 2015-10-30 16:28:16

Ambrevar
Member
Registered: 2011-08-14
Posts: 212
Website

hsync - A filesystem hierarchy synchronizer

Hi!

When synchronizing 2 devices where some folders were renamed, synchronization tools like rsync do not see the rename and will transfer the full folder content instead.
I've always found it a big waste of time and resources!

Example: you have an old backup of a massive folder 'media' which you recently renamed 'medianew' on your local drive.

  • /local/medianew

  • /backup/media

If you synchronize /backup' from /local with rsync, /backup/media will be deleted and /local/medianew will be transferred to /backup/medianew.
However, if you rename '/backup/media -> /backup/medianew' before, the subsequent rsync run will be much faster.

Since I could not find any program that would detect renames, I decided to write my own tool.

Feedback is welcome!

Usage

From 'hsync -h':

Usage: hsync SOURCE TARGET

Filesystem hierarchy synchronizer

Rename files in TARGET so that identical files found in SOURCE and TARGET have
the same relative path.

The main goal of the program is to make folders synchronization faster by
sparing big file transfers when a simple rename suffices. It complements other
synchronization programs that lack this capability.

hsync previews all changes by default. You can tweak the each rename operation to your liking.

If you want to mirror SOURCE to TARGET with rsync, running hsync beforehand can dramatically speed up the process:

$ hsync SOURCE TARGET
$ rsync -livr --size-only --delete-excluded --append-verify --progress -- SOURCE/ TARGET
Requirements

None!
It is written in Go and the binary is statically linked. The program should be pretty lightweight smile

Performance

Since the goal is to provide a significant speed-up to other synchronization tools, performance is a critical part of the program.
I've tried hard at minimizing I/O since this is the definitive bottleneck here.

A run over a few 100000 files (~100 GB) would finish in a few minutes on a mildly slow hard drive on cold cache. A second run would typically finish within seconds on hot cache.

I am currently using a rolling md5 checksum to match files. An adler32 algorithm could be a bit faster but would yield more clashes. I am not sure which one is best.

Some parts could be parallelized, but a lot of thread synchronization would be required. It is unclear to me if it would yield better performance in all scenarios. I still need to work on it.

Memory-wise, hsync can eat several 100 MB from your RAM when working on millions of files. This could probably be optimized, albeit not easily.

Comments are welcome.

File integrity

By default, all changes are previewed and you can tweak the renaming operations before proceeding.
Nevertheless, there are many edge-cases with funky filesystem configurations that may lead to unexpected results.
A typical example would be bind mounts: most (all?) kernels do not support cross-mount points renaming, so if the TARGET folder contain such bind mounts, some renames might fail.

If you experience any weird behaviour or if you are a filesystem guru and can spot some mistakes and my (very short) code, you are more than welcome to share your knowledge!

See the Implementation details.


Links

Last edited by Ambrevar (2015-11-02 10:16:48)

Offline

#2 2015-11-01 14:51:49

sekret
Member
Registered: 2013-07-22
Posts: 283

Re: hsync - A filesystem hierarchy synchronizer

Great idea!!! I'll check it out and most probably will include it in my "timemachine" script wink

Offline

#3 2015-11-01 19:20:16

Rasi
Member
From: Germany
Registered: 2007-08-14
Posts: 1,914
Website

Re: hsync - A filesystem hierarchy synchronizer

hmm. support for ssh:/ URIs would be nice


He hoped and prayed that there wasn't an afterlife. Then he realized there was a contradiction involved here and merely hoped that there wasn't an afterlife.

Douglas Adams

Offline

#4 2015-11-01 19:36:30

Ambrevar
Member
Registered: 2011-08-14
Posts: 212
Website

Re: hsync - A filesystem hierarchy synchronizer

What about mounting your remote SSH storage locally with SSHFS?
You could automate this with a simple wrapper script.
I don't use SSH remote access much, so I need some usability feedback. Tell me if that would not work for you.

Offline

#5 2015-11-13 10:02:34

Ambrevar
Member
Registered: 2011-08-14
Posts: 212
Website

Re: hsync - A filesystem hierarchy synchronizer

I did some backups recently. That was a good opportunity to measure the benefits provided by hsync smile

Document folder on hot cache:
* rsync: 91s.
* hsync + rsync: 18s.

Massive video folder with slow bandwidth, with 1/3 of files in common with the backup:
* rsync: 12 hours
* hsync + rsync: 8 hours (as could be expected)

Offline

Board footer

Powered by FluxBB