What's the best way to sync large amounts of data around the world?

https://stackoverflow.com/questions/233966

04-07-2019
|

Question

I have a great deal of data to keep synchronized over 4 or 5 sites around the world, around half a terabyte at each site. This changes (either adds or changes) by around 1.4 Gigabytes per day, and the data can change at any of the four sites.

A large percentage (30%) of the data is duplicate packages (Perhaps packaged-up JDKs), so the solution would have to include a way of picking up the fact that there are such things lying aruond on the local machine and grab them instead of downloading from another site.

The control of versioning is not an issue, this is not a codebase per-se.

I'm just interested if there are any solutions out there (preferably open-source) that get close to such a thing?

My baby script using rsync doesn't cut the mustard any more, I'd like to do more complex, intelligent synchronization.

Thanks

Edit : This should be UNIX based :)

Solution

Have you tried Unison?

I've had good results with it. It's basically a smarter rsync, which maybe is what you want. There is a listing comparing file syncing tools here.

OTHER TIPS

Sounds like a job for BitTorrent.

For each new file at each site, create a bittorrent seed file and put it into centralized web-accessible dir.

Each site then downloads (via bittorrent) all files. This will gen you bandwidth sharing and automatic local copy reuse.

Actual recipe will depend on your need. For example, you can create 1 bittorrent seed for each file on each host, and set modification time of the seed file to be the same as the modification time of the file itself. Since you'll be doing it daily (hourly?) it's better to use something like "make" to (re-)create seed files only for new or updated files.

Then you copy all seed files from all hosts to the centralized location ("tracker dir") with option "overwrite only if newer". This gets you a set of torrent seeds for all newest copies of all files.

Then each host downloads all seed files (again, with "overwrite if newer setting") and starts bittorrent download on all of them. This will download/redownload all the new/updated files.

Rince and repeat, daily.

BTW, there will be no "downloading from itself", as you said in the comment. If file is already present on the local host, its checksum will be verified, and no downloading will take place.

How about something along the lines of Red Hat's Global Filesystem, so that the whole structure is split across every site onto multiple devices, rather than having it all replicated at each location?

Or perhaps a commercial network storage system such as from LeftHand Networks (disclaimer - I have no idea on cost, and haven't used them).

You have a lot of options:

You can try out to set up replicated DB to store data.
Use combination of rsync or lftp and custom scripts, but that doesn't suit you.
Use git repos with max compressions and sync between them using some scripts
Since the amount of data is rather large, and probably important, do either some custom development on hire an expert ;)

Check out super flexible.... it's pretty cool, haven't used it in a large scale environment, but on a 3-node system it seemed to work perfectly.

Sounds like a job for Foldershare

Have you tried the detect-renamed patch for rsync (http://samba.anu.edu.au/ftp/rsync/dev/patches/detect-renamed.diff)? I haven't tried it myself, but I wonder whether it will detect not just renamed but also duplicated files. If it won't detect duplicated files, then, I guess, it might be possible to modify the patch to do so.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow