Question

I am trying to sync 2 s3 buckets. Here is the command I am using to sync between 2 s3 buckets.

s3cmd sync s3://source-bucket s3://destination-bucket

I am setting this in the crontab. I have specified the absolute path of the s3cmd. I am logging the action, but my log file is empty, this doesnt show any error though and also doesnt sync also. Whats the issue. How do I solve this.

Was it helpful?

Solution

As we discovered in the comments, the solution to your problem was the same as what is described here: your version of s3cmd was too old to support from-bucket to-bucket, and the fix was to upgrade your version of s3cmd. I'm glad it was an easy fix.

However, there are two very significant problems with what you are trying to do with this tool.

The s3cmd utility is not an appropriate tool to use in a cronjob to routinely synchronize two buckets, for 2 reasons:

First, you need to make allowances for the tool to run so long that the cron job fires again the next time it is due; the problem you have here is that you could have 2 or more copies of s3cmd running at the same time trying to synchronize the same two buckets. At some point as the 2nd instance discovered more and more things that were already synched, it would probably catch up with the first one to the point that they would both be resynching approximately the same files, doubling the number of transfers you'll be doing.

The timeline could look something like this:

...A discovers file not there, begins to sync file

......B discovers file not there, also begins to sync file

.........A finishes synching file

............B finishes synching file.

Assuming you're not using versioned objects in your bucket, your data will be fine but you're paying for twice as many requests and twice as much bandwidth.

At an absolute minimum, your cron job needs to call a bash script that manages a lock file, to prevent multiple concurrent runs.

Second, and more seriously, s3cmd will not scale in this environment, as it appears to have no "memory" of what's in each bucket.

I have, for example, a bucket with 8 million objects in it. If I wanted to do a one-time copy from bucket to bucket with s3cmd, that would be okay. The problem is, s3cmd doesn't "remember" what it saw in your buckets before, so the second time around, and each subsequent time around, it has to discover and check all 8 million files in one bucket and then verify whether they're in the other bucket and (presumably) verify whether they're identical files, by sending a HEAD request against each object in both directions. So this approach will not scale and could end up with a substantial cost in unnecessary requests to S3.

For my own internal systems, I maintain a local database of the objects in the buckets. When I add an object to a bucket, I update the database with the size, md5, and other attributes of the object after the transfer succeeds. Then, I have all of my buckets set up with logging (into a different, common bucket). My systems fetch the log files, parse them, and for any objects that have been uploaded by other processes (according to the logs) I fetch their metadata and store that in the local database, too... so I have a local representation of what's in S3 that is delayed by only a few minutes (the wait time for the logs to arrive and be discovered).

Then, when I need to synchronize buckets to filesystems or to each other, I can use the local database to compare the contents and make decisions about which files need to be synched. Of course, I also do have processes that can audit the database for consistency against S3.

If you are going to be synchronizing two buckets routinely, I would suggest you will need a more sophisticated solution than s3cmd.

OTHER TIPS

One of the option is to mount both buckets as local directories (using RiofS for example) and run your favorite tool to synchronize both folders.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top