質問

I need to move a large amount of files (on the order of tens of terabytes) from Amazon S3 into Google Cloud Storage. The files in S3 are all under 500mb.

So far I have tried using gsutil cp with the parallel option (-m) to using S3 as source and GS as destination directly. Even tweaking the multi-processing and multi-threading parameters I haven't been able to achieve a performance of over 30mb/s.

What I am now contemplating:

  • Load the data in batches from S3 into hdfs using distcp and then finding a way of distcp-ing all the data into google storage (not supported as far as I can tell), or:

  • Set up a hadoop cluster where each node runs a gsutil cp parallel job with S3 and GS as src and dst

If the first option were supported, I would really appreciate details on how to do that. However, it seems like I'm gonna have to find out how to do the second one. I'm unsure of how to pursue this avenue because I would need to keep track of the gsutil resumable transfer feature on many nodes and I'm generally inexperienced running this sort of hadoop job.

Any help on how to pursue one of these avenues (or something simpler I haven't thought of) would be greatly appreciated.

役に立ちましたか?

解決

You could set up a Google Compute Engine (GCE) account and run gsutil from GCE to import the data. You can start up multiple GCE instances, each importing a subset of the data. That's part of one of the techniques covered in the talk we gave at Google I/O 2013 called Importing Large Data Sets into Google Cloud Storage.

One other thing you'll want to do if you use this approach is to use the gsutil cp -L and -n options. -L creates a manifest that records details about what has been transferred, and -n allows you to avoid re-copying files that were already copied (in case you restart the copy from the beginning, e.g., after an interruption). I suggest you update to gsutil version 3.30 (which will come out in the next week or so), which improves how the -L option works for this kind of copying scenario.

Mike Schwartz, Google Cloud Storage team

他のヒント

Google has recently released the Cloud Storage Transfer Service which is designed to transfer large amounts of data from S3 to GCS: https://cloud.google.com/storage/transfer/getting-started

(I realize this answer is a little late for the original question but it may help future visitors with the same question.)

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top