For Hadoop, which data storage to choose, Amazon S3 or Azure Blob Store?

https://stackoverflow.com/questions/10490463

06-06-2021
|

문제

I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.

I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.

해결책

First of all it is a great question. Let's try to understand "How data is processed in Hadoop":

In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.

So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:

If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
For example, with HadooponAzure you can connect various data sources i.e. Amazon S3, Azure Blob Storage, SQL Server and SQL Azure etc so a variety of data sources are the best with any cloud Hadoop cluster.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow