Pergunta

The task is to implement the T part (transform) of ETL project in Azure cloud. I believe HDInsight is the right service to use for it, but not sure. Please approve or disprove this choice.

I am quite new to the field and would appreciate if someone can point me to the right direction here.

I would like to be able to develop the transform service (job) and test it locally using Azure Storage/Compute Emulators and Visual Studio 2012 (Ideally in C#). I am hot sure how HDInsight fits into this picture (if does at all). The transform job will read text files from the blob storage and produce (map reduce) data into azure table storage.

Foi útil?

Solução

You can certainly run an HDInsight box locally. This is separate from the Azure storage and computer emulation, and is installed through the Web Platform installer (just search for HDInsight).

There are some subtle differences between the local and Azure version, in that the local version works with data stored in HDFS, whereas in the cloud you can use Azure Blob Containers. As far as developing and testing you transform processes (in MapReduce / Hive / Pig) this make no real difference. The only difference is the way you would get the data in and out.

Note that you can certainly create MapReduce jobs with C# on HDInsight, for basic data transformations it can be a lot easier to use a higher level language like Pig, or possibly the SQL based HiveQL on HDInsight.

Outras dicas

You need to draw a line on what level of T-transformation and automation you are expecting out of that.

I suggest you take up straight forward console application which pulls up the data from blob and performs the Transform

Reasons for suggesting the console application approach

  1. easy, straight forward, same skill set
  2. Good SDK for blob and table to do what ever you want
  3. Map-Reduce(HDInsight) is a totally new species in you Azure Storage and c# family. I heard HDInsight is good, but not sure it is good enough for you here.
  4. If you have a console application you can easily task schedule it, leave it running based on the Pub-Sub model
  5. If you are using your own c# - console app or .exe, you can easily tweak it run in Azure Worker Role.
  6. Taking your own app approach will remove the over head of the installing and setting up your HDInsight
  7. Cost wise worker role are cheaper than HDInsight
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top