문제

I'm just beginning to learn about Big Data and I'm interested in Hadoop. I'm planning on building a simple analytics system to make sense of certain events that occurs in my site.

So I'm planning to have code (both front and back end) to trigger some events that would queue messages (most likely with RabbitMQ). These messages will then be processed by a consumer that would write the data continuously to HDFS. Then, I can run a map reduce job at any time to analyze the current data set.

I'm leaning towards Amazon EMR for the Hadoop functionality. So my question is this, from my server running the consumer, how do I save the data to HDFS? I know there's a command like "hadoop dfs -copyFromLocal", but how do I use this across servers? Is there a tool available?

Has anyone tried a similar thing? I would love to hear about your implementations. Details and examples would be very much helpful. Thanks!

도움이 되었습니까?

해결책

If you mention EMR, it's takes input from a folder in s3 storage, so you can use your preffered language library to push data to s3 to analyse it later with EMR jobs. For example, in python one can use boto.

There are even drivers allowing you to mount s3 storage as a device, but some time ago all of them were too buggy to use them in production systems. May be thing have changed with time.

EMR FAQ:

Q: How do I get my data into Amazon S3? You can use Amazon S3 APIs to upload data to Amazon S3. Alternatively, you can use many open source or commercial clients to easily upload data to Amazon S3.

Note that emr (as well as s3) implies additional costs, and that it's usage is justified for really big data. Also note that it is always benefical to have relatively large files both in terms of Hadoop performance and storage costs.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top