Question

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.

Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.

Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:

  • uploaded mappers, reducers and data to Amazon S3
  • configured appropriate job and processed it successfully
  • downloaded aggregated results from Amazon S3 to my server and inserted them into MySQL database by running CLI script

I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM.

What should I do next? What is a best approach to automate this process?

  • Should I control Amazon EMR jobTracker via API?
  • How can I make sure my logs will not be processed twice?
  • What is the best way to move processed files to archive?
  • What is the best approach to insert results into PostgreSQL/MySQL?
  • How data for the jobs should be laid out in input/output directories?
  • Should I create a new EMR job each time using the API?
  • What is the best approach to upload raw logs to Amazon S3?
  • Can anyone share their setup of the data processing flow?
  • How to control file uploads and jobs completions?

I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices.

UPD: Just to clarify here is the single final question:

What are best practices for logs processing powered by Amazon Elastic MapReduce?

Related posts:

Getting data in and out of Elastic MapReduce HDFS

Was it helpful?

Solution

That's a very very wide open question, but here are some thoughts you could consider:

  • Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.
  • Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.
  • Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.

Hope that gives you some clues.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top