Question

Can someone kindly tell me about the trade-offs involved when choosing between Storm and MapReduce in Hadoop Cluster for data processing? Of course, aside from the obvious one, that Hadoop (processing via MapReduce in a Hadoop Cluster) is a batch processing system, and Storm is a real-time processing system.

I have worked a bit with Hadoop Eco System, but I haven't worked with Storm. After looking through a lot of presentations and articles, I still haven't been able to find a satisfactory and comprehensive answer.

Note: The term tradeoff here is not meant to compare to similar things. It is meant to represent the consequences of getting results real-time that are absent from a batch processing system.

Was it helpful?

Solution

MapReduce: A fault tolerant distributed computational framework. MapReduce allows you to operate over huge amounts of data- with a lot of work put in to prevent failure due to hardware. MapReduce is a poor choice for computing results on the fly because it is slow. (A typical MapReduce job takes on the order of minutes or hours, not microseconds)

A MapReduce job takes a file (or some data store) as an input and writes a file of results. If you want these results available to an application, it is your responsibility to put this data in a place that is accessible. This is likely slow, and there will be a lag between the values you can display, and the values that represent your system in its current state.

An important distinction to make when considering using MapReduce in building realtime systems is that of training your model, and applying your model. If you think your model parameters do not change quickly, you can fit them with MapReduce, and then have a mechanism for accessing these pre-fit parameters when you want to apply your model.

Storm: A real-time, streaming computational system. Storm is online framework, meaning, in this sense, a service that interacts with a running application. In contrast to MapReduce, it receives small pieces of data (not a whole file) as they are processed in your application. You define a DAG of operations to perform on the data. A common and simple use case for Storm is tracking counters, and using that information to populate a real-time dashboard.

Storm doesn't have anything (necessarily) to do with persisting your data. Here, streaming is another way to say keeping the information you care about and throwing the rest away. In reality, you probably have a persistence layer in your application that has already recorded the data, and so this a good and justified separation of concerns.

If you want to know more... If you would like to learn more about real-time systems that that fit parameters with MR and apply the models a different way here are slides for a talk I gave on building real-time recommendation engines on HBase.

An excellent paper that marries real-time counting and persistence in an interesting way is Google News Personalization: Scalable Online Collaborative Filtering

Another interesting marriage of MR and Storm is SummingBird. Summingbird allows you to define data analysis operations that can be applied via Storm or MR.

OTHER TIPS

This is kind of like asking about the tradeoffs between frying pan and your drawer of silverware. They are not two things you compare, really. You might use them together as part of a larger project.

Hadoop itself is not one thing, but a name for a federation of services, like HDFS, Hive, HBase, MapReduce, etc. Storm is something you use with some of these services, like HDFS or HBase. It is a stream-processing framework. There are others within the extended Hadoop ecosystem, like Spark Streaming.

When would you choose a stream-processing framework? when you need to react to new data in near-real-time. If you need this kind of tool, you deploy this kind of tool, too.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top