Question

Going through the presentation and material of Summingbird by Twitter, one of the reasons that is mentioned for using Storm and Hadoop clusters together in Summingbird is that processing through Storm results in cascading of error. In order to avoid this cascading of error and accumulation of it, Hadoop cluster is used to batch process the data and discard the Storm results after the same data is processed by Hadoop.

What is the reasons for generation of this accumulation of error? and why is it not present in Hadoop? Since I have not worked with Storm, I do not know the reasons for it. Is it because Storm uses some approximate algorithm to process the data in order to process them in real time? or is the cause something else?

Was it helpful?

Solution

Twitter uses Storm for real-time processing of data. Problems can happen with real-time data. Systems might go down. Data might be inadvertently processed twice. Network connections can be lost. A lot can happen in a real-time system.

They use hadoop to reliably process historical data. I don't know specifics, but for instance, getting solid information from aggregated logs is probably more reliable than attaching to the stream.

If they simply relied on Storm for everything - Storm would have problems due to the nature of providing real-time information at scale. If they relied on hadoop for everything, there's a good deal of latency involved. Combining the two with Summingbird is the next logical step.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top