Question

I've been reading about Storm and playing around with the examples from storm-starter.

I think I got the concept and it applies very well to many cases. I have a test project I want to do to learn more about this, but I'm wondering if Storm is really suited for this.

The conceptual problem I'm having is with the 'streaming' definition. It seems that Storms will work as a charm subscribing to a stream and processing it in real time, but I don't really have a stream, but rather a finite collection of data that I want to process.

I know there's hadoop for this, but I'm interested in the real time capabilities of Storm as well as other interesting points that Nathan, who wrote Storm, mentions in his talks.

So I was wondering, do people write Spouts that poll non streaming APIs and then diff the results maybe to emulate a stream?

The second important point is, it seems that Storm topologies never finish processing until interrupted, which again doesn't apply to my case. I would like my topology to know that once my finite list of source data is finished, the processing can terminate and a final result can be emitted.

So, does that all make sense in Storm terms or am I looking at the wrong thing? If so, what alternatives do you propose for this sort of real time parallel computing needs?

Thanks!

Was it helpful?

Solution

Found the answer in the storm google group. Seems that DRCP topologies will emit a tuple with parameters that is received by DRCP spout as a stream and then will indicate back when the processing has finished (Using unique Id called Request ID).

In that same thread says that hadoop is probably best suited for these cases, unless the data is not big enough and can be processed entirely all the time.

OTHER TIPS

It's certainly possible to use Storm to process a finite collection of data and stop once all elements are processed. DRPC topologies are one way to do this, but rolling your own solution is not hard.

The idea is to keep track of which elements in your finite dataset have been processed or not, which is easily done in the Spout using the ack() and fail() methods.

If you are looking for a fast, interactively usable and developer friendly batch processing solution, you may want to look at Apache Spark instead of Storm.

Trident/DRPC is more useful when you want to run queries on your continuous computation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top