Question

I am new in storm framework(https://storm.incubator.apache.org/about/integrates.html), I test locally with my code and I think If I remove stop words, it will perform well, but i search on line and I can't see any example that removing stopwords in storm.

Was it helpful?

Solution

If the size of the stop words list is small enough to fit in memory, the most straighforward approach would be to simply filter the tuples with an implementation of storm Filter that knows that list. This Filter could possibly poll the DB every so often to get the latest list of stop words if this list evolves over time.

If the size of the stop words list is bigger, then you can use a QueryFunction, called from your topology with the stateQuery function, which would:

  • receive a batch of tuples to check (say 10000 at a time)
  • build a single query from their content and look up corresponding stop words in persistence
  • attach a boolean to each tuple specifying what to with each one

+ add a Filter right after that to filter based on that boolean.

And if you feel adventurous:

Another and faster approach would be to use a bloom filter approximation. I heard that Algebird is meant to provide this kind of functionality and targets both Scalding and Storm (how cool is that?), but I don't know how stable it is nor do I have any experience in practically plugging it into Storm (maybe Sunday if it's rainy...).

Also, Cascading (which is not directly related to Storm but has a very similar set of primitive abstractions on top of map reduce) suggests in this tutorial a method based on left joins. Such joins exist in Storm and the right branch could possibly be fed with a FixedBatchSpout emitting all stop words every time, or even a custom spout that reads the latest version of the list of stop words from persistence every time, so maybe that would work too? Maybe? This also assumes the size of the stop words list is relatively small though.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top