Question

I'm at a cross roads, ive been using Mahout to classify some documents, and have stumbled across OpenNLP document classifier.

They seem to do very similar things, and i cant figure out if its worth converting what I currently have written in mahout, and provide an OpenNLP implementation instead.

Are there some blatently obvious advantages mahout has over OpenNLP for document classification?

My situation is that I have several hundred thousand news articles, and i only want to extract a subset of them. Mahout does this reasonably well, - im using Naive Bayes for term counting, and then TF-IDF to determine which category the documents fall into. The model is updated as and when new articles are found, so the model is consistently improving over time.

It seems OpenNLP document classifier does something very similar (although i have not tested how accurate it is). - does anyone have experience using both, who can say diffentively why one would be used above the other?

Was it helpful?

Solution

I don't have experience with these two, but while trying to figure out if one of them would make a difference in a personal project, I stumbled upon this blog, and I quote:

Data categorization with OpenNLP is another approach with more accuracy and performance rate as compared to mahout.

You can check the blog post here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top