Question

I'm working on a large text classification project and we have our text data (simple messages) stored in HBase.

We have two problems, first we would like to use HBase as the source for Mahout classifiers namely Bayers and Random Forests.

Second, we would like to be able to store the model generated in HBase instead of using the in memory approach (InMemoryBayesDatastore) however as our sets grow we are running into problems with memory utilization and would like to test out HBase as a viable alternative.

There seems to be little material floating around using HBase with Mahout and if it's possible to use it as a potential datasource. I'm using Mahout 0.6 core API in Java which has the InMemory datastore.

Doing a bit of digging I belive that there (was) a HBase Bayers Datastore component - org.apache.mahout.classifier.bayes.datastore.HBaseBayesDatastore See older JavaDoc here: http://www.jarvana.com/jarvana/view/org/apache/mahout/mahout-core/0.3/mahout-core-0.3-javadoc.jar!/org/apache/mahout/classifier/bayes/datastore/HBaseBayesDatastore.html

However, looking at the latest documentation it looks like this feature has disappeared..? https://builds.apache.org/job/Mahout-Quality/javadoc/

I wanted to know if it was still possible to use HBase as a datastource for Bayers and RandomForests and are there any previous uses cases in this?

Thanks!

Was it helpful?

Solution

It's not directly possible, no. You can revive this old implementation, and dust it off and probably make it work without much trouble. It was indeed removed to slim down and focus the project.

You can of course also look at exporting your data, in some form, and adding it to a representation or store that is directly supported.

Generally speaking, you can use HBase with Mahout by virtue of the fact that Mahout uses Hadoop (mostly) and Hadoop can use HBase. That's not quite the situation here; there's a more direct integration point here, that has been deprecated.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top