Domanda

I have a dataset for which I need to process PCA (Principal Component Analysis, A dimentionality reduction program) which is easy to proceed using Weka.

And Since the dataset is large in size, Weka shows memory issues, which can be resolved if I link Weka with Hadoop. To run the algorithm using weka in a server. Could anyone help me regarding the same. How can I connect Weka with Hadoop to deal with larger dataset? Please help!

Thankyou..

È stato utile?

Soluzione

Weka 3.7 has new packages for distributed processing in Hadoop. One of the jobs provided by these packages will compute a correlation (or covariance) matrix in Hadoop. The user can optionally have the job use the correlation matrix as input to a PCA analysis (this part runs outside of Hadoop) and produce a "trained" Weka PCA filter. This scales Weka's PCA analysis in the number of instances (but not in the number of original features since the PCA computation still happens locally on the client machine).

For more info on the Hadoop packages see:

http://markahall.blogspot.co.nz/2013/10/weka-and-hadoop-part-1.html

The distributedWekaHadoop package can be installed via the package manager in Weka 3.7.

Cheers, Mark.

Altri suggerimenti

Depending on the algorithm, it may be quite complex to rewrite it to use Hadoop.

You can use Apache Mahout instead. It does have support for PCA.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top