I have a dataset for which I need to process PCA (Principal Component Analysis, A dimentionality reduction program) which is easy to proceed using Weka.

And Since the dataset is large in size, Weka shows memory issues, which can be resolved if I link Weka with Hadoop. To run the algorithm using weka in a server. Could anyone help me regarding the same. How can I connect Weka with Hadoop to deal with larger dataset? Please help!

Thankyou..

有帮助吗?

解决方案

Weka 3.7 has new packages for distributed processing in Hadoop. One of the jobs provided by these packages will compute a correlation (or covariance) matrix in Hadoop. The user can optionally have the job use the correlation matrix as input to a PCA analysis (this part runs outside of Hadoop) and produce a "trained" Weka PCA filter. This scales Weka's PCA analysis in the number of instances (but not in the number of original features since the PCA computation still happens locally on the client machine).

For more info on the Hadoop packages see:

http://markahall.blogspot.co.nz/2013/10/weka-and-hadoop-part-1.html

The distributedWekaHadoop package can be installed via the package manager in Weka 3.7.

Cheers, Mark.

其他提示

Depending on the algorithm, it may be quite complex to rewrite it to use Hadoop.

You can use Apache Mahout instead. It does have support for PCA.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top