Question

I am using Weka for my document classification research. I need to set a baseline on which I will show that my contribution improves classification. However, using default Latent Semantic Analysis in the Weka API results in an OutOfMemory error.

After performing some preprocessing, my dataset consists out of 25,765 attributes used in 9,603 instances. This is for the train set, for the test set I have same number of class and normal attributes but here I have 3,299.

I have 8GB of ram and have set the Java Heap Size to 4Gb already and yet I still get OutOfMemory error. Here is the error message:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at weka.core.matrix.Matrix.getArrayCopy(Matrix.java:301)
at weka.core.matrix.SingularValueDecomposition.<init>(SingularValueDecomposition.java:76)
at weka.core.matrix.Matrix.svd(Matrix.java:913)
at weka.attributeSelection.LatentSemanticAnalysis.buildAttributeConstructor(LatentSemanticAnalysis.java:511)
at weka.attributeSelection.LatentSemanticAnalysis.buildEvaluator(LatentSemanticAnalysis.java:416)
at weka.attributeSelection.AttributeSelection.SelectAttributes(AttributeSelection.java:596)
at weka.filters.supervised.attribute.AttributeSelection.batchFinished(AttributeSelection.java:455)
at weka.filters.Filter.useFilter(Filter.java:682)
at test.main(test.java:44)

I have tested my code with a smaller dataset and there everything works as it should, so it is not a code-related problem. Could someone explain how I can scale up LSA to fit my requirements? Or is there another, similar process I can apply that is more scalable?

Was it helpful?

Solution

You aren't going to like the answer, but WEKA can't handle it. The implementation uses a full SVD no matter what. So if you have more than a few thousand data points, just doing the full SVD is going to take an enormous amount of time.

Not to mention that WEKA uses far more memory than needed in general.

Ontop of all that, Weka creates a dense matrix to do the SVD from. You are probably using it for sparse data, and thats going to destroy any hope you ever have of using Weka for LSA.

The fact is you are going to have to use something other than Weka to get LSA done.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top