running weka over a large arff dataset file

https://stackoverflow.com/questions/21387489

03-10-2022
|

Question

I am having an arff file that contains 700 entries, each of 42000+ features for a NLP related project. Right now the format is in dense format, but the entries can be reduced substantially, if sparse representation is used. I am running on a core 2 duo machine with 2 GB RAM, and I am getting memory out of range eception, in spite of increasing the limit till 1536 MB.

Will it be of any advantage if I convert the arff file to a sparse representation or shall I need to run my code on a much more powerful machine?

Solution

Depending on the internal data structure of the algorithm and how the data can be processed (incrementally or all in memory) it will need more memory or not. So the memory you will need depends on the algorithm.

So sparse representation is easier for you because it is compact, but, as fas as I know, the algorithm will need the same amount of memory to create the model from the same dataset. The input should format be transparent to the algorithm.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow