Question

I have a really large file, around 10GB. I can't load it to the memory, so I managed to transfer it to .mat file. But 'out of memory' problem still comes up when I tried clustering. The ultimate solution to it I think is put those memory thing to the disk. But I need to call kmeans() method from matlab. Is there a way to put the local variables in the kmeans to the disk as well without rewriting the method?

Was it helpful?

Solution

You need a strategy to deal with large data sets. Possibilities are:

  1. Use a system with enough memory
  2. Reduce precision of your data set. For clustering small errors and scaling are not important, change attributes to scaled uint8 or uint16 if possible. (And obviously, delete all irrelevant data)
  3. Use more appropriate algorithms. I'm not an expert in this field, but CLARA and CLARANS are two alternatives. These algorithms don't require only a subset of the data, should be possible to combine with matfile to keep only the relevant parts in memory.

OTHER TIPS

When you load your data, it is loaded first to the RAM memory of your computer, so I think the only ultimate solution to your problem is to have like 16GB of RAM.

Probably you can try downsample your data if it is not highly nonlinear. If you are interested you can read reference http://www.mathworks.com/help/signal/ref/downsample.html

For example you can Take your data, downsample by scale = 4 and then you will have 2.5GB of data. You can go further but it will increase the error. After your processing you can upsample your data using different technics(Matlab has all built-in). Unfortunately I don't know type of your data, so if my answer is not matching your question, sorry.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top