Question

I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space.

I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough space).

So my question is : what software should I use (eventually in association with some other packages or custom settings).

Thanks for helping me.

Note : I use linux.

Was it helpful?

Solution

Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).

There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?

OTHER TIPS

I keep the link (that can be useful to the specific user) but I agree with Gavin's comment! To perform a k-means clustering on Big Data you can use the rxKmeans function implemented in the Revolution R Enterprise proprietary implementation of R (I know this can be a problem); this function seems to be capable of manage that kind of data.

Since we know nothing at all about the data, nor the questioner's goals for it, just a couple of general links:
I. Guyon's video lectures — many papers and books too.
feature selection on stats.stackexchange

Check out Mahout, it will do k means on a large data set:

http://mahout.apache.org/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top