Fast k-means like algorithm for 10^10 points?

https://datascience.stackexchange.com/questions/5746

16-10-2019
|

Question

I am looking to do k-means clustering on a set of 10-dimensional points. The catch: there are 10^10 points.

I am looking for just the center and size of the largest clusters (let's say 10 to 100 clusters); I don't care about what cluster each point ends up in. Using k-means specifically is not important; I am just looking for a similar effect, any approximate k-means or related algorithm would be great (minibatch-SGD means, ...). Since GMM is in a sense the same problem as k-means, doing GMM on the same size data is also interesting.

At this scale, subsampling the data probably doesn't change the result significantly: the odds of finding the same top 10 clusters using a 1/10000th sample of the data are very good. But even then, that is a 10^6 point problem which is on/beyond the edge of tractable.

Solution

k-means is based on averages.

It models clusters using means, and thus the improvement by adding more data is marginal. The error of the average estimation reduces with 1/sqrt(n); so adding more data pays off less and less...

Strategies for such large data always revolve around sampling:

If you want sublinear runtime, you have to do sampling!

In fact, Mini-Batch-Kmeans etc. do exactly this: repeatedly sample from the data set.

However, sampling (in particular unbiased sampling) isn't exactly free either... usually, you will have to read your data linearly to sample, because you don't get random access to individual records.

I'd go with MacQueen's algorithm. It's online; by default it does a single pass over your data (although it is popular to iterate this). It's not easy to distribute, but I guess you can afford to linearly read your data say 10 times from a SSD?

OTHER TIPS

As a side comment note that using K-means for 10D data might end up in nowhere according to the curse of dimensionality. Of course it varies a bit according to the nature of the data but once I tried to determine the threshold in which K-Means starts behaving strange regarding the dimensionalty, I got something like 7D. After 7 dimensions it started to miss correct clusters (my data was manually generated according to 4 well-separated Gaussian distributions and I used MATLAB kmeans function for my little experiment).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange