K-means vs. online K-means

https://datascience.stackexchange.com/questions/458

16-10-2019
|

Pergunta

K-means is a well known algorithm for clustering, but there is also an online variation of such algorithm (online K-means). What are the pros and cons of these approaches, and when should each be preferred?

Solução

Online k-means (more commonly known as sequential k-means) and traditional k-means are very similar. The difference is that online k-means allows you to update the model as new data is received.

Online k-means should be used when you expect the data to be received one by one (or maybe in chunks). This allows you to update your model as you get more information about it. The drawback of this method is that it is dependent on the order in which the data is received (ref).

Outras dicas

The original MacQueen k-means publication (the first to use the name "kmeans") is an online algorithm.

MacQueen, J. B. (1967). "Some Methods for classification and Analysis of Multivariate Observations". Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press. pp. 281–297

After assigning each point, the mean is incrementally updated using a simple weighted-average formula (old mean is weighted with n, the new observation is weighted with 1, if the mean had n observations before).

As far as I can tell, it was also meant to be a single pass over the data only, although it can be trivially repeated multiple times to reassign points until convergence.

MacQueen usually takes fewer iterations than Lloyds to converge if your data is shuffled (because it updates the mean faster!). On ordered data, it can have problems. On the downside, it requires more computation for each object, so each iteration takes slightly longer (additional math operations, obviously).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange