Question

I have a set of datapoints from the unit interval (i.e. 1-dimensional dataset with numerical values). I receive some additional datapoints online, and moreover the value of some datapoints might change dynamically. I'm looking for an ideal clustering algorithm which can handle these issues efficiently.

I know sequential k-means clustering copes with the addition of new instances, and I suppose with minor modification it can work with dynamic instance values (i.e. first taking the modified instance from the respective cluster, then updating the mean of the cluster and finally giving the modified instance as an input to the algorithm just as the addition of an unseen instance).

My concern with using the k-means algorithm is the requirement of supplying the number of clusters as an input. I know that they beat other clustering algorithms (GAs, MSTs, Hierarchical Methods etc.) in time&space complexity. Honestly I'm not sure, but maybe I can get away with using one of the aforementioned algorithms. Even that my datasets are relatively large, the existence of a single dimension makes me wonder.

More specifically a typical test case of mine would contain about 10K-200K 1-dimensional datapoints. I would like to complete the clustering preferably under a second. The dynamic changes in the value points are assumed to be smooth, i.e. relatively small. Thus being able to use existing solutions (i.e. being able to continue clustering on the existing one when a value is changed or new one is added) is highly preferred.

So all in all:

Can you think of an algorithm which will provide a sweet spot between computational efficiency and the accuracy of clusters wrt. the problem defined above?

Are there some nice heuristics for the k-means algorithm to automatically compute the value of K beforehand?

Was it helpful?

Solution

I think hierarchical clustering would be more time efficient in your case (with a single dimension). Depending on your task, you may implement something like this:

Having N datapoints di with their 1-dimension value xi:

  1. Sort datapoints based on their xi value.
  2. Calculate distances between adjacent datapoints (N-1 distances). Each distance must be assigned a pair of original datapoints (di, dj).
  3. Sort distances in descending order to generate list of datapoint pairs (di, dj), starting from the closest one.
  4. Iteratively unite datapoints (di, dj) into clusters, starting from beginning of the list (the closest pair). (Depending on current state of di and dj, uniting them means: (a) creating new cluster for two unclustered datapoints, (b) adding a datapoint to existing cluster and (c) uniting two clusters.)
  5. Stop uniting, if the distance is over some threshold.
  6. Create singleton clusters for datapoints which did not get into clusters.

This algorithm implements single linkage clustering. It can be tuned easily to implement average linkage. Complete linkage will be less efficient, but maybe easier ones will give good results depending on your data and task.

I believe for 200K datapoints it must take under second, if you use proper data structures for above operations.

OTHER TIPS

Try using HDBSCAN, even though it is a hierarchical method, it might prove more efficient. I am running it on multidimensional datasets a bit lengthier than 200k, and it takes less than a minute to run. The caveat is the number of clusters that it might produce. If they are too many you might want to stick with a partitioning method.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top