Question

I'm currently learning clustering. I have perform k-mean cluster of average_duration_of_call of subscriber which I store on my database. On first run with 3 centers cluster1( 53.33369 sec)-367 subscriber, cluster2(121.67123 sec)-128 subscriber, cluster3(369.09000 sec)-8 subscriber.

Again I rerun the clustering with center 6 and center obtained are as cluster1(904.66670 sec) -1 subscriber, cluster2(27.7 sec) - 108 subscriber, cluster3(151.58)-43 subscriber, cluster4(95 sec) - 135 subscriber, cluster5(59.5 sec) - 207 subscriber, cluster6(278 sec)-9 subscriber.

Now my question is which is the best cluster and how to find best cluster. Any help from experience are expected (I'm currently using R language)

Was it helpful?

Solution

If you are beginner then I recommend you to start density based clustering so that initial value of K isn't required. You can initially start dbscan clustering using epsilon=10 and minpts= 5 and then check the number of generated clusters. After that, start a smooth increase of epsilon (11, 12, ... 15) and decrease of minpt (4, 3, ..1) and check the number of generated clusters each time. Then the average of these numbers are supposed to reflect the average number of real clusters.

But if you need to apply k-mean clustering then you might find Selection of K in K-means clustering‎ paper useful.

OTHER TIPS

Well, k-means already computes a score for your, the sum-of-squares.

Choose the result that achieved the better score.

However, when you increase k it is natural that the score improves. Obviously, if you set k to the data set size, it will be 0. You then may want to use the BIC or the Silhouette Coefficient (look it up on Wikipedia).

Oh, and consider using a book. This is a classic question, and it should be covered in any good book.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top