Question

I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.

I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:

  • The number of group
  • Which points should be placed in each group

Is there a standard algorithm for this?

Was it helpful?

Solution

I think there is no default way to do this. Simple 'manual' methods would be to either:

  • specify the number of clusters you want/expect
  • set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster

There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).

OTHER TIPS

I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top