Question

I have problem about group in cluster analysis(hierarchical cluster). As example, this is the dendrogram of complete linkage of Iris data set.

enter image description here

After I use

> table(cutree(hc, 3), iris$Species)

This is the output:

  setosa versicolor virginica
1     50          0         0
2      0         23        49
3      0         27         1

I have read in one statistical website that, object 1 in the data always belongs to group/cluster 1. From the output above, we know that setosa is in group 1. Then, how I am going to know about the other two species. How do they fall into either group 2 or 3. How did it happen. Perhaps there is a calculation I need to know?

Was it helpful?

Solution

I'm guessing that you're using this to create that image that doesn't appear to be there at the moment.

> lmbjck <- cutree(hclust(dist(iris[1:4], "euclidean")), 3)
> table(lmbjck, iris$Species)

lmbjck setosa versicolor virginica
     1     50          0         0
     2      0         23        49
     3      0         27         1

Dist is created from measurements of plants from three different species with identical column and row names.

> iris.dist <- dist(iris[1:4], "euclidean")
> identical(rownames(iris.dist), colnames(iris.dist))
[1] TRUE

That object is passed on to hclust which constructs a tree and cut it into three pieces. Object iris.order holds the order by which the dendrogram is drawn. Original order is preserved, the tree is drawn based on this ordering.

> iris.hclust <- hclust(iris.dist)
> iris.cutree <- cutree(iris.hclust, 3)
> iris.order <- iris.hclust$order

Here's proof. I've put together original Species designations, ordered species designations as they can be seen in the dendrogram, order number and group from a cutree function.

> data.frame(original = iris$Species, ordered = iris$Species[iris.order],
             order.num = iris.order, cutree = iris.cutree)

      original    ordered order.num cutree
1       setosa  virginica       108      1
2       setosa  virginica       131      1
3       setosa  virginica       103      1
4       setosa  virginica       126      1
5       setosa  virginica       130      1
6       setosa  virginica       119      1
    ...
103  virginica     setosa        31      2
104  virginica     setosa        26      2
105  virginica     setosa        10      2
106  virginica     setosa        35      2
107  virginica     setosa        13      3
108  virginica     setosa         2      2
    ...

Let's look at the output. If you look at the first line, under order.num there's number 108. This means that for this item (first item on the left side of the dendrogram) comes from row 108. Skim down to line 108, and you can see that the original Species is indeed virginica. Cutree assigns this to group 1. Let's look at line 3. Under order.num you can see that this item comes from row 103. Again, if you go down and check the original species in row 103, it's (still) virginica. I'll make it an exercise for you to check other (random) rows and convince yourself that the order for constructing the table at the beginning is preserved. Ergo, the table should thus be correct.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top