Hierarchical clustering for bitsequences

https://stackoverflow.com/questions/8138226

01-03-2021
|

Question

This is a homework problem and I'm facing some difficulties to understand it. The home work question is

    Cluster the following bitsequences using hierarchical clustering. If d(:,:) defines the
distace between two bitsequences a and b, d(a,b) = Hamming-Distance(a,b) . If C1 and C2 are 
two clusters, the distance between C1 and C2 is d(C1,C2) = 1/|C1||C2| Summation(a belongs C1, b belongs C2) d(a,b). 
Show the cluster hierarchchy with all the intermediate steps.

1   10001011
2   11010111
3   00101010
4   00011110
5   10101110
6   11100001

I read in a book that initially I have to consider all of them as clusters and then start merging the closest ones. A new cluster will be formed. Now I have to find the closest cluster to this newly formed cluster by computing the distance between this new cluster and other clusters by averaging the distance between each element in both clusters as said in the question.

My solution: I will find hamming distance between all the pairs and choose the one with least one which is C3 and C5 (hamming distance is 2). Now this can can be merged into a new cluster.

My concern is what is exactly meant by merging here? How do I do it? Or simply I keep them as they are and name it a new cluster?

And how do I find the average distance between each element of the new cluster with other clusters?

Also to calculate average the formula given says to divide by |C1| and |C2|. So, does it mean I have to divide here by the number of elements (which is 8 per one group times the cluster it gets merged into?)

Any help is greatly appreciated. Thank you.

Solution

It sounds as though you want bottom-up clusters. The idea is, start with some singleton sets

{1} {2} {3} {4} {5} {6}

While there are two or more sets, select the closest pair and replace them by their union. I'll do this somewhat arbitrarily.

{1, 2} {3} {4} {5} {6}
{1, 2} {3, 6} {4} {5}
{1, 2} {3, 4, 6} {5}
{1, 2, 5} {3, 4, 6}
{1, 2, 3, 4, 5, 6}

The hierarchical clustering consists of all of the sets that ever existed in the algorithm. They can be visualized as a tree where, if X is a descendant of Y, then X is a subset of Y.

           {1,2,3,4,5,6}
           /           \
          /             \
         /               \
     {1,2,5}           {3,4,6}
     /     \           /     \
  {1,2}     \       {3,6}     \
  /   \      \      /   \      \
{1}   {2}    {5}  {3}   {6}    {4}

The average distance is computed with the formula given; |C1| and |C2| are the number of sequences in clusters 1 and 2 respectively. The length of the sequences is relevant only in computing the Hamming distance for a single pair. The distance between clusters {1, 2} and {3, 4, 6}, for example, is (d(1,3)+d(1,4)+d(1,6)+d(2,3)+d(2,4)+d(2,6))/6.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow