Question

I want perform cluster analysis for the following data (sample):

    ID     CODE1     CODE2     CODE3     CODE4      CODE5      CODE6
   ------------------------------------------------------------------
   00001     0         1         1         0          0          0
   00002     1         0         0         0          1          1
   00003     0         1         0         1          1          1
   00004     1         1         1         0          1          0
    ...

Where 1 indicates the presence of that code for a person, and 0 the absence.. Is k-means or hierarchical clustering most appropriate for clustering the codes for this kind of data (for about a million distinct ids), and with which distance measure? If neither of these methods are appropriate, what do you think is most appropriate?

Thank you

Was it helpful?

Solution

No, k-means does not make a lot of sense for binary data.

Because k-means computes means. But what is the mean vector for binary data?

Your cluster "centers" will be not part of your data space, and nothing like your input data. That doesn't seem like a proper "center" to me, when it's totally different from your objects.

Most likely, your cluster "centers" will end up being more similar to each other than to the actual cluster members, because they are somewhere in the center, and all your data is in corners.

Seriously, investigate similarity functions for your data type. Then choose a clustering algorithm that works with this distance function. Hierarchical clustering is quite general, but really slow. But you don't have to use a 40 year old algorithm, you may want to look into more modern stuff.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top