Question

The k-means clustering tries to minimize the within-cluster scatter and maximizing the distances between clusters. It does so on all attributes.

I am learning about this method on several datasets. To illustrate, in one the datasets countries are compared based on attributes related to their Human development Index. However some of the attributes are completely unrelated to this dimension, for example total population of countries. How to deal with this attributes? As mentioned before k-means tries to minimize the scatter based on all attributes, which would mean this additional attributes could hurt the clusters. To illustrate, I know the k-means cannot discern three clusters that are perfectly clustered around one dimension and are completely scattered around the other.

Should one just exclude some attributes based on prior knowledge? Is their perhaps a processes that identifies irrelevant attributes.

Was it helpful?

Solution

First of all, if you know that certain attributes shouldn't after the clusters, you should remove them altogether. There is no point in hoping that K-Means will figure it out on its own if that can be fixed upstream.

Second, obviously, not every attribute should affect the clusters equally. K-Means is based on the concept of distances between your points. Based on the distance matrix, the algorithm will find different clusters. The good thing is that you can tweak how the distance is calculated. You could weigh the different attributes such that differences between certain attributes are more important than others.

Third, if you want to programmatically find the "best" attributes for clustering, I don't know of any efficient ways to do it. Meaning that your best bet is to try different combinations of attributes and see how good the clustering becomes. To rate the quality of clustering, there exist metrics like the Dunn Index, or the Davies-Bouldin Index (see this link for more detailed information: https://medium.com/@ODSC/assessment-metrics-for-clustering-algorithms-4a902e00d92d)

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top