Should dimensionality reduction be done before k-means clustering if there are many features?

https://datascience.stackexchange.com/questions/74978

11-12-2020
|

Question

My data contains over 200 features and over 500 observations. I want to place the observations into a number of clusters based on the features that make them different.

There are numerous ideas I have and I'm not sure which one is apt:

1) Conduct principal component analysis (PCA) to reduce the features to two dimensions. I've already done this so that I could visualize them on a 2D plot. It would now be quite easy to use k-means clustering with these two dimensions to create the clusters, but I wonder if this isn't a good idea because of all the components that are being lost. But then again, if they're being lost they're probably not that important? Not sure

2) Conduct principal component analysis (PCA) to determine which features are worth including and then conduct k-means clustering on those features. So I probably wouldn't be reducing the dimensions to two, but they would be reduced and then the k-means clustering would be done. This seems like the best idea intuitively to me, but I'm not sure.

3) Forget the PCA and just conduct k-means clustering on all the features I have at the beginning. This feels like it's probably the worst idea because some of the features could be useless but could still be factored into the distance calculations for the clustering, but I'm just including everything I've thought of.

Solution

For the first idea about PCA, you can not simply just use 2 components. You need to take a look at the explained variance by your principal components and based on that you should select the required number of components. If, for example, you found that the first two components explain a significant amount of variance (e.g., more than 95%), then, you can use them to perform k-means clustering. In that case, it is expected (but not for sure and for all cases) that you get the same results as when you perform k-means using all features.

My suggestion is to use all of your features if you need to consider a large number of components. My reason is that your dataset is too small and it is not computationally demanding to perform k-means clustering using your dataset.

As a side note, also you can try all of your options because your dataset is small and then you will find out what's going on in your data.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange