Identify significant features in clustering results

https://datascience.stackexchange.com/questions/77286

12-12-2020
|

Question

I'm a student in Data Analysis, working on a data clustering exercise.

Two clusters have been identified based on a dataset with 40 features. To interpret and label these clusters, I'm wondering if there is a way to determine which features are the most determinant in the clustering output. For instance, I would imagine that I could take out one feature from the clustering and see how much it affects the output. However there are probably smarter ways.

I would greatly appreciate if someone could point me in the right direction.

Thanks!

Solution

Quoting from that post:

One way to quantify the usefulness of each feature (= variable = dimension), from the book Burns, Robert P., and Richard Burns. Business research methods and statistics using SPSS. Sage, 2008. (mirror), usefulness being defined by the features' discriminative power to tell clusters apart.

We usually examine the means for each cluster on each dimension using ANOVA to assess how distinct our clusters are. Ideally, we would obtain significantly different means for most, if not all dimensions, used in the analysis. The magnitude of the F values performed on each dimension is an indication of how well the respective dimension discriminates between clusters.

Another way would be to remove a specific feature and see how this impact internal quality indices. Unlike the first solution, you would have to redo the clustering for each feature (or set of features) you want to analyze.

FYI:

Can a useless feature negatively impact the clustering?

Can the choice of the measurement units of the features impact the clustering?

Why vector normalization can improve the accuracy of clustering and classification?

What are the most commonly used ways to perform feature selection for k-means clustering?

Furthermore, there is a paper on Feature Selection in Clustering Problems

A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange