Question

I'm a student in Data Analysis, working on a data clustering exercise.

Two clusters have been identified based on a dataset with 40 features. To interpret and label these clusters, I'm wondering if there is a way to determine which features are the most determinant in the clustering output. For instance, I would imagine that I could take out one feature from the clustering and see how much it affects the output. However there are probably smarter ways.

I would greatly appreciate if someone could point me in the right direction.

Thanks!

Was it helpful?

Solution

A similar post appears on Cross-Validated, "Estimating the most important features in a k-means cluster partition".

Quoting from that post:

One way to quantify the usefulness of each feature (= variable = dimension), from the book Burns, Robert P., and Richard Burns. Business research methods and statistics using SPSS. Sage, 2008. (mirror), usefulness being defined by the features' discriminative power to tell clusters apart.

We usually examine the means for each cluster on each dimension using ANOVA to assess how distinct our clusters are. Ideally, we would obtain significantly different means for most, if not all dimensions, used in the analysis. The magnitude of the F values performed on each dimension is an indication of how well the respective dimension discriminates between clusters.

Another way would be to remove a specific feature and see how this impact internal quality indices. Unlike the first solution, you would have to redo the clustering for each feature (or set of features) you want to analyze.

FYI:

Furthermore, there is a paper on Feature Selection in Clustering Problems

A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top