Clustering Variables

https://stackoverflow.com/questions/21431678

04-10-2022
|

문제

What are some proven methods for finding groupings of highly correlated variables within a large, high-dimensional binary dataset (think 200,000+ rows and 150+ fields) that can be easily implemented in R? I want to find groupings of variables which lends itself to interpretation so I don't think PCA would be the best method.

해결책

    library(Hmisc)
mtc <- mtcars[,2:8]
    mtcn <- data.matrix(mtc)
    clust <- varclus(mtcn)
    clust
    plot(clust)

?varclus :Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.

For Binary Vraibles:

library(cluster)
data(animals)
ma <- mona(animals)
ma

plot(ma)

?mona : Returns a list representing a divisive hierarchical clustering of a dataset with binary variables only.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow