What are some proven methods for finding groupings of highly correlated variables within a large, high-dimensional binary dataset (think 200,000+ rows and 150+ fields) that can be easily implemented in R? I want to find groupings of variables which lends itself to interpretation so I don't think PCA would be the best method.

有帮助吗?

解决方案

    library(Hmisc)
mtc <- mtcars[,2:8]
    mtcn <- data.matrix(mtc)
    clust <- varclus(mtcn)
    clust
    plot(clust)

?varclus :Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.

For Binary Vraibles:

library(cluster)
data(animals)
ma <- mona(animals)
ma

plot(ma)  

?mona : Returns a list representing a divisive hierarchical clustering of a dataset with binary variables only.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top