Frage

What are some proven methods for finding groupings of highly correlated variables within a large, high-dimensional binary dataset (think 200,000+ rows and 150+ fields) that can be easily implemented in R? I want to find groupings of variables which lends itself to interpretation so I don't think PCA would be the best method.

War es hilfreich?

Lösung

    library(Hmisc)
mtc <- mtcars[,2:8]
    mtcn <- data.matrix(mtc)
    clust <- varclus(mtcn)
    clust
    plot(clust)

?varclus :Does a hierarchical cluster analysis on variables, using the Hoeffding D statistic, squared Pearson or Spearman correlations, or proportion of observations for which two variables are both positive as similarity measures. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.

For Binary Vraibles:

library(cluster)
data(animals)
ma <- mona(animals)
ma

plot(ma)  

?mona : Returns a list representing a divisive hierarchical clustering of a dataset with binary variables only.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top