Question

I'm using R's built-in correlation matrix and hierarchical clustering methods to segment daily sales data into 10 clusters. Then, I'd like to create agglomerated daily sales data by cluster. I've got as far as creating a cutree() object, but am stumped on extracting only the column names in the cutree object where the cluster number is 1, for example.

For simplicity's sake, I'll use the EuStockMarkets data set and cut the tree into 2 segments; bear in mind that I'm working with thousands of columns here so the needs to be scalable:

data=as.data.frame(EuStockMarkets)

corrMatrix<-cor(data)
dissimilarity<-round(((1-corrMatrix)/2), 3)
distSimilarity<-as.dist(dissimilarity)
hirearchicalCluster<-hclust(distSimilarity)
treecuts<-cutree(hirearchicalCluster, k=2)

now, I get stuck. I want to extract only the column names from treecuts where the cluster number is equal to 1, for example. But, the object that cutree() makes is not a DataFrame, making sub-setting difficult. I've tried to convert treecuts into a data frame, but R does not create a column for the row names, all it does is coerce the numbers into a row with the name treecuts.

I would want to do the following operations:

....Code that converts treecuts into a data frame called "treeIDs" with the 
columns "Index" and "Cluster"......

cluster1Columns<-colnames(treeIDs[Cluster==1, ])
cluster1DF<-data[ , (colnames(data) %in% cluster1Columns)]
rowSums(cluster1DF)

...and voila, I'm done.

Thoughts/suggestions?

Était-ce utile?

La solution

Here is the solution:

names(treecuts[which(treecuts[1:4]==1)])
[1] "DAX"  "SMI"  "FTSE"

If you want,say, also for the cluster 2 (or higher), you can then use %in%

names(treecuts[which(treecuts[1:4] %in% c(1,2))])

[1] "DAX"  "SMI"  "CAC"  "FTSE"

Autres conseils

Why not just

data$clusterID <- treecuts

then subset data as usual?

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top