Question

I'm using R's built-in correlation matrix and hierarchical clustering methods to segment daily sales data into 10 clusters. Then, I'd like to create agglomerated daily sales data by cluster. I've got as far as creating a cutree() object, but am stumped on extracting only the column names in the cutree object where the cluster number is 1, for example.

For simplicity's sake, I'll use the EuStockMarkets data set and cut the tree into 2 segments; bear in mind that I'm working with thousands of columns here so the needs to be scalable:

data=as.data.frame(EuStockMarkets)

corrMatrix<-cor(data)
dissimilarity<-round(((1-corrMatrix)/2), 3)
distSimilarity<-as.dist(dissimilarity)
hirearchicalCluster<-hclust(distSimilarity)
treecuts<-cutree(hirearchicalCluster, k=2)

now, I get stuck. I want to extract only the column names from treecuts where the cluster number is equal to 1, for example. But, the object that cutree() makes is not a DataFrame, making sub-setting difficult. I've tried to convert treecuts into a data frame, but R does not create a column for the row names, all it does is coerce the numbers into a row with the name treecuts.

I would want to do the following operations:

....Code that converts treecuts into a data frame called "treeIDs" with the 
columns "Index" and "Cluster"......

cluster1Columns<-colnames(treeIDs[Cluster==1, ])
cluster1DF<-data[ , (colnames(data) %in% cluster1Columns)]
rowSums(cluster1DF)

...and voila, I'm done.

Thoughts/suggestions?

Was it helpful?

Solution

Here is the solution:

names(treecuts[which(treecuts[1:4]==1)])
[1] "DAX"  "SMI"  "FTSE"

If you want,say, also for the cluster 2 (or higher), you can then use %in%

names(treecuts[which(treecuts[1:4] %in% c(1,2))])

[1] "DAX"  "SMI"  "CAC"  "FTSE"

OTHER TIPS

Why not just

data$clusterID <- treecuts

then subset data as usual?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top