Question

I'm trying to follow a document that has some code on text mining clustering analysis. I'm fairly new to R and the concept of text mining/clustering so please bear with me if i sound illiterate.

I create a simple matrix called dtm and then run kmeans to produce 3 clusters. The code im having issues is where a function has been defined to get "five most common words of the documents in the cluster"

dtm0.75 = as.matrix(dt0.75)
dim(dtm0.75)

kmeans.result = kmeans(dtm0.75, 3)

perClusterCounts = function(df, clusters, n)
{
  v = sort(colSums(df[clusters == n, ]), 
           decreasing = TRUE)
  d = data.frame(word = names(v), freq = v)
  d[1:5, ]
}
perClusterCounts(dtm0.75, kmeans.result$cluster, 1)

Upon running this code i get the following error:

Error in colSums(df[clusters == n, ]) : 'x' must be an array of at least two dimensions

Could someone help me fix this please?

Thank you.

Was it helpful?

Solution

I can't reproduce your error, it works fine for me. Update your question with a reproducible example and you might get a more useful answer. Perhaps your input data object is empty, what do you get with dim(dtm0.75)?

Here it is working fine on the data that comes with the tm package:

library(tm)
data(crude)

dt0.75 <- DocumentTermMatrix(crude)

dtm0.75 = as.matrix(dt0.75)
dim(dtm0.75)

kmeans.result = kmeans(dtm0.75, 3)

perClusterCounts = function(df, clusters, n)
{
  v = sort(colSums(df[clusters == n, ]), 
           decreasing = TRUE)
  d = data.frame(word = names(v), freq = v)
  d[1:5, ]
}
perClusterCounts(dtm0.75, kmeans.result$cluster, 1)

                 word freq
the               the   69
and               and   25
for               for   12
government government   11
oil               oil   10
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top