Question

I have list of 1000+ keywords that I would like to group together by similarity.

For example:

  • "patio furniture"
  • "living room furniture"
  • "used chairs"
  • "new chairs"

I'd like the "furniture" and "chair" terms to cluster together.

I know one way I could do this is to specify some pre-selected "centroid" terms and then compute Levenshtein distances to each and use kmeans to cluster them.

But what I'm interested in finding out is how could I do this without pre-specifying the centroid terms like "chairs" and "furniture".

Thanks.

Was it helpful?

Solution

You could use the stringdist package to calculate the distance matrix:

str <- c("patio furniture", 
  "living room furniture",
  "used chairs",
  "new chairs")

library(stringdist)
d <- stringdistmatrix(str, str)

stringdist supports a number of distance functions. The default is the 'restricted Damerau-Levenshtein distance'. You can then use this distance matrix in hclust to perform hierarchical clustering:

cl <- hclust(as.dist(d))
plot(cl)

hclust has a number of different methods. See ?hclust. To create a fixed number of groups (here 2):

cutree(cl, 2)

But, this is probably one of many possible solutions.

OTHER TIPS

Basically, it could work like this using hierarchical clustering:

library(tm)
library(arules) # or other package with (dis)similarity measures... 
docs <- c("patio furniture", "living room furniture", "used chairs", "new chairs")
dtm <- as.matrix(DocumentTermMatrix(Corpus(VectorSource(docs))))
# comparse & choose measure, e.g. Jaccard vs Dice Distance
plot(hc <- hclust(dist(dtm, method="binary")), main="Jaccard Dist")
plot(hc <- hclust(dissimilarity(dtm, method="Dice")), main="Dice Dist")
# determine cutting distance (e.g. 0.6)_
clusters <- cutree(hc, h=.6)
# result
cbind.data.frame(docs, clusters)
#                    docs clusters
# 1       patio furniture        1
# 2 living room furniture        1
# 3           used chairs        2
# 4            new chairs        2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top