Question

I am attempting to perform fastclust on a very large set of distances, but running into a problem.

I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords (about 50,000 unique keywords) that when I read into a data.frame looks like:

> df   
kwd1 kwd2 similarity  
a  b  1  
b  a  1  
c  a  2  
a  c  2 

It is a sparse list and I can convert it into a sparse matrix using sparseMatrix():

> myMatrix 
  a b c  
a . . .
b 1 . .
c 2 . .

However, when I attempt to turn it into a dist object using as.dist(), I get the error that the 'problem is too large' from R. I have read the other dist questions on here, but the code others have suggested does not work for my above example data set.

Thanks for any help!

Was it helpful?

Solution

While using a sparse matrix in the first place seems like a good idea, I think there is a bit of a problem with that approach: your missing distances will be coded as 0s, not as NAs (see Creating (and Accessing) a Sparse Matrix with NA default entries). As you know, when clustering, a zero dissimilarity has a totally different meaning than a missing one...

So anyway, what you need is a dist object with a lot of NAs for your missing dissimilarities. Unfortunately, your problem is so big that it would require too much memory:

d <- dist(x = rep(NA_integer_, 50000))
# Error: cannot allocate vector of size 9.3 Gb

And that's only dealing with the input... Even with a 64 bit machine with a lot of memory, I'm not sure the clustering algorithm itself would not choke or run indefinitely.

You should consider breaking your problem into smaller pieces.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top