I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:
"Error in double(N * (N - 1)/2) : vector size specified is too large"
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"
The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!
FunDistanceMatrixChunking <- function (df, blockSize=100){
n <- nrow(df)
blocks <- n %/% blockSize
if((n %% blockSize) > 0)blocks <- blocks + 1
chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
dex <- 1:blockSize
chunk <- 0
for(i in 1:blocks){
p <- dex + (i-1)*blockSize
lex <- (blockSize+1):(2*blockSize)
lex <- lex[p<= n]
p <- p[p<= n]
for(j in 1:blocks){
q <- dex +(j-1)*blockSize
q <- q[q<=n]
if (i == j) {
chunk <- chunk+1
x <- dist(df[p,])
chunk.means[chunk,] <- c(length(x), mean(x))}
if ( i > j) {
chunk <- chunk+1
x <- as.matrix(dist(df[c(q,p),]))[lex,dex]
chunk.means[chunk,] <- c(length(x), mean(x))}
}
}
mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)
Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..