質問

I am generating a matrix using the lsa package in R. After the matrix is created, I would like to calculate the cosine similarity between specific pairs of documents (columns) in the matrix.

Currently, I am doing this with nested for-loops, and it is monstrously slow. In the code below, there are 150 sourceIDs and 6413 targetIDs, for a total of 961.950 comparisons. After an hour and a half on my number-crunching machine, it has only gotten through ~300k of them.

For more info, sourceIDs and targetIDs are vectors of column names, loaded in from two files containing those names. I want to apply the cosine function between all of the source->target pairs. The columns are indexed by the document name, which is a string.

I am sure there is a much faster way of doing this with apply, but I just cant wrap my head around it.

library(lsa)

# tf function
real_tf <- function(m)
{
    return (sweep(m, MARGIN=2, apply(m, 2, max), "/"))
}

#idf function
real_idf <- function(m)
{
    df = rowSums(lw_bintf(m), na.rm=TRUE)
    return (log(ncol(m)/df))
}

#load corpus
lsa.documents <- textmatrix(args[1], minWordLength=1, minDocFreq=0)

# compute tf-idf
lsa.weighted_documents <- real_tf(lsa.documents) * real_idf(lsa.documents)

# compute svd
lsa.nspace <- lsa(lsa.weighted_documents, dims = as.integer(args[5]))
lsa.matrix <- diag(lsa.nspace$sk) %*% t(lsa.nspace$dk)

# compute similarities
lsa.sourceIDs <- scan(args[2], what = character())
lsa.targetIDs <- scan(args[3], what = character())
lsa.similarities <- data.frame(SourceID=character(), TargetID=character(), Score=numeric(), stringsAsFactors=FALSE)
k <- 1
for (i in lsa.sourceIDs)
{
    for (j in lsa.targetIDs)
    {
        lsa.similarities[k,] <- c(i, j, cosine(lsa.matrix[,i], lsa.matrix[,j]))
        k <- k + 1
    }
}
lsa.ranklist <- lsa.similarities[order(lsa.similarities$Score, decreasing=TRUE),]

# save ranklist
write.table(lsa.ranklist, args[4], sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)

Edit: Reproducible example

# cosine function from lsa package
cosine <- function( x, y )
{
    return ( crossprod(x,y) / sqrt( crossprod(x)*crossprod(y) ) )
}

theMatrix <- structure(c(-0.0264639232505822, -0.0141165039351167, -0.0280459775632757, 
-0.041211247161448, -0.00331565717239375, -0.0291161345945683, 
-0.0451167802746869, -0.0116214407383401, -0.0381080747718958, 
-1.36693644389599, 0.274747343110076, 0.128100677705483, -0.401760905661056, 
-1.24876927957167, 0.368479552862631, -0.459711112157286, -0.544344448332346, 
-0.765378939625159, -1.28612431910459, 0.293455499695499, 0.025167452173962
), .Dim = c(3L, 7L), .Dimnames = list(NULL, c("doc1", "doc2", "doc3", 
"doc4", "doc5", "doc6", "doc7")))

sources <- c("doc1", "doc2", "doc3")
targets <- c("doc4", "doc5", "doc6", "doc7")

similarities <- data.frame(SourceID=character(), TargetID=character(), Score=numeric(), stringsAsFactors=FALSE)
k <- 1

for (i in sources)
{
    for (j in targets)
    {
        similarities[k,] <- c(i, j, cosine(theMatrix[,i], theMatrix[,j]))
        k <- k + 1
    }
}

ranklist <- similarities[order(similarities$Score, decreasing=TRUE),]
write.table(ranklist, "C:\\Temp\\outputfile.txt", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)

Which produces (outputfile.txt):

doc1    doc6    0.962195242094352
doc3    doc6    0.893461576046585
doc2    doc6    0.813856201398669
doc2    doc7    0.768837903803964
doc2    doc4    0.730093288388069
doc3    doc7    0.675640649189972
doc3    doc4    0.635982900340315
doc1    doc7    0.53871688669971
doc1    doc4    0.499235059782688
doc1    doc5    0.320383772495164
doc3    doc5    0.226751624753921
doc2    doc5    0.144680489733846
役に立ちましたか?

解決

Ok, thanks for the reproducible example. Here is a possible solution. Let's first split your theMatrix into source and target matrices. We do not need to use the names here, as we will not use loops:

matrix1 <- theMatrix[,1:3]
matrix2 <- theMatrix[,4:7]

Then we will create a function to loop through every column of matrix2, keeping a single column from matrix1 constant:

cycleM2 <- function(x) {
    # x is a vector from matrix1 
    apply(matrix2,2,cosine,x)
}

Finally, we will supply this function to every column of matrix1:

(mydata <- apply(matrix1,2,cycleM2))

#      doc1      doc2      doc3
# doc4 0.4992351 0.7300933 0.6359829
# doc5 0.3203838 0.1446805 0.2267516
# doc6 0.9621952 0.8138562 0.8934616
# doc7 0.5387169 0.7688379 0.6756406

Finally, if you really need your original data format:

require(reshape2)
melt(mydata)

This should speed up your code nicely. Also, as @flodel has noticed, when you use loops, pre-allocate your (empty) target object in memory, filling it e.g. with NA. Memory allocations are the most costly in terms of time, and that is why your original loop was so slow.

EDIT:

A better form using pure function would probably be:

pairwiseCosine <- function(matrix1,matrix2) {
    apply(matrix1,2,function(x){
        apply(matrix2,2,cosine,x)
    })
}

pairwiseCosine(theMatrix[,1:3],theMatrix[,4:7])
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top