I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.

library(tm)
data("crude")

sample.dtm <- DocumentTermMatrix(
                    crude, control=list(
                        weighting=function(x) weightTfIdf(x, normalize=FALSE),
                        stopwords=TRUE
                    )
                )

step = 5
n = nrow(sample.dtm)
block = n %/% step 
start = (c(1:block)-1)*step+1
end = start+step-1


j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))

ij <- cbind(i,j)

library(skmeans)

getdocs <- function(k){
    ci <- c(start[k[[1]]]:end[k[[1]]])
    cj <- c(start[k[[2]]]:end[k[[2]]])
    combi <- sample.dtm[ci]
    combj < -sample.dtm[cj]

    rownames(combi)<-ci
    rownames(combj)<-cj

    comb<-c(combi,combj)
    sim<-1-skmeans_xdist(comb)

    cat("Block", k[[1]], "with Block", k[[2]], "\n")
    flush.console()

    tri.sim<-upper.tri(sim,diag=F)
    results<-tri.sim & sim>0.1

    docs<-apply(results,1,function(x) length(x[x==TRUE]))
    docnames<-names(docs)[docs>0]

    gc()
    return (docnames)

}

It works well when using apply

system.time(rmdocs<-apply(ij,1,getdocs))

When using parRapply

library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))

Error:

 Error in checkForRemoteErrors(val) : 
      2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
    Timing stopped at: 0.01 0 0.04 

It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!

有帮助吗?

解决方案

In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.

You can load packages on the cluster workers with the clusterEvalQ function:

clusterEvalQ(cl, { library(tm); library(skmeans) })

After doing that, rownames(combi)<-ci will work correctly.

Also, if you want to see the output from cat, you should use the makeCluster outfile argument:

cl <- makeCluster(2, outfile='')
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top