Question

I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.

library(tm)
data("crude")

sample.dtm <- DocumentTermMatrix(
                    crude, control=list(
                        weighting=function(x) weightTfIdf(x, normalize=FALSE),
                        stopwords=TRUE
                    )
                )

step = 5
n = nrow(sample.dtm)
block = n %/% step 
start = (c(1:block)-1)*step+1
end = start+step-1


j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))

ij <- cbind(i,j)

library(skmeans)

getdocs <- function(k){
    ci <- c(start[k[[1]]]:end[k[[1]]])
    cj <- c(start[k[[2]]]:end[k[[2]]])
    combi <- sample.dtm[ci]
    combj < -sample.dtm[cj]

    rownames(combi)<-ci
    rownames(combj)<-cj

    comb<-c(combi,combj)
    sim<-1-skmeans_xdist(comb)

    cat("Block", k[[1]], "with Block", k[[2]], "\n")
    flush.console()

    tri.sim<-upper.tri(sim,diag=F)
    results<-tri.sim & sim>0.1

    docs<-apply(results,1,function(x) length(x[x==TRUE]))
    docnames<-names(docs)[docs>0]

    gc()
    return (docnames)

}

It works well when using apply

system.time(rmdocs<-apply(ij,1,getdocs))

When using parRapply

library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))

Error:

 Error in checkForRemoteErrors(val) : 
      2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
    Timing stopped at: 0.01 0 0.04 

It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!

Was it helpful?

Solution

In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.

You can load packages on the cluster workers with the clusterEvalQ function:

clusterEvalQ(cl, { library(tm); library(skmeans) })

After doing that, rownames(combi)<-ci will work correctly.

Also, if you want to see the output from cat, you should use the makeCluster outfile argument:

cl <- makeCluster(2, outfile='')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top