Pergunta

I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.

library(tm)
data("crude")

sample.dtm <- DocumentTermMatrix(
                    crude, control=list(
                        weighting=function(x) weightTfIdf(x, normalize=FALSE),
                        stopwords=TRUE
                    )
                )

step = 5
n = nrow(sample.dtm)
block = n %/% step 
start = (c(1:block)-1)*step+1
end = start+step-1


j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))

ij <- cbind(i,j)

library(skmeans)

getdocs <- function(k){
    ci <- c(start[k[[1]]]:end[k[[1]]])
    cj <- c(start[k[[2]]]:end[k[[2]]])
    combi <- sample.dtm[ci]
    combj < -sample.dtm[cj]

    rownames(combi)<-ci
    rownames(combj)<-cj

    comb<-c(combi,combj)
    sim<-1-skmeans_xdist(comb)

    cat("Block", k[[1]], "with Block", k[[2]], "\n")
    flush.console()

    tri.sim<-upper.tri(sim,diag=F)
    results<-tri.sim & sim>0.1

    docs<-apply(results,1,function(x) length(x[x==TRUE]))
    docnames<-names(docs)[docs>0]

    gc()
    return (docnames)

}

It works well when using apply

system.time(rmdocs<-apply(ij,1,getdocs))

When using parRapply

library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))

Error:

 Error in checkForRemoteErrors(val) : 
      2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
    Timing stopped at: 0.01 0 0.04 

It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!

Foi útil?

Solução

In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.

You can load packages on the cluster workers with the clusterEvalQ function:

clusterEvalQ(cl, { library(tm); library(skmeans) })

After doing that, rownames(combi)<-ci will work correctly.

Also, if you want to see the output from cat, you should use the makeCluster outfile argument:

cl <- makeCluster(2, outfile='')
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top