Domanda

I am trying to convert the following Simple Triplet Matrix created with TermDocumentMatrix() of the tm package

A term-document matrix (317443 terms, 86960 documents)

Non-/sparse entries: 18472230/27586371050
Sparsity           : 100%
Maximal term length: 653 
Weighting          : term frequency (tf)

of class

[1] "TermDocumentMatrix"    "simple_triplet_matrix" 

to a dense matrix.

But

dense <- as.matrix(tdm)

generates the error

Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

I can't really understand the error and warning message. Trying to replicate the error on a small dataset with

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
as.matrix(tdm)

doesn't produce the same issue. I saw from this answer that a similar problem was solved through the slam package (even though the question was about a sum operation and not a transformation into a dense matrix). I browsed the slam documentation but I couldn't find any specific function to transform an object of class simple_triplet_matrix into an object of class matrix.

È stato utile?

Soluzione

You get an error because as commented you reach the limit of the integer limit, normal since you have huge number of documents.. This reproduces the error :

as.integer(.Machine$integer.max+1)
[1] NA
Warning message:
NAs introduced by coercion 

Function vector which takes an integer as parameter fails since it second parameter is NA.

One solution is to redefine as.matrix.simple_triplet_matrix without calling vector. For example:

as.matrix.simple_triplet_matrix <- 
function (x, ...) 
{
  nr <- x$nrow
  nc <- x$ncol
  ## old line: y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
  y <- matrix(0, nr, nc)  ## 
  y[cbind(x$i, x$j)] <- x$v
  dimnames(y) <- x$dimnames
  y
}

But I am not sure it is a good idea to coerce to a matrix such sparse matrix(100%).

EDIT

One idea is to use saparseMatrix from Matrix package. Here an example where I compare the objects generated by each coercion. You gain a factor of 10 at lease ( I think regarding your very sparse matrix , you will gain more) by using sparseMatrix. Moreover, Addition and multiplication are supported by sparse Matrix.

require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
                          control = list(weighting = weightTfIdf,
                                         stopwords = TRUE))
library(Matrix)
Dense <- sparseMatrix(dtm$i,dtm$j,x=dtm$v)
dense <- as.matrix(dtm)
## check sizes 
floor(as.numeric(object.size(dense)/object.size(Dense)))
## addistion and multiplication are supported
Dense+Dense
Dense*Dense

Altri suggerimenti

I just had a similar problem. I'm not sure if my problem is identical, but when combining a sparse matrix with a dense matrix I got a similar error message NAs produced by integer overflow. I was able to fix it by converting the dense matrix to single precision using as.single. I think the "overflowing integers" are caused by operations in the sparseMatrix package that somehow truncate double precision values leaving leftover digits.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top