Question

I am trying to convert the following Simple Triplet Matrix created with TermDocumentMatrix() of the tm package

A term-document matrix (317443 terms, 86960 documents)

Non-/sparse entries: 18472230/27586371050
Sparsity           : 100%
Maximal term length: 653 
Weighting          : term frequency (tf)

of class

[1] "TermDocumentMatrix"    "simple_triplet_matrix" 

to a dense matrix.

But

dense <- as.matrix(tdm)

generates the error

Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

I can't really understand the error and warning message. Trying to replicate the error on a small dataset with

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
as.matrix(tdm)

doesn't produce the same issue. I saw from this answer that a similar problem was solved through the slam package (even though the question was about a sum operation and not a transformation into a dense matrix). I browsed the slam documentation but I couldn't find any specific function to transform an object of class simple_triplet_matrix into an object of class matrix.

Was it helpful?

Solution

You get an error because as commented you reach the limit of the integer limit, normal since you have huge number of documents.. This reproduces the error :

as.integer(.Machine$integer.max+1)
[1] NA
Warning message:
NAs introduced by coercion 

Function vector which takes an integer as parameter fails since it second parameter is NA.

One solution is to redefine as.matrix.simple_triplet_matrix without calling vector. For example:

as.matrix.simple_triplet_matrix <- 
function (x, ...) 
{
  nr <- x$nrow
  nc <- x$ncol
  ## old line: y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
  y <- matrix(0, nr, nc)  ## 
  y[cbind(x$i, x$j)] <- x$v
  dimnames(y) <- x$dimnames
  y
}

But I am not sure it is a good idea to coerce to a matrix such sparse matrix(100%).

EDIT

One idea is to use saparseMatrix from Matrix package. Here an example where I compare the objects generated by each coercion. You gain a factor of 10 at lease ( I think regarding your very sparse matrix , you will gain more) by using sparseMatrix. Moreover, Addition and multiplication are supported by sparse Matrix.

require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
                          control = list(weighting = weightTfIdf,
                                         stopwords = TRUE))
library(Matrix)
Dense <- sparseMatrix(dtm$i,dtm$j,x=dtm$v)
dense <- as.matrix(dtm)
## check sizes 
floor(as.numeric(object.size(dense)/object.size(Dense)))
## addistion and multiplication are supported
Dense+Dense
Dense*Dense

OTHER TIPS

I just had a similar problem. I'm not sure if my problem is identical, but when combining a sparse matrix with a dense matrix I got a similar error message NAs produced by integer overflow. I was able to fix it by converting the dense matrix to single precision using as.single. I think the "overflowing integers" are caused by operations in the sparseMatrix package that somehow truncate double precision values leaving leftover digits.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top