Trying to remove words from a DocumentTermMatrix in order to use topicmodels

https://stackoverflow.com/questions/16287546

13-04-2022
|

Question

So, I am trying to use the topicmodels package for R (100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory.

So I try to shrink the size of the document term matrix that the lda() function takes as input; I figure I can do that do using the minDocFreq function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code:

Here is the relevant bit of code:

> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1]  6423 41613

Same dimensions, and same number of columns (that is, same number of terms).

Any sense what I'm doing wrong? Thanks.

La solution

The answer to your question is over here: https://stackoverflow.com/a/13370840/1036500 (give it an upvote!)

In brief, more recent versions of the tm package do not include minDocFreq but instead use bounds, for example, your

smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))

should now be

require(tm)
data("crude")

smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow