Pergunta

I have this line I want to execute:

tdm_english <- DocumentTermMatrix(doc.corpus, list(dictionary = dictionary_english))

doc.corpus has length 191,000 and dictionary_english 48

I run the very same line on a corpus 3/4 the size of this one and all run smoothly in few minutes (probably non even 5 minutes).

Now the function crashes my MacBook pro. I run it twice and both times I had to force quit R & RStudio after more than one hour of computation.

Is there any way to optimize my call?

Foi útil?

Solução

I bypassed the problem by using TermDocumentMatrix instead of DocumentTermMatrix that apparently is more stable on big datasets.

Update: I made it work also with DocumentTermMatrix. As pointed out by DWin the problem seemed to be that DocumentTermMatrix was memory insatiable. I managed to restrain its appetite with vapply. I tested it on a 200k records and completed the job without paralysing the whole system.

tdm_english <- vapply(doc.corpus, DocumentTermMatrix, FUN.VALUE = numeric(1), list(dictionary = dictionary_english), USE.NAMES = FALSE)

Outras dicas

From your description is sounds like you are running out of memory. To check this, open Activity Monitor and start the R script. Than check the System Memory tab in Ac. Monitor and see how many Page Ins and Page Outs take place. If this number is significant, combined with a high memory usage of your R process, this indicates your computer is running out of memory, and is using hard drive space to make up for it. This is very slow.

The solution is to use a smaller dataset, process the data in chunks, find a setting of DocumentTermMatrix that limits memory usage, or get more RAM.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top