I bypassed the problem by using TermDocumentMatrix
instead of DocumentTermMatrix
that apparently is more stable on big datasets.
Update: I made it work also with DocumentTermMatrix
. As pointed out by DWin the problem seemed to be that DocumentTermMatrix
was memory insatiable. I managed to restrain its appetite with vapply. I tested it on a 200k records and completed the job without paralysing the whole system.
tdm_english <- vapply(doc.corpus, DocumentTermMatrix, FUN.VALUE = numeric(1), list(dictionary = dictionary_english), USE.NAMES = FALSE)