문제

I'm trying to make a term-document matrix with the TermDocumentMatrix function of the tm package in R and found that some words are not included.

> library(tm)
> tdm <- TermDocumentMatrix(Corpus(VectorSource("The book is of great importance.")))
> rownames(tdm)
[1] "book"        "great"       "importance." "the" 

Here, the words is and of have been excluded from the matrix. If the corpus only includes the deleted words, it gives the following message.

> tdm <- TermDocumentMatrix(Corpus(VectorSource("of is of is")))
Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
> rownames(tdm)
NULL

The message signals that is and of are deleted before the matrix is built, but I have not been able to figure out why it occurs and how I can include all the tokens in the corpus.

Any help is appreciated.

도움이 되었습니까?

해결책

Use the control argument of TermDocumentMatrix

require(tm)
tdm <- TermDocumentMatrix(Corpus(VectorSource("of is of is")), control =  list(stopwords=FALSE, wordLengths=c(0, Inf)))
rownames(tdm)
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top