Question

I would like to read a txt file and do some text mining approaches. When I used the tm package in R, I got lots of error messages. For example, If I wanted to correlate the most frequent words, I got only NA's. Here is the code, I have used so far:

library(tm)

doc <- c("word1 word1 word2 word1 word2 word3 word1 word2 word3 word4 word1 word2 word3 word4 word5")

Corpus <- Corpus(VectorSource(doc))
Corpus <- tm_map(Corpus, stripWhitespace)
Corpus <- tm_map(Corpus, tolower)
Corpus <- tm_map(Corpus, removeWords, stopwords("english"))
Corpus <- tm_map(Corpus, removePunctuation)

tdm <- TermDocumentMatrix(Corpus)

#Plotting correlation of Terms
plot(tdm, terms = findFreqTerms(tdm, lowfreq = 2, Inf)[1:3], CorThreshold = 0.1)

After that, I got the following error message:

Error in if (all(from == t(from))) "undirected" else "directed":
missing value where TRUE/FALSE needed

O.k. for investigations, I used the following code which is a step-by-step approach of findAssocs():

terms <- findFreqTerms(tdm, lowfreq = 2)[1:3]
m <- as.matrix(t(tdm[terms,]))
m
cor(m)

However, I got the following output:

          word1 word2 word3
    word1    NA    NA    NA
    word2    NA    NA    NA
    word3    NA    NA    NA

From my point of view, there is something wrong with the text, but I have no explanation for this strange behavior. My questions is, if somebody has got a solution for this problem. My R (2.15.2) is running on a Mac system (x86_64-apple-darwin9.8.0/x86_64 (64-bit)).

Thanks a lot!

Was it helpful?

Solution

For the correlation analysis function cor() you got the matrix of NA values because you have only one observation of each variable - you can't do correlation if variables has only one observation.

You can check it by looking on the your matrix m

> m
    Terms
Docs word1 word2 word3
   1     5     4     3
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top