Question

I've been using R's tm package with much success on classificaiton issues. I know how to find the most frequent terms across the entire corpus (with findFreqTerms()), but don't see anything within the documentation that would find the most frequent term (after I've stemmed and removed stopwords, but before I remove sparse terms) in each individual document in the corpus. I've tried using apply() and the max command, but this gives me the maximum number of times the term in each document occurs, not the name of the term itself.

library(tm)

data("crude")
corpus<-tm_map(crude, removePunctuation)
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, tolower)
corpus<-tm_map(corpus, removeWords, stopwords("English"))
corpus<-tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
maxterms<-apply(dtm, 1, max)
maxterms
127 144 191 194 211 236 237 242 246 248 273 349 352 
 5  13   2   3   3  10   8   3   7   9   9   4   5 
353 368 489 502 543 704 708 
 4   4   4   5   5   9   4 

Thoughts?

Était-ce utile?

La solution

Ben's answer gives what you've asked for but I am not sure if what you asked for is wise. It does not account for ties. Here is an approach and a second one using the qdap package. They will give you lists with the words (in qdap's case a list of data frames with words and frequencies. You can use unlist to get you the rest of the way with the first option and lapply, indexing and unlist with qdap. The qdap approach works on the raw Corpus:

Option #1:

apply(dtm, 1, function(x) unlist(dtm[["dimnames"]][2], 
    use.names = FALSE)[x == max(x)])

Option #2 with qdap:

library(qdap)
dat <- tm_corpus2df(crude)
tapply(stemmer(dat$text), dat$docs, freq_terms, top = 1, 
    stopwords = tm::stopwords("English"))

Wrapping the tapply with lapply(WRAP_HERE, "[", 1) makes the two answers identical in content and nearly in format.

EDIT: Added an example that is a leaner use of qdap:

FUN <- function(x) freq_terms(x, top = 1, stopwords = stopwords("English"))[, 1]
lapply(stemmer(crude), FUN)

## [[1]]
## [1] "oil"   "price"
## 
## [[2]]
## [1] "opec"
## 
## [[3]]
## [1] "canada"   "canadian" "crude"    "oil"      "post"     "price"    "texaco"  
## 
## [[4]]
## [1] "crude"
## 
## [[5]]
## [1] "estim"  "reserv" "said"   "trust" 
## 
## [[6]]
## [1] "kuwait" "said"  
## 
## [[7]]
## [1] "report" "say"   
## 
## [[8]]
## [1] "yesterday"
## 
## [[9]]
## [1] "billion"
## 
## [[10]]
## [1] "market" "price" 
## 
## [[11]]
## [1] "mln"
## 
## [[12]]
## [1] "oil"
## 
## [[13]]
## [1] "oil"   "price"
## 
## [[14]]
## [1] "oil"  "opec"
## 
## [[15]]
## [1] "power"
## 
## [[16]]
## [1] "oil"
## 
## [[17]]
## [1] "oil"
## 
## [[18]]
## [1] "dlrs"
## 
## [[19]]
## [1] "futur"
## 
## [[20]]
## [1] "januari"

Autres conseils

You're almost there, replace max with which.max to get the column index of the term with the highest frequency per document (ie. per row). Then use that vector of column indices to subset the Terms (or column names, kind of) in the document term matrix. That will return the actual term for each document that has the maximum frequency for that document (rather than just the frequency value, as it does when you use max). So, following from your example

maxterms<-apply(dtm, 1, which.max)
dtm$dimnames$Terms[maxterms]
[1] "oil"     "opec"    "canada"  "crude"   "said"    "said"    "report"  "oil"    
 [9] "billion" "oil"     "mln"     "oil"     "oil"     "oil"     "power"   "oil"    
[17] "oil"     "dlrs"    "futures" "january"
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top