R: find most frequent group of words in corpus

https://stackoverflow.com/questions/23655694

22-07-2023
|

Frage

Is there an easy way how to find not only most frequent terms, but also expressions (so more than one word, groups of words) in text corpus in R?

Using the tm package, I can find most frequent terms like this:

tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=3, highfreq=Inf)

I can find associated words to the most frequent words using findAssocs() function, so I could manually group these words. But how can I find the number of occurrences of these groups of words in corpus?

Thx

Lösung

If I remember correctly, you can construct a TermDocumentMatrix of Bigrams (2 words that always occur together) using weka, and then process them as needed

library("tm") #text mining
library("RWeka") # for tokenization algorithms more complicated than single-word


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

# process tdm 
# findFreqTerms(tdm, lowfreq=3, highfreq=Inf)
# ...

tdm <- removeSparseTerms(tdm, 0.99)
print("----")
print("tdm properties")
str(tdm)
tdm_top_N_percent = tdm$nrow / 100 * topN_percentage_wanted

Alternatively,

#words combinations that occur at least once together an at most 5 times
wmin=1
wmax = 5

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = wmin, max = wmax))

Sometimes it helps to perform word stemming first in order to get "better" word groups.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow