Question

Editing the question as directed by Tyler in the comments below.

As part of a larger text mining project, I have created a .csv file which has titles of books in the first column and the whole contents of the book in the second column as My goal is to create a word cloud consisting of top n (n = 100 or 200 or 1000 depending on how skewed the scores are going to be) most frequently repeated words in the text for each title after removing the common stop words in English (for which the R-tm (text mining) package has a beautiful function - removeStopwords). Hope this explains my problem better.

Problem statement:

My input is in the below format in a csv file:

title   text
1   <huge amount of text1>
2   <huge amount of text2>
3   <huge amount of text3>

Here's a MWE with similar data:

library(tm)
data(acq)
dat <- data.frame(title=names(acq[1:3]), text=unlist(acq[1:3]), row.names=NULL)

I would like to find out the top "n" terms by frequency appearing in the corresponding text for each title excluding the stop words. The ideal output would be a table in excel or csv that would look like:

title   term    frequency
1       ..       ..
1       ..       ..
1       
1       
1       
2       
2       
2       
2       
2       
3       
3       
3       ..      ..

Please guide if this could be accomplished R or Python. Anyone please?

Was it helpful?

Solution

In Python, you can use Counter from the collections module, and re to split the sentence at each word, giving you this:

>>> import re
>>> from collections import Counter
>>> t = "This is a sentence with many words. Some words are repeated"
>>> Counter(re.split(r'\W', t)).most_common()
[('words', 2), ('a', 1), ('', 1), ('sentence', 1), ('This', 1), ('many', 1), ('is', 1), ('Some', 1), ('repeated', 1), ('are', 1), ('with', 1)]

OTHER TIPS

In R:

dat <- read.csv("myFile")
splitPerRow <- strsplit(dat$text, "\\W")
tablePerRow <- lapply(splitPerRow, table)
tablePerRow <- lapply(tablePerRow, sort, TRUE)
tablePerRow <- lapply(tablePerRow, head, n) # set n to be the threshold on frequency rank

output <- data.frame(freq=unlist(tablePerRow),
                     title=rep(dat$title, times=sapply(tablePerRow, length))
                     term = unlist(lapply(tablePerRow, names))
                      )

Depending on the nature of the text, you might need to filter out non-word entries (as if text is "term1 term2, term3" you'll get an empty entry caused by the empty string between the comma and the space after term2.

In base R:


## set up some data
words <- paste(LETTERS[1:3], letters[1:3], sep = "")
dat <- data.frame(title = 1:3, text = sapply(1:3, function(x){
  paste(sample(unlist(strsplit(words, " ")), 15, TRUE), collapse = " ")
  }))
dat$text <- as.character(dat$text)

## solve the problem
> tabs <- sapply(dat$text, function(x){
    table(unlist(strsplit(x, " ")))
    }, USE.NAMES = FALSE)
> data.frame(title = sort(rep(1:nrow(dat), 3)), 
             text = sort(rep(rownames(tabs))), 
             freq = c(tabs))

## title text freq
##     1   Aa    6
##     1   Bb    3
##     1   Cc    6
##     2   Aa    9
##     2   Bb    4
##     2   Cc    2
##     3   Aa    4
##     3   Bb    7
##     3   Cc    4

This allows you to do what you're after:

library(qdap)
list_df2df(setNames(lapply(dat$text, freq_terms, top=10, 
    stopwords = Dolch), dat$title), "Title")

You can remove stop words and get top n terms with freq_terms but applied to each text. Then you can set the names and put it all together with list_df2df.

Here I use the qdapDictionaries:Dolch list for stopwords but use what ever vector you want. Also that if there's a tie for top ten words here all words at that level will be included.

##              Title           WORD FREQ
## 1   reut-00001.xml       computer    6
## 2   reut-00001.xml        company    4
## 3   reut-00001.xml           dlrs    4
## .
## .
## .
## .
## 112 reut-00003.xml        various    1
## 113 reut-00003.xml           week    1
## 114 reut-00003.xml         within    1

In R you can use stringi package and stri_extract_all_charclass function to extract all letters from text:

 stri_extract_all_charclass(c("Ala ma; kota. Jaś nie ma go\n.To nic nie ma 123","abc dce"),"\\p{Lc}")
## [[1]]
## [1] "Ala"  "ma"   "kota" "Jaś"  "nie"  "ma"   "go"   "To"   "nic"  "nie"  "ma"  
## 
## [[2]]
## [1] "abc" "dce"

And then using table function you can count this words. You may also want to transform every word to lowercase -> stri_trans_tolower function

stri_extract_all_charclass(c("Ala ma; kota. Jaś nie ma go\n.To nic nie ma 123","abc dce"),"\\p{Lc}") -> temp
lapply(temp, table)
## [[1]]
## 
##  Ala   go  Jaś kota   ma  nic  nie   To 
##    1    1    1    1    3    1    2    1 

## [[2]]

## abc dce 
##   1   1 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top