Question

I have used R for mining the tweets and I got the most frequent words used in the tweets. However the most frequent words are like this:

 [1] "cant"     "dont"     "girl"     "gonna"    "lol"      "love"    
 [7] "que"      "thats"    "watching" "wish"     "youre"  

I am looking for trends and names and events in the texts. I am wondering if there is a way to remove this text message style words (such as gonna,wanna, ...) from the corpus? Is there any stopwords for them? any help would be appreciated.

Was it helpful?

Solution

The text mining package maintains it's own list of stopwords and provides useful tools for managing and summarizing this type of text.

Let's say your tweets are stored in a vector.

library(tm)
words <- vector_of_strings
corpus <- Corpus(VectorSource(words))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) tolower(x))
corpus <- tm_map(corpus, function(x) removeWords(x, 
                stopwords()))

You can use the last line with your own list of stopwords():

stoppers <- c(stopwords(), "gonna", "wanna", "lol", ... ) 

Unfortunately, you'll have to generate your own list of "text messaging" or "internet messaging" stopwords.

But, you could cheat a bit, by borrowing from NetLingo ( http://vps.netlingo.com/acronyms.php )

library(XML)
theurl <- "http://vps.netlingo.com/acronyms.php"
h <- htmlParse(theurl)
h <- getNodeSet(h,"//ul/li/span//a")
stoppers <- sapply(h,xmlValue)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top